Spaces:

ACE-Step
/

Ace-Step-v1.5

Running on A100

App Files Files Community

Ace-Step-v1.5 / docs /en /INFERENCE.md

ChuxiJ

add docs and readme

428436b 22 days ago

preview code

raw

history blame contribute delete

37 kB

ACE-Step Inference API Documentation

Language / 语言 / 言語: English | 中文 | 日本語

This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types.

Quick Start
API Overview
GenerationParams Parameters
GenerationConfig Parameters
Task Types
Helper Functions
Complete Examples
Best Practices

Quick Start

Basic Usage

from acestep.handler import AceStepHandler
from acestep.llm_inference import LLMHandler
from acestep.inference import GenerationParams, GenerationConfig, generate_music

# Initialize handlers
dit_handler = AceStepHandler()
llm_handler = LLMHandler()

# Initialize services
dit_handler.initialize_service(
    project_root="/path/to/project",
    config_path="acestep-v15-turbo",
    device="cuda"
)

llm_handler.initialize(
    checkpoint_dir="/path/to/checkpoints",
    lm_model_path="acestep-5Hz-lm-0.6B",
    backend="vllm",
    device="cuda"
)

# Configure generation parameters
params = GenerationParams(
    caption="upbeat electronic dance music with heavy bass",
    bpm=128,
    duration=30,
)

# Configure generation settings
config = GenerationConfig(
    batch_size=2,
    audio_format="flac",
)

# Generate music
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")

# Access results
if result.success:
    for audio in result.audios:
        print(f"Generated: {audio['path']}")
        print(f"Key: {audio['key']}")
        print(f"Seed: {audio['params']['seed']}")
else:
    print(f"Error: {result.error}")

API Overview

Main Functions

generate_music

def generate_music(
    dit_handler,
    llm_handler,
    params: GenerationParams,
    config: GenerationConfig,
    save_dir: Optional[str] = None,
    progress=None,
) -> GenerationResult

Main function for generating music using the ACE-Step model.

understand_music

def understand_music(
    llm_handler,
    audio_codes: str,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> UnderstandResult

Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.).

create_sample

def create_sample(
    llm_handler,
    query: str,
    instrumental: bool = False,
    vocal_language: Optional[str] = None,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> CreateSampleResult

Generate a complete music sample (caption, lyrics, metadata) from a natural language description.

format_sample

def format_sample(
    llm_handler,
    caption: str,
    lyrics: str,
    user_metadata: Optional[Dict[str, Any]] = None,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> FormatSampleResult

Format and enhance user-provided caption and lyrics, generating structured metadata.

Configuration Objects

The API uses two configuration dataclasses:

GenerationParams - Contains all music generation parameters:

@dataclass
class GenerationParams:
    # Task & Instruction
    task_type: str = "text2music"
    instruction: str = "Fill the audio semantic mask based on the given conditions:"
    
    # Audio Uploads
    reference_audio: Optional[str] = None
    src_audio: Optional[str] = None
    
    # LM Codes Hints
    audio_codes: str = ""
    
    # Text Inputs
    caption: str = ""
    lyrics: str = ""
    instrumental: bool = False
    
    # Metadata
    vocal_language: str = "unknown"
    bpm: Optional[int] = None
    keyscale: str = ""
    timesignature: str = ""
    duration: float = -1.0
    
    # Advanced Settings
    inference_steps: int = 8
    seed: int = -1
    guidance_scale: float = 7.0
    use_adg: bool = False
    cfg_interval_start: float = 0.0
    cfg_interval_end: float = 1.0
    shift: float = 1.0                    # NEW: Timestep shift factor
    infer_method: str = "ode"             # NEW: Diffusion inference method
    timesteps: Optional[List[float]] = None  # NEW: Custom timesteps
    
    repainting_start: float = 0.0
    repainting_end: float = -1
    audio_cover_strength: float = 1.0
    
    # 5Hz Language Model Parameters
    thinking: bool = True
    lm_temperature: float = 0.85
    lm_cfg_scale: float = 2.0
    lm_top_k: int = 0
    lm_top_p: float = 0.9
    lm_negative_prompt: str = "NO USER INPUT"
    use_cot_metas: bool = True
    use_cot_caption: bool = True
    use_cot_lyrics: bool = False
    use_cot_language: bool = True
    use_constrained_decoding: bool = True
    
    # CoT Generated Values (auto-filled by LM)
    cot_bpm: Optional[int] = None
    cot_keyscale: str = ""
    cot_timesignature: str = ""
    cot_duration: Optional[float] = None
    cot_vocal_language: str = "unknown"
    cot_caption: str = ""
    cot_lyrics: str = ""

GenerationConfig - Contains batch and output configuration:

@dataclass
class GenerationConfig:
    batch_size: int = 2
    allow_lm_batch: bool = False
    use_random_seed: bool = True
    seeds: Optional[List[int]] = None
    lm_batch_chunk_size: int = 8
    constrained_decoding_debug: bool = False
    audio_format: str = "flac"

Result Objects

GenerationResult - Result of music generation:

@dataclass
class GenerationResult:
    # Audio Outputs
    audios: List[Dict[str, Any]]  # List of audio dictionaries
    
    # Generation Information
    status_message: str           # Status message from generation
    extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)
    
    # Success Status
    success: bool                 # Whether generation succeeded
    error: Optional[str]          # Error message if failed

Audio Dictionary Structure:

Each item in audios list contains:

{
    "path": str,           # File path to saved audio
    "tensor": Tensor,      # Audio tensor [channels, samples], CPU, float32
    "key": str,            # Unique audio key (UUID based on params)
    "sample_rate": int,    # Sample rate (default: 48000)
    "params": Dict,        # Generation params for this audio (includes seed, audio_codes, etc.)
}

UnderstandResult - Result of music understanding:

@dataclass
class UnderstandResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None

CreateSampleResult - Result of sample creation:

@dataclass
class CreateSampleResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    instrumental: bool = False
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None

FormatSampleResult - Result of sample formatting:

@dataclass
class FormatSampleResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None

GenerationParams Parameters

Text Inputs

Parameter	Type	Default	Description
`caption`	`str`	`""`	Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters.
`lyrics`	`str`	`""`	Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. Max 4096 characters.
`instrumental`	`bool`	`False`	If True, generate instrumental music regardless of lyrics.

Music Metadata

Parameter	Type	Default	Description
`bpm`	`Optional[int]`	`None`	Beats per minute (30-300). `None` enables auto-detection via LM.
`keyscale`	`str`	`""`	Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection.
`timesignature`	`str`	`""`	Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection.
`vocal_language`	`str`	`"unknown"`	Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection.
`duration`	`float`	`-1.0`	Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length.

Generation Parameters

Parameter	Type	Default	Description
`inference_steps`	`int`	`8`	Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower.
`guidance_scale`	`float`	`7.0`	Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0.
`seed`	`int`	`-1`	Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed.

Advanced DiT Parameters

Parameter	Type	Default	Description
`use_adg`	`bool`	`False`	Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed.
`cfg_interval_start`	`float`	`0.0`	CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance.
`cfg_interval_end`	`float`	`1.0`	CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance.
`shift`	`float`	`1.0`	Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies `t = shift * t / (1 + (shift - 1) * t)` to timesteps. Recommended 3.0 for turbo models.
`infer_method`	`str`	`"ode"`	Diffusion inference method. `"ode"` (Euler) is faster and deterministic. `"sde"` (stochastic) may produce different results with variance.
`timesteps`	`Optional[List[float]]`	`None`	Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., `[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]`). If provided, overrides `inference_steps` and `shift`.

Task-Specific Parameters

Parameter	Type	Default	Description
`task_type`	`str`	`"text2music"`	Generation task type. See Task Types section for details.
`instruction`	`str`	`"Fill the audio semantic mask based on the given conditions:"`	Task-specific instruction prompt.
`reference_audio`	`Optional[str]`	`None`	Path to reference audio file for style transfer or continuation tasks.
`src_audio`	`Optional[str]`	`None`	Path to source audio file for audio-to-audio tasks (cover, repaint, etc.).
`audio_codes`	`str`	`""`	Pre-extracted 5Hz audio semantic codes as a string. Advanced use only.
`repainting_start`	`float`	`0.0`	Repainting start time in seconds (for repaint/lego tasks).
`repainting_end`	`float`	`-1`	Repainting end time in seconds. Use `-1` for end of audio.
`audio_cover_strength`	`float`	`1.0`	Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks.

5Hz Language Model Parameters

Parameter	Type	Default	Description
`thinking`	`bool`	`True`	Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes.
`lm_temperature`	`float`	`0.85`	LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative.
`lm_cfg_scale`	`float`	`2.0`	LM classifier-free guidance scale. Higher = stronger adherence to prompt.
`lm_top_k`	`int`	`0`	LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100.
`lm_top_p`	`float`	`0.9`	LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95.
`lm_negative_prompt`	`str`	`"NO USER INPUT"`	Negative prompt for LM guidance. Helps avoid unwanted characteristics.
`use_cot_metas`	`bool`	`True`	Generate metadata using LM CoT reasoning (BPM, key, duration, etc.).
`use_cot_caption`	`bool`	`True`	Refine user caption using LM CoT reasoning.
`use_cot_language`	`bool`	`True`	Detect vocal language using LM CoT reasoning.
`use_cot_lyrics`	`bool`	`False`	(Reserved for future use) Generate/refine lyrics using LM CoT.
`use_constrained_decoding`	`bool`	`True`	Enable constrained decoding for structured LM output.

CoT Generated Values

These fields are automatically populated by the LM when CoT reasoning is enabled:

Parameter	Type	Default	Description
`cot_bpm`	`Optional[int]`	`None`	LM-generated BPM value.
`cot_keyscale`	`str`	`""`	LM-generated key/scale.
`cot_timesignature`	`str`	`""`	LM-generated time signature.
`cot_duration`	`Optional[float]`	`None`	LM-generated duration.
`cot_vocal_language`	`str`	`"unknown"`	LM-detected vocal language.
`cot_caption`	`str`	`""`	LM-refined caption.
`cot_lyrics`	`str`	`""`	LM-generated/refined lyrics.

GenerationConfig Parameters

Parameter	Type	Default	Description
`batch_size`	`int`	`2`	Number of samples to generate in parallel (1-8). Higher values require more GPU memory.
`allow_lm_batch`	`bool`	`False`	Allow batch processing in LM. Faster when `batch_size >= 2` and `thinking=True`.
`use_random_seed`	`bool`	`True`	Whether to use random seed. `True` for different results each time, `False` for reproducible results.
`seeds`	`Optional[List[int]]`	`None`	List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int.
`lm_batch_chunk_size`	`int`	`8`	Maximum batch size per LM inference chunk (GPU memory constraint).
`constrained_decoding_debug`	`bool`	`False`	Enable debug logging for constrained decoding.
`audio_format`	`str`	`"flac"`	Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. Default is FLAC for fast saving.

Task Types

ACE-Step supports 6 different generation task types, each optimized for specific use cases.

1. Text2Music (Default)

Purpose: Generate music from text descriptions and optional metadata.

Key Parameters:

params = GenerationParams(
    task_type="text2music",
    caption="energetic rock music with electric guitar",
    lyrics="[Instrumental]",  # or actual lyrics
    bpm=140,
    duration=30,
)

Required:

caption or lyrics (at least one)

Optional but Recommended:

bpm: Controls tempo
keyscale: Controls musical key
timesignature: Controls rhythm structure
duration: Controls length
vocal_language: Controls vocal characteristics

Use Cases:

Generate music from text descriptions
Create backing tracks from prompts
Generate songs with lyrics

2. Cover

Purpose: Transform existing audio while maintaining structure but changing style/timbre.

Key Parameters:

params = GenerationParams(
    task_type="cover",
    src_audio="original_song.mp3",
    caption="jazz piano version",
    audio_cover_strength=0.8,  # 0.0-1.0
)

Required:

src_audio: Path to source audio file
caption: Description of desired style/transformation

Optional:

audio_cover_strength: Controls influence of original audio
- 1.0: Strong adherence to original structure
- 0.5: Balanced transformation
- 0.1: Loose interpretation
lyrics: New lyrics (if changing vocals)

Use Cases:

Create covers in different styles
Change instrumentation while keeping melody
Genre transformation

3. Repaint

Purpose: Regenerate a specific time segment of audio while keeping the rest unchanged.

Key Parameters:

params = GenerationParams(
    task_type="repaint",
    src_audio="original.mp3",
    repainting_start=10.0,  # seconds
    repainting_end=20.0,    # seconds
    caption="smooth transition with piano solo",
)

Required:

src_audio: Path to source audio file
repainting_start: Start time in seconds
repainting_end: End time in seconds (use -1 for end of file)
caption: Description of desired content for repainted section

Use Cases:

Fix specific sections of generated music
Add variations to parts of a song
Create smooth transitions
Replace problematic segments

4. Lego (Base Model Only)

Purpose: Generate a specific instrument track in context of existing audio.

Key Parameters:

params = GenerationParams(
    task_type="lego",
    src_audio="backing_track.mp3",
    instruction="Generate the guitar track based on the audio context:",
    caption="lead guitar melody with bluesy feel",
    repainting_start=0.0,
    repainting_end=-1,
)

Required:

src_audio: Path to source/backing audio
instruction: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...")
caption: Description of desired track characteristics

Available Tracks:

"vocals", "backing_vocals", "drums", "bass", "guitar", "keyboard",
"percussion", "strings", "synth", "fx", "brass", "woodwinds"

Use Cases:

Add specific instrument tracks
Layer additional instruments over backing tracks
Create multi-track compositions iteratively

5. Extract (Base Model Only)

Purpose: Extract/isolate a specific instrument track from mixed audio.

Key Parameters:

params = GenerationParams(
    task_type="extract",
    src_audio="full_mix.mp3",
    instruction="Extract the vocals track from the audio:",
)

Required:

src_audio: Path to mixed audio file
instruction: Must specify track to extract

Available Tracks: Same as Lego task

Use Cases:

Stem separation
Isolate specific instruments
Create remixes
Analyze individual tracks

6. Complete (Base Model Only)

Purpose: Complete/extend partial tracks with specified instruments.

Key Parameters:

params = GenerationParams(
    task_type="complete",
    src_audio="incomplete_track.mp3",
    instruction="Complete the input track with drums, bass, guitar:",
    caption="rock style completion",
)

Required:

src_audio: Path to incomplete/partial track
instruction: Must specify which tracks to add
caption: Description of desired style

Use Cases:

Arrange incomplete compositions
Add backing tracks
Auto-complete musical ideas

Helper Functions

understand_music

Analyze audio codes to extract metadata about the music.

from acestep.inference import understand_music

result = understand_music(
    llm_handler=llm_handler,
    audio_codes="<|audio_code_123|><|audio_code_456|>...",
    temperature=0.85,
    use_constrained_decoding=True,
)

if result.success:
    print(f"Caption: {result.caption}")
    print(f"Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Key: {result.keyscale}")
    print(f"Duration: {result.duration}s")
    print(f"Language: {result.language}")
else:
    print(f"Error: {result.error}")

Use Cases:

Analyze existing music
Extract metadata from audio codes
Reverse-engineer generation parameters

create_sample

Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature.

from acestep.inference import create_sample

result = create_sample(
    llm_handler=llm_handler,
    query="a soft Bengali love song for a quiet evening",
    instrumental=False,
    vocal_language="bn",  # Optional: constrain to Bengali
    temperature=0.85,
)

if result.success:
    print(f"Caption: {result.caption}")
    print(f"Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Duration: {result.duration}s")
    print(f"Key: {result.keyscale}")
    print(f"Is Instrumental: {result.instrumental}")
    
    # Use with generate_music
    params = GenerationParams(
        caption=result.caption,
        lyrics=result.lyrics,
        bpm=result.bpm,
        duration=result.duration,
        keyscale=result.keyscale,
        vocal_language=result.language,
    )
else:
    print(f"Error: {result.error}")

Parameters:

Parameter	Type	Default	Description
`query`	`str`	required	Natural language description of desired music
`instrumental`	`bool`	`False`	Whether to generate instrumental music
`vocal_language`	`Optional[str]`	`None`	Constrain lyrics to specific language (e.g., "en", "zh", "bn")
`temperature`	`float`	`0.85`	Sampling temperature
`top_k`	`Optional[int]`	`None`	Top-k sampling (None disables)
`top_p`	`Optional[float]`	`None`	Top-p sampling (None disables)
`repetition_penalty`	`float`	`1.0`	Repetition penalty
`use_constrained_decoding`	`bool`	`True`	Use FSM-based constrained decoding

format_sample

Format and enhance user-provided caption and lyrics, generating structured metadata.

from acestep.inference import format_sample

result = format_sample(
    llm_handler=llm_handler,
    caption="Latin pop, reggaeton",
    lyrics="[Verse 1]\nBailando en la noche...",
    user_metadata={"bpm": 95},  # Optional: constrain specific values
    temperature=0.85,
)

if result.success:
    print(f"Enhanced Caption: {result.caption}")
    print(f"Formatted Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Duration: {result.duration}s")
    print(f"Key: {result.keyscale}")
    print(f"Detected Language: {result.language}")
else:
    print(f"Error: {result.error}")

Parameters:

Parameter	Type	Default	Description
`caption`	`str`	required	User's caption/description
`lyrics`	`str`	required	User's lyrics with structure tags
`user_metadata`	`Optional[Dict]`	`None`	Constrain specific metadata values (bpm, duration, keyscale, timesignature, language)
`temperature`	`float`	`0.85`	Sampling temperature
`top_k`	`Optional[int]`	`None`	Top-k sampling (None disables)
`top_p`	`Optional[float]`	`None`	Top-p sampling (None disables)
`repetition_penalty`	`float`	`1.0`	Repetition penalty
`use_constrained_decoding`	`bool`	`True`	Use FSM-based constrained decoding

Complete Examples

Example 1: Simple Text-to-Music Generation

from acestep.inference import GenerationParams, GenerationConfig, generate_music

params = GenerationParams(
    task_type="text2music",
    caption="calm ambient music with soft piano and strings",
    duration=60,
    bpm=80,
    keyscale="C Major",
)

config = GenerationConfig(
    batch_size=2,  # Generate 2 variations
    audio_format="flac",
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    for i, audio in enumerate(result.audios, 1):
        print(f"Variation {i}: {audio['path']}")

Example 2: Song Generation with Lyrics

params = GenerationParams(
    task_type="text2music",
    caption="pop ballad with emotional vocals",
    lyrics="""Verse 1:
Walking down the street today
Thinking of the words you used to say
Everything feels different now
But I'll find my way somehow

Chorus:
I'm moving on, I'm staying strong
This is where I belong
""",
    vocal_language="en",
    bpm=72,
    duration=45,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 3: Using Custom Timesteps

params = GenerationParams(
    task_type="text2music",
    caption="jazz fusion with complex harmonies",
    # Custom 9-step schedule
    timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0],
    thinking=True,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 4: Using Shift Parameter (Turbo Model)

params = GenerationParams(
    task_type="text2music",
    caption="upbeat electronic dance music",
    inference_steps=8,
    shift=3.0,  # Recommended for turbo models
    infer_method="ode",
)

config = GenerationConfig(batch_size=2)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 5: Simple Mode with create_sample

from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music

# Step 1: Create sample from description
sample = create_sample(
    llm_handler=llm_handler,
    query="energetic K-pop dance track with catchy hooks",
    vocal_language="ko",
)

if sample.success:
    # Step 2: Generate music using the sample
    params = GenerationParams(
        caption=sample.caption,
        lyrics=sample.lyrics,
        bpm=sample.bpm,
        duration=sample.duration,
        keyscale=sample.keyscale,
        vocal_language=sample.language,
        thinking=True,
    )
    
    config = GenerationConfig(batch_size=2)
    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 6: Format and Enhance User Input

from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music

# Step 1: Format user input
formatted = format_sample(
    llm_handler=llm_handler,
    caption="rock ballad",
    lyrics="[Verse]\nIn the darkness I find my way...",
)

if formatted.success:
    # Step 2: Generate with enhanced input
    params = GenerationParams(
        caption=formatted.caption,
        lyrics=formatted.lyrics,
        bpm=formatted.bpm,
        duration=formatted.duration,
        keyscale=formatted.keyscale,
        thinking=True,
        use_cot_metas=False,  # Already formatted, skip metas CoT
    )
    
    config = GenerationConfig(batch_size=2)
    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 7: Style Cover with LM Reasoning

params = GenerationParams(
    task_type="cover",
    src_audio="original_pop_song.mp3",
    caption="orchestral symphonic arrangement",
    audio_cover_strength=0.7,
    thinking=True,  # Enable LM for metadata
    use_cot_metas=True,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

# Access LM-generated metadata
if result.extra_outputs.get("lm_metadata"):
    lm_meta = result.extra_outputs["lm_metadata"]
    print(f"LM detected BPM: {lm_meta.get('bpm')}")
    print(f"LM detected Key: {lm_meta.get('keyscale')}")

Example 8: Batch Generation with Specific Seeds

params = GenerationParams(
    task_type="text2music",
    caption="epic cinematic trailer music",
)

config = GenerationConfig(
    batch_size=4,           # Generate 4 variations
    seeds=[42, 123, 456],   # Specify 3 seeds, 4th will be random
    use_random_seed=False,  # Use provided seeds
    lm_batch_chunk_size=2,  # Process 2 at a time (GPU memory)
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    print(f"Generated {len(result.audios)} variations")
    for audio in result.audios:
        print(f"  Seed {audio['params']['seed']}: {audio['path']}")

Example 9: High-Quality Generation (Base Model)

params = GenerationParams(
    task_type="text2music",
    caption="intricate jazz fusion with complex harmonies",
    inference_steps=64,     # High quality
    guidance_scale=8.0,
    use_adg=True,           # Adaptive Dual Guidance
    cfg_interval_start=0.0,
    cfg_interval_end=1.0,
    shift=3.0,              # Timestep shift
    seed=42,                # Reproducible results
)

config = GenerationConfig(
    batch_size=1,
    use_random_seed=False,
    audio_format="wav",     # Lossless format
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 10: Understand Audio from Codes

from acestep.inference import understand_music

# Analyze audio codes (e.g., from a previous generation)
result = understand_music(
    llm_handler=llm_handler,
    audio_codes="<|audio_code_10695|><|audio_code_54246|>...",
    temperature=0.85,
)

if result.success:
    print(f"Detected Caption: {result.caption}")
    print(f"Detected Lyrics: {result.lyrics}")
    print(f"Detected BPM: {result.bpm}")
    print(f"Detected Key: {result.keyscale}")
    print(f"Detected Duration: {result.duration}s")
    print(f"Detected Language: {result.language}")

Best Practices

1. Caption Writing

Good Captions:

# Specific and descriptive
caption="upbeat electronic dance music with heavy bass and synthesizer leads"

# Include mood and genre
caption="melancholic indie folk with acoustic guitar and soft vocals"

# Specify instruments
caption="jazz trio with piano, upright bass, and brush drums"

Avoid:

# Too vague
caption="good music"

# Contradictory
caption="fast slow music"  # Conflicting tempos

2. Parameter Tuning

For Best Quality:

Use base model with inference_steps=64 or higher
Enable use_adg=True
Set guidance_scale=7.0-9.0
Set shift=3.0 for better timestep distribution
Use lossless audio format (audio_format="wav")

For Speed:

Use turbo model with inference_steps=8
Disable ADG (use_adg=False)
Use infer_method="ode" (default)
Use compressed format (audio_format="mp3") or default FLAC

For Consistency:

Set use_random_seed=False in config
Use fixed seeds list or single seed in params
Keep lm_temperature lower (0.7-0.85)

For Diversity:

Set use_random_seed=True in config
Increase lm_temperature (0.9-1.1)
Use batch_size > 1 for variations

3. Duration Guidelines

Instrumental: 30-180 seconds works well
With Lyrics: Auto-detection recommended (set duration=-1 or leave default)
Short clips: 10-20 seconds minimum
Long form: Up to 600 seconds (10 minutes) maximum

4. LM Usage

When to Enable LM (thinking=True):

Need automatic metadata detection
Want caption refinement
Generating from minimal input
Need diverse outputs

When to Disable LM (thinking=False):

Have precise metadata already
Need faster generation
Want full control over parameters

5. Batch Processing

# Efficient batch generation
config = GenerationConfig(
    batch_size=8,           # Max supported
    allow_lm_batch=True,    # Enable for speed (when thinking=True)
    lm_batch_chunk_size=4,  # Adjust based on GPU memory
)

6. Error Handling

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if not result.success:
    print(f"Generation failed: {result.error}")
    print(f"Status: {result.status_message}")
else:
    # Process successful result
    for audio in result.audios:
        path = audio['path']
        key = audio['key']
        seed = audio['params']['seed']
        # ... process audio files

7. Memory Management

For large batch sizes or long durations:

Monitor GPU memory usage
Reduce batch_size if OOM errors occur
Reduce lm_batch_chunk_size for LM operations
Consider using offload_to_cpu=True during initialization

8. Accessing Time Costs

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    time_costs = result.extra_outputs.get("time_costs", {})
    print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
    print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
    print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
    print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")

Troubleshooting

Common Issues

Issue: Out of memory errors

Solution: Reduce batch_size, inference_steps, or enable CPU offloading

Issue: Poor quality results

Solution: Increase inference_steps, adjust guidance_scale, use base model

Issue: Results don't match prompt

Solution: Make caption more specific, increase guidance_scale, enable LM refinement (thinking=True)

Issue: Slow generation

Solution: Use turbo model, reduce inference_steps, disable ADG

Issue: LM not generating codes

Solution: Verify llm_handler is initialized, check thinking=True and use_cot_metas=True

Issue: Seeds not being respected

Solution: Set use_random_seed=False in config and provide seeds list or seed in params

Issue: Custom timesteps not working

Solution: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered

API Reference Summary

GenerationParams Fields

See GenerationParams Parameters for complete documentation.

GenerationConfig Fields

See GenerationConfig Parameters for complete documentation.

GenerationResult Fields

@dataclass
class GenerationResult:
    # Audio Outputs
    audios: List[Dict[str, Any]]
    # Each audio dict contains:
    #   - "path": str (file path)
    #   - "tensor": Tensor (audio data)
    #   - "key": str (unique identifier)
    #   - "sample_rate": int (48000)
    #   - "params": Dict (generation params with seed, audio_codes, etc.)
    
    # Generation Information
    status_message: str
    extra_outputs: Dict[str, Any]
    # extra_outputs contains:
    #   - "lm_metadata": Dict (LM-generated metadata)
    #   - "time_costs": Dict (timing information)
    #   - "latents": Tensor (intermediate latents, if available)
    #   - "masks": Tensor (attention masks, if available)
    
    # Success Status
    success: bool
    error: Optional[str]

Version History

v1.5.2: Current version
- Added shift parameter for timestep shifting
- Added infer_method parameter for ODE/SDE selection
- Added timesteps parameter for custom timestep schedules
- Added understand_music() function for audio analysis
- Added create_sample() function for simple mode generation
- Added format_sample() function for input enhancement
- Added UnderstandResult, CreateSampleResult, FormatSampleResult dataclasses
v1.5.1: Previous version
- Split GenerationConfig into GenerationParams and GenerationConfig
- Renamed parameters for consistency (key_scale → keyscale, time_signature → timesignature, audio_duration → duration, use_llm_thinking → thinking, audio_code_string → audio_codes)
- Added instrumental parameter
- Added use_constrained_decoding parameter
- Added CoT auto-filled fields (cot_*)
- Changed default audio_format to "flac"
- Changed default batch_size to 2
- Changed default thinking to True
- Simplified GenerationResult structure with unified audios list
- Added unified time_costs in extra_outputs
v1.5: Initial version
- Introduced GenerationConfig and GenerationResult dataclasses
- Simplified parameter passing
- Added comprehensive documentation

For more information, see:

Main README: ../../README.md
REST API Documentation: API.md
Gradio Demo Guide: GRADIO_GUIDE.md
Project repository: ACE-Step-1.5

ACE-Step Inference API Documentation

Table of Contents

Quick Start

Basic Usage

API Overview

Main Functions

generate_music

understand_music

create_sample

format_sample

Configuration Objects

Result Objects

GenerationParams Parameters

Text Inputs

Music Metadata

Generation Parameters

Advanced DiT Parameters

Task-Specific Parameters

5Hz Language Model Parameters

CoT Generated Values

GenerationConfig Parameters

Task Types

1. Text2Music (Default)

2. Cover

3. Repaint

4. Lego (Base Model Only)

5. Extract (Base Model Only)

6. Complete (Base Model Only)

Helper Functions

understand_music

create_sample

format_sample

Complete Examples

Example 1: Simple Text-to-Music Generation

Example 2: Song Generation with Lyrics

Example 3: Using Custom Timesteps

Example 4: Using Shift Parameter (Turbo Model)

Example 5: Simple Mode with create_sample

Example 6: Format and Enhance User Input

Example 7: Style Cover with LM Reasoning

Example 8: Batch Generation with Specific Seeds

Example 9: High-Quality Generation (Base Model)

Example 10: Understand Audio from Codes

Best Practices

1. Caption Writing

2. Parameter Tuning

3. Duration Guidelines

4. LM Usage

5. Batch Processing

6. Error Handling

7. Memory Management

8. Accessing Time Costs

Troubleshooting

Common Issues

API Reference Summary

GenerationParams Fields

GenerationConfig Fields

GenerationResult Fields

Version History