Ace-Step-v1.5 / docs /en /INFERENCE.md
ChuxiJ's picture
add docs and readme
428436b

ACE-Step Inference API Documentation

Language / 语言 / 言語: English | 中文 | 日本語


This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types.

Table of Contents


Quick Start

Basic Usage

from acestep.handler import AceStepHandler
from acestep.llm_inference import LLMHandler
from acestep.inference import GenerationParams, GenerationConfig, generate_music

# Initialize handlers
dit_handler = AceStepHandler()
llm_handler = LLMHandler()

# Initialize services
dit_handler.initialize_service(
    project_root="/path/to/project",
    config_path="acestep-v15-turbo",
    device="cuda"
)

llm_handler.initialize(
    checkpoint_dir="/path/to/checkpoints",
    lm_model_path="acestep-5Hz-lm-0.6B",
    backend="vllm",
    device="cuda"
)

# Configure generation parameters
params = GenerationParams(
    caption="upbeat electronic dance music with heavy bass",
    bpm=128,
    duration=30,
)

# Configure generation settings
config = GenerationConfig(
    batch_size=2,
    audio_format="flac",
)

# Generate music
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")

# Access results
if result.success:
    for audio in result.audios:
        print(f"Generated: {audio['path']}")
        print(f"Key: {audio['key']}")
        print(f"Seed: {audio['params']['seed']}")
else:
    print(f"Error: {result.error}")

API Overview

Main Functions

generate_music

def generate_music(
    dit_handler,
    llm_handler,
    params: GenerationParams,
    config: GenerationConfig,
    save_dir: Optional[str] = None,
    progress=None,
) -> GenerationResult

Main function for generating music using the ACE-Step model.

understand_music

def understand_music(
    llm_handler,
    audio_codes: str,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> UnderstandResult

Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.).

create_sample

def create_sample(
    llm_handler,
    query: str,
    instrumental: bool = False,
    vocal_language: Optional[str] = None,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> CreateSampleResult

Generate a complete music sample (caption, lyrics, metadata) from a natural language description.

format_sample

def format_sample(
    llm_handler,
    caption: str,
    lyrics: str,
    user_metadata: Optional[Dict[str, Any]] = None,
    temperature: float = 0.85,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: float = 1.0,
    use_constrained_decoding: bool = True,
    constrained_decoding_debug: bool = False,
) -> FormatSampleResult

Format and enhance user-provided caption and lyrics, generating structured metadata.

Configuration Objects

The API uses two configuration dataclasses:

GenerationParams - Contains all music generation parameters:

@dataclass
class GenerationParams:
    # Task & Instruction
    task_type: str = "text2music"
    instruction: str = "Fill the audio semantic mask based on the given conditions:"
    
    # Audio Uploads
    reference_audio: Optional[str] = None
    src_audio: Optional[str] = None
    
    # LM Codes Hints
    audio_codes: str = ""
    
    # Text Inputs
    caption: str = ""
    lyrics: str = ""
    instrumental: bool = False
    
    # Metadata
    vocal_language: str = "unknown"
    bpm: Optional[int] = None
    keyscale: str = ""
    timesignature: str = ""
    duration: float = -1.0
    
    # Advanced Settings
    inference_steps: int = 8
    seed: int = -1
    guidance_scale: float = 7.0
    use_adg: bool = False
    cfg_interval_start: float = 0.0
    cfg_interval_end: float = 1.0
    shift: float = 1.0                    # NEW: Timestep shift factor
    infer_method: str = "ode"             # NEW: Diffusion inference method
    timesteps: Optional[List[float]] = None  # NEW: Custom timesteps
    
    repainting_start: float = 0.0
    repainting_end: float = -1
    audio_cover_strength: float = 1.0
    
    # 5Hz Language Model Parameters
    thinking: bool = True
    lm_temperature: float = 0.85
    lm_cfg_scale: float = 2.0
    lm_top_k: int = 0
    lm_top_p: float = 0.9
    lm_negative_prompt: str = "NO USER INPUT"
    use_cot_metas: bool = True
    use_cot_caption: bool = True
    use_cot_lyrics: bool = False
    use_cot_language: bool = True
    use_constrained_decoding: bool = True
    
    # CoT Generated Values (auto-filled by LM)
    cot_bpm: Optional[int] = None
    cot_keyscale: str = ""
    cot_timesignature: str = ""
    cot_duration: Optional[float] = None
    cot_vocal_language: str = "unknown"
    cot_caption: str = ""
    cot_lyrics: str = ""

GenerationConfig - Contains batch and output configuration:

@dataclass
class GenerationConfig:
    batch_size: int = 2
    allow_lm_batch: bool = False
    use_random_seed: bool = True
    seeds: Optional[List[int]] = None
    lm_batch_chunk_size: int = 8
    constrained_decoding_debug: bool = False
    audio_format: str = "flac"

Result Objects

GenerationResult - Result of music generation:

@dataclass
class GenerationResult:
    # Audio Outputs
    audios: List[Dict[str, Any]]  # List of audio dictionaries
    
    # Generation Information
    status_message: str           # Status message from generation
    extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)
    
    # Success Status
    success: bool                 # Whether generation succeeded
    error: Optional[str]          # Error message if failed

Audio Dictionary Structure:

Each item in audios list contains:

{
    "path": str,           # File path to saved audio
    "tensor": Tensor,      # Audio tensor [channels, samples], CPU, float32
    "key": str,            # Unique audio key (UUID based on params)
    "sample_rate": int,    # Sample rate (default: 48000)
    "params": Dict,        # Generation params for this audio (includes seed, audio_codes, etc.)
}

UnderstandResult - Result of music understanding:

@dataclass
class UnderstandResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None

CreateSampleResult - Result of sample creation:

@dataclass
class CreateSampleResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    instrumental: bool = False
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None

FormatSampleResult - Result of sample formatting:

@dataclass
class FormatSampleResult:
    # Metadata Fields
    caption: str = ""
    lyrics: str = ""
    bpm: Optional[int] = None
    duration: Optional[float] = None
    keyscale: str = ""
    language: str = ""
    timesignature: str = ""
    
    # Status
    status_message: str = ""
    success: bool = True
    error: Optional[str] = None

GenerationParams Parameters

Text Inputs

Parameter Type Default Description
caption str "" Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters.
lyrics str "" Lyrics text for vocal music. Use "[Instrumental]" for instrumental tracks. Supports multiple languages. Max 4096 characters.
instrumental bool False If True, generate instrumental music regardless of lyrics.

Music Metadata

Parameter Type Default Description
bpm Optional[int] None Beats per minute (30-300). None enables auto-detection via LM.
keyscale str "" Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection.
timesignature str "" Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection.
vocal_language str "unknown" Language code for vocals (ISO 639-1). Supported: "en", "zh", "ja", "es", "fr", etc. Use "unknown" for auto-detection.
duration float -1.0 Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length.

Generation Parameters

Parameter Type Default Description
inference_steps int 8 Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower.
guidance_scale float 7.0 Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0.
seed int -1 Random seed for reproducibility. Use -1 for random seed, or any positive integer for fixed seed.

Advanced DiT Parameters

Parameter Type Default Description
use_adg bool False Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed.
cfg_interval_start float 0.0 CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance.
cfg_interval_end float 1.0 CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance.
shift float 1.0 Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies t = shift * t / (1 + (shift - 1) * t) to timesteps. Recommended 3.0 for turbo models.
infer_method str "ode" Diffusion inference method. "ode" (Euler) is faster and deterministic. "sde" (stochastic) may produce different results with variance.
timesteps Optional[List[float]] None Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., [0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]). If provided, overrides inference_steps and shift.

Task-Specific Parameters

Parameter Type Default Description
task_type str "text2music" Generation task type. See Task Types section for details.
instruction str "Fill the audio semantic mask based on the given conditions:" Task-specific instruction prompt.
reference_audio Optional[str] None Path to reference audio file for style transfer or continuation tasks.
src_audio Optional[str] None Path to source audio file for audio-to-audio tasks (cover, repaint, etc.).
audio_codes str "" Pre-extracted 5Hz audio semantic codes as a string. Advanced use only.
repainting_start float 0.0 Repainting start time in seconds (for repaint/lego tasks).
repainting_end float -1 Repainting end time in seconds. Use -1 for end of audio.
audio_cover_strength float 1.0 Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks.

5Hz Language Model Parameters

Parameter Type Default Description
thinking bool True Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes.
lm_temperature float 0.85 LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative.
lm_cfg_scale float 2.0 LM classifier-free guidance scale. Higher = stronger adherence to prompt.
lm_top_k int 0 LM top-k sampling. 0 disables top-k filtering. Typical values: 40-100.
lm_top_p float 0.9 LM nucleus sampling (0.0-1.0). 1.0 disables nucleus sampling. Typical values: 0.9-0.95.
lm_negative_prompt str "NO USER INPUT" Negative prompt for LM guidance. Helps avoid unwanted characteristics.
use_cot_metas bool True Generate metadata using LM CoT reasoning (BPM, key, duration, etc.).
use_cot_caption bool True Refine user caption using LM CoT reasoning.
use_cot_language bool True Detect vocal language using LM CoT reasoning.
use_cot_lyrics bool False (Reserved for future use) Generate/refine lyrics using LM CoT.
use_constrained_decoding bool True Enable constrained decoding for structured LM output.

CoT Generated Values

These fields are automatically populated by the LM when CoT reasoning is enabled:

Parameter Type Default Description
cot_bpm Optional[int] None LM-generated BPM value.
cot_keyscale str "" LM-generated key/scale.
cot_timesignature str "" LM-generated time signature.
cot_duration Optional[float] None LM-generated duration.
cot_vocal_language str "unknown" LM-detected vocal language.
cot_caption str "" LM-refined caption.
cot_lyrics str "" LM-generated/refined lyrics.

GenerationConfig Parameters

Parameter Type Default Description
batch_size int 2 Number of samples to generate in parallel (1-8). Higher values require more GPU memory.
allow_lm_batch bool False Allow batch processing in LM. Faster when batch_size >= 2 and thinking=True.
use_random_seed bool True Whether to use random seed. True for different results each time, False for reproducible results.
seeds Optional[List[int]] None List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int.
lm_batch_chunk_size int 8 Maximum batch size per LM inference chunk (GPU memory constraint).
constrained_decoding_debug bool False Enable debug logging for constrained decoding.
audio_format str "flac" Output audio format. Options: "mp3", "wav", "flac". Default is FLAC for fast saving.

Task Types

ACE-Step supports 6 different generation task types, each optimized for specific use cases.

1. Text2Music (Default)

Purpose: Generate music from text descriptions and optional metadata.

Key Parameters:

params = GenerationParams(
    task_type="text2music",
    caption="energetic rock music with electric guitar",
    lyrics="[Instrumental]",  # or actual lyrics
    bpm=140,
    duration=30,
)

Required:

  • caption or lyrics (at least one)

Optional but Recommended:

  • bpm: Controls tempo
  • keyscale: Controls musical key
  • timesignature: Controls rhythm structure
  • duration: Controls length
  • vocal_language: Controls vocal characteristics

Use Cases:

  • Generate music from text descriptions
  • Create backing tracks from prompts
  • Generate songs with lyrics

2. Cover

Purpose: Transform existing audio while maintaining structure but changing style/timbre.

Key Parameters:

params = GenerationParams(
    task_type="cover",
    src_audio="original_song.mp3",
    caption="jazz piano version",
    audio_cover_strength=0.8,  # 0.0-1.0
)

Required:

  • src_audio: Path to source audio file
  • caption: Description of desired style/transformation

Optional:

  • audio_cover_strength: Controls influence of original audio
    • 1.0: Strong adherence to original structure
    • 0.5: Balanced transformation
    • 0.1: Loose interpretation
  • lyrics: New lyrics (if changing vocals)

Use Cases:

  • Create covers in different styles
  • Change instrumentation while keeping melody
  • Genre transformation

3. Repaint

Purpose: Regenerate a specific time segment of audio while keeping the rest unchanged.

Key Parameters:

params = GenerationParams(
    task_type="repaint",
    src_audio="original.mp3",
    repainting_start=10.0,  # seconds
    repainting_end=20.0,    # seconds
    caption="smooth transition with piano solo",
)

Required:

  • src_audio: Path to source audio file
  • repainting_start: Start time in seconds
  • repainting_end: End time in seconds (use -1 for end of file)
  • caption: Description of desired content for repainted section

Use Cases:

  • Fix specific sections of generated music
  • Add variations to parts of a song
  • Create smooth transitions
  • Replace problematic segments

4. Lego (Base Model Only)

Purpose: Generate a specific instrument track in context of existing audio.

Key Parameters:

params = GenerationParams(
    task_type="lego",
    src_audio="backing_track.mp3",
    instruction="Generate the guitar track based on the audio context:",
    caption="lead guitar melody with bluesy feel",
    repainting_start=0.0,
    repainting_end=-1,
)

Required:

  • src_audio: Path to source/backing audio
  • instruction: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...")
  • caption: Description of desired track characteristics

Available Tracks:

  • "vocals", "backing_vocals", "drums", "bass", "guitar", "keyboard",
  • "percussion", "strings", "synth", "fx", "brass", "woodwinds"

Use Cases:

  • Add specific instrument tracks
  • Layer additional instruments over backing tracks
  • Create multi-track compositions iteratively

5. Extract (Base Model Only)

Purpose: Extract/isolate a specific instrument track from mixed audio.

Key Parameters:

params = GenerationParams(
    task_type="extract",
    src_audio="full_mix.mp3",
    instruction="Extract the vocals track from the audio:",
)

Required:

  • src_audio: Path to mixed audio file
  • instruction: Must specify track to extract

Available Tracks: Same as Lego task

Use Cases:

  • Stem separation
  • Isolate specific instruments
  • Create remixes
  • Analyze individual tracks

6. Complete (Base Model Only)

Purpose: Complete/extend partial tracks with specified instruments.

Key Parameters:

params = GenerationParams(
    task_type="complete",
    src_audio="incomplete_track.mp3",
    instruction="Complete the input track with drums, bass, guitar:",
    caption="rock style completion",
)

Required:

  • src_audio: Path to incomplete/partial track
  • instruction: Must specify which tracks to add
  • caption: Description of desired style

Use Cases:

  • Arrange incomplete compositions
  • Add backing tracks
  • Auto-complete musical ideas

Helper Functions

understand_music

Analyze audio codes to extract metadata about the music.

from acestep.inference import understand_music

result = understand_music(
    llm_handler=llm_handler,
    audio_codes="<|audio_code_123|><|audio_code_456|>...",
    temperature=0.85,
    use_constrained_decoding=True,
)

if result.success:
    print(f"Caption: {result.caption}")
    print(f"Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Key: {result.keyscale}")
    print(f"Duration: {result.duration}s")
    print(f"Language: {result.language}")
else:
    print(f"Error: {result.error}")

Use Cases:

  • Analyze existing music
  • Extract metadata from audio codes
  • Reverse-engineer generation parameters

create_sample

Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature.

from acestep.inference import create_sample

result = create_sample(
    llm_handler=llm_handler,
    query="a soft Bengali love song for a quiet evening",
    instrumental=False,
    vocal_language="bn",  # Optional: constrain to Bengali
    temperature=0.85,
)

if result.success:
    print(f"Caption: {result.caption}")
    print(f"Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Duration: {result.duration}s")
    print(f"Key: {result.keyscale}")
    print(f"Is Instrumental: {result.instrumental}")
    
    # Use with generate_music
    params = GenerationParams(
        caption=result.caption,
        lyrics=result.lyrics,
        bpm=result.bpm,
        duration=result.duration,
        keyscale=result.keyscale,
        vocal_language=result.language,
    )
else:
    print(f"Error: {result.error}")

Parameters:

Parameter Type Default Description
query str required Natural language description of desired music
instrumental bool False Whether to generate instrumental music
vocal_language Optional[str] None Constrain lyrics to specific language (e.g., "en", "zh", "bn")
temperature float 0.85 Sampling temperature
top_k Optional[int] None Top-k sampling (None disables)
top_p Optional[float] None Top-p sampling (None disables)
repetition_penalty float 1.0 Repetition penalty
use_constrained_decoding bool True Use FSM-based constrained decoding

format_sample

Format and enhance user-provided caption and lyrics, generating structured metadata.

from acestep.inference import format_sample

result = format_sample(
    llm_handler=llm_handler,
    caption="Latin pop, reggaeton",
    lyrics="[Verse 1]\nBailando en la noche...",
    user_metadata={"bpm": 95},  # Optional: constrain specific values
    temperature=0.85,
)

if result.success:
    print(f"Enhanced Caption: {result.caption}")
    print(f"Formatted Lyrics: {result.lyrics}")
    print(f"BPM: {result.bpm}")
    print(f"Duration: {result.duration}s")
    print(f"Key: {result.keyscale}")
    print(f"Detected Language: {result.language}")
else:
    print(f"Error: {result.error}")

Parameters:

Parameter Type Default Description
caption str required User's caption/description
lyrics str required User's lyrics with structure tags
user_metadata Optional[Dict] None Constrain specific metadata values (bpm, duration, keyscale, timesignature, language)
temperature float 0.85 Sampling temperature
top_k Optional[int] None Top-k sampling (None disables)
top_p Optional[float] None Top-p sampling (None disables)
repetition_penalty float 1.0 Repetition penalty
use_constrained_decoding bool True Use FSM-based constrained decoding

Complete Examples

Example 1: Simple Text-to-Music Generation

from acestep.inference import GenerationParams, GenerationConfig, generate_music

params = GenerationParams(
    task_type="text2music",
    caption="calm ambient music with soft piano and strings",
    duration=60,
    bpm=80,
    keyscale="C Major",
)

config = GenerationConfig(
    batch_size=2,  # Generate 2 variations
    audio_format="flac",
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    for i, audio in enumerate(result.audios, 1):
        print(f"Variation {i}: {audio['path']}")

Example 2: Song Generation with Lyrics

params = GenerationParams(
    task_type="text2music",
    caption="pop ballad with emotional vocals",
    lyrics="""Verse 1:
Walking down the street today
Thinking of the words you used to say
Everything feels different now
But I'll find my way somehow

Chorus:
I'm moving on, I'm staying strong
This is where I belong
""",
    vocal_language="en",
    bpm=72,
    duration=45,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 3: Using Custom Timesteps

params = GenerationParams(
    task_type="text2music",
    caption="jazz fusion with complex harmonies",
    # Custom 9-step schedule
    timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0],
    thinking=True,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 4: Using Shift Parameter (Turbo Model)

params = GenerationParams(
    task_type="text2music",
    caption="upbeat electronic dance music",
    inference_steps=8,
    shift=3.0,  # Recommended for turbo models
    infer_method="ode",
)

config = GenerationConfig(batch_size=2)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 5: Simple Mode with create_sample

from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music

# Step 1: Create sample from description
sample = create_sample(
    llm_handler=llm_handler,
    query="energetic K-pop dance track with catchy hooks",
    vocal_language="ko",
)

if sample.success:
    # Step 2: Generate music using the sample
    params = GenerationParams(
        caption=sample.caption,
        lyrics=sample.lyrics,
        bpm=sample.bpm,
        duration=sample.duration,
        keyscale=sample.keyscale,
        vocal_language=sample.language,
        thinking=True,
    )
    
    config = GenerationConfig(batch_size=2)
    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 6: Format and Enhance User Input

from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music

# Step 1: Format user input
formatted = format_sample(
    llm_handler=llm_handler,
    caption="rock ballad",
    lyrics="[Verse]\nIn the darkness I find my way...",
)

if formatted.success:
    # Step 2: Generate with enhanced input
    params = GenerationParams(
        caption=formatted.caption,
        lyrics=formatted.lyrics,
        bpm=formatted.bpm,
        duration=formatted.duration,
        keyscale=formatted.keyscale,
        thinking=True,
        use_cot_metas=False,  # Already formatted, skip metas CoT
    )
    
    config = GenerationConfig(batch_size=2)
    result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 7: Style Cover with LM Reasoning

params = GenerationParams(
    task_type="cover",
    src_audio="original_pop_song.mp3",
    caption="orchestral symphonic arrangement",
    audio_cover_strength=0.7,
    thinking=True,  # Enable LM for metadata
    use_cot_metas=True,
)

config = GenerationConfig(batch_size=1)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

# Access LM-generated metadata
if result.extra_outputs.get("lm_metadata"):
    lm_meta = result.extra_outputs["lm_metadata"]
    print(f"LM detected BPM: {lm_meta.get('bpm')}")
    print(f"LM detected Key: {lm_meta.get('keyscale')}")

Example 8: Batch Generation with Specific Seeds

params = GenerationParams(
    task_type="text2music",
    caption="epic cinematic trailer music",
)

config = GenerationConfig(
    batch_size=4,           # Generate 4 variations
    seeds=[42, 123, 456],   # Specify 3 seeds, 4th will be random
    use_random_seed=False,  # Use provided seeds
    lm_batch_chunk_size=2,  # Process 2 at a time (GPU memory)
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    print(f"Generated {len(result.audios)} variations")
    for audio in result.audios:
        print(f"  Seed {audio['params']['seed']}: {audio['path']}")

Example 9: High-Quality Generation (Base Model)

params = GenerationParams(
    task_type="text2music",
    caption="intricate jazz fusion with complex harmonies",
    inference_steps=64,     # High quality
    guidance_scale=8.0,
    use_adg=True,           # Adaptive Dual Guidance
    cfg_interval_start=0.0,
    cfg_interval_end=1.0,
    shift=3.0,              # Timestep shift
    seed=42,                # Reproducible results
)

config = GenerationConfig(
    batch_size=1,
    use_random_seed=False,
    audio_format="wav",     # Lossless format
)

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

Example 10: Understand Audio from Codes

from acestep.inference import understand_music

# Analyze audio codes (e.g., from a previous generation)
result = understand_music(
    llm_handler=llm_handler,
    audio_codes="<|audio_code_10695|><|audio_code_54246|>...",
    temperature=0.85,
)

if result.success:
    print(f"Detected Caption: {result.caption}")
    print(f"Detected Lyrics: {result.lyrics}")
    print(f"Detected BPM: {result.bpm}")
    print(f"Detected Key: {result.keyscale}")
    print(f"Detected Duration: {result.duration}s")
    print(f"Detected Language: {result.language}")

Best Practices

1. Caption Writing

Good Captions:

# Specific and descriptive
caption="upbeat electronic dance music with heavy bass and synthesizer leads"

# Include mood and genre
caption="melancholic indie folk with acoustic guitar and soft vocals"

# Specify instruments
caption="jazz trio with piano, upright bass, and brush drums"

Avoid:

# Too vague
caption="good music"

# Contradictory
caption="fast slow music"  # Conflicting tempos

2. Parameter Tuning

For Best Quality:

  • Use base model with inference_steps=64 or higher
  • Enable use_adg=True
  • Set guidance_scale=7.0-9.0
  • Set shift=3.0 for better timestep distribution
  • Use lossless audio format (audio_format="wav")

For Speed:

  • Use turbo model with inference_steps=8
  • Disable ADG (use_adg=False)
  • Use infer_method="ode" (default)
  • Use compressed format (audio_format="mp3") or default FLAC

For Consistency:

  • Set use_random_seed=False in config
  • Use fixed seeds list or single seed in params
  • Keep lm_temperature lower (0.7-0.85)

For Diversity:

  • Set use_random_seed=True in config
  • Increase lm_temperature (0.9-1.1)
  • Use batch_size > 1 for variations

3. Duration Guidelines

  • Instrumental: 30-180 seconds works well
  • With Lyrics: Auto-detection recommended (set duration=-1 or leave default)
  • Short clips: 10-20 seconds minimum
  • Long form: Up to 600 seconds (10 minutes) maximum

4. LM Usage

When to Enable LM (thinking=True):

  • Need automatic metadata detection
  • Want caption refinement
  • Generating from minimal input
  • Need diverse outputs

When to Disable LM (thinking=False):

  • Have precise metadata already
  • Need faster generation
  • Want full control over parameters

5. Batch Processing

# Efficient batch generation
config = GenerationConfig(
    batch_size=8,           # Max supported
    allow_lm_batch=True,    # Enable for speed (when thinking=True)
    lm_batch_chunk_size=4,  # Adjust based on GPU memory
)

6. Error Handling

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if not result.success:
    print(f"Generation failed: {result.error}")
    print(f"Status: {result.status_message}")
else:
    # Process successful result
    for audio in result.audios:
        path = audio['path']
        key = audio['key']
        seed = audio['params']['seed']
        # ... process audio files

7. Memory Management

For large batch sizes or long durations:

  • Monitor GPU memory usage
  • Reduce batch_size if OOM errors occur
  • Reduce lm_batch_chunk_size for LM operations
  • Consider using offload_to_cpu=True during initialization

8. Accessing Time Costs

result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

if result.success:
    time_costs = result.extra_outputs.get("time_costs", {})
    print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
    print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
    print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
    print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")

Troubleshooting

Common Issues

Issue: Out of memory errors

  • Solution: Reduce batch_size, inference_steps, or enable CPU offloading

Issue: Poor quality results

  • Solution: Increase inference_steps, adjust guidance_scale, use base model

Issue: Results don't match prompt

  • Solution: Make caption more specific, increase guidance_scale, enable LM refinement (thinking=True)

Issue: Slow generation

  • Solution: Use turbo model, reduce inference_steps, disable ADG

Issue: LM not generating codes

  • Solution: Verify llm_handler is initialized, check thinking=True and use_cot_metas=True

Issue: Seeds not being respected

  • Solution: Set use_random_seed=False in config and provide seeds list or seed in params

Issue: Custom timesteps not working

  • Solution: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered

API Reference Summary

GenerationParams Fields

See GenerationParams Parameters for complete documentation.

GenerationConfig Fields

See GenerationConfig Parameters for complete documentation.

GenerationResult Fields

@dataclass
class GenerationResult:
    # Audio Outputs
    audios: List[Dict[str, Any]]
    # Each audio dict contains:
    #   - "path": str (file path)
    #   - "tensor": Tensor (audio data)
    #   - "key": str (unique identifier)
    #   - "sample_rate": int (48000)
    #   - "params": Dict (generation params with seed, audio_codes, etc.)
    
    # Generation Information
    status_message: str
    extra_outputs: Dict[str, Any]
    # extra_outputs contains:
    #   - "lm_metadata": Dict (LM-generated metadata)
    #   - "time_costs": Dict (timing information)
    #   - "latents": Tensor (intermediate latents, if available)
    #   - "masks": Tensor (attention masks, if available)
    
    # Success Status
    success: bool
    error: Optional[str]

Version History

  • v1.5.2: Current version

    • Added shift parameter for timestep shifting
    • Added infer_method parameter for ODE/SDE selection
    • Added timesteps parameter for custom timestep schedules
    • Added understand_music() function for audio analysis
    • Added create_sample() function for simple mode generation
    • Added format_sample() function for input enhancement
    • Added UnderstandResult, CreateSampleResult, FormatSampleResult dataclasses
  • v1.5.1: Previous version

    • Split GenerationConfig into GenerationParams and GenerationConfig
    • Renamed parameters for consistency (key_scalekeyscale, time_signaturetimesignature, audio_durationduration, use_llm_thinkingthinking, audio_code_stringaudio_codes)
    • Added instrumental parameter
    • Added use_constrained_decoding parameter
    • Added CoT auto-filled fields (cot_*)
    • Changed default audio_format to "flac"
    • Changed default batch_size to 2
    • Changed default thinking to True
    • Simplified GenerationResult structure with unified audios list
    • Added unified time_costs in extra_outputs
  • v1.5: Initial version

    • Introduced GenerationConfig and GenerationResult dataclasses
    • Simplified parameter passing
    • Added comprehensive documentation

For more information, see: