Spaces:
Running
on
A100
ACE-Step Inference API Documentation
Language / 语言 / 言語: English | 中文 | 日本語
This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types.
Table of Contents
- Quick Start
- API Overview
- GenerationParams Parameters
- GenerationConfig Parameters
- Task Types
- Helper Functions
- Complete Examples
- Best Practices
Quick Start
Basic Usage
from acestep.handler import AceStepHandler
from acestep.llm_inference import LLMHandler
from acestep.inference import GenerationParams, GenerationConfig, generate_music
# Initialize handlers
dit_handler = AceStepHandler()
llm_handler = LLMHandler()
# Initialize services
dit_handler.initialize_service(
project_root="/path/to/project",
config_path="acestep-v15-turbo",
device="cuda"
)
llm_handler.initialize(
checkpoint_dir="/path/to/checkpoints",
lm_model_path="acestep-5Hz-lm-0.6B",
backend="vllm",
device="cuda"
)
# Configure generation parameters
params = GenerationParams(
caption="upbeat electronic dance music with heavy bass",
bpm=128,
duration=30,
)
# Configure generation settings
config = GenerationConfig(
batch_size=2,
audio_format="flac",
)
# Generate music
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")
# Access results
if result.success:
for audio in result.audios:
print(f"Generated: {audio['path']}")
print(f"Key: {audio['key']}")
print(f"Seed: {audio['params']['seed']}")
else:
print(f"Error: {result.error}")
API Overview
Main Functions
generate_music
def generate_music(
dit_handler,
llm_handler,
params: GenerationParams,
config: GenerationConfig,
save_dir: Optional[str] = None,
progress=None,
) -> GenerationResult
Main function for generating music using the ACE-Step model.
understand_music
def understand_music(
llm_handler,
audio_codes: str,
temperature: float = 0.85,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
repetition_penalty: float = 1.0,
use_constrained_decoding: bool = True,
constrained_decoding_debug: bool = False,
) -> UnderstandResult
Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.).
create_sample
def create_sample(
llm_handler,
query: str,
instrumental: bool = False,
vocal_language: Optional[str] = None,
temperature: float = 0.85,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
repetition_penalty: float = 1.0,
use_constrained_decoding: bool = True,
constrained_decoding_debug: bool = False,
) -> CreateSampleResult
Generate a complete music sample (caption, lyrics, metadata) from a natural language description.
format_sample
def format_sample(
llm_handler,
caption: str,
lyrics: str,
user_metadata: Optional[Dict[str, Any]] = None,
temperature: float = 0.85,
top_k: Optional[int] = None,
top_p: Optional[float] = None,
repetition_penalty: float = 1.0,
use_constrained_decoding: bool = True,
constrained_decoding_debug: bool = False,
) -> FormatSampleResult
Format and enhance user-provided caption and lyrics, generating structured metadata.
Configuration Objects
The API uses two configuration dataclasses:
GenerationParams - Contains all music generation parameters:
@dataclass
class GenerationParams:
# Task & Instruction
task_type: str = "text2music"
instruction: str = "Fill the audio semantic mask based on the given conditions:"
# Audio Uploads
reference_audio: Optional[str] = None
src_audio: Optional[str] = None
# LM Codes Hints
audio_codes: str = ""
# Text Inputs
caption: str = ""
lyrics: str = ""
instrumental: bool = False
# Metadata
vocal_language: str = "unknown"
bpm: Optional[int] = None
keyscale: str = ""
timesignature: str = ""
duration: float = -1.0
# Advanced Settings
inference_steps: int = 8
seed: int = -1
guidance_scale: float = 7.0
use_adg: bool = False
cfg_interval_start: float = 0.0
cfg_interval_end: float = 1.0
shift: float = 1.0 # NEW: Timestep shift factor
infer_method: str = "ode" # NEW: Diffusion inference method
timesteps: Optional[List[float]] = None # NEW: Custom timesteps
repainting_start: float = 0.0
repainting_end: float = -1
audio_cover_strength: float = 1.0
# 5Hz Language Model Parameters
thinking: bool = True
lm_temperature: float = 0.85
lm_cfg_scale: float = 2.0
lm_top_k: int = 0
lm_top_p: float = 0.9
lm_negative_prompt: str = "NO USER INPUT"
use_cot_metas: bool = True
use_cot_caption: bool = True
use_cot_lyrics: bool = False
use_cot_language: bool = True
use_constrained_decoding: bool = True
# CoT Generated Values (auto-filled by LM)
cot_bpm: Optional[int] = None
cot_keyscale: str = ""
cot_timesignature: str = ""
cot_duration: Optional[float] = None
cot_vocal_language: str = "unknown"
cot_caption: str = ""
cot_lyrics: str = ""
GenerationConfig - Contains batch and output configuration:
@dataclass
class GenerationConfig:
batch_size: int = 2
allow_lm_batch: bool = False
use_random_seed: bool = True
seeds: Optional[List[int]] = None
lm_batch_chunk_size: int = 8
constrained_decoding_debug: bool = False
audio_format: str = "flac"
Result Objects
GenerationResult - Result of music generation:
@dataclass
class GenerationResult:
# Audio Outputs
audios: List[Dict[str, Any]] # List of audio dictionaries
# Generation Information
status_message: str # Status message from generation
extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)
# Success Status
success: bool # Whether generation succeeded
error: Optional[str] # Error message if failed
Audio Dictionary Structure:
Each item in audios list contains:
{
"path": str, # File path to saved audio
"tensor": Tensor, # Audio tensor [channels, samples], CPU, float32
"key": str, # Unique audio key (UUID based on params)
"sample_rate": int, # Sample rate (default: 48000)
"params": Dict, # Generation params for this audio (includes seed, audio_codes, etc.)
}
UnderstandResult - Result of music understanding:
@dataclass
class UnderstandResult:
# Metadata Fields
caption: str = ""
lyrics: str = ""
bpm: Optional[int] = None
duration: Optional[float] = None
keyscale: str = ""
language: str = ""
timesignature: str = ""
# Status
status_message: str = ""
success: bool = True
error: Optional[str] = None
CreateSampleResult - Result of sample creation:
@dataclass
class CreateSampleResult:
# Metadata Fields
caption: str = ""
lyrics: str = ""
bpm: Optional[int] = None
duration: Optional[float] = None
keyscale: str = ""
language: str = ""
timesignature: str = ""
instrumental: bool = False
# Status
status_message: str = ""
success: bool = True
error: Optional[str] = None
FormatSampleResult - Result of sample formatting:
@dataclass
class FormatSampleResult:
# Metadata Fields
caption: str = ""
lyrics: str = ""
bpm: Optional[int] = None
duration: Optional[float] = None
keyscale: str = ""
language: str = ""
timesignature: str = ""
# Status
status_message: str = ""
success: bool = True
error: Optional[str] = None
GenerationParams Parameters
Text Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
caption |
str |
"" |
Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters. |
lyrics |
str |
"" |
Lyrics text for vocal music. Use "[Instrumental]" for instrumental tracks. Supports multiple languages. Max 4096 characters. |
instrumental |
bool |
False |
If True, generate instrumental music regardless of lyrics. |
Music Metadata
| Parameter | Type | Default | Description |
|---|---|---|---|
bpm |
Optional[int] |
None |
Beats per minute (30-300). None enables auto-detection via LM. |
keyscale |
str |
"" |
Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. |
timesignature |
str |
"" |
Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection. |
vocal_language |
str |
"unknown" |
Language code for vocals (ISO 639-1). Supported: "en", "zh", "ja", "es", "fr", etc. Use "unknown" for auto-detection. |
duration |
float |
-1.0 |
Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length. |
Generation Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
inference_steps |
int |
8 |
Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower. |
guidance_scale |
float |
7.0 |
Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0. |
seed |
int |
-1 |
Random seed for reproducibility. Use -1 for random seed, or any positive integer for fixed seed. |
Advanced DiT Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
use_adg |
bool |
False |
Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. |
cfg_interval_start |
float |
0.0 |
CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. |
cfg_interval_end |
float |
1.0 |
CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. |
shift |
float |
1.0 |
Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies t = shift * t / (1 + (shift - 1) * t) to timesteps. Recommended 3.0 for turbo models. |
infer_method |
str |
"ode" |
Diffusion inference method. "ode" (Euler) is faster and deterministic. "sde" (stochastic) may produce different results with variance. |
timesteps |
Optional[List[float]] |
None |
Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., [0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]). If provided, overrides inference_steps and shift. |
Task-Specific Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
task_type |
str |
"text2music" |
Generation task type. See Task Types section for details. |
instruction |
str |
"Fill the audio semantic mask based on the given conditions:" |
Task-specific instruction prompt. |
reference_audio |
Optional[str] |
None |
Path to reference audio file for style transfer or continuation tasks. |
src_audio |
Optional[str] |
None |
Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). |
audio_codes |
str |
"" |
Pre-extracted 5Hz audio semantic codes as a string. Advanced use only. |
repainting_start |
float |
0.0 |
Repainting start time in seconds (for repaint/lego tasks). |
repainting_end |
float |
-1 |
Repainting end time in seconds. Use -1 for end of audio. |
audio_cover_strength |
float |
1.0 |
Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks. |
5Hz Language Model Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
thinking |
bool |
True |
Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes. |
lm_temperature |
float |
0.85 |
LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. |
lm_cfg_scale |
float |
2.0 |
LM classifier-free guidance scale. Higher = stronger adherence to prompt. |
lm_top_k |
int |
0 |
LM top-k sampling. 0 disables top-k filtering. Typical values: 40-100. |
lm_top_p |
float |
0.9 |
LM nucleus sampling (0.0-1.0). 1.0 disables nucleus sampling. Typical values: 0.9-0.95. |
lm_negative_prompt |
str |
"NO USER INPUT" |
Negative prompt for LM guidance. Helps avoid unwanted characteristics. |
use_cot_metas |
bool |
True |
Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). |
use_cot_caption |
bool |
True |
Refine user caption using LM CoT reasoning. |
use_cot_language |
bool |
True |
Detect vocal language using LM CoT reasoning. |
use_cot_lyrics |
bool |
False |
(Reserved for future use) Generate/refine lyrics using LM CoT. |
use_constrained_decoding |
bool |
True |
Enable constrained decoding for structured LM output. |
CoT Generated Values
These fields are automatically populated by the LM when CoT reasoning is enabled:
| Parameter | Type | Default | Description |
|---|---|---|---|
cot_bpm |
Optional[int] |
None |
LM-generated BPM value. |
cot_keyscale |
str |
"" |
LM-generated key/scale. |
cot_timesignature |
str |
"" |
LM-generated time signature. |
cot_duration |
Optional[float] |
None |
LM-generated duration. |
cot_vocal_language |
str |
"unknown" |
LM-detected vocal language. |
cot_caption |
str |
"" |
LM-refined caption. |
cot_lyrics |
str |
"" |
LM-generated/refined lyrics. |
GenerationConfig Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
batch_size |
int |
2 |
Number of samples to generate in parallel (1-8). Higher values require more GPU memory. |
allow_lm_batch |
bool |
False |
Allow batch processing in LM. Faster when batch_size >= 2 and thinking=True. |
use_random_seed |
bool |
True |
Whether to use random seed. True for different results each time, False for reproducible results. |
seeds |
Optional[List[int]] |
None |
List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int. |
lm_batch_chunk_size |
int |
8 |
Maximum batch size per LM inference chunk (GPU memory constraint). |
constrained_decoding_debug |
bool |
False |
Enable debug logging for constrained decoding. |
audio_format |
str |
"flac" |
Output audio format. Options: "mp3", "wav", "flac". Default is FLAC for fast saving. |
Task Types
ACE-Step supports 6 different generation task types, each optimized for specific use cases.
1. Text2Music (Default)
Purpose: Generate music from text descriptions and optional metadata.
Key Parameters:
params = GenerationParams(
task_type="text2music",
caption="energetic rock music with electric guitar",
lyrics="[Instrumental]", # or actual lyrics
bpm=140,
duration=30,
)
Required:
captionorlyrics(at least one)
Optional but Recommended:
bpm: Controls tempokeyscale: Controls musical keytimesignature: Controls rhythm structureduration: Controls lengthvocal_language: Controls vocal characteristics
Use Cases:
- Generate music from text descriptions
- Create backing tracks from prompts
- Generate songs with lyrics
2. Cover
Purpose: Transform existing audio while maintaining structure but changing style/timbre.
Key Parameters:
params = GenerationParams(
task_type="cover",
src_audio="original_song.mp3",
caption="jazz piano version",
audio_cover_strength=0.8, # 0.0-1.0
)
Required:
src_audio: Path to source audio filecaption: Description of desired style/transformation
Optional:
audio_cover_strength: Controls influence of original audio1.0: Strong adherence to original structure0.5: Balanced transformation0.1: Loose interpretation
lyrics: New lyrics (if changing vocals)
Use Cases:
- Create covers in different styles
- Change instrumentation while keeping melody
- Genre transformation
3. Repaint
Purpose: Regenerate a specific time segment of audio while keeping the rest unchanged.
Key Parameters:
params = GenerationParams(
task_type="repaint",
src_audio="original.mp3",
repainting_start=10.0, # seconds
repainting_end=20.0, # seconds
caption="smooth transition with piano solo",
)
Required:
src_audio: Path to source audio filerepainting_start: Start time in secondsrepainting_end: End time in seconds (use-1for end of file)caption: Description of desired content for repainted section
Use Cases:
- Fix specific sections of generated music
- Add variations to parts of a song
- Create smooth transitions
- Replace problematic segments
4. Lego (Base Model Only)
Purpose: Generate a specific instrument track in context of existing audio.
Key Parameters:
params = GenerationParams(
task_type="lego",
src_audio="backing_track.mp3",
instruction="Generate the guitar track based on the audio context:",
caption="lead guitar melody with bluesy feel",
repainting_start=0.0,
repainting_end=-1,
)
Required:
src_audio: Path to source/backing audioinstruction: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...")caption: Description of desired track characteristics
Available Tracks:
"vocals","backing_vocals","drums","bass","guitar","keyboard","percussion","strings","synth","fx","brass","woodwinds"
Use Cases:
- Add specific instrument tracks
- Layer additional instruments over backing tracks
- Create multi-track compositions iteratively
5. Extract (Base Model Only)
Purpose: Extract/isolate a specific instrument track from mixed audio.
Key Parameters:
params = GenerationParams(
task_type="extract",
src_audio="full_mix.mp3",
instruction="Extract the vocals track from the audio:",
)
Required:
src_audio: Path to mixed audio fileinstruction: Must specify track to extract
Available Tracks: Same as Lego task
Use Cases:
- Stem separation
- Isolate specific instruments
- Create remixes
- Analyze individual tracks
6. Complete (Base Model Only)
Purpose: Complete/extend partial tracks with specified instruments.
Key Parameters:
params = GenerationParams(
task_type="complete",
src_audio="incomplete_track.mp3",
instruction="Complete the input track with drums, bass, guitar:",
caption="rock style completion",
)
Required:
src_audio: Path to incomplete/partial trackinstruction: Must specify which tracks to addcaption: Description of desired style
Use Cases:
- Arrange incomplete compositions
- Add backing tracks
- Auto-complete musical ideas
Helper Functions
understand_music
Analyze audio codes to extract metadata about the music.
from acestep.inference import understand_music
result = understand_music(
llm_handler=llm_handler,
audio_codes="<|audio_code_123|><|audio_code_456|>...",
temperature=0.85,
use_constrained_decoding=True,
)
if result.success:
print(f"Caption: {result.caption}")
print(f"Lyrics: {result.lyrics}")
print(f"BPM: {result.bpm}")
print(f"Key: {result.keyscale}")
print(f"Duration: {result.duration}s")
print(f"Language: {result.language}")
else:
print(f"Error: {result.error}")
Use Cases:
- Analyze existing music
- Extract metadata from audio codes
- Reverse-engineer generation parameters
create_sample
Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature.
from acestep.inference import create_sample
result = create_sample(
llm_handler=llm_handler,
query="a soft Bengali love song for a quiet evening",
instrumental=False,
vocal_language="bn", # Optional: constrain to Bengali
temperature=0.85,
)
if result.success:
print(f"Caption: {result.caption}")
print(f"Lyrics: {result.lyrics}")
print(f"BPM: {result.bpm}")
print(f"Duration: {result.duration}s")
print(f"Key: {result.keyscale}")
print(f"Is Instrumental: {result.instrumental}")
# Use with generate_music
params = GenerationParams(
caption=result.caption,
lyrics=result.lyrics,
bpm=result.bpm,
duration=result.duration,
keyscale=result.keyscale,
vocal_language=result.language,
)
else:
print(f"Error: {result.error}")
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str |
required | Natural language description of desired music |
instrumental |
bool |
False |
Whether to generate instrumental music |
vocal_language |
Optional[str] |
None |
Constrain lyrics to specific language (e.g., "en", "zh", "bn") |
temperature |
float |
0.85 |
Sampling temperature |
top_k |
Optional[int] |
None |
Top-k sampling (None disables) |
top_p |
Optional[float] |
None |
Top-p sampling (None disables) |
repetition_penalty |
float |
1.0 |
Repetition penalty |
use_constrained_decoding |
bool |
True |
Use FSM-based constrained decoding |
format_sample
Format and enhance user-provided caption and lyrics, generating structured metadata.
from acestep.inference import format_sample
result = format_sample(
llm_handler=llm_handler,
caption="Latin pop, reggaeton",
lyrics="[Verse 1]\nBailando en la noche...",
user_metadata={"bpm": 95}, # Optional: constrain specific values
temperature=0.85,
)
if result.success:
print(f"Enhanced Caption: {result.caption}")
print(f"Formatted Lyrics: {result.lyrics}")
print(f"BPM: {result.bpm}")
print(f"Duration: {result.duration}s")
print(f"Key: {result.keyscale}")
print(f"Detected Language: {result.language}")
else:
print(f"Error: {result.error}")
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
caption |
str |
required | User's caption/description |
lyrics |
str |
required | User's lyrics with structure tags |
user_metadata |
Optional[Dict] |
None |
Constrain specific metadata values (bpm, duration, keyscale, timesignature, language) |
temperature |
float |
0.85 |
Sampling temperature |
top_k |
Optional[int] |
None |
Top-k sampling (None disables) |
top_p |
Optional[float] |
None |
Top-p sampling (None disables) |
repetition_penalty |
float |
1.0 |
Repetition penalty |
use_constrained_decoding |
bool |
True |
Use FSM-based constrained decoding |
Complete Examples
Example 1: Simple Text-to-Music Generation
from acestep.inference import GenerationParams, GenerationConfig, generate_music
params = GenerationParams(
task_type="text2music",
caption="calm ambient music with soft piano and strings",
duration=60,
bpm=80,
keyscale="C Major",
)
config = GenerationConfig(
batch_size=2, # Generate 2 variations
audio_format="flac",
)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
if result.success:
for i, audio in enumerate(result.audios, 1):
print(f"Variation {i}: {audio['path']}")
Example 2: Song Generation with Lyrics
params = GenerationParams(
task_type="text2music",
caption="pop ballad with emotional vocals",
lyrics="""Verse 1:
Walking down the street today
Thinking of the words you used to say
Everything feels different now
But I'll find my way somehow
Chorus:
I'm moving on, I'm staying strong
This is where I belong
""",
vocal_language="en",
bpm=72,
duration=45,
)
config = GenerationConfig(batch_size=1)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
Example 3: Using Custom Timesteps
params = GenerationParams(
task_type="text2music",
caption="jazz fusion with complex harmonies",
# Custom 9-step schedule
timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0],
thinking=True,
)
config = GenerationConfig(batch_size=1)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
Example 4: Using Shift Parameter (Turbo Model)
params = GenerationParams(
task_type="text2music",
caption="upbeat electronic dance music",
inference_steps=8,
shift=3.0, # Recommended for turbo models
infer_method="ode",
)
config = GenerationConfig(batch_size=2)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
Example 5: Simple Mode with create_sample
from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music
# Step 1: Create sample from description
sample = create_sample(
llm_handler=llm_handler,
query="energetic K-pop dance track with catchy hooks",
vocal_language="ko",
)
if sample.success:
# Step 2: Generate music using the sample
params = GenerationParams(
caption=sample.caption,
lyrics=sample.lyrics,
bpm=sample.bpm,
duration=sample.duration,
keyscale=sample.keyscale,
vocal_language=sample.language,
thinking=True,
)
config = GenerationConfig(batch_size=2)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
Example 6: Format and Enhance User Input
from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music
# Step 1: Format user input
formatted = format_sample(
llm_handler=llm_handler,
caption="rock ballad",
lyrics="[Verse]\nIn the darkness I find my way...",
)
if formatted.success:
# Step 2: Generate with enhanced input
params = GenerationParams(
caption=formatted.caption,
lyrics=formatted.lyrics,
bpm=formatted.bpm,
duration=formatted.duration,
keyscale=formatted.keyscale,
thinking=True,
use_cot_metas=False, # Already formatted, skip metas CoT
)
config = GenerationConfig(batch_size=2)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
Example 7: Style Cover with LM Reasoning
params = GenerationParams(
task_type="cover",
src_audio="original_pop_song.mp3",
caption="orchestral symphonic arrangement",
audio_cover_strength=0.7,
thinking=True, # Enable LM for metadata
use_cot_metas=True,
)
config = GenerationConfig(batch_size=1)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
# Access LM-generated metadata
if result.extra_outputs.get("lm_metadata"):
lm_meta = result.extra_outputs["lm_metadata"]
print(f"LM detected BPM: {lm_meta.get('bpm')}")
print(f"LM detected Key: {lm_meta.get('keyscale')}")
Example 8: Batch Generation with Specific Seeds
params = GenerationParams(
task_type="text2music",
caption="epic cinematic trailer music",
)
config = GenerationConfig(
batch_size=4, # Generate 4 variations
seeds=[42, 123, 456], # Specify 3 seeds, 4th will be random
use_random_seed=False, # Use provided seeds
lm_batch_chunk_size=2, # Process 2 at a time (GPU memory)
)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
if result.success:
print(f"Generated {len(result.audios)} variations")
for audio in result.audios:
print(f" Seed {audio['params']['seed']}: {audio['path']}")
Example 9: High-Quality Generation (Base Model)
params = GenerationParams(
task_type="text2music",
caption="intricate jazz fusion with complex harmonies",
inference_steps=64, # High quality
guidance_scale=8.0,
use_adg=True, # Adaptive Dual Guidance
cfg_interval_start=0.0,
cfg_interval_end=1.0,
shift=3.0, # Timestep shift
seed=42, # Reproducible results
)
config = GenerationConfig(
batch_size=1,
use_random_seed=False,
audio_format="wav", # Lossless format
)
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
Example 10: Understand Audio from Codes
from acestep.inference import understand_music
# Analyze audio codes (e.g., from a previous generation)
result = understand_music(
llm_handler=llm_handler,
audio_codes="<|audio_code_10695|><|audio_code_54246|>...",
temperature=0.85,
)
if result.success:
print(f"Detected Caption: {result.caption}")
print(f"Detected Lyrics: {result.lyrics}")
print(f"Detected BPM: {result.bpm}")
print(f"Detected Key: {result.keyscale}")
print(f"Detected Duration: {result.duration}s")
print(f"Detected Language: {result.language}")
Best Practices
1. Caption Writing
Good Captions:
# Specific and descriptive
caption="upbeat electronic dance music with heavy bass and synthesizer leads"
# Include mood and genre
caption="melancholic indie folk with acoustic guitar and soft vocals"
# Specify instruments
caption="jazz trio with piano, upright bass, and brush drums"
Avoid:
# Too vague
caption="good music"
# Contradictory
caption="fast slow music" # Conflicting tempos
2. Parameter Tuning
For Best Quality:
- Use base model with
inference_steps=64or higher - Enable
use_adg=True - Set
guidance_scale=7.0-9.0 - Set
shift=3.0for better timestep distribution - Use lossless audio format (
audio_format="wav")
For Speed:
- Use turbo model with
inference_steps=8 - Disable ADG (
use_adg=False) - Use
infer_method="ode"(default) - Use compressed format (
audio_format="mp3") or default FLAC
For Consistency:
- Set
use_random_seed=Falsein config - Use fixed
seedslist or singleseedin params - Keep
lm_temperaturelower (0.7-0.85)
For Diversity:
- Set
use_random_seed=Truein config - Increase
lm_temperature(0.9-1.1) - Use
batch_size > 1for variations
3. Duration Guidelines
- Instrumental: 30-180 seconds works well
- With Lyrics: Auto-detection recommended (set
duration=-1or leave default) - Short clips: 10-20 seconds minimum
- Long form: Up to 600 seconds (10 minutes) maximum
4. LM Usage
When to Enable LM (thinking=True):
- Need automatic metadata detection
- Want caption refinement
- Generating from minimal input
- Need diverse outputs
When to Disable LM (thinking=False):
- Have precise metadata already
- Need faster generation
- Want full control over parameters
5. Batch Processing
# Efficient batch generation
config = GenerationConfig(
batch_size=8, # Max supported
allow_lm_batch=True, # Enable for speed (when thinking=True)
lm_batch_chunk_size=4, # Adjust based on GPU memory
)
6. Error Handling
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
if not result.success:
print(f"Generation failed: {result.error}")
print(f"Status: {result.status_message}")
else:
# Process successful result
for audio in result.audios:
path = audio['path']
key = audio['key']
seed = audio['params']['seed']
# ... process audio files
7. Memory Management
For large batch sizes or long durations:
- Monitor GPU memory usage
- Reduce
batch_sizeif OOM errors occur - Reduce
lm_batch_chunk_sizefor LM operations - Consider using
offload_to_cpu=Trueduring initialization
8. Accessing Time Costs
result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
if result.success:
time_costs = result.extra_outputs.get("time_costs", {})
print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")
Troubleshooting
Common Issues
Issue: Out of memory errors
- Solution: Reduce
batch_size,inference_steps, or enable CPU offloading
Issue: Poor quality results
- Solution: Increase
inference_steps, adjustguidance_scale, use base model
Issue: Results don't match prompt
- Solution: Make caption more specific, increase
guidance_scale, enable LM refinement (thinking=True)
Issue: Slow generation
- Solution: Use turbo model, reduce
inference_steps, disable ADG
Issue: LM not generating codes
- Solution: Verify
llm_handleris initialized, checkthinking=Trueanduse_cot_metas=True
Issue: Seeds not being respected
- Solution: Set
use_random_seed=Falsein config and provideseedslist orseedin params
Issue: Custom timesteps not working
- Solution: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered
API Reference Summary
GenerationParams Fields
See GenerationParams Parameters for complete documentation.
GenerationConfig Fields
See GenerationConfig Parameters for complete documentation.
GenerationResult Fields
@dataclass
class GenerationResult:
# Audio Outputs
audios: List[Dict[str, Any]]
# Each audio dict contains:
# - "path": str (file path)
# - "tensor": Tensor (audio data)
# - "key": str (unique identifier)
# - "sample_rate": int (48000)
# - "params": Dict (generation params with seed, audio_codes, etc.)
# Generation Information
status_message: str
extra_outputs: Dict[str, Any]
# extra_outputs contains:
# - "lm_metadata": Dict (LM-generated metadata)
# - "time_costs": Dict (timing information)
# - "latents": Tensor (intermediate latents, if available)
# - "masks": Tensor (attention masks, if available)
# Success Status
success: bool
error: Optional[str]
Version History
v1.5.2: Current version
- Added
shiftparameter for timestep shifting - Added
infer_methodparameter for ODE/SDE selection - Added
timestepsparameter for custom timestep schedules - Added
understand_music()function for audio analysis - Added
create_sample()function for simple mode generation - Added
format_sample()function for input enhancement - Added
UnderstandResult,CreateSampleResult,FormatSampleResultdataclasses
- Added
v1.5.1: Previous version
- Split
GenerationConfigintoGenerationParamsandGenerationConfig - Renamed parameters for consistency (
key_scale→keyscale,time_signature→timesignature,audio_duration→duration,use_llm_thinking→thinking,audio_code_string→audio_codes) - Added
instrumentalparameter - Added
use_constrained_decodingparameter - Added CoT auto-filled fields (
cot_*) - Changed default
audio_formatto "flac" - Changed default
batch_sizeto 2 - Changed default
thinkingto True - Simplified
GenerationResultstructure with unifiedaudioslist - Added unified
time_costsinextra_outputs
- Split
v1.5: Initial version
- Introduced
GenerationConfigandGenerationResultdataclasses - Simplified parameter passing
- Added comprehensive documentation
- Introduced
For more information, see:
- Main README:
../../README.md - REST API Documentation:
API.md - Gradio Demo Guide:
GRADIO_GUIDE.md - Project repository: ACE-Step-1.5