# ACE-Step Inference API Documentation **Language / 语言 / 言語:** [English](INFERENCE.md) | [中文](../zh/INFERENCE.md) | [日本語](../ja/INFERENCE.md) --- This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types. ## Table of Contents - [Quick Start](#quick-start) - [API Overview](#api-overview) - [GenerationParams Parameters](#generationparams-parameters) - [GenerationConfig Parameters](#generationconfig-parameters) - [Task Types](#task-types) - [Helper Functions](#helper-functions) - [Complete Examples](#complete-examples) - [Best Practices](#best-practices) --- ## Quick Start ### Basic Usage ```python from acestep.handler import AceStepHandler from acestep.llm_inference import LLMHandler from acestep.inference import GenerationParams, GenerationConfig, generate_music # Initialize handlers dit_handler = AceStepHandler() llm_handler = LLMHandler() # Initialize services dit_handler.initialize_service( project_root="/path/to/project", config_path="acestep-v15-turbo", device="cuda" ) llm_handler.initialize( checkpoint_dir="/path/to/checkpoints", lm_model_path="acestep-5Hz-lm-0.6B", backend="vllm", device="cuda" ) # Configure generation parameters params = GenerationParams( caption="upbeat electronic dance music with heavy bass", bpm=128, duration=30, ) # Configure generation settings config = GenerationConfig( batch_size=2, audio_format="flac", ) # Generate music result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output") # Access results if result.success: for audio in result.audios: print(f"Generated: {audio['path']}") print(f"Key: {audio['key']}") print(f"Seed: {audio['params']['seed']}") else: print(f"Error: {result.error}") ``` --- ## API Overview ### Main Functions #### generate_music ```python def generate_music( dit_handler, llm_handler, params: GenerationParams, config: GenerationConfig, save_dir: Optional[str] = None, progress=None, ) -> GenerationResult ``` Main function for generating music using the ACE-Step model. #### understand_music ```python def understand_music( llm_handler, audio_codes: str, temperature: float = 0.85, top_k: Optional[int] = None, top_p: Optional[float] = None, repetition_penalty: float = 1.0, use_constrained_decoding: bool = True, constrained_decoding_debug: bool = False, ) -> UnderstandResult ``` Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.). #### create_sample ```python def create_sample( llm_handler, query: str, instrumental: bool = False, vocal_language: Optional[str] = None, temperature: float = 0.85, top_k: Optional[int] = None, top_p: Optional[float] = None, repetition_penalty: float = 1.0, use_constrained_decoding: bool = True, constrained_decoding_debug: bool = False, ) -> CreateSampleResult ``` Generate a complete music sample (caption, lyrics, metadata) from a natural language description. #### format_sample ```python def format_sample( llm_handler, caption: str, lyrics: str, user_metadata: Optional[Dict[str, Any]] = None, temperature: float = 0.85, top_k: Optional[int] = None, top_p: Optional[float] = None, repetition_penalty: float = 1.0, use_constrained_decoding: bool = True, constrained_decoding_debug: bool = False, ) -> FormatSampleResult ``` Format and enhance user-provided caption and lyrics, generating structured metadata. ### Configuration Objects The API uses two configuration dataclasses: **GenerationParams** - Contains all music generation parameters: ```python @dataclass class GenerationParams: # Task & Instruction task_type: str = "text2music" instruction: str = "Fill the audio semantic mask based on the given conditions:" # Audio Uploads reference_audio: Optional[str] = None src_audio: Optional[str] = None # LM Codes Hints audio_codes: str = "" # Text Inputs caption: str = "" lyrics: str = "" instrumental: bool = False # Metadata vocal_language: str = "unknown" bpm: Optional[int] = None keyscale: str = "" timesignature: str = "" duration: float = -1.0 # Advanced Settings inference_steps: int = 8 seed: int = -1 guidance_scale: float = 7.0 use_adg: bool = False cfg_interval_start: float = 0.0 cfg_interval_end: float = 1.0 shift: float = 1.0 # NEW: Timestep shift factor infer_method: str = "ode" # NEW: Diffusion inference method timesteps: Optional[List[float]] = None # NEW: Custom timesteps repainting_start: float = 0.0 repainting_end: float = -1 audio_cover_strength: float = 1.0 # 5Hz Language Model Parameters thinking: bool = True lm_temperature: float = 0.85 lm_cfg_scale: float = 2.0 lm_top_k: int = 0 lm_top_p: float = 0.9 lm_negative_prompt: str = "NO USER INPUT" use_cot_metas: bool = True use_cot_caption: bool = True use_cot_lyrics: bool = False use_cot_language: bool = True use_constrained_decoding: bool = True # CoT Generated Values (auto-filled by LM) cot_bpm: Optional[int] = None cot_keyscale: str = "" cot_timesignature: str = "" cot_duration: Optional[float] = None cot_vocal_language: str = "unknown" cot_caption: str = "" cot_lyrics: str = "" ``` **GenerationConfig** - Contains batch and output configuration: ```python @dataclass class GenerationConfig: batch_size: int = 2 allow_lm_batch: bool = False use_random_seed: bool = True seeds: Optional[List[int]] = None lm_batch_chunk_size: int = 8 constrained_decoding_debug: bool = False audio_format: str = "flac" ``` ### Result Objects **GenerationResult** - Result of music generation: ```python @dataclass class GenerationResult: # Audio Outputs audios: List[Dict[str, Any]] # List of audio dictionaries # Generation Information status_message: str # Status message from generation extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs) # Success Status success: bool # Whether generation succeeded error: Optional[str] # Error message if failed ``` **Audio Dictionary Structure:** Each item in `audios` list contains: ```python { "path": str, # File path to saved audio "tensor": Tensor, # Audio tensor [channels, samples], CPU, float32 "key": str, # Unique audio key (UUID based on params) "sample_rate": int, # Sample rate (default: 48000) "params": Dict, # Generation params for this audio (includes seed, audio_codes, etc.) } ``` **UnderstandResult** - Result of music understanding: ```python @dataclass class UnderstandResult: # Metadata Fields caption: str = "" lyrics: str = "" bpm: Optional[int] = None duration: Optional[float] = None keyscale: str = "" language: str = "" timesignature: str = "" # Status status_message: str = "" success: bool = True error: Optional[str] = None ``` **CreateSampleResult** - Result of sample creation: ```python @dataclass class CreateSampleResult: # Metadata Fields caption: str = "" lyrics: str = "" bpm: Optional[int] = None duration: Optional[float] = None keyscale: str = "" language: str = "" timesignature: str = "" instrumental: bool = False # Status status_message: str = "" success: bool = True error: Optional[str] = None ``` **FormatSampleResult** - Result of sample formatting: ```python @dataclass class FormatSampleResult: # Metadata Fields caption: str = "" lyrics: str = "" bpm: Optional[int] = None duration: Optional[float] = None keyscale: str = "" language: str = "" timesignature: str = "" # Status status_message: str = "" success: bool = True error: Optional[str] = None ``` --- ## GenerationParams Parameters ### Text Inputs | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `caption` | `str` | `""` | Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters. | | `lyrics` | `str` | `""` | Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. Max 4096 characters. | | `instrumental` | `bool` | `False` | If True, generate instrumental music regardless of lyrics. | ### Music Metadata | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `bpm` | `Optional[int]` | `None` | Beats per minute (30-300). `None` enables auto-detection via LM. | | `keyscale` | `str` | `""` | Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. | | `timesignature` | `str` | `""` | Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection. | | `vocal_language` | `str` | `"unknown"` | Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection. | | `duration` | `float` | `-1.0` | Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length. | ### Generation Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `inference_steps` | `int` | `8` | Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower. | | `guidance_scale` | `float` | `7.0` | Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0. | | `seed` | `int` | `-1` | Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed. | ### Advanced DiT Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `use_adg` | `bool` | `False` | Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. | | `cfg_interval_start` | `float` | `0.0` | CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. | | `cfg_interval_end` | `float` | `1.0` | CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. | | `shift` | `float` | `1.0` | Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies `t = shift * t / (1 + (shift - 1) * t)` to timesteps. Recommended 3.0 for turbo models. | | `infer_method` | `str` | `"ode"` | Diffusion inference method. `"ode"` (Euler) is faster and deterministic. `"sde"` (stochastic) may produce different results with variance. | | `timesteps` | `Optional[List[float]]` | `None` | Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., `[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]`). If provided, overrides `inference_steps` and `shift`. | ### Task-Specific Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `task_type` | `str` | `"text2music"` | Generation task type. See [Task Types](#task-types) section for details. | | `instruction` | `str` | `"Fill the audio semantic mask based on the given conditions:"` | Task-specific instruction prompt. | | `reference_audio` | `Optional[str]` | `None` | Path to reference audio file for style transfer or continuation tasks. | | `src_audio` | `Optional[str]` | `None` | Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). | | `audio_codes` | `str` | `""` | Pre-extracted 5Hz audio semantic codes as a string. Advanced use only. | | `repainting_start` | `float` | `0.0` | Repainting start time in seconds (for repaint/lego tasks). | | `repainting_end` | `float` | `-1` | Repainting end time in seconds. Use `-1` for end of audio. | | `audio_cover_strength` | `float` | `1.0` | Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks. | ### 5Hz Language Model Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `thinking` | `bool` | `True` | Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes. | | `lm_temperature` | `float` | `0.85` | LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. | | `lm_cfg_scale` | `float` | `2.0` | LM classifier-free guidance scale. Higher = stronger adherence to prompt. | | `lm_top_k` | `int` | `0` | LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100. | | `lm_top_p` | `float` | `0.9` | LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95. | | `lm_negative_prompt` | `str` | `"NO USER INPUT"` | Negative prompt for LM guidance. Helps avoid unwanted characteristics. | | `use_cot_metas` | `bool` | `True` | Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). | | `use_cot_caption` | `bool` | `True` | Refine user caption using LM CoT reasoning. | | `use_cot_language` | `bool` | `True` | Detect vocal language using LM CoT reasoning. | | `use_cot_lyrics` | `bool` | `False` | (Reserved for future use) Generate/refine lyrics using LM CoT. | | `use_constrained_decoding` | `bool` | `True` | Enable constrained decoding for structured LM output. | ### CoT Generated Values These fields are automatically populated by the LM when CoT reasoning is enabled: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `cot_bpm` | `Optional[int]` | `None` | LM-generated BPM value. | | `cot_keyscale` | `str` | `""` | LM-generated key/scale. | | `cot_timesignature` | `str` | `""` | LM-generated time signature. | | `cot_duration` | `Optional[float]` | `None` | LM-generated duration. | | `cot_vocal_language` | `str` | `"unknown"` | LM-detected vocal language. | | `cot_caption` | `str` | `""` | LM-refined caption. | | `cot_lyrics` | `str` | `""` | LM-generated/refined lyrics. | --- ## GenerationConfig Parameters | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `batch_size` | `int` | `2` | Number of samples to generate in parallel (1-8). Higher values require more GPU memory. | | `allow_lm_batch` | `bool` | `False` | Allow batch processing in LM. Faster when `batch_size >= 2` and `thinking=True`. | | `use_random_seed` | `bool` | `True` | Whether to use random seed. `True` for different results each time, `False` for reproducible results. | | `seeds` | `Optional[List[int]]` | `None` | List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int. | | `lm_batch_chunk_size` | `int` | `8` | Maximum batch size per LM inference chunk (GPU memory constraint). | | `constrained_decoding_debug` | `bool` | `False` | Enable debug logging for constrained decoding. | | `audio_format` | `str` | `"flac"` | Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. Default is FLAC for fast saving. | --- ## Task Types ACE-Step supports 6 different generation task types, each optimized for specific use cases. ### 1. Text2Music (Default) **Purpose**: Generate music from text descriptions and optional metadata. **Key Parameters**: ```python params = GenerationParams( task_type="text2music", caption="energetic rock music with electric guitar", lyrics="[Instrumental]", # or actual lyrics bpm=140, duration=30, ) ``` **Required**: - `caption` or `lyrics` (at least one) **Optional but Recommended**: - `bpm`: Controls tempo - `keyscale`: Controls musical key - `timesignature`: Controls rhythm structure - `duration`: Controls length - `vocal_language`: Controls vocal characteristics **Use Cases**: - Generate music from text descriptions - Create backing tracks from prompts - Generate songs with lyrics --- ### 2. Cover **Purpose**: Transform existing audio while maintaining structure but changing style/timbre. **Key Parameters**: ```python params = GenerationParams( task_type="cover", src_audio="original_song.mp3", caption="jazz piano version", audio_cover_strength=0.8, # 0.0-1.0 ) ``` **Required**: - `src_audio`: Path to source audio file - `caption`: Description of desired style/transformation **Optional**: - `audio_cover_strength`: Controls influence of original audio - `1.0`: Strong adherence to original structure - `0.5`: Balanced transformation - `0.1`: Loose interpretation - `lyrics`: New lyrics (if changing vocals) **Use Cases**: - Create covers in different styles - Change instrumentation while keeping melody - Genre transformation --- ### 3. Repaint **Purpose**: Regenerate a specific time segment of audio while keeping the rest unchanged. **Key Parameters**: ```python params = GenerationParams( task_type="repaint", src_audio="original.mp3", repainting_start=10.0, # seconds repainting_end=20.0, # seconds caption="smooth transition with piano solo", ) ``` **Required**: - `src_audio`: Path to source audio file - `repainting_start`: Start time in seconds - `repainting_end`: End time in seconds (use `-1` for end of file) - `caption`: Description of desired content for repainted section **Use Cases**: - Fix specific sections of generated music - Add variations to parts of a song - Create smooth transitions - Replace problematic segments --- ### 4. Lego (Base Model Only) **Purpose**: Generate a specific instrument track in context of existing audio. **Key Parameters**: ```python params = GenerationParams( task_type="lego", src_audio="backing_track.mp3", instruction="Generate the guitar track based on the audio context:", caption="lead guitar melody with bluesy feel", repainting_start=0.0, repainting_end=-1, ) ``` **Required**: - `src_audio`: Path to source/backing audio - `instruction`: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...") - `caption`: Description of desired track characteristics **Available Tracks**: - `"vocals"`, `"backing_vocals"`, `"drums"`, `"bass"`, `"guitar"`, `"keyboard"`, - `"percussion"`, `"strings"`, `"synth"`, `"fx"`, `"brass"`, `"woodwinds"` **Use Cases**: - Add specific instrument tracks - Layer additional instruments over backing tracks - Create multi-track compositions iteratively --- ### 5. Extract (Base Model Only) **Purpose**: Extract/isolate a specific instrument track from mixed audio. **Key Parameters**: ```python params = GenerationParams( task_type="extract", src_audio="full_mix.mp3", instruction="Extract the vocals track from the audio:", ) ``` **Required**: - `src_audio`: Path to mixed audio file - `instruction`: Must specify track to extract **Available Tracks**: Same as Lego task **Use Cases**: - Stem separation - Isolate specific instruments - Create remixes - Analyze individual tracks --- ### 6. Complete (Base Model Only) **Purpose**: Complete/extend partial tracks with specified instruments. **Key Parameters**: ```python params = GenerationParams( task_type="complete", src_audio="incomplete_track.mp3", instruction="Complete the input track with drums, bass, guitar:", caption="rock style completion", ) ``` **Required**: - `src_audio`: Path to incomplete/partial track - `instruction`: Must specify which tracks to add - `caption`: Description of desired style **Use Cases**: - Arrange incomplete compositions - Add backing tracks - Auto-complete musical ideas --- ## Helper Functions ### understand_music Analyze audio codes to extract metadata about the music. ```python from acestep.inference import understand_music result = understand_music( llm_handler=llm_handler, audio_codes="<|audio_code_123|><|audio_code_456|>...", temperature=0.85, use_constrained_decoding=True, ) if result.success: print(f"Caption: {result.caption}") print(f"Lyrics: {result.lyrics}") print(f"BPM: {result.bpm}") print(f"Key: {result.keyscale}") print(f"Duration: {result.duration}s") print(f"Language: {result.language}") else: print(f"Error: {result.error}") ``` **Use Cases**: - Analyze existing music - Extract metadata from audio codes - Reverse-engineer generation parameters --- ### create_sample Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature. ```python from acestep.inference import create_sample result = create_sample( llm_handler=llm_handler, query="a soft Bengali love song for a quiet evening", instrumental=False, vocal_language="bn", # Optional: constrain to Bengali temperature=0.85, ) if result.success: print(f"Caption: {result.caption}") print(f"Lyrics: {result.lyrics}") print(f"BPM: {result.bpm}") print(f"Duration: {result.duration}s") print(f"Key: {result.keyscale}") print(f"Is Instrumental: {result.instrumental}") # Use with generate_music params = GenerationParams( caption=result.caption, lyrics=result.lyrics, bpm=result.bpm, duration=result.duration, keyscale=result.keyscale, vocal_language=result.language, ) else: print(f"Error: {result.error}") ``` **Parameters**: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `query` | `str` | required | Natural language description of desired music | | `instrumental` | `bool` | `False` | Whether to generate instrumental music | | `vocal_language` | `Optional[str]` | `None` | Constrain lyrics to specific language (e.g., "en", "zh", "bn") | | `temperature` | `float` | `0.85` | Sampling temperature | | `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) | | `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) | | `repetition_penalty` | `float` | `1.0` | Repetition penalty | | `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding | --- ### format_sample Format and enhance user-provided caption and lyrics, generating structured metadata. ```python from acestep.inference import format_sample result = format_sample( llm_handler=llm_handler, caption="Latin pop, reggaeton", lyrics="[Verse 1]\nBailando en la noche...", user_metadata={"bpm": 95}, # Optional: constrain specific values temperature=0.85, ) if result.success: print(f"Enhanced Caption: {result.caption}") print(f"Formatted Lyrics: {result.lyrics}") print(f"BPM: {result.bpm}") print(f"Duration: {result.duration}s") print(f"Key: {result.keyscale}") print(f"Detected Language: {result.language}") else: print(f"Error: {result.error}") ``` **Parameters**: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `caption` | `str` | required | User's caption/description | | `lyrics` | `str` | required | User's lyrics with structure tags | | `user_metadata` | `Optional[Dict]` | `None` | Constrain specific metadata values (bpm, duration, keyscale, timesignature, language) | | `temperature` | `float` | `0.85` | Sampling temperature | | `top_k` | `Optional[int]` | `None` | Top-k sampling (None disables) | | `top_p` | `Optional[float]` | `None` | Top-p sampling (None disables) | | `repetition_penalty` | `float` | `1.0` | Repetition penalty | | `use_constrained_decoding` | `bool` | `True` | Use FSM-based constrained decoding | --- ## Complete Examples ### Example 1: Simple Text-to-Music Generation ```python from acestep.inference import GenerationParams, GenerationConfig, generate_music params = GenerationParams( task_type="text2music", caption="calm ambient music with soft piano and strings", duration=60, bpm=80, keyscale="C Major", ) config = GenerationConfig( batch_size=2, # Generate 2 variations audio_format="flac", ) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") if result.success: for i, audio in enumerate(result.audios, 1): print(f"Variation {i}: {audio['path']}") ``` ### Example 2: Song Generation with Lyrics ```python params = GenerationParams( task_type="text2music", caption="pop ballad with emotional vocals", lyrics="""Verse 1: Walking down the street today Thinking of the words you used to say Everything feels different now But I'll find my way somehow Chorus: I'm moving on, I'm staying strong This is where I belong """, vocal_language="en", bpm=72, duration=45, ) config = GenerationConfig(batch_size=1) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") ``` ### Example 3: Using Custom Timesteps ```python params = GenerationParams( task_type="text2music", caption="jazz fusion with complex harmonies", # Custom 9-step schedule timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0], thinking=True, ) config = GenerationConfig(batch_size=1) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") ``` ### Example 4: Using Shift Parameter (Turbo Model) ```python params = GenerationParams( task_type="text2music", caption="upbeat electronic dance music", inference_steps=8, shift=3.0, # Recommended for turbo models infer_method="ode", ) config = GenerationConfig(batch_size=2) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") ``` ### Example 5: Simple Mode with create_sample ```python from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music # Step 1: Create sample from description sample = create_sample( llm_handler=llm_handler, query="energetic K-pop dance track with catchy hooks", vocal_language="ko", ) if sample.success: # Step 2: Generate music using the sample params = GenerationParams( caption=sample.caption, lyrics=sample.lyrics, bpm=sample.bpm, duration=sample.duration, keyscale=sample.keyscale, vocal_language=sample.language, thinking=True, ) config = GenerationConfig(batch_size=2) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") ``` ### Example 6: Format and Enhance User Input ```python from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music # Step 1: Format user input formatted = format_sample( llm_handler=llm_handler, caption="rock ballad", lyrics="[Verse]\nIn the darkness I find my way...", ) if formatted.success: # Step 2: Generate with enhanced input params = GenerationParams( caption=formatted.caption, lyrics=formatted.lyrics, bpm=formatted.bpm, duration=formatted.duration, keyscale=formatted.keyscale, thinking=True, use_cot_metas=False, # Already formatted, skip metas CoT ) config = GenerationConfig(batch_size=2) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") ``` ### Example 7: Style Cover with LM Reasoning ```python params = GenerationParams( task_type="cover", src_audio="original_pop_song.mp3", caption="orchestral symphonic arrangement", audio_cover_strength=0.7, thinking=True, # Enable LM for metadata use_cot_metas=True, ) config = GenerationConfig(batch_size=1) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") # Access LM-generated metadata if result.extra_outputs.get("lm_metadata"): lm_meta = result.extra_outputs["lm_metadata"] print(f"LM detected BPM: {lm_meta.get('bpm')}") print(f"LM detected Key: {lm_meta.get('keyscale')}") ``` ### Example 8: Batch Generation with Specific Seeds ```python params = GenerationParams( task_type="text2music", caption="epic cinematic trailer music", ) config = GenerationConfig( batch_size=4, # Generate 4 variations seeds=[42, 123, 456], # Specify 3 seeds, 4th will be random use_random_seed=False, # Use provided seeds lm_batch_chunk_size=2, # Process 2 at a time (GPU memory) ) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") if result.success: print(f"Generated {len(result.audios)} variations") for audio in result.audios: print(f" Seed {audio['params']['seed']}: {audio['path']}") ``` ### Example 9: High-Quality Generation (Base Model) ```python params = GenerationParams( task_type="text2music", caption="intricate jazz fusion with complex harmonies", inference_steps=64, # High quality guidance_scale=8.0, use_adg=True, # Adaptive Dual Guidance cfg_interval_start=0.0, cfg_interval_end=1.0, shift=3.0, # Timestep shift seed=42, # Reproducible results ) config = GenerationConfig( batch_size=1, use_random_seed=False, audio_format="wav", # Lossless format ) result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") ``` ### Example 10: Understand Audio from Codes ```python from acestep.inference import understand_music # Analyze audio codes (e.g., from a previous generation) result = understand_music( llm_handler=llm_handler, audio_codes="<|audio_code_10695|><|audio_code_54246|>...", temperature=0.85, ) if result.success: print(f"Detected Caption: {result.caption}") print(f"Detected Lyrics: {result.lyrics}") print(f"Detected BPM: {result.bpm}") print(f"Detected Key: {result.keyscale}") print(f"Detected Duration: {result.duration}s") print(f"Detected Language: {result.language}") ``` --- ## Best Practices ### 1. Caption Writing **Good Captions**: ```python # Specific and descriptive caption="upbeat electronic dance music with heavy bass and synthesizer leads" # Include mood and genre caption="melancholic indie folk with acoustic guitar and soft vocals" # Specify instruments caption="jazz trio with piano, upright bass, and brush drums" ``` **Avoid**: ```python # Too vague caption="good music" # Contradictory caption="fast slow music" # Conflicting tempos ``` ### 2. Parameter Tuning **For Best Quality**: - Use base model with `inference_steps=64` or higher - Enable `use_adg=True` - Set `guidance_scale=7.0-9.0` - Set `shift=3.0` for better timestep distribution - Use lossless audio format (`audio_format="wav"`) **For Speed**: - Use turbo model with `inference_steps=8` - Disable ADG (`use_adg=False`) - Use `infer_method="ode"` (default) - Use compressed format (`audio_format="mp3"`) or default FLAC **For Consistency**: - Set `use_random_seed=False` in config - Use fixed `seeds` list or single `seed` in params - Keep `lm_temperature` lower (0.7-0.85) **For Diversity**: - Set `use_random_seed=True` in config - Increase `lm_temperature` (0.9-1.1) - Use `batch_size > 1` for variations ### 3. Duration Guidelines - **Instrumental**: 30-180 seconds works well - **With Lyrics**: Auto-detection recommended (set `duration=-1` or leave default) - **Short clips**: 10-20 seconds minimum - **Long form**: Up to 600 seconds (10 minutes) maximum ### 4. LM Usage **When to Enable LM (`thinking=True`)**: - Need automatic metadata detection - Want caption refinement - Generating from minimal input - Need diverse outputs **When to Disable LM (`thinking=False`)**: - Have precise metadata already - Need faster generation - Want full control over parameters ### 5. Batch Processing ```python # Efficient batch generation config = GenerationConfig( batch_size=8, # Max supported allow_lm_batch=True, # Enable for speed (when thinking=True) lm_batch_chunk_size=4, # Adjust based on GPU memory ) ``` ### 6. Error Handling ```python result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") if not result.success: print(f"Generation failed: {result.error}") print(f"Status: {result.status_message}") else: # Process successful result for audio in result.audios: path = audio['path'] key = audio['key'] seed = audio['params']['seed'] # ... process audio files ``` ### 7. Memory Management For large batch sizes or long durations: - Monitor GPU memory usage - Reduce `batch_size` if OOM errors occur - Reduce `lm_batch_chunk_size` for LM operations - Consider using `offload_to_cpu=True` during initialization ### 8. Accessing Time Costs ```python result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output") if result.success: time_costs = result.extra_outputs.get("time_costs", {}) print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s") print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s") print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s") print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s") ``` --- ## Troubleshooting ### Common Issues **Issue**: Out of memory errors - **Solution**: Reduce `batch_size`, `inference_steps`, or enable CPU offloading **Issue**: Poor quality results - **Solution**: Increase `inference_steps`, adjust `guidance_scale`, use base model **Issue**: Results don't match prompt - **Solution**: Make caption more specific, increase `guidance_scale`, enable LM refinement (`thinking=True`) **Issue**: Slow generation - **Solution**: Use turbo model, reduce `inference_steps`, disable ADG **Issue**: LM not generating codes - **Solution**: Verify `llm_handler` is initialized, check `thinking=True` and `use_cot_metas=True` **Issue**: Seeds not being respected - **Solution**: Set `use_random_seed=False` in config and provide `seeds` list or `seed` in params **Issue**: Custom timesteps not working - **Solution**: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered --- ## API Reference Summary ### GenerationParams Fields See [GenerationParams Parameters](#generationparams-parameters) for complete documentation. ### GenerationConfig Fields See [GenerationConfig Parameters](#generationconfig-parameters) for complete documentation. ### GenerationResult Fields ```python @dataclass class GenerationResult: # Audio Outputs audios: List[Dict[str, Any]] # Each audio dict contains: # - "path": str (file path) # - "tensor": Tensor (audio data) # - "key": str (unique identifier) # - "sample_rate": int (48000) # - "params": Dict (generation params with seed, audio_codes, etc.) # Generation Information status_message: str extra_outputs: Dict[str, Any] # extra_outputs contains: # - "lm_metadata": Dict (LM-generated metadata) # - "time_costs": Dict (timing information) # - "latents": Tensor (intermediate latents, if available) # - "masks": Tensor (attention masks, if available) # Success Status success: bool error: Optional[str] ``` --- ## Version History - **v1.5.2**: Current version - Added `shift` parameter for timestep shifting - Added `infer_method` parameter for ODE/SDE selection - Added `timesteps` parameter for custom timestep schedules - Added `understand_music()` function for audio analysis - Added `create_sample()` function for simple mode generation - Added `format_sample()` function for input enhancement - Added `UnderstandResult`, `CreateSampleResult`, `FormatSampleResult` dataclasses - **v1.5.1**: Previous version - Split `GenerationConfig` into `GenerationParams` and `GenerationConfig` - Renamed parameters for consistency (`key_scale` → `keyscale`, `time_signature` → `timesignature`, `audio_duration` → `duration`, `use_llm_thinking` → `thinking`, `audio_code_string` → `audio_codes`) - Added `instrumental` parameter - Added `use_constrained_decoding` parameter - Added CoT auto-filled fields (`cot_*`) - Changed default `audio_format` to "flac" - Changed default `batch_size` to 2 - Changed default `thinking` to True - Simplified `GenerationResult` structure with unified `audios` list - Added unified `time_costs` in `extra_outputs` - **v1.5**: Initial version - Introduced `GenerationConfig` and `GenerationResult` dataclasses - Simplified parameter passing - Added comprehensive documentation --- For more information, see: - Main README: [`../../README.md`](../../README.md) - REST API Documentation: [`API.md`](API.md) - Gradio Demo Guide: [`GRADIO_GUIDE.md`](GRADIO_GUIDE.md) - Project repository: [ACE-Step-1.5](https://github.com/yourusername/ACE-Step-1.5)