Spaces:

ACE-Step
/

Ace-Step-v1.5

Running on A100

App Files Files Community

Ace-Step-v1.5 / docs /en /INFERENCE.md

ChuxiJ

add docs and readme

428436b 22 days ago

preview code

raw

history blame contribute delete

37 kB

	# ACE-Step Inference API Documentation

	Language / 语言 / 言語: [English](INFERENCE.md) \| [中文](../zh/INFERENCE.md) \| [日本語](../ja/INFERENCE.md)

	---

	This document provides comprehensive documentation for the ACE-Step inference API, including parameter specifications for all supported task types.

	## Table of Contents

	- [Quick Start](#quick-start)
	- [API Overview](#api-overview)
	- [GenerationParams Parameters](#generationparams-parameters)
	- [GenerationConfig Parameters](#generationconfig-parameters)
	- [Task Types](#task-types)
	- [Helper Functions](#helper-functions)
	- [Complete Examples](#complete-examples)
	- [Best Practices](#best-practices)

	---

	## Quick Start

	### Basic Usage

	```python
	from acestep.handler import AceStepHandler
	from acestep.llm_inference import LLMHandler
	from acestep.inference import GenerationParams, GenerationConfig, generate_music

	# Initialize handlers
	dit_handler = AceStepHandler()
	llm_handler = LLMHandler()

	# Initialize services
	dit_handler.initialize_service(
	project_root="/path/to/project",
	config_path="acestep-v15-turbo",
	device="cuda"
	)

	llm_handler.initialize(
	checkpoint_dir="/path/to/checkpoints",
	lm_model_path="acestep-5Hz-lm-0.6B",
	backend="vllm",
	device="cuda"
	)

	# Configure generation parameters
	params = GenerationParams(
	caption="upbeat electronic dance music with heavy bass",
	bpm=128,
	duration=30,
	)

	# Configure generation settings
	config = GenerationConfig(
	batch_size=2,
	audio_format="flac",
	)

	# Generate music
	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")

	# Access results
	if result.success:
	for audio in result.audios:
	print(f"Generated: {audio['path']}")
	print(f"Key: {audio['key']}")
	print(f"Seed: {audio['params']['seed']}")
	else:
	print(f"Error: {result.error}")
	```

	---

	## API Overview

	### Main Functions

	#### generate_music

	```python
	def generate_music(
	dit_handler,
	llm_handler,
	params: GenerationParams,
	config: GenerationConfig,
	save_dir: Optional[str] = None,
	progress=None,
	) -> GenerationResult
	```

	Main function for generating music using the ACE-Step model.

	#### understand_music

	```python
	def understand_music(
	llm_handler,
	audio_codes: str,
	temperature: float = 0.85,
	top_k: Optional[int] = None,
	top_p: Optional[float] = None,
	repetition_penalty: float = 1.0,
	use_constrained_decoding: bool = True,
	constrained_decoding_debug: bool = False,
	) -> UnderstandResult
	```

	Analyze audio semantic codes and extract metadata (caption, lyrics, BPM, key, etc.).

	#### create_sample

	```python
	def create_sample(
	llm_handler,
	query: str,
	instrumental: bool = False,
	vocal_language: Optional[str] = None,
	temperature: float = 0.85,
	top_k: Optional[int] = None,
	top_p: Optional[float] = None,
	repetition_penalty: float = 1.0,
	use_constrained_decoding: bool = True,
	constrained_decoding_debug: bool = False,
	) -> CreateSampleResult
	```

	Generate a complete music sample (caption, lyrics, metadata) from a natural language description.

	#### format_sample

	```python
	def format_sample(
	llm_handler,
	caption: str,
	lyrics: str,
	user_metadata: Optional[Dict[str, Any]] = None,
	temperature: float = 0.85,
	top_k: Optional[int] = None,
	top_p: Optional[float] = None,
	repetition_penalty: float = 1.0,
	use_constrained_decoding: bool = True,
	constrained_decoding_debug: bool = False,
	) -> FormatSampleResult
	```

	Format and enhance user-provided caption and lyrics, generating structured metadata.

	### Configuration Objects

	The API uses two configuration dataclasses:

	GenerationParams - Contains all music generation parameters:

	```python
	@dataclass
	class GenerationParams:
	# Task & Instruction
	task_type: str = "text2music"
	instruction: str = "Fill the audio semantic mask based on the given conditions:"

	# Audio Uploads
	reference_audio: Optional[str] = None
	src_audio: Optional[str] = None

	# LM Codes Hints
	audio_codes: str = ""

	# Text Inputs
	caption: str = ""
	lyrics: str = ""
	instrumental: bool = False

	# Metadata
	vocal_language: str = "unknown"
	bpm: Optional[int] = None
	keyscale: str = ""
	timesignature: str = ""
	duration: float = -1.0

	# Advanced Settings
	inference_steps: int = 8
	seed: int = -1
	guidance_scale: float = 7.0
	use_adg: bool = False
	cfg_interval_start: float = 0.0
	cfg_interval_end: float = 1.0
	shift: float = 1.0 # NEW: Timestep shift factor
	infer_method: str = "ode" # NEW: Diffusion inference method
	timesteps: Optional[List[float]] = None # NEW: Custom timesteps

	repainting_start: float = 0.0
	repainting_end: float = -1
	audio_cover_strength: float = 1.0

	# 5Hz Language Model Parameters
	thinking: bool = True
	lm_temperature: float = 0.85
	lm_cfg_scale: float = 2.0
	lm_top_k: int = 0
	lm_top_p: float = 0.9
	lm_negative_prompt: str = "NO USER INPUT"
	use_cot_metas: bool = True
	use_cot_caption: bool = True
	use_cot_lyrics: bool = False
	use_cot_language: bool = True
	use_constrained_decoding: bool = True

	# CoT Generated Values (auto-filled by LM)
	cot_bpm: Optional[int] = None
	cot_keyscale: str = ""
	cot_timesignature: str = ""
	cot_duration: Optional[float] = None
	cot_vocal_language: str = "unknown"
	cot_caption: str = ""
	cot_lyrics: str = ""
	```

	GenerationConfig - Contains batch and output configuration:

	```python
	@dataclass
	class GenerationConfig:
	batch_size: int = 2
	allow_lm_batch: bool = False
	use_random_seed: bool = True
	seeds: Optional[List[int]] = None
	lm_batch_chunk_size: int = 8
	constrained_decoding_debug: bool = False
	audio_format: str = "flac"
	```

	### Result Objects

	GenerationResult - Result of music generation:

	```python
	@dataclass
	class GenerationResult:
	# Audio Outputs
	audios: List[Dict[str, Any]] # List of audio dictionaries

	# Generation Information
	status_message: str # Status message from generation
	extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)

	# Success Status
	success: bool # Whether generation succeeded
	error: Optional[str] # Error message if failed
	```

	Audio Dictionary Structure:

	Each item in `audios` list contains:

	```python
	{
	"path": str, # File path to saved audio
	"tensor": Tensor, # Audio tensor [channels, samples], CPU, float32
	"key": str, # Unique audio key (UUID based on params)
	"sample_rate": int, # Sample rate (default: 48000)
	"params": Dict, # Generation params for this audio (includes seed, audio_codes, etc.)
	}
	```

	UnderstandResult - Result of music understanding:

	```python
	@dataclass
	class UnderstandResult:
	# Metadata Fields
	caption: str = ""
	lyrics: str = ""
	bpm: Optional[int] = None
	duration: Optional[float] = None
	keyscale: str = ""
	language: str = ""
	timesignature: str = ""

	# Status
	status_message: str = ""
	success: bool = True
	error: Optional[str] = None
	```

	CreateSampleResult - Result of sample creation:

	```python
	@dataclass
	class CreateSampleResult:
	# Metadata Fields
	caption: str = ""
	lyrics: str = ""
	bpm: Optional[int] = None
	duration: Optional[float] = None
	keyscale: str = ""
	language: str = ""
	timesignature: str = ""
	instrumental: bool = False

	# Status
	status_message: str = ""
	success: bool = True
	error: Optional[str] = None
	```

	FormatSampleResult - Result of sample formatting:

	```python
	@dataclass
	class FormatSampleResult:
	# Metadata Fields
	caption: str = ""
	lyrics: str = ""
	bpm: Optional[int] = None
	duration: Optional[float] = None
	keyscale: str = ""
	language: str = ""
	timesignature: str = ""

	# Status
	status_message: str = ""
	success: bool = True
	error: Optional[str] = None
	```

	---

	## GenerationParams Parameters

	### Text Inputs

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `caption` \| `str` \| `""` \| Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters. \|
	\| `lyrics` \| `str` \| `""` \| Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. Max 4096 characters. \|
	\| `instrumental` \| `bool` \| `False` \| If True, generate instrumental music regardless of lyrics. \|

	### Music Metadata

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `bpm` \| `Optional[int]` \| `None` \| Beats per minute (30-300). `None` enables auto-detection via LM. \|
	\| `keyscale` \| `str` \| `""` \| Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. \|
	\| `timesignature` \| `str` \| `""` \| Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection. \|
	\| `vocal_language` \| `str` \| `"unknown"` \| Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection. \|
	\| `duration` \| `float` \| `-1.0` \| Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length. \|

	### Generation Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `inference_steps` \| `int` \| `8` \| Number of denoising steps. Turbo model: 1-20 (recommended 8). Base model: 1-200 (recommended 32-64). Higher = better quality but slower. \|
	\| `guidance_scale` \| `float` \| `7.0` \| Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0. \|
	\| `seed` \| `int` \| `-1` \| Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed. \|

	### Advanced DiT Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `use_adg` \| `bool` \| `False` \| Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. \|
	\| `cfg_interval_start` \| `float` \| `0.0` \| CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. \|
	\| `cfg_interval_end` \| `float` \| `1.0` \| CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. \|
	\| `shift` \| `float` \| `1.0` \| Timestep shift factor (range 1.0-5.0, default 1.0). When != 1.0, applies `t = shift * t / (1 + (shift - 1) * t)` to timesteps. Recommended 3.0 for turbo models. \|
	\| `infer_method` \| `str` \| `"ode"` \| Diffusion inference method. `"ode"` (Euler) is faster and deterministic. `"sde"` (stochastic) may produce different results with variance. \|
	\| `timesteps` \| `Optional[List[float]]` \| `None` \| Custom timesteps as a list of floats from 1.0 to 0.0 (e.g., `[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0]`). If provided, overrides `inference_steps` and `shift`. \|

	### Task-Specific Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `task_type` \| `str` \| `"text2music"` \| Generation task type. See [Task Types](#task-types) section for details. \|
	\| `instruction` \| `str` \| `"Fill the audio semantic mask based on the given conditions:"` \| Task-specific instruction prompt. \|
	\| `reference_audio` \| `Optional[str]` \| `None` \| Path to reference audio file for style transfer or continuation tasks. \|
	\| `src_audio` \| `Optional[str]` \| `None` \| Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). \|
	\| `audio_codes` \| `str` \| `""` \| Pre-extracted 5Hz audio semantic codes as a string. Advanced use only. \|
	\| `repainting_start` \| `float` \| `0.0` \| Repainting start time in seconds (for repaint/lego tasks). \|
	\| `repainting_end` \| `float` \| `-1` \| Repainting end time in seconds. Use `-1` for end of audio. \|
	\| `audio_cover_strength` \| `float` \| `1.0` \| Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks. \|

	### 5Hz Language Model Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `thinking` \| `bool` \| `True` \| Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes. \|
	\| `lm_temperature` \| `float` \| `0.85` \| LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. \|
	\| `lm_cfg_scale` \| `float` \| `2.0` \| LM classifier-free guidance scale. Higher = stronger adherence to prompt. \|
	\| `lm_top_k` \| `int` \| `0` \| LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100. \|
	\| `lm_top_p` \| `float` \| `0.9` \| LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95. \|
	\| `lm_negative_prompt` \| `str` \| `"NO USER INPUT"` \| Negative prompt for LM guidance. Helps avoid unwanted characteristics. \|
	\| `use_cot_metas` \| `bool` \| `True` \| Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). \|
	\| `use_cot_caption` \| `bool` \| `True` \| Refine user caption using LM CoT reasoning. \|
	\| `use_cot_language` \| `bool` \| `True` \| Detect vocal language using LM CoT reasoning. \|
	\| `use_cot_lyrics` \| `bool` \| `False` \| (Reserved for future use) Generate/refine lyrics using LM CoT. \|
	\| `use_constrained_decoding` \| `bool` \| `True` \| Enable constrained decoding for structured LM output. \|

	### CoT Generated Values

	These fields are automatically populated by the LM when CoT reasoning is enabled:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `cot_bpm` \| `Optional[int]` \| `None` \| LM-generated BPM value. \|
	\| `cot_keyscale` \| `str` \| `""` \| LM-generated key/scale. \|
	\| `cot_timesignature` \| `str` \| `""` \| LM-generated time signature. \|
	\| `cot_duration` \| `Optional[float]` \| `None` \| LM-generated duration. \|
	\| `cot_vocal_language` \| `str` \| `"unknown"` \| LM-detected vocal language. \|
	\| `cot_caption` \| `str` \| `""` \| LM-refined caption. \|
	\| `cot_lyrics` \| `str` \| `""` \| LM-generated/refined lyrics. \|

	---

	## GenerationConfig Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `batch_size` \| `int` \| `2` \| Number of samples to generate in parallel (1-8). Higher values require more GPU memory. \|
	\| `allow_lm_batch` \| `bool` \| `False` \| Allow batch processing in LM. Faster when `batch_size >= 2` and `thinking=True`. \|
	\| `use_random_seed` \| `bool` \| `True` \| Whether to use random seed. `True` for different results each time, `False` for reproducible results. \|
	\| `seeds` \| `Optional[List[int]]` \| `None` \| List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int. \|
	\| `lm_batch_chunk_size` \| `int` \| `8` \| Maximum batch size per LM inference chunk (GPU memory constraint). \|
	\| `constrained_decoding_debug` \| `bool` \| `False` \| Enable debug logging for constrained decoding. \|
	\| `audio_format` \| `str` \| `"flac"` \| Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. Default is FLAC for fast saving. \|

	---

	## Task Types

	ACE-Step supports 6 different generation task types, each optimized for specific use cases.

	### 1. Text2Music (Default)

	Purpose: Generate music from text descriptions and optional metadata.

	Key Parameters:
	```python
	params = GenerationParams(
	task_type="text2music",
	caption="energetic rock music with electric guitar",
	lyrics="[Instrumental]", # or actual lyrics
	bpm=140,
	duration=30,
	)
	```

	Required:
	- `caption` or `lyrics` (at least one)

	Optional but Recommended:
	- `bpm`: Controls tempo
	- `keyscale`: Controls musical key
	- `timesignature`: Controls rhythm structure
	- `duration`: Controls length
	- `vocal_language`: Controls vocal characteristics

	Use Cases:
	- Generate music from text descriptions
	- Create backing tracks from prompts
	- Generate songs with lyrics

	---

	### 2. Cover

	Purpose: Transform existing audio while maintaining structure but changing style/timbre.

	Key Parameters:
	```python
	params = GenerationParams(
	task_type="cover",
	src_audio="original_song.mp3",
	caption="jazz piano version",
	audio_cover_strength=0.8, # 0.0-1.0
	)
	```

	Required:
	- `src_audio`: Path to source audio file
	- `caption`: Description of desired style/transformation

	Optional:
	- `audio_cover_strength`: Controls influence of original audio
	- `1.0`: Strong adherence to original structure
	- `0.5`: Balanced transformation
	- `0.1`: Loose interpretation
	- `lyrics`: New lyrics (if changing vocals)

	Use Cases:
	- Create covers in different styles
	- Change instrumentation while keeping melody
	- Genre transformation

	---

	### 3. Repaint

	Purpose: Regenerate a specific time segment of audio while keeping the rest unchanged.

	Key Parameters:
	```python
	params = GenerationParams(
	task_type="repaint",
	src_audio="original.mp3",
	repainting_start=10.0, # seconds
	repainting_end=20.0, # seconds
	caption="smooth transition with piano solo",
	)
	```

	Required:
	- `src_audio`: Path to source audio file
	- `repainting_start`: Start time in seconds
	- `repainting_end`: End time in seconds (use `-1` for end of file)
	- `caption`: Description of desired content for repainted section

	Use Cases:
	- Fix specific sections of generated music
	- Add variations to parts of a song
	- Create smooth transitions
	- Replace problematic segments

	---

	### 4. Lego (Base Model Only)

	Purpose: Generate a specific instrument track in context of existing audio.

	Key Parameters:
	```python
	params = GenerationParams(
	task_type="lego",
	src_audio="backing_track.mp3",
	instruction="Generate the guitar track based on the audio context:",
	caption="lead guitar melody with bluesy feel",
	repainting_start=0.0,
	repainting_end=-1,
	)
	```

	Required:
	- `src_audio`: Path to source/backing audio
	- `instruction`: Must specify the track type (e.g., "Generate the {TRACK_NAME} track...")
	- `caption`: Description of desired track characteristics

	Available Tracks:
	- `"vocals"`, `"backing_vocals"`, `"drums"`, `"bass"`, `"guitar"`, `"keyboard"`,
	- `"percussion"`, `"strings"`, `"synth"`, `"fx"`, `"brass"`, `"woodwinds"`

	Use Cases:
	- Add specific instrument tracks
	- Layer additional instruments over backing tracks
	- Create multi-track compositions iteratively

	---

	### 5. Extract (Base Model Only)

	Purpose: Extract/isolate a specific instrument track from mixed audio.

	Key Parameters:
	```python
	params = GenerationParams(
	task_type="extract",
	src_audio="full_mix.mp3",
	instruction="Extract the vocals track from the audio:",
	)
	```

	Required:
	- `src_audio`: Path to mixed audio file
	- `instruction`: Must specify track to extract

	Available Tracks: Same as Lego task

	Use Cases:
	- Stem separation
	- Isolate specific instruments
	- Create remixes
	- Analyze individual tracks

	---

	### 6. Complete (Base Model Only)

	Purpose: Complete/extend partial tracks with specified instruments.

	Key Parameters:
	```python
	params = GenerationParams(
	task_type="complete",
	src_audio="incomplete_track.mp3",
	instruction="Complete the input track with drums, bass, guitar:",
	caption="rock style completion",
	)
	```

	Required:
	- `src_audio`: Path to incomplete/partial track
	- `instruction`: Must specify which tracks to add
	- `caption`: Description of desired style

	Use Cases:
	- Arrange incomplete compositions
	- Add backing tracks
	- Auto-complete musical ideas

	---

	## Helper Functions

	### understand_music

	Analyze audio codes to extract metadata about the music.

	```python
	from acestep.inference import understand_music

	result = understand_music(
	llm_handler=llm_handler,
	audio_codes="<\|audio_code_123\|><\|audio_code_456\|>...",
	temperature=0.85,
	use_constrained_decoding=True,
	)

	if result.success:
	print(f"Caption: {result.caption}")
	print(f"Lyrics: {result.lyrics}")
	print(f"BPM: {result.bpm}")
	print(f"Key: {result.keyscale}")
	print(f"Duration: {result.duration}s")
	print(f"Language: {result.language}")
	else:
	print(f"Error: {result.error}")
	```

	Use Cases:
	- Analyze existing music
	- Extract metadata from audio codes
	- Reverse-engineer generation parameters

	---

	### create_sample

	Generate a complete music sample from a natural language description. This is the "Simple Mode" / "Inspiration Mode" feature.

	```python
	from acestep.inference import create_sample

	result = create_sample(
	llm_handler=llm_handler,
	query="a soft Bengali love song for a quiet evening",
	instrumental=False,
	vocal_language="bn", # Optional: constrain to Bengali
	temperature=0.85,
	)

	if result.success:
	print(f"Caption: {result.caption}")
	print(f"Lyrics: {result.lyrics}")
	print(f"BPM: {result.bpm}")
	print(f"Duration: {result.duration}s")
	print(f"Key: {result.keyscale}")
	print(f"Is Instrumental: {result.instrumental}")

	# Use with generate_music
	params = GenerationParams(
	caption=result.caption,
	lyrics=result.lyrics,
	bpm=result.bpm,
	duration=result.duration,
	keyscale=result.keyscale,
	vocal_language=result.language,
	)
	else:
	print(f"Error: {result.error}")
	```

	Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `query` \| `str` \| required \| Natural language description of desired music \|
	\| `instrumental` \| `bool` \| `False` \| Whether to generate instrumental music \|
	\| `vocal_language` \| `Optional[str]` \| `None` \| Constrain lyrics to specific language (e.g., "en", "zh", "bn") \|
	\| `temperature` \| `float` \| `0.85` \| Sampling temperature \|
	\| `top_k` \| `Optional[int]` \| `None` \| Top-k sampling (None disables) \|
	\| `top_p` \| `Optional[float]` \| `None` \| Top-p sampling (None disables) \|
	\| `repetition_penalty` \| `float` \| `1.0` \| Repetition penalty \|
	\| `use_constrained_decoding` \| `bool` \| `True` \| Use FSM-based constrained decoding \|

	---

	### format_sample

	Format and enhance user-provided caption and lyrics, generating structured metadata.

	```python
	from acestep.inference import format_sample

	result = format_sample(
	llm_handler=llm_handler,
	caption="Latin pop, reggaeton",
	lyrics="[Verse 1]\nBailando en la noche...",
	user_metadata={"bpm": 95}, # Optional: constrain specific values
	temperature=0.85,
	)

	if result.success:
	print(f"Enhanced Caption: {result.caption}")
	print(f"Formatted Lyrics: {result.lyrics}")
	print(f"BPM: {result.bpm}")
	print(f"Duration: {result.duration}s")
	print(f"Key: {result.keyscale}")
	print(f"Detected Language: {result.language}")
	else:
	print(f"Error: {result.error}")
	```

	Parameters:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `caption` \| `str` \| required \| User's caption/description \|
	\| `lyrics` \| `str` \| required \| User's lyrics with structure tags \|
	\| `user_metadata` \| `Optional[Dict]` \| `None` \| Constrain specific metadata values (bpm, duration, keyscale, timesignature, language) \|
	\| `temperature` \| `float` \| `0.85` \| Sampling temperature \|
	\| `top_k` \| `Optional[int]` \| `None` \| Top-k sampling (None disables) \|
	\| `top_p` \| `Optional[float]` \| `None` \| Top-p sampling (None disables) \|
	\| `repetition_penalty` \| `float` \| `1.0` \| Repetition penalty \|
	\| `use_constrained_decoding` \| `bool` \| `True` \| Use FSM-based constrained decoding \|

	---

	## Complete Examples

	### Example 1: Simple Text-to-Music Generation

	```python
	from acestep.inference import GenerationParams, GenerationConfig, generate_music

	params = GenerationParams(
	task_type="text2music",
	caption="calm ambient music with soft piano and strings",
	duration=60,
	bpm=80,
	keyscale="C Major",
	)

	config = GenerationConfig(
	batch_size=2, # Generate 2 variations
	audio_format="flac",
	)

	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

	if result.success:
	for i, audio in enumerate(result.audios, 1):
	print(f"Variation {i}: {audio['path']}")
	```

	### Example 2: Song Generation with Lyrics

	```python
	params = GenerationParams(
	task_type="text2music",
	caption="pop ballad with emotional vocals",
	lyrics="""Verse 1:
	Walking down the street today
	Thinking of the words you used to say
	Everything feels different now
	But I'll find my way somehow

	Chorus:
	I'm moving on, I'm staying strong
	This is where I belong
	""",
	vocal_language="en",
	bpm=72,
	duration=45,
	)

	config = GenerationConfig(batch_size=1)

	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
	```

	### Example 3: Using Custom Timesteps

	```python
	params = GenerationParams(
	task_type="text2music",
	caption="jazz fusion with complex harmonies",
	# Custom 9-step schedule
	timesteps=[0.97, 0.76, 0.615, 0.5, 0.395, 0.28, 0.18, 0.085, 0],
	thinking=True,
	)

	config = GenerationConfig(batch_size=1)

	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
	```

	### Example 4: Using Shift Parameter (Turbo Model)

	```python
	params = GenerationParams(
	task_type="text2music",
	caption="upbeat electronic dance music",
	inference_steps=8,
	shift=3.0, # Recommended for turbo models
	infer_method="ode",
	)

	config = GenerationConfig(batch_size=2)

	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
	```

	### Example 5: Simple Mode with create_sample

	```python
	from acestep.inference import create_sample, GenerationParams, GenerationConfig, generate_music

	# Step 1: Create sample from description
	sample = create_sample(
	llm_handler=llm_handler,
	query="energetic K-pop dance track with catchy hooks",
	vocal_language="ko",
	)

	if sample.success:
	# Step 2: Generate music using the sample
	params = GenerationParams(
	caption=sample.caption,
	lyrics=sample.lyrics,
	bpm=sample.bpm,
	duration=sample.duration,
	keyscale=sample.keyscale,
	vocal_language=sample.language,
	thinking=True,
	)

	config = GenerationConfig(batch_size=2)
	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
	```

	### Example 6: Format and Enhance User Input

	```python
	from acestep.inference import format_sample, GenerationParams, GenerationConfig, generate_music

	# Step 1: Format user input
	formatted = format_sample(
	llm_handler=llm_handler,
	caption="rock ballad",
	lyrics="[Verse]\nIn the darkness I find my way...",
	)

	if formatted.success:
	# Step 2: Generate with enhanced input
	params = GenerationParams(
	caption=formatted.caption,
	lyrics=formatted.lyrics,
	bpm=formatted.bpm,
	duration=formatted.duration,
	keyscale=formatted.keyscale,
	thinking=True,
	use_cot_metas=False, # Already formatted, skip metas CoT
	)

	config = GenerationConfig(batch_size=2)
	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
	```

	### Example 7: Style Cover with LM Reasoning

	```python
	params = GenerationParams(
	task_type="cover",
	src_audio="original_pop_song.mp3",
	caption="orchestral symphonic arrangement",
	audio_cover_strength=0.7,
	thinking=True, # Enable LM for metadata
	use_cot_metas=True,
	)

	config = GenerationConfig(batch_size=1)

	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

	# Access LM-generated metadata
	if result.extra_outputs.get("lm_metadata"):
	lm_meta = result.extra_outputs["lm_metadata"]
	print(f"LM detected BPM: {lm_meta.get('bpm')}")
	print(f"LM detected Key: {lm_meta.get('keyscale')}")
	```

	### Example 8: Batch Generation with Specific Seeds

	```python
	params = GenerationParams(
	task_type="text2music",
	caption="epic cinematic trailer music",
	)

	config = GenerationConfig(
	batch_size=4, # Generate 4 variations
	seeds=[42, 123, 456], # Specify 3 seeds, 4th will be random
	use_random_seed=False, # Use provided seeds
	lm_batch_chunk_size=2, # Process 2 at a time (GPU memory)
	)

	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

	if result.success:
	print(f"Generated {len(result.audios)} variations")
	for audio in result.audios:
	print(f" Seed {audio['params']['seed']}: {audio['path']}")
	```

	### Example 9: High-Quality Generation (Base Model)

	```python
	params = GenerationParams(
	task_type="text2music",
	caption="intricate jazz fusion with complex harmonies",
	inference_steps=64, # High quality
	guidance_scale=8.0,
	use_adg=True, # Adaptive Dual Guidance
	cfg_interval_start=0.0,
	cfg_interval_end=1.0,
	shift=3.0, # Timestep shift
	seed=42, # Reproducible results
	)

	config = GenerationConfig(
	batch_size=1,
	use_random_seed=False,
	audio_format="wav", # Lossless format
	)

	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
	```

	### Example 10: Understand Audio from Codes

	```python
	from acestep.inference import understand_music

	# Analyze audio codes (e.g., from a previous generation)
	result = understand_music(
	llm_handler=llm_handler,
	audio_codes="<\|audio_code_10695\|><\|audio_code_54246\|>...",
	temperature=0.85,
	)

	if result.success:
	print(f"Detected Caption: {result.caption}")
	print(f"Detected Lyrics: {result.lyrics}")
	print(f"Detected BPM: {result.bpm}")
	print(f"Detected Key: {result.keyscale}")
	print(f"Detected Duration: {result.duration}s")
	print(f"Detected Language: {result.language}")
	```

	---

	## Best Practices

	### 1. Caption Writing

	Good Captions:
	```python
	# Specific and descriptive
	caption="upbeat electronic dance music with heavy bass and synthesizer leads"

	# Include mood and genre
	caption="melancholic indie folk with acoustic guitar and soft vocals"

	# Specify instruments
	caption="jazz trio with piano, upright bass, and brush drums"
	```

	Avoid:
	```python
	# Too vague
	caption="good music"

	# Contradictory
	caption="fast slow music" # Conflicting tempos
	```

	### 2. Parameter Tuning

	For Best Quality:
	- Use base model with `inference_steps=64` or higher
	- Enable `use_adg=True`
	- Set `guidance_scale=7.0-9.0`
	- Set `shift=3.0` for better timestep distribution
	- Use lossless audio format (`audio_format="wav"`)

	For Speed:
	- Use turbo model with `inference_steps=8`
	- Disable ADG (`use_adg=False`)
	- Use `infer_method="ode"` (default)
	- Use compressed format (`audio_format="mp3"`) or default FLAC

	For Consistency:
	- Set `use_random_seed=False` in config
	- Use fixed `seeds` list or single `seed` in params
	- Keep `lm_temperature` lower (0.7-0.85)

	For Diversity:
	- Set `use_random_seed=True` in config
	- Increase `lm_temperature` (0.9-1.1)
	- Use `batch_size > 1` for variations

	### 3. Duration Guidelines

	- Instrumental: 30-180 seconds works well
	- With Lyrics: Auto-detection recommended (set `duration=-1` or leave default)
	- Short clips: 10-20 seconds minimum
	- Long form: Up to 600 seconds (10 minutes) maximum

	### 4. LM Usage

	When to Enable LM (`thinking=True`):
	- Need automatic metadata detection
	- Want caption refinement
	- Generating from minimal input
	- Need diverse outputs

	When to Disable LM (`thinking=False`):
	- Have precise metadata already
	- Need faster generation
	- Want full control over parameters

	### 5. Batch Processing

	```python
	# Efficient batch generation
	config = GenerationConfig(
	batch_size=8, # Max supported
	allow_lm_batch=True, # Enable for speed (when thinking=True)
	lm_batch_chunk_size=4, # Adjust based on GPU memory
	)
	```

	### 6. Error Handling

	```python
	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

	if not result.success:
	print(f"Generation failed: {result.error}")
	print(f"Status: {result.status_message}")
	else:
	# Process successful result
	for audio in result.audios:
	path = audio['path']
	key = audio['key']
	seed = audio['params']['seed']
	# ... process audio files
	```

	### 7. Memory Management

	For large batch sizes or long durations:
	- Monitor GPU memory usage
	- Reduce `batch_size` if OOM errors occur
	- Reduce `lm_batch_chunk_size` for LM operations
	- Consider using `offload_to_cpu=True` during initialization

	### 8. Accessing Time Costs

	```python
	result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")

	if result.success:
	time_costs = result.extra_outputs.get("time_costs", {})
	print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
	print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
	print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
	print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")
	```

	---

	## Troubleshooting

	### Common Issues

	Issue: Out of memory errors
	- Solution: Reduce `batch_size`, `inference_steps`, or enable CPU offloading

	Issue: Poor quality results
	- Solution: Increase `inference_steps`, adjust `guidance_scale`, use base model

	Issue: Results don't match prompt
	- Solution: Make caption more specific, increase `guidance_scale`, enable LM refinement (`thinking=True`)

	Issue: Slow generation
	- Solution: Use turbo model, reduce `inference_steps`, disable ADG

	Issue: LM not generating codes
	- Solution: Verify `llm_handler` is initialized, check `thinking=True` and `use_cot_metas=True`

	Issue: Seeds not being respected
	- Solution: Set `use_random_seed=False` in config and provide `seeds` list or `seed` in params

	Issue: Custom timesteps not working
	- Solution: Ensure timesteps are a list of floats from 1.0 to 0.0, properly ordered

	---

	## API Reference Summary

	### GenerationParams Fields

	See [GenerationParams Parameters](#generationparams-parameters) for complete documentation.

	### GenerationConfig Fields

	See [GenerationConfig Parameters](#generationconfig-parameters) for complete documentation.

	### GenerationResult Fields

	```python
	@dataclass
	class GenerationResult:
	# Audio Outputs
	audios: List[Dict[str, Any]]
	# Each audio dict contains:
	# - "path": str (file path)
	# - "tensor": Tensor (audio data)
	# - "key": str (unique identifier)
	# - "sample_rate": int (48000)
	# - "params": Dict (generation params with seed, audio_codes, etc.)

	# Generation Information
	status_message: str
	extra_outputs: Dict[str, Any]
	# extra_outputs contains:
	# - "lm_metadata": Dict (LM-generated metadata)
	# - "time_costs": Dict (timing information)
	# - "latents": Tensor (intermediate latents, if available)
	# - "masks": Tensor (attention masks, if available)

	# Success Status
	success: bool
	error: Optional[str]
	```

	---

	## Version History

	- v1.5.2: Current version
	- Added `shift` parameter for timestep shifting
	- Added `infer_method` parameter for ODE/SDE selection
	- Added `timesteps` parameter for custom timestep schedules
	- Added `understand_music()` function for audio analysis
	- Added `create_sample()` function for simple mode generation
	- Added `format_sample()` function for input enhancement
	- Added `UnderstandResult`, `CreateSampleResult`, `FormatSampleResult` dataclasses

	- v1.5.1: Previous version
	- Split `GenerationConfig` into `GenerationParams` and `GenerationConfig`
	- Renamed parameters for consistency (`key_scale` → `keyscale`, `time_signature` → `timesignature`, `audio_duration` → `duration`, `use_llm_thinking` → `thinking`, `audio_code_string` → `audio_codes`)
	- Added `instrumental` parameter
	- Added `use_constrained_decoding` parameter
	- Added CoT auto-filled fields (`cot_*`)
	- Changed default `audio_format` to "flac"
	- Changed default `batch_size` to 2
	- Changed default `thinking` to True
	- Simplified `GenerationResult` structure with unified `audios` list
	- Added unified `time_costs` in `extra_outputs`

	- v1.5: Initial version
	- Introduced `GenerationConfig` and `GenerationResult` dataclasses
	- Simplified parameter passing
	- Added comprehensive documentation

	---

	For more information, see:
	- Main README: [`../../README.md`](../../README.md)
	- REST API Documentation: [`API.md`](API.md)
	- Gradio Demo Guide: [`GRADIO_GUIDE.md`](GRADIO_GUIDE.md)
	- Project repository: [ACE-Step-1.5](https://github.com/yourusername/ACE-Step-1.5)