ChuxiJ commited on
Commit
858eb3e
·
1 Parent(s): 77c327b

update doc

Browse files
Files changed (1) hide show
  1. INFERENCE.md +314 -132
INFERENCE.md CHANGED
@@ -6,7 +6,8 @@ This document provides comprehensive documentation for the ACE-Step inference AP
6
 
7
  - [Quick Start](#quick-start)
8
  - [API Overview](#api-overview)
9
- - [Configuration Parameters](#configuration-parameters)
 
10
  - [Task Types](#task-types)
11
  - [Complete Examples](#complete-examples)
12
  - [Best Practices](#best-practices)
@@ -20,7 +21,7 @@ This document provides comprehensive documentation for the ACE-Step inference AP
20
  ```python
21
  from acestep.handler import AceStepHandler
22
  from acestep.llm_inference import LLMHandler
23
- from acestep.inference import GenerationConfig, generate_music
24
 
25
  # Initialize handlers
26
  dit_handler = AceStepHandler()
@@ -40,21 +41,28 @@ llm_handler.initialize(
40
  device="cuda"
41
  )
42
 
43
- # Configure generation
44
- config = GenerationConfig(
45
  caption="upbeat electronic dance music with heavy bass",
46
  bpm=128,
47
- audio_duration=30,
48
- batch_size=1,
 
 
 
 
 
49
  )
50
 
51
  # Generate music
52
- result = generate_music(dit_handler, llm_handler, config)
53
 
54
  # Access results
55
  if result.success:
56
- for audio_path in result.audio_paths:
57
- print(f"Generated: {audio_path}")
 
 
58
  else:
59
  print(f"Error: {result.error}")
60
  ```
@@ -67,23 +75,94 @@ else:
67
 
68
  ```python
69
  def generate_music(
70
- dit_handler: AceStepHandler,
71
- llm_handler: LLMHandler,
 
72
  config: GenerationConfig,
 
 
73
  ) -> GenerationResult
74
  ```
75
 
76
- ### Configuration Object
 
 
77
 
78
- The `GenerationConfig` dataclass consolidates all generation parameters:
79
 
80
  ```python
81
  @dataclass
82
- class GenerationConfig:
83
- # Required parameters with sensible defaults
 
 
 
 
 
 
 
 
 
 
 
84
  caption: str = ""
85
  lyrics: str = ""
86
- # ... (see full parameter list below)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ```
88
 
89
  ### Result Object
@@ -91,46 +170,61 @@ class GenerationConfig:
91
  ```python
92
  @dataclass
93
  class GenerationResult:
94
- audio_paths: List[str] # Paths to generated audio files
95
- generation_info: str # Markdown-formatted info
96
- status_message: str # Status message
97
- seed_value: str # Seed used
98
- lm_metadata: Optional[Dict] # LM-generated metadata
99
- success: bool # Success flag
100
- error: Optional[str] # Error message if failed
101
- # ... (see full fields below)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
102
  ```
103
 
104
  ---
105
 
106
- ## Configuration Parameters
107
 
108
  ### Text Inputs
109
 
110
  | Parameter | Type | Default | Description |
111
  |-----------|------|---------|-------------|
112
- | `caption` | `str` | `""` | Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. |
113
- | `lyrics` | `str` | `""` | Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. |
 
114
 
115
  ### Music Metadata
116
 
117
  | Parameter | Type | Default | Description |
118
  |-----------|------|---------|-------------|
119
  | `bpm` | `Optional[int]` | `None` | Beats per minute (30-300). `None` enables auto-detection via LM. |
120
- | `key_scale` | `str` | `""` | Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. |
121
- | `time_signature` | `str` | `""` | Time signature (e.g., "4/4", "3/4", "6/8"). Empty string enables auto-detection. |
122
  | `vocal_language` | `str` | `"unknown"` | Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection. |
123
- | `audio_duration` | `Optional[float]` | `None` | Duration in seconds (10-600). `None` enables auto-detection based on lyrics length. |
124
 
125
  ### Generation Parameters
126
 
127
  | Parameter | Type | Default | Description |
128
  |-----------|------|---------|-------------|
129
  | `inference_steps` | `int` | `8` | Number of denoising steps. Turbo model: 1-8 (recommended 8). Base model: 1-100 (recommended 32-64). Higher = better quality but slower. |
130
- | `guidance_scale` | `float` | `7.0` | Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Typical range: 5.0-9.0. |
131
- | `use_random_seed` | `bool` | `True` | Whether to use random seed. `True` for different results each time, `False` for reproducible results. |
132
  | `seed` | `int` | `-1` | Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed. |
133
- | `batch_size` | `int` | `1` | Number of samples to generate in parallel (1-8). Higher values require more GPU memory. |
134
 
135
  ### Advanced DiT Parameters
136
 
@@ -139,43 +233,63 @@ class GenerationResult:
139
  | `use_adg` | `bool` | `False` | Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. |
140
  | `cfg_interval_start` | `float` | `0.0` | CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. |
141
  | `cfg_interval_end` | `float` | `1.0` | CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. |
142
- | `audio_format` | `str` | `"mp3"` | Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. |
143
 
144
  ### Task-Specific Parameters
145
 
146
  | Parameter | Type | Default | Description |
147
  |-----------|------|---------|-------------|
148
  | `task_type` | `str` | `"text2music"` | Generation task type. See [Task Types](#task-types) section for details. |
 
149
  | `reference_audio` | `Optional[str]` | `None` | Path to reference audio file for style transfer or continuation tasks. |
150
  | `src_audio` | `Optional[str]` | `None` | Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). |
151
- | `audio_code_string` | `Union[str, List[str]]` | `""` | Pre-extracted 5Hz audio codes. Can be single string or list for batch mode. Advanced use only. |
152
  | `repainting_start` | `float` | `0.0` | Repainting start time in seconds (for repaint/lego tasks). |
153
  | `repainting_end` | `float` | `-1` | Repainting end time in seconds. Use `-1` for end of audio. |
154
- | `audio_cover_strength` | `float` | `1.0` | Strength of audio cover/codes influence (0.0-1.0). Higher = stronger influence from source audio. |
155
- | `instruction` | `str` | `""` | Task-specific instruction prompt. Auto-generated if empty. |
156
 
157
  ### 5Hz Language Model Parameters
158
 
159
  | Parameter | Type | Default | Description |
160
  |-----------|------|---------|-------------|
161
- | `use_llm_thinking` | `bool` | `False` | Enable LM-based Chain-of-Thought reasoning. When enabled, LM generates metadata and/or audio codes. |
162
  | `lm_temperature` | `float` | `0.85` | LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. |
163
- | `lm_cfg_scale` | `float` | `2.0` | LM classifier-free guidance scale (1.0-5.0). Higher = stronger adherence to prompt. |
164
  | `lm_top_k` | `int` | `0` | LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100. |
165
  | `lm_top_p` | `float` | `0.9` | LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95. |
166
  | `lm_negative_prompt` | `str` | `"NO USER INPUT"` | Negative prompt for LM guidance. Helps avoid unwanted characteristics. |
167
  | `use_cot_metas` | `bool` | `True` | Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). |
168
  | `use_cot_caption` | `bool` | `True` | Refine user caption using LM CoT reasoning. |
169
  | `use_cot_language` | `bool` | `True` | Detect vocal language using LM CoT reasoning. |
170
- | `is_format_caption` | `bool` | `False` | Whether caption is already formatted/refined (skip LM refinement). |
171
- | `constrained_decoding_debug` | `bool` | `False` | Enable debug logging for constrained decoding. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
- ### Batch LM Generation
174
 
175
  | Parameter | Type | Default | Description |
176
  |-----------|------|---------|-------------|
177
- | `allow_lm_batch` | `bool` | `False` | Allow batch LM code generation. Faster when `batch_size >= 2` and `use_llm_thinking=True`. |
178
- | `lm_batch_chunk_size` | `int` | `4` | Maximum batch size per LM inference chunk (GPU memory constraint). |
 
 
 
 
 
179
 
180
  ---
181
 
@@ -189,12 +303,12 @@ ACE-Step supports 6 different generation task types, each optimized for specific
189
 
190
  **Key Parameters**:
191
  ```python
192
- config = GenerationConfig(
193
  task_type="text2music",
194
  caption="energetic rock music with electric guitar",
195
  lyrics="[Instrumental]", # or actual lyrics
196
  bpm=140,
197
- audio_duration=30,
198
  )
199
  ```
200
 
@@ -203,9 +317,9 @@ config = GenerationConfig(
203
 
204
  **Optional but Recommended**:
205
  - `bpm`: Controls tempo
206
- - `key_scale`: Controls musical key
207
- - `time_signature`: Controls rhythm structure
208
- - `audio_duration`: Controls length
209
  - `vocal_language`: Controls vocal characteristics
210
 
211
  **Use Cases**:
@@ -221,7 +335,7 @@ config = GenerationConfig(
221
 
222
  **Key Parameters**:
223
  ```python
224
- config = GenerationConfig(
225
  task_type="cover",
226
  src_audio="original_song.mp3",
227
  caption="jazz piano version",
@@ -253,7 +367,7 @@ config = GenerationConfig(
253
 
254
  **Key Parameters**:
255
  ```python
256
- config = GenerationConfig(
257
  task_type="repaint",
258
  src_audio="original.mp3",
259
  repainting_start=10.0, # seconds
@@ -282,7 +396,7 @@ config = GenerationConfig(
282
 
283
  **Key Parameters**:
284
  ```python
285
- config = GenerationConfig(
286
  task_type="lego",
287
  src_audio="backing_track.mp3",
288
  instruction="Generate the guitar track based on the audio context:",
@@ -314,7 +428,7 @@ config = GenerationConfig(
314
 
315
  **Key Parameters**:
316
  ```python
317
- config = GenerationConfig(
318
  task_type="extract",
319
  src_audio="full_mix.mp3",
320
  instruction="Extract the vocals track from the audio:",
@@ -341,7 +455,7 @@ config = GenerationConfig(
341
 
342
  **Key Parameters**:
343
  ```python
344
- config = GenerationConfig(
345
  task_type="complete",
346
  src_audio="incomplete_track.mp3",
347
  instruction="Complete the input track with drums, bass, guitar:",
@@ -366,28 +480,32 @@ config = GenerationConfig(
366
  ### Example 1: Simple Text-to-Music Generation
367
 
368
  ```python
369
- from acestep.inference import GenerationConfig, generate_music
370
 
371
- config = GenerationConfig(
372
  task_type="text2music",
373
  caption="calm ambient music with soft piano and strings",
374
- audio_duration=60,
375
  bpm=80,
376
- key_scale="C Major",
 
 
 
377
  batch_size=2, # Generate 2 variations
 
378
  )
379
 
380
- result = generate_music(dit_handler, llm_handler, config)
381
 
382
  if result.success:
383
- for i, path in enumerate(result.audio_paths, 1):
384
- print(f"Variation {i}: {path}")
385
  ```
386
 
387
  ### Example 2: Song Generation with Lyrics
388
 
389
  ```python
390
- config = GenerationConfig(
391
  task_type="text2music",
392
  caption="pop ballad with emotional vocals",
393
  lyrics="""Verse 1:
@@ -402,36 +520,41 @@ This is where I belong
402
  """,
403
  vocal_language="en",
404
  bpm=72,
405
- audio_duration=45,
406
  )
407
 
408
- result = generate_music(dit_handler, llm_handler, config)
 
 
409
  ```
410
 
411
  ### Example 3: Style Cover with LM Reasoning
412
 
413
  ```python
414
- config = GenerationConfig(
415
  task_type="cover",
416
  src_audio="original_pop_song.mp3",
417
  caption="orchestral symphonic arrangement",
418
  audio_cover_strength=0.7,
419
- use_llm_thinking=True, # Enable LM for metadata
420
  use_cot_metas=True,
421
  )
422
 
423
- result = generate_music(dit_handler, llm_handler, config)
 
 
424
 
425
  # Access LM-generated metadata
426
- if result.lm_metadata:
427
- print(f"LM detected BPM: {result.lm_metadata.get('bpm')}")
428
- print(f"LM detected Key: {result.lm_metadata.get('keyscale')}")
 
429
  ```
430
 
431
  ### Example 4: Repaint Section of Audio
432
 
433
  ```python
434
- config = GenerationConfig(
435
  task_type="repaint",
436
  src_audio="generated_track.mp3",
437
  repainting_start=15.0, # Start at 15 seconds
@@ -440,66 +563,78 @@ config = GenerationConfig(
440
  inference_steps=32, # Higher quality for base model
441
  )
442
 
443
- result = generate_music(dit_handler, llm_handler, config)
 
 
444
  ```
445
 
446
- ### Example 5: Batch Generation with LM
447
 
448
  ```python
449
- config = GenerationConfig(
450
  task_type="text2music",
451
  caption="epic cinematic trailer music",
452
- batch_size=4, # Generate 4 variations
453
- use_llm_thinking=True,
454
- use_cot_metas=True,
455
- allow_lm_batch=True, # Faster batch processing
 
 
456
  lm_batch_chunk_size=2, # Process 2 at a time (GPU memory)
457
  )
458
 
459
- result = generate_music(dit_handler, llm_handler, config)
460
 
461
  if result.success:
462
- print(f"Generated {len(result.audio_paths)} variations")
 
 
463
  ```
464
 
465
  ### Example 6: High-Quality Generation (Base Model)
466
 
467
  ```python
468
- config = GenerationConfig(
469
  task_type="text2music",
470
  caption="intricate jazz fusion with complex harmonies",
471
- inference_steps=64, # High quality
472
  guidance_scale=8.0,
473
- use_adg=True, # Adaptive Dual Guidance
474
  cfg_interval_start=0.0,
475
  cfg_interval_end=1.0,
476
- audio_format="wav", # Lossless format
 
 
 
 
477
  use_random_seed=False,
478
- seed=42, # Reproducible results
479
  )
480
 
481
- result = generate_music(dit_handler, llm_handler, config)
482
  ```
483
 
484
  ### Example 7: Extract Vocals from Mix
485
 
486
  ```python
487
- config = GenerationConfig(
488
  task_type="extract",
489
  src_audio="full_song_mix.mp3",
490
  instruction="Extract the vocals track from the audio:",
491
  )
492
 
493
- result = generate_music(dit_handler, llm_handler, config)
 
 
494
 
495
  if result.success:
496
- print(f"Extracted vocals: {result.audio_paths[0]}")
497
  ```
498
 
499
  ### Example 8: Add Guitar Track (Lego)
500
 
501
  ```python
502
- config = GenerationConfig(
503
  task_type="lego",
504
  src_audio="drums_and_bass.mp3",
505
  instruction="Generate the guitar track based on the audio context:",
@@ -508,7 +643,25 @@ config = GenerationConfig(
508
  repainting_end=-1, # Full duration
509
  )
510
 
511
- result = generate_music(dit_handler, llm_handler, config)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
512
  ```
513
 
514
  ---
@@ -550,34 +703,34 @@ caption="fast slow music" # Conflicting tempos
550
  - Use turbo model with `inference_steps=8`
551
  - Disable ADG (`use_adg=False`)
552
  - Lower `guidance_scale=5.0-7.0`
553
- - Use compressed format (`audio_format="mp3"`)
554
 
555
  **For Consistency**:
556
- - Set `use_random_seed=False`
557
- - Use fixed `seed` value
558
  - Keep `lm_temperature` lower (0.7-0.85)
559
 
560
  **For Diversity**:
561
- - Set `use_random_seed=True`
562
  - Increase `lm_temperature` (0.9-1.1)
563
  - Use `batch_size > 1` for variations
564
 
565
  ### 3. Duration Guidelines
566
 
567
  - **Instrumental**: 30-180 seconds works well
568
- - **With Lyrics**: Auto-detection recommended (set `audio_duration=None`)
569
  - **Short clips**: 10-20 seconds minimum
570
  - **Long form**: Up to 600 seconds (10 minutes) maximum
571
 
572
  ### 4. LM Usage
573
 
574
- **When to Enable LM (`use_llm_thinking=True`)**:
575
  - Need automatic metadata detection
576
  - Want caption refinement
577
  - Generating from minimal input
578
  - Need diverse outputs
579
 
580
- **When to Disable LM**:
581
  - Have precise metadata already
582
  - Need faster generation
583
  - Want full control over parameters
@@ -587,9 +740,8 @@ caption="fast slow music" # Conflicting tempos
587
  ```python
588
  # Efficient batch generation
589
  config = GenerationConfig(
590
- batch_size=8, # Max supported
591
- use_llm_thinking=True,
592
- allow_lm_batch=True, # Enable for speed
593
  lm_batch_chunk_size=4, # Adjust based on GPU memory
594
  )
595
  ```
@@ -597,16 +749,18 @@ config = GenerationConfig(
597
  ### 6. Error Handling
598
 
599
  ```python
600
- result = generate_music(dit_handler, llm_handler, config)
601
 
602
  if not result.success:
603
  print(f"Generation failed: {result.error}")
604
- # Check logs for details
605
  else:
606
  # Process successful result
607
- for path in result.audio_paths:
 
 
 
608
  # ... process audio files
609
- pass
610
  ```
611
 
612
  ### 7. Memory Management
@@ -617,6 +771,19 @@ For large batch sizes or long durations:
617
  - Reduce `lm_batch_chunk_size` for LM operations
618
  - Consider using `offload_to_cpu=True` during initialization
619
 
 
 
 
 
 
 
 
 
 
 
 
 
 
620
  ---
621
 
622
  ## Troubleshooting
@@ -630,62 +797,77 @@ For large batch sizes or long durations:
630
  - **Solution**: Increase `inference_steps`, adjust `guidance_scale`, use base model
631
 
632
  **Issue**: Results don't match prompt
633
- - **Solution**: Make caption more specific, increase `guidance_scale`, enable LM refinement
634
 
635
  **Issue**: Slow generation
636
  - **Solution**: Use turbo model, reduce `inference_steps`, disable ADG
637
 
638
  **Issue**: LM not generating codes
639
- - **Solution**: Verify `llm_handler` is initialized, check `use_llm_thinking=True` and `use_cot_metas=True`
 
 
 
640
 
641
  ---
642
 
643
  ## API Reference Summary
644
 
 
 
 
 
645
  ### GenerationConfig Fields
646
 
647
- See [Configuration Parameters](#configuration-parameters) for complete documentation.
648
 
649
  ### GenerationResult Fields
650
 
651
  ```python
652
  @dataclass
653
  class GenerationResult:
654
- # Audio outputs
655
- audio_paths: List[str] # List of generated audio file paths
656
- first_audio: Optional[str] # First audio (backward compatibility)
657
- second_audio: Optional[str] # Second audio (backward compatibility)
658
-
659
- # Generation metadata
660
- generation_info: str # Markdown-formatted generation info
661
- status_message: str # Status message
662
- seed_value: str # Seed value used
663
-
664
- # LM outputs
665
- lm_metadata: Optional[Dict[str, Any]] # LM-generated metadata
666
 
667
- # Alignment scores (if available)
668
- align_score_1: Optional[float]
669
- align_text_1: Optional[str]
670
- align_plot_1: Optional[Any]
671
- align_score_2: Optional[float]
672
- align_text_2: Optional[str]
673
- align_plot_2: Optional[Any]
 
674
 
675
- # Status
676
- success: bool # Whether generation succeeded
677
- error: Optional[str] # Error message if failed
678
  ```
679
 
680
  ---
681
 
682
  ## Version History
683
 
684
- - **v1.5**: Current version with refactored inference API
 
 
 
 
 
 
 
 
 
 
 
 
685
  - Introduced `GenerationConfig` and `GenerationResult` dataclasses
686
  - Simplified parameter passing
687
  - Added comprehensive documentation
688
- - Maintained backward compatibility with Gradio UI
689
 
690
  ---
691
 
 
6
 
7
  - [Quick Start](#quick-start)
8
  - [API Overview](#api-overview)
9
+ - [GenerationParams Parameters](#generationparams-parameters)
10
+ - [GenerationConfig Parameters](#generationconfig-parameters)
11
  - [Task Types](#task-types)
12
  - [Complete Examples](#complete-examples)
13
  - [Best Practices](#best-practices)
 
21
  ```python
22
  from acestep.handler import AceStepHandler
23
  from acestep.llm_inference import LLMHandler
24
+ from acestep.inference import GenerationParams, GenerationConfig, generate_music
25
 
26
  # Initialize handlers
27
  dit_handler = AceStepHandler()
 
41
  device="cuda"
42
  )
43
 
44
+ # Configure generation parameters
45
+ params = GenerationParams(
46
  caption="upbeat electronic dance music with heavy bass",
47
  bpm=128,
48
+ duration=30,
49
+ )
50
+
51
+ # Configure generation settings
52
+ config = GenerationConfig(
53
+ batch_size=2,
54
+ audio_format="flac",
55
  )
56
 
57
  # Generate music
58
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/path/to/output")
59
 
60
  # Access results
61
  if result.success:
62
+ for audio in result.audios:
63
+ print(f"Generated: {audio['path']}")
64
+ print(f"Key: {audio['key']}")
65
+ print(f"Seed: {audio['params']['seed']}")
66
  else:
67
  print(f"Error: {result.error}")
68
  ```
 
75
 
76
  ```python
77
  def generate_music(
78
+ dit_handler,
79
+ llm_handler,
80
+ params: GenerationParams,
81
  config: GenerationConfig,
82
+ save_dir: Optional[str] = None,
83
+ progress=None,
84
  ) -> GenerationResult
85
  ```
86
 
87
+ ### Configuration Objects
88
+
89
+ The API uses two configuration dataclasses:
90
 
91
+ **GenerationParams** - Contains all music generation parameters:
92
 
93
  ```python
94
  @dataclass
95
+ class GenerationParams:
96
+ # Task & Instruction
97
+ task_type: str = "text2music"
98
+ instruction: str = "Fill the audio semantic mask based on the given conditions:"
99
+
100
+ # Audio Uploads
101
+ reference_audio: Optional[str] = None
102
+ src_audio: Optional[str] = None
103
+
104
+ # LM Codes Hints
105
+ audio_codes: str = ""
106
+
107
+ # Text Inputs
108
  caption: str = ""
109
  lyrics: str = ""
110
+ instrumental: bool = False
111
+
112
+ # Metadata
113
+ vocal_language: str = "unknown"
114
+ bpm: Optional[int] = None
115
+ keyscale: str = ""
116
+ timesignature: str = ""
117
+ duration: float = -1.0
118
+
119
+ # Advanced Settings
120
+ inference_steps: int = 8
121
+ seed: int = -1
122
+ guidance_scale: float = 7.0
123
+ use_adg: bool = False
124
+ cfg_interval_start: float = 0.0
125
+ cfg_interval_end: float = 1.0
126
+
127
+ repainting_start: float = 0.0
128
+ repainting_end: float = -1
129
+ audio_cover_strength: float = 1.0
130
+
131
+ # 5Hz Language Model Parameters
132
+ thinking: bool = True
133
+ lm_temperature: float = 0.85
134
+ lm_cfg_scale: float = 2.0
135
+ lm_top_k: int = 0
136
+ lm_top_p: float = 0.9
137
+ lm_negative_prompt: str = "NO USER INPUT"
138
+ use_cot_metas: bool = True
139
+ use_cot_caption: bool = True
140
+ use_cot_lyrics: bool = False
141
+ use_cot_language: bool = True
142
+ use_constrained_decoding: bool = True
143
+
144
+ # CoT Generated Values (auto-filled by LM)
145
+ cot_bpm: Optional[int] = None
146
+ cot_keyscale: str = ""
147
+ cot_timesignature: str = ""
148
+ cot_duration: Optional[float] = None
149
+ cot_vocal_language: str = "unknown"
150
+ cot_caption: str = ""
151
+ cot_lyrics: str = ""
152
+ ```
153
+
154
+ **GenerationConfig** - Contains batch and output configuration:
155
+
156
+ ```python
157
+ @dataclass
158
+ class GenerationConfig:
159
+ batch_size: int = 2
160
+ allow_lm_batch: bool = False
161
+ use_random_seed: bool = True
162
+ seeds: Optional[List[int]] = None
163
+ lm_batch_chunk_size: int = 8
164
+ constrained_decoding_debug: bool = False
165
+ audio_format: str = "flac"
166
  ```
167
 
168
  ### Result Object
 
170
  ```python
171
  @dataclass
172
  class GenerationResult:
173
+ # Audio Outputs
174
+ audios: List[Dict[str, Any]] # List of audio dictionaries
175
+
176
+ # Generation Information
177
+ status_message: str # Status message from generation
178
+ extra_outputs: Dict[str, Any] # Extra outputs (latents, masks, lm_metadata, time_costs)
179
+
180
+ # Success Status
181
+ success: bool # Whether generation succeeded
182
+ error: Optional[str] # Error message if failed
183
+ ```
184
+
185
+ **Audio Dictionary Structure:**
186
+
187
+ Each item in `audios` list contains:
188
+
189
+ ```python
190
+ {
191
+ "path": str, # File path to saved audio
192
+ "tensor": Tensor, # Audio tensor [channels, samples], CPU, float32
193
+ "key": str, # Unique audio key (UUID based on params)
194
+ "sample_rate": int, # Sample rate (default: 48000)
195
+ "params": Dict, # Generation params for this audio (includes seed, audio_codes, etc.)
196
+ }
197
  ```
198
 
199
  ---
200
 
201
+ ## GenerationParams Parameters
202
 
203
  ### Text Inputs
204
 
205
  | Parameter | Type | Default | Description |
206
  |-----------|------|---------|-------------|
207
+ | `caption` | `str` | `""` | Text description of the desired music. Can be a simple prompt like "relaxing piano music" or detailed description with genre, mood, instruments, etc. Max 512 characters. |
208
+ | `lyrics` | `str` | `""` | Lyrics text for vocal music. Use `"[Instrumental]"` for instrumental tracks. Supports multiple languages. Max 4096 characters. |
209
+ | `instrumental` | `bool` | `False` | If True, generate instrumental music regardless of lyrics. |
210
 
211
  ### Music Metadata
212
 
213
  | Parameter | Type | Default | Description |
214
  |-----------|------|---------|-------------|
215
  | `bpm` | `Optional[int]` | `None` | Beats per minute (30-300). `None` enables auto-detection via LM. |
216
+ | `keyscale` | `str` | `""` | Musical key (e.g., "C Major", "Am", "F# minor"). Empty string enables auto-detection. |
217
+ | `timesignature` | `str` | `""` | Time signature (2 for '2/4', 3 for '3/4', 4 for '4/4', 6 for '6/8'). Empty string enables auto-detection. |
218
  | `vocal_language` | `str` | `"unknown"` | Language code for vocals (ISO 639-1). Supported: `"en"`, `"zh"`, `"ja"`, `"es"`, `"fr"`, etc. Use `"unknown"` for auto-detection. |
219
+ | `duration` | `float` | `-1.0` | Target audio length in seconds (10-600). If <= 0 or None, model chooses automatically based on lyrics length. |
220
 
221
  ### Generation Parameters
222
 
223
  | Parameter | Type | Default | Description |
224
  |-----------|------|---------|-------------|
225
  | `inference_steps` | `int` | `8` | Number of denoising steps. Turbo model: 1-8 (recommended 8). Base model: 1-100 (recommended 32-64). Higher = better quality but slower. |
226
+ | `guidance_scale` | `float` | `7.0` | Classifier-free guidance scale (1.0-15.0). Higher values increase adherence to text prompt. Only supported for non-turbo model. Typical range: 5.0-9.0. |
 
227
  | `seed` | `int` | `-1` | Random seed for reproducibility. Use `-1` for random seed, or any positive integer for fixed seed. |
 
228
 
229
  ### Advanced DiT Parameters
230
 
 
233
  | `use_adg` | `bool` | `False` | Use Adaptive Dual Guidance (base model only). Improves quality at the cost of speed. |
234
  | `cfg_interval_start` | `float` | `0.0` | CFG application start ratio (0.0-1.0). Controls when to start applying classifier-free guidance. |
235
  | `cfg_interval_end` | `float` | `1.0` | CFG application end ratio (0.0-1.0). Controls when to stop applying classifier-free guidance. |
 
236
 
237
  ### Task-Specific Parameters
238
 
239
  | Parameter | Type | Default | Description |
240
  |-----------|------|---------|-------------|
241
  | `task_type` | `str` | `"text2music"` | Generation task type. See [Task Types](#task-types) section for details. |
242
+ | `instruction` | `str` | `"Fill the audio semantic mask based on the given conditions:"` | Task-specific instruction prompt. |
243
  | `reference_audio` | `Optional[str]` | `None` | Path to reference audio file for style transfer or continuation tasks. |
244
  | `src_audio` | `Optional[str]` | `None` | Path to source audio file for audio-to-audio tasks (cover, repaint, etc.). |
245
+ | `audio_codes` | `str` | `""` | Pre-extracted 5Hz audio semantic codes as a string. Advanced use only. |
246
  | `repainting_start` | `float` | `0.0` | Repainting start time in seconds (for repaint/lego tasks). |
247
  | `repainting_end` | `float` | `-1` | Repainting end time in seconds. Use `-1` for end of audio. |
248
+ | `audio_cover_strength` | `float` | `1.0` | Strength of audio cover/codes influence (0.0-1.0). Set smaller (0.2) for style transfer tasks. |
 
249
 
250
  ### 5Hz Language Model Parameters
251
 
252
  | Parameter | Type | Default | Description |
253
  |-----------|------|---------|-------------|
254
+ | `thinking` | `bool` | `True` | Enable 5Hz Language Model "Chain-of-Thought" reasoning for semantic/music metadata and codes. |
255
  | `lm_temperature` | `float` | `0.85` | LM sampling temperature (0.0-2.0). Higher = more creative/diverse, lower = more conservative. |
256
+ | `lm_cfg_scale` | `float` | `2.0` | LM classifier-free guidance scale. Higher = stronger adherence to prompt. |
257
  | `lm_top_k` | `int` | `0` | LM top-k sampling. `0` disables top-k filtering. Typical values: 40-100. |
258
  | `lm_top_p` | `float` | `0.9` | LM nucleus sampling (0.0-1.0). `1.0` disables nucleus sampling. Typical values: 0.9-0.95. |
259
  | `lm_negative_prompt` | `str` | `"NO USER INPUT"` | Negative prompt for LM guidance. Helps avoid unwanted characteristics. |
260
  | `use_cot_metas` | `bool` | `True` | Generate metadata using LM CoT reasoning (BPM, key, duration, etc.). |
261
  | `use_cot_caption` | `bool` | `True` | Refine user caption using LM CoT reasoning. |
262
  | `use_cot_language` | `bool` | `True` | Detect vocal language using LM CoT reasoning. |
263
+ | `use_cot_lyrics` | `bool` | `False` | (Reserved for future use) Generate/refine lyrics using LM CoT. |
264
+ | `use_constrained_decoding` | `bool` | `True` | Enable constrained decoding for structured LM output. |
265
+
266
+ ### CoT Generated Values
267
+
268
+ These fields are automatically populated by the LM when CoT reasoning is enabled:
269
+
270
+ | Parameter | Type | Default | Description |
271
+ |-----------|------|---------|-------------|
272
+ | `cot_bpm` | `Optional[int]` | `None` | LM-generated BPM value. |
273
+ | `cot_keyscale` | `str` | `""` | LM-generated key/scale. |
274
+ | `cot_timesignature` | `str` | `""` | LM-generated time signature. |
275
+ | `cot_duration` | `Optional[float]` | `None` | LM-generated duration. |
276
+ | `cot_vocal_language` | `str` | `"unknown"` | LM-detected vocal language. |
277
+ | `cot_caption` | `str` | `""` | LM-refined caption. |
278
+ | `cot_lyrics` | `str` | `""` | LM-generated/refined lyrics. |
279
+
280
+ ---
281
 
282
+ ## GenerationConfig Parameters
283
 
284
  | Parameter | Type | Default | Description |
285
  |-----------|------|---------|-------------|
286
+ | `batch_size` | `int` | `2` | Number of samples to generate in parallel (1-8). Higher values require more GPU memory. |
287
+ | `allow_lm_batch` | `bool` | `False` | Allow batch processing in LM. Faster when `batch_size >= 2` and `thinking=True`. |
288
+ | `use_random_seed` | `bool` | `True` | Whether to use random seed. `True` for different results each time, `False` for reproducible results. |
289
+ | `seeds` | `Optional[List[int]]` | `None` | List of seeds for batch generation. If provided, will be padded with random seeds if fewer than batch_size. Can also be single int. |
290
+ | `lm_batch_chunk_size` | `int` | `8` | Maximum batch size per LM inference chunk (GPU memory constraint). |
291
+ | `constrained_decoding_debug` | `bool` | `False` | Enable debug logging for constrained decoding. |
292
+ | `audio_format` | `str` | `"flac"` | Output audio format. Options: `"mp3"`, `"wav"`, `"flac"`. Default is FLAC for fast saving. |
293
 
294
  ---
295
 
 
303
 
304
  **Key Parameters**:
305
  ```python
306
+ params = GenerationParams(
307
  task_type="text2music",
308
  caption="energetic rock music with electric guitar",
309
  lyrics="[Instrumental]", # or actual lyrics
310
  bpm=140,
311
+ duration=30,
312
  )
313
  ```
314
 
 
317
 
318
  **Optional but Recommended**:
319
  - `bpm`: Controls tempo
320
+ - `keyscale`: Controls musical key
321
+ - `timesignature`: Controls rhythm structure
322
+ - `duration`: Controls length
323
  - `vocal_language`: Controls vocal characteristics
324
 
325
  **Use Cases**:
 
335
 
336
  **Key Parameters**:
337
  ```python
338
+ params = GenerationParams(
339
  task_type="cover",
340
  src_audio="original_song.mp3",
341
  caption="jazz piano version",
 
367
 
368
  **Key Parameters**:
369
  ```python
370
+ params = GenerationParams(
371
  task_type="repaint",
372
  src_audio="original.mp3",
373
  repainting_start=10.0, # seconds
 
396
 
397
  **Key Parameters**:
398
  ```python
399
+ params = GenerationParams(
400
  task_type="lego",
401
  src_audio="backing_track.mp3",
402
  instruction="Generate the guitar track based on the audio context:",
 
428
 
429
  **Key Parameters**:
430
  ```python
431
+ params = GenerationParams(
432
  task_type="extract",
433
  src_audio="full_mix.mp3",
434
  instruction="Extract the vocals track from the audio:",
 
455
 
456
  **Key Parameters**:
457
  ```python
458
+ params = GenerationParams(
459
  task_type="complete",
460
  src_audio="incomplete_track.mp3",
461
  instruction="Complete the input track with drums, bass, guitar:",
 
480
  ### Example 1: Simple Text-to-Music Generation
481
 
482
  ```python
483
+ from acestep.inference import GenerationParams, GenerationConfig, generate_music
484
 
485
+ params = GenerationParams(
486
  task_type="text2music",
487
  caption="calm ambient music with soft piano and strings",
488
+ duration=60,
489
  bpm=80,
490
+ keyscale="C Major",
491
+ )
492
+
493
+ config = GenerationConfig(
494
  batch_size=2, # Generate 2 variations
495
+ audio_format="flac",
496
  )
497
 
498
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
499
 
500
  if result.success:
501
+ for i, audio in enumerate(result.audios, 1):
502
+ print(f"Variation {i}: {audio['path']}")
503
  ```
504
 
505
  ### Example 2: Song Generation with Lyrics
506
 
507
  ```python
508
+ params = GenerationParams(
509
  task_type="text2music",
510
  caption="pop ballad with emotional vocals",
511
  lyrics="""Verse 1:
 
520
  """,
521
  vocal_language="en",
522
  bpm=72,
523
+ duration=45,
524
  )
525
 
526
+ config = GenerationConfig(batch_size=1)
527
+
528
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
529
  ```
530
 
531
  ### Example 3: Style Cover with LM Reasoning
532
 
533
  ```python
534
+ params = GenerationParams(
535
  task_type="cover",
536
  src_audio="original_pop_song.mp3",
537
  caption="orchestral symphonic arrangement",
538
  audio_cover_strength=0.7,
539
+ thinking=True, # Enable LM for metadata
540
  use_cot_metas=True,
541
  )
542
 
543
+ config = GenerationConfig(batch_size=1)
544
+
545
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
546
 
547
  # Access LM-generated metadata
548
+ if result.extra_outputs.get("lm_metadata"):
549
+ lm_meta = result.extra_outputs["lm_metadata"]
550
+ print(f"LM detected BPM: {lm_meta.get('bpm')}")
551
+ print(f"LM detected Key: {lm_meta.get('keyscale')}")
552
  ```
553
 
554
  ### Example 4: Repaint Section of Audio
555
 
556
  ```python
557
+ params = GenerationParams(
558
  task_type="repaint",
559
  src_audio="generated_track.mp3",
560
  repainting_start=15.0, # Start at 15 seconds
 
563
  inference_steps=32, # Higher quality for base model
564
  )
565
 
566
+ config = GenerationConfig(batch_size=1)
567
+
568
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
569
  ```
570
 
571
+ ### Example 5: Batch Generation with Specific Seeds
572
 
573
  ```python
574
+ params = GenerationParams(
575
  task_type="text2music",
576
  caption="epic cinematic trailer music",
577
+ )
578
+
579
+ config = GenerationConfig(
580
+ batch_size=4, # Generate 4 variations
581
+ seeds=[42, 123, 456], # Specify 3 seeds, 4th will be random
582
+ use_random_seed=False, # Use provided seeds
583
  lm_batch_chunk_size=2, # Process 2 at a time (GPU memory)
584
  )
585
 
586
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
587
 
588
  if result.success:
589
+ print(f"Generated {len(result.audios)} variations")
590
+ for audio in result.audios:
591
+ print(f" Seed {audio['params']['seed']}: {audio['path']}")
592
  ```
593
 
594
  ### Example 6: High-Quality Generation (Base Model)
595
 
596
  ```python
597
+ params = GenerationParams(
598
  task_type="text2music",
599
  caption="intricate jazz fusion with complex harmonies",
600
+ inference_steps=64, # High quality
601
  guidance_scale=8.0,
602
+ use_adg=True, # Adaptive Dual Guidance
603
  cfg_interval_start=0.0,
604
  cfg_interval_end=1.0,
605
+ seed=42, # Reproducible results
606
+ )
607
+
608
+ config = GenerationConfig(
609
+ batch_size=1,
610
  use_random_seed=False,
611
+ audio_format="wav", # Lossless format
612
  )
613
 
614
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
615
  ```
616
 
617
  ### Example 7: Extract Vocals from Mix
618
 
619
  ```python
620
+ params = GenerationParams(
621
  task_type="extract",
622
  src_audio="full_song_mix.mp3",
623
  instruction="Extract the vocals track from the audio:",
624
  )
625
 
626
+ config = GenerationConfig(batch_size=1)
627
+
628
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
629
 
630
  if result.success:
631
+ print(f"Extracted vocals: {result.audios[0]['path']}")
632
  ```
633
 
634
  ### Example 8: Add Guitar Track (Lego)
635
 
636
  ```python
637
+ params = GenerationParams(
638
  task_type="lego",
639
  src_audio="drums_and_bass.mp3",
640
  instruction="Generate the guitar track based on the audio context:",
 
643
  repainting_end=-1, # Full duration
644
  )
645
 
646
+ config = GenerationConfig(batch_size=1)
647
+
648
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
649
+ ```
650
+
651
+ ### Example 9: Instrumental Generation
652
+
653
+ ```python
654
+ params = GenerationParams(
655
+ task_type="text2music",
656
+ caption="upbeat electronic dance music",
657
+ instrumental=True, # Force instrumental output
658
+ duration=120,
659
+ bpm=128,
660
+ )
661
+
662
+ config = GenerationConfig(batch_size=2)
663
+
664
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
665
  ```
666
 
667
  ---
 
703
  - Use turbo model with `inference_steps=8`
704
  - Disable ADG (`use_adg=False`)
705
  - Lower `guidance_scale=5.0-7.0`
706
+ - Use compressed format (`audio_format="mp3"`) or default FLAC
707
 
708
  **For Consistency**:
709
+ - Set `use_random_seed=False` in config
710
+ - Use fixed `seeds` list or single `seed` in params
711
  - Keep `lm_temperature` lower (0.7-0.85)
712
 
713
  **For Diversity**:
714
+ - Set `use_random_seed=True` in config
715
  - Increase `lm_temperature` (0.9-1.1)
716
  - Use `batch_size > 1` for variations
717
 
718
  ### 3. Duration Guidelines
719
 
720
  - **Instrumental**: 30-180 seconds works well
721
+ - **With Lyrics**: Auto-detection recommended (set `duration=-1` or leave default)
722
  - **Short clips**: 10-20 seconds minimum
723
  - **Long form**: Up to 600 seconds (10 minutes) maximum
724
 
725
  ### 4. LM Usage
726
 
727
+ **When to Enable LM (`thinking=True`)**:
728
  - Need automatic metadata detection
729
  - Want caption refinement
730
  - Generating from minimal input
731
  - Need diverse outputs
732
 
733
+ **When to Disable LM (`thinking=False`)**:
734
  - Have precise metadata already
735
  - Need faster generation
736
  - Want full control over parameters
 
740
  ```python
741
  # Efficient batch generation
742
  config = GenerationConfig(
743
+ batch_size=8, # Max supported
744
+ allow_lm_batch=True, # Enable for speed (when thinking=True)
 
745
  lm_batch_chunk_size=4, # Adjust based on GPU memory
746
  )
747
  ```
 
749
  ### 6. Error Handling
750
 
751
  ```python
752
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
753
 
754
  if not result.success:
755
  print(f"Generation failed: {result.error}")
756
+ print(f"Status: {result.status_message}")
757
  else:
758
  # Process successful result
759
+ for audio in result.audios:
760
+ path = audio['path']
761
+ key = audio['key']
762
+ seed = audio['params']['seed']
763
  # ... process audio files
 
764
  ```
765
 
766
  ### 7. Memory Management
 
771
  - Reduce `lm_batch_chunk_size` for LM operations
772
  - Consider using `offload_to_cpu=True` during initialization
773
 
774
+ ### 8. Accessing Time Costs
775
+
776
+ ```python
777
+ result = generate_music(dit_handler, llm_handler, params, config, save_dir="/output")
778
+
779
+ if result.success:
780
+ time_costs = result.extra_outputs.get("time_costs", {})
781
+ print(f"LM Phase 1 Time: {time_costs.get('lm_phase1_time', 0):.2f}s")
782
+ print(f"LM Phase 2 Time: {time_costs.get('lm_phase2_time', 0):.2f}s")
783
+ print(f"DiT Total Time: {time_costs.get('dit_total_time_cost', 0):.2f}s")
784
+ print(f"Pipeline Total: {time_costs.get('pipeline_total_time', 0):.2f}s")
785
+ ```
786
+
787
  ---
788
 
789
  ## Troubleshooting
 
797
  - **Solution**: Increase `inference_steps`, adjust `guidance_scale`, use base model
798
 
799
  **Issue**: Results don't match prompt
800
+ - **Solution**: Make caption more specific, increase `guidance_scale`, enable LM refinement (`thinking=True`)
801
 
802
  **Issue**: Slow generation
803
  - **Solution**: Use turbo model, reduce `inference_steps`, disable ADG
804
 
805
  **Issue**: LM not generating codes
806
+ - **Solution**: Verify `llm_handler` is initialized, check `thinking=True` and `use_cot_metas=True`
807
+
808
+ **Issue**: Seeds not being respected
809
+ - **Solution**: Set `use_random_seed=False` in config and provide `seeds` list or `seed` in params
810
 
811
  ---
812
 
813
  ## API Reference Summary
814
 
815
+ ### GenerationParams Fields
816
+
817
+ See [GenerationParams Parameters](#generationparams-parameters) for complete documentation.
818
+
819
  ### GenerationConfig Fields
820
 
821
+ See [GenerationConfig Parameters](#generationconfig-parameters) for complete documentation.
822
 
823
  ### GenerationResult Fields
824
 
825
  ```python
826
  @dataclass
827
  class GenerationResult:
828
+ # Audio Outputs
829
+ audios: List[Dict[str, Any]]
830
+ # Each audio dict contains:
831
+ # - "path": str (file path)
832
+ # - "tensor": Tensor (audio data)
833
+ # - "key": str (unique identifier)
834
+ # - "sample_rate": int (48000)
835
+ # - "params": Dict (generation params with seed, audio_codes, etc.)
 
 
 
 
836
 
837
+ # Generation Information
838
+ status_message: str
839
+ extra_outputs: Dict[str, Any]
840
+ # extra_outputs contains:
841
+ # - "lm_metadata": Dict (LM-generated metadata)
842
+ # - "time_costs": Dict (timing information)
843
+ # - "latents": Tensor (intermediate latents, if available)
844
+ # - "masks": Tensor (attention masks, if available)
845
 
846
+ # Success Status
847
+ success: bool
848
+ error: Optional[str]
849
  ```
850
 
851
  ---
852
 
853
  ## Version History
854
 
855
+ - **v1.5.1**: Current version with refactored inference API
856
+ - Split `GenerationConfig` into `GenerationParams` and `GenerationConfig`
857
+ - Renamed parameters for consistency (`key_scale` → `keyscale`, `time_signature` → `timesignature`, `audio_duration` → `duration`, `use_llm_thinking` → `thinking`, `audio_code_string` → `audio_codes`)
858
+ - Added `instrumental` parameter
859
+ - Added `use_constrained_decoding` parameter
860
+ - Added CoT auto-filled fields (`cot_*`)
861
+ - Changed default `audio_format` to "flac"
862
+ - Changed default `batch_size` to 2
863
+ - Changed default `thinking` to True
864
+ - Simplified `GenerationResult` structure with unified `audios` list
865
+ - Added unified `time_costs` in `extra_outputs`
866
+
867
+ - **v1.5**: Previous version
868
  - Introduced `GenerationConfig` and `GenerationResult` dataclasses
869
  - Simplified parameter passing
870
  - Added comprehensive documentation
 
871
 
872
  ---
873