Spaces:
Running
on
A100
Running
on
A100
feat: update thinking mode
Browse files- API.md +55 -13
- acestep/api_server.py +80 -92
API.md
CHANGED
|
@@ -39,9 +39,30 @@ Suitable for passing only text parameters, or referencing audio file paths that
|
|
| 39 |
| :--- | :--- | :--- | :--- |
|
| 40 |
| `caption` | string | `""` | Music description prompt |
|
| 41 |
| `lyrics` | string | `""` | Lyrics content |
|
|
|
|
| 42 |
| `vocal_language` | string | `"en"` | Lyrics language (en, zh, ja, etc.) |
|
| 43 |
| `audio_format` | string | `"mp3"` | Output format (mp3, wav, flac) |
|
| 44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
**Music Attribute Parameters**:
|
| 46 |
|
| 47 |
| Parameter Name | Type | Default | Description |
|
|
@@ -51,6 +72,12 @@ Suitable for passing only text parameters, or referencing audio file paths that
|
|
| 51 |
| `time_signature` | string | `""` | Time signature (e.g., "4/4") |
|
| 52 |
| `audio_duration` | float | null | Generation duration (seconds) |
|
| 53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
**Generation Control Parameters**:
|
| 55 |
|
| 56 |
| Parameter Name | Type | Default | Description |
|
|
@@ -61,20 +88,19 @@ Suitable for passing only text parameters, or referencing audio file paths that
|
|
| 61 |
| `seed` | int | `-1` | Specify seed (when use_random_seed=false) |
|
| 62 |
| `batch_size` | int | null | Batch generation count |
|
| 63 |
|
| 64 |
-
**5Hz LM Parameters (Optional, server-side
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
| Parameter Name | Type | Default | Description |
|
| 69 |
| :--- | :--- | :--- | :--- |
|
| 70 |
-
| `use_5hz_lm` | bool | `false` | Enable server-side 5Hz LM code generation |
|
| 71 |
| `lm_model_path` | string | null | 5Hz LM checkpoint dir name (e.g. `acestep-5Hz-lm-0.6B`) |
|
| 72 |
| `lm_backend` | string | `"vllm"` | `vllm` or `pt` |
|
| 73 |
-
| `lm_temperature` | float | `0.
|
| 74 |
-
| `lm_cfg_scale` | float | `
|
| 75 |
| `lm_negative_prompt` | string | `"NO USER INPUT"` | Negative prompt used by CFG |
|
| 76 |
| `lm_top_k` | int | null | Top-k (0/null disables) |
|
| 77 |
-
| `lm_top_p` | float |
|
| 78 |
| `lm_repetition_penalty` | float | `1.0` | Repetition penalty |
|
| 79 |
|
| 80 |
**Edit/Reference Audio Parameters** (requires absolute path on server):
|
|
@@ -124,7 +150,7 @@ curl -X POST http://localhost:8001/v1/music/generate \
|
|
| 124 |
}'
|
| 125 |
```
|
| 126 |
|
| 127 |
-
**JSON Method (
|
| 128 |
|
| 129 |
```bash
|
| 130 |
curl -X POST http://localhost:8001/v1/music/generate \
|
|
@@ -132,22 +158,38 @@ curl -X POST http://localhost:8001/v1/music/generate \
|
|
| 132 |
-d '{
|
| 133 |
"caption": "upbeat pop song",
|
| 134 |
"lyrics": "Hello world",
|
| 135 |
-
"
|
| 136 |
-
"lm_temperature": 0.
|
| 137 |
-
"lm_cfg_scale":
|
| 138 |
-
"lm_top_k":
|
| 139 |
-
"lm_top_p":
|
| 140 |
"lm_repetition_penalty": 1.0
|
| 141 |
}'
|
| 142 |
```
|
| 143 |
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
- `bpm`
|
| 147 |
- `duration`
|
| 148 |
- `genres`
|
| 149 |
- `keyscale`
|
| 150 |
- `timesignature`
|
|
|
|
| 151 |
|
| 152 |
> Note: If you use `curl -d` but **forget** to add `-H 'Content-Type: application/json'`, curl will default to sending `application/x-www-form-urlencoded`, and older server versions will return 415.
|
| 153 |
|
|
|
|
| 39 |
| :--- | :--- | :--- | :--- |
|
| 40 |
| `caption` | string | `""` | Music description prompt |
|
| 41 |
| `lyrics` | string | `""` | Lyrics content |
|
| 42 |
+
| `thinking` | bool | `false` | Whether to use 5Hz LM to generate audio codes (lm-dit behavior). |
|
| 43 |
| `vocal_language` | string | `"en"` | Lyrics language (en, zh, ja, etc.) |
|
| 44 |
| `audio_format` | string | `"mp3"` | Output format (mp3, wav, flac) |
|
| 45 |
|
| 46 |
+
**thinking Semantics (Important)**:
|
| 47 |
+
|
| 48 |
+
- `thinking=false`:
|
| 49 |
+
- The server will **NOT** use 5Hz LM to generate `audio_code_string`.
|
| 50 |
+
- DiT runs in **text2music** mode and **ignores** any provided `audio_code_string`.
|
| 51 |
+
- `thinking=true`:
|
| 52 |
+
- The server will use 5Hz LM to generate `audio_code_string` (lm-dit behavior).
|
| 53 |
+
- DiT runs in **cover** mode and uses `audio_code_string`.
|
| 54 |
+
|
| 55 |
+
**Metadata Auto-Completion (Always On)**:
|
| 56 |
+
|
| 57 |
+
Regardless of `thinking`, if any of the following fields are missing, the server may call 5Hz LM to **fill only the missing fields** based on `caption`/`lyrics`:
|
| 58 |
+
|
| 59 |
+
- `bpm`
|
| 60 |
+
- `key_scale`
|
| 61 |
+
- `time_signature`
|
| 62 |
+
- `audio_duration`
|
| 63 |
+
|
| 64 |
+
User-provided values always win; LM only fills the fields that are empty/missing.
|
| 65 |
+
|
| 66 |
**Music Attribute Parameters**:
|
| 67 |
|
| 68 |
| Parameter Name | Type | Default | Description |
|
|
|
|
| 72 |
| `time_signature` | string | `""` | Time signature (e.g., "4/4") |
|
| 73 |
| `audio_duration` | float | null | Generation duration (seconds) |
|
| 74 |
|
| 75 |
+
**Audio Codes (Optional)**:
|
| 76 |
+
|
| 77 |
+
| Parameter Name | Type | Default | Description |
|
| 78 |
+
| :--- | :--- | :--- | :--- |
|
| 79 |
+
| `audio_code_string` | string or string[] | `""` | Audio semantic tokens (5Hz) for `llm_dit`. If provided as an array, it should match `batch_size` (or the server batch size). |
|
| 80 |
+
|
| 81 |
**Generation Control Parameters**:
|
| 82 |
|
| 83 |
| Parameter Name | Type | Default | Description |
|
|
|
|
| 88 |
| `seed` | int | `-1` | Specify seed (when use_random_seed=false) |
|
| 89 |
| `batch_size` | int | null | Batch generation count |
|
| 90 |
|
| 91 |
+
**5Hz LM Parameters (Optional, server-side)**:
|
| 92 |
|
| 93 |
+
These parameters control 5Hz LM sampling, used for metadata auto-completion and (when `thinking=true`) codes generation.
|
| 94 |
|
| 95 |
| Parameter Name | Type | Default | Description |
|
| 96 |
| :--- | :--- | :--- | :--- |
|
|
|
|
| 97 |
| `lm_model_path` | string | null | 5Hz LM checkpoint dir name (e.g. `acestep-5Hz-lm-0.6B`) |
|
| 98 |
| `lm_backend` | string | `"vllm"` | `vllm` or `pt` |
|
| 99 |
+
| `lm_temperature` | float | `0.85` | Sampling temperature |
|
| 100 |
+
| `lm_cfg_scale` | float | `2.0` | CFG scale (>1 enables CFG) |
|
| 101 |
| `lm_negative_prompt` | string | `"NO USER INPUT"` | Negative prompt used by CFG |
|
| 102 |
| `lm_top_k` | int | null | Top-k (0/null disables) |
|
| 103 |
+
| `lm_top_p` | float | `0.9` | Top-p (>=1 will be treated as disabled) |
|
| 104 |
| `lm_repetition_penalty` | float | `1.0` | Repetition penalty |
|
| 105 |
|
| 106 |
**Edit/Reference Audio Parameters** (requires absolute path on server):
|
|
|
|
| 150 |
}'
|
| 151 |
```
|
| 152 |
|
| 153 |
+
**JSON Method (thinking=true: generate codes + fill missing metas)**:
|
| 154 |
|
| 155 |
```bash
|
| 156 |
curl -X POST http://localhost:8001/v1/music/generate \
|
|
|
|
| 158 |
-d '{
|
| 159 |
"caption": "upbeat pop song",
|
| 160 |
"lyrics": "Hello world",
|
| 161 |
+
"thinking": true,
|
| 162 |
+
"lm_temperature": 0.85,
|
| 163 |
+
"lm_cfg_scale": 2.0,
|
| 164 |
+
"lm_top_k": null,
|
| 165 |
+
"lm_top_p": 0.9,
|
| 166 |
"lm_repetition_penalty": 1.0
|
| 167 |
}'
|
| 168 |
```
|
| 169 |
|
| 170 |
+
**JSON Method (thinking=false: do NOT generate codes, but fill missing metas)**:
|
| 171 |
+
|
| 172 |
+
Example: user specifies `bpm` but omits `audio_duration`. The server may call LM to infer `duration` from `caption`/`lyrics` and use it only if the user did not set it.
|
| 173 |
+
|
| 174 |
+
```bash
|
| 175 |
+
curl -X POST http://localhost:8001/v1/music/generate \
|
| 176 |
+
-H 'Content-Type: application/json' \
|
| 177 |
+
-d '{
|
| 178 |
+
"caption": "slow emotional ballad",
|
| 179 |
+
"lyrics": "...",
|
| 180 |
+
"thinking": false,
|
| 181 |
+
"bpm": 72
|
| 182 |
+
}'
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
When the server invokes the 5Hz LM (to fill metas and/or generate codes), the job `result` may include the following optional fields:
|
| 186 |
|
| 187 |
- `bpm`
|
| 188 |
- `duration`
|
| 189 |
- `genres`
|
| 190 |
- `keyscale`
|
| 191 |
- `timesignature`
|
| 192 |
+
- `metas` (raw-ish metadata dict)
|
| 193 |
|
| 194 |
> Note: If you use `curl -d` but **forget** to add `-H 'Content-Type: application/json'`, curl will default to sending `application/x-www-form-urlencoded`, and older server versions will return 415.
|
| 195 |
|
acestep/api_server.py
CHANGED
|
@@ -44,8 +44,11 @@ class GenerateMusicRequest(BaseModel):
|
|
| 44 |
caption: str = Field(default="", description="Text caption describing the music")
|
| 45 |
lyrics: str = Field(default="", description="Lyric text")
|
| 46 |
|
| 47 |
-
#
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
bpm: Optional[int] = None
|
| 51 |
key_scale: str = ""
|
|
@@ -77,8 +80,7 @@ class GenerateMusicRequest(BaseModel):
|
|
| 77 |
audio_format: str = "mp3"
|
| 78 |
use_tiled_decode: bool = True
|
| 79 |
|
| 80 |
-
# 5Hz LM
|
| 81 |
-
use_5hz_lm: bool = False
|
| 82 |
lm_model_path: Optional[str] = None # e.g. "acestep-5Hz-lm-0.6B"
|
| 83 |
lm_backend: Literal["vllm", "pt"] = "vllm"
|
| 84 |
|
|
@@ -99,15 +101,6 @@ _DEFAULT_DIT_INSTRUCTION = "Fill the audio semantic mask based on the given cond
|
|
| 99 |
_DEFAULT_LM_INSTRUCTION = "Generate audio semantic tokens based on the given conditions:"
|
| 100 |
|
| 101 |
|
| 102 |
-
def _normalize_infer_type(v: Any) -> Optional[str]:
|
| 103 |
-
s = str(v or "").strip().lower()
|
| 104 |
-
if not s:
|
| 105 |
-
return None
|
| 106 |
-
if s in {"dit", "llm_dit"}:
|
| 107 |
-
return s
|
| 108 |
-
return None
|
| 109 |
-
|
| 110 |
-
|
| 111 |
class CreateJobResponse(BaseModel):
|
| 112 |
job_id: str
|
| 113 |
status: JobStatus
|
|
@@ -123,7 +116,7 @@ class JobResult(BaseModel):
|
|
| 123 |
status_message: str = ""
|
| 124 |
seed_value: str = ""
|
| 125 |
|
| 126 |
-
# 5Hz LM metadata (present when
|
| 127 |
# Keep a raw-ish dict for clients that expect a `metas` object.
|
| 128 |
metas: Dict[str, Any] = Field(default_factory=dict)
|
| 129 |
bpm: Optional[int] = None
|
|
@@ -539,13 +532,7 @@ def create_app() -> FastAPI:
|
|
| 539 |
time_sig_val = req.time_signature
|
| 540 |
audio_duration_val = req.audio_duration
|
| 541 |
|
| 542 |
-
|
| 543 |
-
# Default to llm_dit only when we actually have (or will generate) codes.
|
| 544 |
-
explicit_infer = (req.infer_type or "").strip().lower() in {"dit", "llm_dit"}
|
| 545 |
-
infer_type = (req.infer_type or "").strip().lower()
|
| 546 |
-
if infer_type not in {"dit", "llm_dit"}:
|
| 547 |
-
has_codes = bool(audio_code_string and str(audio_code_string).strip())
|
| 548 |
-
infer_type = "llm_dit" if (req.use_5hz_lm or has_codes) else "dit"
|
| 549 |
|
| 550 |
# If LM-generated code hints are used, a too-strong cover strength can suppress lyric/vocal conditioning.
|
| 551 |
# We keep backward compatibility: only auto-adjust when user didn't override (still at default 1.0).
|
|
@@ -562,7 +549,16 @@ def create_app() -> FastAPI:
|
|
| 562 |
effective_batch_size = 1
|
| 563 |
effective_batch_size = max(1, int(effective_batch_size))
|
| 564 |
|
| 565 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 566 |
# Lazy init 5Hz LM once
|
| 567 |
with app.state._llm_init_lock:
|
| 568 |
if getattr(app.state, "_llm_initialized", False) is False and getattr(app.state, "_llm_init_error", None) is None:
|
|
@@ -590,87 +586,81 @@ def create_app() -> FastAPI:
|
|
| 590 |
app.state._llm_initialized = True
|
| 591 |
|
| 592 |
if getattr(app.state, "_llm_init_error", None):
|
| 593 |
-
|
| 594 |
-
|
| 595 |
-
|
| 596 |
-
|
| 597 |
-
|
| 598 |
-
|
| 599 |
-
|
| 600 |
-
|
| 601 |
-
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
|
| 612 |
-
|
| 613 |
-
|
| 614 |
-
|
| 615 |
-
|
| 616 |
-
|
| 617 |
-
|
| 618 |
-
|
| 619 |
-
|
| 620 |
-
#
|
| 621 |
-
|
| 622 |
-
|
| 623 |
-
|
| 624 |
-
|
| 625 |
-
|
| 626 |
-
|
| 627 |
-
|
| 628 |
-
|
| 629 |
-
|
| 630 |
-
|
| 631 |
-
|
| 632 |
-
|
| 633 |
-
|
| 634 |
-
|
| 635 |
-
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
| 640 |
-
|
| 641 |
-
|
| 642 |
-
|
| 643 |
-
|
| 644 |
-
|
| 645 |
-
|
| 646 |
-
|
| 647 |
-
|
| 648 |
-
|
| 649 |
-
#
|
| 650 |
-
# - dit: metas only (ignore audio codes), keep text2music.
|
| 651 |
-
# - llm_dit: metas + audio codes, run in cover mode with LM instruction.
|
| 652 |
instruction_val = req.instruction
|
| 653 |
task_type_val = (req.task_type or "").strip() or "text2music"
|
| 654 |
|
| 655 |
-
if
|
| 656 |
audio_code_string = ""
|
| 657 |
if task_type_val == "cover":
|
| 658 |
task_type_val = "text2music"
|
| 659 |
if (instruction_val or "").strip() in {"", _DEFAULT_LM_INSTRUCTION}:
|
| 660 |
instruction_val = _DEFAULT_DIT_INSTRUCTION
|
| 661 |
|
| 662 |
-
if
|
| 663 |
task_type_val = "cover"
|
| 664 |
if (instruction_val or "").strip() in {"", _DEFAULT_DIT_INSTRUCTION}:
|
| 665 |
instruction_val = _DEFAULT_LM_INSTRUCTION
|
| 666 |
|
| 667 |
if not (audio_code_string and str(audio_code_string).strip()):
|
| 668 |
-
|
| 669 |
-
|
| 670 |
-
# If not explicitly requested, fall back to dit semantics.
|
| 671 |
-
infer_type = "dit"
|
| 672 |
-
task_type_val = "text2music"
|
| 673 |
-
instruction_val = _DEFAULT_DIT_INSTRUCTION
|
| 674 |
|
| 675 |
first, second, paths, gen_info, status_msg, seed_value, *_ = h.generate_music(
|
| 676 |
captions=req.caption,
|
|
@@ -779,7 +769,7 @@ def create_app() -> FastAPI:
|
|
| 779 |
return GenerateMusicRequest(
|
| 780 |
caption=str(get("caption", "") or ""),
|
| 781 |
lyrics=str(get("lyrics", "") or ""),
|
| 782 |
-
|
| 783 |
bpm=_to_int(get("bpm"), None),
|
| 784 |
key_scale=str(get("key_scale", "") or ""),
|
| 785 |
time_signature=str(get("time_signature", "") or ""),
|
|
@@ -803,8 +793,6 @@ def create_app() -> FastAPI:
|
|
| 803 |
cfg_interval_end=_to_float(get("cfg_interval_end"), 1.0) or 1.0,
|
| 804 |
audio_format=str(get("audio_format", "mp3") or "mp3"),
|
| 805 |
use_tiled_decode=_to_bool(get("use_tiled_decode"), True),
|
| 806 |
-
|
| 807 |
-
use_5hz_lm=_to_bool(get("use_5hz_lm"), False),
|
| 808 |
lm_model_path=str(get("lm_model_path") or "").strip() or None,
|
| 809 |
lm_backend=str(get("lm_backend", "vllm") or "vllm"),
|
| 810 |
lm_temperature=_to_float(get("lm_temperature"), _LM_DEFAULT_TEMPERATURE) or _LM_DEFAULT_TEMPERATURE,
|
|
|
|
| 44 |
caption: str = Field(default="", description="Text caption describing the music")
|
| 45 |
lyrics: str = Field(default="", description="Lyric text")
|
| 46 |
|
| 47 |
+
# New API semantics:
|
| 48 |
+
# - thinking=True: use 5Hz LM to generate audio codes (lm-dit behavior)
|
| 49 |
+
# - thinking=False: do not use LM to generate codes (dit behavior)
|
| 50 |
+
# Regardless of thinking, if some metas are missing, server may use LM to fill them.
|
| 51 |
+
thinking: bool = False
|
| 52 |
|
| 53 |
bpm: Optional[int] = None
|
| 54 |
key_scale: str = ""
|
|
|
|
| 80 |
audio_format: str = "mp3"
|
| 81 |
use_tiled_decode: bool = True
|
| 82 |
|
| 83 |
+
# 5Hz LM (server-side): used for metadata completion and (when thinking=True) codes generation.
|
|
|
|
| 84 |
lm_model_path: Optional[str] = None # e.g. "acestep-5Hz-lm-0.6B"
|
| 85 |
lm_backend: Literal["vllm", "pt"] = "vllm"
|
| 86 |
|
|
|
|
| 101 |
_DEFAULT_LM_INSTRUCTION = "Generate audio semantic tokens based on the given conditions:"
|
| 102 |
|
| 103 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 104 |
class CreateJobResponse(BaseModel):
|
| 105 |
job_id: str
|
| 106 |
status: JobStatus
|
|
|
|
| 116 |
status_message: str = ""
|
| 117 |
seed_value: str = ""
|
| 118 |
|
| 119 |
+
# 5Hz LM metadata (present when server invoked LM)
|
| 120 |
# Keep a raw-ish dict for clients that expect a `metas` object.
|
| 121 |
metas: Dict[str, Any] = Field(default_factory=dict)
|
| 122 |
bpm: Optional[int] = None
|
|
|
|
| 532 |
time_sig_val = req.time_signature
|
| 533 |
audio_duration_val = req.audio_duration
|
| 534 |
|
| 535 |
+
thinking = bool(getattr(req, "thinking", False))
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 536 |
|
| 537 |
# If LM-generated code hints are used, a too-strong cover strength can suppress lyric/vocal conditioning.
|
| 538 |
# We keep backward compatibility: only auto-adjust when user didn't override (still at default 1.0).
|
|
|
|
| 549 |
effective_batch_size = 1
|
| 550 |
effective_batch_size = max(1, int(effective_batch_size))
|
| 551 |
|
| 552 |
+
has_codes = bool(audio_code_string and str(audio_code_string).strip())
|
| 553 |
+
need_lm_codes = bool(thinking) and (not has_codes)
|
| 554 |
+
need_lm_metas = (
|
| 555 |
+
(bpm_val is None)
|
| 556 |
+
or (not (key_scale_val or "").strip())
|
| 557 |
+
or (not (time_sig_val or "").strip())
|
| 558 |
+
or (audio_duration_val is None)
|
| 559 |
+
)
|
| 560 |
+
|
| 561 |
+
if need_lm_metas or need_lm_codes:
|
| 562 |
# Lazy init 5Hz LM once
|
| 563 |
with app.state._llm_init_lock:
|
| 564 |
if getattr(app.state, "_llm_initialized", False) is False and getattr(app.state, "_llm_init_error", None) is None:
|
|
|
|
| 586 |
app.state._llm_initialized = True
|
| 587 |
|
| 588 |
if getattr(app.state, "_llm_init_error", None):
|
| 589 |
+
# If codes generation is required, fail hard.
|
| 590 |
+
if need_lm_codes:
|
| 591 |
+
raise RuntimeError(f"5Hz LM init failed: {app.state._llm_init_error}")
|
| 592 |
+
# Otherwise, skip LM best-effort (fallback to default/meta-less behavior)
|
| 593 |
+
else:
|
| 594 |
+
lm_infer = "llm_dit" if need_lm_codes else "dit"
|
| 595 |
+
|
| 596 |
+
def _lm_call() -> tuple[Dict[str, Any], str, str]:
|
| 597 |
+
return llm.generate_with_stop_condition(
|
| 598 |
+
caption=req.caption,
|
| 599 |
+
lyrics=req.lyrics,
|
| 600 |
+
infer_type=lm_infer,
|
| 601 |
+
temperature=float(req.lm_temperature),
|
| 602 |
+
cfg_scale=max(1.0, float(req.lm_cfg_scale)),
|
| 603 |
+
negative_prompt=str(req.lm_negative_prompt or "NO USER INPUT"),
|
| 604 |
+
top_k=_normalize_optional_int(req.lm_top_k),
|
| 605 |
+
top_p=_normalize_optional_float(req.lm_top_p),
|
| 606 |
+
repetition_penalty=float(req.lm_repetition_penalty),
|
| 607 |
+
)
|
| 608 |
+
|
| 609 |
+
meta, codes, status = _lm_call()
|
| 610 |
+
|
| 611 |
+
if need_lm_codes:
|
| 612 |
+
if not codes:
|
| 613 |
+
raise RuntimeError(f"5Hz LM generation failed: {status}")
|
| 614 |
+
|
| 615 |
+
# LM once per job; rely on DiT seeds for batch diversity.
|
| 616 |
+
# For convenience, replicate the same codes across the batch.
|
| 617 |
+
if effective_batch_size > 1:
|
| 618 |
+
audio_code_string = [codes] * effective_batch_size
|
| 619 |
+
else:
|
| 620 |
+
audio_code_string = codes
|
| 621 |
+
|
| 622 |
+
# Always expose LM metas when we invoked LM (even if user already set some fields).
|
| 623 |
+
lm_fields = {
|
| 624 |
+
"metas": _normalize_metas(meta),
|
| 625 |
+
**_extract_lm_fields(meta),
|
| 626 |
+
}
|
| 627 |
+
|
| 628 |
+
# Fill only missing fields (user-provided values win)
|
| 629 |
+
bpm_val, key_scale_val, time_sig_val, audio_duration_val = _maybe_fill_from_metadata(req, meta)
|
| 630 |
+
|
| 631 |
+
# If user provided lyrics but LM didn't provide a usable duration, estimate a longer duration.
|
| 632 |
+
if audio_duration_val is None and (req.audio_duration is None):
|
| 633 |
+
est = _estimate_duration_from_lyrics(req.lyrics)
|
| 634 |
+
if est is not None:
|
| 635 |
+
audio_duration_val = est
|
| 636 |
+
|
| 637 |
+
# Optional: auto-tune LM cover strength (opt-in) to avoid suppressing lyric/vocal conditioning.
|
| 638 |
+
if thinking and audio_cover_strength_val >= 0.999 and (req.lyrics or "").strip():
|
| 639 |
+
tuned = os.getenv("ACESTEP_LM_COVER_STRENGTH")
|
| 640 |
+
if tuned is not None and tuned.strip() != "":
|
| 641 |
+
audio_cover_strength_val = float(tuned)
|
| 642 |
+
|
| 643 |
+
# Align behavior:
|
| 644 |
+
# - thinking=False: metas only (ignore audio codes), keep text2music.
|
| 645 |
+
# - thinking=True: metas + audio codes, run in cover mode with LM instruction.
|
|
|
|
|
|
|
| 646 |
instruction_val = req.instruction
|
| 647 |
task_type_val = (req.task_type or "").strip() or "text2music"
|
| 648 |
|
| 649 |
+
if not thinking:
|
| 650 |
audio_code_string = ""
|
| 651 |
if task_type_val == "cover":
|
| 652 |
task_type_val = "text2music"
|
| 653 |
if (instruction_val or "").strip() in {"", _DEFAULT_LM_INSTRUCTION}:
|
| 654 |
instruction_val = _DEFAULT_DIT_INSTRUCTION
|
| 655 |
|
| 656 |
+
if thinking:
|
| 657 |
task_type_val = "cover"
|
| 658 |
if (instruction_val or "").strip() in {"", _DEFAULT_DIT_INSTRUCTION}:
|
| 659 |
instruction_val = _DEFAULT_LM_INSTRUCTION
|
| 660 |
|
| 661 |
if not (audio_code_string and str(audio_code_string).strip()):
|
| 662 |
+
# thinking=True requires codes generation.
|
| 663 |
+
raise RuntimeError("thinking=true requires non-empty audio codes (LM generation failed).")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 664 |
|
| 665 |
first, second, paths, gen_info, status_msg, seed_value, *_ = h.generate_music(
|
| 666 |
captions=req.caption,
|
|
|
|
| 769 |
return GenerateMusicRequest(
|
| 770 |
caption=str(get("caption", "") or ""),
|
| 771 |
lyrics=str(get("lyrics", "") or ""),
|
| 772 |
+
thinking=_to_bool(get("thinking"), False),
|
| 773 |
bpm=_to_int(get("bpm"), None),
|
| 774 |
key_scale=str(get("key_scale", "") or ""),
|
| 775 |
time_signature=str(get("time_signature", "") or ""),
|
|
|
|
| 793 |
cfg_interval_end=_to_float(get("cfg_interval_end"), 1.0) or 1.0,
|
| 794 |
audio_format=str(get("audio_format", "mp3") or "mp3"),
|
| 795 |
use_tiled_decode=_to_bool(get("use_tiled_decode"), True),
|
|
|
|
|
|
|
| 796 |
lm_model_path=str(get("lm_model_path") or "").strip() or None,
|
| 797 |
lm_backend=str(get("lm_backend", "vllm") or "vllm"),
|
| 798 |
lm_temperature=_to_float(get("lm_temperature"), _LM_DEFAULT_TEMPERATURE) or _LM_DEFAULT_TEMPERATURE,
|