Update vllm info, add time series demo
#11
by
jack-zxy
- opened
- README.md +118 -4
- deployment_guide.md +2 -20
README.md
CHANGED
|
@@ -61,7 +61,7 @@ temperature = 0.8
|
|
| 61 |
### Serving
|
| 62 |
|
| 63 |
> [!IMPORTANT]
|
| 64 |
-
> Running a trillion-parameter model using the native Hugging Face forward method is challenging. We strongly recommend using an LLM inference engine (such as LMDeploy, vLLM, or
|
| 65 |
|
| 66 |
Intern-S1-Pro can be deployed using any of the following LLM inference frameworks:
|
| 67 |
|
|
@@ -71,8 +71,6 @@ Intern-S1-Pro can be deployed using any of the following LLM inference framework
|
|
| 71 |
|
| 72 |
Detailed deployment examples for these frameworks are available in the [Model Deployment Guide](./deployment_guide.md).
|
| 73 |
|
| 74 |
-
> Deployment support for the time-series module is under optimization and will be released soon.
|
| 75 |
-
|
| 76 |
|
| 77 |
## Advanced Usage
|
| 78 |
|
|
@@ -249,7 +247,7 @@ text = tokenizer.apply_chat_template(
|
|
| 249 |
)
|
| 250 |
```
|
| 251 |
|
| 252 |
-
|
| 253 |
|
| 254 |
```python
|
| 255 |
from openai import OpenAI
|
|
@@ -288,6 +286,122 @@ response = client.chat.completions.create(
|
|
| 288 |
print(json.dumps(response.model_dump(), indent=2, ensure_ascii=False))
|
| 289 |
```
|
| 290 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 291 |
## Citation
|
| 292 |
|
| 293 |
If you find this work useful, feel free to give us a cite.
|
|
|
|
| 61 |
### Serving
|
| 62 |
|
| 63 |
> [!IMPORTANT]
|
| 64 |
+
> Running a trillion-parameter model using the native Hugging Face forward method is challenging. We strongly recommend using an LLM inference engine (such as LMDeploy, vLLM, or SGLang) to host Intern-S1-Pro and accessing the model via API.
|
| 65 |
|
| 66 |
Intern-S1-Pro can be deployed using any of the following LLM inference frameworks:
|
| 67 |
|
|
|
|
| 71 |
|
| 72 |
Detailed deployment examples for these frameworks are available in the [Model Deployment Guide](./deployment_guide.md).
|
| 73 |
|
|
|
|
|
|
|
| 74 |
|
| 75 |
## Advanced Usage
|
| 76 |
|
|
|
|
| 247 |
)
|
| 248 |
```
|
| 249 |
|
| 250 |
+
When serving Intern-S1-Pro models, you can dynamically control the thinking mode by adjusting the `enable_thinking` parameter in your requests.
|
| 251 |
|
| 252 |
```python
|
| 253 |
from openai import OpenAI
|
|
|
|
| 286 |
print(json.dumps(response.model_dump(), indent=2, ensure_ascii=False))
|
| 287 |
```
|
| 288 |
|
| 289 |
+
### Time Series Demo
|
| 290 |
+
|
| 291 |
+
Time series inference is currently only supported in LMDeploy. To get started, download and deploy Intern-S1-Pro with LMDeploy (>=v0.12.1) by following the [Model Deployment Guide](./deployment_guide.md).
|
| 292 |
+
Below is an example of detecting earthquake events from a time series signal file. Additional data types and functionalities are also supported.
|
| 293 |
+
|
| 294 |
+
```
|
| 295 |
+
from openai import OpenAI
|
| 296 |
+
from lmdeploy.vl.time_series_utils import encode_time_series_base64
|
| 297 |
+
|
| 298 |
+
openai_api_key = "EMPTY"
|
| 299 |
+
openai_api_base = "http://0.0.0.0:8000/v1"
|
| 300 |
+
client = OpenAI(
|
| 301 |
+
api_key=openai_api_key,
|
| 302 |
+
base_url=openai_api_base,
|
| 303 |
+
)
|
| 304 |
+
model_name = client.models.list().data[0].id
|
| 305 |
+
|
| 306 |
+
|
| 307 |
+
def send_base64(file_path: str, sampling_rate: int = 100):
|
| 308 |
+
"""base64-encoded time-series data."""
|
| 309 |
+
|
| 310 |
+
# encode_time_series_base64 accepts local file paths and http urls,
|
| 311 |
+
# encoding time-series data (.npy, .csv, .wav, .mp3, .flac, etc.) into base64 strings.
|
| 312 |
+
base64_ts = encode_time_series_base64(file_path)
|
| 313 |
+
|
| 314 |
+
messages = [
|
| 315 |
+
{
|
| 316 |
+
"role": "user",
|
| 317 |
+
"content": [
|
| 318 |
+
{
|
| 319 |
+
"type": "text",
|
| 320 |
+
"text": "Please determine whether an Earthquake event has occurred in the provided time-series data. If so, please specify the starting time point indices of the P-wave and S-wave in the event."
|
| 321 |
+
},
|
| 322 |
+
{
|
| 323 |
+
"type": "time_series_url",
|
| 324 |
+
"time_series_url": {
|
| 325 |
+
"url": f"data:time_series/npy;base64,{base64_ts}",
|
| 326 |
+
"sampling_rate": sampling_rate
|
| 327 |
+
},
|
| 328 |
+
},
|
| 329 |
+
],
|
| 330 |
+
}
|
| 331 |
+
]
|
| 332 |
+
|
| 333 |
+
return client.chat.completions.create(
|
| 334 |
+
model=model_name,
|
| 335 |
+
messages=messages,
|
| 336 |
+
temperature=0,
|
| 337 |
+
max_tokens=200,
|
| 338 |
+
)
|
| 339 |
+
|
| 340 |
+
|
| 341 |
+
def send_http_url(url: str, sampling_rate: int = 100):
|
| 342 |
+
"""http(s) url pointing to the time-series data."""
|
| 343 |
+
messages = [
|
| 344 |
+
{
|
| 345 |
+
"role": "user",
|
| 346 |
+
"content": [
|
| 347 |
+
{
|
| 348 |
+
"type": "text",
|
| 349 |
+
"text": "Please determine whether an Earthquake event has occurred in the provided time-series data. If so, please specify the starting time point indices of the P-wave and S-wave in the event."
|
| 350 |
+
},
|
| 351 |
+
{
|
| 352 |
+
"type": "time_series_url",
|
| 353 |
+
"time_series_url": {
|
| 354 |
+
"url": url,
|
| 355 |
+
"sampling_rate": sampling_rate
|
| 356 |
+
},
|
| 357 |
+
},
|
| 358 |
+
],
|
| 359 |
+
}
|
| 360 |
+
]
|
| 361 |
+
|
| 362 |
+
return client.chat.completions.create(
|
| 363 |
+
model=model_name,
|
| 364 |
+
messages=messages,
|
| 365 |
+
temperature=0,
|
| 366 |
+
max_tokens=200,
|
| 367 |
+
)
|
| 368 |
+
|
| 369 |
+
|
| 370 |
+
def send_file_url(file_path: str, sampling_rate: int = 100):
|
| 371 |
+
"""file url pointing to the time-series data."""
|
| 372 |
+
messages = [
|
| 373 |
+
{
|
| 374 |
+
"role": "user",
|
| 375 |
+
"content": [
|
| 376 |
+
{
|
| 377 |
+
"type": "text",
|
| 378 |
+
"text": "Please determine whether an Earthquake event has occurred in the provided time-series data. If so, please specify the starting time point indices of the P-wave and S-wave in the event."
|
| 379 |
+
},
|
| 380 |
+
{
|
| 381 |
+
"type": "time_series_url",
|
| 382 |
+
"time_series_url": {
|
| 383 |
+
"url": f"file://{file_path}",
|
| 384 |
+
"sampling_rate": sampling_rate
|
| 385 |
+
},
|
| 386 |
+
},
|
| 387 |
+
],
|
| 388 |
+
}
|
| 389 |
+
]
|
| 390 |
+
|
| 391 |
+
return client.chat.completions.create(
|
| 392 |
+
model=model_name,
|
| 393 |
+
messages=messages,
|
| 394 |
+
temperature=0,
|
| 395 |
+
max_tokens=200,
|
| 396 |
+
)
|
| 397 |
+
|
| 398 |
+
response = send_base64("./0092638_seism.npy")
|
| 399 |
+
# response = send_http_url("https://huggingface.co/internlm/Intern-S1-Pro/raw/main/0092638_seism.npy")
|
| 400 |
+
# response = send_file_url("./0092638_seism.npy")
|
| 401 |
+
|
| 402 |
+
print(response.choices[0].message)
|
| 403 |
+
```
|
| 404 |
+
|
| 405 |
## Citation
|
| 406 |
|
| 407 |
If you find this work useful, feel free to give us a cite.
|
deployment_guide.md
CHANGED
|
@@ -9,7 +9,7 @@ The Intern-S1-Pro release is a 1T parameter model stored in FP8 format. Deployme
|
|
| 9 |
|
| 10 |
## LMDeploy
|
| 11 |
|
| 12 |
-
Required version `lmdeploy>=0.12.
|
| 13 |
|
| 14 |
- Tensor Parallelism
|
| 15 |
|
|
@@ -59,25 +59,7 @@ lmdeploy serve api_server \
|
|
| 59 |
|
| 60 |
## vLLM
|
| 61 |
|
| 62 |
-
-
|
| 63 |
-
|
| 64 |
-
```bash
|
| 65 |
-
# start ray on node 0 and node 1
|
| 66 |
-
|
| 67 |
-
# node 0
|
| 68 |
-
export VLLM_ENGINE_READY_TIMEOUT_S=10000
|
| 69 |
-
vllm serve internlm/Intern-S1-Pro \
|
| 70 |
-
--tensor-parallel-size 16 \
|
| 71 |
-
--enable-expert-parallel \
|
| 72 |
-
--distributed-executor-backend ray \
|
| 73 |
-
--max-model-len 65536 \
|
| 74 |
-
--trust-remote-code \
|
| 75 |
-
--reasoning-parser deepseek_r1 \
|
| 76 |
-
--enable-auto-tool-choice \
|
| 77 |
-
--tool-call-parser hermes
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
- Data Parallelism + Expert Parallelism
|
| 81 |
|
| 82 |
```bash
|
| 83 |
# node 0
|
|
|
|
| 9 |
|
| 10 |
## LMDeploy
|
| 11 |
|
| 12 |
+
Required version `lmdeploy>=0.12.1`
|
| 13 |
|
| 14 |
- Tensor Parallelism
|
| 15 |
|
|
|
|
| 59 |
|
| 60 |
## vLLM
|
| 61 |
|
| 62 |
+
You can use the vLLM nightly-built docker image `vllm/vllm-openai:nightly` to deploy. Refer to [using-docker](https://docs.vllm.ai/en/latest/deployment/docker/?h=docker) for more.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
```bash
|
| 65 |
# node 0
|