Update vllm info, add time series demo

#11
by jack-zxy - opened
Files changed (2) hide show
  1. README.md +118 -4
  2. deployment_guide.md +2 -20
README.md CHANGED
@@ -61,7 +61,7 @@ temperature = 0.8
61
  ### Serving
62
 
63
  > [!IMPORTANT]
64
- > Running a trillion-parameter model using the native Hugging Face forward method is challenging. We strongly recommend using an LLM inference engine (such as LMDeploy, vLLM, or sglang) to host Intern-S1-Pro and accessing the model via API.
65
 
66
  Intern-S1-Pro can be deployed using any of the following LLM inference frameworks:
67
 
@@ -71,8 +71,6 @@ Intern-S1-Pro can be deployed using any of the following LLM inference framework
71
 
72
  Detailed deployment examples for these frameworks are available in the [Model Deployment Guide](./deployment_guide.md).
73
 
74
- > Deployment support for the time-series module is under optimization and will be released soon.
75
-
76
 
77
  ## Advanced Usage
78
 
@@ -249,7 +247,7 @@ text = tokenizer.apply_chat_template(
249
  )
250
  ```
251
 
252
- With serving Intern-S1-Pro models, you can dynamically control the thinking mode by adjusting the `enable_thinking` parameter in your requests.
253
 
254
  ```python
255
  from openai import OpenAI
@@ -288,6 +286,122 @@ response = client.chat.completions.create(
288
  print(json.dumps(response.model_dump(), indent=2, ensure_ascii=False))
289
  ```
290
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
291
  ## Citation
292
 
293
  If you find this work useful, feel free to give us a cite.
 
61
  ### Serving
62
 
63
  > [!IMPORTANT]
64
+ > Running a trillion-parameter model using the native Hugging Face forward method is challenging. We strongly recommend using an LLM inference engine (such as LMDeploy, vLLM, or SGLang) to host Intern-S1-Pro and accessing the model via API.
65
 
66
  Intern-S1-Pro can be deployed using any of the following LLM inference frameworks:
67
 
 
71
 
72
  Detailed deployment examples for these frameworks are available in the [Model Deployment Guide](./deployment_guide.md).
73
 
 
 
74
 
75
  ## Advanced Usage
76
 
 
247
  )
248
  ```
249
 
250
+ When serving Intern-S1-Pro models, you can dynamically control the thinking mode by adjusting the `enable_thinking` parameter in your requests.
251
 
252
  ```python
253
  from openai import OpenAI
 
286
  print(json.dumps(response.model_dump(), indent=2, ensure_ascii=False))
287
  ```
288
 
289
+ ### Time Series Demo
290
+
291
+ Time series inference is currently only supported in LMDeploy. To get started, download and deploy Intern-S1-Pro with LMDeploy (>=v0.12.1) by following the [Model Deployment Guide](./deployment_guide.md).
292
+ Below is an example of detecting earthquake events from a time series signal file. Additional data types and functionalities are also supported.
293
+
294
+ ```
295
+ from openai import OpenAI
296
+ from lmdeploy.vl.time_series_utils import encode_time_series_base64
297
+
298
+ openai_api_key = "EMPTY"
299
+ openai_api_base = "http://0.0.0.0:8000/v1"
300
+ client = OpenAI(
301
+ api_key=openai_api_key,
302
+ base_url=openai_api_base,
303
+ )
304
+ model_name = client.models.list().data[0].id
305
+
306
+
307
+ def send_base64(file_path: str, sampling_rate: int = 100):
308
+ """base64-encoded time-series data."""
309
+
310
+ # encode_time_series_base64 accepts local file paths and http urls,
311
+ # encoding time-series data (.npy, .csv, .wav, .mp3, .flac, etc.) into base64 strings.
312
+ base64_ts = encode_time_series_base64(file_path)
313
+
314
+ messages = [
315
+ {
316
+ "role": "user",
317
+ "content": [
318
+ {
319
+ "type": "text",
320
+ "text": "Please determine whether an Earthquake event has occurred in the provided time-series data. If so, please specify the starting time point indices of the P-wave and S-wave in the event."
321
+ },
322
+ {
323
+ "type": "time_series_url",
324
+ "time_series_url": {
325
+ "url": f"data:time_series/npy;base64,{base64_ts}",
326
+ "sampling_rate": sampling_rate
327
+ },
328
+ },
329
+ ],
330
+ }
331
+ ]
332
+
333
+ return client.chat.completions.create(
334
+ model=model_name,
335
+ messages=messages,
336
+ temperature=0,
337
+ max_tokens=200,
338
+ )
339
+
340
+
341
+ def send_http_url(url: str, sampling_rate: int = 100):
342
+ """http(s) url pointing to the time-series data."""
343
+ messages = [
344
+ {
345
+ "role": "user",
346
+ "content": [
347
+ {
348
+ "type": "text",
349
+ "text": "Please determine whether an Earthquake event has occurred in the provided time-series data. If so, please specify the starting time point indices of the P-wave and S-wave in the event."
350
+ },
351
+ {
352
+ "type": "time_series_url",
353
+ "time_series_url": {
354
+ "url": url,
355
+ "sampling_rate": sampling_rate
356
+ },
357
+ },
358
+ ],
359
+ }
360
+ ]
361
+
362
+ return client.chat.completions.create(
363
+ model=model_name,
364
+ messages=messages,
365
+ temperature=0,
366
+ max_tokens=200,
367
+ )
368
+
369
+
370
+ def send_file_url(file_path: str, sampling_rate: int = 100):
371
+ """file url pointing to the time-series data."""
372
+ messages = [
373
+ {
374
+ "role": "user",
375
+ "content": [
376
+ {
377
+ "type": "text",
378
+ "text": "Please determine whether an Earthquake event has occurred in the provided time-series data. If so, please specify the starting time point indices of the P-wave and S-wave in the event."
379
+ },
380
+ {
381
+ "type": "time_series_url",
382
+ "time_series_url": {
383
+ "url": f"file://{file_path}",
384
+ "sampling_rate": sampling_rate
385
+ },
386
+ },
387
+ ],
388
+ }
389
+ ]
390
+
391
+ return client.chat.completions.create(
392
+ model=model_name,
393
+ messages=messages,
394
+ temperature=0,
395
+ max_tokens=200,
396
+ )
397
+
398
+ response = send_base64("./0092638_seism.npy")
399
+ # response = send_http_url("https://huggingface.co/internlm/Intern-S1-Pro/raw/main/0092638_seism.npy")
400
+ # response = send_file_url("./0092638_seism.npy")
401
+
402
+ print(response.choices[0].message)
403
+ ```
404
+
405
  ## Citation
406
 
407
  If you find this work useful, feel free to give us a cite.
deployment_guide.md CHANGED
@@ -9,7 +9,7 @@ The Intern-S1-Pro release is a 1T parameter model stored in FP8 format. Deployme
9
 
10
  ## LMDeploy
11
 
12
- Required version `lmdeploy>=0.12.0`
13
 
14
  - Tensor Parallelism
15
 
@@ -59,25 +59,7 @@ lmdeploy serve api_server \
59
 
60
  ## vLLM
61
 
62
- - Tensor Parallelism + Expert Parallelism
63
-
64
- ```bash
65
- # start ray on node 0 and node 1
66
-
67
- # node 0
68
- export VLLM_ENGINE_READY_TIMEOUT_S=10000
69
- vllm serve internlm/Intern-S1-Pro \
70
- --tensor-parallel-size 16 \
71
- --enable-expert-parallel \
72
- --distributed-executor-backend ray \
73
- --max-model-len 65536 \
74
- --trust-remote-code \
75
- --reasoning-parser deepseek_r1 \
76
- --enable-auto-tool-choice \
77
- --tool-call-parser hermes
78
- ```
79
-
80
- - Data Parallelism + Expert Parallelism
81
 
82
  ```bash
83
  # node 0
 
9
 
10
  ## LMDeploy
11
 
12
+ Required version `lmdeploy>=0.12.1`
13
 
14
  - Tensor Parallelism
15
 
 
59
 
60
  ## vLLM
61
 
62
+ You can use the vLLM nightly-built docker image `vllm/vllm-openai:nightly` to deploy. Refer to [using-docker](https://docs.vllm.ai/en/latest/deployment/docker/?h=docker) for more.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ```bash
65
  # node 0