Text Generation
Transformers
GGUF
step3p5
custom_code
imatrix
conversational

INT8 quantization for KVCache on DGX Spark/GB10

#6
by JDWarner - opened

Per the model card:
"On NVIDIA DGX Spark, the Step 3.5 Flash achieves a generation speed of 20 tokens per second; by integrating the INT8 quantization technology for KVCache, it supports an extended context window of up to 256K tokens, thus delivering long text processing capabilities on par with cloud-based inference."

Unless I missed something (possible!), this does not seem to have instructions or breadcrumbs in the model card. The provided start-up command for this GGUF on the Spark seems to limit context to 16k. Could you please provide some guidance on how to use INT8 KVCache with this Int4 GGUF on the DGX Spark? Thanks!

StepFun org

Sorry I’m a bit late to this discussion.

Would you mind trying the Spark usage example from this page and checking whether it works for you?
https://github.com/stepfun-ai/Step-3.5-Flash/blob/main/llama.cpp/docs/step3.5-flash.md 

Thanks for the link, this is great documentation. It seems the -ctk q8_0 and -ctv q8_0 options are essential for the Int8 KVCache.

Per the guide, thus far I have held to or a little below 200k context rather than push it further. I am not currently forcing MMQ with the build flag -DGGML_CUDA_FORCE_MMQ=ON, favoring performance over a little bit of extra context window, and seeing 10-20% higher throughput for low-medium context (up to 24 t/s in the 2-8k context range).

The linked guide suggests the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, which I'm unsure is relevant for the DGX Spark. Does this have any function on the Spark, where memory is already inherently unified?

StepFun org

Thanks for the link, this is great documentation. It seems the -ctk q8_0 and -ctv q8_0 options are essential for the Int8 KVCache.

Per the guide, thus far I have held to or a little below 200k context rather than push it further. I am not currently forcing MMQ with the build flag -DGGML_CUDA_FORCE_MMQ=ON, favoring performance over a little bit of extra context window, and seeing 10-20% higher throughput for low-medium context (up to 24 t/s in the 2-8k context range).

The linked guide suggests the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, which I'm unsure is relevant for the DGX Spark. Does this have any function on the Spark, where memory is already inherently unified?

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. On my DGX Spark tests, a ~256k context would crash without this setting, but with GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 it can run (but maybe hurts performance).

Sign up or log in to comment