After integrations with TorchAO, Transformers, and VLLM, AutoRound-quantized models are now officially compatible with SGLang β bringing faster and more flexible deployment to your LLM workflows.
π‘ Weβve also enhanced the RTN mode (--iters 0), cutting quantization costs significantly for low-resource users.
β Star our repo and stay tuned for more exciting updates!
AutoRound keeps evolving its LLM quantization algorithm! π After enhancing W2A16 quantization, we now offer a fast algorithm to generate mixed bits/data-type schemes (~2mins for 8B models), great for MXFP4 and W2A16. Learn more: https://github.com/intel/auto-round/blob/main/docs/step_by_step.md#autoscheme
AutoRound v0.7 is out! π This release includes enhanced algorithms for W2A16, NVFP4, and MXFP4, along with support for FP8 models as input. π Check out the full details here: https://github.com/intel/auto-round/releases/tag/v0.7.0
We're excited to announce that AutoRound now supports: β GGUF format export β for seamless compatibility with popular inference engines. β Custom bit settings β tailor quantization to your needs for optimal performance.
Check out these newly released models: πΉIntel/Qwen3-235B-A22B-Instruct-2507-gguf-q4km-AutoRound πΉIntel/Qwen3-235B-A22B-Instruct-2507-gguf-q2ks-mixed-AutoRound πΉIntel/Kimi-K2-Instruct-gguf-q2ks-mixed-AutoRound
Stay tuned! An even more advanced algorithm for some configurations is coming soon.
AutoRound(https://github.com/intel/auto-round) has been integrated into vLLM , allowing you to run AutoRound-formatted models directly in the upcoming release.
Beside, we strongly recommend using AutoRound to generate AWQ INT4 models, as AutoAWQ is no longer maintained and manually configuring new models is not trivial due to the need for custom layer mappings.
AutoRound(https://github.com/intel/auto-round) has been integrated into Transformers, allowing you to run AutoRound-formatted models directly in the upcoming release. Additionally, we are actively working on supporting the GGUF double-quant format, e.g. q4_k_s, stay tuned!
Check out [DeepSeek-R1 INT2 model(OPEA/DeepSeek-R1-int2-mixed-sym-inc). This 200GB DeepSeek-R1 model shows only about a 2% drop in MMLU, though it's quite slow due to kernel issue.