GLM-4.7-REAP-30 4-bit GPTQ Quantization

This is a 4-bit AutoRound GPTQ quantization of 0xSero/GLM-4.7-REAP-30, a 30% REAP-pruned version of the original GLM-4.7 MoE model.

  • Quantized with AutoRound v0.9.4 (bits=4, group_size=128, sym=True) in auto_gptq format.
  • Model size: ~124 GB (3.8x compression from the unquantized pruned model).
  • Compatible with vLLM, Transformers + AutoGPTQ, ExLlamaV2, etc.

Example vLLM serving command

(tested on 2x NVIDIA RTX PRO 6000 Blackwell with the following settings)

vllm serve Jon-Nielsen/GLM-4.7-REAP-30-W4A16 \
  --tensor-parallel-size 2 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.96 \
  --max-model-len 196608 \
  --max-num-seqs 8 \
  --max-num-batched-tokens 16384 \
  --trust-remote-code \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-expert-parallel \
  --tool-call-parser=glm47 \
  --reasoning-parser=glm45 \
  --enable-auto-tool-choice \
  --disable-custom-all-reduce
Downloads last month
121
Safetensors
Model size
2B params
Tensor type
F32
I32
BF16
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Jon-Nielsen/GLM-4.7-REAP-30-W4A16

Quantized
(1)
this model