Update README.md
Browse files
README.md
CHANGED
|
@@ -1,253 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
-
library_name: transformers
|
| 5 |
-
tags:
|
| 6 |
-
- glm
|
| 7 |
-
- glm4
|
| 8 |
-
- MOE
|
| 9 |
-
- pruning
|
| 10 |
-
- compression
|
| 11 |
-
- reap
|
| 12 |
-
- cerebras
|
| 13 |
-
- code
|
| 14 |
-
- function-calling
|
| 15 |
-
- agentic
|
| 16 |
-
license: apache-2.0
|
| 17 |
-
pipeline_tag: text-generation
|
| 18 |
-
base_model:
|
| 19 |
-
- zai/glm-4.7
|
| 20 |
-
---
|
| 21 |
-
|
| 22 |
-
<p align="center">
|
| 23 |
-
<em>π³ <strong>REAP</strong>π³ the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
|
| 24 |
-
<a href="https://arxiv.org/abs/2510.13999">π Paper</a> β’ <a href="https://github.com/CerebrasResearch/reap">π» Code</a> β’ <a href="https://www.cerebras.ai/blog/reap">π Blog</a>
|
| 25 |
-
</p>
|
| 26 |
-
|
| 27 |
-
# GLM-4.7-REAP-30
|
| 28 |
-
|
| 29 |
-
## β¨ Highlights
|
| 30 |
-
|
| 31 |
-
**30% Expert-Pruned** GLM-4.7 optimized for **code generation**, **function calling**, and **agentic workflows**.
|
| 32 |
-
|
| 33 |
-
Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras:
|
| 34 |
-
|
| 35 |
-
- **358B β 251B**: 30% of MoE experts pruned (112/160 remaining)
|
| 36 |
-
- **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities
|
| 37 |
-
- **One-Shot Compression**: No fine-tuning required
|
| 38 |
-
- **Drop-in Compatible**: Works with vLLM, Transformers, SGLang
|
| 39 |
-
|
| 40 |
-
### π Acknowledgments
|
| 41 |
-
|
| 42 |
-
- **[Prime Intellect](https://www.primeintellect.ai/)** β Compute sponsorship (8x H200 cluster)
|
| 43 |
-
- **[Cerebras](https://www.cerebras.net/)** β [REAP methodology](https://arxiv.org/abs/2510.13999)
|
| 44 |
-
|
| 45 |
-
---
|
| 46 |
-
|
| 47 |
-
## π Model Specifications
|
| 48 |
-
|
| 49 |
-
| Property | Value |
|
| 50 |
-
|----------|-------|
|
| 51 |
-
| **Base Model** | [zai/glm-4.7](https://huggingface.co/zai/glm-4.7) |
|
| 52 |
-
| **Architecture** | Sparse Mixture-of-Experts (SMoE) |
|
| 53 |
-
| **Original Parameters** | 358B |
|
| 54 |
-
| **Pruned Parameters** | 251B |
|
| 55 |
-
| **Compression** | 30% experts removed |
|
| 56 |
-
| **Experts per Layer** | 112 (was 160) |
|
| 57 |
-
| **MoE Layers** | 92 |
|
| 58 |
-
| **Activated Experts** | 8 per token |
|
| 59 |
-
| **Precision** | BF16 |
|
| 60 |
-
| **Disk Size** | ~470GB |
|
| 61 |
-
| **VRAM Required** | ~470GB |
|
| 62 |
-
|
| 63 |
-
---
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
## π¬ Calibration Dataset: Deep Dive
|
| 67 |
-
|
| 68 |
-
REAP's effectiveness depends critically on **calibration data that represents the target use case**. We specifically optimized for **code generation**, **function/tool calling**, and **agentic workflows**.
|
| 69 |
-
|
| 70 |
-
### Why These 3 Datasets?
|
| 71 |
-
|
| 72 |
-
| Dataset | Samples | Purpose | Why It Matters |
|
| 73 |
-
|---------|---------|---------|----------------|
|
| 74 |
-
| [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 700 | Code generation | **51% of mix** β Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability |
|
| 75 |
-
| [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 330 | Function/tool calling | **24% of mix** β Tool use requires structured JSON output; experts handling schema generation must be preserved |
|
| 76 |
-
| [SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) | 330 | Agentic multi-turn | **24% of mix** β Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning |
|
| 77 |
-
|
| 78 |
-
### The Science Behind Dataset Selection
|
| 79 |
-
|
| 80 |
-
```
|
| 81 |
-
REAP Algorithm:
|
| 82 |
-
1. Forward pass calibration samples through model
|
| 83 |
-
2. Record which experts activate and their magnitudes
|
| 84 |
-
3. Compute saliency = router_weight Γ activation_norm
|
| 85 |
-
4. Prune lowest-saliency experts
|
| 86 |
-
|
| 87 |
-
Key Insight: Experts are TASK-SPECIFIC
|
| 88 |
-
βββ Some experts specialize in natural language
|
| 89 |
-
βββ Some experts specialize in code syntax
|
| 90 |
-
βββ Some experts specialize in JSON/structured output
|
| 91 |
-
βββ Some experts specialize in multi-turn context
|
| 92 |
-
|
| 93 |
-
If calibration lacks code β code-specialized experts appear "unused" β get pruned β model loses coding ability
|
| 94 |
-
```
|
| 95 |
-
|
| 96 |
-
### Cerebras' Original Mix (from paper)
|
| 97 |
-
|
| 98 |
-
Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
|
| 99 |
-
- evol-codealpaca-v1 for code generation
|
| 100 |
-
- xlam-function-calling-60k for tool calling
|
| 101 |
-
- SWE-smith-trajectories for agentic tasks
|
| 102 |
-
|
| 103 |
-
We followed this exact recipe for reproducibility.
|
| 104 |
-
|
| 105 |
-
### Combined Dataset
|
| 106 |
-
|
| 107 |
-
Our calibration mix: [0xSero/glm47-reap-calibration-v2](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2)
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
---
|
| 111 |
-
|
| 112 |
-
## π¦ Related Models
|
| 113 |
-
|
| 114 |
-
| Model | Params | Experts | Size | Format |
|
| 115 |
-
|-------|--------|---------|------|--------|
|
| 116 |
-
| [GLM-4.7-REAP-30](https://huggingface.co/0xSero/GLM-4.7-REAP-30) | 251B | 112 | ~470GB | BF16 |
|
| 117 |
-
| [GLM-4.7-REAP-35](https://huggingface.co/0xSero/GLM-4.7-REAP-35) | 233B | 104 | ~439GB | BF16 |
|
| 118 |
-
| [GLM-4.7-REAP-40](https://huggingface.co/0xSero/GLM-4.7-REAP-40) | 218B | 96 | ~407GB | BF16 |
|
| 119 |
-
| [GLM-4.7-REAP-45](https://huggingface.co/0xSero/GLM-4.7-REAP-45) | 197B | 88 | ~370GB | BF16 |
|
| 120 |
-
| [GLM-4.7-REAP-50](https://huggingface.co/0xSero/GLM-4.7-REAP-50) | 179B | 80 | ~345GB | BF16 |
|
| 121 |
-
| [GLM-4.7-REAP-40-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-40-W4A16) | 218B | 96 | ~108GB | GPTQ |
|
| 122 |
-
| [GLM-4.7-REAP-50-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-50-W4A16) | 179B | 80 | ~92GB | GPTQ |
|
| 123 |
-
|
| 124 |
-
---
|
| 125 |
-
|
| 126 |
-
## π Deployment
|
| 127 |
-
|
| 128 |
-
### vLLM (Recommended)
|
| 129 |
-
|
| 130 |
-
```bash
|
| 131 |
-
vllm serve 0xSero/GLM-4.7-REAP-30 \
|
| 132 |
-
--tensor-parallel-size 8 \
|
| 133 |
-
--trust-remote-code \
|
| 134 |
-
--dtype bfloat16
|
| 135 |
-
```
|
| 136 |
-
|
| 137 |
-
### Transformers
|
| 138 |
-
|
| 139 |
-
```python
|
| 140 |
-
import torch
|
| 141 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 142 |
-
|
| 143 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 144 |
-
"0xSero/GLM-4.7-REAP-30",
|
| 145 |
-
torch_dtype=torch.bfloat16,
|
| 146 |
-
device_map="auto",
|
| 147 |
-
trust_remote_code=True
|
| 148 |
-
)
|
| 149 |
-
tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-30", trust_remote_code=True)
|
| 150 |
-
|
| 151 |
-
messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
|
| 152 |
-
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
|
| 153 |
-
outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
|
| 154 |
-
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 155 |
-
```
|
| 156 |
-
|
| 157 |
-
---
|
| 158 |
-
|
| 159 |
-
## π§© Reproduction
|
| 160 |
-
|
| 161 |
-
### REAP Pruning Script
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
```python
|
| 165 |
-
#!/usr/bin/env python3
|
| 166 |
-
"""
|
| 167 |
-
REAP Pruning Script for MoE Models
|
| 168 |
-
Adapted from: https://github.com/CerebrasResearch/reap
|
| 169 |
-
"""
|
| 170 |
-
|
| 171 |
-
import subprocess
|
| 172 |
-
import sys
|
| 173 |
-
|
| 174 |
-
def run_reap(
|
| 175 |
-
model_path: str,
|
| 176 |
-
compression_ratio: float,
|
| 177 |
-
dataset: str = "0xSero/glm47-reap-calibration-v2",
|
| 178 |
-
samples: int = 1360,
|
| 179 |
-
seed: int = 42,
|
| 180 |
-
distance: str = "angular",
|
| 181 |
-
reuse_observations: str = None,
|
| 182 |
-
):
|
| 183 |
-
"""
|
| 184 |
-
Run REAP expert pruning.
|
| 185 |
-
|
| 186 |
-
Args:
|
| 187 |
-
model_path: Path to base model
|
| 188 |
-
compression_ratio: 0.30 = prune 30%, keep 70%
|
| 189 |
-
dataset: Calibration dataset (code + tools + agentic)
|
| 190 |
-
samples: Number of calibration samples
|
| 191 |
-
seed: Random seed for reproducibility
|
| 192 |
-
distance: Distance metric for expert clustering
|
| 193 |
-
reuse_observations: Path to pre-computed observations for instant pruning
|
| 194 |
-
"""
|
| 195 |
-
cmd = [
|
| 196 |
-
sys.executable, "src/reap/prune.py",
|
| 197 |
-
"--model-name", model_path,
|
| 198 |
-
"--dataset-name", dataset,
|
| 199 |
-
"--compression-ratio", str(compression_ratio),
|
| 200 |
-
"--prune-method", "reap",
|
| 201 |
-
"--seed", str(seed),
|
| 202 |
-
"--samples_per_category", str(samples),
|
| 203 |
-
"--model_max_length", "2048",
|
| 204 |
-
"--distance_measure", distance,
|
| 205 |
-
"--record_pruning_metrics_only", "true",
|
| 206 |
-
]
|
| 207 |
-
|
| 208 |
-
if reuse_observations:
|
| 209 |
-
# Instant pruning: skip calibration, reuse precomputed expert scores
|
| 210 |
-
cmd.extend(["--load_observations", reuse_observations])
|
| 211 |
-
|
| 212 |
-
subprocess.run(cmd, check=True)
|
| 213 |
-
|
| 214 |
-
# Example: Create 40% pruned model
|
| 215 |
-
run_reap(
|
| 216 |
-
model_path="/path/to/GLM-4.7",
|
| 217 |
-
compression_ratio=0.40, # Prune 40% of experts
|
| 218 |
-
)
|
| 219 |
-
```
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
### Observation Reuse (Instant Multi-Ratio Pruning)
|
| 223 |
-
|
| 224 |
-
REAP computes expert saliency scores during calibration. These scores are **compression-ratio independent**, enabling instant pruning at any ratio:
|
| 225 |
-
|
| 226 |
-
```bash
|
| 227 |
-
# First run: compute observations (~5 hours)
|
| 228 |
-
python prune.py --compression-ratio 0.40 --output_file_name observations.pt
|
| 229 |
-
|
| 230 |
-
# Subsequent runs: instant pruning (<5 minutes)
|
| 231 |
-
python prune.py --compression-ratio 0.30 --load_observations observations.pt
|
| 232 |
-
python prune.py --compression-ratio 0.50 --load_observations observations.pt
|
| 233 |
-
```
|
| 234 |
-
|
| 235 |
-
---
|
| 236 |
-
|
| 237 |
-
## βοΈ License
|
| 238 |
-
|
| 239 |
-
Apache 2.0 (inherited from GLM-4)
|
| 240 |
-
|
| 241 |
-
---
|
| 242 |
-
|
| 243 |
-
## π§Ύ Citation
|
| 244 |
-
|
| 245 |
-
```bibtex
|
| 246 |
-
@article{lasby2025reap,
|
| 247 |
-
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
|
| 248 |
-
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
|
| 249 |
-
journal={arXiv preprint arXiv:2510.13999},
|
| 250 |
-
year={2025},
|
| 251 |
-
url={https://arxiv.org/abs/2510.13999}
|
| 252 |
-
}
|
| 253 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|