Infinite3214 commited on
Commit
5d6d32c
Β·
verified Β·
1 Parent(s): 4fb8489

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -253
README.md CHANGED
@@ -1,253 +0,0 @@
1
- ---
2
- language:
3
- - en
4
- library_name: transformers
5
- tags:
6
- - glm
7
- - glm4
8
- - MOE
9
- - pruning
10
- - compression
11
- - reap
12
- - cerebras
13
- - code
14
- - function-calling
15
- - agentic
16
- license: apache-2.0
17
- pipeline_tag: text-generation
18
- base_model:
19
- - zai/glm-4.7
20
- ---
21
-
22
- <p align="center">
23
- <em>π“Œ³ <strong>REAP</strong>π“Œ³ the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
24
- <a href="https://arxiv.org/abs/2510.13999">πŸ“„ Paper</a> β€’ <a href="https://github.com/CerebrasResearch/reap">πŸ’» Code</a> β€’ <a href="https://www.cerebras.ai/blog/reap">πŸ“ Blog</a>
25
- </p>
26
-
27
- # GLM-4.7-REAP-30
28
-
29
- ## ✨ Highlights
30
-
31
- **30% Expert-Pruned** GLM-4.7 optimized for **code generation**, **function calling**, and **agentic workflows**.
32
-
33
- Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras:
34
-
35
- - **358B β†’ 251B**: 30% of MoE experts pruned (112/160 remaining)
36
- - **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities
37
- - **One-Shot Compression**: No fine-tuning required
38
- - **Drop-in Compatible**: Works with vLLM, Transformers, SGLang
39
-
40
- ### πŸ™ Acknowledgments
41
-
42
- - **[Prime Intellect](https://www.primeintellect.ai/)** β€” Compute sponsorship (8x H200 cluster)
43
- - **[Cerebras](https://www.cerebras.net/)** β€” [REAP methodology](https://arxiv.org/abs/2510.13999)
44
-
45
- ---
46
-
47
- ## πŸ“‹ Model Specifications
48
-
49
- | Property | Value |
50
- |----------|-------|
51
- | **Base Model** | [zai/glm-4.7](https://huggingface.co/zai/glm-4.7) |
52
- | **Architecture** | Sparse Mixture-of-Experts (SMoE) |
53
- | **Original Parameters** | 358B |
54
- | **Pruned Parameters** | 251B |
55
- | **Compression** | 30% experts removed |
56
- | **Experts per Layer** | 112 (was 160) |
57
- | **MoE Layers** | 92 |
58
- | **Activated Experts** | 8 per token |
59
- | **Precision** | BF16 |
60
- | **Disk Size** | ~470GB |
61
- | **VRAM Required** | ~470GB |
62
-
63
- ---
64
-
65
-
66
- ## πŸ”¬ Calibration Dataset: Deep Dive
67
-
68
- REAP's effectiveness depends critically on **calibration data that represents the target use case**. We specifically optimized for **code generation**, **function/tool calling**, and **agentic workflows**.
69
-
70
- ### Why These 3 Datasets?
71
-
72
- | Dataset | Samples | Purpose | Why It Matters |
73
- |---------|---------|---------|----------------|
74
- | [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 700 | Code generation | **51% of mix** β€” Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability |
75
- | [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 330 | Function/tool calling | **24% of mix** β€” Tool use requires structured JSON output; experts handling schema generation must be preserved |
76
- | [SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) | 330 | Agentic multi-turn | **24% of mix** β€” Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning |
77
-
78
- ### The Science Behind Dataset Selection
79
-
80
- ```
81
- REAP Algorithm:
82
- 1. Forward pass calibration samples through model
83
- 2. Record which experts activate and their magnitudes
84
- 3. Compute saliency = router_weight Γ— activation_norm
85
- 4. Prune lowest-saliency experts
86
-
87
- Key Insight: Experts are TASK-SPECIFIC
88
- β”œβ”€β”€ Some experts specialize in natural language
89
- β”œβ”€β”€ Some experts specialize in code syntax
90
- β”œβ”€β”€ Some experts specialize in JSON/structured output
91
- └── Some experts specialize in multi-turn context
92
-
93
- If calibration lacks code β†’ code-specialized experts appear "unused" β†’ get pruned β†’ model loses coding ability
94
- ```
95
-
96
- ### Cerebras' Original Mix (from paper)
97
-
98
- Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
99
- - evol-codealpaca-v1 for code generation
100
- - xlam-function-calling-60k for tool calling
101
- - SWE-smith-trajectories for agentic tasks
102
-
103
- We followed this exact recipe for reproducibility.
104
-
105
- ### Combined Dataset
106
-
107
- Our calibration mix: [0xSero/glm47-reap-calibration-v2](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2)
108
-
109
-
110
- ---
111
-
112
- ## πŸ“¦ Related Models
113
-
114
- | Model | Params | Experts | Size | Format |
115
- |-------|--------|---------|------|--------|
116
- | [GLM-4.7-REAP-30](https://huggingface.co/0xSero/GLM-4.7-REAP-30) | 251B | 112 | ~470GB | BF16 |
117
- | [GLM-4.7-REAP-35](https://huggingface.co/0xSero/GLM-4.7-REAP-35) | 233B | 104 | ~439GB | BF16 |
118
- | [GLM-4.7-REAP-40](https://huggingface.co/0xSero/GLM-4.7-REAP-40) | 218B | 96 | ~407GB | BF16 |
119
- | [GLM-4.7-REAP-45](https://huggingface.co/0xSero/GLM-4.7-REAP-45) | 197B | 88 | ~370GB | BF16 |
120
- | [GLM-4.7-REAP-50](https://huggingface.co/0xSero/GLM-4.7-REAP-50) | 179B | 80 | ~345GB | BF16 |
121
- | [GLM-4.7-REAP-40-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-40-W4A16) | 218B | 96 | ~108GB | GPTQ |
122
- | [GLM-4.7-REAP-50-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-50-W4A16) | 179B | 80 | ~92GB | GPTQ |
123
-
124
- ---
125
-
126
- ## πŸš€ Deployment
127
-
128
- ### vLLM (Recommended)
129
-
130
- ```bash
131
- vllm serve 0xSero/GLM-4.7-REAP-30 \
132
- --tensor-parallel-size 8 \
133
- --trust-remote-code \
134
- --dtype bfloat16
135
- ```
136
-
137
- ### Transformers
138
-
139
- ```python
140
- import torch
141
- from transformers import AutoModelForCausalLM, AutoTokenizer
142
-
143
- model = AutoModelForCausalLM.from_pretrained(
144
- "0xSero/GLM-4.7-REAP-30",
145
- torch_dtype=torch.bfloat16,
146
- device_map="auto",
147
- trust_remote_code=True
148
- )
149
- tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-30", trust_remote_code=True)
150
-
151
- messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
152
- inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
153
- outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
154
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
155
- ```
156
-
157
- ---
158
-
159
- ## 🧩 Reproduction
160
-
161
- ### REAP Pruning Script
162
-
163
-
164
- ```python
165
- #!/usr/bin/env python3
166
- """
167
- REAP Pruning Script for MoE Models
168
- Adapted from: https://github.com/CerebrasResearch/reap
169
- """
170
-
171
- import subprocess
172
- import sys
173
-
174
- def run_reap(
175
- model_path: str,
176
- compression_ratio: float,
177
- dataset: str = "0xSero/glm47-reap-calibration-v2",
178
- samples: int = 1360,
179
- seed: int = 42,
180
- distance: str = "angular",
181
- reuse_observations: str = None,
182
- ):
183
- """
184
- Run REAP expert pruning.
185
-
186
- Args:
187
- model_path: Path to base model
188
- compression_ratio: 0.30 = prune 30%, keep 70%
189
- dataset: Calibration dataset (code + tools + agentic)
190
- samples: Number of calibration samples
191
- seed: Random seed for reproducibility
192
- distance: Distance metric for expert clustering
193
- reuse_observations: Path to pre-computed observations for instant pruning
194
- """
195
- cmd = [
196
- sys.executable, "src/reap/prune.py",
197
- "--model-name", model_path,
198
- "--dataset-name", dataset,
199
- "--compression-ratio", str(compression_ratio),
200
- "--prune-method", "reap",
201
- "--seed", str(seed),
202
- "--samples_per_category", str(samples),
203
- "--model_max_length", "2048",
204
- "--distance_measure", distance,
205
- "--record_pruning_metrics_only", "true",
206
- ]
207
-
208
- if reuse_observations:
209
- # Instant pruning: skip calibration, reuse precomputed expert scores
210
- cmd.extend(["--load_observations", reuse_observations])
211
-
212
- subprocess.run(cmd, check=True)
213
-
214
- # Example: Create 40% pruned model
215
- run_reap(
216
- model_path="/path/to/GLM-4.7",
217
- compression_ratio=0.40, # Prune 40% of experts
218
- )
219
- ```
220
-
221
-
222
- ### Observation Reuse (Instant Multi-Ratio Pruning)
223
-
224
- REAP computes expert saliency scores during calibration. These scores are **compression-ratio independent**, enabling instant pruning at any ratio:
225
-
226
- ```bash
227
- # First run: compute observations (~5 hours)
228
- python prune.py --compression-ratio 0.40 --output_file_name observations.pt
229
-
230
- # Subsequent runs: instant pruning (<5 minutes)
231
- python prune.py --compression-ratio 0.30 --load_observations observations.pt
232
- python prune.py --compression-ratio 0.50 --load_observations observations.pt
233
- ```
234
-
235
- ---
236
-
237
- ## βš–οΈ License
238
-
239
- Apache 2.0 (inherited from GLM-4)
240
-
241
- ---
242
-
243
- ## 🧾 Citation
244
-
245
- ```bibtex
246
- @article{lasby2025reap,
247
- title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
248
- author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
249
- journal={arXiv preprint arXiv:2510.13999},
250
- year={2025},
251
- url={https://arxiv.org/abs/2510.13999}
252
- }
253
- ```