Infinite3214
/

Affine-0213-5FBk9qGHx1MRrFTQiTRNzj3sdNPBqihkurS2EStxS8BrAsP7

Safetensors

glm4_moe

Model card Files Files and versions

xet

Community

Infinite3214 commited on 3 days ago

Commit

5d6d32c

verified ·

1 Parent(s): 4fb8489

Update README.md

Browse files

Files changed (1) hide show

README.md +0 -253

README.md CHANGED Viewed

@@ -1,253 +0,0 @@
----
-language:
-- en
-library_name: transformers
-tags:
-- glm
-- glm4
-- MOE
-- pruning
-- compression
-- reap
-- cerebras
-- code
-- function-calling
-- agentic
-license: apache-2.0
-pipeline_tag: text-generation
-base_model:
-- zai/glm-4.7
----
-<p align="center">
-  <em>𓌳 <strong>REAP</strong>𓌳  the Experts: Why Pruning Prevails for One-Shot MoE Compression</em><br>
-  <a href="https://arxiv.org/abs/2510.13999">📄 Paper</a> • <a href="https://github.com/CerebrasResearch/reap">💻 Code</a> • <a href="https://www.cerebras.ai/blog/reap">📝 Blog</a>
-</p>
-# GLM-4.7-REAP-30
-## ✨ Highlights
-**30% Expert-Pruned** GLM-4.7 optimized for **code generation**, **function calling**, and **agentic workflows**.
-Created using **[REAP (Router-weighted Expert Activation Pruning)](https://arxiv.org/abs/2510.13999)** by Cerebras:
-- **358B → 251B**: 30% of MoE experts pruned (112/160 remaining)
-- **Calibrated for Code & Tools**: Preserves coding and function-calling capabilities
-- **One-Shot Compression**: No fine-tuning required
-- **Drop-in Compatible**: Works with vLLM, Transformers, SGLang
-### 🙏 Acknowledgments
-- **[Prime Intellect](https://www.primeintellect.ai/)** — Compute sponsorship (8x H200 cluster)
-- **[Cerebras](https://www.cerebras.net/)** — [REAP methodology](https://arxiv.org/abs/2510.13999)
----
-## 📋 Model Specifications
-| Property | Value |
-|----------|-------|
-| **Base Model** | [zai/glm-4.7](https://huggingface.co/zai/glm-4.7) |
-| **Architecture** | Sparse Mixture-of-Experts (SMoE) |
-| **Original Parameters** | 358B |
-| **Pruned Parameters** | 251B |
-| **Compression** | 30% experts removed |
-| **Experts per Layer** | 112 (was 160) |
-| **MoE Layers** | 92 |
-| **Activated Experts** | 8 per token |
-| **Precision** | BF16 |
-| **Disk Size** | ~470GB |
-| **VRAM Required** | ~470GB |
----
-## 🔬 Calibration Dataset: Deep Dive
-REAP's effectiveness depends critically on **calibration data that represents the target use case**. We specifically optimized for **code generation**, **function/tool calling**, and **agentic workflows**.
-### Why These 3 Datasets?
-| Dataset | Samples | Purpose | Why It Matters |
-|---------|---------|---------|----------------|
-| [evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) | 700 | Code generation | **51% of mix** — Code tasks activate specific expert pathways; pruning without code calibration destroys coding ability |
-| [xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) | 330 | Function/tool calling | **24% of mix** — Tool use requires structured JSON output; experts handling schema generation must be preserved |
-| [SWE-smith-trajectories](https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories) | 330 | Agentic multi-turn | **24% of mix** — Real SWE-bench trajectories with tool calls, file edits, and multi-step reasoning |
-### The Science Behind Dataset Selection
-```
-REAP Algorithm:
-1. Forward pass calibration samples through model
-2. Record which experts activate and their magnitudes
-3. Compute saliency = router_weight × activation_norm
-4. Prune lowest-saliency experts
-Key Insight: Experts are TASK-SPECIFIC
-├── Some experts specialize in natural language
-├── Some experts specialize in code syntax
-├── Some experts specialize in JSON/structured output
-└── Some experts specialize in multi-turn context
-If calibration lacks code → code-specialized experts appear "unused" → get pruned → model loses coding ability
-```
-### Cerebras' Original Mix (from paper)
-Cerebras used the same 3 datasets in their GLM-4.6 REAP experiments:
-- evol-codealpaca-v1 for code generation
-- xlam-function-calling-60k for tool calling
-- SWE-smith-trajectories for agentic tasks
-We followed this exact recipe for reproducibility.
-### Combined Dataset
-Our calibration mix: [0xSero/glm47-reap-calibration-v2](https://huggingface.co/datasets/0xSero/glm47-reap-calibration-v2)
----
-## 📦 Related Models
-| Model | Params | Experts | Size | Format |
-|-------|--------|---------|------|--------|
-| [GLM-4.7-REAP-30](https://huggingface.co/0xSero/GLM-4.7-REAP-30) | 251B | 112 | ~470GB | BF16 |
-| [GLM-4.7-REAP-35](https://huggingface.co/0xSero/GLM-4.7-REAP-35) | 233B | 104 | ~439GB | BF16 |
-| [GLM-4.7-REAP-40](https://huggingface.co/0xSero/GLM-4.7-REAP-40) | 218B | 96 | ~407GB | BF16 |
-| [GLM-4.7-REAP-45](https://huggingface.co/0xSero/GLM-4.7-REAP-45) | 197B | 88 | ~370GB | BF16 |
-| [GLM-4.7-REAP-50](https://huggingface.co/0xSero/GLM-4.7-REAP-50) | 179B | 80 | ~345GB | BF16 |
-| [GLM-4.7-REAP-40-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-40-W4A16) | 218B | 96 | ~108GB | GPTQ |
-| [GLM-4.7-REAP-50-W4A16](https://huggingface.co/0xSero/GLM-4.7-REAP-50-W4A16) | 179B | 80 | ~92GB | GPTQ |
----
-## 🚀 Deployment
-### vLLM (Recommended)
-```bash
-vllm serve 0xSero/GLM-4.7-REAP-30 \
-    --tensor-parallel-size 8 \
-    --trust-remote-code \
-    --dtype bfloat16
-```
-### Transformers
-```python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model = AutoModelForCausalLM.from_pretrained(
-    "0xSero/GLM-4.7-REAP-30",
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
-)
-tokenizer = AutoTokenizer.from_pretrained("0xSero/GLM-4.7-REAP-30", trust_remote_code=True)
-messages = [{"role": "user", "content": "Write a Python function to merge two sorted lists."}]
-inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
-outputs = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.7)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
----
-## 🧩 Reproduction
-### REAP Pruning Script
-```python
-#!/usr/bin/env python3
-"""
-REAP Pruning Script for MoE Models
-Adapted from: https://github.com/CerebrasResearch/reap
-"""
-import subprocess
-import sys
-def run_reap(
-    model_path: str,
-    compression_ratio: float,
-    dataset: str = "0xSero/glm47-reap-calibration-v2",
-    samples: int = 1360,
-    seed: int = 42,
-    distance: str = "angular",
-    reuse_observations: str = None,
-):
-    """
-    Run REAP expert pruning.
-    Args:
-        model_path: Path to base model
-        compression_ratio: 0.30 = prune 30%, keep 70%
-        dataset: Calibration dataset (code + tools + agentic)
-        samples: Number of calibration samples
-        seed: Random seed for reproducibility
-        distance: Distance metric for expert clustering
-        reuse_observations: Path to pre-computed observations for instant pruning
-    """
-    cmd = [
-        sys.executable, "src/reap/prune.py",
-        "--model-name", model_path,
-        "--dataset-name", dataset,
-        "--compression-ratio", str(compression_ratio),
-        "--prune-method", "reap",
-        "--seed", str(seed),
-        "--samples_per_category", str(samples),
-        "--model_max_length", "2048",
-        "--distance_measure", distance,
-        "--record_pruning_metrics_only", "true",
-    ]
-    if reuse_observations:
-        # Instant pruning: skip calibration, reuse precomputed expert scores
-        cmd.extend(["--load_observations", reuse_observations])
-    subprocess.run(cmd, check=True)
-# Example: Create 40% pruned model
-run_reap(
-    model_path="/path/to/GLM-4.7",
-    compression_ratio=0.40,  # Prune 40% of experts
-)
-```
-### Observation Reuse (Instant Multi-Ratio Pruning)
-REAP computes expert saliency scores during calibration. These scores are **compression-ratio independent**, enabling instant pruning at any ratio:
-```bash
-# First run: compute observations (~5 hours)
-python prune.py --compression-ratio 0.40 --output_file_name observations.pt
-# Subsequent runs: instant pruning (<5 minutes)
-python prune.py --compression-ratio 0.30 --load_observations observations.pt
-python prune.py --compression-ratio 0.50 --load_observations observations.pt
-```
----
-## ⚖️ License
-Apache 2.0 (inherited from GLM-4)
----
-## 🧾 Citation
-```bibtex
-@article{lasby2025reap,
-  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
-  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
-  journal={arXiv preprint arXiv:2510.13999},
-  year={2025},
-  url={https://arxiv.org/abs/2510.13999}
-}
-```