|
|
--- |
|
|
base_model: zai-org/GLM-5 |
|
|
library_name: mlx |
|
|
license: mit |
|
|
tags: |
|
|
- mlx |
|
|
- safetensors |
|
|
- glm_moe_dsa |
|
|
- conversational |
|
|
- text-generation |
|
|
- mxfp4 |
|
|
- quantized |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
--- |
|
|
|
|
|
# mlx-community/GLM-5-MXFP4-Q8 |
|
|
|
|
|
This model was converted to MLX format from [`zai-org/GLM-5`](https://huggingface.co/zai-org/GLM-5) using a custom MXFP4-Q8 quantization scheme. |
|
|
|
|
|
GLM-5 is a 744B parameter (40B active) Mixture-of-Experts model developed by Z.ai, targeting complex systems engineering and long-horizon agentic tasks. It uses Multi-Head Latent Attention (MLA) with 47 transformer layers, 64 routed experts (4 active per token), and 1 shared expert. |
|
|
|
|
|
## Quantization |
|
|
|
|
|
This model uses a mixed-precision quantization. |
|
|
|
|
|
| Component | Mode | Bits | Group Size | |
|
|
|---|---|---|---| |
|
|
| Expert weights (switch_mlp) | MXFP4 | 4 | 32 | |
|
|
| Attention, embeddings, shared expert, dense MLP, lm_head | Affine | 8 | 64 | |
|
|
|
|
|
## Use with mlx-lm |
|
|
|
|
|
```bash |
|
|
pip install mlx-lm |
|
|
``` |
|
|
|
|
|
```python |
|
|
from mlx_lm import load, generate |
|
|
|
|
|
model, tokenizer = load("mlx-community/GLM-5-MXFP4-Q8") |
|
|
|
|
|
prompt = "hello" |
|
|
|
|
|
if tokenizer.chat_template is not None: |
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
prompt = tokenizer.apply_chat_template( |
|
|
messages, add_generation_prompt=True |
|
|
) |
|
|
|
|
|
response = generate(model, tokenizer, prompt=prompt, verbose=True) |
|
|
``` |
|
|
|