File size: 11,610 Bytes
204461e b5ee0b9 204461e c0a10a6 f411ee8 8ae1d06 f411ee8 7ffb291 5dae421 7ffb291 5dae421 7ffb291 9856166 b4fb812 9856166 101834b 9856166 5dae421 7ffb291 5dae421 468b1ba 5dae421 7ffb291 5dae421 468b1ba 5dae421 88f30fd 5dae421 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
---
pipeline_tag: text-generation
license: other
license_name: modified-mit
license_link: https://github.com/MiniMax-AI/MiniMax-M2.5/blob/main/LICENSE
library_name: exllamav3
base_model: MiniMaxAI/MiniMax-M2.5
base_model_relation: quantized
tags:
- exl3
---
[exllamav3](https://github.com/turboderp-org/exllamav3/) quantizations of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5). Quantized using commit 89b841d of the dev branch.
Note that tensor parallelism is not currently supported for this architecture, so multi-GPU setups will have a harder time fitting this model than they would otherwise (you'll get more context out of 1x96 GB GPU than 4x24 GB GPUs).
| Quant | Size | KLD | PPL | GPU Requirement Hint |
| --- | --- | --- | --- | --- |
| [2.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.00bpw_H6) | 61.054 GiB | 0.42365 | 9.31452 | 3x24 GB w/ 49152 FP16 context |
| [2.10 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.10bpw_H6) (optimized) | 57.292 GiB | 0.36355 | 9.20850 | 3x24GB w/ 40960 FP16 context |
| [2.50 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.50bpw_H6) (optimized) | 67.838 GiB | 0.30152 | 8.88802 | 4x24GB w/ 90112 FP16 context |
| [3.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/3.00bpw_H6) | 81.613 GiB | 0.17263 | 8.58626 | 4x24GB w/ 16384 FP16 context |
| [3.06 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/3.06bpw_H6) (optimized) | 82.656 GiB | 0.15648 | 8.66856 | 4x24GB w/ 12288 FP16 context |
| [3.50 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/3.50bpw_H6) (optimized) | 94.328 GiB | 0.12513 | 8.58743 | 5x24 GB w/ 49152 FP16 context |
| [4.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/4.00bpw_H6) | 108.087 GiB | 0.07882 | 8.45404 | 6x24GB w/ 49152 FP16 context |
| [5.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/5.00bpw_H6) | 134.561 GiB | - | - | 5x24GB + 1x32GB w/ 24576 FP16 context (will not load for me with 6x24GB) |
### K/L-D and PPL graphs


### Measurements for creating optimized quants
[measurement.json - 2.0bpw_H6 vs 3.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-2.0-3.0.json)
[measurement.json - 3.0bpw_H6 vs 4.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-3.0-4.0.json)
[measurement.json - 4.0bpw_H6 vs 5.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-4.0-5.0.json)
### matplotlib Catbench
<details><summary>Click to see cat plots!</summary>
2.0 bpw:

2.1 bpw:

2.5 bpw:

3.0 bpw:

3.06 bpw:

3.5 bpw:

4.0 bpw:

5.0 bpw:

Prompted in Text Generation Web UI in chat-instruct mode with MiniMax AI's recommended settings (temp 1, top p 0.95, top k 40).
Note that 2.1, 4.0, and 5.0 all required a single re-roll each to get a working script.
</details>
### How to use these quants
The documentation for [exllamav3](https://github.com/turboderp-org/exllamav3/) is your best bet here, as well as that of [TabbyAPI](https://github.com/theroyallab/tabbyAPI) or [Text Generation Web UI (oobabooga)](https://github.com/oobabooga/text-generation-webui). In short:
* You need to have sufficient VRAM to fit the model and your context cache. I give some pointers above that may be helpful.
* At this point, your GPUs need to be nVidia. AMD/ROCm, Intel, and offloading to system RAM are not currently supported.
* You will need a software package capable of loading exllamav3 models. I'm still somewhat partial to oobabooga, but TabbyAPI is another popular option. Follow the documenation for your choice in order to get yourself set up.
### How to create a quant
The documentation for [exllamav3](https://github.com/turboderp-org/exllamav3/) is again the authoritative source. But for a short primer, click below to continue.
<details>
<summary>Expand for more details</summary>
Quantization happens a layer at a time, so you don't need nearly as much VRAM to quant as you do to load the whole model.
Not all architectures are supported by exllamav3. Check the documentation to ensure the model you want to quantize is supported.
To create a quant, you'll need to:
* Download your source model
* git clone exllamav3
* Set up a Python environment with all requirements from requirements.txt
* Run convert.py:
```bash
python convert.py -w [path/to/work_area] -i [path/to/source_model] -o [path/to/output_model] -b [bitrate] -hb [head bitrate]
```
Where:
* `path/to/work_area` is a folder where the script can save intermediate checkpoints as it works. If the process crashes, you can pass the `--resume` flag to pick up from where it left off.
* `path/to/source_model` folder containing the source model you downloaded
* `path/to/output_model` destination folder for your completed quant (will be created if it does not exist)
* `bitrate` The average number of bits to use for each weight. Needs to be a float (pass `4.0` if you want just 4 even).
* `head bitrate` Number of bits to use for attention head weights. 6 is usually most useful here. 8 is generally considered overkill, but may be useful in some situations.
</details>
### How to create optimized quants
It's possible to produce quants that are better for a given size than the ones you get by performing a quant directly to a given target bitrate. The process involves comparing two quants, measuring which modules are more affected by the quantization process, and selecting those modules first when targeting some in-between bitrate.
<details>
<summary>Expand for more details</summary>
exllamav3 includes a measurement script `util/measure.py` that will compare two exllamav3 models module by module against the original model. The goal is to see which modules are the most affected by the decrease in precision involved in going from a larger quant to a smaller quant.
The command is:
```bash
python util/measure.py -l [level] -d [device] -ms [max_sys_memory] -i [path/to/quant1] [path/to/quant2] -r [path/to/original_model] -o [path/to/measurement.json]
```
Where:
* `level` is an integer between 0 and 3 that determines the resolution of the measurement. 0 is fastest but least granular, 2 is default, 3 is most granular and slowest.
* `device` is the index of the CUDA device that will perform the work
* `max_sys_memory` is the amount of memory that can be used for state data to speed things up, in GiB
* `path/to/quant1` and `path/to/quant2` are the paths to the two quants to compare
* `path/to/original_model` is the path to the original model
* `path/to/measurement.json` is the path to the resulting json measurement file
The masurement fie I created above compared my 2.0bpw_H6 and my 3.0bpw_H6 quants.
You can then feed this measurement file, along with the two quants, to `util/optimize.py` to create optimized quants that draw modules from both quants where appropriate to get the best result for a given bitrate.
The command is:
```bash
python util/optimize.py -i [path/to/quant1] [path/to/quant2] -o [path/to/resulting_model] -m [path/to/measurement.json] -b [target_bitrate]
```
Where:
* `path/to/quant1` and `path/to/quant2` are paths to the two source models
* `path/to/resulting_model` is the output path
* `target_bitrate` is the target bitrate as a number a decimal point
You can use a measurement script from one pair of quants with another pair of quants of the same model. When I tried to use 2.0bpw and 4.0bpw quants to create a 2.25bpw quant, the size of the resulting model was larger than requested because of the substitution at 2.48 bpw, but it was still an improvement over a straight 2.48bpw quant. An explicitly-requested 2.48bpw quant drawing from the 2.0bpw and 3.0bpw quants proved to be even better (in terms of k/l divergence). Finally, I tried creating a 3.25bpw quant from 3.0bpw and 4.0bpw quants, still using my 2.0-vs-3.0 measurement file. This was not as successful as the optimized 2.25bpw quant, and may have benefitted from a 'correct' measurement file that matched the two actual sources.
</details>
### How to measure Perplexity and KL Divergence
<details>
<summary>Expand for details</summary>
Measuring KL/D is a process that involves comparing the outputs of the quantized model to outputs of the original model. If the original model is too large for your hardware to load without quantization, you can run a script to generate logits which can then be passed into the comparison script, sparing you the need to load the whole source model.
First, you'll need to create a dataset spec file. I based mine on `eval/spec/wiki2_llama3_large.json`.
```json
{
"tokenize_fn": "transformers",
"tokenizer_dir": "path/to/full_model",
"dataset": "wiki2",
"eval_stride": 512,
"eval_len": 2048,
"max_rows": 100
}
```
I passed this into `eval/compare_q_logits.py` as follows:
```bash
python eval/compare_q_logits.py -m [path/to/full_model] -o [path/to/output_logits.safetensors] -d [path/to/dataset_spec.json] -rpb [rows_per_batch] -dev [device_index]
```
Where:
* `path/to/full_model` is the path to the model
* `path/to/output_logits.safetensors` is the path to the output logits file
* `path/to/dataset_spec.json` is the path to the dataset spec file described above
* `rows_per_batch` - I would run out of memory without this parameter. I set it to 32768.
* `device_index` - optional CUDA device index
Next, you'll need a model spec file that describes all the quants you want in the graph. You'll need to be able to load any model you'd like compared. Here's a sample of the one I used for these quants:
```json
[
{
"load_fn": "exllamav3",
"fwd_fn": "exllamav3",
"label": "EXL3 2.0bpw H6",
"model_dir": "path/to/MiniMaxAI_MiniMax-M2.5-2.0bpw-h6-exl3"
},
{
"load_fn": "exllamav3",
"fwd_fn": "exllamav3",
"label": "EXL3 2.1bpw H6 (optimized)",
"model_dir": "path/to/MiniMaxAI_MiniMax-M2.5-2.1bpw-h6-exl3"
}
]
```
This spec file can be passed in to the following command:
```bash
python eval/compare_q.py -d [path/to/dataset_spec.json] -m [path/to/model_spec.json] -lf [path/to/logits.safetensors] -p [-kld] -t [chart_title]
```
Where:
* `path/to/dataset_spec.json` is the path to the dataset spec file described above
* `path/to/model_spec.json` is the path to the model spec file described above
* `path/to/logits.safetensors` is the path to the full model's logits, created above
* `-kld` the script creates a perplexity chart by default, add this if you want K/L-d instead
* `chart_title` the chart title in the resulting plot
Results are cached, so if the process crashes after processing one or more models, you just need to restart the script until every model has been tested (don't use the argument that clears the cache). Also note that if you're running this via SSH like me, you may not see anything - the script uses `plt.show()`. I hacked in an extra arg and a `plt.savefig()` call instead.
</details> |