MikeRoz commited on
Commit
5dae421
·
verified ·
1 Parent(s): beac5a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -14
README.md CHANGED
@@ -10,28 +10,122 @@ tags:
10
  - exl3
11
  ---
12
 
13
- exllamav3 quantizations of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5).
14
 
15
- ### Optimized quants
 
 
 
 
 
 
 
 
16
 
17
- [2.10 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.10bpw_H6) 57.292 GiB
18
- [2.50 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.50bpw_H6) 67.838 GiB
19
- [3.06 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/3.06bpw_H6) 82.656 GiB
20
 
21
- ### Straight quants
 
22
 
23
- As the charts below will show, a 4bpw or 5bpw is still better than an optimized 3.06 or 2.5bpw quant.
24
 
25
- [2.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.00bpw_H6) 61.054 GiB
26
- [3.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/3.00bpw_H6) 81.613 GiB
27
- [4.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/4.00bpw_H6) 108.087 GiB
28
- [5.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/5.00bpw_H6) 134.561 GiB
29
 
30
- ### K/L-D and PPL graphs
 
31
 
32
- ![KLD Chart](MinMax25kld.png)
33
- ![PPL Chart](MinMax25ppl.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  [measurement.json - 2.0bpw_H6 vs 3.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-2.0-3.0.json)
36
  [measurement.json - 3.0bpw_H6 vs 4.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-3.0-4.0.json)
37
  [measurement.json - 4.0bpw_H6 vs 5.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-4.0-5.0.json)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  - exl3
11
  ---
12
 
13
+ [exllamav3](https://github.com/turboderp-org/exllamav3/) quantizations of [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5).
14
 
15
+ | Quant | Size | KLD | PPL |
16
+ | --- | --- | --- | --- |
17
+ | [2.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.00bpw_H6) | 61.054 GiB | 0.42365 | 9.31452 |
18
+ | [2.10 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.10bpw_H6) (optimized) | 57.292 GiB | 0.36355 | 9.20850 |
19
+ | [2.50 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/2.50bpw_H6) (optimized) | 67.838 GiB | 0.30152 | 8.88802 |
20
+ | [3.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/3.00bpw_H6) | 81.613 GiB | 0.17263 | 8.58626 |
21
+ | [3.06 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/3.06bpw_H6) (optimized) | 82.656 GiB | 0.15648 | 8.66856 |
22
+ | [4.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/4.00bpw_H6) | 108.087 GiB | 0.07882 | 8.45404 |
23
+ | [5.00 bpw h6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/tree/5.00bpw_H6) | 134.561 GiB | - | - |
24
 
25
+ ### K/L-D and PPL graphs
 
 
26
 
27
+ ![KLD Chart](MinMax25kld.png)
28
+ ![PPL Chart](MinMax25ppl.png)
29
 
30
+ ### How to create optimized quants
31
 
32
+ It's possible to produce quants that are better for a given size than the ones you get by performing a quant directly to a given target bitrate. The process involves comparing two quants, measuring which modules are more affected by the quantization process, and selecting those modules first when targeting some in-between bitrate.
 
 
 
33
 
34
+ <details>
35
+ <summary>Expand for more details</summary>
36
 
37
+ exllamav3 includes a measurement script `util/measure.py` that will compare two exllamav3 models module by module against the original model. The goal is to see which modules are the most affected by the decrease in precision involved in going from a larger quant to a smaller quant.
38
+
39
+ The command is:
40
+ ```bash
41
+ python util/measure.py -l [level] -d [device] -ms [max_sys_memory] -i [path/to/quant1] [path/to/quant2] -r [path/to/original_model] -o [path/to/measurement.json]
42
+ ```
43
+ Where:
44
+ * `level` is an integer between 0 and 3 that determines the resolution of the measurement. 0 is fastest but least granular, 2 is default, 3 is most granular and slowest.
45
+ * `device` is the index of the CUDA device that will perform the work
46
+ * `max_sys_memory` is the amount of memory that can be used for state data to speed things up, in GiB
47
+ * `path/to/quant1` and `path/to/quant2` are the paths to the two quants to compare
48
+ * `path/to/original_model` is the path to the original model
49
+ * `path/to/measurement.json` is the path to the resulting json measurement file
50
+
51
+ The masurement fie I created above compared my 2.0bpw_H6 and my 3.0bpw_H6 quants.
52
+
53
+ You can then feed this measurement file, along with the two quants, to `util/optimize.py` to create optimized quants that draw modules from both quants where appropriate to get the best result for a given bitrate.
54
+
55
+ The command is:
56
+ ```bash
57
+ python util/optimize.py -i [path/to/quant1] [path/to/quant2] -o [path/to/resulting_model] -m [path/to/measurement.json] -b [target_bitrate]
58
+ ```
59
+ Where:
60
+ * `path/to/quant1` and `path/to/quant2` are paths to the two source models
61
+ * `path/to/resulting_model` is the output path
62
+ * `target_bitrate` is the target bitrate as a number a decimal point
63
+
64
+ You can use a measurement script from one pair of quants with another pair of quants of the same model. When I tried to use 2.0bpw and 4.0bpw quants to create a 2.25bpw quant, the size of the resulting model was larger than requested because of the substitution at 2.48 bpw, but it was still an improvement over a straight 2.48bpw quant. An explicitly-requested 2.48bpw quant drawing from the 2.0bpw and 3.0bpw quants proved to be even better (in terms of k/l divergence). Finally, I tried creating a 3.25bpw quant from 3.0bpw and 4.0bpw quants, still using my 2.0-vs-3.0 measurement file. This was not as successful as the optimized 2.25bpw quant, and may have benefitted from a 'correct' measurement file that matched the two actual sources.
65
+ </details>
66
 
67
  [measurement.json - 2.0bpw_H6 vs 3.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-2.0-3.0.json)
68
  [measurement.json - 3.0bpw_H6 vs 4.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-3.0-4.0.json)
69
  [measurement.json - 4.0bpw_H6 vs 5.0bpw_H6](https://huggingface.co/MikeRoz/MiniMax-M2.5-exl3/blob/main/measurement_MiniMaxAI_MiniMax-M2.5-4.0-5.0.json)
70
+
71
+ ### How to measure Perplexity and KL Divergence
72
+
73
+ <details>
74
+ <summary>Expand for details</summary>
75
+ Measuring KL/D is a process that involves comparing the outputs of the quantized model to outputs of the original model. If the original model is too large for your hardware to load without quantization, you can run a script to generate logits which can then be passed into the comparison script, sparing you the need to load the whole source model.
76
+
77
+ First, you'll need to create a dataset spec file. I based mine on `eval/spec/wiki2_llama3_large.json`.
78
+ ```json
79
+ {
80
+ "tokenize_fn": "transformers",
81
+ "tokenizer_dir": "path/to/full_model",
82
+ "dataset": "wiki2",
83
+ "eval_stride": 512,
84
+ "eval_len": 2048,
85
+ "max_rows": 100
86
+ }
87
+ ```
88
+
89
+ I passed this into `eval/compare_q_logits.py` as follows:
90
+
91
+ ```bash
92
+ python eval/compare_q_logits.py -m [path/to/full_model] -o [path/to/output_logits.safetensors] -d [path/to/dataset_spec.json] -rpb [rows_per_batch] -dev [device_index]
93
+ ```
94
+ Where:
95
+ * `path/to/full_model` is the path to the model
96
+ * `path/to/output_logits.safetensors` is the path to the output logits file
97
+ * `path/to/dataset_spec.json` is the path to the dataset spec file described above
98
+ * `rows_per_batch` - I would run out of memory without this parameter. I set it to 32768.
99
+ * `device_index` - optional CUDA device index
100
+
101
+ Next, you'll need a model spec file that describes all the quants you want in the graph. You'll need to be able to load any model you'd like compared. Here's a sample of the one I used for these quants:
102
+ ```json
103
+ [
104
+ {
105
+ "load_fn": "exllamav3",
106
+ "fwd_fn": "exllamav3",
107
+ "label": "EXL3 2.0bpw H6",
108
+ "model_dir": "path/to/MiniMaxAI_MiniMax-M2.5-2.0bpw-h6-exl3"
109
+ },
110
+ {
111
+ "load_fn": "exllamav3",
112
+ "fwd_fn": "exllamav3",
113
+ "label": "EXL3 2.1bpw H6 (optimized)",
114
+ "model_dir": "path/to/MiniMaxAI_MiniMax-M2.5-2.1bpw-h6-exl3"
115
+ }
116
+ ]
117
+ ```
118
+
119
+ This spec file can be passed in to the following command:
120
+ ```bash
121
+ python eval/compare_q.py -d [path/to/dataset_spec.json] -m [path/to/model_spec.json] -lf [path/to/logits.safetensors] -p [-kld] -t [chart_title]
122
+ ```
123
+ Where:
124
+ * `path/to/dataset_spec.json` is the path to the dataset spec file described above
125
+ * `path/to/model_spec.json` is the path to the model spec file described above
126
+ * `path/to/logits.safetensors` is the path to the full model's logits, created above
127
+ * `-kld` the script creates a perplexity chart by default, add this if you want K/L-d instead
128
+ * `chart_title` the chart title in the resulting plot
129
+
130
+ Results are cached, so if the process crashes after processing one or more models, you just need to restart the script until every model has been tested (don't use the argument that clears the cache). Also note that if you're running this via SSH like me, you may not see anything - the script uses `plt.show()`. I hacked in an extra arg and a `plt.savefig()` install instead.
131
+ </details>