Benchmarks
Do you have benchmarks for this model against other models of similar size?
Instead of just publishing finetuned models, It would be great if you could share some benchmarks which shows improvements compared with actual models.
People like me are really comparing metrics/benchmarks in models before putting it to real production use.
So, Requesting to publish benchmarks on these finetuned models for the benefit of whole AI community.
You're 100% right, I can do some benchmarks for the 8B models similar to the ones we have done for the 4B-Thinking-2507 ones on our website (https://www.teichai.com/benchmarks) I would take those benchmarks with a grain of salt though, the 2507 distills just feel very different from the orginal qwen3 ones.
I will run the same benchmarks for all our Qwen3-8B models and upload results
2.5 flash makes the most sense, since it is smaller and distilling from too large of models, without things like supervised RL, can even decrease quality (which reflects in MMLU, where only 2.5 flash improved and others lost quality) and also because unlike the others that only had 1k distill questions max, flash had 11k right? Btw I really think y‘all should leave the censored reasoning traces out, in distillation and SRL may also help.
Yea the biggest reason for this model showing the most improvement is due to the 11k dataset size covering all sorts of prompts from different domains. Honestly I hoped the improvement would be bigger, but am still overall happy with the improvements. I can add making a non-reasoning instruct distill for the gemini models to my list of todos
Now that we have a NEW model (GLM 4.7 flash) that pushed gpt-oss-120b of the throne (both in speed and quality and coding), I noticed this repo with OPus 4.5 training material.
But what is the effect of this extra fintuning and is it proven for this new GLM 4.7 flash model?
https://huggingface.co/TeichAI/GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill-GGUF
ps: when i speak of the THRONE , I'm only considering models that can be run on consumer grade hardware with unifed memory 64GB+ like AMD 395+ and the latest MAC models. Because those are actually usable for offline usage. The 120B parameter was long the KING just because it's size while the active paramters was so low, that it was realy fast, for beeing usable (versus dense models).
The opus material is 250 prompts. This is not doing much, except for converting behaviour. Do not expect noticable gains in intelligence. Their are few here that have noticable gains. They hava a bench site for that, comparing base line to qwen 3 4b (insert model name) distills. pretty sure only gpt 5 codex and gemini 2.5 flash beat the baseline. 2.5 flash was trained on 11k prompts so makes sense.
just now realized I never ran the benchmarks for other models. probably wont get around to it as it's super time consuming and I'd rather use my compute elsewhere ATM. When we get around to producing some models that actually improve intelligence all around we wil take benchmarks more seriously