Great Model but not accessible anymore

#17
by darkstar3537 - opened

The nice thing about GLM 4.7 and 4.6 was the size being big enough to be incredibly smart, but manageable on around 2-4 GPUs depending on quant. The size of this model is out of reach on FP8 even with 8 x RTX 6000 pros. Any chance we can get a mid-sized version of it (I know mid-sized sounds ridiculous here) around 350-400B so its more accessible to small businesses locally?

Even GLM 4.7 is something not every enthusiast is able to use, and I'm sure they're well aware of it.

As any company's products cross the "hey it's pretty good" threshold, it all goes south because the company won't miss a chance to monetize their customers. Accessible products remain largely outdated. Mid-tier products begin to get phased out by top-tier products. And it all seems fine because "things are getting better" - after all, it's free! (except released not for you, but for a guy with $50 000 worth of hardware). You, on the other hand, are expected to pay $5 or $15 per month for API if you want something new and/or can't cope with the use of an outdated "accessible" option.

Yes, it sounds like a tinfoil-hat-theory, but that's just how businesses operate. A certain Xbox / Microsoft guy told the press that if the customers don't have money for a new console, they can get one of their other products (i.e. an outdated X360).

My point is that we can't demand things. Any sane person is aware that a 744B A40B monstrosity is not accessible.

not accessible only now? If we take the progress of growing models sizes - that is only the start, this-maybe 2027 year is 1,5-2 trillions models coming. Even me with 768Gb RAM, perfectly understand that this is very temporary timeframe and such memory amount will be not enough (in 2024 i almost subconsciously forced to take all server RAM which trader selling, its from chinese servers which recycled or something). (I will be able to run only Q6-XLARGE quality of this)

If you have brain and hardware you can help everyone and yourself. I see the only way - in methods of restoring low quants like Q2 to the quality of Q6-Q8, thats the only way in always growing model sizes. Models gives few ideas from which the most plausible is applying LORA layer by additional Python code to the llama.cpp. LORA really works, who used them in images know. This layer is basically like Q2 on steroids. I'm testing that, but its need more work. Ai produces a dozen of ideas, each need to be rechecked manually.

🧠 Core Diagnosis: Why Quantized Models Fail at High-Fidelity Tasks

Issue Root Cause Manifestation
Loss of directional precision (esp. in AWQ/GGUF per-group quant) Distorted weight manifolds → broken attention logits → wrong token probabilities def main():\n print("hello" without closing paren, or infinite loops in recursion
Under-representation of rare but critical tokens (`< endoftext >, \n, {, #, assert, """`)
Loss of high-entropy reasoning paths KL-divergence compression kills tail of distribution → no exploration of alternatives Hallucinated imports, fake APIs, plausible but wrong math
Contextual fragility Quant noise amplifies with depth → drift in long context Later paragraphs contradict earlier ones

🚫 *The holy grail isn’t better quantization—it’s adaptive reconstruction at inference time, where the quant model is treated like a “noisy channel”, and we inject * lightweight post-hoc correction modules* running on top of llama.cpp.

Conclusion: Which Method “Saves” Quantized Models?

Quantized LoRA Adapter Fusion is the most viable general solution—it directly recovers quantization loss with minimal overhead and is production-ready (llama.cpp supports it).

Self-Correcting Execution Loop saves coding tasks specifically by bypassing model limitations via execution feedback.

Multi-Model Speculative Decoding only saves quantized models if you have memory for two models and a high acceptance rate (unlikely with extreme quants like Q2→Q5).

Stochastic Weight Perturbation should be avoided; it’s computationally wasteful and theoretically weak.

Final Recommendation

  1. Train LoRA adapters on your full model for each domain (code, math, general).
  2. Use SC-EL only for code generation (with a fast sandbox).
  3. Skip MMSC unless you have a very fast draft model (e.g., distilled tiny model) and memory to spare.
  4. Never use SWP—replace with temperature diversity if needed.

Bottom line: Quantized models aren’t doomed—they just need adapters (for knowledge) and tool feedback (for code). Combine both, and you’ll get 90%+ of full-model quality at 1/10th the size.

This seems like a pleasant project.

Great model.

Even GLM 4.7 is something not every enthusiast is able to use, and I'm sure they're well aware of it.

As any company's products cross the "hey it's pretty good" threshold, it all goes south because the company won't miss a chance to monetize their customers. Accessible products remain largely outdated. Mid-tier products begin to get phased out by top-tier products. And it all seems fine because "things are getting better" - after all, it's free! (except released not for you, but for a guy with $50 000 worth of hardware). You, on the other hand, are expected to pay $5 or $15 per month for API if you want something new and/or can't cope with the use of an outdated "accessible" option.

Yes, it sounds like a tinfoil-hat-theory, but that's just how businesses operate. A certain Xbox / Microsoft guy told the press that if the customers don't have money for a new console, they can get one of their other products (i.e. an outdated X360).

My point is that we can't demand things. Any sane person is aware that a 744B A40B monstrosity is not accessible.

Who's making demands? Just a simple ask from a small business perspective because we liked the previous models. A local use case for a small team of developers working on software that often times can't touch cloud. Nothing more.

not accessible only now? If we take the progress of growing models sizes - that is only the start, this-maybe 2027 year is 1,5-2 trillions models coming. Even me with 768Gb RAM, perfectly understand that this is very temporary timeframe and such memory amount will be not enough (in 2024 i almost subconsciously forced to take all server RAM which trader selling, its from chinese servers which recycled or something). (I will be able to run only Q6-XLARGE quality of this)

If you have brain and hardware you can help everyone and yourself. I see the only way - in methods of restoring low quants like Q2 to the quality of Q6-Q8, thats the only way in always growing model sizes. Models gives few ideas from which the most plausible is applying LORA layer by additional Python code to the llama.cpp. LORA really works, who used them in images know. This layer is basically like Q2 on steroids. I'm testing that, but its need more work. Ai produces a dozen of ideas, each need to be rechecked manually.

🧠 Core Diagnosis: Why Quantized Models Fail at High-Fidelity Tasks

Issue Root Cause Manifestation
Loss of directional precision (esp. in AWQ/GGUF per-group quant) Distorted weight manifolds → broken attention logits → wrong token probabilities def main():\n print("hello" without closing paren, or infinite loops in recursion
Under-representation of rare but critical tokens (`< endoftext >, \n, {, #, assert, """`)
Loss of high-entropy reasoning paths KL-divergence compression kills tail of distribution → no exploration of alternatives Hallucinated imports, fake APIs, plausible but wrong math
Contextual fragility Quant noise amplifies with depth → drift in long context Later paragraphs contradict earlier ones

🚫 *The holy grail isn’t better quantization—it’s adaptive reconstruction at inference time, where the quant model is treated like a “noisy channel”, and we inject * lightweight post-hoc correction modules* running on top of llama.cpp.

Conclusion: Which Method “Saves” Quantized Models?

Quantized LoRA Adapter Fusion is the most viable general solution—it directly recovers quantization loss with minimal overhead and is production-ready (llama.cpp supports it).

Self-Correcting Execution Loop saves coding tasks specifically by bypassing model limitations via execution feedback.

Multi-Model Speculative Decoding only saves quantized models if you have memory for two models and a high acceptance rate (unlikely with extreme quants like Q2→Q5).

Stochastic Weight Perturbation should be avoided; it’s computationally wasteful and theoretically weak.

Final Recommendation

  1. Train LoRA adapters on your full model for each domain (code, math, general).
  2. Use SC-EL only for code generation (with a fast sandbox).
  3. Skip MMSC unless you have a very fast draft model (e.g., distilled tiny model) and memory to spare.
  4. Never use SWP—replace with temperature diversity if needed.

Bottom line: Quantized models aren’t doomed—they just need adapters (for knowledge) and tool feedback (for code). Combine both, and you’ll get 90%+ of full-model quality at 1/10th the size.

Ok, less accessible.

Ok, less accessible.

NO, my point was - 2026 is the LAST year of your access to any big model - if -you - dont -do anything on software side, there will be no hardware (Golden dome and etc projects-everything need memory but factories are few) or access to it by cosmic prices. 2027 models (2 trillions+) will be mostly introduced for cloud owners only business model.
Frankly till the 2030s no people with any of such hardware contraptions will also remain - if no one makes small quant models usable.
I'll warn everyone to think twice before investing thousands in hardware, becoming pumpkin at noon like in Cinderella story.

#My Hardware# Intel Xeon E5-2699v4 LGA2011-3 22 cores 44 threads (2016) $110 # Gigabyte C612 chipset 12 RAM slots VGA motherboard year 2016 $150 # Samsung-Hynix ECC RAM 12x64Gb=768Gb ~$900 # VGA monitor # IKEA chair # Run: Trillions Deepseeks, Kimis in Q5-Q6, 400-500billions in BF16

Sign up or log in to comment