Quantized GLM-4.7-Flash with llama.cpp and opencode.

#66
by ghostwithahat - opened

I had a lot of problems running this model on llama.cpp and using it with opencode. Most of the time, tool usage was broken.

The problems were worst, when i worked with IQ4_NL quants, that i made with convert_hf_to_gguf.py and llama-quantize. Tool calls were torn apart and had a lot of syntax errors. When I happened to make MXFP4_MOE quants and run the model this way, those problems vanished.

I then had the next problem with tool calling. As far as i remember, during a stream GLM revoked tool usage, which caused errors in opencode. I therefore asked codex to make a proxy that buffers the stream. This finally worked.

Maybe someone can make use of my experience. After I got tool usage working, I then tested this model as a programming assistant. Eventually I switched back to gpt-oss-20b.

Sign up or log in to comment