Text Generation
Transformers
GGUF
step3p5
custom_code
imatrix
conversational

cool model !!

#3
by gopi87 - opened

cool model i was able to run the model at 12t/sec in 3060 12gb and dual cpu with 256 gb ram but still slow while comparing to the glm 4.7 flash mxfp4.

cool model i was able to run the model at 12t/sec in 3060 12gb and dual cpu with 256 gb ram but still slow while comparing to the glm 4.7 flash mxfp4.

seems like that has more to do with the distribution of weights into system ram vs vram than the model itself.

@gopi8712t/s is impressive. Do you have 256 GB RAM or fast SSD? Even GPT-OSS-120B on RTX 5060 Ti 16 GB produces only 14t/s with moe layers offloaded to RAM and active 5.6 GB in VRAM.

@gopi8712t/s is impressive. Do you have 256 GB RAM or fast SSD? Even GPT-OSS-120B on RTX 5060 Ti 16 GB produces only 14t/s with moe layers offloaded to RAM and active 5.6 GB in VRAM.

CUDA_VISIBLE_DEVICES="0" ./bin/llama-server
--model "/home/gopi/Storage/GPT OSS 120B/step3p5_flash_Q4_K_S.gguf"
--n-cpu-moe 46
-ngl 99
--ctx-size 50000
--threads 40
--threads-batch 40
--host 0.0.0.0
--jinja
--port 8080
--temp 0.6
--top-p 0.95

yep i run like this

Sign up or log in to comment