cool model !!

by gopi87 - opened 13 days ago

13 days ago

cool model i was able to run the model at 12t/sec in 3060 12gb and dual cpu with 256 gb ram but still slow while comparing to the glm 4.7 flash mxfp4.

mayhem4markets

13 days ago

cool model i was able to run the model at 12t/sec in 3060 12gb and dual cpu with 256 gb ram but still slow while comparing to the glm 4.7 flash mxfp4.

seems like that has more to do with the distribution of weights into system ram vs vram than the model itself.

anjeysapkovski

9 days ago

@gopi8712t/s is impressive. Do you have 256 GB RAM or fast SSD? Even GPT-OSS-120B on RTX 5060 Ti 16 GB produces only 14t/s with moe layers offloaded to RAM and active 5.6 GB in VRAM.

gopi87

9 days ago

@gopi8712t/s is impressive. Do you have 256 GB RAM or fast SSD? Even GPT-OSS-120B on RTX 5060 Ti 16 GB produces only 14t/s with moe layers offloaded to RAM and active 5.6 GB in VRAM.

CUDA_VISIBLE_DEVICES="0" ./bin/llama-server
--model "/home/gopi/Storage/GPT OSS 120B/step3p5_flash_Q4_K_S.gguf"
--n-cpu-moe 46
-ngl 99
--ctx-size 50000
--threads 40
--threads-batch 40
--host 0.0.0.0
--jinja
--port 8080
--temp 0.6
--top-p 0.95

yep i run like this

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment