magiccodingman commited on 7 days ago

Commit

92970fa

verified ·

1 Parent(s): d98377c

initial upload

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +7 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +177 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log +176 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log +176 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log +176 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log +176 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log +176 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log +176 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log +178 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log +178 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log +178 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/bench_metrics.json +44 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md +11 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log +178 -0
Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log +178 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+Qwen3-30B-A3B-Thinking-2507-Q5_K.gguf filter=lfs diff=lfs merge=lfs -text
+Qwen3-30B-A3B-Thinking-2507-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
+Qwen3-30B-A3B-Thinking-2507-iq4_nl-QKOUD-IQ4NL-E-MXFP4-H-Q5K.gguf filter=lfs diff=lfs merge=lfs -text
+Qwen3-30B-A3B-Thinking-2507-iq4_nl-QKOUD-IQ4NL-EH-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
+Qwen3-30B-A3B-Thinking-2507-mxfp4_moe-H-B16-EUD-IQ4NL-R-Q6K-QKO-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
+Qwen3-30B-A3B-Thinking-2507-mxfp4_moe-HQKOR-B16-U-Q5K-E-Q6K-D-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.07 GiB",
+        "t/s": "147.62 \u00b1 3.10",
+        "test": "pp8",
+        "tps_value": 147.62
+      },
+      "test": "pp8",
+      "tps": 147.62
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
+        "ppl": 1.3063,
+        "ppl_error": 0.00707
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
+        "ppl": 6.3691,
+        "ppl_error": 0.12991
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
+        "ppl": 5.7368,
+        "ppl_error": 0.10519
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 1.1729,
+    "bench_tps": 147.62,
+    "file_size_bytes": 17263163424,
+    "file_size_gb": 16.08
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.07 GiB |    30.53 B | CUDA       |  35 |             pp8 |        147.62 ± 3.10 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.07 GiB |    30.53 B | CUDA       |  35 |           tg128 |         49.93 ± 0.96 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20685 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.07 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9754.91 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   504.77 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 111.624 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.31 seconds per pass - ETA 2.42 minutes
+[1]1.5158,[2]1.4623,[3]1.2888,[4]1.2489,[5]1.3367,[6]1.3996,[7]1.4022,[8]1.4004,[9]1.3605,[10]1.3375,[11]1.3229,[12]1.3248,[13]1.3094,[14]1.3004,[15]1.2965,[16]1.2849,[17]1.2792,[18]1.2783,[19]1.2709,[20]1.2615,[21]1.2587,[22]1.2589,[23]1.2749,[24]1.2678,[25]1.2660,[26]1.2573,[27]1.2519,[28]1.2510,[29]1.2637,[30]1.2656,[31]1.2592,[32]1.2541,[33]1.2549,[34]1.2541,[35]1.2530,[36]1.2742,[37]1.2836,[38]1.2884,[39]1.2952,[40]1.2964,[41]1.2930,[42]1.3064,[43]1.3059,[44]1.3063,
+Final estimate: PPL = 1.3063 +/- 0.00707
+llama_perf_context_print:        load time =    2360.00 ms
+llama_perf_context_print: prompt eval time =  123483.57 ms / 90112 tokens (    1.37 ms per token,   729.75 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  124693.97 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16670 + (3896 =  3351 +      40 +     504) +        3549 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9874 =  9754 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20692 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.07 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9754.91 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   504.77 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 49.041 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.25 seconds per pass - ETA 0.80 minutes
+[1]5.4570,[2]6.4839,[3]6.8398,[4]6.7548,[5]6.6892,[6]5.7698,[7]5.2742,[8]5.2992,[9]5.6087,[10]5.7400,[11]5.7881,[12]6.0905,[13]6.1602,[14]6.2905,[15]6.3691,
+Final estimate: PPL = 6.3691 +/- 0.12991
+llama_perf_context_print:        load time =    2312.99 ms
+llama_perf_context_print: prompt eval time =   44838.75 ms / 30720 tokens (    1.46 ms per token,   685.12 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   45404.54 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16681 + (3896 =  3351 +      40 +     504) +        3537 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9874 =  9754 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20668 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.07 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9754.91 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   504.77 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.543 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.35 seconds per pass - ETA 0.88 minutes
+[1]4.5913,[2]5.0009,[3]5.2857,[4]5.4514,[5]5.6706,[6]5.6722,[7]5.6540,[8]5.6046,[9]5.6586,[10]5.6478,[11]5.6541,[12]5.6483,[13]5.7230,[14]5.7342,[15]5.7287,[16]5.7368,
+Final estimate: PPL = 5.7368 +/- 0.10519
+llama_perf_context_print:        load time =    2333.37 ms
+llama_perf_context_print: prompt eval time =   49087.99 ms / 32768 tokens (    1.50 ms per token,   667.54 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   49532.14 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16667 + (3896 =  3351 +      40 +     504) +        3551 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9874 =  9754 +     112 +       8                |

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.11 GiB",
+        "t/s": "134.10 \u00b1 5.78",
+        "test": "pp8",
+        "tps_value": 134.1
+      },
+      "test": "pp8",
+      "tps": 134.1
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_code.log",
+        "ppl": 1.3066,
+        "ppl_error": 0.00709
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_general.log",
+        "ppl": 6.3484,
+        "ppl_error": 0.12911
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K/perplexity_math.log",
+        "ppl": 5.7026,
+        "ppl_error": 0.10404
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.8703,
+    "bench_tps": 134.1,
+    "file_size_bytes": 17304490016,
+    "file_size_gb": 16.12
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.11 GiB |    30.53 B | CUDA       |  35 |             pp8 |        134.10 ± 5.78 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.11 GiB |    30.53 B | CUDA       |  35 |           tg128 |         48.35 ± 0.50 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20935 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.11 GiB (4.53 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9794.32 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   544.18 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 113.646 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.31 seconds per pass - ETA 2.42 minutes
+[1]1.5148,[2]1.4623,[3]1.2888,[4]1.2491,[5]1.3370,[6]1.4009,[7]1.4038,[8]1.4016,[9]1.3615,[10]1.3384,[11]1.3239,[12]1.3258,[13]1.3104,[14]1.3014,[15]1.2974,[16]1.2857,[17]1.2800,[18]1.2791,[19]1.2716,[20]1.2621,[21]1.2593,[22]1.2594,[23]1.2754,[24]1.2683,[25]1.2665,[26]1.2577,[27]1.2523,[28]1.2514,[29]1.2641,[30]1.2660,[31]1.2595,[32]1.2544,[33]1.2553,[34]1.2544,[35]1.2533,[36]1.2744,[37]1.2839,[38]1.2887,[39]1.2955,[40]1.2967,[41]1.2933,[42]1.3066,[43]1.3063,[44]1.3066,
+Final estimate: PPL = 1.3066 +/- 0.00709
+llama_perf_context_print:        load time =    2340.34 ms
+llama_perf_context_print: prompt eval time =  123762.53 ms / 90112 tokens (    1.37 ms per token,   728.10 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  124977.94 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16892 + (3935 =  3351 +      40 +     544) +        3286 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9914 =  9794 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20911 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.11 GiB (4.53 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9794.32 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   544.18 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.554 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.26 seconds per pass - ETA 0.80 minutes
+[1]5.4142,[2]6.4548,[3]6.8451,[4]6.7474,[5]6.6742,[6]5.7565,[7]5.2607,[8]5.2909,[9]5.5957,[10]5.7255,[11]5.7724,[12]6.0754,[13]6.1445,[14]6.2748,[15]6.3484,
+Final estimate: PPL = 6.3484 +/- 0.12911
+llama_perf_context_print:        load time =    2343.92 ms
+llama_perf_context_print: prompt eval time =   44899.49 ms / 30720 tokens (    1.46 ms per token,   684.19 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   45321.84 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16892 + (3935 =  3351 +      40 +     544) +        3287 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9914 =  9794 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20936 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q6_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.11 GiB (4.53 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9794.32 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   544.18 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 43.891 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.36 seconds per pass - ETA 0.88 minutes
+[1]4.5753,[2]4.9789,[3]5.2511,[4]5.4213,[5]5.6313,[6]5.6336,[7]5.6162,[8]5.5661,[9]5.6202,[10]5.6090,[11]5.6163,[12]5.6108,[13]5.6870,[14]5.6986,[15]5.6948,[16]5.7026,
+Final estimate: PPL = 5.7026 +/- 0.10404
+llama_perf_context_print:        load time =    2344.79 ms
+llama_perf_context_print: prompt eval time =   49162.82 ms / 32768 tokens (    1.50 ms per token,   666.52 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   49605.37 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16887 + (3935 =  3351 +      40 +     544) +        3292 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9914 =  9794 +     112 +       8                |

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.18 GiB",
+        "t/s": "131.55 \u00b1 3.54",
+        "test": "pp8",
+        "tps_value": 131.55
+      },
+      "test": "pp8",
+      "tps": 131.55
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log",
+        "ppl": 1.3062,
+        "ppl_error": 0.00708
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log",
+        "ppl": 6.3521,
+        "ppl_error": 0.12931
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log",
+        "ppl": 5.7063,
+        "ppl_error": 0.10414
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.9013,
+    "bench_tps": 131.55,
+    "file_size_bytes": 17379850272,
+    "file_size_gb": 16.19
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.18 GiB |    30.53 B | CUDA       |  35 |             pp8 |        131.55 ± 3.54 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.18 GiB |    30.53 B | CUDA       |  35 |           tg128 |         45.73 ± 0.43 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20926 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.18 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9866.19 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   616.05 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 115.016 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.34 seconds per pass - ETA 2.43 minutes
+[1]1.5133,[2]1.4614,[3]1.2882,[4]1.2486,[5]1.3366,[6]1.4002,[7]1.4030,[8]1.4009,[9]1.3609,[10]1.3379,[11]1.3233,[12]1.3253,[13]1.3099,[14]1.3009,[15]1.2970,[16]1.2853,[17]1.2797,[18]1.2787,[19]1.2712,[20]1.2618,[21]1.2590,[22]1.2592,[23]1.2751,[24]1.2680,[25]1.2662,[26]1.2575,[27]1.2521,[28]1.2512,[29]1.2638,[30]1.2657,[31]1.2592,[32]1.2541,[33]1.2550,[34]1.2541,[35]1.2530,[36]1.2740,[37]1.2835,[38]1.2883,[39]1.2951,[40]1.2963,[41]1.2929,[42]1.3063,[43]1.3059,[44]1.3062,
+Final estimate: PPL = 1.3062 +/- 0.00708
+llama_perf_context_print:        load time =    4892.48 ms
+llama_perf_context_print: prompt eval time =  124979.07 ms / 90112 tokens (    1.39 ms per token,   721.02 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  126196.66 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16810 + (4007 =  3351 +      40 +     616) +        3297 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9986 =  9866 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20929 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.18 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9866.19 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   616.05 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.552 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.34 seconds per pass - ETA 0.83 minutes
+[1]5.4240,[2]6.4552,[3]6.8390,[4]6.7454,[5]6.6753,[6]5.7589,[7]5.2620,[8]5.2916,[9]5.5976,[10]5.7269,[11]5.7740,[12]6.0750,[13]6.1452,[14]6.2766,[15]6.3521,
+Final estimate: PPL = 6.3521 +/- 0.12931
+llama_perf_context_print:        load time =    2322.23 ms
+llama_perf_context_print: prompt eval time =   45246.07 ms / 30720 tokens (    1.47 ms per token,   678.95 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   45672.70 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16810 + (4007 =  3351 +      40 +     616) +        3297 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9986 =  9866 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20926 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.18 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9866.19 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   616.05 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.117 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.37 seconds per pass - ETA 0.88 minutes
+[1]4.5758,[2]4.9793,[3]5.2509,[4]5.4211,[5]5.6307,[6]5.6351,[7]5.6187,[8]5.5700,[9]5.6239,[10]5.6121,[11]5.6193,[12]5.6142,[13]5.6899,[14]5.7023,[15]5.6984,[16]5.7063,
+Final estimate: PPL = 5.7063 +/- 0.10414
+llama_perf_context_print:        load time =    2343.02 ms
+llama_perf_context_print: prompt eval time =   49355.25 ms / 32768 tokens (    1.51 ms per token,   663.92 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   49799.01 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16809 + (4007 =  3351 +      40 +     616) +        3298 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9986 =  9866 +     112 +       8                |

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.07 GiB",
+        "t/s": "140.60 \u00b1 5.97",
+        "test": "pp8",
+        "tps_value": 140.6
+      },
+      "test": "pp8",
+      "tps": 140.6
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.3094,
+        "ppl_error": 0.00712
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.5314,
+        "ppl_error": 0.13444
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.8549,
+        "ppl_error": 0.10805
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 2.8064,
+    "bench_tps": 140.6,
+    "file_size_bytes": 17263163424,
+    "file_size_gb": 16.08
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.07 GiB |    30.53 B | CUDA       |  35 |             pp8 |        140.60 ± 5.97 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.07 GiB |    30.53 B | CUDA       |  35 |           tg128 |         50.90 ± 0.12 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20671 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.07 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9754.91 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 117.87 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.31 seconds per pass - ETA 2.42 minutes
+[1]1.5203,[2]1.4651,[3]1.2904,[4]1.2484,[5]1.3391,[6]1.4023,[7]1.4060,[8]1.4050,[9]1.3644,[10]1.3415,[11]1.3268,[12]1.3283,[13]1.3123,[14]1.3032,[15]1.2987,[16]1.2873,[17]1.2816,[18]1.2804,[19]1.2729,[20]1.2636,[21]1.2605,[22]1.2608,[23]1.2769,[24]1.2698,[25]1.2677,[26]1.2590,[27]1.2536,[28]1.2528,[29]1.2654,[30]1.2671,[31]1.2607,[32]1.2556,[33]1.2566,[34]1.2560,[35]1.2550,[36]1.2767,[37]1.2862,[38]1.2911,[39]1.2982,[40]1.2997,[41]1.2962,[42]1.3097,[43]1.3091,[44]1.3094,
+Final estimate: PPL = 1.3094 +/- 0.00712
+llama_perf_context_print:        load time =    2353.55 ms
+llama_perf_context_print: prompt eval time =  123585.53 ms / 90112 tokens (    1.37 ms per token,   729.15 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  124806.28 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16703 + (3859 =  3351 +      40 +     467) +        3552 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9874 =  9754 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20663 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.07 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9754.91 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 50.065 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.26 seconds per pass - ETA 0.80 minutes
+[1]5.5363,[2]6.5899,[3]7.0472,[4]6.9218,[5]6.8426,[6]5.8956,[7]5.3900,[8]5.4252,[9]5.7401,[10]5.8727,[11]5.9266,[12]6.2397,[13]6.3078,[14]6.4487,[15]6.5314,
+Final estimate: PPL = 6.5314 +/- 0.13444
+llama_perf_context_print:        load time =    2400.47 ms
+llama_perf_context_print: prompt eval time =   44875.22 ms / 30720 tokens (    1.46 ms per token,   684.56 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   45305.33 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16703 + (3859 =  3351 +      40 +     467) +        3552 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9874 =  9754 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20669 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q5_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.07 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9754.91 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.494 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.35 seconds per pass - ETA 0.88 minutes
+[1]4.6628,[2]5.0886,[3]5.3551,[4]5.5435,[5]5.7587,[6]5.7679,[7]5.7713,[8]5.7213,[9]5.7670,[10]5.7605,[11]5.7618,[12]5.7592,[13]5.8437,[14]5.8542,[15]5.8479,[16]5.8549,
+Final estimate: PPL = 5.8549 +/- 0.10805
+llama_perf_context_print:        load time =    2442.75 ms
+llama_perf_context_print: prompt eval time =   48979.64 ms / 32768 tokens (    1.49 ms per token,   669.01 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   49425.17 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16703 + (3859 =  3351 +      40 +     467) +        3552 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9874 =  9754 +     112 +       8                |

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.11 GiB",
+        "t/s": "143.24 \u00b1 4.42",
+        "test": "pp8",
+        "tps_value": 143.24
+      },
+      "test": "pp8",
+      "tps": 143.24
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.3091,
+        "ppl_error": 0.00711
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.5299,
+        "ppl_error": 0.13434
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.8552,
+        "ppl_error": 0.10806
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 2.7924,
+    "bench_tps": 143.24,
+    "file_size_bytes": 17304490016,
+    "file_size_gb": 16.12
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.11 GiB |    30.53 B | CUDA       |  35 |             pp8 |        143.24 ± 4.42 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.11 GiB |    30.53 B | CUDA       |  35 |           tg128 |         53.04 ± 0.43 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20673 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.11 GiB (4.53 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9794.32 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 120.29 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.29 seconds per pass - ETA 2.40 minutes
+[1]1.5278,[2]1.4668,[3]1.2915,[4]1.2495,[5]1.3401,[6]1.4029,[7]1.4067,[8]1.4054,[9]1.3647,[10]1.3419,[11]1.3268,[12]1.3283,[13]1.3125,[14]1.3034,[15]1.2991,[16]1.2875,[17]1.2820,[18]1.2811,[19]1.2736,[20]1.2642,[21]1.2612,[22]1.2615,[23]1.2772,[24]1.2701,[25]1.2680,[26]1.2592,[27]1.2538,[28]1.2530,[29]1.2656,[30]1.2672,[31]1.2609,[32]1.2558,[33]1.2568,[34]1.2561,[35]1.2551,[36]1.2768,[37]1.2863,[38]1.2912,[39]1.2983,[40]1.2996,[41]1.2959,[42]1.3094,[43]1.3088,[44]1.3091,
+Final estimate: PPL = 1.3091 +/- 0.00711
+llama_perf_context_print:        load time =    2322.79 ms
+llama_perf_context_print: prompt eval time =  123226.59 ms / 90112 tokens (    1.37 ms per token,   731.27 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  124445.92 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16704 + (3859 =  3351 +      40 +     467) +        3551 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9914 =  9794 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20675 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.11 GiB (4.53 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9794.32 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.333 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.22 seconds per pass - ETA 0.80 minutes
+[1]5.5268,[2]6.5916,[3]7.0599,[4]6.9248,[5]6.8390,[6]5.8915,[7]5.3858,[8]5.4210,[9]5.7348,[10]5.8667,[11]5.9198,[12]6.2353,[13]6.3051,[14]6.4444,[15]6.5299,
+Final estimate: PPL = 6.5299 +/- 0.13434
+llama_perf_context_print:        load time =    2330.41 ms
+llama_perf_context_print: prompt eval time =   44414.37 ms / 30720 tokens (    1.45 ms per token,   691.67 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   44837.11 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16713 + (3859 =  3351 +      40 +     467) +        3542 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9914 =  9794 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20665 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q6_K-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q6_K:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.11 GiB (4.53 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9794.32 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 45.264 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.35 seconds per pass - ETA 0.88 minutes
+[1]4.6592,[2]5.0932,[3]5.3619,[4]5.5518,[5]5.7639,[6]5.7737,[7]5.7765,[8]5.7245,[9]5.7687,[10]5.7621,[11]5.7632,[12]5.7604,[13]5.8453,[14]5.8550,[15]5.8489,[16]5.8552,
+Final estimate: PPL = 5.8552 +/- 0.10806
+llama_perf_context_print:        load time =    2341.69 ms
+llama_perf_context_print: prompt eval time =   49041.53 ms / 32768 tokens (    1.50 ms per token,   668.17 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   49486.92 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16702 + (3859 =  3351 +      40 +     467) +        3554 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9914 =  9794 +     112 +       8                |

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.18 GiB",
+        "t/s": "147.38 \u00b1 9.26",
+        "test": "pp8",
+        "tps_value": 147.38
+      },
+      "test": "pp8",
+      "tps": 147.38
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.3092,
+        "ppl_error": 0.0071
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.527,
+        "ppl_error": 0.13429
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.8546,
+        "ppl_error": 0.10803
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 2.7761,
+    "bench_tps": 147.38,
+    "file_size_bytes": 17379850272,
+    "file_size_gb": 16.19
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.18 GiB |    30.53 B | CUDA       |  35 |             pp8 |        147.38 ± 9.26 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.18 GiB |    30.53 B | CUDA       |  35 |           tg128 |         53.39 ± 0.70 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20666 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.18 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9866.19 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 115.443 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.29 seconds per pass - ETA 2.40 minutes
+[1]1.5173,[2]1.4633,[3]1.2894,[4]1.2472,[5]1.3385,[6]1.4012,[7]1.4050,[8]1.4039,[9]1.3630,[10]1.3404,[11]1.3259,[12]1.3272,[13]1.3114,[14]1.3024,[15]1.2981,[16]1.2869,[17]1.2810,[18]1.2799,[19]1.2726,[20]1.2632,[21]1.2603,[22]1.2606,[23]1.2765,[24]1.2694,[25]1.2674,[26]1.2587,[27]1.2532,[28]1.2523,[29]1.2650,[30]1.2667,[31]1.2604,[32]1.2554,[33]1.2565,[34]1.2557,[35]1.2548,[36]1.2766,[37]1.2860,[38]1.2910,[39]1.2981,[40]1.2996,[41]1.2959,[42]1.3094,[43]1.3089,[44]1.3092,
+Final estimate: PPL = 1.3092 +/- 0.00710
+llama_perf_context_print:        load time =    2330.47 ms
+llama_perf_context_print: prompt eval time =  123459.47 ms / 90112 tokens (    1.37 ms per token,   729.89 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  124970.84 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16692 + (3859 =  3351 +      40 +     467) +        3563 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9986 =  9866 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20669 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.18 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9866.19 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.314 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.24 seconds per pass - ETA 0.80 minutes
+[1]5.5380,[2]6.6025,[3]7.0556,[4]6.9240,[5]6.8321,[6]5.8898,[7]5.3840,[8]5.4182,[9]5.7337,[10]5.8682,[11]5.9228,[12]6.2366,[13]6.3046,[14]6.4432,[15]6.5270,
+Final estimate: PPL = 6.5270 +/- 0.13429
+llama_perf_context_print:        load time =    2539.66 ms
+llama_perf_context_print: prompt eval time =   44703.27 ms / 30720 tokens (    1.46 ms per token,   687.20 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   45125.62 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16695 + (3859 =  3351 +      40 +     467) +        3561 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9986 =  9866 +     112 +       8                |

	@@ -0,0 +1,177 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20663 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-IQ4_NL-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_Q8_0-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  337 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.18 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9866.19 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 45.982 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.37 seconds per pass - ETA 0.88 minutes
+[1]4.6504,[2]5.0826,[3]5.3555,[4]5.5459,[5]5.7597,[6]5.7685,[7]5.7712,[8]5.7222,[9]5.7676,[10]5.7611,[11]5.7628,[12]5.7604,[13]5.8451,[14]5.8550,[15]5.8482,[16]5.8546,
+Final estimate: PPL = 5.8546 +/- 0.10803
+llama_perf_context_print:        load time =    2562.13 ms
+llama_perf_context_print: prompt eval time =   49239.31 ms / 32768 tokens (    1.50 ms per token,   665.48 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   49718.33 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16693 + (3859 =  3351 +      40 +     467) +        3563 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9986 =  9866 +     112 +       8                |

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md",
+      "ngl": "30",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "30",
+        "params": "30.53 B",
+        "size": "56.89 GiB",
+        "t/s": "51.98 \u00b1 2.02",
+        "test": "pp8",
+        "tps_value": 51.98
+      },
+      "test": "pp8",
+      "tps": 51.98
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_code.log",
+        "ppl": 1.2903,
+        "ppl_error": 0.00687
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_general.log",
+        "ppl": 6.2878,
+        "ppl_error": 0.1285
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/perplexity_math.log",
+        "ppl": 5.6808,
+        "ppl_error": 0.10471
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.0,
+    "bench_tps": 51.98,
+    "file_size_bytes": 61095802912,
+    "file_size_gb": 56.9
+  }
+}

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16/llamabench.md ADDED Viewed

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  56.89 GiB |    30.53 B | CUDA       |  30 |             pp8 |         51.98 ± 2.02 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  56.89 GiB |    30.53 B | CUDA       |  30 |           tg128 |         16.62 ± 0.03 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,176 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20791 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type bf16:  338 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 56.89 GiB (16.01 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 34479.47 MiB
+load_tensors:        CUDA0 model buffer size = 11890.17 MiB
+load_tensors:        CUDA1 model buffer size = 11890.17 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   894.25 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 110.9 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.77 seconds per pass - ETA 7.15 minutes
+[1]1.4786,[2]1.4104,[3]1.2579,[4]1.2238,[5]1.3138,[6]1.3756,[7]1.3806,[8]1.3798,[9]1.3416,[10]1.3201,[11]1.3058,[12]1.3073,[13]1.2927,[14]1.2844,[15]1.2775,[16]1.2673,[17]1.2611,[18]1.2588,[19]1.2523,[20]1.2429,[21]1.2407,[22]1.2417,[23]1.2584,[24]1.2520,[25]1.2492,[26]1.2413,[27]1.2364,[28]1.2355,[29]1.2479,[30]1.2498,[31]1.2438,[32]1.2389,[33]1.2396,[34]1.2390,[35]1.2383,[36]1.2590,[37]1.2686,[38]1.2732,[39]1.2799,[40]1.2804,[41]1.2772,[42]1.2901,[43]1.2899,[44]1.2903,
+Final estimate: PPL = 1.2903 +/- 0.00687
+llama_perf_context_print:        load time =    7194.73 ms
+llama_perf_context_print: prompt eval time =  386399.67 ms / 90112 tokens (    4.29 ms per token,   233.21 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  387959.02 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 =  7760 + (12824 = 11890 +      40 +     894) +        3530 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 11454 + (12012 = 11890 +      40 +      82) +         657 |
+llama_memory_breakdown_print: |   - Host               |                  34599 = 34479 +     112 +       8                |

	@@ -0,0 +1,176 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20798 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type bf16:  338 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 56.89 GiB (16.01 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 34479.47 MiB
+load_tensors:        CUDA0 model buffer size = 11890.17 MiB
+load_tensors:        CUDA1 model buffer size = 11890.17 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   894.25 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 47.691 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.62 seconds per pass - ETA 2.40 minutes
+[1]5.3521,[2]6.3539,[3]6.7273,[4]6.6668,[5]6.5895,[6]5.6932,[7]5.2097,[8]5.2274,[9]5.5344,[10]5.6618,[11]5.7134,[12]6.0129,[13]6.0809,[14]6.2123,[15]6.2878,
+Final estimate: PPL = 6.2878 +/- 0.12850
+llama_perf_context_print:        load time =    7182.82 ms
+llama_perf_context_print: prompt eval time =  139724.78 ms / 30720 tokens (    4.55 ms per token,   219.86 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  140144.23 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 =  7763 + (12824 = 11890 +      40 +     894) +        3527 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 11454 + (12012 = 11890 +      40 +      82) +         657 |
+llama_memory_breakdown_print: |   - Host               |                  34599 = 34479 +     112 +       8                |

	@@ -0,0 +1,176 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20787 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_BF16-attn_output_BF16-attn_q_BF16-embeddings_BF16-ffn_down_BF16-ffn_up_gate_BF16-lm_head_BF16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type bf16:  338 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 56.89 GiB (16.01 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size = 34479.47 MiB
+load_tensors:        CUDA0 model buffer size = 11890.17 MiB
+load_tensors:        CUDA1 model buffer size = 11890.17 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   894.25 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 46.364 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 9.95 seconds per pass - ETA 2.65 minutes
+[1]4.6325,[2]5.0040,[3]5.2627,[4]5.4294,[5]5.6199,[6]5.6158,[7]5.6044,[8]5.5563,[9]5.6047,[10]5.5880,[11]5.5970,[12]5.5936,[13]5.6653,[14]5.6749,[15]5.6701,[16]5.6808,
+Final estimate: PPL = 5.6808 +/- 0.10471
+llama_perf_context_print:        load time =    7238.09 ms
+llama_perf_context_print: prompt eval time =  153467.89 ms / 32768 tokens (    4.68 ms per token,   213.52 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  154260.57 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 =  7767 + (12824 = 11890 +      40 +     894) +        3523 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 11454 + (12012 = 11890 +      40 +      82) +         657 |
+llama_memory_breakdown_print: |   - Host               |                  34599 = 34479 +     112 +       8                |

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.04 GiB",
+        "t/s": "143.64 \u00b1 9.45",
+        "test": "pp8",
+        "tps_value": 143.64
+      },
+      "test": "pp8",
+      "tps": 143.64
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_code.log",
+        "ppl": 1.3098,
+        "ppl_error": 0.00712
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_general.log",
+        "ppl": 6.5252,
+        "ppl_error": 0.13424
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL/perplexity_math.log",
+        "ppl": 5.8597,
+        "ppl_error": 0.10817
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 2.812,
+    "bench_tps": 143.64,
+    "file_size_bytes": 17224267808,
+    "file_size_gb": 16.04
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.04 GiB |    30.53 B | CUDA       |  35 |             pp8 |        143.64 ± 9.45 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.04 GiB |    30.53 B | CUDA       |  35 |           tg128 |         50.81 ± 0.75 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,176 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20789 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type iq4_nl:  338 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.04 GiB (4.51 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9717.82 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 113.985 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.36 seconds per pass - ETA 2.45 minutes
+[1]1.5216,[2]1.4688,[3]1.2927,[4]1.2509,[5]1.3411,[6]1.4040,[7]1.4075,[8]1.4065,[9]1.3658,[10]1.3428,[11]1.3276,[12]1.3292,[13]1.3133,[14]1.3040,[15]1.3000,[16]1.2883,[17]1.2826,[18]1.2817,[19]1.2742,[20]1.2648,[21]1.2618,[22]1.2620,[23]1.2779,[24]1.2707,[25]1.2686,[26]1.2598,[27]1.2544,[28]1.2535,[29]1.2662,[30]1.2680,[31]1.2615,[32]1.2564,[33]1.2575,[34]1.2567,[35]1.2557,[36]1.2772,[37]1.2866,[38]1.2916,[39]1.2987,[40]1.3002,[41]1.2966,[42]1.3101,[43]1.3095,[44]1.3098,
+Final estimate: PPL = 1.3098 +/- 0.00712
+llama_perf_context_print:        load time =    2525.79 ms
+llama_perf_context_print: prompt eval time =  126559.42 ms / 90112 tokens (    1.40 ms per token,   712.01 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  127840.44 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16883 + (3859 =  3351 +      40 +     467) +        3373 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9837 =  9717 +     112 +       8                |

	@@ -0,0 +1,176 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20856 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type iq4_nl:  338 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.04 GiB (4.51 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9717.82 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 50.436 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.30 seconds per pass - ETA 0.82 minutes
+[1]5.5164,[2]6.5847,[3]7.0380,[4]6.9090,[5]6.8306,[6]5.8839,[7]5.3798,[8]5.4143,[9]5.7279,[10]5.8593,[11]5.9151,[12]6.2282,[13]6.2985,[14]6.4406,[15]6.5252,
+Final estimate: PPL = 6.5252 +/- 0.13424
+llama_perf_context_print:        load time =    2632.70 ms
+llama_perf_context_print: prompt eval time =   45810.21 ms / 30720 tokens (    1.49 ms per token,   670.59 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   46251.63 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16832 + (3859 =  3351 +      40 +     467) +        3423 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9837 =  9717 +     112 +       8                |

	@@ -0,0 +1,176 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20838 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round0_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_IQ4_NL-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_IQ4_NL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type iq4_nl:  338 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.04 GiB (4.51 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9717.82 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   467.67 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 44.471 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.39 seconds per pass - ETA 0.90 minutes
+[1]4.6492,[2]5.0826,[3]5.3607,[4]5.5500,[5]5.7673,[6]5.7750,[7]5.7793,[8]5.7272,[9]5.7721,[10]5.7634,[11]5.7653,[12]5.7635,[13]5.8485,[14]5.8593,[15]5.8536,[16]5.8597,
+Final estimate: PPL = 5.8597 +/- 0.10817
+llama_perf_context_print:        load time =    2501.90 ms
+llama_perf_context_print: prompt eval time =   50382.21 ms / 32768 tokens (    1.54 ms per token,   650.39 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   50841.87 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16728 + (3859 =  3351 +      40 +     467) +        3528 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9837 =  9717 +     112 +       8                |

Benchmarks/DataCollection/Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/bench_metrics.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.06 GiB",
+        "t/s": "153.05 \u00b1 4.67",
+        "test": "pp8",
+        "tps_value": 153.05
+      },
+      "test": "pp8",
+      "tps": 153.05
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_code.log",
+        "ppl": 1.3056,
+        "ppl_error": 0.00706
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_general.log",
+        "ppl": 6.3772,
+        "ppl_error": 0.13014
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K/perplexity_math.log",
+        "ppl": 5.7351,
+        "ppl_error": 0.1051
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 1.1878,
+    "bench_tps": 153.05,
+    "file_size_bytes": 17253439520,
+    "file_size_gb": 16.07
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.06 GiB |    30.53 B | CUDA       |  35 |             pp8 |        153.05 ± 4.67 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.06 GiB |    30.53 B | CUDA       |  35 |           tg128 |         50.86 ± 0.32 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,178 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20668 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round1_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  336 tensors
+llama_model_loader: - type mxfp4:    1 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.06 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9745.64 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   504.77 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 112.915 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.31 seconds per pass - ETA 2.42 minutes
+[1]1.5133,[2]1.4561,[3]1.2850,[4]1.2452,[5]1.3348,[6]1.3979,[7]1.4007,[8]1.3980,[9]1.3584,[10]1.3355,[11]1.3212,[12]1.3232,[13]1.3079,[14]1.2995,[15]1.2950,[16]1.2837,[17]1.2777,[18]1.2768,[19]1.2694,[20]1.2601,[21]1.2574,[22]1.2577,[23]1.2737,[24]1.2667,[25]1.2652,[26]1.2566,[27]1.2512,[28]1.2505,[29]1.2631,[30]1.2650,[31]1.2585,[32]1.2534,[33]1.2542,[34]1.2534,[35]1.2523,[36]1.2737,[37]1.2830,[38]1.2878,[39]1.2945,[40]1.2956,[41]1.2922,[42]1.3056,[43]1.3053,[44]1.3056,
+Final estimate: PPL = 1.3056 +/- 0.00706
+llama_perf_context_print:        load time =    5258.07 ms
+llama_perf_context_print: prompt eval time =  123618.62 ms / 90112 tokens (    1.37 ms per token,   728.95 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  124982.52 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16666 + (3896 =  3351 +      40 +     504) +        3552 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9865 =  9745 +     112 +       8                |

	@@ -0,0 +1,178 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20669 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round1_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  336 tensors
+llama_model_loader: - type mxfp4:    1 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.06 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9745.64 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   504.77 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.471 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.26 seconds per pass - ETA 0.80 minutes
+[1]5.4538,[2]6.4958,[3]6.8567,[4]6.7701,[5]6.6970,[6]5.7723,[7]5.2801,[8]5.3017,[9]5.6168,[10]5.7496,[11]5.7953,[12]6.0979,[13]6.1691,[14]6.2968,[15]6.3772,
+Final estimate: PPL = 6.3772 +/- 0.13014
+llama_perf_context_print:        load time =    2328.69 ms
+llama_perf_context_print: prompt eval time =   44950.22 ms / 30720 tokens (    1.46 ms per token,   683.42 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   45519.81 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16664 + (3896 =  3351 +      40 +     504) +        3554 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9865 =  9745 +     112 +       8                |

	@@ -0,0 +1,178 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20670 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round1_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q5_K.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q5_K:    1 tensors
+llama_model_loader: - type iq4_nl:  336 tensors
+llama_model_loader: - type mxfp4:    1 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.06 GiB (4.52 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9745.64 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   504.77 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 45.919 ms
+perplexity: calculating perplexity over 16 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.36 seconds per pass - ETA 0.88 minutes
+[1]4.5968,[2]5.0125,[3]5.2798,[4]5.4458,[5]5.6643,[6]5.6658,[7]5.6468,[8]5.5969,[9]5.6531,[10]5.6451,[11]5.6541,[12]5.6482,[13]5.7220,[14]5.7333,[15]5.7271,[16]5.7351,
+Final estimate: PPL = 5.7351 +/- 0.10510
+llama_perf_context_print:        load time =    2390.66 ms
+llama_perf_context_print: prompt eval time =   49186.34 ms / 32768 tokens (    1.50 ms per token,   666.20 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   49704.78 ms / 32769 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16667 + (3896 =  3351 +      40 +     504) +        3551 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9865 =  9745 +     112 +       8                |

	@@ -0,0 +1,44 @@

+{
+  "raw_metrics": {
+    "llamabench": {
+      "backend": "CUDA",
+      "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/llamabench.md",
+      "ngl": "35",
+      "raw_row": {
+        "backend": "CUDA",
+        "model": "qwen3moe 30B.A3B IQ4_NL - 4.5 bpw",
+        "ngl": "35",
+        "params": "30.53 B",
+        "size": "16.17 GiB",
+        "t/s": "138.04 \u00b1 8.36",
+        "test": "pp8",
+        "tps_value": 138.04
+      },
+      "test": "pp8",
+      "tps": 138.04
+    },
+    "perplexity": {
+      "code": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_code.log",
+        "ppl": 1.3055,
+        "ppl_error": 0.00706
+      },
+      "general": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_general.log",
+        "ppl": 6.3615,
+        "ppl_error": 0.12957
+      },
+      "math": {
+        "log_path": "Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0/perplexity_math.log",
+        "ppl": 5.7044,
+        "ppl_error": 0.10406
+      }
+    }
+  },
+  "summary": {
+    "avg_prec_loss_pct": 0.9219,
+    "bench_tps": 138.04,
+    "file_size_bytes": 17370126368,
+    "file_size_gb": 16.18
+  }
+}

	@@ -0,0 +1,11 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model                          |       size |     params | backend    | ngl |            test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.17 GiB |    30.53 B | CUDA       |  35 |             pp8 |        138.04 ± 8.36 |
+| qwen3moe 30B.A3B IQ4_NL - 4.5 bpw |  16.17 GiB |    30.53 B | CUDA       |  35 |           tg128 |         44.46 ± 0.42 |
+build: 92bb442ad (7040)

	@@ -0,0 +1,178 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20658 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round2_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  336 tensors
+llama_model_loader: - type mxfp4:    1 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.17 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9856.92 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   616.05 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 114.273 ms
+perplexity: calculating perplexity over 44 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.35 seconds per pass - ETA 2.45 minutes
+[1]1.5110,[2]1.4554,[3]1.2846,[4]1.2451,[5]1.3348,[6]1.3984,[7]1.4012,[8]1.3983,[9]1.3587,[10]1.3358,[11]1.3216,[12]1.3237,[13]1.3084,[14]1.3000,[15]1.2953,[16]1.2841,[17]1.2782,[18]1.2772,[19]1.2697,[20]1.2604,[21]1.2577,[22]1.2579,[23]1.2739,[24]1.2669,[25]1.2654,[26]1.2567,[27]1.2513,[28]1.2506,[29]1.2632,[30]1.2650,[31]1.2585,[32]1.2535,[33]1.2542,[34]1.2534,[35]1.2523,[36]1.2735,[37]1.2829,[38]1.2876,[39]1.2944,[40]1.2955,[41]1.2921,[42]1.3054,[43]1.3052,[44]1.3055,
+Final estimate: PPL = 1.3055 +/- 0.00706
+llama_perf_context_print:        load time =    2382.39 ms
+llama_perf_context_print: prompt eval time =  125241.33 ms / 90112 tokens (    1.39 ms per token,   719.51 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =  126477.18 ms / 90113 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16537 + (4007 =  3351 +      40 +     616) +        3570 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9976 =  9856 +     112 +       8                |

	@@ -0,0 +1,178 @@

+ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+build: 7040 (92bb442ad) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 20659 MiB free
+llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:03:00.0) - 23582 MiB free
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/world8/AI/ToBench/Qwen3-30B-A3B-Thinking-2507-unsloth/Magic_Quant/GGUF/dc_round2_Qwen3-30B-A3B-Thinking-2507-unsloth-iq4_nl-attn_kv_IQ4_NL-attn_output_IQ4_NL-attn_q_IQ4_NL-embeddings_MXFP4-ffn_down_IQ4_NL-ffn_up_gate_IQ4_NL-lm_head_Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv   0:                       general.architecture str              = qwen3moe
+llama_model_loader: - kv   1:                               general.type str              = model
+llama_model_loader: - kv   2:                               general.name str              = Qwen3 30B A3B Thinking 2507 Unsloth
+llama_model_loader: - kv   3:                            general.version str              = 2507
+llama_model_loader: - kv   4:                           general.finetune str              = Thinking-unsloth
+llama_model_loader: - kv   5:                           general.basename str              = Qwen3
+llama_model_loader: - kv   6:                         general.size_label str              = 30B-A3B
+llama_model_loader: - kv   7:                            general.license str              = apache-2.0
+llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
+llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen3 30B A3B Thinking 2507
+llama_model_loader: - kv  11:               general.base_model.0.version str              = 2507
+llama_model_loader: - kv  12:          general.base_model.0.organization str              = Qwen
+llama_model_loader: - kv  13:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv  14:                               general.tags arr[str,3]       = ["qwen", "qwen3", "unsloth"]
+llama_model_loader: - kv  15:                       qwen3moe.block_count u32              = 48
+llama_model_loader: - kv  16:                    qwen3moe.context_length u32              = 262144
+llama_model_loader: - kv  17:                  qwen3moe.embedding_length u32              = 2048
+llama_model_loader: - kv  18:               qwen3moe.feed_forward_length u32              = 6144
+llama_model_loader: - kv  19:              qwen3moe.attention.head_count u32              = 32
+llama_model_loader: - kv  20:           qwen3moe.attention.head_count_kv u32              = 4
+llama_model_loader: - kv  21:                    qwen3moe.rope.freq_base f32              = 10000000.000000
+llama_model_loader: - kv  22:  qwen3moe.attention.layer_norm_rms_epsilon f32              = 0.000001
+llama_model_loader: - kv  23:                 qwen3moe.expert_used_count u32              = 8
+llama_model_loader: - kv  24:              qwen3moe.attention.key_length u32              = 128
+llama_model_loader: - kv  25:            qwen3moe.attention.value_length u32              = 128
+llama_model_loader: - kv  26:                      qwen3moe.expert_count u32              = 128
+llama_model_loader: - kv  27:        qwen3moe.expert_feed_forward_length u32              = 768
+llama_model_loader: - kv  28:                       tokenizer.ggml.model str              = gpt2
+llama_model_loader: - kv  29:                         tokenizer.ggml.pre str              = qwen2
+llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 151645
+llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 151654
+llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = false
+llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
+llama_model_loader: - kv  37:               general.quantization_version u32              = 2
+llama_model_loader: - kv  38:                          general.file_type u32              = 25
+llama_model_loader: - type  f32:  241 tensors
+llama_model_loader: - type q8_0:    1 tensors
+llama_model_loader: - type iq4_nl:  336 tensors
+llama_model_loader: - type mxfp4:    1 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type   = IQ4_NL - 4.5 bpw
+print_info: file size   = 16.17 GiB (4.55 BPW)
+load: printing all EOG tokens:
+load:   - 151643 ('<|endoftext|>')
+load:   - 151645 ('<|im_end|>')
+load:   - 151662 ('<|fim_pad|>')
+load:   - 151663 ('<|repo_name|>')
+load:   - 151664 ('<|file_sep|>')
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch             = qwen3moe
+print_info: vocab_only       = 0
+print_info: n_ctx_train      = 262144
+print_info: n_embd           = 2048
+print_info: n_embd_inp       = 2048
+print_info: n_layer          = 48
+print_info: n_head           = 32
+print_info: n_head_kv        = 4
+print_info: n_rot            = 128
+print_info: n_swa            = 0
+print_info: is_swa_any       = 0
+print_info: n_embd_head_k    = 128
+print_info: n_embd_head_v    = 128
+print_info: n_gqa            = 8
+print_info: n_embd_k_gqa     = 512
+print_info: n_embd_v_gqa     = 512
+print_info: f_norm_eps       = 0.0e+00
+print_info: f_norm_rms_eps   = 1.0e-06
+print_info: f_clamp_kqv      = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale    = 0.0e+00
+print_info: f_attn_scale     = 0.0e+00
+print_info: n_ff             = 6144
+print_info: n_expert         = 128
+print_info: n_expert_used    = 8
+print_info: n_expert_groups  = 0
+print_info: n_group_used     = 0
+print_info: causal attn      = 1
+print_info: pooling type     = 0
+print_info: rope type        = 2
+print_info: rope scaling     = linear
+print_info: freq_base_train  = 10000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn  = 262144
+print_info: rope_finetuned   = unknown
+print_info: model type       = 30B.A3B
+print_info: model params     = 30.53 B
+print_info: general.name     = Qwen3 30B A3B Thinking 2507 Unsloth
+print_info: n_ff_exp         = 768
+print_info: vocab type       = BPE
+print_info: n_vocab          = 151936
+print_info: n_merges         = 151387
+print_info: BOS token        = 11 ','
+print_info: EOS token        = 151645 '<|im_end|>'
+print_info: EOT token        = 151645 '<|im_end|>'
+print_info: PAD token        = 151654 '<|vision_pad|>'
+print_info: LF token         = 198 'Ċ'
+print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
+print_info: FIM MID token    = 151660 '<|fim_middle|>'
+print_info: FIM PAD token    = 151662 '<|fim_pad|>'
+print_info: FIM REP token    = 151663 '<|repo_name|>'
+print_info: FIM SEP token    = 151664 '<|file_sep|>'
+print_info: EOG token        = 151643 '<|endoftext|>'
+print_info: EOG token        = 151645 '<|im_end|>'
+print_info: EOG token        = 151662 '<|fim_pad|>'
+print_info: EOG token        = 151663 '<|repo_name|>'
+print_info: EOG token        = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 20 repeating layers to GPU
+load_tensors: offloaded 20/49 layers to GPU
+load_tensors:   CPU_Mapped model buffer size =  9856.92 MiB
+load_tensors:        CUDA0 model buffer size =  3351.42 MiB
+load_tensors:        CUDA1 model buffer size =  3351.42 MiB
+...................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max     = 1
+llama_context: n_ctx         = 2048
+llama_context: n_ctx_seq     = 2048
+llama_context: n_batch       = 2048
+llama_context: n_ubatch      = 512
+llama_context: causal_attn   = 1
+llama_context: flash_attn    = auto
+llama_context: kv_unified    = false
+llama_context: freq_base     = 10000000.0
+llama_context: freq_scale    = 1
+llama_context: n_ctx_seq (2048) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
+llama_context:        CPU  output buffer size =     0.58 MiB
+llama_kv_cache:        CPU KV buffer size =   112.00 MiB
+llama_kv_cache:      CUDA0 KV buffer size =    40.00 MiB
+llama_kv_cache:      CUDA1 KV buffer size =    40.00 MiB
+llama_kv_cache: size =  192.00 MiB (  2048 cells,  48 layers,  1/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
+llama_context: Flash Attention was auto, set to enabled
+llama_context:      CUDA0 compute buffer size =   616.05 MiB
+llama_context:      CUDA1 compute buffer size =    82.01 MiB
+llama_context:  CUDA_Host compute buffer size =     8.01 MiB
+llama_context: graph nodes  = 3031
+llama_context: graph splits = 397 (with bs=512), 88 (with bs=1)
+common_init_from_params: added <|endoftext|> logit bias = -inf
+common_init_from_params: added <|im_end|> logit bias = -inf
+common_init_from_params: added <|fim_pad|> logit bias = -inf
+common_init_from_params: added <|repo_name|> logit bias = -inf
+common_init_from_params: added <|file_sep|> logit bias = -inf
+common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
+common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
+system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 48.386 ms
+perplexity: calculating perplexity over 15 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 3.29 seconds per pass - ETA 0.82 minutes
+[1]5.4240,[2]6.4707,[3]6.8596,[4]6.7633,[5]6.6858,[6]5.7635,[7]5.2691,[8]5.2953,[9]5.6070,[10]5.7377,[11]5.7823,[12]6.0837,[13]6.1553,[14]6.2838,[15]6.3615,
+Final estimate: PPL = 6.3615 +/- 0.12957
+llama_perf_context_print:        load time =    2389.34 ms
+llama_perf_context_print: prompt eval time =   45389.07 ms / 30720 tokens (    1.48 ms per token,   676.81 tokens per second)
+llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
+llama_perf_context_print:       total time =   45816.96 ms / 30721 tokens
+llama_perf_context_print:    graphs reused =          0
+llama_memory_breakdown_print: | memory breakdown [MiB] | total    free    self   model   context   compute    unaccounted |
+llama_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24115 = 16542 + (4007 =  3351 +      40 +     616) +        3565 |
+llama_memory_breakdown_print: |   - CUDA1 (RTX 3090)   | 24124 = 20000 + (3473 =  3351 +      40 +      82) +         649 |
+llama_memory_breakdown_print: |   - Host               |                  9976 =  9856 +     112 +       8                |