9 90

Proto_AGI PRO

mayafree

AI & ML interests

None yet

Recent Activity

upvoted an article about 7 hours ago

Darwin V6: Diagnostic-Guided Evolutionary Model Merging

liked a Space about 12 hours ago

FINAL-Bench/Darwin-4B-Opus

liked a model about 12 hours ago

FINAL-Bench/Darwin-4B-Opus

View all activity

Organizations

upvoted an article about 7 hours ago

Article

Darwin V6: Diagnostic-Guided Evolutionary Model Merging

about 11 hours ago

•

liked a Space about 12 hours ago

Darwin-31B-Opus

👀

gemma-4-31B-it + gemma-4-31B-it-Claude-Opus-Distill

liked a model about 12 hours ago

FINAL-Bench/Darwin-4B-Opus

Text Generation • Updated about 11 hours ago • 12

reactedto SeaWolf-AI's post with 👍 about 15 hours ago

Post

2232

🧬 Darwin V6: Diagnostic-Guided Evolutionary Model Merging

We are releasing Darwin-31B-Opus — a reasoning-enhanced model merging Google's Gemma-4-31B-it and TeichAI's Claude Opus Distill using the Darwin V6 engine.

Model: FINAL-Bench/Darwin-31B-Opus
Demo: FINAL-Bench/Darwin-31B-Opus

🔬 What Darwin V6 Does

Conventional merging tools (mergekit, etc.) apply a single ratio to all tensors. Set ratio=0.5 and all 1,188 tensors blend identically, with no distinction between which tensors matter for reasoning versus coding.

Darwin V6 diagnoses both parents at the tensor level before merging. It measures Shannon entropy, standard deviation, and L2 norm for every tensor, then passes 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) through the model to determine layer-wise functional importance. Each of the 1,188 tensors receives an independent optimal ratio.

combined = static(entropy/std/norm) x 0.4 + probe(cosine_distance) x 0.6
final_ratio = mri_ratio x mri_trust + genome_ratio x (1 - mri_trust)

When one parent is overwhelmingly superior for a tensor (ratio < 0.15 or > 0.85), Darwin transplants it directly without interpolation. The mri_trust parameter itself is optimized by CMA-ES evolutionary search, so optimal transplant intensity is determined automatically. After merging, a Health Check compares the child against both parents layer-by-layer to detect interference or function loss.

🧬 Parent Models
Father: google/gemma-4-31B-it
Mother: TeichAI/gemma-4-31B-it-Claude-Opus-Distill

🧬 Results
Compared under identical conditions (same 50 questions, same seed, greedy, thinking mode):
Father: 60.0% (30/50)
Darwin-31B-Opus: 66.0% (33/50) — +10% relative improvement
ARC-Challenge: 82.89% (loglikelihood, zero-shot, 200 questions)
Optimal genome found by evolution:
ffn_ratio=0.93 — FFN layers strongly favor Mother (Claude Opus Distill)
block_5 (L50-L59)=0.86 and more...

8 replies

liked a model about 15 hours ago

FINAL-Bench/Darwin-9B-Opus

Text Generation • 10B • Updated about 12 hours ago • 763 • 18

liked 2 Spaces about 15 hours ago

Darwin 35B A3B Opus

👀

The child surpassed both parents — that is evolution

Darwin-31B-Opus

👀

gemma-4-31B-it + gemma-4-31B-it-Claude-Opus-Distill

liked a model about 15 hours ago

FINAL-Bench/Darwin-31B-Opus

Text Generation • 33B • Updated about 12 hours ago • 147 • 19

reactedto SeaWolf-AI's post with 🚀 6 days ago

Post

3120

💎 Gemma 4 Playground — Dual Model Demo on ZeroGPU

We just launched a Gemma 4 Playground that lets you chat with Google DeepMind's latest open models — directly on Hugging Face Spaces with ZeroGPU.

FINAL-Bench/Gemma-4-Multi

👉 Try it now: FINAL-Bench/Gemma-4-Multi
Two Models, One Space
Switch between both Gemma 4 variants in a single interface:

⚡ Gemma 4 26B-A4B — MoE with 128 experts, only 3.8B active params. 95% of the 31B's quality at ~8x faster inference. AIME 88.3%, GPQA 82.3%.
🏆 Gemma 4 31B — Dense 30.7B. Best quality among Gemma 4 family. AIME 89.2%, GPQA 84.3%, Codeforces 2150. Arena open-model top 3.

Features

Vision — Upload images for analysis, OCR, chart reading, document parsing
Thinking Mode — Toggle chain-of-thought reasoning with Gemma 4's native <|channel> thinking tokens
System Prompts — 6 presets (General, Code, Math, Creative, Translate, Research) or write your own
Streaming — Real-time token-by-token response via ZeroGPU
Apache 2.0 — Fully open, no restrictions

Technical Details
Built with the dev build of transformers (5.5.0.dev0) for full Gemma 4 support including multimodal apply_chat_template, variable-resolution image processing, and native thinking mode. Runs on HF ZeroGPU with @spaces .GPU — no dedicated GPU needed.
Both models support 256K context window and 140+ languages out of the box.

Links

- 🤗 Space: [FINAL-Bench/Gemma-4-Multi]( FINAL-Bench/Gemma-4-Multi)
- 📄 Gemma 4 26B-A4B: [google/gemma-4-26B-A4B-it]( google/gemma-4-26B-A4B-it)
- 📄 Gemma 4 31B: [google/gemma-4-31B-it]( google/gemma-4-31B-it)
- 🔬 DeepMind Blog: [Gemma 4 Launch](https://deepmind.google/blog/gemma-4-byte-for-byte-the-most-capable-open-models/)

liked a Space 6 days ago

Gemma-4 Multichat

👀

Gemma 4 — MoE 26B or Dense 31B, Vision, Thinking

liked 2 models 6 days ago

FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF

Text Generation • 35B • Updated about 12 hours ago • 1.05k • 15

bartowski/FINAL-Bench_Darwin-35B-A3B-Opus-GGUF

Image-Text-to-Text • 35B • Updated 7 days ago • 10.2k • 18

upvoted an article 8 days ago

Article

"The Child That Surpassed Both Parents Through MRI-Guided Evolutionary Merge"

8 days ago

•

reactedto SeaWolf-AI's post with 👍🔥 8 days ago

Post

2144

🧬 Darwin-35B-A3B-Opus — The Child That Surpassed Both Parents

What if a merged model could beat both its parents? We proved it can.
Darwin-35B-A3B-Opus is a 35B MoE model (3B active) built with our Darwin V5 engine — the first evolution system that CT-scans parent models before merging them.
🤗 Model: FINAL-Bench/Darwin-35B-A3B-Opus

The result speaks for itself: GPQA Diamond 90.0%, versus Father (Qwen3.5-35B-A3B) at 84.2% and Mother (Claude 4.6 Opus Distilled) at 85.0%. That's +6.9% over Father and +5.9% over Mother. Not a tradeoff — a genuine leap. Meanwhile, MMMLU sits at 85.0% (Father: 85.2%), multimodal is fully intact, and all 201 languages are preserved.

How? Model MRI changed everything. Traditional merging is guesswork. Darwin V4 added evolution. Darwin V5 added X-ray vision. Model MRI scans each parent layer by layer and discovers: Mother's L34–L38 is the reasoning engine (peak cosine distance), 50–65% of Mother's experts are dead (killed by text-only distillation), and Father is a healthy generalist with every expert alive. The prescription: transplant Mother's reasoning brain at L38 (90% weight), replace her dead experts with Father's living ones, and let Father's router handle the output layer. Reasoning went up. Versatility stayed intact. No tradeoff — just evolution.

35B total, 3B active (MoE) · GPQA Diamond 90.0% · MMMLU 85.0% (201 languages) · Multimodal Image & Video · 262K native context · 147.8 tok/s on H100 · Runs on a single RTX 4090 (Q4) · Apache 2.0
Darwin V5's full algorithm and technical details will be released alongside an upcoming paper.

🚀 Live Demo: FINAL-Bench/Darwin-35B-A3B-Opus

🏆 FINAL Bench Leaderboard: FINAL-Bench/Leaderboard

📊 ALL Bench Leaderboard: FINAL-Bench/all-bench-leaderboard

Built by VIDRAFT · Supported by the Korean Government GPU Support Program

8 replies

liked a Space 8 days ago

Darwin 35B A3B Opus

👀

The child surpassed both parents — that is evolution

liked a model 8 days ago

FINAL-Bench/Darwin-35B-A3B-Opus

Image-Text-to-Text • 36B • Updated about 12 hours ago • 1.33k • 63

reactedto SeaWolf-AI's post with ❤️ 9 days ago

Post

4655

🌍 World Model Bench — does your world model actually think?

FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.

We just released WM Bench — the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint — not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?

Those are cognitive questions. No existing benchmark asks them. So we built one.

3 Pillars · 10 Categories · 100 Scenarios · 1,000-point scale

- 👁 P1 Perception (25%) — Can it read the scene?
- 🧠 P2 Cognition (45%) — Does it predict threats, escalate emotions, utilize memory?
- 🔥 P3 Embodiment (30%) — Does the body respond with the right motion?

All evaluation is via simple JSON I/O — no 3D engine, no special hardware. Any model with an API can participate.

We also built PROMETHEUS as a live reference implementation — runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive → Predict → Decide → Act). Scored 726/1000 (Grade B) on Track C — the only directly verified model so far. Submissions from other teams very welcome.

---

🗂 Dataset → FINAL-Bench/World-Model
🌍 Demo → FINAL-Bench/World-Model
🏆 Leaderboard → FINAL-Bench/worldmodel-bench
📝 Article → https://huggingface.co/blog/FINAL-Bench/world-model

Part of the FINAL Bench Family — alongside FINAL Bench (Feb 2026). Feedback on rubrics and missing models always welcome!