5 33

NFTCID

AI & ML interests

None yet

Recent Activity

liked a model 3 days ago

LiquidAI/LFM2.5-VL-450M

reacted to Crownelius's post with 👍 8 days ago

[DAY ONE] PROJECT CROWFEATHER 4/30/2026 ...The day I forgot to attach wandb.ai Just dropped Crowfeather-50m, the first checkpoint in a series, and yeah, no graphs. https://huggingface.co/Crowfeather/Crowfeather-50m 54.5M params. Pretrain only. 17,500 steps banked on FineWeb-edu before Thunder credits ran dry. About 2.3B tokens, no SFT yet. Architecture: Gemma-4 alternating sliding/global attention (1024 window, last layer always global) plus DeepSeek-V4 Muon optimizer plus WSD scheduler plus Gemma-2 logit soft-cap plus PaLM z-loss. Recipe in the model card. What it can do: writes grammatical English. Knows that France has Rhine-adjacent monasteries (it picked Rouen instead of Paris but the vocabulary is in there). Tells stories about Mr. Fabien. What it can't do yet: facts, code, math. Base LM, no SFT, no instruction tuning. The series: Every additional training run becomes another model card here Every model card gets a matching post on this profile Continuation goes to Colab next, picking up from step 17500 out of 100k Limited to one post a day on Hugging Face, so updates will trickle out at that pace. Follow [@Crownelius](https://huggingface.co/Crownelius) and [@Crowfeather](https://huggingface.co/Crowfeather) if you want to watch this thing learn in public. Next drop will either come with the finished pre-train or whatever step I land on before the bank takes my credit card away. Graphs will be available on my NEXT model lol -Shane

reacted to sequelbox's post with 👀 8 days ago

EARLY SNEAK PREVIEW of our first DeepSeek-V4-Pro dataset, Tachibana 4! Tachibana 4 is our upcoming agentic coding dataset: - Questions prioritize real-world, challenging agentic coding tasks across a variety of programming languages and topics. - Areas of focus include back-end and front-end development, systems programming, distributed systems, performance optimization, data structures, databases and data engineering, game and mobile development, security engineering, compiler design, custom tooling, task automation, practical bugfixes, and more! - A wide variety of emphasized languages improves development capability: Python, C, C++, C#, Go, TypeScript, Java, JavaScript, Rust, Haskell, SQL, Shell, R, Ruby, assembly code, and more! - Synthethic prompts utilize a variety of personas, experience levels, and styles of communication to maximize real-world flexibility and usability. Get it now: https://huggingface.co/datasets/sequelbox/Tachibana4-DeepSeek-V4-Pro-PREVIEW These agentic datasets will power the upcoming Esper 4, and whatever you can build! We'll have more finetunes on the way as well! :) we're going to make open source better and better for your work! If you would like to see Esper 4 and these datasets faster, this is the best way you can help us: https://huggingface.co/spaces/sequelbox/SupportOpenSource for freedom, with love, allegra

View all activity

Organizations

None yet

liked a model 3 days ago

LiquidAI/LFM2.5-VL-450M

Image-Text-to-Text • 0.4B • Updated about 1 month ago • 53.6k • 169

reacted to Crownelius's post with 👍 8 days ago

Post

3797

[DAY ONE] PROJECT CROWFEATHER 4/30/2026
...The day I forgot to attach wandb.ai
Just dropped Crowfeather-50m, the first checkpoint in a series, and yeah, no graphs.

Crowfeather/Crowfeather-50m

54.5M params. Pretrain only. 17,500 steps banked on FineWeb-edu before Thunder credits ran dry. About 2.3B tokens, no SFT yet.

Architecture: Gemma-4 alternating sliding/global attention (1024 window, last layer always global) plus DeepSeek-V4 Muon optimizer plus WSD scheduler plus Gemma-2 logit soft-cap plus PaLM z-loss. Recipe in the model card.

What it can do: writes grammatical English. Knows that France has Rhine-adjacent monasteries (it picked Rouen instead of Paris but the vocabulary is in there). Tells stories about Mr. Fabien.

What it can't do yet: facts, code, math. Base LM, no SFT, no instruction tuning.

The series:
Every additional training run becomes another model card here
Every model card gets a matching post on this profile
Continuation goes to Colab next, picking up from step 17500 out of 100k

Limited to one post a day on Hugging Face, so updates will trickle out at that pace. Follow [@Crownelius](@Crownelius ) and [@Crowfeather](

Crowfeather ) if you want to watch this thing learn in public. Next drop will either come with the finished pre-train or whatever step I land on before the bank takes my credit card away.

Graphs will be available on my NEXT model lol

-Shane

3 replies

reacted to sequelbox's post with 👀 8 days ago

Post

3229

EARLY SNEAK PREVIEW of our first DeepSeek-V4-Pro dataset, Tachibana 4!

Tachibana 4 is our upcoming agentic coding dataset:
- Questions prioritize real-world, challenging agentic coding tasks across a variety of programming languages and topics.
- Areas of focus include back-end and front-end development, systems programming, distributed systems, performance optimization, data structures, databases and data engineering, game and mobile development, security engineering, compiler design, custom tooling, task automation, practical bugfixes, and more!
- A wide variety of emphasized languages improves development capability: Python, C, C++, C#, Go, TypeScript, Java, JavaScript, Rust, Haskell, SQL, Shell, R, Ruby, assembly code, and more!
- Synthethic prompts utilize a variety of personas, experience levels, and styles of communication to maximize real-world flexibility and usability.

Get it now: sequelbox/Tachibana4-DeepSeek-V4-Pro-PREVIEW

These agentic datasets will power the upcoming Esper 4, and whatever you can build! We'll have more finetunes on the way as well! :) we're going to make open source better and better for your work!

If you would like to see Esper 4 and these datasets faster, this is the best way you can help us: sequelbox/SupportOpenSource

for freedom, with love,
allegra

reacted to ManniX-ITA's post with 🚀 8 days ago

Post

2998

🚀 Two releases this week pushing merge methodology forward.

▶ Qwen3.6-27B-Omnimerge-v4-MLP
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Same-base DARE-TIES merge of Qwen3.6-27B + 3 fine-tunes (rico03 Claude distill, Esper3.1, kai-os Opus reasoning anchor) via my Omnimerge_v2 method (OBIM-lite + DAREx-q + EMR election).

Hit a Qwen3.6-specific fragility: hyperparams that work flawlessly on 3.5 produced 80% unclosed-<think> on 3.6, collapsing pass@1 to ~20%. Per-tensor delta forensics localized the failure to mlp.{gate,up,down}_proj in
layers 27–52. Fix: MLP-passthrough surgery — copy MLPs verbatim from base, keep merged attn + linear_attn. Leak → 0%.

Q6_K results (vs Qwen3.6 base / vs Omnimerge-v2 on Qwen3.5):
• HumanEval: 84.76% (= base, +5.49 pp vs v2)
• MBPP corrected: 73.40% (+15.80 pp vs base, ≈ v2)
• GPQA Diamond: ~84.75% partial 192/198 (+15.5 pp vs v2)

▶ Qwen3.5-4B Importance-Signal Study (M1..M5)

Controlled 5-way comparison: same Qwen3.5-4B base, same 2 fine-tunes (Jackrong Claude-4.5 distill + Crow Opus-4.6 distill), only the importance signal driving DARE-TIES sparsification varies.

Q6_K HE / MBPP pass@1:
• M1 Vanilla DARE-TIES → 51.22 / 47.00
• M2 OMv2 (no signal) → 52.44 / 49.40
• M3 OMv2 + Fisher → 57.93 🥇 / 48.80
• M4 mergekit ex-LRP (PR #682) → 51.22 / 49.40
• M5 OMv2 + LRP → 53.05 / 51.40 🥇

Findings: Fisher wins HE (+4.88 pp over vanilla), LRP wins MBPP (+2.60 pp). Both signals + Omnimerge_v2 recipe beat vanilla. To make multimodal-LM ex-LRP work end-to-end against Qwen3_5ForConditionalGeneration, I filed
5 patches against arcee-ai/mergekit PR #682 + 1 against rachtibat/lxt.

All five Mx checkpoints + Fisher/LRP signal safetensors + reproducer scripts published.

1 reply

upvoted 3 papers 8 days ago

Turning the TIDE: Cross-Architecture Distillation for Diffusion Large Language Models

Paper • 2604.26951 • Published 10 days ago • 46

RADIO-ViPE: Online Tightly Coupled Multi-Modal Fusion for Open-Vocabulary Semantic SLAM in Dynamic Environments

Paper • 2604.26067 • Published 11 days ago • 73

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

Paper • 2604.26752 • Published 10 days ago • 100

upvoted a changelog 10 days ago

Hugging Face Changelog

Spaces agents.md for your coding agents

22 days ago

• 243

liked 2 Spaces 3 months ago

DINOv3 Video Tracking

🐠

In-browser video tracking, powered by Transformers.js

SoulX-Singer

🎤

157

Generate singing voice from lyrics or convert voices

reacted to AdinaY's post with 🚀 3 months ago

Post

3784

MiniMax M2.5 is now available on the hub 🚀

MiniMaxAI/MiniMax-M2.5

✨ 229B - Modified MIT license
✨37% faster than M2.1
✨ ~$1/hour at 100 TPS

2 replies

liked a dataset 3 months ago

nvidia/PhysicalAI-Robotics-GR00T-Teleop-Sim

Viewer • Updated Dec 17, 2025 • 5.82M • 5.46k • 17

liked 2 Spaces 3 months ago

MioTTS 0.1B Demo

📈

TTS demo for MioTTS-0.1B

Z Image Turbo

🖼

3.12k

Generate custom images from text prompts in seconds

reacted to danielhanchen's post with 🔥 3 months ago

Post

5225

We collaborated with Hugging Face to enable you to train MoE models 12× faster with 35% less VRAM via our new Triton kernels (no accuracy loss). 🤗

Train gpt-oss locally on 12.8GB VRAM with our free notebooks: https://unsloth.ai/docs/new/faster-moe

1 reply

liked a Space 3 months ago

Transformer Training Visualized

🚀

Visualize GPT training with weights, gradients, and attention

reacted to marksverdhei's post with 🔥 3 months ago

Post

4595

Poll: Will 2026 be the year of subquadratic attention?

The transformer architecture is cursed by its computational complexity.
It is why you run out of tokens and have to compact. But some would argue that this is a feature not a bug and that this is also why these models are so good. We've been doing a lot of research on trying to make equally good models that are computationally cheaper, But so far, none of the approaches have stood the test of time. Or so it seems.

Please vote, don't be shy. Remember that the Dunning-Kruger effect is very real, so the person who knows less about transformers than you is going to vote. We want everyone's opinion, no matter confidence.

👍 if you think at least one frontier model* will have no O(n^2) attention by the end of 2026
🔥 If you disagree

* Frontier models - models that match / outperform the flagship claude, gemini or chatgpt at the time on multiple popular benchmarks

4 replies

reacted to mitkox's post with 👍 3 months ago

Post

4819

I just pushed Claude Code Agent Swarm with 20 coding agents on my desktop GPU workstation.

With local AI, I don’t have /fast CC switch, but I have /absurdlyfast:
- 100’499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation
- KV cache: 707’200 tokens
- Hardware: 5+ year old GPUs 4xA6K gen1; It’s not the car. It’s the driver.

Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.

My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.