Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

dronefreak 
posted an update 2 days ago
view post
Post
2794
Excited to open-source the VisDrone Aerial Object Detection Model Zoo on Hugging Face.

The collection includes multiple YOLO variants trained and evaluated on the VisDrone benchmark for aerial object detection, with accompanying documentation and performance metrics.

If you're working on drones, aerial surveillance, robotics, or small-object detection, I hope these models save you some time.

Model Zoo: https://huggingface.co/collections/dronefreak/visdrone-detection-model-zoo

Feedback, issues, and contributions are welcome.
  • 5 replies
·
Reubencf 
posted an update 2 days ago
view post
Post
3238
Shadows of Tomorrow is finally live on Hugging Face Spaces with Gradio.

It’s a browser-playable RPG built with Godot, set in a post-nuclear future where players explore Magnus Province, collect medicinal plants, craft medicine, and help cure NPCs.

Play it here: Reubencf/Shadows_of_Tomorrow
  • 8 replies
·
AxionLab-official 
posted an update 3 days ago
view post
Post
3196
# An Open Letter from SupraLabs.

Over the past few days, SupraLabs has been mentioned in a public discussion regarding small language models, scaling laws, and training methodology. We'd like to clarify our position.

Before anything else, we want to make one thing absolutely clear: we have great respect for Lane and the work being done at Glint Research. At no point was our intention to disrespect Lane, Glint Research, or their research. What began as a technical discussion about model scaling and training methodology unfortunately became much more personal than we ever intended. From our perspective, it was simply an exchange of technical opinions, and we sincerely hope it remains that way.
We'd also like to acknowledge that one of our own comments during the discussion was poorly worded. Referring to a benchmark as "fake" was imprecise. What we intended to criticize was the comparison methodology, not the integrity of the evaluation itself. Comparing a merged checkpoint against a single checkpoint is, in our view, not an apples-to-apples comparison.

That said, this was never the core of the discussion.

Our disagreement was not about SLERP, model merging, or whether training a small model on massive amounts of data is an interesting research direction. We support experimentation and unconventional ideas.

The actual point of disagreement was much simpler.

The statement that a 1M parameter model trained on 1 trillion tokens will become a "100M killer" is, today, a prediction, not an experimental result.
Could it happen? Perhaps.
Would it be exciting if it did? Absolutely.

But until benchmark results, reproducible evaluations, and independent validation exist, we believe such statements should be presented as hypotheses rather than established conclusions.
Research advances by testing ideas, not by assuming their outcomes.

We sincerely wish Lane and everyone at Glint Research success in their experiments.

Thank you to everyone who read it.
  • 1 reply
·
Anran-MLLM 
posted an update 2 days ago
view post
Post
3223
🚀 Introducing PerceptionDLM — the first multimodal diffusion LLM for parallel region perception!

Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. 🧩

✨ Highlights
• ⚡ Up to 3.4× faster on dense multi-region captioning, with stable per-image latency
• 🏆 PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs)
• 📊 New benchmark: ParaDLC-Bench — jointly evaluates caption quality AND inference efficiency
• 🔓 Code, models & benchmark all open-sourced

🤖 Models
MSALab/PerceptionDLM-Base
MSALab/PerceptionDLM

📊 Benchmark
MSALab/ParaDLC-Bench

📄 Paper: PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models (2606.19534)
💻 Code: https://github.com/MSALab-PKU/PerceptionDLM

Diffusion LLMs aren't just for text — they unlock efficient, parallel visual perception. 👁️✨

#multimodal #diffusion #VLM #perception
ajibawa-2023 
posted an update about 19 hours ago
view post
Post
1355
Shell-Code-Large
Dataset: ajibawa-2023/Shell-Code-Large

Shell-Code-Large is a large-scale corpus of Shell scripting source code comprising approximately 640,000 code samples stored in JSON Lines (.jsonl) format. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, DevOps automation, cloud infrastructure engineering, system administration, and software engineering automation.

By providing a high-volume, language-specific corpus focused exclusively on Shell scripting, Shell-Code-Large enables systematic experimentation in automation workflows, deployment pipelines, infrastructure management, and command-line tooling. These domains remain foundational to Linux systems, cloud-native platforms, CI/CD environments, and modern DevOps practices.

Shell-Code-Large addresses the need for a dedicated Shell-focused dataset at substantial scale, enabling targeted research into scripting patterns, command composition, workflow orchestration, infrastructure automation, and operational engineering practices
owensong 
posted an update 4 days ago
view post
Post
6221
I just released Inflect-Nano-v1, an ultra-small 4.63 parameter text-to-speech model.

The main idea is simple: instead of only making the acoustic model tiny and relying on a larger external vocoder, Inflect-Nano-v1 keeps the complete text-to-waveform stack under 5M parameters.

Quick facts:
- 4.63M total inference parameters
- 3.46M acoustic model
- 1.17M vocoder
- 24 kHz audio
- English-only
- Single male voice
- Runs locally with a simple PyTorch inference script

Why I made it:
Most modern TTS models are much larger, and even many “small TTS” projects depend on a separate vocoder. I wanted to see how far a complete tiny TTS stack could be pushed while still producing usable speech.

It is not SOTA, and I am not trying to claim it competes with large TTS systems. The interesting part is the size-to-functionality ratio.

What works:
It can generate arbitrary English speech locally, and the model is small enough to be interesting for:

- local voice assistants
- embedded/edge experiments
- browser or WASM-style TTS exploration
- efficient inference research
- tiny-model baselines

Limitations:
The quality is still limited. It can sound robotic, stumble on difficult unseen text, and the vocoder is still a clear bottleneck. Long or unusual prompts are less reliable.

So I would frame this as a research/demo release, not a production TTS engine.

I’d love feedback from people interested in:
- tiny speech models
- vocoders
- local TTS
- efficient inference
- embedded speech synthesis
- improving small-model generalization

If people find it useful, I’m interested in putting more training budget into a stronger v2.

Model page:
owensong/Inflect-Nano-v1
AmelieSchreiber 
posted an update 1 day ago
view post
Post
731
Latest OpenAI Parameter Golf Competition Training Run BPB (<1K steps on a single 4090) See: ToricBLM, ToricGT, and TropicalGT methods
Hari5115 
posted an update 1 day ago
view post
Post
890
Bit addictive. Fair warning !!!
Chain combos, fever mode, daily leaderboard. Free, runs in your browser.
Beat the score if you can 🫧

🎮 Hari5115/neon-pop

#SendHelp #JustOneMoreGame #NeonPop #NotAddicted

  • 2 replies
·
Jaward 
posted an update 3 days ago
view post
Post
8999
Our preprint is out!
We attempt to model human teaching behaviors into agents yielding a unified framework that enables adaptive personalized learning experiences:
LectūraAgents addresses the prevailing limitations in current AI learning systems with three essential capabilities:
(1) a hierarchical multi-agent architecture modeled on academic standards. we observe that agents collaborating across hierarchies yield better personalized learning outcomes.
(2) an adaptive embodied teaching mechanism, in which the instructor agent executes visible and pedagogically motivated teaching actions (e.g. handwrite, highlight, circle etc) on contents in a teaching environment while speaking.
(3) to achieve this we propose a novel teaching action-speech alignment algorithm (TASA) that dynamically aligns speech with visual teaching actions: specifically, TASA temporally chops up speech segments into word-level tokens, performs salience heuristics analysis on learning contents (texts, images etc) then identifies relevant regions to apply pedagogical teaching actions that guide attention and augment understanding.

We conducted several experiments to assess these capabilities: starting with pedagogical evaluation of the various components under frontier models, comparative analysis with existing frameworks and an efficacy study with real students.

Results show consistent gains in standard instructional metrics (curated by expert educators) spanning lecture content quality, embodied teaching quality, assessment, and personalization over baseline systems, positioning LectūraAgents as a pedagogically grounded framework for personalized learning at scale.

Paper: LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching (2606.16428)
Data: Jaward/lectura-agents-data
  • 1 reply
·
AxionLab-official 
posted an update about 3 hours ago