AI & ML interests

None defined yet.

Recent Activity

hesamation 
posted an update 4 days ago
view post
Post
2744
this is big... 50 AI researchers from Bytedance, Alibaba, Tencent, and other labs/universities just published a 300-page paper with surprising lessons about coding models and agents (data, pre and post-training, etc).

key highlights:

> small LLMs can beat proprietary giants
RL (RLVR specifically) gives small open-source models an edge over big models in reasoning. a 14B model trained with RLVR on high-quality verified problems can match the performance of OpenAI's o3.

> models have a hard time learning Python.
mixing language models during pre-training is good, but Python behaves different from statically typed languages. languages with similar syntax (Java and C#, or JavaScript and TypeScript) creates high positive synergy. mixing Python heavily into the training of statically typed languages can actually hurt because of Python's dynamic typing.

> not all languages are equal (coding scaling laws)
the amount of data required to specialize a model on a language drastically depends on the language. paper argues like C# and Java are easier to learn (less training data required). languages like Python and Javascript are actually more tricky to learn, ironically (you see AI most used for these languages :)

> MoE vs Dense (ability vs stability)
MoE models offer higher capacity, but are much more fragile during SFT than dense models. hyperparams in training have a more drastic effect in MoE models, while dense models are more stable. MoE models also require constant learning rate schedules to avoid routing instability.

> code models are "insecure" by default (duh)
training on public repos makes models learn years of accumulated insecure coding patterns. safety fine-tuning often fails to work much on code. a model might refuse to write a hate speech email but will happily generate a SQL-injection vulnerable function because it "works."

read the full paper:
From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence (2511.18538)
juhoinkinen 
posted an update 9 days ago
view post
Post
219
**AI4LAM’s annual conference, AI Everywhere, All at Once
December 3 – 5, 2025, British Library, London**

See the conference programme: 👉 https://www.conftool.org/fantastic-futures-2025/sessions.php

Some program items related to NatLibFi/Annif:
• Workshop:
• Evaluating Automated Subject Indexing Methods, Maximilian Kähler
• Presentations:
• Autocat Cataloguing Assistant
• The usage of hardware resources for automatic subject cataloguing at the German National Library – an analysis and outlook for future challenges, Christoph Poley
• Posters:
• AI-Powered Subject Indexing in the Archives – Piloting Finto AI at the Finnish Literature Society, Milla Eräsaari and Teemu Hirvonen
• From Annotation to Insight: Human-in-the-Loop Machine Learning for Historical Archives in HAICu WP2, C.A. Romein and others
asigalov61 
posted an update 11 days ago
view post
Post
3362
🔥🎵 ➕ 🖹 🔥Check out my new large-scale MIDI + Lyrics dataset!!!

asigalov61/Lyrics-MIDI-Dataset

~179k MIDIs with corresponding Lyrics to play with!!! 🤗

If you liked the dataset, please ❤️

Any feedback and/or suggestions are also appreciated 🤗
Bils 
posted an update 22 days ago
lunarflu 
posted an update 28 days ago
lunarflu 
posted an update 28 days ago
lunarflu 
posted an update 28 days ago
view post
Post
2659
💸🤑You don’t need 100 GPUs to train something amazing!

Our Smol Training Playbook teaches you a better path to world-class LLMs, for free!

Check out the #1 trending space on 🤗 :
HuggingFaceTB/smol-training-playbook
multimodalart 
posted an update about 2 months ago
view post
Post
5889
Want to iterate on a Hugging Face Space with an LLM?

Now you can easily convert any HF entire repo (Model, Dataset or Space) to a text file and feed it to a language model!

multimodalart/repo2txt
BramVanroy 
posted an update about 2 months ago
view post
Post
338
What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?
  • 1 reply
·
lunarflu 
posted an update 2 months ago
view post
Post
2268
Cool stuff these past weeks on huggingface! 🤗 🚀 !
• 📈Trackio, local-first W&B alternative
https://github.com/gradio-app/trackio/issues
• 🌍EmbeddingGemma, 300M-param, multilingual embeddings, on-device
https://huggingface.co/blog/embeddinggemma
• 💻Open LLMs in VS Code (Inference Providers)
https://x.com/reach_vb/status/1966185427582497171
• 🤖Smol2Operator GUI agents
https://huggingface.co/blog/smol2operator
• 🖼️Gradio visible watermarking
https://huggingface.co/blog/watermarking-with-gradio
Sri-Vigneshwar-DJ 
posted an update 2 months ago
view post
Post
327
Do you think domain-specific embedding fine-tuners are needed?
I've been working with embeddings for marketing use cases and noticed something: most embeddings don't get marketing concepts very well. They're trained in general-purpose ways.
The Issue I'm Seeing
When I search marketing content with general embeddings:

"organic growth" returns farming articles
"conversion funnel" matches industrial equipment
"brand lift" doesn't connect to campaign effectiveness
Marketing jargon like CAC, ROAS, CTR aren't properly understood

My Question
Do you think domain-specific embeddings are needed for marketing?
Some thoughts:

Marketing has its own vocabulary and concept relationships
General models trained on Wikipedia/web crawl miss these nuances
But is fine-tuning worth the effort vs just using more retrieval tricks?

Quick Example
I fine-tuned all-mpnet-base-v2 on ~1000 marketing concept pairs and saw 15-20% better retrieval accuracy. But I'm curious:

Has anyone else tried this for marketing or other domains?
When do you think domain-specific embeddings are actually necessary vs overkill?
Are there better approaches I'm missing?

https://huggingface.co/blog/Sri-Vigneshwar-DJ/why-your-marketing-rag-system-needs-domain-specifi
  • 6 replies
·
Sri-Vigneshwar-DJ 
posted an update 2 months ago
view post
Post
4427
🚀 Exciting News! We've released a Performance Marketing Expert Dataset from Hawky.ai [www.hawky.ai] Hawky-ai


This dataset empowers AI models with cutting-edge strategies for Meta, Google Ads, and TikTok campaigns. It includes:
1. Multi-platform strategies for e-commerce, DTC, B2B, and more
2. Creative optimization and audience targeting insights
3. ROI-driven recommendations based on 2025 best practices

Sri-Vigneshwar-DJ/Performance-Marketing-Data