CléMax

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

clefourrier authored a paper 5 days ago

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

mbartolo authored a paper 12 months ago

Command A: An Enterprise-Ready Large Language Model

clefourrier authored a paper about 1 year ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

View all activity

clefourrier

authored a paper 5 days ago

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Paper • 2603.12180 • Published 5 days ago • 59

clefourrier

posted an update 10 months ago

Post

2419

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

mbartolo

authored a paper 12 months ago

Command A: An Enterprise-Ready Large Language Model

Paper • 2504.00698 • Published Apr 1, 2025 • 29

clefourrier

posted an update about 1 year ago

Post

2710

Gemma3 family is out! Reading the tech report, and this section was really interesting to me from a methods/scientific fairness pov.

Instead of doing over-hyped comparisons, they clearly state that **results are reported in a setup which is advantageous to their models**.
(Which everybody does, but people usually don't say)

For a tech report, it makes a lot of sense to report model performance when used optimally!
On leaderboards on the other hand, comparison will be apples to apples, but in a potentially unoptimal way for a given model family (like some user interact sub-optimally with models)

Also contains a cool section (6) on training data memorization rate too! Important to see if your model will output the training data it has seen as such: always an issue for privacy/copyright/... but also very much for evaluation!

Because if your model knows its evals by heart, you're not testing for generalization.

clefourrier

authored a paper about 1 year ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 256

clefourrier

authored 3 papers over 1 year ago

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Paper • 2412.03304 • Published Dec 4, 2024 • 19

The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models

Paper • 2404.05904 • Published Apr 8, 2024 • 9

GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 246

mbartolo

authored 9 papers almost 2 years ago

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

Paper • 2002.00293 • Published Feb 2, 2020

Interpretation of Natural Language Rules in Conversational Machine Reading

Paper • 1809.01494 • Published Aug 28, 2018

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

Paper • 2204.03162 • Published Apr 7, 2022 • 1

Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation

Paper • 2104.08678 • Published Apr 18, 2021

The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

Paper • 2404.16019 • Published Apr 24, 2024 • 1

clefourrier

posted an update almost 2 years ago

Post

6185

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

clefourrier

posted an update almost 2 years ago

Post

4800

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

clefourrier

posted an update almost 2 years ago

Post

2281

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

AI & ML interests

Recent Activity

Team members 2

clemax's activity