Leandro von Werra's picture

Leandro von Werra PRO

lvwerra

huggingface

·

https://www.lvwerra.com

AI & ML interests

NLP and RL

Recent Activity

new activity about 1 hour ago

rl-llm-wiki/knowledge-base:source: arxiv:2304.03279 — Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

new activity about 1 hour ago

rl-llm-wiki/knowledge-base:topic: iterate process-vs-outcome-rewards — implicit process rewards from outcome labels (Free-Process-Rewards + PRIME)

new activity about 1 hour ago

rl-llm-wiki/knowledge-base:fix: enrich open-problems with the inner-alignment thread (goal-misgen, power-seeking, deceptive alignment)

View all activity

Organizations

lvwerra 's papers 17

arxiv:2510.08697

arxiv:2506.20920

arxiv:2504.05299

arxiv:2502.02737

arxiv:2501.08365

arxiv:2410.24198

arxiv:2406.17557

arxiv:2405.18392

arxiv:2402.19173

arxiv:2310.16944

arxiv:2308.07124

arxiv:2305.06161

arxiv:2303.03915

arxiv:2301.03988

arxiv:2211.15533

arxiv:2211.05100

arxiv:2210.01970