Mechanistic Interpretability Benchmark

university

https://mib-bench.github.io

Activity Feed

AI & ML interests

Principled evaluation of mechanistic interpretability methods.

Recent Activity

hij authored a paper 27 days ago

Blackbox Model Provenance via Palimpsestic Membership Inference

amueller updated a Space about 2 months ago

mib-bench/leaderboard

hij authored a paper 3 months ago

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

View all activity

hij

authored a paper 27 days ago

Blackbox Model Provenance via Palimpsestic Membership Inference

Paper • 2510.19796 • Published Oct 22 • 3

amueller

updated a Space about 2 months ago

MIB Leaderboard

😎

Leaderboard for the Mechanistic Interpretability Benchmark

hij

authored 3 papers 3 months ago

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Paper • 2501.17148 • Published Jan 28 • 1

LLMs Encode Harmfulness and Refusal Separately

Paper • 2507.11878 • Published Jul 16 • 1

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Paper • 2505.11770 • Published May 17 • 2

atticusg

authored a paper 10 months ago

Open Problems in Mechanistic Interpretability

Paper • 2501.16496 • Published Jan 27 • 20

belinkov

authored a paper 11 months ago

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

Paper • 2501.06751 • Published Jan 12 • 32

hadasor

authored a paper 11 months ago

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

Paper • 2501.06751 • Published Jan 12 • 32

belinkov

authored a paper about 1 year ago

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Paper • 2410.02707 • Published Oct 3, 2024 • 47

hadasor

authored a paper about 1 year ago

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Paper • 2410.02707 • Published Oct 3, 2024 • 47

atticusg

authored a paper about 1 year ago

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations

Paper • 2408.10920 • Published Aug 20, 2024 • 1

AdamBelfki

authored a paper over 1 year ago

NNsight and NDIF: Democratizing Access to Foundation Model Internals

Paper • 2407.14561 • Published Jul 18, 2024 • 35

amueller

authored a paper over 1 year ago

NNsight and NDIF: Democratizing Access to Foundation Model Internals

Paper • 2407.14561 • Published Jul 18, 2024 • 35

sarahwie

authored 5 papers over 1 year ago

Self-Refine: Iterative Refinement with Self-Feedback

Paper • 2303.17651 • Published Mar 30, 2023 • 2

Reframing Human-AI Collaboration for Generating Free-Text Explanations

Paper • 2112.08674 • Published Dec 16, 2021

Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing

Paper • 2102.12060 • Published Feb 24, 2021

Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy

Paper • 2305.14596 • Published May 24, 2023 • 1

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Paper • 2401.06751 • Published Jan 12, 2024 • 1

belinkov

authored a paper over 1 year ago

Confidence Regulation Neurons in Language Models

Paper • 2406.16254 • Published Jun 24, 2024 • 10

alestolfo

authored a paper over 1 year ago

Confidence Regulation Neurons in Language Models

Paper • 2406.16254 • Published Jun 24, 2024 • 10

AI & ML interests

Recent Activity

Team members 20

mib-bench's activity

MIB Leaderboard