Blackbox Model Provenance via Palimpsestic Membership Inference Paper • 2510.19796 • Published Oct 22 • 3
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Paper • 2501.17148 • Published Jan 28 • 1
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Paper • 2505.11770 • Published May 17 • 2
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models Paper • 2501.06751 • Published Jan 12 • 32
Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models Paper • 2501.06751 • Published Jan 12 • 32
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Paper • 2410.02707 • Published Oct 3, 2024 • 47
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations Paper • 2410.02707 • Published Oct 3, 2024 • 47
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations Paper • 2408.10920 • Published Aug 20, 2024 • 1
NNsight and NDIF: Democratizing Access to Foundation Model Internals Paper • 2407.14561 • Published Jul 18, 2024 • 35
NNsight and NDIF: Democratizing Access to Foundation Model Internals Paper • 2407.14561 • Published Jul 18, 2024 • 35
Reframing Human-AI Collaboration for Generating Free-Text Explanations Paper • 2112.08674 • Published Dec 16, 2021
Teach Me to Explain: A Review of Datasets for Explainable Natural Language Processing Paper • 2102.12060 • Published Feb 24, 2021
Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy Paper • 2305.14596 • Published May 24, 2023 • 1
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks Paper • 2401.06751 • Published Jan 12, 2024 • 1