arxiv:2605.27882

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Published on May 27

· Submitted by

HuangMeow on May 28

rednote-hilab

Upvote

Authors:

Abstract

LLM-based agents perform poorly on VibeSearch benchmark, which evaluates multi-turn dialogue search scenarios reflecting real user-agent collaboration rather than traditional single-turn query tasks.

AI-generated summary

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

View arXiv page View PDF Project page GitHub 103 Add to collection

Community

Luckyyy

Paper submitter 3 days ago

🚀 Introducing VibeSearchBench — a new benchmark that exposes a striking gap between how LLM agents are evaluated and how real users actually search.

💡 The problem. Today's search benchmarks (BrowseComp, WideSearch, DeepSearchQA…) all assume over-specified queries, single-turn interaction, and fixed-schema outputs. But in the wild, users don't know what they want upfront — they vibe-search: vague initial query → partial results → emerging preferences → iterative refinement. We call this the evaluation–experience gap.

🧪 What we built. 200 manually curated bilingual (EN/ZH) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (everyday life). Each task pairs a user persona with K progressive-disclosure stages and a schema-free ground-truth knowledge graph (avg. 212 nodes / 298 triples). We contribute two novel pieces: (1) a progressive-disclosure user simulator that unlocks needs only when trigger conditions are met, and (2) an LLM-as-judge graph-matching evaluator with 98.5%+ human agreement.

📊 Findings. Across 7 frontier models (Claude Opus 4.6, GPT-5.4, Gemini-3.1 Pro, Kimi K2.6, DeepSeek-V4-Pro…) under ReAct & OpenClaw:

Best F1 = 30.30 — every model below 33
More tool calls ≠ better results (GPT-5.4 burns the most, scores lowest)
Zero trajectories reach the user's [DONE] signal
Sub-agents, local memory, life-long memory all yield no significant gain

🎯 Takeaway. VibeSearch demands fundamental model-level advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction — not just more scaffolding. The road to truly helpful search agents is much longer than the leaderboards suggest.

🔗 vibebench.github.io/VibeSearchBench