Papers
arxiv:2605.27882

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

Published on May 27
· Submitted by
HuangMeow
on May 28
Authors:

Abstract

LLM-based agents perform poorly on VibeSearch benchmark, which evaluates multi-turn dialogue search scenarios reflecting real user-agent collaboration rather than traditional single-turn query tasks.

AI-generated summary

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

Community

Paper submitter

🚀 Introducing VibeSearchBench — a new benchmark that exposes a striking gap between how LLM agents are evaluated and how real users actually search.

💡 The problem. Today's search benchmarks (BrowseComp, WideSearch, DeepSearchQA…) all assume over-specified queries, single-turn interaction, and fixed-schema outputs. But in the wild, users don't know what they want upfront — they vibe-search: vague initial query → partial results → emerging preferences → iterative refinement. We call this the evaluation–experience gap.

🧪 What we built. 200 manually curated bilingual (EN/ZH) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (everyday life). Each task pairs a user persona with K progressive-disclosure stages and a schema-free ground-truth knowledge graph (avg. 212 nodes / 298 triples). We contribute two novel pieces: (1) a progressive-disclosure user simulator that unlocks needs only when trigger conditions are met, and (2) an LLM-as-judge graph-matching evaluator with 98.5%+ human agreement.

📊 Findings. Across 7 frontier models (Claude Opus 4.6, GPT-5.4, Gemini-3.1 Pro, Kimi K2.6, DeepSeek-V4-Pro…) under ReAct & OpenClaw:

  • Best F1 = 30.30 — every model below 33
  • More tool calls ≠ better results (GPT-5.4 burns the most, scores lowest)
  • Zero trajectories reach the user's [DONE] signal
  • Sub-agents, local memory, life-long memory all yield no significant gain

🎯 Takeaway. VibeSearch demands fundamental model-level advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction — not just more scaffolding. The road to truly helpful search agents is much longer than the leaderboards suggest.

🔗 vibebench.github.io/VibeSearchBench

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27882
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27882 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27882 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27882 in a Space README.md to link it from this page.

Collections including this paper 1