LongVidSearch

An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos

1 Peking University    2 Meituan
* Equal contribution    Corresponding author

LongVidSearch benchmark framework

Figure 1: Overview of LongVidSearch. Agents iteratively retrieve clips and read captions via standardized tools, and are evaluated by a three-judge majority-vote protocol.

Abstract

Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation.

To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories—State Mutation, Causal Inference, Global Summary, and Visual Tracking—with 2-, 3-, and 4-hop evidence requirements.

To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent’s ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy–efficiency trade-off under identical access conditions.

We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50%, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.

Dataset Statistics

3,000 QA
Category 2-Hop 3-Hop 4-Hop Total (Ratio)
Causal Inference 436282144862 (28.7%)
Global Summary 512181166859 (28.6%)
Visual Tracking 65313661850 (28.3%)
State Mutation 23811972429 (14.3%)
Overall 1,839 (61.3%)718 (23.9%)443 (14.8%)3,000 (100%)

Results

Accuracy (%) by Model, Category & Hop Level
Model Overall State Mutation Causal Inference Global Summary Visual Tracking
2h3h4h 2h3h4h 2h3h4h 2h3h4h
Closed-Source LLMs
GPT-5 42.43 38.2436.1322.22 47.7143.9739.58 44.3435.3629.52 49.7737.5029.51
Gemini 3 Pro 30.97 30.2518.4912.50 34.1720.9217.36 36.7220.4415.66 45.4825.0018.03
GPT-4o 19.20 15.5514.2912.50 20.1812.7711.81 19.7313.8111.45 29.4020.5911.48
GPT-4-mini 18.27 15.975.934.17 15.1410.996.25 20.3116.0212.65 31.3520.5911.48
Open-Source LLMs
Qwen3-VL-32B 29.59 29.7427.9715.49 29.2622.8618.44 34.1920.9916.46 40.4325.9322.95
Qwen2.5-VL-72B 25.30 23.9517.6512.50 26.3821.9915.97 29.4920.4415.06 34.0022.799.84
Qwen3-VL-8B 18.58 16.6712.719.72 14.8111.4311.19 20.5916.6715.34 28.8417.7815.25
Qwen2.5-7B 11.10 10.924.204.17 8.725.324.86 15.827.733.61 18.997.356.56
Qwen2.5-VL-7B 10.41 7.737.694.35 7.645.052.82 13.837.395.00 18.5010.294.92
Llama-3-8B 7.73 6.725.881.39 7.574.964.86 8.206.084.22 12.717.351.64

Key Findings: All agents use the same tool set and interact with the same retrieval system, so performance differences primarily reflect their ability to formulate effective queries and plan multi-step evidence acquisition.

Accuracy drops consistently as hop count increases across all models and categories, confirming that multi-hop retrieval planning is fundamentally harder. Even GPT-5, the strongest model, only achieves 42.43% overall—well below 50%—while oracle experiments with golden evidence clips yield near-perfect accuracy, pinpointing retrieval planning as the primary bottleneck.

Standard vs. Oracle (Golden Clips) Accuracy
Model Standard Acc (%) Oracle Acc (%) Gap (Δ)
Closed-Source LLMs
GPT-5 42.43 100.00 57.57
Gemini 3 Pro 30.9799.9769.00
GPT-4o 19.2099.4080.20
GPT-4-mini 18.2798.7380.46
Open-Source LLMs
Qwen3-VL-32B 29.5998.5668.97
Qwen3-VL-8B 18.5896.9078.32
Qwen2.5-VL-72B 25.3098.6073.30
Qwen2.5-VL-7B 10.4197.2386.82
Qwen2.5-7B 11.1097.3386.23
Llama-3-8B 7.7396.8989.16

Oracle Analysis: When agents receive golden evidence clips directly, all models achieve near-perfect accuracy (96–100%), yet standard accuracy remains below 43%. The massive gap (Δ = 57–89 points) confirms that retrieval planning—not answer generation—is the primary bottleneck in multi-hop long-video QA.

Human Verification vs. LLM Majority Vote

Human Verification vs. LLM Majority Vote
Model Checked Disagree Rate
Closed-Source LLMs
GPT-5 59830.50%
Gemini 3 Pro 60150.83%
GPT-4o 61760.97%
GPT-4-mini 62860.96%
Open-Source LLMs
Qwen3-VL-32B 60730.49%
Qwen3-VL-8B 61360.98%
Qwen2.5-VL-72B 62040.65%
Qwen2.5-VL-7B 62860.96%
Qwen2.5-7B 63171.11%
Llama-3-8B 62981.27%
Overall 6,172540.87%

Evaluation Reliability: Across 6,172 human-verified samples, the three-judge LLM majority vote disagrees with human annotators in only 0.87% of cases, confirming that our automatic evaluation protocol is highly reliable and consistent with expert judgment.

Standardized Tools

All agents interact with LongVidSearch through the same fixed tool interface:

Search_Clips_In_Video(video_id, query, top_k)
Retrieves top-K relevant clips for a textual query within a given video.
Get_Clip_Detail(clip_id)
Returns a high-quality caption for the queried clip (used as evidence).
FINAL_ANSWER(answer_text, evidence_clip_ids)
Submits the answer and the list of viewed evidence clip IDs.

This fixed interface ensures performance differences primarily reflect agentic retrieval planning, not retriever strength or privileged evidence access.

BibTeX

@inproceedings{longvidsearch2026,
  title     = {LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos},
  author    = {Rongyi Yu and Chenyuan Duan and Hao Liang and Haoze Sun and Peng Pei},
  booktitle = {ACM Multimedia},
  year      = {2026}
}