LongVidSearch

Figure 1: Overview of LongVidSearch. Agents iteratively retrieve clips and read captions via standardized tools, and are evaluated by a three-judge majority-vote protocol.

Abstract

Long video question answering (Long-Video QA) increasingly relies on agentic tool use to retrieve evidence from long videos. In realistic settings, this process often requires multi-hop retrieval, where agents must iteratively gather multiple discontinuous evidence clips. However, existing long-video benchmarks are largely static: they rarely enforce strict multi-hop retrieval and typically lack a standardized evidence-access interface, making it difficult to separate failures in retrieval planning from those in answer generation.

To address this gap, we introduce LongVidSearch, a benchmark for evaluating agentic multi-hop evidence retrieval planning in long videos under standardized access constraints. LongVidSearch enforces retrieval necessity: a Hop-k question requires exactly k necessary evidence clips, and removing any single clip renders the question unsolvable. The benchmark contains 3,000 questions over 447 long videos (average length 26 minutes), covering four reasoning categories—State Mutation, Causal Inference, Global Summary, and Visual Tracking—with 2-, 3-, and 4-hop evidence requirements.

To ensure fair and controlled evaluation, all agents interact with LongVidSearch through a unified tool interface, which fixes the retrieval backend and isolates the agent’s ability to formulate queries and plan iterative retrieval. In addition to answer accuracy, we measure tool-call cost to analyze the accuracy–efficiency trade-off under identical access conditions.

We evaluate VideoAgent-style QA agents with multiple backbone LLMs using three-judge majority voting. GPT-5 achieves the highest accuracy (42.43), outperforming Gemini 3 Pro (30.97) and GPT-4o (19.20), yet remaining below 50%, highlighting the difficulty of multi-hop retrieval planning. With gold evidence clips, performance becomes near-perfect, confirming retrieval planning as the primary bottleneck.

Dataset Statistics

3,000 QA

Category	2-Hop	3-Hop	4-Hop	Total (Ratio)
Causal Inference	436	282	144	862 (28.7%)
Global Summary	512	181	166	859 (28.6%)
Visual Tracking	653	136	61	850 (28.3%)
State Mutation	238	119	72	429 (14.3%)
Overall	1,839 (61.3%)	718 (23.9%)	443 (14.8%)	3,000 (100%)

Results

Accuracy (%) by Model, Category & Hop Level

Model	Overall	State Mutation			Causal Inference			Global Summary			Visual Tracking
Model	Overall	2h	3h	4h	2h	3h	4h	2h	3h	4h	2h	3h	4h
Closed-Source LLMs
GPT-5	42.43	38.24	36.13	22.22	47.71	43.97	39.58	44.34	35.36	29.52	49.77	37.50	29.51
Gemini 3 Pro	30.97	30.25	18.49	12.50	34.17	20.92	17.36	36.72	20.44	15.66	45.48	25.00	18.03
GPT-4o	19.20	15.55	14.29	12.50	20.18	12.77	11.81	19.73	13.81	11.45	29.40	20.59	11.48
GPT-4-mini	18.27	15.97	5.93	4.17	15.14	10.99	6.25	20.31	16.02	12.65	31.35	20.59	11.48
Open-Source LLMs
Qwen3-VL-32B	29.59	29.74	27.97	15.49	29.26	22.86	18.44	34.19	20.99	16.46	40.43	25.93	22.95
Qwen2.5-VL-72B	25.30	23.95	17.65	12.50	26.38	21.99	15.97	29.49	20.44	15.06	34.00	22.79	9.84
Qwen3-VL-8B	18.58	16.67	12.71	9.72	14.81	11.43	11.19	20.59	16.67	15.34	28.84	17.78	15.25
Qwen2.5-7B	11.10	10.92	4.20	4.17	8.72	5.32	4.86	15.82	7.73	3.61	18.99	7.35	6.56
Qwen2.5-VL-7B	10.41	7.73	7.69	4.35	7.64	5.05	2.82	13.83	7.39	5.00	18.50	10.29	4.92
Llama-3-8B	7.73	6.72	5.88	1.39	7.57	4.96	4.86	8.20	6.08	4.22	12.71	7.35	1.64

Key Findings: All agents use the same tool set and interact with the same retrieval system, so performance differences primarily reflect their ability to formulate effective queries and plan multi-step evidence acquisition.

Accuracy drops consistently as hop count increases across all models and categories, confirming that multi-hop retrieval planning is fundamentally harder. Even GPT-5, the strongest model, only achieves 42.43% overall—well below 50%—while oracle experiments with golden evidence clips yield near-perfect accuracy, pinpointing retrieval planning as the primary bottleneck.

Standard vs. Oracle (Golden Clips) Accuracy

Model	Standard Acc (%)	Oracle Acc (%)	Gap (Δ)
Closed-Source LLMs
GPT-5	42.43	100.00	57.57
Gemini 3 Pro	30.97	99.97	69.00
GPT-4o	19.20	99.40	80.20
GPT-4-mini	18.27	98.73	80.46
Open-Source LLMs
Qwen3-VL-32B	29.59	98.56	68.97
Qwen3-VL-8B	18.58	96.90	78.32
Qwen2.5-VL-72B	25.30	98.60	73.30
Qwen2.5-VL-7B	10.41	97.23	86.82
Qwen2.5-7B	11.10	97.33	86.23
Llama-3-8B	7.73	96.89	89.16

Oracle Analysis: When agents receive golden evidence clips directly, all models achieve near-perfect accuracy (96–100%), yet standard accuracy remains below 43%. The massive gap (Δ = 57–89 points) confirms that retrieval planning—not answer generation—is the primary bottleneck in multi-hop long-video QA.

Human Verification vs. LLM Majority Vote

Model	Checked	Disagree	Rate
Closed-Source LLMs
GPT-5	598	3	0.50%
Gemini 3 Pro	601	5	0.83%
GPT-4o	617	6	0.97%
GPT-4-mini	628	6	0.96%
Open-Source LLMs
Qwen3-VL-32B	607	3	0.49%
Qwen3-VL-8B	613	6	0.98%
Qwen2.5-VL-72B	620	4	0.65%
Qwen2.5-VL-7B	628	6	0.96%
Qwen2.5-7B	631	7	1.11%
Llama-3-8B	629	8	1.27%
Overall	6,172	54	0.87%

Evaluation Reliability: Across 6,172 human-verified samples, the three-judge LLM majority vote disagrees with human annotators in only 0.87% of cases, confirming that our automatic evaluation protocol is highly reliable and consistent with expert judgment.

Standardized Tools

All agents interact with LongVidSearch through the same fixed tool interface:

Search_Clips_In_Video(video_id, query, top_k)

Retrieves top-K relevant clips for a textual query within a given video.

Get_Clip_Detail(clip_id)

Returns a high-quality caption for the queried clip (used as evidence).

FINAL_ANSWER(answer_text, evidence_clip_ids)

Submits the answer and the list of viewed evidence clip IDs.

This fixed interface ensures performance differences primarily reflect agentic retrieval planning, not retriever strength or privileged evidence access.

BibTeX

@inproceedings{longvidsearch2026,
  title     = {LongVidSearch: An Agentic Benchmark for Multi-hop Evidence Retrieval Planning in Long Videos},
  author    = {Rongyi Yu and Chenyuan Duan and Hao Liang and Haoze Sun and Peng Pei},
  booktitle = {ACM Multimedia},
  year      = {2026}
}