DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang*, Zhanyu Ma
Beijing University of Posts and Telecommunications, Beijing, China
Equal contribution * Corresponding author

Abstract

Accurately detecting and localizing hallucinations is a critical task to ensure the high accuracy of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often exceeding hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flagging response-level inconsistencies. However, existing benchmarks lack the fine-grained granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the long image caption to date.

*Word cloud of hallucinated tokens across all 1,000 captions, sized by frequency.

1,000
Images
5
Domains
200+
Caption Avg. Words
10
Hallucination Types

Benchmark Overview

Domain Source #Img Avg. Len #Hallu. Locations Hallu. Rate
GUI Screenspot Pro 200 196 274 68%
Nature DOCCI 200 148 69 26%
Chart Echarts Examples 200 197 192 41%
Movie CineTechBench 200 214 613 88%
ShotBench
Poster IMDB 200 257 576 90%
Movie Poster 100k

Benchmark Construction Pipeline

Pipeline Step 1

Adversarial Hallucination Injection Pipeline

Model Localization Results

Leaderboard

Type:

Click any column header to sort. Domain scores show token-level F1 per domain.

Citation

@misc{wang2026detailverifybenchbenchmarkdensehallucination,
      title={DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions},
      author={Xinran Wang and Yuxuan Zhang and Xiao Zhang and Haolong Yan and Muxi Diao and Songyu Xu and Zhonghao Yan and Hongbing Li and Kongming Liang and Zhanyu Ma},
      year={2026},
      eprint={2604.05623},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.05623},
}