DetailVerifyBench

Abstract

Accurately detecting and localizing hallucinations is a critical task to ensure the high accuracy of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often exceeding hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flagging response-level inconsistencies. However, existing benchmarks lack the fine-grained granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the long image caption to date.

*Word cloud of hallucinated tokens across all 1,000 captions, sized by frequency.

1,000

Images

5

Domains

200+

Caption Avg. Words

10

Hallucination Types

Benchmark Overview

Domain	Source	#Img	Avg. Len	#Hallu. Locations	Hallu. Rate
GUI	Screenspot Pro	200	196	274	68%
Nature	DOCCI	200	148	69	26%
Chart	Echarts Examples	200	197	192	41%
Movie	CineTechBench	200	214	613	88%
Movie	ShotBench	200	214	613	88%
Poster	IMDB	200	257	576	90%
Poster	Movie Poster 100k	200	257	576	90%

Benchmark Construction Pipeline

Adversarial Hallucination Injection Pipeline

Model Localization Results

Leaderboard

Click any column header to sort. Domain scores show token-level F1 per domain.

Citation

@misc{wang2026detailverifybenchbenchmarkdensehallucination,
      title={DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions},
      author={Xinran Wang and Yuxuan Zhang and Xiao Zhang and Haolong Yan and Muxi Diao and Songyu Xu and Zhonghao Yan and Hongbing Li and Kongming Liang and Zhanyu Ma},
      year={2026},
      eprint={2604.05623},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.05623},
}