DetailVerifyBench

Abstract

Accurately detecting and localizing hallucinations is a critical task to ensure the high accuracy of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often exceeding hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flagging response-level inconsistencies. However, existing benchmarks lack the fine-grained granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the long image caption to date.

1,000

Images

5

Domains

200+

Caption Avg. Words

10

Hallucination Types

Benchmark Overview

Domain	Source	#Images	Avg. Caption Length	#Hallucination Words	Hallucination Rate
GUI	Screenspot Pro	200	196	425	68%
Nature	DOCCI	200	148	173	26%
Chart	Echarts Examples	200	197	322	41%
Movie	CineTechBench + ShotBench	200	214	1,094	88%
Poster	IMDB + Movie Poster 100k	200	257	1,235	90%

Benchmark Construction Pipeline

Leaderboard

Click any column header to sort. Domain scores show token-level F1 per domain.

Citation

@misc{detailverifybench,
  title={DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions},
  author={Xinran Wang and Yuxuan Zhang and Xiao Zhang and Haolong Yan and Muxi Diao and
          Songyu Xu and Zhonghao Yan and Hongbing Li and Kongming Liang and Zhanyu Ma},
  year={2025},
  url={https://github.com/zyx-hhnkh/DetailVerifyBench}
}