Accurately detecting and localizing hallucinations is a critical task to ensure the high accuracy of image captions. In the era of Multimodal Large Language Models (MLLMs), captions have evolved from brief sentences into comprehensive narratives, often exceeding hundreds of words. This shift exponentially increases the challenge: models must now pinpoint specific erroneous spans or words within extensive contexts, rather than merely flagging response-level inconsistencies. However, existing benchmarks lack the fine-grained granularity and domain diversity required to evaluate this capability. To bridge this gap, we introduce DetailVerifyBench, a rigorous benchmark comprising 1,000 high-quality images across five distinct domains. With an average caption length of over 200 words and dense, token-level annotations of multiple hallucination types, it stands as the most challenging benchmark for precise hallucination localization in the long image caption to date.
| Domain | Source | #Images | Avg. Caption Length | #Hallucination Words | Hallucination Rate |
|---|---|---|---|---|---|
| GUI | Screenspot Pro | 200 | 196 | 425 | 68% |
| Nature | DOCCI | 200 | 148 | 173 | 26% |
| Chart | Echarts Examples | 200 | 197 | 322 | 41% |
| Movie | CineTechBench + ShotBench | 200 | 214 | 1,094 | 88% |
| Poster | IMDB + Movie Poster 100k | 200 | 257 | 1,235 | 90% |
Click any column header to sort. Domain scores show token-level F1 per domain.
@misc{detailverifybench,
title={DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions},
author={Xinran Wang and Yuxuan Zhang and Xiao Zhang and Haolong Yan and Muxi Diao and
Songyu Xu and Zhonghao Yan and Hongbing Li and Kongming Liang and Zhanyu Ma},
year={2025},
url={https://github.com/zyx-hhnkh/DetailVerifyBench}
}