TY - JOUR
T1 - Seeing Far and Clearly
T2 - 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
AU - Tang, Feilong
AU - Liu, Chengzhi
AU - Xu, Zhongxing
AU - Hu, Ming
AU - Huang, Zile
AU - Xue, Haochen
AU - Chen, Ziyang
AU - Peng, Zelin
AU - Yang, Zhiwei
AU - Zhou, Sijin
AU - Li, Wenxue
AU - Li, Yulong
AU - Song, Wenxuan
AU - Su, Shiyan
AU - Feng, Wei
AU - Su, Jionglong
AU - Lin, Minquan
AU - Peng, Yifan
AU - Cheng, Xuelian
AU - Razzak, Imran
AU - Ge, Zongyuan
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks.
AB - Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks.
KW - FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks
KW - proving its effectiveness
KW - With extensive experiments
UR - https://www.scopus.com/pages/publications/105017091359
U2 - 10.1109/CVPR52734.2025.02435
DO - 10.1109/CVPR52734.2025.02435
M3 - Conference article
AN - SCOPUS:105017091359
SN - 1063-6919
SP - 26147
EP - 26159
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Y2 - 11 June 2025 through 15 June 2025
ER -