Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, Wenxue Li, Yulong Li, Wenxuan Song, Shiyan Su, Wei Feng, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak*Zongyuan Ge*

*Corresponding author for this work

    Research output: Contribution to journalConference articlepeer-review

    1 Citation (Scopus)

    Abstract

    Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks.

    Original languageEnglish
    Pages (from-to)26147-26159
    Number of pages13
    JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
    DOIs
    Publication statusPublished - 2025
    Event2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025 - Nashville, United States
    Duration: 11 Jun 202515 Jun 2025

    Keywords

    • FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks
    • proving its effectiveness
    • With extensive experiments

    Fingerprint

    Dive into the research topics of 'Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding'. Together they form a unique fingerprint.

    Cite this