UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models

Qi Zhi Lim; Chin Poo Lee; Kian Ming Lim; Ahmad Kamsani Samingan

doi:10.1109/ACCESS.2024.3403101

UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models

Qi Zhi Lim, Chin Poo Lee^*, Kian Ming Lim, Ahmad Kamsani Samingan

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Multimodal Question Answering (MMQA) has emerged as a challenging frontier at the intersection of natural language processing (NLP) and computer vision, demanding the integration of diverse modalities for effective comprehension and response. While pre-trained language models (PLMs) exhibit impressive performance across a range of NLP tasks, the investigation of text-based approaches to address MMQA represents a compelling and promising avenue for further research and advancement in the field. Although recent research has delved into text-based approaches for MMQA, the attained results have been unsatisfactory, which could be attributed to potential information loss during the knowledge transformation processes. In response, a novel three-stage framework named UniRaG is proposed for tackling MMQA, which encompasses unified knowledge representation, context retrieval, and answer generation. At the initial stage, advanced techniques are employed for unified knowledge representation, including LLaVA for image captioning and table linearization for tabular data, facilitating seamless integration of visual and tabular information into textual representation. For context retrieval, a cross-encoder trained on sequence classification is utilized to predict relevance scores for question-document pairs, and a top-k retrieval strategy is employed to retrieve the documents with the highest relevance scores as the contexts for answer generation. Finally, the answer generation stage is facilitated by a text-to-text PLM, Flan-T5-Base, which follows the encoder-decoder architecture with attention mechanisms. During this stage, uniform prefix conditioning is applied to the input text for enhanced adaptability and generalizability. Moreover, contextual diversity training is introduced to improve model robustness by including distractor documents as negative contexts during training. Experimental results on the MultimodalQA dataset demonstrate the superior performance of UniRaG, surpassing the existing state-of-the-art methods across all scenarios with 67.4% EM and 71.3% F1. Overall, UniRaG showcases robustness and reliability in MMQA, heralding significant advancements in multimodal comprehension and question answering research.

Original language	English
Pages (from-to)	71505-71519
Number of pages	15
Journal	IEEE Access
Volume	12
DOIs	https://doi.org/10.1109/ACCESS.2024.3403101
Publication status	Published - 2024
Externally published	Yes

Keywords

Computer vision
information retrieval
multimodal question answering
natural language processing
pre-trained language models
unified knowledge representation

Access to Document

10.1109/ACCESS.2024.3403101

Cite this

@article{f581db9634fb4ff2b45ace8c8cb106d3,

title = "UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models",

abstract = "Multimodal Question Answering (MMQA) has emerged as a challenging frontier at the intersection of natural language processing (NLP) and computer vision, demanding the integration of diverse modalities for effective comprehension and response. While pre-trained language models (PLMs) exhibit impressive performance across a range of NLP tasks, the investigation of text-based approaches to address MMQA represents a compelling and promising avenue for further research and advancement in the field. Although recent research has delved into text-based approaches for MMQA, the attained results have been unsatisfactory, which could be attributed to potential information loss during the knowledge transformation processes. In response, a novel three-stage framework named UniRaG is proposed for tackling MMQA, which encompasses unified knowledge representation, context retrieval, and answer generation. At the initial stage, advanced techniques are employed for unified knowledge representation, including LLaVA for image captioning and table linearization for tabular data, facilitating seamless integration of visual and tabular information into textual representation. For context retrieval, a cross-encoder trained on sequence classification is utilized to predict relevance scores for question-document pairs, and a top-k retrieval strategy is employed to retrieve the documents with the highest relevance scores as the contexts for answer generation. Finally, the answer generation stage is facilitated by a text-to-text PLM, Flan-T5-Base, which follows the encoder-decoder architecture with attention mechanisms. During this stage, uniform prefix conditioning is applied to the input text for enhanced adaptability and generalizability. Moreover, contextual diversity training is introduced to improve model robustness by including distractor documents as negative contexts during training. Experimental results on the MultimodalQA dataset demonstrate the superior performance of UniRaG, surpassing the existing state-of-the-art methods across all scenarios with 67.4% EM and 71.3% F1. Overall, UniRaG showcases robustness and reliability in MMQA, heralding significant advancements in multimodal comprehension and question answering research.",

keywords = "Computer vision, information retrieval, multimodal question answering, natural language processing, pre-trained language models, unified knowledge representation",

author = "Lim, {Qi Zhi} and Lee, {Chin Poo} and Lim, {Kian Ming} and Samingan, {Ahmad Kamsani}",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2024",

doi = "10.1109/ACCESS.2024.3403101",

language = "English",

volume = "12",

pages = "71505--71519",

journal = "IEEE Access",

issn = "2169-3536",

}

TY - JOUR

T1 - UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models

AU - Lim, Qi Zhi

AU - Lee, Chin Poo

AU - Lim, Kian Ming

AU - Samingan, Ahmad Kamsani

PY - 2024

Y1 - 2024

N2 - Multimodal Question Answering (MMQA) has emerged as a challenging frontier at the intersection of natural language processing (NLP) and computer vision, demanding the integration of diverse modalities for effective comprehension and response. While pre-trained language models (PLMs) exhibit impressive performance across a range of NLP tasks, the investigation of text-based approaches to address MMQA represents a compelling and promising avenue for further research and advancement in the field. Although recent research has delved into text-based approaches for MMQA, the attained results have been unsatisfactory, which could be attributed to potential information loss during the knowledge transformation processes. In response, a novel three-stage framework named UniRaG is proposed for tackling MMQA, which encompasses unified knowledge representation, context retrieval, and answer generation. At the initial stage, advanced techniques are employed for unified knowledge representation, including LLaVA for image captioning and table linearization for tabular data, facilitating seamless integration of visual and tabular information into textual representation. For context retrieval, a cross-encoder trained on sequence classification is utilized to predict relevance scores for question-document pairs, and a top-k retrieval strategy is employed to retrieve the documents with the highest relevance scores as the contexts for answer generation. Finally, the answer generation stage is facilitated by a text-to-text PLM, Flan-T5-Base, which follows the encoder-decoder architecture with attention mechanisms. During this stage, uniform prefix conditioning is applied to the input text for enhanced adaptability and generalizability. Moreover, contextual diversity training is introduced to improve model robustness by including distractor documents as negative contexts during training. Experimental results on the MultimodalQA dataset demonstrate the superior performance of UniRaG, surpassing the existing state-of-the-art methods across all scenarios with 67.4% EM and 71.3% F1. Overall, UniRaG showcases robustness and reliability in MMQA, heralding significant advancements in multimodal comprehension and question answering research.

AB - Multimodal Question Answering (MMQA) has emerged as a challenging frontier at the intersection of natural language processing (NLP) and computer vision, demanding the integration of diverse modalities for effective comprehension and response. While pre-trained language models (PLMs) exhibit impressive performance across a range of NLP tasks, the investigation of text-based approaches to address MMQA represents a compelling and promising avenue for further research and advancement in the field. Although recent research has delved into text-based approaches for MMQA, the attained results have been unsatisfactory, which could be attributed to potential information loss during the knowledge transformation processes. In response, a novel three-stage framework named UniRaG is proposed for tackling MMQA, which encompasses unified knowledge representation, context retrieval, and answer generation. At the initial stage, advanced techniques are employed for unified knowledge representation, including LLaVA for image captioning and table linearization for tabular data, facilitating seamless integration of visual and tabular information into textual representation. For context retrieval, a cross-encoder trained on sequence classification is utilized to predict relevance scores for question-document pairs, and a top-k retrieval strategy is employed to retrieve the documents with the highest relevance scores as the contexts for answer generation. Finally, the answer generation stage is facilitated by a text-to-text PLM, Flan-T5-Base, which follows the encoder-decoder architecture with attention mechanisms. During this stage, uniform prefix conditioning is applied to the input text for enhanced adaptability and generalizability. Moreover, contextual diversity training is introduced to improve model robustness by including distractor documents as negative contexts during training. Experimental results on the MultimodalQA dataset demonstrate the superior performance of UniRaG, surpassing the existing state-of-the-art methods across all scenarios with 67.4% EM and 71.3% F1. Overall, UniRaG showcases robustness and reliability in MMQA, heralding significant advancements in multimodal comprehension and question answering research.

KW - Computer vision

KW - information retrieval

KW - multimodal question answering

KW - natural language processing

KW - pre-trained language models

KW - unified knowledge representation

UR - http://www.scopus.com/inward/record.url?scp=85194060883&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2024.3403101

DO - 10.1109/ACCESS.2024.3403101

M3 - Article

AN - SCOPUS:85194060883

SN - 2169-3536

VL - 12

SP - 71505

EP - 71519

JO - IEEE Access

JF - IEEE Access

ER -

UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this