Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

Jianfeng Wang; Kah Phooi Seng; Yi Shen; Li Minn Ang; Difeng Huang

doi:10.3390/electronics13122273

Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

Jianfeng Wang, Kah Phooi Seng^*, Yi Shen, Li Minn Ang, Difeng Huang

^*Corresponding author for this work

School of Internet of Things

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications.

Original language	English
Article number	2273
Journal	Electronics (Switzerland)
Volume	13
Issue number	12
DOIs	https://doi.org/10.3390/electronics13122273
Publication status	Published - Jun 2024

Keywords

attention mechanisms
large language models (LLMs)
medical visual question answering (Med-VQA)
multi-label learning
zero-shot learning

Access to Document

10.3390/electronics13122273

Cite this

@article{8d79d671f35347f98bf65bb25f88cef2,

title = "Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering",

abstract = "Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications.",

keywords = "attention mechanisms, large language models (LLMs), medical visual question answering (Med-VQA), multi-label learning, zero-shot learning",

author = "Jianfeng Wang and Seng, {Kah Phooi} and Yi Shen and Ang, {Li Minn} and Difeng Huang",

note = "Publisher Copyright: {\textcopyright} 2024 by the authors.",

year = "2024",

month = jun,

doi = "10.3390/electronics13122273",

language = "English",

volume = "13",

journal = "Electronics (Switzerland)",

issn = "2079-9292",

number = "12",

}

TY - JOUR

T1 - Image to Label to Answer

T2 - An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

AU - Wang, Jianfeng

AU - Seng, Kah Phooi

AU - Shen, Yi

AU - Ang, Li Minn

AU - Huang, Difeng

PY - 2024/6

Y1 - 2024/6

N2 - Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications.

AB - Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications.

KW - attention mechanisms

KW - large language models (LLMs)

KW - medical visual question answering (Med-VQA)

KW - multi-label learning

KW - zero-shot learning

UR - http://www.scopus.com/inward/record.url?scp=85197291863&partnerID=8YFLogxK

U2 - 10.3390/electronics13122273

DO - 10.3390/electronics13122273

M3 - Article

AN - SCOPUS:85197291863

SN - 2079-9292

VL - 13

JO - Electronics (Switzerland)

JF - Electronics (Switzerland)

IS - 12

M1 - 2273

ER -

Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this