TY - JOUR
T1 - Image to Label to Answer
T2 - An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering
AU - Wang, Jianfeng
AU - Seng, Kah Phooi
AU - Shen, Yi
AU - Ang, Li Minn
AU - Huang, Difeng
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/6
Y1 - 2024/6
N2 - Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications.
AB - Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications.
KW - attention mechanisms
KW - large language models (LLMs)
KW - medical visual question answering (Med-VQA)
KW - multi-label learning
KW - zero-shot learning
UR - http://www.scopus.com/inward/record.url?scp=85197291863&partnerID=8YFLogxK
U2 - 10.3390/electronics13122273
DO - 10.3390/electronics13122273
M3 - Article
AN - SCOPUS:85197291863
SN - 2079-9292
VL - 13
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
IS - 12
M1 - 2273
ER -