Deep Fuzzy Multiteacher Distillation Network for Medical Visual Question Answering

Yishu Liu; Bingzhi Chen; Shuihua Wang; Guangming Lu; Zheng Zhang

doi:10.1109/TFUZZ.2024.3402086

Deep Fuzzy Multiteacher Distillation Network for Medical Visual Question Answering

Yishu Liu, Bingzhi Chen^*, Shuihua Wang, Guangming Lu, Zheng Zhang^*

^*Corresponding author for this work

Department of Biosciences and Bioinformatics

Research output: Contribution to journal › Article › peer-review

8 Citations (Scopus)

Abstract

Medical visual question answering (medical VQA) is a critical cross-modal interaction task that garnered considerable attention in the medical domain. Several existing methods commonly leverage the vision-and-language pretraining paradigms to mitigate the limitation of small-scale data. Nevertheless, most of them still suffer from two challenges that remain for further research: 1) limited research focuses on distilling representation from a complete modality to guide the representation learning of masked data in other modalities. 2) Multimodal fusion based on self-attention mechanisms cannot effectively handle the inherent uncertainty and vagueness of information interaction across modalities. To mitigate these issues, in this article, we propose a novel deep fuzzy multiteacher distillation (DFMD) network for medical VQA, which can take advantage of fuzzy logic to model the uncertainties from vison-language representations across modalities in a multiteacher framework. Specifically, a multiteacher knowledge distillation module is conceived to assist in reconstructing the missing semantics under the supervision signal generated by teachers from the other complete modality, achieving more robust semantic interaction across modalities. Incorporating insights from the fuzzy logic theory, we propose a noise-robust encoder called FuzBERT that enables our DFMD model to reduce the imprecision and ambiguity in feature representation during the multimodal interaction process. To the best of our knowledge, our work is the first attempt to combine the fuzzy logic theory with the transformer-based encoder to effectively learn multimodal representation for medical VQA. Experimental results on the VQA-RAD and SLAKE datasets consistently demonstrate the superiority of our proposed DFMD method over state-of-the-art baselines.

Original language	English
Pages (from-to)	5413-5427
Number of pages	15
Journal	IEEE Transactions on Fuzzy Systems
Volume	32
Issue number	10
DOIs	https://doi.org/10.1109/TFUZZ.2024.3402086
Publication status	Published - 2024

Keywords

Fuzzy deep learning
fuzzy logic
knowledge distillation (KD)
medical visual question answering (VQA)

Access to Document

10.1109/TFUZZ.2024.3402086

Cite this

@article{33b28a415ace46bba1b16e0029169fca,

title = "Deep Fuzzy Multiteacher Distillation Network for Medical Visual Question Answering",

abstract = "Medical visual question answering (medical VQA) is a critical cross-modal interaction task that garnered considerable attention in the medical domain. Several existing methods commonly leverage the vision-and-language pretraining paradigms to mitigate the limitation of small-scale data. Nevertheless, most of them still suffer from two challenges that remain for further research: 1) limited research focuses on distilling representation from a complete modality to guide the representation learning of masked data in other modalities. 2) Multimodal fusion based on self-attention mechanisms cannot effectively handle the inherent uncertainty and vagueness of information interaction across modalities. To mitigate these issues, in this article, we propose a novel deep fuzzy multiteacher distillation (DFMD) network for medical VQA, which can take advantage of fuzzy logic to model the uncertainties from vison-language representations across modalities in a multiteacher framework. Specifically, a multiteacher knowledge distillation module is conceived to assist in reconstructing the missing semantics under the supervision signal generated by teachers from the other complete modality, achieving more robust semantic interaction across modalities. Incorporating insights from the fuzzy logic theory, we propose a noise-robust encoder called FuzBERT that enables our DFMD model to reduce the imprecision and ambiguity in feature representation during the multimodal interaction process. To the best of our knowledge, our work is the first attempt to combine the fuzzy logic theory with the transformer-based encoder to effectively learn multimodal representation for medical VQA. Experimental results on the VQA-RAD and SLAKE datasets consistently demonstrate the superiority of our proposed DFMD method over state-of-the-art baselines.",

keywords = "Fuzzy deep learning, fuzzy logic, knowledge distillation (KD), medical visual question answering (VQA)",

author = "Yishu Liu and Bingzhi Chen and Shuihua Wang and Guangming Lu and Zheng Zhang",

note = "Publisher Copyright: {\textcopyright} 1993-2012 IEEE.",

year = "2024",

doi = "10.1109/TFUZZ.2024.3402086",

language = "English",

volume = "32",

pages = "5413--5427",

journal = "IEEE Transactions on Fuzzy Systems",

issn = "1063-6706",

number = "10",

}

TY - JOUR

T1 - Deep Fuzzy Multiteacher Distillation Network for Medical Visual Question Answering

AU - Liu, Yishu

AU - Chen, Bingzhi

AU - Wang, Shuihua

AU - Lu, Guangming

AU - Zhang, Zheng

PY - 2024

Y1 - 2024

N2 - Medical visual question answering (medical VQA) is a critical cross-modal interaction task that garnered considerable attention in the medical domain. Several existing methods commonly leverage the vision-and-language pretraining paradigms to mitigate the limitation of small-scale data. Nevertheless, most of them still suffer from two challenges that remain for further research: 1) limited research focuses on distilling representation from a complete modality to guide the representation learning of masked data in other modalities. 2) Multimodal fusion based on self-attention mechanisms cannot effectively handle the inherent uncertainty and vagueness of information interaction across modalities. To mitigate these issues, in this article, we propose a novel deep fuzzy multiteacher distillation (DFMD) network for medical VQA, which can take advantage of fuzzy logic to model the uncertainties from vison-language representations across modalities in a multiteacher framework. Specifically, a multiteacher knowledge distillation module is conceived to assist in reconstructing the missing semantics under the supervision signal generated by teachers from the other complete modality, achieving more robust semantic interaction across modalities. Incorporating insights from the fuzzy logic theory, we propose a noise-robust encoder called FuzBERT that enables our DFMD model to reduce the imprecision and ambiguity in feature representation during the multimodal interaction process. To the best of our knowledge, our work is the first attempt to combine the fuzzy logic theory with the transformer-based encoder to effectively learn multimodal representation for medical VQA. Experimental results on the VQA-RAD and SLAKE datasets consistently demonstrate the superiority of our proposed DFMD method over state-of-the-art baselines.

AB - Medical visual question answering (medical VQA) is a critical cross-modal interaction task that garnered considerable attention in the medical domain. Several existing methods commonly leverage the vision-and-language pretraining paradigms to mitigate the limitation of small-scale data. Nevertheless, most of them still suffer from two challenges that remain for further research: 1) limited research focuses on distilling representation from a complete modality to guide the representation learning of masked data in other modalities. 2) Multimodal fusion based on self-attention mechanisms cannot effectively handle the inherent uncertainty and vagueness of information interaction across modalities. To mitigate these issues, in this article, we propose a novel deep fuzzy multiteacher distillation (DFMD) network for medical VQA, which can take advantage of fuzzy logic to model the uncertainties from vison-language representations across modalities in a multiteacher framework. Specifically, a multiteacher knowledge distillation module is conceived to assist in reconstructing the missing semantics under the supervision signal generated by teachers from the other complete modality, achieving more robust semantic interaction across modalities. Incorporating insights from the fuzzy logic theory, we propose a noise-robust encoder called FuzBERT that enables our DFMD model to reduce the imprecision and ambiguity in feature representation during the multimodal interaction process. To the best of our knowledge, our work is the first attempt to combine the fuzzy logic theory with the transformer-based encoder to effectively learn multimodal representation for medical VQA. Experimental results on the VQA-RAD and SLAKE datasets consistently demonstrate the superiority of our proposed DFMD method over state-of-the-art baselines.

KW - Fuzzy deep learning

KW - fuzzy logic

KW - knowledge distillation (KD)

KW - medical visual question answering (VQA)

UR - http://www.scopus.com/inward/record.url?scp=85195386543&partnerID=8YFLogxK

U2 - 10.1109/TFUZZ.2024.3402086

DO - 10.1109/TFUZZ.2024.3402086

M3 - Article

AN - SCOPUS:85195386543

SN - 1063-6706

VL - 32

SP - 5413

EP - 5427

JO - IEEE Transactions on Fuzzy Systems

JF - IEEE Transactions on Fuzzy Systems

IS - 10

ER -

Deep Fuzzy Multiteacher Distillation Network for Medical Visual Question Answering

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this