TY - JOUR
T1 - KDViT: COVID-19 diagnosis on CT-scans with knowledge distillation of vision transformer
AU - Lim, Yu Jie
AU - Lim, Kian Ming
AU - Chang, Roy Kwang Yang
AU - Lee, Chin Poo
N1 - Publisher Copyright:
© 2024 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
PY - 2024
Y1 - 2024
N2 - This paper introduces Knowledge Distillation of Vision Transformer (KDViT), a novel approach for medical image classification. The Vision Transformer architecture incorporates a self-attention mechanism to autonomously learn image structure. The input medical image is segmented into patches and transformed into low-dimensional linear embeddings. Position information is integrated into each patch, and a learnable classification token is appended for classification, thereby preserving spatial relationships within the image. The output vectors are then fed into a Transformer encoder to extract both local and global features, leveraging the inherent attention mechanism for robust feature extraction across diverse medical imaging scenarios. Furthermore, knowledge distillation is employed to enhance performance by transferring insights from a large teacher model to a small student model. This approach reduces the computational requirements of the larger model and improves overall effectiveness. Integrating knowledge distillation with two Vision Transformer models not only showcases the novelty of the proposed solution for medical image classification but also enhances model interpretability, reduces computational complexity, and improves generalization capabilities. The proposed KDViT model achieved high accuracy rates of 98.39%, 88.57%, and 99.15% on the SARS-CoV-2-CT, COVID-CT, and iCTCF datasets respectively, surpassing the performance of other state-of-the-art methods.
AB - This paper introduces Knowledge Distillation of Vision Transformer (KDViT), a novel approach for medical image classification. The Vision Transformer architecture incorporates a self-attention mechanism to autonomously learn image structure. The input medical image is segmented into patches and transformed into low-dimensional linear embeddings. Position information is integrated into each patch, and a learnable classification token is appended for classification, thereby preserving spatial relationships within the image. The output vectors are then fed into a Transformer encoder to extract both local and global features, leveraging the inherent attention mechanism for robust feature extraction across diverse medical imaging scenarios. Furthermore, knowledge distillation is employed to enhance performance by transferring insights from a large teacher model to a small student model. This approach reduces the computational requirements of the larger model and improves overall effectiveness. Integrating knowledge distillation with two Vision Transformer models not only showcases the novelty of the proposed solution for medical image classification but also enhances model interpretability, reduces computational complexity, and improves generalization capabilities. The proposed KDViT model achieved high accuracy rates of 98.39%, 88.57%, and 99.15% on the SARS-CoV-2-CT, COVID-CT, and iCTCF datasets respectively, surpassing the performance of other state-of-the-art methods.
KW - COVID-19 image classification
KW - CT scan images
KW - knowledge distillation
KW - Vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85193268578&partnerID=8YFLogxK
U2 - 10.1080/00051144.2024.2349416
DO - 10.1080/00051144.2024.2349416
M3 - Article
AN - SCOPUS:85193268578
SN - 0005-1144
VL - 65
SP - 1113
EP - 1126
JO - Automatika
JF - Automatika
IS - 3
ER -