SVD-KD: SVD-based hidden layer feature extraction for Knowledge distillation

Jianhua Zhang; Yi Gao; Mian Zhou; Ruyu Liu; Xu Cheng; Saša V. Nikolić; Shengyong Chen

doi:10.1016/j.patcog.2025.111721

SVD-KD: SVD-based hidden layer feature extraction for Knowledge distillation

Jianhua Zhang, Yi Gao, Mian Zhou, Ruyu Liu^*, Xu Cheng, Saša V. Nikolić, Shengyong Chen

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

Recent advancement of Knowledge distillation (KD) is to extract and transfer middle-layer knowledge of teacher models to student models, which is better than original KDs which only transfer the last layer of knowledge. However, the middle-layer knowledge commonly appears as a high-dimensional tensor, which is more difficult to transfer than the one-dimensional knowledge in the last layer. Moreover, when there are significant differences between the teachers and students in terms of model parameter capabilities and model structures, the middle layers of teacher and student models differ in dimensionality and structure, which further increase the learning difficulty for student models. To solve these problems, we propose a novel knowledge extraction module to transform the high-dimensional tensor tensor-based knowledge in middle layers to one-dimensional knowledge based on singular value decomposition. Thus, the knowledge at the middle layers of teacher models can be effectively extracted and simplified, and extremely facilitate the learning of student models, even though the structure and parameter capacity between the teacher network and the student network are very different. To help the students learn the knowledge from the middle layers of teachers as accurately as possible, we also propose a novel loss function that can constrain values of the one-dimension knowledge learned by the student model to be as close as possible to that extracted from the teacher model, thus improving the learning efficiency of the student model. We have conducted extensive experiments on three major datasets (CIFAR-10, CIFAR-100, and ImageNet 1K), and the results of our method demonstrate superior performance by comparing with the state-of-the-art methods.

Original language	English
Article number	111721
Journal	Pattern Recognition
Volume	167
DOIs	https://doi.org/10.1016/j.patcog.2025.111721
Publication status	Published - Nov 2025

Keywords

Computer vision
Deep learning
Knowledge distillation
Model compression
Singular value decomposition

Access to Document

10.1016/j.patcog.2025.111721

Cite this

@article{2d3bef4ff9004666b38e686146828482,

title = "SVD-KD: SVD-based hidden layer feature extraction for Knowledge distillation",

abstract = "Recent advancement of Knowledge distillation (KD) is to extract and transfer middle-layer knowledge of teacher models to student models, which is better than original KDs which only transfer the last layer of knowledge. However, the middle-layer knowledge commonly appears as a high-dimensional tensor, which is more difficult to transfer than the one-dimensional knowledge in the last layer. Moreover, when there are significant differences between the teachers and students in terms of model parameter capabilities and model structures, the middle layers of teacher and student models differ in dimensionality and structure, which further increase the learning difficulty for student models. To solve these problems, we propose a novel knowledge extraction module to transform the high-dimensional tensor tensor-based knowledge in middle layers to one-dimensional knowledge based on singular value decomposition. Thus, the knowledge at the middle layers of teacher models can be effectively extracted and simplified, and extremely facilitate the learning of student models, even though the structure and parameter capacity between the teacher network and the student network are very different. To help the students learn the knowledge from the middle layers of teachers as accurately as possible, we also propose a novel loss function that can constrain values of the one-dimension knowledge learned by the student model to be as close as possible to that extracted from the teacher model, thus improving the learning efficiency of the student model. We have conducted extensive experiments on three major datasets (CIFAR-10, CIFAR-100, and ImageNet 1K), and the results of our method demonstrate superior performance by comparing with the state-of-the-art methods.",

keywords = "Computer vision, Deep learning, Knowledge distillation, Model compression, Singular value decomposition",

author = "Jianhua Zhang and Yi Gao and Mian Zhou and Ruyu Liu and Xu Cheng and Nikoli{\'c}, {Sa{\v s}a V.} and Shengyong Chen",

note = "Publisher Copyright: {\textcopyright} 2025",

year = "2025",

month = nov,

doi = "10.1016/j.patcog.2025.111721",

language = "English",

volume = "167",

journal = "Pattern Recognition",

issn = "0031-3203",

}

TY - JOUR

T1 - SVD-KD

T2 - SVD-based hidden layer feature extraction for Knowledge distillation

AU - Zhang, Jianhua

AU - Gao, Yi

AU - Zhou, Mian

AU - Liu, Ruyu

AU - Cheng, Xu

AU - Nikolić, Saša V.

AU - Chen, Shengyong

PY - 2025/11

Y1 - 2025/11

N2 - Recent advancement of Knowledge distillation (KD) is to extract and transfer middle-layer knowledge of teacher models to student models, which is better than original KDs which only transfer the last layer of knowledge. However, the middle-layer knowledge commonly appears as a high-dimensional tensor, which is more difficult to transfer than the one-dimensional knowledge in the last layer. Moreover, when there are significant differences between the teachers and students in terms of model parameter capabilities and model structures, the middle layers of teacher and student models differ in dimensionality and structure, which further increase the learning difficulty for student models. To solve these problems, we propose a novel knowledge extraction module to transform the high-dimensional tensor tensor-based knowledge in middle layers to one-dimensional knowledge based on singular value decomposition. Thus, the knowledge at the middle layers of teacher models can be effectively extracted and simplified, and extremely facilitate the learning of student models, even though the structure and parameter capacity between the teacher network and the student network are very different. To help the students learn the knowledge from the middle layers of teachers as accurately as possible, we also propose a novel loss function that can constrain values of the one-dimension knowledge learned by the student model to be as close as possible to that extracted from the teacher model, thus improving the learning efficiency of the student model. We have conducted extensive experiments on three major datasets (CIFAR-10, CIFAR-100, and ImageNet 1K), and the results of our method demonstrate superior performance by comparing with the state-of-the-art methods.

AB - Recent advancement of Knowledge distillation (KD) is to extract and transfer middle-layer knowledge of teacher models to student models, which is better than original KDs which only transfer the last layer of knowledge. However, the middle-layer knowledge commonly appears as a high-dimensional tensor, which is more difficult to transfer than the one-dimensional knowledge in the last layer. Moreover, when there are significant differences between the teachers and students in terms of model parameter capabilities and model structures, the middle layers of teacher and student models differ in dimensionality and structure, which further increase the learning difficulty for student models. To solve these problems, we propose a novel knowledge extraction module to transform the high-dimensional tensor tensor-based knowledge in middle layers to one-dimensional knowledge based on singular value decomposition. Thus, the knowledge at the middle layers of teacher models can be effectively extracted and simplified, and extremely facilitate the learning of student models, even though the structure and parameter capacity between the teacher network and the student network are very different. To help the students learn the knowledge from the middle layers of teachers as accurately as possible, we also propose a novel loss function that can constrain values of the one-dimension knowledge learned by the student model to be as close as possible to that extracted from the teacher model, thus improving the learning efficiency of the student model. We have conducted extensive experiments on three major datasets (CIFAR-10, CIFAR-100, and ImageNet 1K), and the results of our method demonstrate superior performance by comparing with the state-of-the-art methods.

KW - Computer vision

KW - Deep learning

KW - Knowledge distillation

KW - Model compression

KW - Singular value decomposition

UR - http://www.scopus.com/inward/record.url?scp=105004735793&partnerID=8YFLogxK

U2 - 10.1016/j.patcog.2025.111721

DO - 10.1016/j.patcog.2025.111721

M3 - Article

AN - SCOPUS:105004735793

SN - 0031-3203

VL - 167

JO - Pattern Recognition

JF - Pattern Recognition

M1 - 111721

ER -

SVD-KD: SVD-based hidden layer feature extraction for Knowledge distillation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this