KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

Yulong Li; Bolin Ren; Ke Hu; Changyuan Liu; Zhengyong Jiang; Kang Dang; Jionglong Su

KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

Yulong Li, Bolin Ren, Ke Hu, Changyuan Liu, Zhengyong Jiang, Kang Dang^*, Jionglong Su^*

^*Corresponding author for this work

School of AI and Advanced Computing

School of AI and Advanced Computing

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

Artificial intelligence has achieved notable results in sign lan-guage recognition and translation. However, relatively fewefforts have been made to significantly improve the qualityof life for the 72 million hearing-impaired people worldwide.Sign language translation models, relying on video inputs, in-volves with large parameter sizes, making it time-consumingand computationally intensive to be deployed. This directlycontributes to the scarcity of human-centered technology inthis field. Additionally, the lack of datasets in sign languagetranslation hampers research progress in this area. To addressthese, we first propose a cross-modal multi-knowledge distil-lation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advance-ments in correcting text output errors. Our model achievesa decrease in Word Error Rate (WER) of at least 1.4% onPHOENIX14 and PHOENIX14T datasets compared to thestate-of-the-art CorrNet. Additionally, the TensorFlow Lite(TFLite) quantized model size is reduced to 12.93 MB, mak-ing it the smallest, fastest, and most accurate model to date.We have also collected and released extensive Chinese signlanguage datasets, and developed a specialized training vo-cabulary. To address the lack of research on data augmenta-tion for landmark data, we have designed comparative exper-iments on various augmentation methods. Moreover, we per-formed a simulated deployment and prediction of our modelon Intel platform CPUs and assessed the feasibility of deploy-ing the model on other platforms.

Original language	English
Title of host publication	The 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Publisher	AAAI press
Publication status	Accepted/In press - 2025

Access to Document

https://www.researchgate.net/publication/387767518_KD-MSLRT_Lightweight_Sign_Language_Recognition_Model_Based_on_Mediapipe_and_3D_to_1D_Knowledge_Distillation#fullTextFileContent

Cite this

Li, Y., Ren, B., Hu, K., Liu, C., Jiang, Z., Dang, K., & Su, J. (Accepted/In press). KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation. In The 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 AAAI press. https://www.researchgate.net/publication/387767518_KD-MSLRT_Lightweight_Sign_Language_Recognition_Model_Based_on_Mediapipe_and_3D_to_1D_Knowledge_Distillation#fullTextFileContent

@inproceedings{c6e46686cd59416ca630abc29d0ef827,

title = "KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation",

abstract = "Artificial intelligence has achieved notable results in sign lan-guage recognition and translation. However, relatively fewefforts have been made to significantly improve the qualityof life for the 72 million hearing-impaired people worldwide.Sign language translation models, relying on video inputs, in-volves with large parameter sizes, making it time-consumingand computationally intensive to be deployed. This directlycontributes to the scarcity of human-centered technology inthis field. Additionally, the lack of datasets in sign languagetranslation hampers research progress in this area. To addressthese, we first propose a cross-modal multi-knowledge distil-lation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advance-ments in correcting text output errors. Our model achievesa decrease in Word Error Rate (WER) of at least 1.4% onPHOENIX14 and PHOENIX14T datasets compared to thestate-of-the-art CorrNet. Additionally, the TensorFlow Lite(TFLite) quantized model size is reduced to 12.93 MB, mak-ing it the smallest, fastest, and most accurate model to date.We have also collected and released extensive Chinese signlanguage datasets, and developed a specialized training vo-cabulary. To address the lack of research on data augmenta-tion for landmark data, we have designed comparative exper-iments on various augmentation methods. Moreover, we per-formed a simulated deployment and prediction of our modelon Intel platform CPUs and assessed the feasibility of deploy-ing the model on other platforms.",

author = "Yulong Li and Bolin Ren and Ke Hu and Changyuan Liu and Zhengyong Jiang and Kang Dang and Jionglong Su",

year = "2025",

language = "English",

booktitle = "The 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025",

publisher = "AAAI press",

}

Li, Y , Ren, B, Hu, K, Liu, C, Jiang, Z , Dang, K & Su, J 2025, KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation. in The 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025. AAAI press. <https://www.researchgate.net/publication/387767518_KD-MSLRT_Lightweight_Sign_Language_Recognition_Model_Based_on_Mediapipe_and_3D_to_1D_Knowledge_Distillation#fullTextFileContent>

TY - GEN

T1 - KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

AU - Li, Yulong

AU - Ren, Bolin

AU - Hu, Ke

AU - Liu, Changyuan

AU - Jiang, Zhengyong

AU - Dang, Kang

AU - Su, Jionglong

PY - 2025

Y1 - 2025

N2 - Artificial intelligence has achieved notable results in sign lan-guage recognition and translation. However, relatively fewefforts have been made to significantly improve the qualityof life for the 72 million hearing-impaired people worldwide.Sign language translation models, relying on video inputs, in-volves with large parameter sizes, making it time-consumingand computationally intensive to be deployed. This directlycontributes to the scarcity of human-centered technology inthis field. Additionally, the lack of datasets in sign languagetranslation hampers research progress in this area. To addressthese, we first propose a cross-modal multi-knowledge distil-lation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advance-ments in correcting text output errors. Our model achievesa decrease in Word Error Rate (WER) of at least 1.4% onPHOENIX14 and PHOENIX14T datasets compared to thestate-of-the-art CorrNet. Additionally, the TensorFlow Lite(TFLite) quantized model size is reduced to 12.93 MB, mak-ing it the smallest, fastest, and most accurate model to date.We have also collected and released extensive Chinese signlanguage datasets, and developed a specialized training vo-cabulary. To address the lack of research on data augmenta-tion for landmark data, we have designed comparative exper-iments on various augmentation methods. Moreover, we per-formed a simulated deployment and prediction of our modelon Intel platform CPUs and assessed the feasibility of deploy-ing the model on other platforms.

AB - Artificial intelligence has achieved notable results in sign lan-guage recognition and translation. However, relatively fewefforts have been made to significantly improve the qualityof life for the 72 million hearing-impaired people worldwide.Sign language translation models, relying on video inputs, in-volves with large parameter sizes, making it time-consumingand computationally intensive to be deployed. This directlycontributes to the scarcity of human-centered technology inthis field. Additionally, the lack of datasets in sign languagetranslation hampers research progress in this area. To addressthese, we first propose a cross-modal multi-knowledge distil-lation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advance-ments in correcting text output errors. Our model achievesa decrease in Word Error Rate (WER) of at least 1.4% onPHOENIX14 and PHOENIX14T datasets compared to thestate-of-the-art CorrNet. Additionally, the TensorFlow Lite(TFLite) quantized model size is reduced to 12.93 MB, mak-ing it the smallest, fastest, and most accurate model to date.We have also collected and released extensive Chinese signlanguage datasets, and developed a specialized training vo-cabulary. To address the lack of research on data augmenta-tion for landmark data, we have designed comparative exper-iments on various augmentation methods. Moreover, we per-formed a simulated deployment and prediction of our modelon Intel platform CPUs and assessed the feasibility of deploy-ing the model on other platforms.

M3 - Conference Proceeding

BT - The 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025

PB - AAAI press

ER -

KD-MSLRT: Lightweight Sign Language Recognition Model Based on Mediapipe and 3D to 1D Knowledge Distillation

Abstract

Access to Document

Fingerprint

Cite this