Abstract
Artificial intelligence has achieved notable results in sign lan-guage recognition and translation. However, relatively fewefforts have been made to significantly improve the qualityof life for the 72 million hearing-impaired people worldwide.Sign language translation models, relying on video inputs, in-volves with large parameter sizes, making it time-consumingand computationally intensive to be deployed. This directlycontributes to the scarcity of human-centered technology inthis field. Additionally, the lack of datasets in sign languagetranslation hampers research progress in this area. To addressthese, we first propose a cross-modal multi-knowledge distil-lation technique from 3D to 1D and a novel end-to-end pre-training text correction framework. Compared to other pre-trained models, our framework achieves significant advance-ments in correcting text output errors. Our model achievesa decrease in Word Error Rate (WER) of at least 1.4% onPHOENIX14 and PHOENIX14T datasets compared to thestate-of-the-art CorrNet. Additionally, the TensorFlow Lite(TFLite) quantized model size is reduced to 12.93 MB, mak-ing it the smallest, fastest, and most accurate model to date.We have also collected and released extensive Chinese signlanguage datasets, and developed a specialized training vo-cabulary. To address the lack of research on data augmenta-tion for landmark data, we have designed comparative exper-iments on various augmentation methods. Moreover, we per-formed a simulated deployment and prediction of our modelon Intel platform CPUs and assessed the feasibility of deploy-ing the model on other platforms.
Original language | English |
---|---|
Title of host publication | The 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025 |
Publisher | AAAI press |
Publication status | Accepted/In press - 2025 |