TY - GEN
T1 - Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition
AU - He, Yibo
AU - Seng, Kah Phooi
AU - Ang, Li Minn
AU - Zhao, Xingyu
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Generative Adversarial Networks (GANs) have found extensive applications in image classification and image generation domains. Nevertheless, their utilisation for recognising and detecting multimodal images presents considerable difficulties. Audio Visual Speech Recognition (AVSR) is a classic task in multimodal audio-visual sensing, which leverages audio inputs from human speech and aligned visual inputs from lip movements. However, the performance of AVSR is impacted by the inherent discrepancies present in real-world environments, such as variations in lighting intensity, noise, and sampling devices. To mitigate these challenges, this paper proposed a AVSR architecture based on a specially constructed Cycle-Consistent Adversarial Networks (CycleGAN). First, on the visual side, we used data-Augmentation methods such as flipping and rotating to process video data, increasing the number and variety of samples. This increases the robustness and generalisation capabilities of the model. Then, since the AVSR dataset was collected in different environments with different styles, we transformed the original images multiple times through the specially constructed CycleGAN module to address the inherent differences in the different environments. To validate the approaches, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process. Experimental results validate the correctness and effectiveness of the approach.
AB - Generative Adversarial Networks (GANs) have found extensive applications in image classification and image generation domains. Nevertheless, their utilisation for recognising and detecting multimodal images presents considerable difficulties. Audio Visual Speech Recognition (AVSR) is a classic task in multimodal audio-visual sensing, which leverages audio inputs from human speech and aligned visual inputs from lip movements. However, the performance of AVSR is impacted by the inherent discrepancies present in real-world environments, such as variations in lighting intensity, noise, and sampling devices. To mitigate these challenges, this paper proposed a AVSR architecture based on a specially constructed Cycle-Consistent Adversarial Networks (CycleGAN). First, on the visual side, we used data-Augmentation methods such as flipping and rotating to process video data, increasing the number and variety of samples. This increases the robustness and generalisation capabilities of the model. Then, since the AVSR dataset was collected in different environments with different styles, we transformed the original images multiple times through the specially constructed CycleGAN module to address the inherent differences in the different environments. To validate the approaches, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process. Experimental results validate the correctness and effectiveness of the approach.
KW - Generative Adversarial Networks (GANs)
KW - audio visual speech recognition
KW - deep learning
UR - http://www.scopus.com/inward/record.url?scp=85184849653&partnerID=8YFLogxK
U2 - 10.1109/ICSPCC59353.2023.10400358
DO - 10.1109/ICSPCC59353.2023.10400358
M3 - Conference Proceeding
AN - SCOPUS:85184849653
T3 - Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023
BT - Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023
Y2 - 14 November 2023 through 17 November 2023
ER -