Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition

Yibo He; Kah Phooi Seng; Li Minn Ang; Xingyu Zhao

doi:10.1109/ICSPCC59353.2023.10400358

Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition

Yibo He, Kah Phooi Seng, Li Minn Ang, Xingyu Zhao

School of Internet of Things

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

Abstract

Generative Adversarial Networks (GANs) have found extensive applications in image classification and image generation domains. Nevertheless, their utilisation for recognising and detecting multimodal images presents considerable difficulties. Audio Visual Speech Recognition (AVSR) is a classic task in multimodal audio-visual sensing, which leverages audio inputs from human speech and aligned visual inputs from lip movements. However, the performance of AVSR is impacted by the inherent discrepancies present in real-world environments, such as variations in lighting intensity, noise, and sampling devices. To mitigate these challenges, this paper proposed a AVSR architecture based on a specially constructed Cycle-Consistent Adversarial Networks (CycleGAN). First, on the visual side, we used data-Augmentation methods such as flipping and rotating to process video data, increasing the number and variety of samples. This increases the robustness and generalisation capabilities of the model. Then, since the AVSR dataset was collected in different environments with different styles, we transformed the original images multiple times through the specially constructed CycleGAN module to address the inherent differences in the different environments. To validate the approaches, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process. Experimental results validate the correctness and effectiveness of the approach.

Original language	English
Title of host publication	Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	9798350316728
DOIs	https://doi.org/10.1109/ICSPCC59353.2023.10400358
Publication status	Published - 2023
Event	2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023 - Zhengzhou, Henan, China Duration: 14 Nov 2023 → 17 Nov 2023

Publication series

Name	Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023

Conference

Conference	2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023
Country/Territory	China
City	Zhengzhou, Henan
Period	14/11/23 → 17/11/23

Keywords

Generative Adversarial Networks (GANs)
audio visual speech recognition
deep learning

Access to Document

10.1109/ICSPCC59353.2023.10400358

Cite this

He, Y., Seng, K. P., Ang, L. M., & Zhao, X. (2023). Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition. In Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023 (Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICSPCC59353.2023.10400358

He, Yibo ; Seng, Kah Phooi ; Ang, Li Minn et al. / Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition. Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023).

@inproceedings{abb6b5fbb5c04b7a9e935a0f55d47a0e,

title = "Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition",

abstract = "Generative Adversarial Networks (GANs) have found extensive applications in image classification and image generation domains. Nevertheless, their utilisation for recognising and detecting multimodal images presents considerable difficulties. Audio Visual Speech Recognition (AVSR) is a classic task in multimodal audio-visual sensing, which leverages audio inputs from human speech and aligned visual inputs from lip movements. However, the performance of AVSR is impacted by the inherent discrepancies present in real-world environments, such as variations in lighting intensity, noise, and sampling devices. To mitigate these challenges, this paper proposed a AVSR architecture based on a specially constructed Cycle-Consistent Adversarial Networks (CycleGAN). First, on the visual side, we used data-Augmentation methods such as flipping and rotating to process video data, increasing the number and variety of samples. This increases the robustness and generalisation capabilities of the model. Then, since the AVSR dataset was collected in different environments with different styles, we transformed the original images multiple times through the specially constructed CycleGAN module to address the inherent differences in the different environments. To validate the approaches, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process. Experimental results validate the correctness and effectiveness of the approach.",

keywords = "Generative Adversarial Networks (GANs), audio visual speech recognition, deep learning",

author = "Yibo He and Seng, {Kah Phooi} and Ang, {Li Minn} and Xingyu Zhao",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023 ; Conference date: 14-11-2023 Through 17-11-2023",

year = "2023",

doi = "10.1109/ICSPCC59353.2023.10400358",

language = "English",

series = "Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023",

}

He, Y, Seng, KP, Ang, LM & Zhao, X 2023, Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition. in Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023. Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023, Institute of Electrical and Electronics Engineers Inc., 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023, Zhengzhou, Henan, China, 14/11/23. https://doi.org/10.1109/ICSPCC59353.2023.10400358

Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition. / He, Yibo; Seng, Kah Phooi; Ang, Li Minn et al.
Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023).

Research output: Chapter in Book or Report/Conference proceeding › Conference Proceeding › peer-review

TY - GEN

T1 - Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition

AU - He, Yibo

AU - Seng, Kah Phooi

AU - Ang, Li Minn

AU - Zhao, Xingyu

PY - 2023

Y1 - 2023

N2 - Generative Adversarial Networks (GANs) have found extensive applications in image classification and image generation domains. Nevertheless, their utilisation for recognising and detecting multimodal images presents considerable difficulties. Audio Visual Speech Recognition (AVSR) is a classic task in multimodal audio-visual sensing, which leverages audio inputs from human speech and aligned visual inputs from lip movements. However, the performance of AVSR is impacted by the inherent discrepancies present in real-world environments, such as variations in lighting intensity, noise, and sampling devices. To mitigate these challenges, this paper proposed a AVSR architecture based on a specially constructed Cycle-Consistent Adversarial Networks (CycleGAN). First, on the visual side, we used data-Augmentation methods such as flipping and rotating to process video data, increasing the number and variety of samples. This increases the robustness and generalisation capabilities of the model. Then, since the AVSR dataset was collected in different environments with different styles, we transformed the original images multiple times through the specially constructed CycleGAN module to address the inherent differences in the different environments. To validate the approaches, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process. Experimental results validate the correctness and effectiveness of the approach.

AB - Generative Adversarial Networks (GANs) have found extensive applications in image classification and image generation domains. Nevertheless, their utilisation for recognising and detecting multimodal images presents considerable difficulties. Audio Visual Speech Recognition (AVSR) is a classic task in multimodal audio-visual sensing, which leverages audio inputs from human speech and aligned visual inputs from lip movements. However, the performance of AVSR is impacted by the inherent discrepancies present in real-world environments, such as variations in lighting intensity, noise, and sampling devices. To mitigate these challenges, this paper proposed a AVSR architecture based on a specially constructed Cycle-Consistent Adversarial Networks (CycleGAN). First, on the visual side, we used data-Augmentation methods such as flipping and rotating to process video data, increasing the number and variety of samples. This increases the robustness and generalisation capabilities of the model. Then, since the AVSR dataset was collected in different environments with different styles, we transformed the original images multiple times through the specially constructed CycleGAN module to address the inherent differences in the different environments. To validate the approaches, we used augmented data from well-known datasets (LRS2-Lip Reading Sentences 2 and LRS3) in the training process. Experimental results validate the correctness and effectiveness of the approach.

KW - Generative Adversarial Networks (GANs)

KW - audio visual speech recognition

KW - deep learning

UR - http://www.scopus.com/inward/record.url?scp=85184849653&partnerID=8YFLogxK

U2 - 10.1109/ICSPCC59353.2023.10400358

DO - 10.1109/ICSPCC59353.2023.10400358

M3 - Conference Proceeding

AN - SCOPUS:85184849653

T3 - Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023

BT - Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023

Y2 - 14 November 2023 through 17 November 2023

ER -

He Y, Seng KP, Ang LM, Zhao X. Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition. In Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (Proceedings of 2023 IEEE International Conference on Signal Processing, Communications and Computing, ICSPCC 2023). doi: 10.1109/ICSPCC59353.2023.10400358

Cycle-Consistent Generative Adversarial Network Architectures for Audio Visual Speech Recognition

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this