CycleGAN∗: Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data

Yibo He; Kah Phooi Seng; Li Minn Ang

doi:10.1109/TAI.2024.3432856

CycleGAN∗: Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data

Yibo He, Kah Phooi Seng^*, Li Minn Ang

^*Corresponding author for this work

School of Internet of Things

Research output: Contribution to journal › Article › peer-review

Abstract

With the widespread adoption of generative adversarial networks (GANs) for sample generation, this article aims to enhance adversarial neural networks to facilitate collaborative artificial intelligence (AI) learning which has been specifically tailored to handle datasets containing multimodalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this article, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audio-visual speech recognition (AVSR) task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. AVSR experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN∗, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN∗ exhibit significant improvement in word error rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN∗ exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model's capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.

Original language	English
Pages (from-to)	5616-5629
Number of pages	14
Journal	IEEE Transactions on Artificial Intelligence
Volume	5
Issue number	11
DOIs	https://doi.org/10.1109/TAI.2024.3432856
Publication status	Published - 2024

Keywords

AudioâBBvisual speech recognition (AVSR)
deep learning
generative adversarial networks (GANs)

Access to Document

10.1109/TAI.2024.3432856

Cite this

@article{672468a5ec9441608e8bfcc166ffc197,

title = "CycleGAN∗: Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data",

abstract = "With the widespread adoption of generative adversarial networks (GANs) for sample generation, this article aims to enhance adversarial neural networks to facilitate collaborative artificial intelligence (AI) learning which has been specifically tailored to handle datasets containing multimodalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this article, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audio-visual speech recognition (AVSR) task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. AVSR experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN∗, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN∗ exhibit significant improvement in word error rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN∗ exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model's capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.",

keywords = "Audio{\^a}BBvisual speech recognition (AVSR), deep learning, generative adversarial networks (GANs)",

author = "Yibo He and Seng, {Kah Phooi} and Ang, {Li Minn}",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.",

year = "2024",

doi = "10.1109/TAI.2024.3432856",

language = "English",

volume = "5",

pages = "5616--5629",

journal = "IEEE Transactions on Artificial Intelligence",

issn = "2691-4581",

number = "11",

}

TY - JOUR

T1 - CycleGAN∗

T2 - Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data

AU - He, Yibo

AU - Seng, Kah Phooi

AU - Ang, Li Minn

PY - 2024

Y1 - 2024

N2 - With the widespread adoption of generative adversarial networks (GANs) for sample generation, this article aims to enhance adversarial neural networks to facilitate collaborative artificial intelligence (AI) learning which has been specifically tailored to handle datasets containing multimodalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this article, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audio-visual speech recognition (AVSR) task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. AVSR experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN∗, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN∗ exhibit significant improvement in word error rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN∗ exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model's capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.

AB - With the widespread adoption of generative adversarial networks (GANs) for sample generation, this article aims to enhance adversarial neural networks to facilitate collaborative artificial intelligence (AI) learning which has been specifically tailored to handle datasets containing multimodalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this article, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audio-visual speech recognition (AVSR) task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. AVSR experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN∗, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN∗ exhibit significant improvement in word error rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN∗ exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model's capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.

KW - AudioâBBvisual speech recognition (AVSR)

KW - deep learning

KW - generative adversarial networks (GANs)

UR - http://www.scopus.com/inward/record.url?scp=85199529208&partnerID=8YFLogxK

U2 - 10.1109/TAI.2024.3432856

DO - 10.1109/TAI.2024.3432856

M3 - Article

AN - SCOPUS:85199529208

SN - 2691-4581

VL - 5

SP - 5616

EP - 5629

JO - IEEE Transactions on Artificial Intelligence

JF - IEEE Transactions on Artificial Intelligence

IS - 11

ER -

CycleGAN∗: Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this