Abstract
With the widespread adoption of Generative Adversarial Networks (GANs) for sample generation, this paper aims to enhance adversarial neural networks to facilitate collaborative Artificial Intelligence (AI) learning which has been specifically tailored to handle datasets containing multi-modalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this paper, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audiovisual speech recognition task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. Audiovisual Speech Recognition (AVSR) experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN*, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN* exhibit significant improvement in Word Error Rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN* exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model’s capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.
Original language | English |
---|---|
Pages (from-to) | 1-14 |
Number of pages | 14 |
Journal | IEEE Transactions on Artificial Intelligence |
DOIs | |
Publication status | Accepted/In press - 2024 |
Keywords
- Artificial intelligence
- Audio-visual speech recognition
- Data models
- Deep learning
- Generative adversarial networks
- Generative adversarial networks
- Generators
- Speech recognition
- Task analysis
- Training