Disentangling Semantic-to-Visual Confusion for Zero-Shot Learning

Zihan Ye; Fuyuan Hu; Fan Lyu; Linyan Li; Kaizhu Huang

doi:10.1109/TMM.2021.3089017

Disentangling Semantic-to-Visual Confusion for Zero-Shot Learning

Zihan Ye, Fuyuan Hu^*, Fan Lyu, Linyan Li, Kaizhu Huang

^*Corresponding author for this work

School of Advanced Technology

Research output: Contribution to journal › Article › peer-review

24 Citations (Scopus)

Abstract

Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multi-modal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets.

Original language	English
Pages (from-to)	2828-2840
Number of pages	13
Journal	IEEE Transactions on Multimedia
Volume	24
DOIs	https://doi.org/10.1109/TMM.2021.3089017
Publication status	Published - 2022

Keywords

Zero-shot learning
deep learning
generative adversarial network
representation learning

Access to Document

10.1109/TMM.2021.3089017

Cite this

@article{6a7463eb32174c4380af24a8c473b606,

title = "Disentangling Semantic-to-Visual Confusion for Zero-Shot Learning",

abstract = "Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multi-modal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets.",

keywords = "Zero-shot learning, deep learning, generative adversarial network, representation learning",

author = "Zihan Ye and Fuyuan Hu and Fan Lyu and Linyan Li and Kaizhu Huang",

note = "Publisher Copyright: {\textcopyright} 1999-2012 IEEE.",

year = "2022",

doi = "10.1109/TMM.2021.3089017",

language = "English",

volume = "24",

pages = "2828--2840",

journal = "IEEE Transactions on Multimedia",

issn = "1520-9210",

}

TY - JOUR

T1 - Disentangling Semantic-to-Visual Confusion for Zero-Shot Learning

AU - Ye, Zihan

AU - Hu, Fuyuan

AU - Lyu, Fan

AU - Li, Linyan

AU - Huang, Kaizhu

PY - 2022

Y1 - 2022

N2 - Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multi-modal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets.

AB - Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multi-modal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets.

KW - Zero-shot learning

KW - deep learning

KW - generative adversarial network

KW - representation learning

UR - http://www.scopus.com/inward/record.url?scp=85112194488&partnerID=8YFLogxK

U2 - 10.1109/TMM.2021.3089017

DO - 10.1109/TMM.2021.3089017

M3 - Article

AN - SCOPUS:85112194488

SN - 1520-9210

VL - 24

SP - 2828

EP - 2840

JO - IEEE Transactions on Multimedia

JF - IEEE Transactions on Multimedia

ER -

Disentangling Semantic-to-Visual Confusion for Zero-Shot Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this