Text-to-image synthesis with self-supervised bi-stage generative adversarial network

Yong Xuan Tan; Chin Poo Lee; Mai Neo; Kian Ming Lim; Jit Yan Lim

doi:10.1016/j.patrec.2023.03.023

Text-to-image synthesis with self-supervised bi-stage generative adversarial network

Yong Xuan Tan^*, Chin Poo Lee, Mai Neo, Kian Ming Lim, Jit Yan Lim

^*Corresponding author for this work

Multimedia University

Research output: Contribution to journal › Article › peer-review

7 Citations (Scopus)

Abstract

Text-to-image synthesis is challenging as generating images that are visually realistic and semantically consistent with the given text description involves multi-modal learning with text and image. To address the challenges, this paper presents a text-to-image synthesis model that utilizes self-supervision and bi-stage image distribution architecture, referred to as the Self-Supervised Bi-Stage Generative Adversarial Network (SSBi-GAN). The self-supervision diversifies the learned representation thus improving the quality of the synthesized images. Besides that, the bi-stage architecture with Residual network enables the generation of larger images with finer visual contents. Not only that, some enhancements including L1 distance, one-sided smoothing and feature matching are incorporated to enhance the visual realism and semantic consistency of the images as well as the training stability of the model. The empirical results on Oxford-102 and CUB datasets corroborate the ability of the proposed SSBi-GAN in generating visually realistic and semantically consistent images.

Original language	English
Pages (from-to)	43-49
Number of pages	7
Journal	Pattern Recognition Letters
Volume	169
DOIs	https://doi.org/10.1016/j.patrec.2023.03.023
Publication status	Published - May 2023
Externally published	Yes

Keywords

GAN
Generative adversarial network
Self-supervised learning
Text-to-image-synthesis

Access to Document

10.1016/j.patrec.2023.03.023

Cite this

Tan, Y. X., Lee, C. P., Neo, M., Lim, K. M., & Lim, J. Y. (2023). Text-to-image synthesis with self-supervised bi-stage generative adversarial network. Pattern Recognition Letters, 169, 43-49. https://doi.org/10.1016/j.patrec.2023.03.023

@article{c7ab5f1cd7974a16a0925af7a7d5bd91,

title = "Text-to-image synthesis with self-supervised bi-stage generative adversarial network",

abstract = "Text-to-image synthesis is challenging as generating images that are visually realistic and semantically consistent with the given text description involves multi-modal learning with text and image. To address the challenges, this paper presents a text-to-image synthesis model that utilizes self-supervision and bi-stage image distribution architecture, referred to as the Self-Supervised Bi-Stage Generative Adversarial Network (SSBi-GAN). The self-supervision diversifies the learned representation thus improving the quality of the synthesized images. Besides that, the bi-stage architecture with Residual network enables the generation of larger images with finer visual contents. Not only that, some enhancements including L1 distance, one-sided smoothing and feature matching are incorporated to enhance the visual realism and semantic consistency of the images as well as the training stability of the model. The empirical results on Oxford-102 and CUB datasets corroborate the ability of the proposed SSBi-GAN in generating visually realistic and semantically consistent images.",

keywords = "GAN, Generative adversarial network, Self-supervised learning, Text-to-image-synthesis",

author = "Tan, {Yong Xuan} and Lee, {Chin Poo} and Mai Neo and Lim, {Kian Ming} and Lim, {Jit Yan}",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = may,

doi = "10.1016/j.patrec.2023.03.023",

language = "English",

volume = "169",

pages = "43--49",

journal = "Pattern Recognition Letters",

issn = "0167-8655",

}

TY - JOUR

T1 - Text-to-image synthesis with self-supervised bi-stage generative adversarial network

AU - Tan, Yong Xuan

AU - Lee, Chin Poo

AU - Neo, Mai

AU - Lim, Kian Ming

AU - Lim, Jit Yan

PY - 2023/5

Y1 - 2023/5

N2 - Text-to-image synthesis is challenging as generating images that are visually realistic and semantically consistent with the given text description involves multi-modal learning with text and image. To address the challenges, this paper presents a text-to-image synthesis model that utilizes self-supervision and bi-stage image distribution architecture, referred to as the Self-Supervised Bi-Stage Generative Adversarial Network (SSBi-GAN). The self-supervision diversifies the learned representation thus improving the quality of the synthesized images. Besides that, the bi-stage architecture with Residual network enables the generation of larger images with finer visual contents. Not only that, some enhancements including L1 distance, one-sided smoothing and feature matching are incorporated to enhance the visual realism and semantic consistency of the images as well as the training stability of the model. The empirical results on Oxford-102 and CUB datasets corroborate the ability of the proposed SSBi-GAN in generating visually realistic and semantically consistent images.

AB - Text-to-image synthesis is challenging as generating images that are visually realistic and semantically consistent with the given text description involves multi-modal learning with text and image. To address the challenges, this paper presents a text-to-image synthesis model that utilizes self-supervision and bi-stage image distribution architecture, referred to as the Self-Supervised Bi-Stage Generative Adversarial Network (SSBi-GAN). The self-supervision diversifies the learned representation thus improving the quality of the synthesized images. Besides that, the bi-stage architecture with Residual network enables the generation of larger images with finer visual contents. Not only that, some enhancements including L1 distance, one-sided smoothing and feature matching are incorporated to enhance the visual realism and semantic consistency of the images as well as the training stability of the model. The empirical results on Oxford-102 and CUB datasets corroborate the ability of the proposed SSBi-GAN in generating visually realistic and semantically consistent images.

KW - GAN

KW - Generative adversarial network

KW - Self-supervised learning

KW - Text-to-image-synthesis

UR - http://www.scopus.com/inward/record.url?scp=85151670285&partnerID=8YFLogxK

U2 - 10.1016/j.patrec.2023.03.023

DO - 10.1016/j.patrec.2023.03.023

M3 - Article

AN - SCOPUS:85151670285

SN - 0167-8655

VL - 169

SP - 43

EP - 49

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

ER -

Text-to-image synthesis with self-supervised bi-stage generative adversarial network

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this