Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation

Zhaorui Tan; Xi Yang; Zihan Ye; Qiu-Feng Wang; Yuyao Yan; Anh Nguyen; Kaizhu Huang

Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation

Zhaorui Tan, Xi Yang^*, Zihan Ye, Qiu-Feng Wang, Yuyao Yan, Anh Nguyen, Kaizhu Huang

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

12 Citations (Scopus)

Abstract

Generating high-quality images from text remains a challenge in visual-language understanding, with text-image consistency being a major concern. Particularly, the most popular metric R-precision may not accurately reflect the text-image consistency, leading to misleading semantics in generated images. Albeit its significance, designing a better text-image consistency metric surprisingly remains under-explored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric, Semantic Similarity Distance (S S D), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. We also introduce Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which use two novel components to mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments indicate that, under the guidance

Original language	English
Journal	Pattern Recognition
Publication status	Published - 2023

Keywords

text-to-image generation
text-image consistency metric

Access to Document

https://www.sciencedirect.com/science/article/abs/pii/S0031320323005812

Cite this

@article{4b2d565b99ec4beb861bbd759170d196,

title = "Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation",

abstract = "Generating high-quality images from text remains a challenge in visual-language understanding, with text-image consistency being a major concern. Particularly, the most popular metric R-precision may not accurately reflect the text-image consistency, leading to misleading semantics in generated images. Albeit its significance, designing a better text-image consistency metric surprisingly remains under-explored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric, Semantic Similarity Distance (S S D), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. We also introduce Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which use two novel components to mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments indicate that, under the guidance ",

keywords = "text-to-image generation, text-image consistency metric",

author = "Zhaorui Tan and Xi Yang and Zihan Ye and Qiu-Feng Wang and Yuyao Yan and Anh Nguyen and Kaizhu Huang",

year = "2023",

language = "English",

journal = "Pattern Recognition",

issn = "0031-3203",

}

TY - JOUR

T1 - Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation

AU - Tan, Zhaorui

AU - Yang, Xi

AU - Ye, Zihan

AU - Wang, Qiu-Feng

AU - Yan, Yuyao

AU - Nguyen, Anh

AU - Huang, Kaizhu

PY - 2023

Y1 - 2023

N2 - Generating high-quality images from text remains a challenge in visual-language understanding, with text-image consistency being a major concern. Particularly, the most popular metric R-precision may not accurately reflect the text-image consistency, leading to misleading semantics in generated images. Albeit its significance, designing a better text-image consistency metric surprisingly remains under-explored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric, Semantic Similarity Distance (S S D), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. We also introduce Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which use two novel components to mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments indicate that, under the guidance

AB - Generating high-quality images from text remains a challenge in visual-language understanding, with text-image consistency being a major concern. Particularly, the most popular metric R-precision may not accurately reflect the text-image consistency, leading to misleading semantics in generated images. Albeit its significance, designing a better text-image consistency metric surprisingly remains under-explored in the community. In this paper, we make a further step forward to develop a novel CLIP-based metric, Semantic Similarity Distance (S S D), which is both theoretically founded from a distributional viewpoint and empirically verified on benchmark datasets. We also introduce Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which use two novel components to mitigate inconsistent semantics and bridge the text-image semantic gap. A series of experiments indicate that, under the guidance

KW - text-to-image generation

KW - text-image consistency metric

M3 - Article

SN - 0031-3203

JO - Pattern Recognition

JF - Pattern Recognition

ER -

Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation

Abstract

Keywords

Access to Document

Fingerprint

Cite this