TY - GEN
T1 - Talk2Face
T2 - 30th ACM International Conference on Multimedia, MM 2022
AU - Li, Yudong
AU - Hou, Xianxu
AU - Zhao, Zhe
AU - Shen, Linlin
AU - Yang, Xuefeng
AU - Yan, Kimmo
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/10
Y1 - 2022/10/10
N2 - Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.
AB - Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.
KW - cross-modal generation
KW - face captioning
KW - text-to-face synthesis
UR - http://www.scopus.com/inward/record.url?scp=85144803045&partnerID=8YFLogxK
U2 - 10.1145/3503161.3548205
DO - 10.1145/3503161.3548205
M3 - Conference Proceeding
AN - SCOPUS:85144803045
T3 - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
SP - 4594
EP - 4604
BT - MM 2022 - Proceedings of the 30th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
Y2 - 10 October 2022 through 14 October 2022
ER -