多模态情感识别与理解发展现状及趋势

Jianhua Tao; Cunhang Fan; Zheng Lian; Zhao Lyu; Ying Shen; Shan Liang

doi:10.11834/jig.240017

多模态情感识别与理解发展现状及趋势

Translated title of the contribution: Development of multimodal sentiment recognition and understanding

Jianhua Tao, Cunhang Fan, Zheng Lian, Zhao Lyu, Ying Shen, Shan Liang^*

^*Corresponding author for this work

Department of Intelligent Science

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Affective computing is an important branch in the field of artificial intelligence（AI）. It aims to build a computa⁃ tional system that can automatically perceive，recognize，understand，and provide feedback on human emotions. It involves the intersection of multiple disciplines such as computer science，neuroscience，psychology，and social science. Deep emotional understanding and interaction can enable computers to better understand and respond to human emotional needs. It can also provide personalized interactions and feedback based on emotional states，which enhances the human-computer interaction experience. It has various applications in areas such as intelligent assistants，virtual reality，and smart healthcare. Relying solely on single-modal information，such as speech signal or video，does not align with the way humans perceive emotions. The accuracy of recognition rapidly decreases when faced with interference. Multimodal emo⁃ tion understanding and interaction technologies aim to fully model multidimensional information from audio，video，and physiological signals to achieve more accurate emotion understanding. This technology is fundamental and an important prerequisite for achieving natural，human-like，and personalized human-computer interaction. It holds significant value for ushering in the era of intelligence and digitalization. Multimodal fusion for sentiment recognition receives increasing atten⁃ tion from researchers in fully exploiting the complementary nature of different modalities. This study introduces the current research status of multimodal sentiment computation from three dimensions：an overview of multimodal sentiment recogni⁃ tion，multimodal sentiment understanding，and detection and assessment of emotional disorders such as depression. The overview of emotion recognition is elaborated from the aspects of academic definition，mainstream datasets，and interna⁃ tional competitions. In recent years，large language models（LLMs）have demonstrated excellent modeling capabilities and achieved great success in the field of natural language processing with their outstanding language understanding and reason⁃ ing abilities. LLMs have garnered widespread attention because of their ability to handle various complex tasks by under⁃ standing prompts with minimal or zero-shot learning. Through methods such as self-supervised learning or contrastive learn⁃ ing，LLMs can learn more expressive multimodal representations，which can capture the correlations between different modalities and emotional information. Multimodal sentiment recognition and understanding are discussed in terms of emo⁃ tion feature extraction，multimodal fusion，and the representation and models involved in sentiment recognition under the background of pre-trained large models. With the rapid development of society，people are facing increasing pressure，which can lead to feelings of depression，anxiety，and other negative emotions. Those who are in a prolonged state of depression and anxiety are more likely to develop mental illnesses. Depression is a common and serious condition，with symptoms including low mood，poor sleep quality，loss of appetite，fatigue，and difficulty concentrating. Depression not only harms individuals and families but also causes significant economic losses to society. The detection of emotional disor⁃ ders starts from specific applications，which selects depression as the most common emotional disorder. We analyze its lat⁃ est developments and trends from the perspectives of assessment and intervention. In addition，this study provides a detailed comparison of the research status of affective computation domestically，and prospects for future development trends are offered. We believe that scalable emotion feature design and large-scale model transfer learning based methods will be the future directions of development. The main challenge in multimodal emotion recognition lies in data scarcity，which means that data available to build and explore complex models are insufficient. This insufficiency causes difficulty in creating robust models based on deep neural network methods. The above mentioned issues can be addressed by construct⁃ ing large-scale multimodal emotion databases and exploring transfer learning methods based on large models. By transfer⁃ ring knowledge learned from unsupervised tasks or other tasks to emotion recognition tasks，the problem of limited data resources can be alleviated. The use of explicit discrete and dimensional labels to represent ambiguous emotional states has limitations due to the inherent fuzziness of emotions. Enhancing the interpretability of prediction results to improve the reli⁃ ability of recognition results is also an important research direction for the future. The role of multimodal emotion comput⁃ ing in addressing emotional disorders such as depression and anxiety is increasingly prominent. Future research can be con⁃ ducted in the following three areas. First，research and construction of multimodal emotion disorder datasets can provide a solid foundation for the automatic recognition of emotional disorders. However，this field still needs to address challenges such as data privacy and ethics. In addition，considerations such as designing targeted interview questions，ensuring patient safety during data collection，and sample augmentation through algorithms are still worth exploring. Second，more effective algorithms should be developed. Emotional disorders fall within the psychological domain，and they can also affect the physiological features of patients，such as voice and body movements. This psychological-physiological correla⁃ tion is worthy of comprehensive exploration. Therefore，improving the accuracy of algorithms for multimodal emotion disor⁃ der recognition is a pressing research issue. Finally，intelligent psychological intervention systems should be designed and implemented. The following issues can be further studied：effectively simulating the counseling process of a psychologist，promptly receiving user emotional feedback，and generating empathetic conversations.

Translated title of the contribution	Development of multimodal sentiment recognition and understanding
Original language	Chinese (Simplified)
Pages (from-to)	1607-1627
Number of pages	21
Journal	Journal of Image and Graphics
Volume	29
Issue number	6
DOIs	https://doi.org/10.11834/jig.240017
Publication status	Published - Jun 2024

Keywords

cognitive behavior therapy
depression detection
emotion disor⁃ der intervention
human-computer interaction
multimodel fusion
sentiment recognition

Access to Document

10.11834/jig.240017

Cite this

@article{e0885759100347a9be3dbac161501614,

title = "多模态情感识别与理解发展现状及趋势",

abstract = "Affective computing is an important branch in the field of artificial intelligence（AI）. It aims to build a computa⁃ tional system that can automatically perceive，recognize，understand，and provide feedback on human emotions. It involves the intersection of multiple disciplines such as computer science，neuroscience，psychology，and social science. Deep emotional understanding and interaction can enable computers to better understand and respond to human emotional needs. It can also provide personalized interactions and feedback based on emotional states，which enhances the human-computer interaction experience. It has various applications in areas such as intelligent assistants，virtual reality，and smart healthcare. Relying solely on single-modal information，such as speech signal or video，does not align with the way humans perceive emotions. The accuracy of recognition rapidly decreases when faced with interference. Multimodal emo⁃ tion understanding and interaction technologies aim to fully model multidimensional information from audio，video，and physiological signals to achieve more accurate emotion understanding. This technology is fundamental and an important prerequisite for achieving natural，human-like，and personalized human-computer interaction. It holds significant value for ushering in the era of intelligence and digitalization. Multimodal fusion for sentiment recognition receives increasing atten⁃ tion from researchers in fully exploiting the complementary nature of different modalities. This study introduces the current research status of multimodal sentiment computation from three dimensions：an overview of multimodal sentiment recogni⁃ tion，multimodal sentiment understanding，and detection and assessment of emotional disorders such as depression. The overview of emotion recognition is elaborated from the aspects of academic definition，mainstream datasets，and interna⁃ tional competitions. In recent years，large language models（LLMs）have demonstrated excellent modeling capabilities and achieved great success in the field of natural language processing with their outstanding language understanding and reason⁃ ing abilities. LLMs have garnered widespread attention because of their ability to handle various complex tasks by under⁃ standing prompts with minimal or zero-shot learning. Through methods such as self-supervised learning or contrastive learn⁃ ing，LLMs can learn more expressive multimodal representations，which can capture the correlations between different modalities and emotional information. Multimodal sentiment recognition and understanding are discussed in terms of emo⁃ tion feature extraction，multimodal fusion，and the representation and models involved in sentiment recognition under the background of pre-trained large models. With the rapid development of society，people are facing increasing pressure，which can lead to feelings of depression，anxiety，and other negative emotions. Those who are in a prolonged state of depression and anxiety are more likely to develop mental illnesses. Depression is a common and serious condition，with symptoms including low mood，poor sleep quality，loss of appetite，fatigue，and difficulty concentrating. Depression not only harms individuals and families but also causes significant economic losses to society. The detection of emotional disor⁃ ders starts from specific applications，which selects depression as the most common emotional disorder. We analyze its lat⁃ est developments and trends from the perspectives of assessment and intervention. In addition，this study provides a detailed comparison of the research status of affective computation domestically，and prospects for future development trends are offered. We believe that scalable emotion feature design and large-scale model transfer learning based methods will be the future directions of development. The main challenge in multimodal emotion recognition lies in data scarcity，which means that data available to build and explore complex models are insufficient. This insufficiency causes difficulty in creating robust models based on deep neural network methods. The above mentioned issues can be addressed by construct⁃ ing large-scale multimodal emotion databases and exploring transfer learning methods based on large models. By transfer⁃ ring knowledge learned from unsupervised tasks or other tasks to emotion recognition tasks，the problem of limited data resources can be alleviated. The use of explicit discrete and dimensional labels to represent ambiguous emotional states has limitations due to the inherent fuzziness of emotions. Enhancing the interpretability of prediction results to improve the reli⁃ ability of recognition results is also an important research direction for the future. The role of multimodal emotion comput⁃ ing in addressing emotional disorders such as depression and anxiety is increasingly prominent. Future research can be con⁃ ducted in the following three areas. First，research and construction of multimodal emotion disorder datasets can provide a solid foundation for the automatic recognition of emotional disorders. However，this field still needs to address challenges such as data privacy and ethics. In addition，considerations such as designing targeted interview questions，ensuring patient safety during data collection，and sample augmentation through algorithms are still worth exploring. Second，more effective algorithms should be developed. Emotional disorders fall within the psychological domain，and they can also affect the physiological features of patients，such as voice and body movements. This psychological-physiological correla⁃ tion is worthy of comprehensive exploration. Therefore，improving the accuracy of algorithms for multimodal emotion disor⁃ der recognition is a pressing research issue. Finally，intelligent psychological intervention systems should be designed and implemented. The following issues can be further studied：effectively simulating the counseling process of a psychologist，promptly receiving user emotional feedback，and generating empathetic conversations.",

keywords = "cognitive behavior therapy, depression detection, emotion disor⁃ der intervention, human-computer interaction, multimodel fusion, sentiment recognition",

author = "Jianhua Tao and Cunhang Fan and Zheng Lian and Zhao Lyu and Ying Shen and Shan Liang",

year = "2024",

month = jun,

doi = "10.11834/jig.240017",

language = "简体中文",

volume = "29",

pages = "1607--1627",

journal = "Journal of Image and Graphics",

issn = "1006-8961",

number = "6",

}

TY - JOUR

T1 - 多模态情感识别与理解发展现状及趋势

AU - Tao, Jianhua

AU - Fan, Cunhang

AU - Lian, Zheng

AU - Lyu, Zhao

AU - Shen, Ying

AU - Liang, Shan

PY - 2024/6

Y1 - 2024/6

N2 - Affective computing is an important branch in the field of artificial intelligence（AI）. It aims to build a computa⁃ tional system that can automatically perceive，recognize，understand，and provide feedback on human emotions. It involves the intersection of multiple disciplines such as computer science，neuroscience，psychology，and social science. Deep emotional understanding and interaction can enable computers to better understand and respond to human emotional needs. It can also provide personalized interactions and feedback based on emotional states，which enhances the human-computer interaction experience. It has various applications in areas such as intelligent assistants，virtual reality，and smart healthcare. Relying solely on single-modal information，such as speech signal or video，does not align with the way humans perceive emotions. The accuracy of recognition rapidly decreases when faced with interference. Multimodal emo⁃ tion understanding and interaction technologies aim to fully model multidimensional information from audio，video，and physiological signals to achieve more accurate emotion understanding. This technology is fundamental and an important prerequisite for achieving natural，human-like，and personalized human-computer interaction. It holds significant value for ushering in the era of intelligence and digitalization. Multimodal fusion for sentiment recognition receives increasing atten⁃ tion from researchers in fully exploiting the complementary nature of different modalities. This study introduces the current research status of multimodal sentiment computation from three dimensions：an overview of multimodal sentiment recogni⁃ tion，multimodal sentiment understanding，and detection and assessment of emotional disorders such as depression. The overview of emotion recognition is elaborated from the aspects of academic definition，mainstream datasets，and interna⁃ tional competitions. In recent years，large language models（LLMs）have demonstrated excellent modeling capabilities and achieved great success in the field of natural language processing with their outstanding language understanding and reason⁃ ing abilities. LLMs have garnered widespread attention because of their ability to handle various complex tasks by under⁃ standing prompts with minimal or zero-shot learning. Through methods such as self-supervised learning or contrastive learn⁃ ing，LLMs can learn more expressive multimodal representations，which can capture the correlations between different modalities and emotional information. Multimodal sentiment recognition and understanding are discussed in terms of emo⁃ tion feature extraction，multimodal fusion，and the representation and models involved in sentiment recognition under the background of pre-trained large models. With the rapid development of society，people are facing increasing pressure，which can lead to feelings of depression，anxiety，and other negative emotions. Those who are in a prolonged state of depression and anxiety are more likely to develop mental illnesses. Depression is a common and serious condition，with symptoms including low mood，poor sleep quality，loss of appetite，fatigue，and difficulty concentrating. Depression not only harms individuals and families but also causes significant economic losses to society. The detection of emotional disor⁃ ders starts from specific applications，which selects depression as the most common emotional disorder. We analyze its lat⁃ est developments and trends from the perspectives of assessment and intervention. In addition，this study provides a detailed comparison of the research status of affective computation domestically，and prospects for future development trends are offered. We believe that scalable emotion feature design and large-scale model transfer learning based methods will be the future directions of development. The main challenge in multimodal emotion recognition lies in data scarcity，which means that data available to build and explore complex models are insufficient. This insufficiency causes difficulty in creating robust models based on deep neural network methods. The above mentioned issues can be addressed by construct⁃ ing large-scale multimodal emotion databases and exploring transfer learning methods based on large models. By transfer⁃ ring knowledge learned from unsupervised tasks or other tasks to emotion recognition tasks，the problem of limited data resources can be alleviated. The use of explicit discrete and dimensional labels to represent ambiguous emotional states has limitations due to the inherent fuzziness of emotions. Enhancing the interpretability of prediction results to improve the reli⁃ ability of recognition results is also an important research direction for the future. The role of multimodal emotion comput⁃ ing in addressing emotional disorders such as depression and anxiety is increasingly prominent. Future research can be con⁃ ducted in the following three areas. First，research and construction of multimodal emotion disorder datasets can provide a solid foundation for the automatic recognition of emotional disorders. However，this field still needs to address challenges such as data privacy and ethics. In addition，considerations such as designing targeted interview questions，ensuring patient safety during data collection，and sample augmentation through algorithms are still worth exploring. Second，more effective algorithms should be developed. Emotional disorders fall within the psychological domain，and they can also affect the physiological features of patients，such as voice and body movements. This psychological-physiological correla⁃ tion is worthy of comprehensive exploration. Therefore，improving the accuracy of algorithms for multimodal emotion disor⁃ der recognition is a pressing research issue. Finally，intelligent psychological intervention systems should be designed and implemented. The following issues can be further studied：effectively simulating the counseling process of a psychologist，promptly receiving user emotional feedback，and generating empathetic conversations.

AB - Affective computing is an important branch in the field of artificial intelligence（AI）. It aims to build a computa⁃ tional system that can automatically perceive，recognize，understand，and provide feedback on human emotions. It involves the intersection of multiple disciplines such as computer science，neuroscience，psychology，and social science. Deep emotional understanding and interaction can enable computers to better understand and respond to human emotional needs. It can also provide personalized interactions and feedback based on emotional states，which enhances the human-computer interaction experience. It has various applications in areas such as intelligent assistants，virtual reality，and smart healthcare. Relying solely on single-modal information，such as speech signal or video，does not align with the way humans perceive emotions. The accuracy of recognition rapidly decreases when faced with interference. Multimodal emo⁃ tion understanding and interaction technologies aim to fully model multidimensional information from audio，video，and physiological signals to achieve more accurate emotion understanding. This technology is fundamental and an important prerequisite for achieving natural，human-like，and personalized human-computer interaction. It holds significant value for ushering in the era of intelligence and digitalization. Multimodal fusion for sentiment recognition receives increasing atten⁃ tion from researchers in fully exploiting the complementary nature of different modalities. This study introduces the current research status of multimodal sentiment computation from three dimensions：an overview of multimodal sentiment recogni⁃ tion，multimodal sentiment understanding，and detection and assessment of emotional disorders such as depression. The overview of emotion recognition is elaborated from the aspects of academic definition，mainstream datasets，and interna⁃ tional competitions. In recent years，large language models（LLMs）have demonstrated excellent modeling capabilities and achieved great success in the field of natural language processing with their outstanding language understanding and reason⁃ ing abilities. LLMs have garnered widespread attention because of their ability to handle various complex tasks by under⁃ standing prompts with minimal or zero-shot learning. Through methods such as self-supervised learning or contrastive learn⁃ ing，LLMs can learn more expressive multimodal representations，which can capture the correlations between different modalities and emotional information. Multimodal sentiment recognition and understanding are discussed in terms of emo⁃ tion feature extraction，multimodal fusion，and the representation and models involved in sentiment recognition under the background of pre-trained large models. With the rapid development of society，people are facing increasing pressure，which can lead to feelings of depression，anxiety，and other negative emotions. Those who are in a prolonged state of depression and anxiety are more likely to develop mental illnesses. Depression is a common and serious condition，with symptoms including low mood，poor sleep quality，loss of appetite，fatigue，and difficulty concentrating. Depression not only harms individuals and families but also causes significant economic losses to society. The detection of emotional disor⁃ ders starts from specific applications，which selects depression as the most common emotional disorder. We analyze its lat⁃ est developments and trends from the perspectives of assessment and intervention. In addition，this study provides a detailed comparison of the research status of affective computation domestically，and prospects for future development trends are offered. We believe that scalable emotion feature design and large-scale model transfer learning based methods will be the future directions of development. The main challenge in multimodal emotion recognition lies in data scarcity，which means that data available to build and explore complex models are insufficient. This insufficiency causes difficulty in creating robust models based on deep neural network methods. The above mentioned issues can be addressed by construct⁃ ing large-scale multimodal emotion databases and exploring transfer learning methods based on large models. By transfer⁃ ring knowledge learned from unsupervised tasks or other tasks to emotion recognition tasks，the problem of limited data resources can be alleviated. The use of explicit discrete and dimensional labels to represent ambiguous emotional states has limitations due to the inherent fuzziness of emotions. Enhancing the interpretability of prediction results to improve the reli⁃ ability of recognition results is also an important research direction for the future. The role of multimodal emotion comput⁃ ing in addressing emotional disorders such as depression and anxiety is increasingly prominent. Future research can be con⁃ ducted in the following three areas. First，research and construction of multimodal emotion disorder datasets can provide a solid foundation for the automatic recognition of emotional disorders. However，this field still needs to address challenges such as data privacy and ethics. In addition，considerations such as designing targeted interview questions，ensuring patient safety during data collection，and sample augmentation through algorithms are still worth exploring. Second，more effective algorithms should be developed. Emotional disorders fall within the psychological domain，and they can also affect the physiological features of patients，such as voice and body movements. This psychological-physiological correla⁃ tion is worthy of comprehensive exploration. Therefore，improving the accuracy of algorithms for multimodal emotion disor⁃ der recognition is a pressing research issue. Finally，intelligent psychological intervention systems should be designed and implemented. The following issues can be further studied：effectively simulating the counseling process of a psychologist，promptly receiving user emotional feedback，and generating empathetic conversations.

KW - cognitive behavior therapy

KW - depression detection

KW - emotion disor⁃ der intervention

KW - human-computer interaction

KW - multimodel fusion

KW - sentiment recognition

UR - http://www.scopus.com/inward/record.url?scp=85197216662&partnerID=8YFLogxK

U2 - 10.11834/jig.240017

DO - 10.11834/jig.240017

M3 - 文章

SN - 1006-8961

VL - 29

SP - 1607

EP - 1627

JO - Journal of Image and Graphics

JF - Journal of Image and Graphics

IS - 6

ER -

多模态情感识别与理解发展现状及趋势

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this