TY - GEN
T1 - Generating Valid and Natural Adversarial Examples with Large Language Models
AU - Wang, Zimu
AU - Wang, Wei
AU - Chen, Qi
AU - Wang, Qiufeng
AU - Nguyen, Anh
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Deep learning-based natural language processing (NLP) models, particularly pre-trained language models (PLMs), have been revealed to be vulnerable to adversarial attacks. However, the adversarial examples generated by many mainstream word-level adversarial attack models are neither valid nor natural, leading to the loss of semantic maintenance, grammaticality, and human imperceptibility. Based on the exceptional capacity of language understanding and generation of large language models (LLMs), we propose LLM-Attack, which aims at generating both valid and natural adversarial examples with LLMs. The method consists of two stages: word importance ranking (which searches for the most vulnerable words) and word synonym replacement (which substitutes them with their synonyms obtained from LLMs). Experimental results on the Movie Review (MR), IMDB, and Yelp Review Polarity datasets against the baseline adversarial attack models illustrate the effectiveness of LLM-Attack, and it outperforms the baselines in human and GPT-4 evaluation by a significant margin. The model can generate adversarial examples that are typically valid and natural, with the preservation of semantic meaning, grammaticality, and human imperceptibility.
AB - Deep learning-based natural language processing (NLP) models, particularly pre-trained language models (PLMs), have been revealed to be vulnerable to adversarial attacks. However, the adversarial examples generated by many mainstream word-level adversarial attack models are neither valid nor natural, leading to the loss of semantic maintenance, grammaticality, and human imperceptibility. Based on the exceptional capacity of language understanding and generation of large language models (LLMs), we propose LLM-Attack, which aims at generating both valid and natural adversarial examples with LLMs. The method consists of two stages: word importance ranking (which searches for the most vulnerable words) and word synonym replacement (which substitutes them with their synonyms obtained from LLMs). Experimental results on the Movie Review (MR), IMDB, and Yelp Review Polarity datasets against the baseline adversarial attack models illustrate the effectiveness of LLM-Attack, and it outperforms the baselines in human and GPT-4 evaluation by a significant margin. The model can generate adversarial examples that are typically valid and natural, with the preservation of semantic meaning, grammaticality, and human imperceptibility.
KW - Adversarial attack
KW - adversarial examples
KW - large language models
KW - natural language processing
KW - text classification
UR - http://www.scopus.com/inward/record.url?scp=85199078756&partnerID=8YFLogxK
U2 - 10.1109/CSCWD61410.2024.10580402
DO - 10.1109/CSCWD61410.2024.10580402
M3 - Conference Proceeding
AN - SCOPUS:85199078756
T3 - Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
SP - 1716
EP - 1721
BT - Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
A2 - Shen, Weiming
A2 - Shen, Weiming
A2 - Barthes, Jean-Paul
A2 - Luo, Junzhou
A2 - Qiu, Tie
A2 - Zhou, Xiaobo
A2 - Zhang, Jinghui
A2 - Zhu, Haibin
A2 - Peng, Kunkun
A2 - Xu, Tianyi
A2 - Chen, Ning
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
Y2 - 8 May 2024 through 10 May 2024
ER -