Generating Valid and Natural Adversarial Examples with Large Language Models

Zimu Wang, Wei Wang*, Qi Chen, Qiufeng Wang, Anh Nguyen

*Corresponding author for this work

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

Abstract

Deep learning-based natural language processing (NLP) models, particularly pre-trained language models (PLMs), have been revealed to be vulnerable to adversarial attacks. However, the adversarial examples generated by many mainstream word-level adversarial attack models are neither valid nor natural, leading to the loss of semantic maintenance, grammaticality, and human imperceptibility. Based on the exceptional capacity of language understanding and generation of large language models (LLMs), we propose LLM-Attack, which aims at generating both valid and natural adversarial examples with LLMs. The method consists of two stages: word importance ranking (which searches for the most vulnerable words) and word synonym replacement (which substitutes them with their synonyms obtained from LLMs). Experimental results on the Movie Review (MR), IMDB, and Yelp Review Polarity datasets against the baseline adversarial attack models illustrate the effectiveness of LLM-Attack, and it outperforms the baselines in human and GPT-4 evaluation by a significant margin. The model can generate adversarial examples that are typically valid and natural, with the preservation of semantic meaning, grammaticality, and human imperceptibility.

Original languageEnglish
Title of host publicationProceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
EditorsWeiming Shen, Weiming Shen, Jean-Paul Barthes, Junzhou Luo, Tie Qiu, Xiaobo Zhou, Jinghui Zhang, Haibin Zhu, Kunkun Peng, Tianyi Xu, Ning Chen
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1716-1721
Number of pages6
ISBN (Electronic)9798350349184
DOIs
Publication statusPublished - 2024
Event27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024 - Tianjin, China
Duration: 8 May 202410 May 2024

Publication series

NameProceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024

Conference

Conference27th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2024
Country/TerritoryChina
CityTianjin
Period8/05/2410/05/24

Keywords

  • Adversarial attack
  • adversarial examples
  • large language models
  • natural language processing
  • text classification

Fingerprint

Dive into the research topics of 'Generating Valid and Natural Adversarial Examples with Large Language Models'. Together they form a unique fingerprint.

Cite this