AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Dong Shu; Mingyu Jin; Chong Zhang; Lingyao Li; Zihao Zhou; Yongfeng Zhang

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Dong Shu, Mingyu Jin, Chong Zhang, Lingyao Li, Zihao Zhou, Yongfeng Zhang

Xi'an Jiaotong-Liverpool University

Research output: Contribution to journal › Article › peer-review

25 Downloads (Pure)

Abstract

Ensuring the security of large language models (LLMs) against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.

Original language	English
Number of pages	10
Journal	ACM SIGKDD Explorations Newsletter
Volume	27
Issue number	1
Publication status	Published - 31 May 2025

Access to Document

AttackEval How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language ModelsSubmitted manuscript, 1.39 MB

Cite this

@article{0e7c302dea434dd79408015ed396a793,

title = "AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models",

abstract = "Ensuring the security of large language models (LLMs) against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.",

author = "Dong Shu and Mingyu Jin and Chong Zhang and Lingyao Li and Zihao Zhou and Yongfeng Zhang",

year = "2025",

month = may,

day = "31",

language = "English",

volume = "27",

journal = "ACM SIGKDD Explorations Newsletter",

issn = "1931-0145",

publisher = "Association for Computing Machinery (ACM)",

number = "1",

}

TY - JOUR

T1 - AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

AU - Shu, Dong

AU - Jin, Mingyu

AU - Zhang, Chong

AU - Li, Lingyao

AU - Zhou, Zihao

AU - Zhang, Yongfeng

PY - 2025/5/31

Y1 - 2025/5/31

N2 - Ensuring the security of large language models (LLMs) against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.

AB - Ensuring the security of large language models (LLMs) against attacks has become increasingly urgent, with jailbreak attacks representing one of the most sophisticated threats. To deal with such risks, we introduce an innovative framework that can help evaluate the effectiveness of jailbreak attacks on LLMs. Unlike traditional binary evaluations focusing solely on the robustness of LLMs, our method assesses the effectiveness of the attacking prompts themselves. We present two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework uses a scoring range from 0 to 1, offering unique perspectives and allowing for the assessment of attack effectiveness in different scenarios. Additionally, we develop a comprehensive ground truth dataset specifically tailored for jailbreak prompts. This dataset serves as a crucial benchmark for our current study and provides a foundational resource for future research. By comparing with traditional evaluation methods, our study shows that the current results align with baseline metrics while offering a more nuanced and fine-grained assessment. It also helps identify potentially harmful attack prompts that might appear harmless in traditional evaluations. Overall, our work establishes a solid foundation for assessing a broader range of attack prompts in the area of prompt injection.

M3 - Article

SN - 1931-0145

VL - 27

JO - ACM SIGKDD Explorations Newsletter

JF - ACM SIGKDD Explorations Newsletter

IS - 1

ER -

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

Abstract

Access to Document

Fingerprint

Cite this