Behind the Bait: Delving into PhishTank's hidden data

Affan Yasin; Rubia Fatima; Javed Ali Khan; Wasif Afzal

doi:10.1016/j.dib.2023.109959

Behind the Bait: Delving into PhishTank's hidden data

Affan Yasin, Rubia Fatima, Javed Ali Khan, Wasif Afzal^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Phishing constitutes a form of social engineering that aims to deceive individuals through email communication. Extensive prior research has underscored phishing as one of the most commonly employed attack vectors for infiltrating organizational networks. A prevalent method involves misleading the target by employing phishing URLs concealed through hyperlink strategies. PhishTank, a website employing the concept of crowd-sourcing, aggregates phishing URLs and subsequently verifies their authenticity. In the course of this study, we leveraged a Python script to extract data from the PhishTank website, amassing a comprehensive dataset comprising over 190,0000 phishing URLs. This dataset is a valuable resource that can be harnessed by both researchers and practitioners for enhancing phish- ing filters, fortifying firewalls, security education, and refining training and testing models, among other applications.

Original language	English
Article number	109959
Journal	Data in Brief
Volume	52
DOIs	https://doi.org/10.1016/j.dib.2023.109959
Publication status	Published - Feb 2024
Externally published	Yes

Keywords

Artificial intelligence
Computer security
Dataset
Email security
Phished URL
Social engineering
Web security

Access to Document

10.1016/j.dib.2023.109959

Cite this

@article{e974bb778c77491cb57c3808d34c00fd,

title = "Behind the Bait: Delving into PhishTank's hidden data",

abstract = "Phishing constitutes a form of social engineering that aims to deceive individuals through email communication. Extensive prior research has underscored phishing as one of the most commonly employed attack vectors for infiltrating organizational networks. A prevalent method involves misleading the target by employing phishing URLs concealed through hyperlink strategies. PhishTank, a website employing the concept of crowd-sourcing, aggregates phishing URLs and subsequently verifies their authenticity. In the course of this study, we leveraged a Python script to extract data from the PhishTank website, amassing a comprehensive dataset comprising over 190,0000 phishing URLs. This dataset is a valuable resource that can be harnessed by both researchers and practitioners for enhancing phish- ing filters, fortifying firewalls, security education, and refining training and testing models, among other applications.",

keywords = "Artificial intelligence, Computer security, Dataset, Email security, Phished URL, Social engineering, Web security",

author = "Affan Yasin and Rubia Fatima and Khan, {Javed Ali} and Wasif Afzal",

note = "Publisher Copyright: {\textcopyright} 2023 The Authors",

year = "2024",

month = feb,

doi = "10.1016/j.dib.2023.109959",

language = "English",

volume = "52",

journal = "Data in Brief",

issn = "2352-3409",

}

TY - JOUR

T1 - Behind the Bait

T2 - Delving into PhishTank's hidden data

AU - Yasin, Affan

AU - Fatima, Rubia

AU - Khan, Javed Ali

AU - Afzal, Wasif

PY - 2024/2

Y1 - 2024/2

N2 - Phishing constitutes a form of social engineering that aims to deceive individuals through email communication. Extensive prior research has underscored phishing as one of the most commonly employed attack vectors for infiltrating organizational networks. A prevalent method involves misleading the target by employing phishing URLs concealed through hyperlink strategies. PhishTank, a website employing the concept of crowd-sourcing, aggregates phishing URLs and subsequently verifies their authenticity. In the course of this study, we leveraged a Python script to extract data from the PhishTank website, amassing a comprehensive dataset comprising over 190,0000 phishing URLs. This dataset is a valuable resource that can be harnessed by both researchers and practitioners for enhancing phish- ing filters, fortifying firewalls, security education, and refining training and testing models, among other applications.

AB - Phishing constitutes a form of social engineering that aims to deceive individuals through email communication. Extensive prior research has underscored phishing as one of the most commonly employed attack vectors for infiltrating organizational networks. A prevalent method involves misleading the target by employing phishing URLs concealed through hyperlink strategies. PhishTank, a website employing the concept of crowd-sourcing, aggregates phishing URLs and subsequently verifies their authenticity. In the course of this study, we leveraged a Python script to extract data from the PhishTank website, amassing a comprehensive dataset comprising over 190,0000 phishing URLs. This dataset is a valuable resource that can be harnessed by both researchers and practitioners for enhancing phish- ing filters, fortifying firewalls, security education, and refining training and testing models, among other applications.

KW - Artificial intelligence

KW - Computer security

KW - Dataset

KW - Email security

KW - Phished URL

KW - Social engineering

KW - Web security

UR - http://www.scopus.com/inward/record.url?scp=85180539147&partnerID=8YFLogxK

U2 - 10.1016/j.dib.2023.109959

DO - 10.1016/j.dib.2023.109959

M3 - Article

AN - SCOPUS:85180539147

SN - 2352-3409

VL - 52

JO - Data in Brief

JF - Data in Brief

M1 - 109959

ER -

Behind the Bait: Delving into PhishTank's hidden data

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this