TY - GEN
T1 - Boosting the phishing detection performance by semantic analysis
AU - Zhang, Xi
AU - Zeng, Yu
AU - Jin, Xiao Bo
AU - Yan, Zhi Wei
AU - Geng, Guang Gang
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/1
Y1 - 2017/7/1
N2 - Phishing is increasingly severe in recent years, which seriously threatens the privacy and property security of netizens. Phishing is essentially a counterfeiting of brands. In order to effectively cheat the victim, phishing sites are visually and semantically highly similar to real sites. In recent years, anti-phishing methods based on machine learning are mainstream anti-phishing methods. The effectiveness of the machine learning models hinges on the extracted statistical features. However, the extracted statistical features mainly focus on visual similarity, stealing information and third-party services, which ignore the semantic information of web pages. Therefore, we extract a series of semantic features through word2vec to better describe the features of phishing sites, and further fuse them with other multi-scale statistical features to construct a more robust phishing detection model. The experimental results on the actual data sets show that the majority of phishing websites are effectively identified by only mining the semantic features of word embeddings. The phishing detection models based on fusion features obtained the best detection results, which shows that semantic features and other statistical features have good complementarity. The proposed method provides a promising way for phishing detection in actual Internet environment, which boosts the phishing detection performance effectively.
AB - Phishing is increasingly severe in recent years, which seriously threatens the privacy and property security of netizens. Phishing is essentially a counterfeiting of brands. In order to effectively cheat the victim, phishing sites are visually and semantically highly similar to real sites. In recent years, anti-phishing methods based on machine learning are mainstream anti-phishing methods. The effectiveness of the machine learning models hinges on the extracted statistical features. However, the extracted statistical features mainly focus on visual similarity, stealing information and third-party services, which ignore the semantic information of web pages. Therefore, we extract a series of semantic features through word2vec to better describe the features of phishing sites, and further fuse them with other multi-scale statistical features to construct a more robust phishing detection model. The experimental results on the actual data sets show that the majority of phishing websites are effectively identified by only mining the semantic features of word embeddings. The phishing detection models based on fusion features obtained the best detection results, which shows that semantic features and other statistical features have good complementarity. The proposed method provides a promising way for phishing detection in actual Internet environment, which boosts the phishing detection performance effectively.
KW - deep learning
KW - phishing detection
KW - semantic analysis
KW - statistical feature
KW - word embeddings
UR - http://www.scopus.com/inward/record.url?scp=85047748929&partnerID=8YFLogxK
U2 - 10.1109/BigData.2017.8258030
DO - 10.1109/BigData.2017.8258030
M3 - Conference Proceeding
AN - SCOPUS:85047748929
T3 - Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017
SP - 1063
EP - 1070
BT - Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017
A2 - Nie, Jian-Yun
A2 - Obradovic, Zoran
A2 - Suzumura, Toyotaro
A2 - Ghosh, Rumi
A2 - Nambiar, Raghunath
A2 - Wang, Chonggang
A2 - Zang, Hui
A2 - Baeza-Yates, Ricardo
A2 - Baeza-Yates, Ricardo
A2 - Hu, Xiaohua
A2 - Kepner, Jeremy
A2 - Cuzzocrea, Alfredo
A2 - Tang, Jian
A2 - Toyoda, Masashi
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 5th IEEE International Conference on Big Data, Big Data 2017
Y2 - 11 December 2017 through 14 December 2017
ER -