Binary PSO with mutation operator for feature selection using decision tree applied to spam detection

Yudong Zhang*, Shuihua Wang, Preetha Phillips, Genlin Ji

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

366 Citations (Scopus)

Abstract

In this paper, we proposed a novel spam detection method that focused on reducing the false positive error of mislabeling nonspam as spam. First, we used the wrapper-based feature selection method to extract crucial features. Second, the decision tree was chosen as the classifier model with C4.5 as the training algorithm. Third, the cost matrix was introduced to give different weights to two error types, i.e., the false positive and the false negative errors. We define the weight parameter as α to adjust the relative importance of the two error types. Fourth, K-fold cross validation was employed to reduce out-of-sample error. Finally, the binary PSO with mutation operator (MBPSO) was used as the subset search strategy. Our experimental dataset contains 6000 emails, which were collected during the year of 2012. We conducted a Kolmogorov-Smirnov hypothesis test on the capital-run-length related features and found that all the p values were less than 0.001. Afterwards, we found α = 7 was the most appropriate in our model. Among seven meta-heuristic algorithms, we demonstrated the MBPSO is superior to GA, RSA, PSO, and BPSO in terms of classification performance. The sensitivity, specificity, and accuracy of the decision tree with feature selection by MBPSO were 91.02%, 97.51%, and 94.27%, respectively. We also compared the MBPSO with conventional feature selection methods such as SFS and SBS. The results showed that the MBPSO performs better than SFS and SBS. We also demonstrated that wrappers are more effective than filters with regard to classification performance indexes. It was clearly shown that the proposed method is effective, and it can reduce the false positive error without compromising the sensitivity and accuracy values.

Original languageEnglish
Pages (from-to)22-31
Number of pages10
JournalKnowledge-Based Systems
Volume64
DOIs
Publication statusPublished - Jul 2014
Externally publishedYes

Keywords

  • Binary Particle Swarm Optimization
  • Cost matrix
  • Decision tree
  • Feature selection
  • Mutation operator
  • Premature convergence
  • Spam detection
  • Wrapper

Fingerprint

Dive into the research topics of 'Binary PSO with mutation operator for feature selection using decision tree applied to spam detection'. Together they form a unique fingerprint.

Cite this