TY - JOUR
T1 - An integrated approach for intrinsic plagiarism detection
AU - AlSallal, Muna
AU - Iqbal, Rahat
AU - Palade, Vasile
AU - Amin, Saad
AU - Chang, Victor
N1 - Publisher Copyright:
© 2017 Elsevier B.V.
PY - 2019/7
Y1 - 2019/7
N2 - Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author's “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author's style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.
AB - Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author's “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author's style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.
UR - http://www.scopus.com/inward/record.url?scp=85040009178&partnerID=8YFLogxK
U2 - 10.1016/j.future.2017.11.023
DO - 10.1016/j.future.2017.11.023
M3 - Article
AN - SCOPUS:85040009178
SN - 0167-739X
VL - 96
SP - 700
EP - 712
JO - Future Generation Computer Systems
JF - Future Generation Computer Systems
ER -