An integrated approach for intrinsic plagiarism detection

Muna AlSallal; Rahat Iqbal; Vasile Palade; Saad Amin; Victor Chang

doi:10.1016/j.future.2017.11.023

An integrated approach for intrinsic plagiarism detection

Muna AlSallal, Rahat Iqbal^*, Vasile Palade, Saad Amin, Victor Chang

^*Corresponding author for this work

International Business School Suzhou

Coventry University

Research output: Contribution to journal › Article › peer-review

36 Citations (Scopus)

Abstract

Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author's “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author's style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.

Original language	English
Pages (from-to)	700-712
Number of pages	13
Journal	Future Generation Computer Systems
Volume	96
DOIs	https://doi.org/10.1016/j.future.2017.11.023
Publication status	Published - Jul 2019

Access to Document

10.1016/j.future.2017.11.023

Cite this

@article{dff5ae0d4af14215ae7ba1de846bdc6c,

title = "An integrated approach for intrinsic plagiarism detection",

abstract = "Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author's “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author's style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.",

author = "Muna AlSallal and Rahat Iqbal and Vasile Palade and Saad Amin and Victor Chang",

note = "Publisher Copyright: {\textcopyright} 2017 Elsevier B.V.",

year = "2019",

month = jul,

doi = "10.1016/j.future.2017.11.023",

language = "English",

volume = "96",

pages = "700--712",

journal = "Future Generation Computer Systems",

issn = "0167-739X",

}

TY - JOUR

T1 - An integrated approach for intrinsic plagiarism detection

AU - AlSallal, Muna

AU - Iqbal, Rahat

AU - Palade, Vasile

AU - Amin, Saad

AU - Chang, Victor

PY - 2019/7

Y1 - 2019/7

N2 - Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author's “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author's style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.

AB - Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author's “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author's style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.

UR - http://www.scopus.com/inward/record.url?scp=85040009178&partnerID=8YFLogxK

U2 - 10.1016/j.future.2017.11.023

DO - 10.1016/j.future.2017.11.023

M3 - Article

AN - SCOPUS:85040009178

SN - 0167-739X

VL - 96

SP - 700

EP - 712

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

ER -

An integrated approach for intrinsic plagiarism detection

Abstract

Access to Document

Other files and links

Fingerprint

Cite this