TY - JOUR
T1 - A domain-independent automatic labeling system for large-scale social data annotation using lexicon and web-based augmentation
AU - Khatoon, Shaheen
AU - Abu Romman, Lamis
AU - Maruf Hasan, Md
N1 - Publisher Copyright:
© 2020, Kauno Technologijos Universitetas. All rights reserved.
PY - 2020
Y1 - 2020
N2 - Recently, with the large-scale adoption of social media, people have begun to express their opinion on these sites in the form of reviews. Potential consumers are often forced to wade through a massive amount of reviews to make an informed decision. Sentiment analysis has become a fast and effective way to gauge consum-ers’ opinions automatically. However, such analysis often requires a tedious process of manual annotation of extensive training examples or manually crafted lexicon to find Semantic Orientation (SO) of online reviews. In this paper, we present a method to automate the laborious process of labeling extensive textual data in an unsupervised, domain-independent, and scalable manner. The proposed method combines the lexicon-based and Web-based Pointwise Mutual Information (PMI) statistics to find the Semantic Orientation (SO) of opinion expressed in a review. Based on the proposed method, a system called Domain-Independent Automatic Labeling System (DIALS) has been implemented, which takes a collection of text from any domain as input and generates a fully labeled dataset without any manual intervention. The result generated can be used to track and summarize the online discussion and/or use to train any classifier in the next stage of development. The effectiveness of the system is tested by comparing its results with baseline machine learning and lexicon-based methods. Experiments on cross-domain datasets have shown that the proposed system consistently showed improved recall and accuracy as compared to baseline machine learning and lexicon-based methods.
AB - Recently, with the large-scale adoption of social media, people have begun to express their opinion on these sites in the form of reviews. Potential consumers are often forced to wade through a massive amount of reviews to make an informed decision. Sentiment analysis has become a fast and effective way to gauge consum-ers’ opinions automatically. However, such analysis often requires a tedious process of manual annotation of extensive training examples or manually crafted lexicon to find Semantic Orientation (SO) of online reviews. In this paper, we present a method to automate the laborious process of labeling extensive textual data in an unsupervised, domain-independent, and scalable manner. The proposed method combines the lexicon-based and Web-based Pointwise Mutual Information (PMI) statistics to find the Semantic Orientation (SO) of opinion expressed in a review. Based on the proposed method, a system called Domain-Independent Automatic Labeling System (DIALS) has been implemented, which takes a collection of text from any domain as input and generates a fully labeled dataset without any manual intervention. The result generated can be used to track and summarize the online discussion and/or use to train any classifier in the next stage of development. The effectiveness of the system is tested by comparing its results with baseline machine learning and lexicon-based methods. Experiments on cross-domain datasets have shown that the proposed system consistently showed improved recall and accuracy as compared to baseline machine learning and lexicon-based methods.
KW - Information retrieval
KW - Sentiment analysis
KW - Unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85085896822&partnerID=8YFLogxK
U2 - 10.5755/j01.itc.49.1.23769
DO - 10.5755/j01.itc.49.1.23769
M3 - Article
AN - SCOPUS:85085896822
SN - 1392-124X
VL - 49
SP - 36
EP - 54
JO - Information Technology and Control
JF - Information Technology and Control
IS - 1
ER -