A domain-independent automatic labeling system for large-scale social data annotation using lexicon and web-based augmentation

Shaheen Khatoon*, Lamis Abu Romman, Md Maruf Hasan

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)

Abstract

Recently, with the large-scale adoption of social media, people have begun to express their opinion on these sites in the form of reviews. Potential consumers are often forced to wade through a massive amount of reviews to make an informed decision. Sentiment analysis has become a fast and effective way to gauge consum-ers’ opinions automatically. However, such analysis often requires a tedious process of manual annotation of extensive training examples or manually crafted lexicon to find Semantic Orientation (SO) of online reviews. In this paper, we present a method to automate the laborious process of labeling extensive textual data in an unsupervised, domain-independent, and scalable manner. The proposed method combines the lexicon-based and Web-based Pointwise Mutual Information (PMI) statistics to find the Semantic Orientation (SO) of opinion expressed in a review. Based on the proposed method, a system called Domain-Independent Automatic Labeling System (DIALS) has been implemented, which takes a collection of text from any domain as input and generates a fully labeled dataset without any manual intervention. The result generated can be used to track and summarize the online discussion and/or use to train any classifier in the next stage of development. The effectiveness of the system is tested by comparing its results with baseline machine learning and lexicon-based methods. Experiments on cross-domain datasets have shown that the proposed system consistently showed improved recall and accuracy as compared to baseline machine learning and lexicon-based methods.

Original languageEnglish
Pages (from-to)36-54
Number of pages19
JournalInformation Technology and Control
Volume49
Issue number1
DOIs
Publication statusPublished - 2020
Externally publishedYes

Keywords

  • Information retrieval
  • Sentiment analysis
  • Unsupervised learning

Cite this