Location analysis for arabic covid-19 twitter data using enhanced dialect identification models

Nader Essam; Abdullah M. Moussa; Khaled M. Elsayed; Sherif Abdou; Mohsen Rashwan; Shaheen Khatoon; Md Maruf Hasan; Amna Asif; Majed A. Alshamari

doi:10.3390/app112311328

Location analysis for arabic covid-19 twitter data using enhanced dialect identification models

Nader Essam, Abdullah M. Moussa^*, Khaled M. Elsayed, Sherif Abdou, Mohsen Rashwan, Shaheen Khatoon, Md Maruf Hasan, Amna Asif, Majed A. Alshamari

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

10 Citations (Scopus)

Abstract

The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.

Original language	English
Article number	11328
Journal	Applied Sciences (Switzerland)
Volume	11
Issue number	23
DOIs	https://doi.org/10.3390/app112311328
Publication status	Published - 1 Dec 2021
Externally published	Yes

Keywords

BERT models
Dialect identification
Language identification
Location analysis
Social networks

Access to Document

10.3390/app112311328

Cite this

@article{a2f811b9261e4da8a964cfdc8ed1d2f1,

title = "Location analysis for arabic covid-19 twitter data using enhanced dialect identification models",

abstract = "The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users{\textquoteright} location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users{\textquoteright} tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.",

keywords = "BERT models, Dialect identification, Language identification, Location analysis, Social networks",

author = "Nader Essam and Moussa, {Abdullah M.} and Elsayed, {Khaled M.} and Sherif Abdou and Mohsen Rashwan and Shaheen Khatoon and Hasan, {Md Maruf} and Amna Asif and Alshamari, {Majed A.}",

note = "Funding Information: Funding: The authors are grateful to the Saudi Arabian Ministry of Education{\textquoteright}s Deputyship for Research and Innovation for supporting this research through project number 523. Publisher Copyright: {\textcopyright} 2021 by the authors. Licensee MDPI, Basel, Switzerland.",

year = "2021",

month = dec,

day = "1",

doi = "10.3390/app112311328",

language = "English",

volume = "11",

journal = "Applied Sciences (Switzerland)",

issn = "2076-3417",

number = "23",

}

TY - JOUR

T1 - Location analysis for arabic covid-19 twitter data using enhanced dialect identification models

AU - Essam, Nader

AU - Moussa, Abdullah M.

AU - Elsayed, Khaled M.

AU - Abdou, Sherif

AU - Rashwan, Mohsen

AU - Khatoon, Shaheen

AU - Hasan, Md Maruf

AU - Asif, Amna

AU - Alshamari, Majed A.

N1 - Funding Information: Funding: The authors are grateful to the Saudi Arabian Ministry of Education’s Deputyship for Research and Innovation for supporting this research through project number 523. Publisher Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland.

PY - 2021/12/1

Y1 - 2021/12/1

N2 - The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.

AB - The recent surge of social media networks has provided a channel to gather and publish vital medical and health information. The focal role of these networks has become more prominent in periods of crisis, such as the recent pandemic of COVID-19. These social networks have been the leading platform for broadcasting health news updates, precaution instructions, and governmental procedures. They also provide an effective means for gathering public opinion and tracking breaking events and stories. To achieve location-based analysis for social media input, the location information of the users must be captured. Most of the time, this information is either missing or hidden. For some languages, such as Arabic, the users’ location can be predicted from their dialects. The Arabic language has many local dialects for most Arab countries. Natural Language Processing (NLP) techniques have provided several approaches for dialect identification. The recent advanced language models using contextual-based word representations in the continuous domain, such as BERT models, have provided significant improvement for many NLP applications. In this work, we present our efforts to use BERT-based models to improve the dialect identification of Arabic text. We show the results of the developed models to recognize the source of the Arabic country, or the Arabic region, from Twitter data. Our results show 3.4% absolute enhancement in dialect identification accuracy on the regional level over the state-of-the-art result. When we excluded the Modern Standard Arabic (MSA) set, which is formal Arabic language, we achieved 3% absolute gain in accuracy between the three major Arabic dialects over the state-of-the-art level. Finally, we applied the developed models on a recently collected resource for COVID-19 Arabic tweets to recognize the source country from the users’ tweets. We achieved a weighted average accuracy of 97.36%, which proposes a tool to be used by policymakers to support country-level disaster-related activities.

KW - BERT models

KW - Dialect identification

KW - Language identification

KW - Location analysis

KW - Social networks

UR - http://www.scopus.com/inward/record.url?scp=85120868415&partnerID=8YFLogxK

U2 - 10.3390/app112311328

DO - 10.3390/app112311328

M3 - Article

AN - SCOPUS:85120868415

SN - 2076-3417

VL - 11

JO - Applied Sciences (Switzerland)

JF - Applied Sciences (Switzerland)

IS - 23

M1 - 11328

ER -

Location analysis for arabic covid-19 twitter data using enhanced dialect identification models

Abstract

Keywords

Access to Document

Other files and links

Cite this