TY - JOUR
T1 - Improving Non-Negative Positive-Unlabeled Learning for News Headline Classification
AU - Ji, Zhanlin
AU - Du, Chengyuan
AU - Jiang, Jiawen
AU - Zhao, Li
AU - Zhang, Haiyang
AU - Ganchev, Ivan
N1 - Publisher Copyright:
Author
PY - 2023
Y1 - 2023
N2 - With the development of Internet technology, network platforms have gradually become a tool for people to obtain hot news. How to filter out the current hot news from a large number of news collections and push them to users has important application value. In supervised learning scenarios, each piece of news needs to be labeled manually, which takes a lot of time and effort. From the perspective of semi-supervised learning, on the basis of the non-negative Positive-Unlabeled (nnPU) learning, this paper proposes a novel algorithm, called 'Enhanced nnPU with Focal Loss' (FLPU), for news headline classification, which uses the Focal Loss to replace the way the classical nnPU calculates the empirical risk of positive and negative samples. Then, by introducing the Virtual Adversarial Training (VAT) of the Adversarial training for large neural LangUage Models (ALUM) into FLPU, another (and better) algorithm, called 'FLPU+ALUM', is proposed for the same purpose, aiming to label only a small number of positive samples. The superiority of both algorithms to the state-of-the-art PU algorithms considered is demonstrated by means of experiments, conducted on two datasets for performance comparison. Moreover, through another set of experiments, it is shown that, if enriched by the proposed algorithms, the RoBERTa-wwm-ext model can achieve better classification performance than the state-of-the-art binary classification models included in the comparison. In addition, a 'Ratio Batch' method is elaborated and proposed as more stable for use in scenarios involving only a small number of labeled positive samples, which is also experimentally demonstrated.
AB - With the development of Internet technology, network platforms have gradually become a tool for people to obtain hot news. How to filter out the current hot news from a large number of news collections and push them to users has important application value. In supervised learning scenarios, each piece of news needs to be labeled manually, which takes a lot of time and effort. From the perspective of semi-supervised learning, on the basis of the non-negative Positive-Unlabeled (nnPU) learning, this paper proposes a novel algorithm, called 'Enhanced nnPU with Focal Loss' (FLPU), for news headline classification, which uses the Focal Loss to replace the way the classical nnPU calculates the empirical risk of positive and negative samples. Then, by introducing the Virtual Adversarial Training (VAT) of the Adversarial training for large neural LangUage Models (ALUM) into FLPU, another (and better) algorithm, called 'FLPU+ALUM', is proposed for the same purpose, aiming to label only a small number of positive samples. The superiority of both algorithms to the state-of-the-art PU algorithms considered is demonstrated by means of experiments, conducted on two datasets for performance comparison. Moreover, through another set of experiments, it is shown that, if enriched by the proposed algorithms, the RoBERTa-wwm-ext model can achieve better classification performance than the state-of-the-art binary classification models included in the comparison. In addition, a 'Ratio Batch' method is elaborated and proposed as more stable for use in scenarios involving only a small number of labeled positive samples, which is also experimentally demonstrated.
KW - Text classification
KW - adversarial training for large neural language models (ALUM)
KW - focal loss
KW - non-negative positive-unlabeled (nnPU) learning
KW - virtual adversarial training (VAT)
UR - http://www.scopus.com/inward/record.url?scp=85153801352&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2023.3269304
DO - 10.1109/ACCESS.2023.3269304
M3 - Article
AN - SCOPUS:85153801352
SN - 2169-3536
VL - 11
SP - 40192
EP - 40203
JO - IEEE Access
JF - IEEE Access
ER -