Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks

Yujian Liu; Dejun Xie; Yazhe Li

doi:10.1080/00949655.2023.2238235

Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks

Yujian Liu, Dejun Xie^*, Yazhe Li

^*Corresponding author for this work

Department of Financial and Actuarial Mathematics

Research output: Contribution to journal › Article › peer-review

9 Citations (Scopus)

Abstract

The area under the curve (AUC) is the most popular measure for summarizing a binary classifier's receiver operating characteristic (ROC) curve. Therefore, it is essential to ensure that the AUC estimation is accurate. One straightforward and popular estimation approach is to calculate the empirical AUC from the data. However, one must look closely at the behaviour of this point estimator, particularly its variance. This study demonstrates both analytically and empirically that the empirical AUC estimation could be highly volatile in many circumstances when applied to an imbalanced dataset. To be more specific, we have proved that under some frequently encountered circumstances, variances of the empirical AUC estimator increase with the imbalanced level of the dataset. Hence, under the imbalanced setting, variances could be high. Furthermore, we conduct several simulations and experiments to solidify our findings. Therefore, extra attention must be paid when the empirical ROC-AUC is used to summarize the classifier's performance, especially when the dataset presents high class imbalance.

Original language	English
Pages (from-to)	183-203
Number of pages	21
Journal	Journal of Statistical Computation and Simulation
Volume	94
Issue number	1
DOIs	https://doi.org/10.1080/00949655.2023.2238235
Publication status	Published - 23 Jul 2023

Keywords

Area under receiver operating characteristic curve
empirical AUC estimator
imbalanced dataset

Access to Document

10.1080/00949655.2023.2238235

Cite this

@article{69b1c48437ae4240959a12051e644426,

title = "Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks",

abstract = "The area under the curve (AUC) is the most popular measure for summarizing a binary classifier's receiver operating characteristic (ROC) curve. Therefore, it is essential to ensure that the AUC estimation is accurate. One straightforward and popular estimation approach is to calculate the empirical AUC from the data. However, one must look closely at the behaviour of this point estimator, particularly its variance. This study demonstrates both analytically and empirically that the empirical AUC estimation could be highly volatile in many circumstances when applied to an imbalanced dataset. To be more specific, we have proved that under some frequently encountered circumstances, variances of the empirical AUC estimator increase with the imbalanced level of the dataset. Hence, under the imbalanced setting, variances could be high. Furthermore, we conduct several simulations and experiments to solidify our findings. Therefore, extra attention must be paid when the empirical ROC-AUC is used to summarize the classifier's performance, especially when the dataset presents high class imbalance.",

keywords = "Area under receiver operating characteristic curve, empirical AUC estimator, imbalanced dataset",

author = "Yujian Liu and Dejun Xie and Yazhe Li",

note = "Publisher Copyright: {\textcopyright} 2023 Informa UK Limited, trading as Taylor & Francis Group.",

year = "2023",

month = jul,

day = "23",

doi = "10.1080/00949655.2023.2238235",

language = "English",

volume = "94",

pages = "183--203",

journal = "Journal of Statistical Computation and Simulation",

issn = "0094-9655",

number = "1",

}

TY - JOUR

T1 - Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks

AU - Liu, Yujian

AU - Xie, Dejun

AU - Li, Yazhe

PY - 2023/7/23

Y1 - 2023/7/23

N2 - The area under the curve (AUC) is the most popular measure for summarizing a binary classifier's receiver operating characteristic (ROC) curve. Therefore, it is essential to ensure that the AUC estimation is accurate. One straightforward and popular estimation approach is to calculate the empirical AUC from the data. However, one must look closely at the behaviour of this point estimator, particularly its variance. This study demonstrates both analytically and empirically that the empirical AUC estimation could be highly volatile in many circumstances when applied to an imbalanced dataset. To be more specific, we have proved that under some frequently encountered circumstances, variances of the empirical AUC estimator increase with the imbalanced level of the dataset. Hence, under the imbalanced setting, variances could be high. Furthermore, we conduct several simulations and experiments to solidify our findings. Therefore, extra attention must be paid when the empirical ROC-AUC is used to summarize the classifier's performance, especially when the dataset presents high class imbalance.

AB - The area under the curve (AUC) is the most popular measure for summarizing a binary classifier's receiver operating characteristic (ROC) curve. Therefore, it is essential to ensure that the AUC estimation is accurate. One straightforward and popular estimation approach is to calculate the empirical AUC from the data. However, one must look closely at the behaviour of this point estimator, particularly its variance. This study demonstrates both analytically and empirically that the empirical AUC estimation could be highly volatile in many circumstances when applied to an imbalanced dataset. To be more specific, we have proved that under some frequently encountered circumstances, variances of the empirical AUC estimator increase with the imbalanced level of the dataset. Hence, under the imbalanced setting, variances could be high. Furthermore, we conduct several simulations and experiments to solidify our findings. Therefore, extra attention must be paid when the empirical ROC-AUC is used to summarize the classifier's performance, especially when the dataset presents high class imbalance.

KW - Area under receiver operating characteristic curve

KW - empirical AUC estimator

KW - imbalanced dataset

UR - http://www.scopus.com/inward/record.url?scp=85165444424&partnerID=8YFLogxK

U2 - 10.1080/00949655.2023.2238235

DO - 10.1080/00949655.2023.2238235

M3 - Article

AN - SCOPUS:85165444424

SN - 0094-9655

VL - 94

SP - 183

EP - 203

JO - Journal of Statistical Computation and Simulation

JF - Journal of Statistical Computation and Simulation

IS - 1

ER -

Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this