TY - JOUR
T1 - Implications of imbalanced datasets for empirical ROC-AUC estimation in binary classification tasks
AU - Liu, Yujian
AU - Xie, Dejun
AU - Li, Yazhe
N1 - Publisher Copyright:
© 2023 Informa UK Limited, trading as Taylor & Francis Group.
PY - 2023/7/23
Y1 - 2023/7/23
N2 - The area under the curve (AUC) is the most popular measure for summarizing a binary classifier's receiver operating characteristic (ROC) curve. Therefore, it is essential to ensure that the AUC estimation is accurate. One straightforward and popular estimation approach is to calculate the empirical AUC from the data. However, one must look closely at the behaviour of this point estimator, particularly its variance. This study demonstrates both analytically and empirically that the empirical AUC estimation could be highly volatile in many circumstances when applied to an imbalanced dataset. To be more specific, we have proved that under some frequently encountered circumstances, variances of the empirical AUC estimator increase with the imbalanced level of the dataset. Hence, under the imbalanced setting, variances could be high. Furthermore, we conduct several simulations and experiments to solidify our findings. Therefore, extra attention must be paid when the empirical ROC-AUC is used to summarize the classifier's performance, especially when the dataset presents high class imbalance.
AB - The area under the curve (AUC) is the most popular measure for summarizing a binary classifier's receiver operating characteristic (ROC) curve. Therefore, it is essential to ensure that the AUC estimation is accurate. One straightforward and popular estimation approach is to calculate the empirical AUC from the data. However, one must look closely at the behaviour of this point estimator, particularly its variance. This study demonstrates both analytically and empirically that the empirical AUC estimation could be highly volatile in many circumstances when applied to an imbalanced dataset. To be more specific, we have proved that under some frequently encountered circumstances, variances of the empirical AUC estimator increase with the imbalanced level of the dataset. Hence, under the imbalanced setting, variances could be high. Furthermore, we conduct several simulations and experiments to solidify our findings. Therefore, extra attention must be paid when the empirical ROC-AUC is used to summarize the classifier's performance, especially when the dataset presents high class imbalance.
KW - Area under receiver operating characteristic curve
KW - empirical AUC estimator
KW - imbalanced dataset
UR - http://www.scopus.com/inward/record.url?scp=85165444424&partnerID=8YFLogxK
U2 - 10.1080/00949655.2023.2238235
DO - 10.1080/00949655.2023.2238235
M3 - Article
AN - SCOPUS:85165444424
SN - 0094-9655
VL - 94
SP - 183
EP - 203
JO - Journal of Statistical Computation and Simulation
JF - Journal of Statistical Computation and Simulation
IS - 1
ER -