TY - JOUR
T1 - Handling mislabeled data in fault diagnosis: A graph-assisted random forest approach
AU - Chen, Shaozhi
AU - Xi, Xiaopeng
AU - Zhong, Maiying
AU - Yang, Rui
AU - Orchard, Marcos E.
N1 - Publisher Copyright:
© 2026 Elsevier B.V.
PY - 2026/3/28
Y1 - 2026/3/28
N2 - Label information plays a critical role in both supervised and semi-supervised learning-based fault diagnosis methods. However, mislabeled data can significantly degrade the classification performance of the resulting fault diagnosis model. To address this challenge, a graph-assisted random forest (GARF) approach is proposed in this paper, aiming to mitigate the adverse effects of mislabeled data in fault diagnosis. The core of this approach is a spectral clustering matching (SCM)-based method for identifying incorrect labels, leveraging the independence of the graph structure from sample labels. Identified mislabeled samples are subsequently stripped of their incorrect labels and treated as unlabeled data. Subsequently, a graph-based semi-supervised learning (GSSL) algorithm is employed to infer corrected labels for these samples, using the underlying graph topology to enable effective label correction. Following this, a random forest (RF) classifier is trained on the rectified dataset to establish the GARF-based fault diagnosis model, facilitating real-time fault diagnosis. The proposed method is validated using monitoring data from a hardware-in-the-loop high-speed train simulation platform. Experimental results show that the GARF method outperforms multiple existing approaches across key metrics, including accuracy, recall, F1-score, and computational efficiency.
AB - Label information plays a critical role in both supervised and semi-supervised learning-based fault diagnosis methods. However, mislabeled data can significantly degrade the classification performance of the resulting fault diagnosis model. To address this challenge, a graph-assisted random forest (GARF) approach is proposed in this paper, aiming to mitigate the adverse effects of mislabeled data in fault diagnosis. The core of this approach is a spectral clustering matching (SCM)-based method for identifying incorrect labels, leveraging the independence of the graph structure from sample labels. Identified mislabeled samples are subsequently stripped of their incorrect labels and treated as unlabeled data. Subsequently, a graph-based semi-supervised learning (GSSL) algorithm is employed to infer corrected labels for these samples, using the underlying graph topology to enable effective label correction. Following this, a random forest (RF) classifier is trained on the rectified dataset to establish the GARF-based fault diagnosis model, facilitating real-time fault diagnosis. The proposed method is validated using monitoring data from a hardware-in-the-loop high-speed train simulation platform. Experimental results show that the GARF method outperforms multiple existing approaches across key metrics, including accuracy, recall, F1-score, and computational efficiency.
KW - Fault diagnosis
KW - Graph-assisted random forest
KW - Label correction
KW - Mislabeled data
KW - Semi-supervised learning
UR - https://www.scopus.com/pages/publications/105027403845
U2 - 10.1016/j.neucom.2026.132669
DO - 10.1016/j.neucom.2026.132669
M3 - Article
AN - SCOPUS:105027403845
SN - 0925-2312
VL - 671
JO - Neurocomputing
JF - Neurocomputing
M1 - 132669
ER -