TY - JOUR
T1 - CGUFS
T2 - A clustering-guided unsupervised feature selection algorithm for gene expression data
AU - Xu, Zhaozhao
AU - Yang, Fangyuan
AU - Wang, Hong
AU - Sun, Junding
AU - Zhu, Hengde
AU - Wang, Shuihua
AU - Zhang, Yudong
N1 - Publisher Copyright:
© 2023 The Author(s)
PY - 2023/10
Y1 - 2023/10
N2 - (Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.
AB - (Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.
KW - Clustering-guided
KW - Gene expression data
KW - Spectral clustering
KW - Unsupervised feature selection
KW - k-means
UR - http://www.scopus.com/inward/record.url?scp=85171993199&partnerID=8YFLogxK
U2 - 10.1016/j.jksuci.2023.101731
DO - 10.1016/j.jksuci.2023.101731
M3 - Article
AN - SCOPUS:85171993199
SN - 1319-1578
VL - 35
JO - Journal of King Saud University - Computer and Information Sciences
JF - Journal of King Saud University - Computer and Information Sciences
IS - 9
M1 - 101731
ER -