CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data

Zhaozhao Xu, Fangyuan Yang, Hong Wang, Junding Sun*, Hengde Zhu, Shuihua Wang, Yudong Zhang

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Citations (Scopus)

Abstract

(Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.

Original languageEnglish
Article number101731
JournalJournal of King Saud University - Computer and Information Sciences
Volume35
Issue number9
DOIs
Publication statusPublished - Oct 2023
Externally publishedYes

Keywords

  • Clustering-guided
  • Gene expression data
  • Spectral clustering
  • Unsupervised feature selection
  • k-means

Fingerprint

Dive into the research topics of 'CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data'. Together they form a unique fingerprint.

Cite this