TY - JOUR
T1 - Domain-knowledge enabled ensemble learning of 5-formylcytosine (f5C) modification sites
AU - Huang, Jiaming
AU - Wang, Xuan
AU - Xia, Rong
AU - Yang, Dongqing
AU - Liu, Jian
AU - Lv, Qi
AU - Yu, Xiaoxuan
AU - Meng, Jia
AU - Chen, Kunqi
AU - Song, Bowen
AU - Wang, Yue
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2024/12
Y1 - 2024/12
N2 - 5-formylcytidine (f5C) is a unique post-transcriptional RNA modification found in mRNA and tRNA at the wobble site, playing a crucial role in mitochondrial protein synthesis and potentially contributing to the regulation of translation. Recent studies have unveiled that the f5C modifications may drive mitochondrial mRNA translation to power cancer metastasis. Accurate identification of f5C sites is essential for further unraveling their molecular functions and regulatory mechanisms, but there are currently no computational methods available for predicting their locations. In this study, we introduce an innovative ensemble approach, successfully enabling the computational recognition of Saccharomyces cerevisiae f5C. We conducted a comprehensive model selection process that involved multiple basic machine learning and deep learning algorithms such as recurrent neural networks, convolutional neural networks and Transformer-based models. Initially trained only on sequence information, these individual models achieved an AUROC ranging from 0.7104 to 0.7492. Through the integration of 32 novel domain-derived genomic features, the performance of individual models has significantly improved to an AUROC between 0.7309 and 0.8076. To further enhance accuracy and robustness, we then constructed the ensembles of these individual models with different combinations. The best performance attained by our ensemble models reached an AUROC of 0.8391. Shapley additive explanations were conducted to explain the significant contributions of genomic features, providing insights into the putative distribution of f5C across various topological regions and potentially paving the way for revealing their functional relevance within distinct genomic contexts. A freely accessible web server that allows real-time analysis of user-uploaded sites can be accessed at: www.rnamd.org/Resf5C-Pred.
AB - 5-formylcytidine (f5C) is a unique post-transcriptional RNA modification found in mRNA and tRNA at the wobble site, playing a crucial role in mitochondrial protein synthesis and potentially contributing to the regulation of translation. Recent studies have unveiled that the f5C modifications may drive mitochondrial mRNA translation to power cancer metastasis. Accurate identification of f5C sites is essential for further unraveling their molecular functions and regulatory mechanisms, but there are currently no computational methods available for predicting their locations. In this study, we introduce an innovative ensemble approach, successfully enabling the computational recognition of Saccharomyces cerevisiae f5C. We conducted a comprehensive model selection process that involved multiple basic machine learning and deep learning algorithms such as recurrent neural networks, convolutional neural networks and Transformer-based models. Initially trained only on sequence information, these individual models achieved an AUROC ranging from 0.7104 to 0.7492. Through the integration of 32 novel domain-derived genomic features, the performance of individual models has significantly improved to an AUROC between 0.7309 and 0.8076. To further enhance accuracy and robustness, we then constructed the ensembles of these individual models with different combinations. The best performance attained by our ensemble models reached an AUROC of 0.8391. Shapley additive explanations were conducted to explain the significant contributions of genomic features, providing insights into the putative distribution of f5C across various topological regions and potentially paving the way for revealing their functional relevance within distinct genomic contexts. A freely accessible web server that allows real-time analysis of user-uploaded sites can be accessed at: www.rnamd.org/Resf5C-Pred.
KW - 5-formylcytidine
KW - Ensemble learning
KW - Epitranscriptomic marks
KW - Genomic features
KW - RNA modification
UR - http://www.scopus.com/inward/record.url?scp=85201666362&partnerID=8YFLogxK
U2 - 10.1016/j.csbj.2024.08.004
DO - 10.1016/j.csbj.2024.08.004
M3 - Article
AN - SCOPUS:85201666362
SN - 2001-0370
VL - 23
SP - 3175
EP - 3185
JO - Computational and Structural Biotechnology Journal
JF - Computational and Structural Biotechnology Journal
ER -