TY - GEN
T1 - Differentiating amino acids from nanopore sequencing
AU - Zhang, Jiahao
AU - Meng, Jia
AU - Zhang, Yuxin
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/11/18
Y1 - 2024/11/18
N2 - Amino acids nanopore sequencing is a significant breakthrough in the fields of molecular biology, biochemistry, and medical diagnostics.The tool's high sensitivity, specificity, and real-time analytic capability make it essential for accurately identifying amino acids.A new nanopore, known as Msp-NTA-Ni, has recently advanced the bounds by allowing accurate differentiation of all 20 proteinogenic amino acids and their post-translational modifications (PTMs).Utilizing the data produced by this nanopore, our research conducted a thorough examination of five features, pinpointing the most useful pairs for the purpose of classification.Subsequently, we undertake an elaborate process that encompasses the training, fine-tuning, and comparative evaluation of multiple machine learning models, such as Random Forest, CatBoost, and SVM.The results of our research indicate that the Random Forest model surpasses the current benchmarks, obtaining a validation accuracy of 99.04%.Moreover, our research emphasizes the crucial significance of particular combinations of features, such as the mean and standard deviation, in improving the performance of the model, despite some limitations in differentiating between certain pairs of amino acids.
AB - Amino acids nanopore sequencing is a significant breakthrough in the fields of molecular biology, biochemistry, and medical diagnostics.The tool's high sensitivity, specificity, and real-time analytic capability make it essential for accurately identifying amino acids.A new nanopore, known as Msp-NTA-Ni, has recently advanced the bounds by allowing accurate differentiation of all 20 proteinogenic amino acids and their post-translational modifications (PTMs).Utilizing the data produced by this nanopore, our research conducted a thorough examination of five features, pinpointing the most useful pairs for the purpose of classification.Subsequently, we undertake an elaborate process that encompasses the training, fine-tuning, and comparative evaluation of multiple machine learning models, such as Random Forest, CatBoost, and SVM.The results of our research indicate that the Random Forest model surpasses the current benchmarks, obtaining a validation accuracy of 99.04%.Moreover, our research emphasizes the crucial significance of particular combinations of features, such as the mean and standard deviation, in improving the performance of the model, despite some limitations in differentiating between certain pairs of amino acids.
UR - http://www.scopus.com/inward/record.url?scp=85212871239&partnerID=8YFLogxK
U2 - 10.1145/3674658.3674663
DO - 10.1145/3674658.3674663
M3 - Conference Proceeding
AN - SCOPUS:85212871239
T3 - ACM International Conference Proceeding Series
SP - 25
EP - 30
BT - ICBBT 2024 - Proceedings of the 2024 16th International Conference on Bioinformatics and Biomedical Technology
PB - Association for Computing Machinery
T2 - 16th International Conference on Bioinformatics and Biomedical Technology, ICBBT 2024
Y2 - 24 May 2024 through 26 May 2024
ER -