How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?

Meiru Yan, Ankeer Abuduhebaier, Haojin Zhou*, Jiaqi Wang*

*Corresponding author for this work

Research output: Contribution to journalArticle

Abstract

Accurate AI prediction of peptide physicochemical properties is essential for advancing peptide-based biomedicine, biotechnology, and bioengineering. However, the performance of predictive AI models is significantly affected by the representativeness of the training data, which depends on the sample size and sampling methods employed. This study addresses the challenge of determining the optimal sample size and sampling methods to enhance the predictive accuracy and generalization capacity of AI models for estimating the aggregation propensity, hydrophilicity, and isoelectric point of tetrapeptides. Four sampling methods were evaluated: Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS), across sample sizes ranging from 100 to 20,000. The findings reveal that LHS and UDS generally provide superior predictive accuracy, particular at smaller sample sizes, while the impact of the sampling method diminishes as the sample size increases beyond 8000. Specifically, a sample size of approximately 12,000, representing 8% of the complete sequence space of tetrapeptides, was identified as a practical threshold for achieving reliable predictions. This study provides valuable insights into the interplay between sample size, sampling strategies, and model performance, offering a foundational framework for optimizing data collection and AI model training for the prediction of peptides' physicochemical properties, especially for prediction in the complete sequence space of longer peptides with more than four amino acids.
Original languageEnglish
JournalbioRxiv
Early online date2 Feb 2025
DOIs
Publication statusPublished - 2 Feb 2025

Fingerprint

Dive into the research topics of 'How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?'. Together they form a unique fingerprint.

Cite this