TY - JOUR
T1 - How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?
AU - Yan, Meiru
AU - Abuduhebaier, Ankeer
AU - Zhou, Haojin
AU - Wang, Jiaqi
PY - 2025/2/2
Y1 - 2025/2/2
N2 - Accurate AI prediction of peptide physicochemical properties is essential for advancing peptide-based biomedicine, biotechnology, and bioengineering. However, the performance of predictive AI models is significantly affected by the representativeness of the training data, which depends on the sample size and sampling methods employed. This study addresses the challenge of determining the optimal sample size and sampling methods to enhance the predictive accuracy and generalization capacity of AI models for estimating the aggregation propensity, hydrophilicity, and isoelectric point of tetrapeptides. Four sampling methods were evaluated: Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS), across sample sizes ranging from 100 to 20,000. The findings reveal that LHS and UDS generally provide superior predictive accuracy, particular at smaller sample sizes, while the impact of the sampling method diminishes as the sample size increases beyond 8000. Specifically, a sample size of approximately 12,000, representing 8% of the complete sequence space of tetrapeptides, was identified as a practical threshold for achieving reliable predictions. This study provides valuable insights into the interplay between sample size, sampling strategies, and model performance, offering a foundational framework for optimizing data collection and AI model training for the prediction of peptides' physicochemical properties, especially for prediction in the complete sequence space of longer peptides with more than four amino acids.
AB - Accurate AI prediction of peptide physicochemical properties is essential for advancing peptide-based biomedicine, biotechnology, and bioengineering. However, the performance of predictive AI models is significantly affected by the representativeness of the training data, which depends on the sample size and sampling methods employed. This study addresses the challenge of determining the optimal sample size and sampling methods to enhance the predictive accuracy and generalization capacity of AI models for estimating the aggregation propensity, hydrophilicity, and isoelectric point of tetrapeptides. Four sampling methods were evaluated: Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS), across sample sizes ranging from 100 to 20,000. The findings reveal that LHS and UDS generally provide superior predictive accuracy, particular at smaller sample sizes, while the impact of the sampling method diminishes as the sample size increases beyond 8000. Specifically, a sample size of approximately 12,000, representing 8% of the complete sequence space of tetrapeptides, was identified as a practical threshold for achieving reliable predictions. This study provides valuable insights into the interplay between sample size, sampling strategies, and model performance, offering a foundational framework for optimizing data collection and AI model training for the prediction of peptides' physicochemical properties, especially for prediction in the complete sequence space of longer peptides with more than four amino acids.
U2 - 10.1101/2025.01.29.635451
DO - 10.1101/2025.01.29.635451
M3 - Article
JO - bioRxiv
JF - bioRxiv
ER -