How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?

Meiru Yan; Ankeer Abuduhebaier; Haojin Zhou; Jiaqi Wang

doi:10.1101/2025.01.29.635451

How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?

Meiru Yan, Ankeer Abuduhebaier, Haojin Zhou^*, Jiaqi Wang^*

^*Corresponding author for this work

AoPHA Faculty

Research output: Contribution to journal › Article

Abstract

Accurate AI prediction of peptide physicochemical properties is essential for advancing peptide-based biomedicine, biotechnology, and bioengineering. However, the performance of predictive AI models is significantly affected by the representativeness of the training data, which depends on the sample size and sampling methods employed. This study addresses the challenge of determining the optimal sample size and sampling methods to enhance the predictive accuracy and generalization capacity of AI models for estimating the aggregation propensity, hydrophilicity, and isoelectric point of tetrapeptides. Four sampling methods were evaluated: Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS), across sample sizes ranging from 100 to 20,000. The findings reveal that LHS and UDS generally provide superior predictive accuracy, particular at smaller sample sizes, while the impact of the sampling method diminishes as the sample size increases beyond 8000. Specifically, a sample size of approximately 12,000, representing 8% of the complete sequence space of tetrapeptides, was identified as a practical threshold for achieving reliable predictions. This study provides valuable insights into the interplay between sample size, sampling strategies, and model performance, offering a foundational framework for optimizing data collection and AI model training for the prediction of peptides' physicochemical properties, especially for prediction in the complete sequence space of longer peptides with more than four amino acids.

Original language	English
Journal	bioRxiv
Early online date	2 Feb 2025
DOIs	https://doi.org/10.1101/2025.01.29.635451
Publication status	Published - 2 Feb 2025

Access to Document

10.1101/2025.01.29.635451

Cite this

@article{ea18030f7e51434a986fd7474cbbbe2e,

title = "How Does Sampling Affect the AI Prediction Accuracy of Peptides' Physicochemical Properties?",

abstract = "Accurate AI prediction of peptide physicochemical properties is essential for advancing peptide-based biomedicine, biotechnology, and bioengineering. However, the performance of predictive AI models is significantly affected by the representativeness of the training data, which depends on the sample size and sampling methods employed. This study addresses the challenge of determining the optimal sample size and sampling methods to enhance the predictive accuracy and generalization capacity of AI models for estimating the aggregation propensity, hydrophilicity, and isoelectric point of tetrapeptides. Four sampling methods were evaluated: Latin Hypercube Sampling (LHS), Uniform Design Sampling (UDS), Simple Random Sampling (SRS), and Probability-Proportional-to-Size Sampling (PPS), across sample sizes ranging from 100 to 20,000. The findings reveal that LHS and UDS generally provide superior predictive accuracy, particular at smaller sample sizes, while the impact of the sampling method diminishes as the sample size increases beyond 8000. Specifically, a sample size of approximately 12,000, representing 8% of the complete sequence space of tetrapeptides, was identified as a practical threshold for achieving reliable predictions. This study provides valuable insights into the interplay between sample size, sampling strategies, and model performance, offering a foundational framework for optimizing data collection and AI model training for the prediction of peptides' physicochemical properties, especially for prediction in the complete sequence space of longer peptides with more than four amino acids.",

author = "Meiru Yan and Ankeer Abuduhebaier and Haojin Zhou and Jiaqi Wang",

year = "2025",

month = feb,

day = "2",

doi = "10.1101/2025.01.29.635451",