TY - JOUR
T1 - Benchmarking Sequence-Based Compound–Protein Interaction Prediction through Constructing a Debiased Data Set CDPN
AU - Hao, Yang
AU - Li, Bo
AU - Huang, Daiyun
AU - Fu, Lei
AU - Cao, Zhiwei
AU - Liu, Xin
N1 - Publisher Copyright:
© 2025 The Authors. Published by American Chemical Society
PY - 2025/12/8
Y1 - 2025/12/8
N2 - Accurate prediction of compound–protein interactions (CPIs) is critical for drug discovery, but existing data sets often suffer from biases that hinder model generalization. Here, we first highlighted that over-represented molecular scaffolds and imbalanced label distributions can lead to machine learning shortcuts. While existing debiasing approaches often compromise data set diversity, we present Clustering-based Down-sampling and Putative Negatives (CDPN), a novel protocol for constructing a debiased CPI benchmark. CDPN mitigates biases through compound Cluster-level Down-sampling and generates Putative Negatives from unexplored chemical spaces, ensuring balanced label distributions. Using CDPN, we systematically benchmark deep learning-based CPI models, with a particular focus on protein language models. Although systematic evaluation on PDBbind reveals critical limitations in attention interpretability, thorough ablation studies on the CDPN data set identify superior models such as KPGT-Ankh, which exhibits enhanced generalization and virtual screening performance. The top-performing models from benchmark were also integrated into DeepSEQreen, a no-code web server designed to facilitate community feedback and broader accessibility.
AB - Accurate prediction of compound–protein interactions (CPIs) is critical for drug discovery, but existing data sets often suffer from biases that hinder model generalization. Here, we first highlighted that over-represented molecular scaffolds and imbalanced label distributions can lead to machine learning shortcuts. While existing debiasing approaches often compromise data set diversity, we present Clustering-based Down-sampling and Putative Negatives (CDPN), a novel protocol for constructing a debiased CPI benchmark. CDPN mitigates biases through compound Cluster-level Down-sampling and generates Putative Negatives from unexplored chemical spaces, ensuring balanced label distributions. Using CDPN, we systematically benchmark deep learning-based CPI models, with a particular focus on protein language models. Although systematic evaluation on PDBbind reveals critical limitations in attention interpretability, thorough ablation studies on the CDPN data set identify superior models such as KPGT-Ankh, which exhibits enhanced generalization and virtual screening performance. The top-performing models from benchmark were also integrated into DeepSEQreen, a no-code web server designed to facilitate community feedback and broader accessibility.
UR - https://www.scopus.com/pages/publications/105024253184
U2 - 10.1021/acs.jcim.5c02040
DO - 10.1021/acs.jcim.5c02040
M3 - Article
C2 - 41264813
AN - SCOPUS:105024253184
SN - 1549-9596
VL - 65
SP - 12737
EP - 12751
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 23
ER -