Benchmarking Sequence-Based Compound–Protein Interaction Prediction through Constructing a Debiased Data Set CDPN

Research output: Contribution to journalArticlepeer-review

Abstract

Accurate prediction of compound–protein interactions (CPIs) is critical for drug discovery, but existing data sets often suffer from biases that hinder model generalization. Here, we first highlighted that over-represented molecular scaffolds and imbalanced label distributions can lead to machine learning shortcuts. While existing debiasing approaches often compromise data set diversity, we present Clustering-based Down-sampling and Putative Negatives (CDPN), a novel protocol for constructing a debiased CPI benchmark. CDPN mitigates biases through compound Cluster-level Down-sampling and generates Putative Negatives from unexplored chemical spaces, ensuring balanced label distributions. Using CDPN, we systematically benchmark deep learning-based CPI models, with a particular focus on protein language models. Although systematic evaluation on PDBbind reveals critical limitations in attention interpretability, thorough ablation studies on the CDPN data set identify superior models such as KPGT-Ankh, which exhibits enhanced generalization and virtual screening performance. The top-performing models from benchmark were also integrated into DeepSEQreen, a no-code web server designed to facilitate community feedback and broader accessibility.

Original languageEnglish
Pages (from-to)12737-12751
Number of pages15
JournalJournal of Chemical Information and Modeling
Volume65
Issue number23
DOIs
Publication statusPublished - 8 Dec 2025

Cite this