TY - JOUR
T1 - 4acCPred
T2 - Weakly supervised prediction of N4-acetyldeoxycytosine DNA modification from sequences
AU - Zhou, Jingxian
AU - Wang, Xuan
AU - Wei, Zhen
AU - Meng, Jia
AU - Huang, Daiyun
N1 - Publisher Copyright:
© 2022 The Authors
PY - 2022/12/13
Y1 - 2022/12/13
N2 - DNA methylation is one of the earliest epigenetic regulation mechanisms studied extensively, and it is critical for normal development, diseases, and gene expression. As a recently identified chemical modification of DNA, N4-acetyldeoxycytosine (4acC) was shown to be abundant in Arabidopsis and highly associated with gene expression and actively transcribed genes. Precise identification of 4acC is essential for studying its biological function. We proposed the 4acCPred, the first computational framework for predicting 4acC-carrying regions from Arabidopsis genomic DNA sequences. Since the existing 4acC data are not precise for a specific base but only report regions that are hundreds of bases long, we formulated the task as a weakly supervised learning problem and built 4acCPred using a multi-instance-based deep neural network. Both cross-validation and independent testing on the four datasets under different conditions show promising performance, with mean areas under the receiver operating characteristic curve (AUCs) of 0.9877 and 0.9899, respectively. 4acCPred also provides motif mining through model interpretation. The motifs found by 4acCPred are consistent with existing knowledge, indicating that the model successfully captured real biological signals. In addition, a user-friendly web server was built to facilitate 4acC prediction, motif visualization, and data access. Our framework and web server should serve as useful tools for 4acC research.
AB - DNA methylation is one of the earliest epigenetic regulation mechanisms studied extensively, and it is critical for normal development, diseases, and gene expression. As a recently identified chemical modification of DNA, N4-acetyldeoxycytosine (4acC) was shown to be abundant in Arabidopsis and highly associated with gene expression and actively transcribed genes. Precise identification of 4acC is essential for studying its biological function. We proposed the 4acCPred, the first computational framework for predicting 4acC-carrying regions from Arabidopsis genomic DNA sequences. Since the existing 4acC data are not precise for a specific base but only report regions that are hundreds of bases long, we formulated the task as a weakly supervised learning problem and built 4acCPred using a multi-instance-based deep neural network. Both cross-validation and independent testing on the four datasets under different conditions show promising performance, with mean areas under the receiver operating characteristic curve (AUCs) of 0.9877 and 0.9899, respectively. 4acCPred also provides motif mining through model interpretation. The motifs found by 4acCPred are consistent with existing knowledge, indicating that the model successfully captured real biological signals. In addition, a user-friendly web server was built to facilitate 4acC prediction, motif visualization, and data access. Our framework and web server should serve as useful tools for 4acC research.
KW - DNA modification
KW - MT: Bioinformatics
KW - N-acetyldeoxycytosine
KW - deep neural network
KW - multiple-instance learning
KW - sequence motif
KW - weakly supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85140931721&partnerID=8YFLogxK
U2 - 10.1016/j.omtn.2022.10.004
DO - 10.1016/j.omtn.2022.10.004
M3 - Article
AN - SCOPUS:85140931721
SN - 2162-2531
VL - 30
SP - 337
EP - 345
JO - Molecular Therapy - Nucleic Acids
JF - Molecular Therapy - Nucleic Acids
ER -