Clustering count-based RNA methylation data using a nonparametric generative model

Lin Zhang; Yanling He; Huaizhi Wang; Hui Liu; Yufei Huang; Xuesong Wang; Jia Meng

doi:10.2174/1574893613666180601080008

Clustering count-based RNA methylation data using a nonparametric generative model

Lin Zhang, Yanling He, Huaizhi Wang, Hui Liu^*, Yufei Huang, Xuesong Wang, Jia Meng

^*Corresponding author for this work

Department of Biosciences and Bioinformatics

Research output: Contribution to journal › Article › peer-review

14 Citations (Scopus)

Abstract

Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m⁶A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m⁶A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

Original language	English
Pages (from-to)	11-23
Number of pages	13
Journal	Current Bioinformatics
Volume	14
Issue number	1
DOIs	https://doi.org/10.2174/1574893613666180601080008
Publication status	Published - 2019

Keywords

Beta-binomial mixture
Clustering
Dirichlet process
Epitranscriptome
MA-seq
RNA methylation

Access to Document

10.2174/1574893613666180601080008

Cite this

@article{1f9ecc2d81e245ed8f9dc58cea3058f6,

title = "Clustering count-based RNA methylation data using a nonparametric generative model",

abstract = "Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.",

keywords = "Beta-binomial mixture, Clustering, Dirichlet process, Epitranscriptome, MA-seq, RNA methylation",

author = "Lin Zhang and Yanling He and Huaizhi Wang and Hui Liu and Yufei Huang and Xuesong Wang and Jia Meng",

note = "Publisher Copyright: {\textcopyright} 2019 Bentham Science Publishers.",

year = "2019",

doi = "10.2174/1574893613666180601080008",

language = "English",

volume = "14",

pages = "11--23",

journal = "Current Bioinformatics",

issn = "1574-8936",

number = "1",

}

TY - JOUR

T1 - Clustering count-based RNA methylation data using a nonparametric generative model

AU - Zhang, Lin

AU - He, Yanling

AU - Wang, Huaizhi

AU - Liu, Hui

AU - Huang, Yufei

AU - Wang, Xuesong

AU - Meng, Jia

PY - 2019

Y1 - 2019

N2 - Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

AB - Background: RNA methylome has been discovered as an important layer of gene regulation and can be profiled directly with count-based measurements from high-throughput sequencing data. Although the detailed regulatory circuit of the epitranscriptome remains uncharted, clustering effect in methylation status among different RNA methylation sites can be identified from transcriptome-wide RNA methylation profiles and may reflect the epitranscriptomic regulation. Count-based RNA methylation sequencing data has unique features, such as low reads coverage, which calls for novel clustering approaches. Objective: Besides the low reads coverage, it is also necessary to keep the integer property to approach clustering analysis of count-based RNA methylation sequencing data. Method: We proposed a nonparametric generative model together with its Gibbs sampling solution for clustering analysis. The proposed approach implements a beta-binomial mixture model to capture the clustering effect in methylation level with the original count-based measurements rather than an estimated continuous methylation level. Besides, it adopts a nonparametric Dirichlet process to automatically determine an optimal number of clusters so as to avoid the common model selection problem in clustering analysis. Results: When tested on the simulated system, the method demonstrated improved clustering performance over hierarchical clustering, K-means, MClust, NMF and EMclust. It also revealed on real dataset two novel RNA N6-methyladenosine (m6A) co-methylation patterns that may be induced directly by METTL14 and WTAP, which are two known regulatory components of the RNA m6A methyltransferase complex. Conclusion: Our proposed DPBBM method not only properly handles the count-based measurements of RNA methylation data from sites of very low reads coverage, but also learns an optimal number of clusters adaptively from the data analyzed. Availability: The source code and documents of DPBBM R package are freely available through the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/DPBBM/.

KW - Beta-binomial mixture

KW - Clustering

KW - Dirichlet process

KW - Epitranscriptome

KW - MA-seq

KW - RNA methylation

UR - http://www.scopus.com/inward/record.url?scp=85061733359&partnerID=8YFLogxK

U2 - 10.2174/1574893613666180601080008

DO - 10.2174/1574893613666180601080008

M3 - Article

AN - SCOPUS:85061733359

SN - 1574-8936

VL - 14

SP - 11

EP - 23

JO - Current Bioinformatics

JF - Current Bioinformatics

IS - 1

ER -

Clustering count-based RNA methylation data using a nonparametric generative model

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this