Bayesian inference for genomic data integration reduces misclassification rate in predicting protein-protein interactions.
Repository Usage Stats
Protein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing PPIs networks. However, several critical difficulties exist in obtaining reliable predictions. Noticeably, false positive rates can be as high as >80%. Error correction from each generating source can be both time-consuming and inefficient due to the difficulty of covering the errors from multiple levels of data processing procedures within a single test. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes to unreliable, error-prone and contaminated data. On a large human data set our NBEL approach predicts many more PPIs than naïve Bayes. This suggests that previous studies may have large numbers of not only false positives but also false negatives. The validation on two human PPIs datasets having high quality supports our observations. Our experiments demonstrate that it is feasible to predict high-throughput PPIs computationally with substantially reduced false positives and false negatives. The ability of predicting large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction and roles of PPIs in disease susceptibility.
Published Version (Please cite this version)
Xing, Chuanhua, and David B Dunson (2011). Bayesian inference for genomic data integration reduces misclassification rate in predicting protein-protein interactions. PLoS Comput Biol, 7(7). p. e1002110. 10.1371/journal.pcbi.1002110 Retrieved from https://hdl.handle.net/10161/15602.
This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.
My research focuses on developing new tools for probabilistic learning from complex data - methods development is directly motivated by challenging applications in ecology/biodiversity, neuroscience, environmental health, criminal justice/fairness, and more. We seek to develop new modeling frameworks, algorithms and corresponding code that can be used routinely by scientists and decision makers. We are also interested in new inference framework and in studying theoretical properties of methods we develop.
Some highlight application areas:
(1) Modeling of biological communities and biodiversity - we are considering global data on fungi, insects, birds and animals including DNA sequences, images, audio, etc. Data contain large numbers of species unknown to science and we would like to learn about these new species, community network structure, and the impact of environmental change and climate.
(2) Brain connectomics - based on high resolution imaging data of the human brain, we are seeking to developing new statistical and machine learning models for relating brain networks to human traits and diseases.
(3) Environmental health & mixtures - we are building tools for relating chemical and other exposures (air pollution etc) to human health outcomes, accounting for spatial dependence in both exposures and disease. This includes an emphasis on infectious disease modeling, such as COVID-19.
Some statistical areas that play a prominent role in our methods development include models for low-dimensional structure in data (latent factors, clustering, geometric and manifold learning), flexible/nonparametric models (neural networks, Gaussian/spatial processes, other stochastic processes), Bayesian inference frameworks, efficient sampling and analytic approximation algorithms, and models for "object data" (trees, networks, images, spatial processes, etc).
Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.