# Browsing by Author "Mukherjee, Sayan"

###### Results Per Page

###### Sort Options

Item Open Access A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators(2019 IBM IEEE CAS/EDS – AI Compute Symposium, 2019-10-02) Zhang, Xiangyu; Mukherjee, Sayan; Lebeck, AlvinItem Open Access A Digital Network Approach to Infer Sex Behavior in Emerging HIV Epidemics(PLOS ONE, 2014-07-03) Kapur, Abhinav; Schneider, John A; Heard, Daniel; Mukherjee, Sayan; Schumm, Phil; Oruganti, Ganesh; Laumann, Edward OItem Open Access A Geometric Approach for Inference on Graphical Models(2009) Lunagomez, SimonWe formulate a novel approach to infer conditional independence models or Markov structure of a multivariate distribution. Specifically, our objective is to place informative prior distributions over graphs (decomposable and unrestricted) and sample efficiently from the induced posterior distribution. We also explore the idea of factorizing according to complete sets of a graph; which implies working with a hypergraph that cannot be retrieved from the graph alone. The key idea we develop in this paper is a parametrization of hypergraphs using the geometry of points in $R^m$. This induces informative priors on graphs from specified priors on finite sets of points. Constructing hypergraphs from finite point sets has been well studied in the fields of computational topology and random geometric graphs. We develop the framework underlying this idea and illustrate its efficacy using simulations.Item Open Access A Large Deviation Approach to Posterior Consistency in Dynamical SystemsSu, Langxuan; Mukherjee, SayanIn this paper, we provide asymptotic results concerning (generalized) Bayesian inference for certain dynamical systems based on a large deviation approach. Given a sequence of observations $y$, a class of model processes parameterized by $\theta \in \Theta$ which can be characterized as a stochastic process $X^\theta$ or a measure $\mu_\theta$, and a loss function $L$ which measures the error between $y$ and a realization of $X^\theta$, we specify the generalized posterior distribution $\pi_t(\theta \mid y)$. The goal of this paper is to study the asymptotic behavior of $\pi_t(\theta \mid y)$ as $t \to \infty.$ In particular, we state conditions on the model family $\{\mu_\theta\}_{\theta \in \Theta}$ and the loss function $L$ such that the posterior distribution converges. The two conditions we require are: (1) a conditional large deviation behavior for a single $X^\theta$, and (2) an exponential continuity condition over the model family for the map from the parameter $\theta$ to the loss incurred between $X^\theta$ and the observation sequence $y$. The proposed framework is quite general, we apply it to two very different classes of dynamical systems: continuous time hypermixing processes and Gibbs processes on shifts of finite type. We also show that the generalized posterior distribution concentrates asymptotically on those parameters that minimize the expected loss and a divergence term, hence proving posterior consistency.Item Open Access A new fully automated approach for aligning and comparing shapes.(Anatomical record (Hoboken, N.J. : 2007), 2015-01) Boyer, Doug M; Puente, Jesus; Gladman, Justin T; Glynn, Chris; Mukherjee, Sayan; Yapuncich, Gabriel S; Daubechies, IngridThree-dimensional geometric morphometric (3DGM) methods for placing landmarks on digitized bones have become increasingly sophisticated in the last 20 years, including greater degrees of automation. One aspect shared by all 3DGM methods is that the researcher must designate initial landmarks. Thus, researcher interpretations of homology and correspondence are required for and influence representations of shape. We present an algorithm allowing fully automatic placement of correspondence points on samples of 3D digital models representing bones of different individuals/species, which can then be input into standard 3DGM software and analyzed with dimension reduction techniques. We test this algorithm against several samples, primarily a dataset of 106 primate calcanei represented by 1,024 correspondence points per bone. Results of our automated analysis of these samples are compared to a published study using a traditional 3DGM approach with 27 landmarks on each bone. Data were analyzed with morphologika(2.5) and PAST. Our analyses returned strong correlations between principal component scores, similar variance partitioning among components, and similarities between the shape spaces generated by the automatic and traditional methods. While cluster analyses of both automatically generated and traditional datasets produced broadly similar patterns, there were also differences. Overall these results suggest to us that automatic quantifications can lead to shape spaces that are as meaningful as those based on observer landmarks, thereby presenting potential to save time in data collection, increase completeness of morphological quantification, eliminate observer error, and allow comparisons of shape diversity between different types of bones. We provide an R package for implementing this analysis.Item Open Access A phylogenetic transform enhances analysis of compositional microbiota data.(Elife, 2017-02-15) Silverman, Justin D; Washburne, Alex D; Mukherjee, Sayan; David, Lawrence ASurveys of microbial communities (microbiota), typically measured as relative abundance of species, have illustrated the importance of these communities in human health and disease. Yet, statistical artifacts commonly plague the analysis of relative abundance data. Here, we introduce the PhILR transform, which incorporates microbial evolutionary models with the isometric log-ratio transform to allow off-the-shelf statistical tools to be safely applied to microbiota surveys. We demonstrate that analyses of community-level structure can be applied to PhILR transformed data with performance on benchmarks rivaling or surpassing standard tools. Additionally, by decomposing distance in the PhILR transformed space, we identified neighboring clades that may have adapted to distinct human body sites. Decomposing variance revealed that covariation of bacterial clades within human body sites increases with phylogenetic relatedness. Together, these findings illustrate how the PhILR transform combines statistical and phylogenetic models to overcome compositional data challenges and enable evolutionary insights relevant to microbial communities.Item Open Access A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results(2018) Coker, BeauInference is the process of using facts we know to learn about facts we do not know. A theory of inference gives assumptions necessary to get from the former to the latter, along with a definition for and summary of the resulting uncertainty. Any one theory of inference is neither right nor wrong, but merely an axiom that may or may not be useful. Each of the many diverse theories of inference can be valuable for certain applications. However, no existing theory of inference addresses the tendency to choose, from the range of plausible data analysis specifications consistent with prior evidence, those that inadvertently favor one's own hypotheses. Since the biases from these choices are a growing concern across scientific fields, and in a sense the reason the scientific community was invented in the first place, we introduce a new theory of inference designed to address this critical problem. From this theory, we derive ``hacking intervals,'' which are the range of summary statistic one may obtain given a class of possible endogenous manipulations of the data. They make no appeal to hypothetical data sets drawn from imaginary superpopulations. A scientific result with a small hacking interval is more robust to researcher manipulation than one with a larger interval, and is often easier to interpret than a classic confidence interval. Hacking intervals turn out to be equivalent to classical confidence intervals under the linear regression model, and are equivalent to profile likelihood confidence intervals under certain other conditions, which means they may sometimes provide a more intuitive and potentially more useful interpretation of classical intervals.

Item Open Access A unifying framework for interpreting and predicting mutualistic systems.(Nature communications, 2019-01) Wu, Feilun; Lopatkin, Allison J; Needs, Daniel A; Lee, Charlotte T; Mukherjee, Sayan; You, LingchongCoarse-grained rules are widely used in chemistry, physics and engineering. In biology, however, such rules are less common and under-appreciated. This gap can be attributed to the difficulty in establishing general rules to encompass the immense diversity and complexity of biological systems. Furthermore, even when a rule is established, it is often challenging to map it to mechanistic details and to quantify these details. Here we report a framework that addresses these challenges for mutualistic systems. We first deduce a general rule that predicts the various outcomes of mutualistic systems, including coexistence and productivity. We further develop a standardized machine-learning-based calibration procedure to use the rule without the need to fully elucidate or characterize their mechanistic underpinnings. Our approach consistently provides explanatory and predictive power with various simulated and experimental mutualistic systems. Our strategy can pave the way for establishing and implementing other simple rules for biological systems.Item Open Access Absolute Continuity of Singular SPDEs and Bayesian Inference on Dynamical Systems(2023) Su, LangxuanWe explore the interplay among probability, stochastic analysis, and dynamical systems through two lenses: (1) absolute continuity of singular stochastic partial differential equations (SPDEs); (2) Bayesian inference on dynamical systems.

In the first part, we prove that up to a certain singular regime, the law of the stochastic Burgers equation at a fixed time is absolutely continuous with respect to the corresponding stochastic heat equation with the nonlinearity removed. The results follow from an extension of the Girsanov Theorem to handle less spatially regular solutions while only proving absolute continuity at a fixed time. To deal with the singularity, we introduce a novel decomposition in the spirit of Da Prato-Debussche and Gaussian chaos decomposition in singular SPDEs, by separating out the noise into different levels of regularity, along with a number of renormalization techniques. The number of levels in this decomposition diverges to infinite as we move to the stochastic Burgers equation associated with the KPZ equation. This result illustrates the fundamental probabilistic structure of a class of singular SPDEs and a notion of ``ellipticity'' in the infinite-dimensional setting.

In the second part, we establish connections between large deviations and a class of generalized Bayesian inference procedures on dynamical systems. We show that posterior consistency can be derived using a combination of classical large deviation techniques, such as Varadhan's lemma, conditional/quenched large deviations, annealed large deviations, and exponential approximations. We identified the divergence term as the Donsker-Varadhan relative entropy rate, also related to the Kolmogorov-Sinai entropy in ergodic theory. As an application, we prove new quenched/annealed large deviation asymptotics and a new Bayesian posterior consistency result for a class of mixing stochastic processes. In the case of Markov processes, one can obtain explicit conditions for posterior consistency, when estimates for log-Sobolev constants are available, which makes our framework essentially a black box. We also recover state-of-the-art posterior consistency on classical dynamical systems with a simple proof. Our approach has the potential of proving posterior consistency for a wide range of Bayesian procedures in a unified way.

Item Open Access Advances in Choquet theories(2022) Caprio, MicheleChoquet theory and the theory of capacities, both initiated by French mathematician Gustave Choquet, share the heuristic notion of studying the extrema of a convex set in order to give interesting results regarding its elements. In this work, we put to use Choquet theory in the study of finite mixture models and the theory of capacities in studying severe uncertainty.

In chapter 2, we show how by combining a classical non-parametric density estimator based on a Pólya tree with techniques from Choquet theory, it is possible to retrieve the weights of a finite mixture model. We also give the rate of convergence of the Pólya tree posterior to the Dirac measure on the weights.

In chapter 3, we introduce dynamic probability kinematics (DPK), a method for an agent to mechanically update subjective beliefs in the presence of partial information. We then generalize it to dynamic imprecise probability kinematics (DIPK), which allows the agent to express their initial beliefs via a set of probabilities. We provide bounds for the lower probability associated with the updated probability sets, and we study the behavior of the latter, in particular contraction, dilation, and sure loss. Examples are provided to illustrate how the methods work. We also formulate in chapter 4 an ergodic theory for the limit of the sequence of successive DIPK updates. As a consequence, we formulate a strong law of large numbers.

Finally, in chapter 5 we propose a new, more general definition of extended probability measures ("probabilities" whose codomain is the interval [-1,1]). We study their properties and provide a behavioral interpretation. We use them in an inference procedure, whose environment is canonically represented by a probability space, when both the probability measure and the composition of the state space are unknown. We develop an ex ante analysis - taking place before the statistical analysis requiring knowledge of the state space - in which we progressively learn its true composition. We describe how to update extended probabilities in this setting, and introduce the concept of lower extended probabilities. We provide two examples in the fields of ecology and opinion dynamics.

Item Open Access Age-specific differences in oncogenic pathway deregulation seen in human breast tumors.(PLoS One, 2008-01-02) Anders, Carey K; Acharya, Chaitanya R; Hsu, David S; Broadwater, Gloria; Garman, Katherine; Foekens, John A; Zhang, Yi; Wang, Yixin; Marcom, Kelly; Marks, Jeffrey R; Mukherjee, Sayan; Nevins, Joseph R; Blackwell, Kimberly L; Potti, AnilPURPOSE: To define the biology driving the aggressive nature of breast cancer arising in young women. EXPERIMENTAL DESIGN: Among 784 patients with early stage breast cancer, using prospectively-defined, age-specific cohorts (young or=65 years), 411 eligible patients (n = 200or=65 years) with clinically-annotated Affymetrix microarray data were identified. GSEA, signatures of oncogenic pathway deregulation and predictors of chemotherapy sensitivity were evaluated within the two age-defined cohorts. RESULTS: In comparing deregulation of oncogenic pathways between age groups, a higher probability of PI3K (p = 0.006) and Myc (p = 0.03) pathway deregulation was observed in breast tumors arising in younger women. When evaluating unique patterns of pathway deregulation, a low probability of Src and E2F deregulation in tumors of younger women, concurrent with a higher probability of PI3K, Myc, and beta-catenin, conferred a worse prognosis (HR = 4.15). In contrast, a higher probability of Src and E2F pathway activation in tumors of older women, with concurrent low probability of PI3K, Myc and beta-catenin deregulation, was associated with poorer outcome (HR = 2.7). In multivariate analyses, genomic clusters of pathway deregulation illustrate prognostic value. CONCLUSION: Results demonstrate that breast cancer arising in young women represents a distinct biologic entity characterized by unique patterns of deregulated signaling pathways that are prognostic, independent of currently available clinico-pathologic variables. These results should enable refinement of targeted treatment strategies in this clinically challenging situation.Item Open Access An Analysis of NBA Spatio-Temporal Data(2017) Robertson, MeganThis project examines the utility of spatio-temporal tracking data from professional basketball games by fitting models predicting whether a player will make a shot. The first part of the project involved the exploration of the data, evaluated its issues, and generated features to use as co-variates in the models. The second part fit various classification models and evaluated their predictive performance. The paper concludes with a discussion of methods to improve the models and future work.

Item Open Access AN APPLICATION OF GRAPH DIFFUSION FOR GESTURE CLASSIFICATION(2020) Voisin, Perry SamuelReliable and widely available robotic prostheses have long been a dream of science fiction writers and researchers alike. The problem of sufficiently generalizable gesture recognition algorithms and technology remains a barrier to these ambitions despite numerous advances in computer science, engineering, and machine learning. Often the failure of a particular algorithm to generalize to the population at large is due to superficial characteristics of subjects in the training data set. These superficial characteristics are captured and integrated into the signal intended to capture the gesture being performed. This work applies methods developed in computer vision

and graph theory to the problem of identifying pertinent features in a set of time series modalities.

Item Open Access An Euler Characteristic Curve Based Representation of 3DShapes in Statistical Analysis(2021) Ma, Zining3D shape analysis is common in many fields such as medical science and biology. Analysis of original shape data is challenging and could be computationally heavy. A simplified representation of 3D shapes could help developing accessible shape analysis methods. In this paper we propose a method to generate a specific form of 3D shapes representation that could be applied in statistical analysis. We use Euler Characteristic curves to create the represention of shapes and utilize scaling function bases from diffusion wavelet to generate the representation. We discuss the details of our method and in the last we apply our method on a shape classification problem to test the performance of the representation.

Item Metadata only An integrated approach to the prediction of chemotherapeutic response in patients with breast cancer.(PLoS One, 2008-04-02) Salter, Kelly H; Acharya, Chaitanya R; Walters, Kelli S; Redman, Richard; Anguiano, Ariel; Garman, Katherine S; Anders, Carey K; Mukherjee, Sayan; Dressman, Holly K; Barry, William T; Marcom, Kelly P; Olson, John; Nevins, Joseph R; Potti, AnilBACKGROUND: A major challenge in oncology is the selection of the most effective chemotherapeutic agents for individual patients, while the administration of ineffective chemotherapy increases mortality and decreases quality of life in cancer patients. This emphasizes the need to evaluate every patient's probability of responding to each chemotherapeutic agent and limiting the agents used to those most likely to be effective. METHODS AND RESULTS: Using gene expression data on the NCI-60 and corresponding drug sensitivity, mRNA and microRNA profiles were developed representing sensitivity to individual chemotherapeutic agents. The mRNA signatures were tested in an independent cohort of 133 breast cancer patients treated with the TFAC (paclitaxel, 5-fluorouracil, adriamycin, and cyclophosphamide) chemotherapy regimen. To further dissect the biology of resistance, we applied signatures of oncogenic pathway activation and performed hierarchical clustering. We then used mRNA signatures of chemotherapy sensitivity to identify alternative therapeutics for patients resistant to TFAC. Profiles from mRNA and microRNA expression data represent distinct biologic mechanisms of resistance to common cytotoxic agents. The individual mRNA signatures were validated in an independent dataset of breast tumors (P = 0.002, NPV = 82%). When the accuracy of the signatures was analyzed based on molecular variables, the predictive ability was found to be greater in basal-like than non basal-like patients (P = 0.03 and P = 0.06). Samples from patients with co-activated Myc and E2F represented the cohort with the lowest percentage (8%) of responders. Using mRNA signatures of sensitivity to other cytotoxic agents, we predict that TFAC non-responders are more likely to be sensitive to docetaxel (P = 0.04), representing a viable alternative therapy. CONCLUSIONS: Our results suggest that the optimal strategy for chemotherapy sensitivity prediction integrates molecular variables such as ER and HER2 status with corresponding microRNA and mRNA expression profiles. Importantly, we also present evidence to support the concept that analysis of molecular variables can present a rational strategy to identifying alternative therapeutic opportunities.Item Open Access Approximate Inference for High-Dimensional Latent Variable Models(2018) Tan, ZilongLatent variable models are widely used in applications ranging from

natural language processing to recommender systems. Exact inference

using maximum likelihood for these models is generally NP-hard, and

computationally prohibitive on big and/or high-dimensional data. This

has motivated the development of approximate inference methods that

balance between computational complexity and statistical

efficiency. Understanding the computational and statistical tradeoff

is important for analyzing approximate inference approaches as well

as designing new ones. Towards this goal, this dissertation presents a

study of new approximate inference algorithms with provable guarantees

for three classes of inference tasks.

The first class is based on the method of moments. The inference in

this setting is typically reduced to a tensor decomposition problem

which requires decomposing a $p$-by-$p$-by-$p$ estimator tensor for

$p$ variables. We divide-and-conquer the tensor method to instead

decompose $O\left(p/k\right)$ sub-tensors each of size

$O\left(k^3\right)$, achieving significant reduction in computational

complexity when the number of latent variables $k$ is small. Our

approach can also enforce the nonnegativity of estimates for inferring

nonnegative models parameters. Theoretical analysis gives sufficient

conditions for ensuring robustness of the divide-and-conquer method,

as well as proof of linear convergence for the nonnegative

factorization.

In the second class, we further consider mixed-effect models in which

the variance of latent variables also needs to be inferred. We present

approximate estimators which have closed-form analytical

expressions. Fast computational techniques based on the subsampled

randomized Hadamard transform are also developed achieving sublinear

complexity in the dimension. This makes our approach useful for

high-dimensional applications like genome-wide association

studies. Moreover, we provide theoretical analysis that states

provable error guarantees for the approximation.

The last class is more general inference in an infinite-dimensional

function space specified by a Gaussian process (GP) prior. We provide

a dual formulation of GPs using random functions in a reproducing

kernel Hilbert space (RKHS) where the function representation is

specified as latent variables. We show that the dual GP can realize an

expanded class of functions, and can also be well-approximated by a

low-dimensional sufficient dimension reduction subspace of the RKHS. A

fast learning algorithm for the dual GP is developed which improves

upon the state-of-the-art computational complexity of GPs.

Item Open Access Assessing the radiation response of lung cancer with different gene mutations using genetically engineered mice.(Front Oncol, 2013) Perez, Bradford A; Ghafoori, A Paiman; Lee, Chang-Lung; Johnston, Samuel M; Li, Yifan; Moroshek, Jacob G; Ma, Yan; Mukherjee, Sayan; Kim, Yongbaek; Badea, Cristian T; Kirsch, David GPURPOSE: Non-small cell lung cancers (NSCLC) are a heterogeneous group of carcinomas harboring a variety of different gene mutations. We have utilized two distinct genetically engineered mouse models of human NSCLC (adenocarcinoma) to investigate how genetic factors within tumor parenchymal cells influence the in vivo tumor growth delay after one or two fractions of radiation therapy (RT). MATERIALS AND METHODS: Primary lung adenocarcinomas were generated in vivo in mice by intranasal delivery of an adenovirus expressing Cre-recombinase. Lung cancers expressed oncogenic Kras(G12D) and were also deficient in one of two tumor suppressor genes: p53 or Ink4a/ARF. Mice received no radiation treatment or whole lung irradiation in a single fraction (11.6 Gy) or in two 7.3 Gy fractions (14.6 Gy total) separated by 24 h. In each case, the biologically effective dose (BED) equaled 25 Gy10. Response to RT was assessed by micro-CT 2 weeks after treatment. Quantitative reverse transcription-polymerase chain reaction (qRT-PCR) and immunohistochemical staining were performed to assess the integrity of the p53 pathway, the G1 cell-cycle checkpoint, and apoptosis. RESULTS: Tumor growth rates prior to RT were similar for the two genetic variants of lung adenocarcinoma. Lung cancers with wild-type (WT) p53 (LSL-Kras; Ink4a/ARF(FL/FL) mice) responded better to two daily fractions of 7.3 Gy compared to a single fraction of 11.6 Gy (P = 0.002). There was no statistically significant difference in the response of lung cancers deficient in p53 (LSL-Kras; p53(FL/FL) mice) to a single fraction (11.6 Gy) compared to 7.3 Gy × 2 (P = 0.23). Expression of the p53 target genes p21 and PUMA were higher and bromodeoxyuridine uptake was lower after RT in tumors with WT p53. CONCLUSION: Using an in vivo model of malignant lung cancer in mice, we demonstrate that the response of primary lung cancers to one or two fractions of RT can be influenced by specific gene mutations.Item Open Access Bayesian Estimation and Sensitivity Analysis for Causal Inference(2019) Zaidi, Abbas MThis disseration aims to explore Bayesian methods for causal inference. In chapter 1, we present an overview of fundamental ideas from causal inference along with an outline of the methodological developments that we hope to tackle. In chapter 2, we develop a Gaussian-process mixture model for heterogeneous treatment effect estimation that leverages the use of transformed outcomes. The approach we will present attempts to improve point estimation and uncertainty quantification relative to past work that has used transformed variable related methods as well as traditional outcome modeling. Earlier work on modeling treatment effect heterogeneity using transformed outcomes has relied on tree based methods such as single regression trees and random forests. Under the umbrella of non-parametric models, outcome modeling has been performed using Bayesian additive regression trees and various flavors of weighted single trees. These approaches work well when large samples are available, but suffer in smaller samples where results are more sensitive to model misspecification -- our method attempts to garner improvements in inference quality via a correctly specified model rooted in Bayesian non-parametrics. Furthermore, while we begin with a model that assumes that the treatment assignment mechanism is known, an extension where it is learnt from the data is presented for applications to observational studies. Our approach is applied to simulated and real data to demonstrate our theorized improvements in inference with respect to two causal estimands: the conditional average treatment effect and the average treatment effect. By leveraging our correctly specified model, we are able to more accurately estimate the treatment effects while reducing their variance. In chapter 3, we parametrically and hierarchically estimate the average causal effects of different lengths of stay in the Udayan Ghar Program under the assumption that selection into different lengths is based on a set of observed covariates. This program was piloted in New Delhi, India as a means of providing a residential surrogate to vulnerable and at risk children with the hope of improving their psychological development. We find that the estimated effects on the psychological ideas of self concept and ego resilience (measured by the standardized Piers-Harris score) increase with the length of the time spent in the program. We are also able to conclude that there are measurable differences that exist between male and female children that spend time in the program. In chapter 4, we supplement the estimation of hierarchical dose-response function estimation by introducing a novel sensitivity-analysis and summarization strategy for assessing the robustness of our results to violations of the assumption of unconfoundedness. Finally, in chapter 5, we summarize what this dissertation has achieved, and briefly outline important areas where our work warrants further development.

Item Open Access Bayesian Kernel Models for Statistical Genetics and Cancer Genomics(2017) Crawford, Lorin AnthonyThe main contribution of this thesis is to examine the utility of kernel regression ap- proaches and variance component models for solving complex problems in statistical genetics and molecular biology. Many of these types of statistical methods have been developed specifically to be applied to solve similar biological problems. For example, kernel regression models have a long history in statistics, applied mathematics, and machine learning. More recently, variance component models have been extensively utilized as tools to broaden understanding of the genetic basis of phenotypic varia- tion. However, because of large combinatorial search spaces and other confounding factors, many of these current methods face enormous computational challenges and often suffer from low statistical power --- particularly when phenotypic variation is driven by complicated underlying genetic architectures (e.g. the presence of epistatic effects involving higher order genetic interactions). This thesis highlights two novel methods which provide innovative solutions to better address the important statis- tical and computational hurdles faced within complex biological data sets. The first is a Bayesian non-parametric statistical framework that allows for efficient variable selection in nonlinear regression which we refer to as "Bayesian approximate kernel regression", or BAKR. The second is a novel algorithm for identifying genetic vari- ants that are involved in epistasis without the need to identify the exact partners with which the variants interact. We refer to this method as the "MArginal ePIstasis Test", or MAPIT. Here, we develop the theory of these two approaches, and demonstrate their power, interpretability, and computational efficiency for analyz- ing complex phenotypes. We also illustrate their ability to facilitate novel biological discoveries in several real data sets, each of them representing a particular class of analyses: genome-wide association studies (GWASs), molecular trait quantitative trait loci (QTL) mapping studies, and cancer biology association studies. Lastly, we will also explore the potential of these approaches in radiogenomics, a brand new subfield of genetics and genomics that focuses on the study of correlations between imaging or network features and genetic variation.

Item Open Access Bayesian meta-analysis models for heterogeneous genomics data(2013) Zheng, LinglingThe accumulation of high-throughput data from vast sources has drawn a lot attentions to develop methods for extracting meaningful information out of the massive data. More interesting questions arise from how to combine the disparate information, which goes beyond modeling sparsity and dimension reduction. This dissertation focuses on the innovations in the area of heterogeneous data integration.

Chapter 1 contextualizes this dissertation by introducing different aspects of meta-analysis and model frameworks for high-dimensional genomic data.

Chapter 2 introduces a novel technique, joint Bayesian sparse factor analysis model, to vertically integrate multi-dimensional genomic data from different platforms.

Chapter 3 extends the above model to a nonparametric Bayes formula. It directly infers number of factors from a model-based approach.

On the other hand, chapter 4 deals with horizontal integration of diverse gene expression data; the model infers pathway activities across various experimental conditions.

All the methods mentioned above are demonstrated in both simulation studies and real data applications in chapters 2-4.

Finally, chapter 5 summarizes the dissertation and discusses future directions.

- «
- 1 (current)
- 2
- 3
- »