# Browsing by Subject "Bayesian statistics"

###### Results Per Page

###### Sort Options

Item Open Access Advancements in Probabilistic Machine Learning and Causal Inference for Personalized Medicine(2019) Lorenzi, Elizabeth CatherineIn this dissertation, we present four novel contributions to the field of statistics with the shared goal of personalizing medicine to individual patients. These methods are developed to directly address problems in health care through two subfields of statistics: probabilistic machine learning and causal inference. These projects include improving predictions of adverse events after surgeries, or learning the effectiveness of treatments for specific subgroups and for individuals. We begin the dissertation in Chapter 1 with a discussion of personalized medicine, the use of electronic health record (EHR) data, and a brief discussion on learning heterogeneous treatment effects. In chapter 2, we present a novel algorithm, Predictive Hierarchical Clustering (PHC), for agglomerative hierarchical clustering of current procedural terminology (CPT) codes. Our predictive hierarchical clustering aims to cluster subgroups, not individual observations, found within our data, such that the clusters discovered result in optimal performance of a classification model, specifically for predicting surgical complications. In chapter 3, we develop a hierarchical infinite latent factor model (HIFM) to appropriately account for the covariance structure across subpopulations in data. We propose a novel Hierarchical Dirichlet Process shrinkage prior on the loadings matrix that flexibly captures the underlying structure of our data across subpopulations while sharing information to improve inference and prediction. We apply this work to the problem of predicting surgical complications using electronic health record data for geriatric patients at Duke University Health System (DUHS). The last chapters of the dissertation address personalized medicine from a causal perspective, where the goal is to understand how interventions affect individuals not full populations. In chapter 4, we address heterogeneous treatment effects across subgroups, where guidance for observational comparisons within subgroups is lacking as is a connection to classic design principles for propensity score (PS) analyses. We address these shortcomings by proposing a novel propensity score method for subgroup analysis (SGA) that seeks to balance existing strategies in an automatic and efficient way. With the use of overlap weights, we prove that an over-specified propensity model including interactions between subgroups and all covariates results in exact covariate balance within subgroups. This is paired with variable selection approaches to adjust for a possibly overspecified propensity score model. Finally, chapter 5 discusses our final contribution, a longitudinal matching algorithm aiming to predict individual treatment effects of a medication change for diabetes patients. This project aims to develop a novel and generalizable causal inference framework for learning heterogeneous treatment effects from Electronic Health Records (EHR) data. The key methodological innovation is to cast the sparse and irregularly-spaced EHR time series into functional data analysis in the design stage to adjust for confounding that changes over time. We conclude the dissertation and discuss future work in Section 6, outlining many directions for continued research on these topics.

Item Open Access Advances in Bayesian Modeling of Protein Structure Evolution(2018) Larson, GaryThis thesis contributes to a statistical modeling framework for protein sequence and structure evolution. An existing Bayesian model for protein structure evolution is extended in two unique ways. Each of these model extensions addresses an important limitation which has not yet been satisfactorily addressed in the wider literature. These extensions are followed by work regarding inherent statistical bias in models for sequence evolution.

Most available models for protein structure evolution do not model interdependence between the backbone sites of the protein, yet the assumption that the sites evolve independently is known to be false. I argue that ignoring such dependence leads to biased estimation of evolutionary distance between proteins. To mitigate this bias, I express an existing Bayesian model in a generalized form and introduce site-dependence via the generalized model. In the process, I show that the effect of protein structure information on the measure of evolutionary distance can be suppressed by the model formulation, and I further modify the model to help mitigate this problem. In addition to the statistical model itself, I provide computational details and computer code. I modify a well-known bioinformatics algorithm in order to preserve efficient computation under this model. The modified algorithm can be easily understood and used by practitioners familiar with the original algorithm. My approach to modeling dependence is computationally tractable and interpretable with little additional computational burden over the model on which it is based.

The second model expansion allows for evolutionary inference on protein pairs having structural discrepancies attributable to backbone flexion. Thus, the model expansion exposes flexible protein structures to the capabilities of Bayesian protein structure alignment and phylogenetics. Unlike most of the few existing methods that deal with flexible protein structures, our Bayesian flexible alignment model requires no prior knowledge of the presence or absence of flexion points in the protein structure, and uncertainty measures are available for the alignment and other parameters of interest. The model can detect subtle flexion while not overfitting non-flexible protein pairs, and is demonstrated to improve phylogenetic inference in a simulated data setting and in a difficult-to-align set of proteins. The flexible model is a unique addition to the small but growing set of tools available for analysis of flexible protein structure. The ability to perform inference on flexible proteins in a Bayesian framework is likely to be of immediate interest to the structural phylogenetics community.

Finally, I present work related to the study of bias in site-independent models for sequence evolution. In the case of binary sequences, I discuss strategies for theoretical proof of bias and provide various details to that end, including detailing efforts undertaken to produce a site-dependent sequence model with similar properties to the site-dependent structural model introduced in an earlier chapter. I highlight the challenges of theoretical proof for this bias and include miscellaneous related work of general interest to researchers studying dependent sequence models.

Item Open Access Bayesian Adjustment for Multiplicity(2009) Scott, James GordonThis thesis is about Bayesian approaches for handling multiplicity. It considers three main kinds of multiple-testing scenarios: tests of exchangeable experimental units, tests for variable inclusion in linear regresson models, and tests for conditional independence in jointly normal vectors. Multiplicity adjustment in these three areas will be seen to have many common structural features. Though the modeling approach throughout is Bayesian, frequentist reasoning regarding error rates will often be employed.

Chapter 1 frames the issues in the context of historical debates about Bayesian multiplicity adjustment. Chapter 2 confronts the problem of large-scale screening of functional data, where control over Type-I error rates is a crucial issue. Chapter 3 develops new theory for comparing Bayes and empirical-Bayes approaches for multiplicity correction in regression variable selection. Chapters 4 and 5 describe new theoretical and computational tools for Gaussian graphical-model selection, where multiplicity arises in performing many simultaneous tests of pairwise conditional independence. Chapter 6 introduces a new approach to sparse-signal modeling based upon local shrinkage rules. Here the focus is not on multiplicity per se, but rather on using ideas from Bayesian multiple-testing models to motivate a new class of multivariate scale-mixture priors. Finally, Chapter 7 describes some directions for future study, many of which are the subjects of my current research agenda.

Item Open Access Bayesian Analysis and Computational Methods for Dynamic Modeling(2009) Niemi, JaradDynamic models, also termed state space models, comprise an extremely rich model class for time series analysis. This dissertation focuses on building state space models for a variety of contexts and computationally efficient methods for Bayesian inference for simultaneous estimation of latent states and unknown fixed parameters.

Chapter 1 introduces state space models and methods of inference in these models. Chapter 2 describes a novel method for jointly sampling the entire latent state vector in a nonlinear Gaussian state space model using a computationally efficient adaptive mixture modeling procedure. This method is embedded in an overall Markov chain Monte Carlo algorithm for estimating fixed parameters as well as states. In Chapter 3 the method of the previous chapter is implemented in a few illustrative

nonlinear models and compared to standard existing methods. This chapter also looks at the effect of the number of mixture components as well as length of the time series on the efficiency of the method. I then turn to an biological application in Chapter 4. I discuss modeling choices as well as derivation of the state space model to be used in this application. Parameter and state estimation are analyzed in these models for both simulated and real data. Chapter 5 extends the methodology introduced in Chapter 2 from nonlinear Gaussian models to general state space models. The method is then applied to a financial

stochastic volatility model on US $ - British £ exchange rates. Bayesian inference in the previous chapter is accomplished through Markov chain Monte Carlo which is suitable for batch analyses, but computationally limiting in sequential analysis. Chapter 6 introduces sequential Monte Carlo. It discusses two methods currently available for simultaneous sequential estimation of latent states and fixed parameters and then introduces a novel algorithm that reduces the key, limiting degeneracy issue while being usable in a wide model class. Chapter 7 implements the novel algorithm in a disease surveillance context modeling influenza epidemics. Finally, Chapter 8 suggests areas for future work in both modeling and Bayesian inference. Several appendices provide detailed technical support material as well as relevant related work.

Item Open Access Bayesian Hierarchical Models for Model Choice(2013) Li, YingboWith the development of modern data collection approaches, researchers may collect hundreds to millions of variables, yet may not need to utilize all explanatory variables available in predictive models. Hence, choosing models that consist of a subset of variables often becomes a crucial step. In linear regression, variable selection not only reduces model complexity, but also prevents over-fitting. From a Bayesian perspective, prior specification of model parameters plays an important role in model selection as well as parameter estimation, and often prevents over-fitting through shrinkage and model averaging.

We develop two novel hierarchical priors for selection and model averaging, for Generalized Linear Models (GLMs) and normal linear regression, respectively. They can be considered as "spike-and-slab" prior distributions or more appropriately "spike- and-bell" distributions. Under these priors we achieve dimension reduction, since their point masses at zero allow predictors to be excluded with positive posterior probability. In addition, these hierarchical priors have heavy tails to provide robust- ness when MLE's are far from zero.

Zellner's g-prior is widely used in linear models. It preserves correlation structure among predictors in its prior covariance, and yields closed-form marginal likelihoods which leads to huge computational savings by avoiding sampling in the parameter space. Mixtures of g-priors avoid fixing g in advance, and can resolve consistency problems that arise with fixed g. For GLMs, we show that the mixture of g-priors using a Compound Confluent Hypergeometric distribution unifies existing choices in the literature and maintains their good properties such as tractable (approximate) marginal likelihoods and asymptotic consistency for model selection and parameter estimation under specific values of the hyper parameters.

While the g-prior is invariant under rotation within a model, a potential problem with the g-prior is that it inherits the instability of ordinary least squares (OLS) estimates when predictors are highly correlated. We build a hierarchical prior based on scale mixtures of independent normals, which incorporates invariance under rotations within models like ridge regression and the g-prior, but has heavy tails like the Zeller-Siow Cauchy prior. We find this method out-performs the gold standard mixture of g-priors and other methods in the case of highly correlated predictors in Gaussian linear models. We incorporate a non-parametric structure, the Dirichlet Process (DP) as a hyper prior, to allow more flexibility and adaptivity to the data.

Item Open Access Bayesian meta-analysis models for heterogeneous genomics data(2013) Zheng, LinglingThe accumulation of high-throughput data from vast sources has drawn a lot attentions to develop methods for extracting meaningful information out of the massive data. More interesting questions arise from how to combine the disparate information, which goes beyond modeling sparsity and dimension reduction. This dissertation focuses on the innovations in the area of heterogeneous data integration.

Chapter 1 contextualizes this dissertation by introducing different aspects of meta-analysis and model frameworks for high-dimensional genomic data.

Chapter 2 introduces a novel technique, joint Bayesian sparse factor analysis model, to vertically integrate multi-dimensional genomic data from different platforms.

Chapter 3 extends the above model to a nonparametric Bayes formula. It directly infers number of factors from a model-based approach.

On the other hand, chapter 4 deals with horizontal integration of diverse gene expression data; the model infers pathway activities across various experimental conditions.

All the methods mentioned above are demonstrated in both simulation studies and real data applications in chapters 2-4.

Finally, chapter 5 summarizes the dissertation and discusses future directions.

Item Open Access Bayesian Mixture Modeling Approaches for Intermediate Variables and Causal Inference(2010) Schwartz, Scott LeeThis thesis examines causal inference related topics involving intermediate variables, and uses Bayesian methodologies to advance analysis capabilities in these areas. First, joint modeling of outcome variables with intermediate variables is considered in the context of birthweight and censored gestational age analyses. The proposed methodology provides improved inference capabilities for birthweight and gestational age, avoids post-treatment selection bias problems associated with conditional on gestational age analyses, and appropriately assesses the uncertainty associated with censored gestational age. Second, principal stratification methodology for settings where causal inference analysis requires appropriate adjustment of intermediate variables is extended to observational settings with binary treatments and binary intermediate variables. This is done by uncovering the structural pathways of unmeasured confounding affecting principal stratification analysis and directly incorporating them into a model based sensitivity analysis methodology. Demonstration focuses on a study of the efficacy of influenza vaccination in elderly populations. Third, flexibility, interpretability, and capability of principal stratification analyses for continuous intermediate variables are improved by replacing the current fully parametric methodologies with semiparametric Bayesian alternatives. This presentation is one of the first uses of nonparametric techniques in causal inference analysis,

and opens a connection between these two fields. Demonstration focuses on two studies, one involving a cholesterol reduction drug, and one examine the effect of physical activity on cardiovascular disease as it relates to body mass index.

Item Open Access Bayesian Statistical Models of Cell-Cycle Progression at Single-Cell and Population Levels(2014) Mayhew, Michael BenjaminCell division is a biological process fundamental to all life. One aspect of the process that is still under investigation is whether or not cells in a lineage are correlated in their cell-cycle progression. Data on cell-cycle progression is typically acquired either in lineages of single cells or in synchronized cell populations, and each source of data offers complementary information on cell division. To formally assess dependence in cell-cycle progression, I develop a hierarchical statistical model of single-cell measurements and extend a previously proposed model of population cell division in the budding yeast, Saccharomyces cerevisiae. Both models capture correlation and cell-to-cell heterogeneity in cell-cycle progression, and parameter inference is carried out in a fully Bayesian manner. The single-cell model is fit to three published time-lapse microscopy datasets and the population-based model is fit to simulated data for which the true model is known. Based on posterior inferences and formal model comparisons, the single-cell analysis demonstrates that budding yeast mother and daughter cells do not appear to correlate in their cell-cycle progression in two of the three experimental settings. In contrast, mother cells grown in a less preferred sugar source, glycerol/ethanol, did correlate in their rate of cell division in two successive cell cycles. Population model fitting to simulated data suggested that, under typical synchrony experimental conditions, population-based measurements of the cell-cycle were not informative for correlation in cell-cycle progression or heterogeneity in daughter-specific G1 phase progression.

Item Open Access Clustering Multiple Related Datasets with a Hierarchical Dirichlet Process(2011) de Oliveira Sales, Ana PaulaI consider the problem of clustering multiple related groups of data. My approach entails mixture models in the context of hierarchical Dirichlet processes, focusing on their ability to perform inference on the unknown number of components in the mixture, as well as to facilitate the sharing of information and borrowing of strength across the various data groups. Here, I build upon the hierarchical Dirichlet process model proposed by Muller et al. (2004), revising some relevant aspects of the model, as well as improving the MCMC sampler's convergence by combining local Gibbs sampler moves with global Metropolis-Hastings split-merge moves. I demonstrate the strengths of my model by employing it to cluster both synthetic and real datasets.

Item Open Access Computational Systems Biology of Saccharomyces cerevisiae Cell Growth and Division(2014) Mayhew, Michael BenjaminCell division and growth are complex processes fundamental to all living organisms. In the budding yeast, Saccharomyces cerevisiae, these two processes are known to be coordinated with one another as a cell's mass must roughly double before division. Moreover, cell-cycle progression is dependent on cell size with smaller cells at birth generally taking more time in the cell cycle. This dependence is a signature of size control. Systems biology is an emerging field that emphasizes connections or dependencies between biological entities and processes over the characteristics of individual entities. Statistical models provide a quantitative framework for describing and analyzing these dependencies. In this dissertation, I take a statistical systems biology approach to study cell division and growth and the dependencies within and between these two processes, drawing on observations from richly informative microscope images and time-lapse movies. I review the current state of knowledge on these processes, highlighting key results and open questions from the biological literature. I then discuss my development of machine learning and statistical approaches to extract cell-cycle information from microscope images and to better characterize the cell-cycle progression of populations of cells. In addition, I analyze single cells to uncover correlation in cell-cycle progression, evaluate potential models of dependence between growth and division, and revisit classical assertions about budding yeast size control. This dissertation presents a unique perspective and approach towards comprehensive characterization of the coordination between growth and division.

Item Open Access Dynamic modeling and Bayesian predictive synthesis(2017) McAlinn, KenichiroThis dissertation discusses model and forecast comparison, calibration, and combination from a foundational perspective. For nearly five decades, the field of forecast combination has grown exponentially. Its practicality and effectiveness in important real world problems concerning forecasting, uncertainty, and decisions propels this. Ample research-- theoretical and empirical-- into new methods and justifications have been produced. However, its foundations-- the philosophical/theoretical underpinnings on which methods and strategies are built upon-- have been unexplored in recent literature. Bayesian predictive synthesis (BPS) defines a coherent theoretical basis for combining multiple forecast densities, whether from models, individuals, or other sources, and generalizes existing forecast pooling and Bayesian model mixing methods. By understanding the underlying foundation that defines the combination of forecasts, multiple extensions are revealed, resulting in significant advances in the understanding and efficacy of the methods for decision making in multiple fields.

The extensions discussed in this dissertation are into the temporal domain. Many important decision problems are time series, including policy decisions in macroeconomics and investment decisions in finance, where decisions are sequentially updated over time. Time series extensions of BPS are implicit dynamic latent factor models, allowing adaptation to time-varying biases, mis-calibration, and dependencies among models or forecasters. Multiple studies using different data and different decision problems are presented, demonstrating the effectiveness of dynamic BPS, in terms of forecast accuracy and improved decision making, and highlighting the unique insight it provides.

Item Open Access Employing Neural Language Models and A Bayesian Hierarchical Framework for Classification and Engagement Analysis of Misinformation on Social Media(2022-04) List, AbbeyWhile social media can be an effective tool for maintaining personal relationships and making global connections, it has become a powerful force in the damaging spread of misinformation, especially during universally difficult and taxing events such as the COVID-19 pandemic. In this study, we collected a sample of Tweets related to COVID-19 from Twitter accounts of influential political media commentators and news organizations, assigning labels of misinformation, misleading, or legitimate to each Tweet. We constructed a Bayesian hierarchical negative binomial regression model to analyze any associations between Tweet engagement and misleading status while controlling for factors such as political lean, lexical diversity, and Retweet status. We found evidence that engagement had a positive association with misleading status and text readability, as well as a negative association with Retweets. We also employed a DeBERTa neural language classification model to predict the presence of misinformative or misleading content in Tweets, and we experimented with external datasets, multitask fine-tuning, backtranslation, and weighted loss to achieve accuracy of 0.683 and a macro F1-score of 0.593. We then examined DeBERTa explainability through word attributions with integrated gradients and found that tokens with the highest influence on model predictions often possessed connotations or context that was understandably related to the predicted label. The results of this study indicate that misleading status, Retweet status, and linguistic features may hold associations with overall Tweet engagement, and the DeBERTa model represents a potentially useful tool that can examine Tweet text alone without an external knowledge base and determine whether misinformation is present.Item Open Access Exploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics(2014) Shang, YanIn cargo logistics, a key performance measure is transport risk, defined as the deviation of the actual arrival time from the planned arrival time. Neither earliness nor tardiness is desirable for the customer and freight forwarder. In this paper, we investigate ways to assess and forecast transport risks using a half-year of air cargo data, provided by a leading forwarder on 1336 routes served by 20 airlines. Interestingly, our preliminary data analysis shows a strong multimodal feature in the transport risks, driven by unobserved events, such as cargo missing flights. To accommodate this feature, we introduce a Bayesian nonparametric model -- the probit stick-breaking process (PSBP) mixture model -- for flexible estimation of the conditional (i.e., state-dependent) density function of transport risk. We demonstrate that using simpler methods, such as OLS linear regression, can lead to misleading inferences. Our model provides a tool for the forwarder to offer customized price and service quotes. It can also generate baseline airline performance to enable fair supplier evaluation. Furthermore, the method allows us to separate recurrent risks from disruption risks. This is important, because hedging strategies for these two kinds of risks are often drastically different.

Item Open Access General and Efficient Bayesian Computation through Hamiltonian Monte Carlo Extensions(2017) Nishimura, AkihikoHamiltonian Monte Carlo (HMC) is a state-of-the-art sampling algorithm for Bayesian computation. Popular probabilistic programming languages Stan and PyMC rely on HMC’s generality and efficiency to provide automatic Bayesian inference platforms for practitioners. Despite its wide-spread use and numerous success stories, HMC has several well known pitfalls. This thesis presents extensions of HMC that overcome its two most prominent weaknesses: inability to handle discrete parameters and slow mixing on multi-modal target distributions.

Discontinuous HMC (DHMC) presented in Chapter 2 extends HMC to discontinuous target distributions – and hence to discrete parameter distributions through embedding them into continuous spaces — using an idea of event-driven Monte Carlo from the computational physics literature. DHMC is guaranteed to outperform a Metropolis-within-Gibbs algorithm since, as it turns out, the two algorithms coincide under a specific (and sub-optimal) implementation of DHMC. The theoretical justification of DHMC extends an existing theory of non-smooth Hamiltonian mechanics and of measure-valued differential inclusions.

Geometrically tempered HMC (GTHMC) presented in Chapter 3 improves HMC’s performance on multi-modal target distributions. The efficiency improvement is achieved through differential geometric techniques, relating a target distribution to

another distribution with less severe multi-modality. We establish a geometric theory behind Riemannian manifold HMC to motivate our geometric tempering methods. We then develop an explicit variable stepsize reversible integrator for simulating

Hamiltonian dynamics to overcome a stability issue of the usual Stormer-Verlet integrator. The integrator is of independent interest, being the first of its kind designed specifically for HMC variants.

In addition to the two extensions described above, Chapter 4 describes a variable trajectory length algorithm that generalizes the acceptance and rejection procedure of HMC — and in fact of any reversible dynamics based samplers — to allow for more flexible choices of trajectory lengths. The algorithm in particular enables an effective application of a variable stepsize integrator to HMC extensions, including GTHMC. The algorithm is widely applicable and provides a recipe for constructing valid dynamics based samplers beyond the known HMC variants. Chapter 5 concludes the thesis with a simple and practical algorithm to improve computational efficiencies of HMC and related algorithms over their traditional implementations.

Item Open Access Interfaces between Bayesian and Frequentist Multiplte Testing(2015) CHANG, SHIH-HANThis thesis investigates frequentist properties of Bayesian multiple testing procedures in a variety of scenarios and depicts the asymptotic behaviors of Bayesian methods. Both Bayesian and frequentist approaches to multiplicity control are studied and compared, with special focus on understanding the multiplicity control behavior in situations of dependence between test statistics.

Chapter 2 examines a problem of testing mutually exclusive hypotheses with dependent data. The Bayesian approach is shown to have excellent frequentist properties and is argued to be the most effective way of obtaining frequentist multiplicity control without sacrificing power. Chapter 3 further generalizes the model such that multiple signals are acceptable, and depicts the asymptotic behavior of false positives rates and the expected number of false positives. Chapter 4 considers the problem of dealing with a sequence of different trials concerning some medical or scientific issue, and discusses the possibilities for multiplicity control of the sequence. Chapter 5 addresses issues and efforts in reconciling frequentist and Bayesian approaches in sequential endpoint testing. We consider the conditional frequentist approach in sequential endpoint testing and show several examples in which Bayesian and frequentist methodologies cannot be made to match.

Item Open Access Modeling Heterogeneity With Bayesian Additive Regression Trees(2023) Orlandi, VittorioThis work focuses on using Bayesian Additive Regression Trees (BART), a flexible and computationally efficient regression method, to model heterogeneity in data. In particular, we focus on the closely related tasks of hierarchical modeling, latent variable modeling, and density regression. We begin by introducing BART in Chapter 2, presenting the prior, various extensions, and an in-depth case study using BART to analyze the impact of ABO-incompatible cardiac transplant on infants. Chapter 3 describes a methodological contribution, in which we use BART to model data structured within known groups by allowing for group-specific forests, each of which is only updated using units corresponding to that group. We further introduce an intercept forest common to all units and a hierarchical prior across the leaf variances in order to allow for sharing of information. We find that such an approach yields more parsimonious models than other BART-based approaches in the literature, which in turn translates to better out-of-sample accuracy, at virtually no added computational cost. In Chapter 4, we consider models involving latent variables within BART. The original motivation is to extend the known-group approach in Chapter 3 to a setting where group information is unavailable. However, this idea lends itself well to many different analyses, including those involving continuous omitted or latent variables. Another application is a generalization of a BART-based approach to sensitivity analysis, in which we allow for the unobserved confounder to flexibly influence the outcome. The latent variable framework we consider is computationally efficient, can help BART model data much more accurately than if restricting oneself to observed covariates, and is widely applicable to many different settings. In Chapter 5, we study one such application in great detail: using BART for density regression. By integrating out the latent variable in our model, we can model conditional densities in a way that outperforms a variety of other approaches on simulated tasks, and also allows us to bound its posterior concentration rate. We hope that the tools we develop in this work are useful to practitioners seeking to model heterogeneity in their data and also serve as a foundation for future methodological advances.

Item Open Access Monitoring and Improving Markov Chain Monte Carlo Convergence by Partitioning(2015) VanDerwerken, DouglasSince Bayes' Theorem was first published in 1762, many have argued for the Bayesian paradigm on purely philosophical grounds. For much of this time, however, practical implementation of Bayesian methods was limited to a relatively small class of "conjugate" or otherwise computationally tractable problems. With the development of Markov chain Monte Carlo (MCMC) and improvements in computers over the last few decades, the number of problems amenable to Bayesian analysis has increased dramatically. The ensuing spread of Bayesian modeling has led to new computational challenges as models become more complex and higher-dimensional, and both parameter sets and data sets become orders of magnitude larger. This dissertation introduces methodological improvements to deal with these challenges. These include methods for enhanced convergence assessment, for parallelization of MCMC, for estimation of the convergence rate, and for estimation of normalizing constants. A recurring theme across these methods is the utilization of one or more chain-dependent partitions of the state space.

Item Open Access On Bayesian Analyses of Functional Regression, Correlated Functional Data and Non-homogeneous Computer Models(2013) Montagna, SilviaCurrent frontiers in complex stochastic modeling of high-dimensional processes include major emphases on so-called functional data: problems in which the data are snapshots of curves and surfaces representing fundamentally important scientific quantities. This thesis explores new Bayesian methodologies for functional data analysis.

The first part of the thesis places emphasis on the role of factor models in functional data analysis. Data reduction becomes mandatory when dealing with such high-dimensional data, more so when data are available on a large number of individuals. In Chapter 2 we present a novel Bayesian framework which employs a latent factor construction to represent each variable by a low dimensional summary. Further, we explore the important issue of modeling and analyzing the relationship of functional data with other covariate and outcome variables simultaneously measured on the same subjects.

The second part of the thesis is concerned with the analysis of circadian data. The focus is on the identification of circadian genes that is, genes whose expression levels appear to be rhythmic through time with a period of approximately 24 hours. While addressing this goal, most of the current literature does not account for the potential dependence across genes. In Chapter 4, we propose a Bayesian approach which employs latent factors to accommodate dependence and verify patterns and relationships between genes, while representing the true gene expression trajectories in the Fourier domain allows for inference on period, phase, and amplitude of the signal.

The third part of the thesis is concerned with the statistical analysis of computer models (simulators). The heavy computational demand of these input-output maps calls for statistical techniques that quickly estimate the surface output at untried inputs given a few preliminary runs of the simulator at a set design points. In this regard, we propose a Bayesian methodology based on a non-stationary Gaussian process. Relying on a model-based assessment of uncertainty, we envision a sequential design technique which helps choosing input points where the simulator should be run to minimize the uncertainty in posterior surface estimation in an optimal way. The proposed non-stationary approach adapts well to output surfaces of unconstrained shape.

Item Open Access Probabilistic Models for Text in Social Networks(2018) Owens-Oas, DerekText in social networks is a common form of data. Common examples include emails between coworkers, text messages in a group chat, or comments on Facebook. There is value in developing models for such data. Examples of related services include archiving emails by topic and recommending job prospects for those seeking employment. However, due to privacy concerns, these data are relatively hard to obtain. We therefore work with similar data of the same structure which are publically available to design and experiment.

Motivated primarily by topic discovery, this thesis begins with a thorough survey of models which extend the foundational probabilistic topic model, latent Dirichlet allocation. My focus is on those which endow documents with meta data, like a time stamp, the author, or a set of links to other authors. Each approach is given common notation, described in terms of a structural innovation to LDA, and presented in a graphical model. The review reveals, to our knowledge, there was previously no model which combines dynamic topic modeling and community detection.

The first data set studied in this thesis is a corpus of political blog posts. Our motivation is to learn communities, guided by the presence of links and dynamic topic interests. This formulation enables new link recommendation. We therefore develop an appropriate Bayesian probabilistic model to learn these parameters jointly. Experiments reveal the model successfully identifies a groups of blogs which discuss sensational crime, despite having very few links between these blogs. It also enables presentation of top blogs, according to various criteria, for a specified topic interest community.

In a second analysis of the blog post data I develop a similar model. The motivation is to partition documents into groups. The groups are defined by shared topic interest proportions and shared linking patterns. Documents in the same group are reasonable recommendations to a reader. The model is designed to extend the foundational LDA. This enables easy comparison to a strong baseline. Also, it offers an alternative to LDA for situations where a hard clustering of documents is desired, and documents with similar enough topic proportions are clustered together. It simultaneously learns the linking tendency for each of these groups.

We show a different application of a probabilistic model for text data in social networks to related text event sequence data. Here we analyze a transcription of group conversation data from the movie 12 Angry Men. A main contribution is an algorithm based on marked multivariate Hawkes processes to recover latent structure, learning the root source of an event. The algorithm is tested on synthetic data and a Reddit data set where structure is observed. The algorithm enables partial credit attribution, distributing the credit over likely people who start each new conversation thread.

The above models and applications demonstrate the value of text network data. Generalized software for such data enables visualization and summarization of model outputs for text data in social networks.

Item Open Access Probabilistic Models on Fibre Bundles(2019) Shan, ShanIn this thesis, we propose probabilistic models on fibre bundles for learning the generative process of data. The main tool we use is the diffusion kernel and we use it in two ways. First, we build from the diffusion kernel on a fibre bundle a projected kernel that generates robust representations of the data, and we test that it outperforms regular diffusion maps under noise. Second, this diffusion kernel gives rise to a natural covariance function when defining Gaussian processes (GP) on the fibre bundle. To demonstrate the uses of GP on a fibre bundle, we apply it to simulated data on a Mobius strip for the problem of prediction and regression. Parameter tuning can also be guided by a novel semi-group test arising from the geometric properties of diffusion kernel. For an example of real-world application, we use probabilistic models on fibre bundles to study evolutionary process on anatomical surfaces. In a separate chapter, we propose a robust algorithm (ariaDNE) for computing curvature on each individual surface. The proposed machinery, relating diffusion processes to probabilistic models on fibre bundles, provides a unified framework for ideas from a variety of different topics such as geometric operators, dimension reduction, regression and Bayesian statistics.