Browsing by Author "Hoff, Peter D"
Results Per Page
Sort Options
Item Open Access Advanced Topics in Introductory Statistics(2023) Bryan, Jordan GreyIt is now common practice in many scientific disciplines to collect large amounts of experimental or observational data in the course of a research study. The abundance of such data creates a circumstance in which even simply posed research questions may, or sometimes must, be answered using multivariate datasets with complex structure. Introductory-level statistical tools familiar to practitioners may be applied to these types of data, but inference will either be sub-optimal or invalid if properties of the data violate the assumptions made by these statistical procedures. In this thesis, we provide examples of how basic statistical procedures may be adapted to suit the complexity of modern datasets while preserving the simplicity of low-dimensional parametric models. In the context of genomics studies, we propose a frequentist-assisted-by-Bayes (FAB) method for conducting hypothesis tests for the means of normal models when auxiliary information about the means is available. If the auxiliary information accurately describes the means, then the proposed FAB hypothesis tests may be more powerful than the corresponding classical $t$-tests. If the information is not accurate, then the FAB tests retain type-I error control. For multivariate financial and climatological data, we develop a semiparametric model in order to characterize the dependence between two sets of random variables. Our approach is inspired by a multivariate notion of the sample rank and extends classical concepts such as canonical correlation analysis (CCA) and the Gaussian copula model. The proposed model allows for the analysis of multivariate dependence between variable sets with arbitrary marginal distributions. Motivated by fluorescence spectroscopy data collected from sites along the Neuse River, we also propose a least squares estimator for quantifying the contribution of various land-use sources to the water quality of the river. The estimator can be computed quickly relative to estimators derived using parallel factor analysis (PARAFAC) and it performs favorably in two source apportionment tasks.
Item Open Access Advances in Survey Methodology and Sports Science(2019) Burris, Kyle ChristianThis thesis develops statistical methodology for efficient uncertainty quantification in the presence of small sample sizes and/or missing data. These methods have a wide range of potential applications, though they are particularly relevant for the analysis of cross-sectional survey data.
In the analysis of survey data it is frequently of interest to estimate and quantify uncertainty about means or totals for each of several non-overlapping subpopulations, or areas. Sometimes there are areas with small sample sizes under the survey design, which can result in wide confidence intervals. While some model-based methods have been developed to reduce interval width by utilizing data from other areas, these interval procedures do not have the nominal frequentist coverage rate for all values of the target quantity. We develop an alternative model-based confidence interval procedure that leverages data from other areas to reduce expected interval width. Importantly, our procedure maintains the nominal frequentist coverage rate for all values of the target quantity and is coverage-robust to model misspecification.
Missing data values are also pervasive in survey samples. Imputing multiple completed datasets is a principled way to avoid removing observations with incomplete values while simultaneously accounting for the uncertainty involved in the imputation procedure. The quality of imputations can be improved when the support of the data is known \textit{a priori}. We develop methodology for multiple imputation of mixed data when the support is known \textit{a priori} to be a subset of the data product space. This can improve the quality of the resulting imputed data sets, resulting in more efficient statistical inference.
In addition to its contributions in the field of survey methodology, this thesis also contributes to the sports science literature by developing Bayesian latent variable models for the analysis of visual-motor expertise.
In particular, we consider a multivariate dataset consisting of visual-motor assessments for 2317 athletes, including 252 professional baseball players. We quantify the variation in visual-motor expertise in athletes by level of expertise, gender, and sport type. Moreover, we examine the dependence among the battery of assessments and their relationship to on-field performance in professional baseball. We find significant positive associations between performance on the assessment battery and measures of baseball performance, particularly those that involve plate discipline.
Item Open Access Comparison of Bayesian Inference Methods for Probit Network Models(2021) Shen, YueMingThis thesis explores and compares Bayesian inference procedures for probit network models. Network data typically exhibit high dyadic correlation due to reciprocity. For binary network data, presence of dyadic correlation often leads to inefficiency of a basic implementation of Markov chain Monte Carlo (MCMC). We first explore variational inference as a fast approximation to the posterior distribution. Aware of its insufficiency in quantifying posterior uncertainties, we propose an alternative MCMC algorithm which is more efficient and accurate. In particular, we propose to update the dyadic correlation parameter $\rho$ using the marginal likelihood unconditional of the latent relations $Z$. This reduces autocorrelations in the posterior samples of $\rho$ and hence improves mixing. Simulation study and real data examples are provided to compare the performance of these Bayesian inference methods.
Item Open Access Geometric Methods for Point Estimation(2023) McCormack, Andrew RThe focus of this dissertation is on geometric aspects of point estimation problems. In the first half of this work, we examine the estimation of location parameters for non-Euclidean data that lies in a known manifold or metric space. Ideas from statistical decision theory motivate the construction of new estimators for location parameters. The second half of this work explores information geometric aspects of covariance matrix estimation. In a regular statistical model the Fisher information metric endows the parameter space with a Riemannian manifold structure. Parameter estimation can therefore also be viewed as problem in non-Euclidean data analysis.
Chapter 2 introduces and formalizes the problem of estimating Frechet mean location parameters for metric space valued data. We highlight the importance of the isometry group of distance preserving transformations, and how this group interacts with Frechet means. Pitman's minimum risk equivariant estimator for location models in Euclidean space is generalized to manifold settings, where we discuss aspects of the performance and computation of this minimum risk equivariant estimator.
Turning from equivariant estimation to shrinkage estimation, Chapter 3 introduces a shrinkage estimator for Frechet means that is inspired by Stein's estimator. This estimator utilizes metric space geodesics to shrink an estimate towards a pre-specified, shrinkage point. It is shown that the performance of this geodesic James-Stein estimator depends on the curvature of the underlying metric space, where shrinkage is especially beneficial in non-positively curved spaces.
Chapter 4 discusses shrinkage estimation for covariance matrices that approximately have a Kronecker product structure. The idea of geodesic shrinkage can be applied with respect to alpha-geodesics that arise from the information geometry of a statistical model. These alpha-geodesics also lead to interpretable parameter space decompositions. In a Wishart model we propose an empirical Bayes procedure for estimating approximate Kronecker covariance matrices; a procedure which can be viewed as shrinkage along (-1)-geodesics.
The last chapter of this work further discusses information geometric aspects of covariance estimation, with a view towards asymptotic efficiency. The asymptotic performance of the partial trace estimator for a Kronecker covariance is contrasted with the maximum likelihood estimator. A correction to the partial trace estimator is proposed which is both asymptotically efficient and simple to compute.
Item Open Access Methodological Advances for Multi-group Data(2024) Bersson, ElizabethThis dissertation focuses on improving inference in analyses of multi-group data, that is, data obtained from non-overlapping subpopulations such as across counties in a state or for various socio-economic groups. Precise and accurate group-specific inference based on such data may be encumbered by small within-group sample sizes. In such cases, inference may be improved by making use of auxiliary information. In this work, we present two streams of methodological development aimed at improving group-specific inference for multi-group data that may feature small within-group sample sizes for some or all of the groups. First, we detail methodology that constructs frequentist-valid prediction regions based on indirect information. We show such prediction regions may feature improved precision over those constructed with standard approaches. To this end, we present methods that result in accurate and precise prediction regions for multi-group data based on a continuous response in Chapter 2 and a categorical response in Chapter 3. We develop straightforward computational algorithms to compute the regions and detail empirical Bayesian estimation procedures that allow for information to be shared across groups in the construction of the prediction regions. In Chapter 4, we present work that improves covariance estimation for structured multi-group data with shrinkage estimation that allows for robustness to structural assumptions. In particular, for multi-group matrix-variate data, we describe a hierarchical prior distribution that improves covariance estimate accuracy by flexibly allowing for shrinkage within groups towards a Kronecker structure and across groups towards a pooled covariance estimate. We illustrate the utility of all methods presented with simulation studies and data applications.
Item Open Access Random Orthogonal Matrices with Applications in Statistics(2019) Jauch, MichaelThis dissertation focuses on random orthogonal matrices with applications in statistics. While Bayesian inference for statistical models with orthogonal matrix parameters is a recurring theme, several of the results on random orthogonal matrices may be of interest to those in the broader probability and random matrix theory communities. In Chapter 2, we parametrize the Stiefel and Grassmann manifolds, represented as subsets of orthogonal matrices, in terms of Euclidean parameters using the Cayley transform and then derive Jacobian terms for change of variables formulas. This allows for Markov chain Monte Carlo simulation from probability distributions defined on the Stiefel and Grassmann manifolds. We also establish an asymptotic independent normal approximation for the distribution of the Euclidean parameters corresponding to the uniform distribution on the Stiefel manifold. In Chapter 3, we present polar expansion, a general approach to Monte Carlo simulation from probability distributions on the Stiefel manifold. When combined with modern Markov chain Monte Carlo software, polar expansion allows for routine and flexible posterior inference in models with orthogonal matrix parameters. Chapter 4 addresses prior distributions for structured orthogonal matrices. We introduce an approach to constructing prior distributions for structured orthogonal matrices which leads to tractable posterior simulation via polar expansion. We state two main results which support our approach and offer a new perspective on approximating the entries of random orthogonal matrices.
Item Open Access Smaller $p$-values in genomics studies using distilled historical informationBryan, Jordan G; Hoff, Peter DMedical research institutions have generated massive amounts of biological data by genetically profiling hundreds of cancer cell lines. In parallel, academic biology labs have conducted genetic screens on small numbers of cancer cell lines under custom experimental conditions. In order to share information between these two approaches to scientific discovery, this article proposes a "frequentist assisted by Bayes" (FAB) procedure for hypothesis testing that allows historical information from massive genomics datasets to increase the power of hypothesis tests in specialized studies. The exchange of information takes place through a novel probability model for multimodal genomics data, which distills historical information pertaining to cancer cell lines and genes across a wide variety of experimental contexts. If the relevance of the historical information for a given study is high, then the resulting FAB tests can be more powerful than the corresponding classical tests. If the relevance is low, then the FAB tests yield as many discoveries as the classical tests. Simulations and practical investigations demonstrate that the FAB testing procedure can increase the number of effects discovered in genomics studies while still maintaining strict control of type I error and false discovery rates.