Browsing by Author "Clyde, Merlise A"
- Results Per Page
- Sort Options
Item Open Access A tutorial on Bayesian multi-model linear regression with BAS and JASP.(Behavior research methods, 2021-12) Bergh, Don van den; Clyde, Merlise A; Gupta, Akash R Komarlu Narendra; de Jong, Tim; Gronau, Quentin F; Marsman, Maarten; Ly, Alexander; Wagenmakers, Eric-JanLinear regression analyses commonly involve two consecutive stages of statistical inquiry. In the first stage, a single 'best' model is defined by a specific selection of relevant predictors; in the second stage, the regression coefficients of the winning model are used for prediction and for inference concerning the importance of the predictors. However, such second-stage inference ignores the model uncertainty from the first stage, resulting in overconfident parameter estimates that generalize poorly. These drawbacks can be overcome by model averaging, a technique that retains all models for inference, weighting each model's contribution by its posterior probability. Although conceptually straightforward, model averaging is rarely used in applied research, possibly due to the lack of easily accessible software. To bridge the gap between theory and practice, we provide a tutorial on linear regression using Bayesian model averaging in JASP, based on the BAS package in R. Firstly, we provide theoretical background on linear regression, Bayesian inference, and Bayesian model averaging. Secondly, we demonstrate the method on an example data set from the World Happiness Report. Lastly, we discuss limitations of model averaging and directions for dealing with violations of model assumptions.Item Restricted Association between DNA damage response and repair genes and risk of invasive serous ovarian cancer.(PLoS One, 2010-04-08) Schildkraut, Joellen M; Iversen, Edwin S; Wilson, Melanie A; Clyde, Merlise A; Moorman, Patricia G; Palmieri, Rachel T; Whitaker, Regina; Bentley, Rex C; Marks, Jeffrey R; Berchuck, AndrewBACKGROUND: We analyzed the association between 53 genes related to DNA repair and p53-mediated damage response and serous ovarian cancer risk using case-control data from the North Carolina Ovarian Cancer Study (NCOCS), a population-based, case-control study. METHODS/PRINCIPAL FINDINGS: The analysis was restricted to 364 invasive serous ovarian cancer cases and 761 controls of white, non-Hispanic race. Statistical analysis was two staged: a screen using marginal Bayes factors (BFs) for 484 SNPs and a modeling stage in which we calculated multivariate adjusted posterior probabilities of association for 77 SNPs that passed the screen. These probabilities were conditional on subject age at diagnosis/interview, batch, a DNA quality metric and genotypes of other SNPs and allowed for uncertainty in the genetic parameterizations of the SNPs and number of associated SNPs. Six SNPs had Bayes factors greater than 10 in favor of an association with invasive serous ovarian cancer. These included rs5762746 (median OR(odds ratio)(per allele) = 0.66; 95% credible interval (CI) = 0.44-1.00) and rs6005835 (median OR(per allele) = 0.69; 95% CI = 0.53-0.91) in CHEK2, rs2078486 (median OR(per allele) = 1.65; 95% CI = 1.21-2.25) and rs12951053 (median OR(per allele) = 1.65; 95% CI = 1.20-2.26) in TP53, rs411697 (median OR (rare homozygote) = 0.53; 95% CI = 0.35 - 0.79) in BACH1 and rs10131 (median OR( rare homozygote) = not estimable) in LIG4. The six most highly associated SNPs are either predicted to be functionally significant or are in LD with such a variant. The variants in TP53 were confirmed to be associated in a large follow-up study. CONCLUSIONS/SIGNIFICANCE: Based on our findings, further follow-up of the DNA repair and response pathways in a larger dataset is warranted to confirm these results.Item Open Access Bayesian Hierarchical Models for Model Choice(2013) Li, YingboWith the development of modern data collection approaches, researchers may collect hundreds to millions of variables, yet may not need to utilize all explanatory variables available in predictive models. Hence, choosing models that consist of a subset of variables often becomes a crucial step. In linear regression, variable selection not only reduces model complexity, but also prevents over-fitting. From a Bayesian perspective, prior specification of model parameters plays an important role in model selection as well as parameter estimation, and often prevents over-fitting through shrinkage and model averaging.
We develop two novel hierarchical priors for selection and model averaging, for Generalized Linear Models (GLMs) and normal linear regression, respectively. They can be considered as "spike-and-slab" prior distributions or more appropriately "spike- and-bell" distributions. Under these priors we achieve dimension reduction, since their point masses at zero allow predictors to be excluded with positive posterior probability. In addition, these hierarchical priors have heavy tails to provide robust- ness when MLE's are far from zero.
Zellner's g-prior is widely used in linear models. It preserves correlation structure among predictors in its prior covariance, and yields closed-form marginal likelihoods which leads to huge computational savings by avoiding sampling in the parameter space. Mixtures of g-priors avoid fixing g in advance, and can resolve consistency problems that arise with fixed g. For GLMs, we show that the mixture of g-priors using a Compound Confluent Hypergeometric distribution unifies existing choices in the literature and maintains their good properties such as tractable (approximate) marginal likelihoods and asymptotic consistency for model selection and parameter estimation under specific values of the hyper parameters.
While the g-prior is invariant under rotation within a model, a potential problem with the g-prior is that it inherits the instability of ordinary least squares (OLS) estimates when predictors are highly correlated. We build a hierarchical prior based on scale mixtures of independent normals, which incorporates invariance under rotations within models like ridge regression and the g-prior, but has heavy tails like the Zeller-Siow Cauchy prior. We find this method out-performs the gold standard mixture of g-priors and other methods in the case of highly correlated predictors in Gaussian linear models. We incorporate a non-parametric structure, the Dirichlet Process (DP) as a hyper prior, to allow more flexibility and adaptivity to the data.
Item Open Access Bayesian Model Uncertainty and Prior Choice with Applications to Genetic Association Studies(2010) Wilson, Melanie AnnThe Bayesian approach to model selection allows for uncertainty in both model specific parameters and in the models themselves. Much of the recent Bayesian model uncertainty literature has focused on defining these prior distributions in an objective manner, providing conditions under which Bayes factors lead to the correct model selection, particularly in the situation where the number of variables, p, increases with the sample size, n. This is certainly the case in our area of motivation; the biological application of genetic association studies involving single nucleotide polymorphisms. While the most common approach to this problem has been to apply a marginal test to all genetic markers, we employ analytical strategies that improve upon these marginal methods by modeling the outcome variable as a function of a multivariate genetic profile using Bayesian variable selection. In doing so, we perform variable selection on a large number of correlated covariates within studies involving modest sample sizes.
In particular, we present an efficient Bayesian model search strategy that searches over the space of genetic markers and their genetic parametrization. The resulting method for Multilevel Inference of SNP Associations MISA, allows computation of multilevel posterior probabilities and Bayes factors at the global, gene and SNP level. We use simulated data sets to characterize MISA's statistical power, and show that MISA has higher power to detect association than standard procedures. Using data from the North Carolina Ovarian Cancer Study (NCOCS), MISA identifies variants that were not identified by standard methods and have been externally 'validated' in independent studies.
In the context of Bayesian model uncertainty for problems involving a large number of correlated covariates we characterize commonly used prior distributions on the model space and investigate their implicit multiplicity correction properties first in the extreme case where the model includes an increasing number of redundant covariates and then under the case of full rank design matrices. We provide conditions on the asymptotic (in n and p) behavior of the model space prior
required to achieve consistent selection of the global hypothesis of at least one associated variable in the analysis using global posterior probabilities (i.e. under 0-1 loss). In particular, under the assumption that the null model is true, we show that the commonly used uniform prior on the model space leads to inconsistent selection of the global hypothesis via global posterior probabilities (the posterior probability of at least one association goes to 1) when the rank of the design matrix is finite. In the full rank case, we also show inconsistency when p goes to infinity faster than the square root of n. Alternatively, we show that any model space prior such that the global prior odds of association increases at a rate slower than the square root of n results in consistent selection of the global hypothesis in terms of posterior probabilities.
Item Open Access Bayesian Spatial Quantile Regression(2010) Lum, KristianSpatial quantile regression is the combination of two separate and individually well-developed ideas that, to date, has barely been explored. Quantile regression seeks to model each quantile of an outcome distribution, whether separately or jointly, conditional upon covariates. Spatial methods have been developed for instances when spatial dependence ought to be incorporated into the model, whether to adjust for the decreased e ective sample size that comes with highly correlated data or to allow the ability to create a model-based spatial surface that interpolates between the data collected. Combining the spatial methods with quantile regression, this dissertation proposes and studies the properties of several process models for quantile regression that incorporate spatial dependence. In each chapter, we present an application for the model presented therein. In all cases, we are able to achieve improved check loss by incorporating a spatial component into the model.
In Chapter 1, the introduction, we motivate this work by exploring several examples that demonstrate the utility of both quantile regression and spatial models separately.
In Chapter 2, we present the asymmetric Laplace process (ALP), a process model suitable for quantile regression. We derive several covariance properties of various speci cations of this model and discuss the advantages and disadvantages of each option. As an example, we apply this model to real estate data.
In Chapter 3, we extend the ALP to accommodate large data sets by incorporating a predictive process covariance structure and sampling scheme into the ALP. By doing so, we create the asymmetric Laplace predictive process (ALPP), which we apply to a data set of approximately 3,000 births in the state of North Carolina in the year 2000. Here, interest lies primarily on the relationship between various maternal covariates and the lower tails of the distribution of birth weights.
In Chapter 4, we again extend the ALP, this time to incorporate a temporal component. We discuss several ways in which both continuous and discrete time can be included in the model. We further develop and outline the details of a discrete time spatial dynamic model. We apply this model k to a data set of spatially and temporally indexed temperatures, given elevation.
In Chapter 5, we propose an alternative to the ALP, which re-scales a Gaussian process using two separate scale parameters. We investigate the properties of this double normal process (DNP), and present a simulation example to illustrate the utility (and disutility) of this model.
Item Open Access Default Prior Choice for Bayesian Model Selection in Generalized Linear Models with Applications in Mortgage Default(2014) Kramer, ZacharyThe adoption of Zellner's g prior is a popular prior choice in Bayesian Model Averaging, although literature has shown that using a fixed g has undesirable properties. Mixtures of g priors have recently been proposed for Generalized linear models, extending results from the Gaussian linear model context. This paper will specifically look at the model selection problem as it applies to logistic regression. The effect of prior choice on both model selection and prediction using Bayesian Model Averaging is analyzed. This is done by testing a variety of model space and mixtures of g priors in a simulated data study as well illustrating their use in mortgage default data. This paper shows that the different mixtures of g priors tends to fall into one of two groups that have similar behavior. Additionally, priors in one of these groups, specifically the n/2, Beta Prime, and Robust mixtures of g priors, tend to outperform the other choices.
Item Open Access Risk Prediction for Epithelial Ovarian Cancer in 11 United States-Based Case-Control Studies: Incorporation of Epidemiologic Risk Factors and 17 Confirmed Genetic Loci.(Am J Epidemiol, 2016-10-15) Clyde, Merlise A; Palmieri Weber, Rachel; Iversen, Edwin S; Poole, Elizabeth M; Doherty, Jennifer A; Goodman, Marc T; Ness, Roberta B; Risch, Harvey A; Rossing, Mary Anne; Terry, Kathryn L; Wentzensen, Nicolas; Whittemore, Alice S; Anton-Culver, Hoda; Bandera, Elisa V; Berchuck, Andrew; Carney, Michael E; Cramer, Daniel W; Cunningham, Julie M; Cushing-Haugen, Kara L; Edwards, Robert P; Fridley, Brooke L; Goode, Ellen L; Lurie, Galina; McGuire, Valerie; Modugno, Francesmary; Moysich, Kirsten B; Olson, Sara H; Pearce, Celeste Leigh; Pike, Malcolm C; Rothstein, Joseph H; Sellers, Thomas A; Sieh, Weiva; Stram, Daniel; Thompson, Pamela J; Vierkant, Robert A; Wicklund, Kristine G; Wu, Anna H; Ziogas, Argyrios; Tworoger, Shelley S; Schildkraut, Joellen MPreviously developed models for predicting absolute risk of invasive epithelial ovarian cancer have included a limited number of risk factors and have had low discriminatory power (area under the receiver operating characteristic curve (AUC) < 0.60). Because of this, we developed and internally validated a relative risk prediction model that incorporates 17 established epidemiologic risk factors and 17 genome-wide significant single nucleotide polymorphisms (SNPs) using data from 11 case-control studies in the United States (5,793 cases; 9,512 controls) from the Ovarian Cancer Association Consortium (data accrued from 1992 to 2010). We developed a hierarchical logistic regression model for predicting case-control status that included imputation of missing data. We randomly divided the data into an 80% training sample and used the remaining 20% for model evaluation. The AUC for the full model was 0.664. A reduced model without SNPs performed similarly (AUC = 0.649). Both models performed better than a baseline model that included age and study site only (AUC = 0.563). The best predictive power was obtained in the full model among women younger than 50 years of age (AUC = 0.714); however, the addition of SNPs increased the AUC the most for women older than 50 years of age (AUC = 0.638 vs. 0.616). Adapting this improved model to estimate absolute risk and evaluating it in prospective data sets is warranted.