dc.description.abstract |
<p>This thesis develops flexible non- and semiparametric Bayesian models for mixed
continuous, ordered and unordered categorical data. These methods have a range of
possible applications; the applications considered in this thesis are drawn primarily
from the social sciences, where multivariate, heterogeneous datasets with complex
dependence and missing observations are the norm. </p><p>The first contribution is
an extension of the Gaussian factor model to Gaussian copula factor models, which
accommodate continuous and ordinal data with unspecified marginal distributions. I
describe how this model is the most natural extension of the Gaussian factor model,
preserving its essential dependence structure and the interpretability of factor loadings
and the latent variables. I adopt an approximate likelihood for posterior inference
and prove that, if the Gaussian copula model is true, the approximate posterior distribution
of the copula correlation matrix asymptotically converges to the correct parameter
under nearly any marginal distributions. I demonstrate with simulations that this
method is both robust and efficient, and illustrate its use in an application from
political science.</p><p>The second contribution is a novel nonparametric hierarchical
mixture model for continuous, ordered and unordered categorical data. The model includes
a hierarchical prior used to couple component indices of two separate models, which
are also linked by local multivariate regressions. This structure effectively overcomes
the limitations of existing mixture models for mixed data, namely the overly strong
local independence assumptions. In the proposed model local independence is replaced
by local conditional independence, so that the induced model is able to more readily
adapt to structure in the data. I demonstrate the utility of this model as a default
engine for multiple imputation of mixed data in a large repeated-sampling study using
data from the Survey of Income and Participation. I show that it improves substantially
on its most popular competitor, multiple imputation by chained equations (MICE), while
enjoying certain theoretical properties that MICE lacks. </p><p>The third contribution
is a latent variable model for density regression. Most existing density regression
models are quite flexible but somewhat cumbersome to specify and fit, particularly
when the regressors are a combination of continuous and categorical variables. The
majority of these methods rely on extensions of infinite discrete mixture models to
incorporate covariate dependence in mixture weights, atoms or both. I take a fundamentally
different approach, introducing a continuous latent variable which depends on covariates
through a parametric regression. In turn, the observed response depends on the latent
variable through an unknown function. I demonstrate that a spline prior for the unknown
function is quite effective relative to Dirichlet Process mixture models in density
estimation settings (i.e., without covariates) even though these Dirichlet process
mixtures have better theoretical properties asymptotically. The spline formulation
enjoys a number of computational advantages over more flexible priors on functions.
Finally, I demonstrate the utility of this model in regression applications using
a dataset on U.S. wages from the Census Bureau, where I estimate the return to schooling
as a smooth function of the quantile index.</p>
|
|