Bayesian Nonparametric Modeling of Latent Structures
Unprecedented amount of data has been collected in diverse fields such as social network, infectious disease and political science in this information explosive era. The high dimensional, complex and heterogeneous data imposes tremendous challenges on traditional statistical models. Bayesian nonparametric methods address these challenges by providing models that can fit the data with growing complexity. In this thesis, we design novel Bayesian nonparametric models on dataset from three different fields, hyperspectral images analysis, infectious disease and voting behaviors.
First, we consider analysis of noisy and incomplete hyperspectral imagery, with the objective of removing the noise and inferring the missing data. The noise statistics may be wavelength-dependent, and the fraction of data missing (at random) may be substantial, including potentially entire bands, offering the potential to significantly reduce the quantity of data that need be measured. We achieve this objective by employing Bayesian dictionary learning model, considering two distinct means of imposing sparse dictionary usage and drawing the dictionary elements from a Gaussian process prior, imposing structure on the wavelength dependence of the dictionary elements.
Second, a Bayesian statistical model is developed for analysis of the time-evolving properties of infectious disease, with a particular focus on viruses. The model employs a latent semi-Markovian state process, and the state-transition statistics are driven by three terms: ($i$) a general time-evolving trend of the overall population, ($ii$) a semi-periodic term that accounts for effects caused by the days of the week, and ($iii$) a regression term that relates the probability of infection to covariates (here, specifically, to the Google Flu Trends data).
Third, extensive information on 3 million randomly sampled United States citizens is used to construct a statistical model of constituent preferences for each U.S. congressional district. This model is linked to the legislative voting record of the legislator from each district, yielding an integrated model for constituency data, legislative roll-call votes, and the text of the legislation. The model is used to examine the extent to which legislators' voting records are aligned with constituent preferences, and the implications of that alignment (or lack thereof) on subsequent election outcomes. The analysis is based on a Bayesian nonparametric formalism, with fast inference via a stochastic variational Bayesian analysis.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
Rights for Collection: Duke Dissertations