Browsing by Author "Tokdar, Surya Tapas"
- Results Per Page
- Sort Options
Item Open Access A Data-Retaining Framework for Tail Estimation(2020) Cunningham, ErikaModeling of extreme data often involves thresholding, or retaining only the most extreme observations, in order that the tail may "speak" and not be overwhelmed by the bulk of the data. We describe a transformation-based framework that allows univariate density estimation to smoothly transition from a flexible, semi-parametric estimation of the bulk into a parametric estimation of the tail without thresholding. In the limit, this framework has desirable theoretical tail-matching properties to the selected parametric distribution. We develop three Bayesian models under the framework: one using a logistic Gaussian process (LGP) approach; one using a Dirichlet process mixture model (DPMM); and one using a predictive recursion approximation of the DPMM. Models produce estimates and intervals for density, distribution, and quantile functions across the full data range and for the tail index (inverse-power-decay parameter), under an assumption of heavy tails. For each approach, we carry out a simulation study to explore the model's practical usage in non-asymptotic settings, comparing its performance to methods that involve thresholding.
Among the three models proposed, the LGP has lowest bias through the bulk and highest quantile interval coverage generally. Compared to thresholding methods, its tail predictions have lower root mean squared error (RMSE) in all scenarios but the most complicated, e.g. a sharp bulk-to-tail transition. The LGP's consistent underestimation of the tail index does not hinder tail estimation in pre-extrapolation to moderate-extrapolation regions but does affect extreme extrapolations.
An interplay between the parametric transform and the natural sparsity of the DPMM sometimes causes the DPMM to favor estimation of the bulk over estimation of the tail. This can be overcome by increasing prior precision on less sparse (flatter) base-measure density shapes. A finite mixture model (FMM), substituted for the DPMM in simulation, proves effective at reducing tail RMSE over thresholding methods in some, but not all, scenarios and quantile levels.
The predictive recursion marginal posterior (PRMP) model is fast and does the best job among proposed models of estimating the tail-index parameter. This allows it to reduce RMSE in extrapolation over thresholding methods in most scenarios considered. However, bias from the predictive recursion contaminates the tail, casting doubt on the PRMP's predictions in tail regions where data should still inform estimation. We recommend the PRMP model as a quick tool for visualizing the marginal posterior over transformation parameters, which can aid in diagnosing multimodality and informing the precision needed to overcome sparsity in the mixture model approach.
In summary, there is not enough information in the likelihood alone to prevent the bulk from overwhelming the tail. However, a model that harnesses the likelihood with a carefully specified prior can allow both the bulk and tail to speak without an explicit separation of the two. Moreover, retaining all of the data under this framework reduces quantile variability, improving prediction in the tails compared to methods that threshold.
Item Open Access Bayesian Analysis of Latent Threshold Dynamic Models(2012) Nakajima, JochiTime series modeling faces increasingly high-dimensional problems in many scientific areas. Lack of relevant, data-based constraints typically leads to increased uncer-tainty in estimation and degradation of predictive performance. This dissertation addresses these general questions with a new and broadly applicable idea based on latent threshold models. The latent threshold approach is a model-based framework for inducing data-driven shrinkage of elements of parameter processes, collapsing them fully to zero when redundant or irrelevant while allowing for time-varying non-zero values when supported by the data. This dynamic sparsity modeling technique is implemented in broad classes of multivariate time series models with application tovarious time series data. The analyses demonstrate the utility of the latent threshold idea in reducing estimation uncertainty and improving predictions as well as model interpretation. Chapter 1 overviews the idea of the latent threshold approach and outlines the dissertation. Chapter 2 introduces the new approach to dynamic sparsity using latent threshold modeling and also discusses Bayesian analysis and computation for model fitting. Chapter 3 describes latent threshold multivariate models for a wide range of applications in the real data analysis that follows. Chapter 4 provides US and Japanese macroeconomic data analysis using latent threshold VAR models. Chapter 5 analyzes time series of foreign currency exchange rates (FX) using latent thresh-old dynamic factor models. Chapter 6 provides a study of electroencephalographic (EEG) time series using latent threshold factor process models. Chapter 7 develops a new framework of dynamic network modeling for multivariate time series using the latent threshold approach. Finally, Chapter 8 concludes the dissertation with open questions and future works.Item Open Access Bayesian Decoupling: A Decision Theory-Based Approach to Bayesian Variable Selection(2022) Li, AihuaThe spike and slab prior offers a canonical approach to Bayesian variable selection, with caveats known to be data dimension and correlation. Motivated by the pitfalls of the spike and slab prior on high dimensional correlated data, this paper introduces Bayesian decoupling (BD, proposed by Hahn and Carvalho [HC15]) as a decision theory-based approach to inducing sparsity on posterior. We formalize the decision theoretical foundation of BD, and argue that BD conducts a sparsification over the posterior mean with a tolerable degradation of the predictive ability. Moreover, the application of BD in sparse estimation motivates the notion of decoupled model fitting and variable estimation, which is an idea rooted in Bayesian decision theory stating that variable estimation should be explicitly recovered as a decision making problem after the model fitting stage. We suggest a broader use of BD in Bayesian statistics, emphasizing that it allows multiple estimation tasks to be carried out simultaneously under a single prior by using different loss functions for different estimation purposes. Our simulation results show that BD with appropriately defined loss functions leads to a desired support recovery with low MSE and FDR and offers an accurate representation of the posterior belief.
Item Open Access Bayesian Density Regression With a Jump Discontinuity at a Given Threshold(2019) Zhang, ShuangjieStandard regression discontinuity design usually concentrates on the causal effects by assigning a threshold above or below which an intervention is assigned. By com- paring the real values of observations near the threshold, it is plausible to design some structures of discontinuity to capture the discontinuity information. Instead of looking at the observations, this paper develops a Bayesian density regression model whose parameters are related to covariates to achieve regression discontinuity design. The methodology is applied to simulated data first and then analyzing out- comes of yes/no votes to a large number of company proposals. During the process of Bayesian inference, we have adopted adaptive MCMC, importance sampling and Metropolis-adjusted Langevin algorithm to implement statistical inference. They to some extend make some sense in estimating the parameters and explain how the covariates affect the discontinuity magnitude.
Item Open Access Bayesian Hierarchical Models to Address Problems in Neuroscience and Economics(2017) Zaman, Azeem ZahidIn the first chapter, motivated by a model used to analyze spike train data, we present a method for learning multiple probability vectors by using information from large samples to improve estimates for smaller samples. The method makes use of Polya-gamma data augmentation to construct a conjugate model whose posterior can estimate the weights of a mixture distribution. This novel method successfully uses borrows information from large samples to increase the precision and accuracy of estimates for smaller samples.
In the second chapter, data from the Federal Communications Commission spectrum auction number 73 is analyzed. By analyzing the structure of the auctions bounds are placed on the valuations that govern the bidders' decisions in the auction. With these bounds, common models are estimated by imputing valuations and the results are compared with the estimates from standard methods used in the literature. The comparison shows some important differences between the approaches. A second model that accounts for the geographic relationship between the licenses sold finds strong evidence of a correlation between the value of adjacent licenses, as expected by economic theory.
Item Open Access Bayesian Models for Causal Analysis with Many Potentially Weak Instruments(2015) Jiang, ShengThis paper investigates Bayesian instrumental variable models with many instruments. The number of instrumental variables grows with the sample size and is allowed to be much larger than the sample size. With some sparsity condition on the coefficients on the instruments, we characterize a general prior specification where the posterior consistency of the parameters is established and calculate the corresponding convergence rate.
In particular, we show the posterior consistency for a class of spike and slab priors on the many potentially weak instruments. The spike and slab prior shrinks the number of instrumental variables, which avoids overfitting and provides uncertainty quantifications on the first stage. A simulation study is conducted to illustrate the convergence notion and estimation/selection performance under dependent instruments. Computational issues related to the Gibbs sampler are also discussed.
Item Open Access Computational Challenges to Bayesian Density Discontinuity Regression(2022) Zheng, HaoliangMany policies subject an underlying continuous variable to an artificial cutoff. Agents may regulate the magnitude of the variable to stay on the preferred side of a known cutoff, which results in the form of a jump discontinuity of the distribution of the variable at the cutoff value. In this paper, we present a statistical method to estimate the presence and magnitude of such jump discontinuities as functions of measured covariates.
For the posterior computation of our model, we use a Gibbs sampling scheme as the overall structure. For each iteration, we have two layers of data augmentation. We first adopt the rejection history strategy to remove the intractable integral and then generate Pólya-Gamma latent variables to ease the computation. We implement algorithms including adaptive Metropolis, ensemble MCMC, and independent Metropolis for each parameter within the Gibbs sampler. We justify our method based on the simulation study.
As for the real data, we focus on a study of corporate proposal voting, where we encounter several computational challenges. We discuss the multimodality issue from two aspects. In an effort to solve this problem, we borrow the idea from parallel tempering. We build an adaptive parallel tempered version of our sampler. The result shows that introducing the tempering method indeed improves the performance of our original sampler.
Item Open Access Consistency and Adaptation of Gaussian Process Regression, Bayesian Stochastic Block Model and Tail Index(2021) Jiang, ShengBayesian methods offer adaptive inference via hierarchical extensions and uncertaintyquantification automatically with corresponding posterior distribution. Frequentist evaluation of Bayesian methods becomes a fundamental and necessary step in Bayesian analysis.
Bayesian nonparametric regression under a rescaled Gaussian process prior offers smoothness-adaptive function estimation with near minimax-optimal error rates. Hierarchical extensions of this approach, equipped with stochastic variable selection, are known to also adapt to the unknown intrinsic dimension of a sparse true regression function. But it remains unclear if such extensions offer variable selection consistency, i.e., if the true subset of important variables could be consistently learned from the data. It is shown here that variable consistency may indeed be achieved with such models at least when the true regression function has finite smoothness to induce a polynomially larger penalty on inclusion of false positive predictors. Our result covers the high dimensional asymptotic setting where the predictor dimension is allowed to grow with the sample size.
Stochastic Block Models (SBMs) are a fundamental tool for community detection in network analysis. But little theoretical work exists on the statistical performance of Bayesian SBMs, especially when the number of communities is unknown. This project studies weakly assortative SBMs whose members of the same community are more likely to connect with one another than with members from other communities. The weak assortativity constraint is embedded within an otherwise weak prior, and, under mild regularity conditions, the resulting posterior distribution is shown to concentrate on the true number of communities and membership allocation as the network size grows to infinity. A reversible-jump Markov Chain Monte Carlo posterior computation strategy is developed by adapting the allocation sampler. Finite sample properties are examined via simulation studies in which the proposed method offers competitive estimation accuracy relative to existing methods under a variety of challenging scenarios.
Tail index estimation has been well studied in the frequentist literature. However, few asymptotic studies on Bayesian tail index estimation are available. This paper works with a transformation based semi-parametric density model by non-parametrically transforming a parametric CDF. The semiparametric density model offers both accurate density estimation and tail index estimation. Compared with frequentist methods, it avoids choosing a high quantile to threshold the data. We provide sufficient conditions on the parametric family and the logistic Gaussian process priors, such that posterior contraction rate of tail index can be established. Limitations of the semiparametric density model are also discussed.
Item Open Access Continuous-Time Models of Arrival Times and Optimization Methods for Variable Selection(2018) Lindon, Michael ScottThis thesis naturally divides itself into two sections. The first two chapters concern
the development of Bayesian semi-parametric models for arrival times. Chapter 2
considers Bayesian inference for a Gaussian process modulated temporal inhomogeneous Poisson point process, made challenging by an intractable likelihood. The intractable likelihood is circumvented by two novel data augmentation strategies which result in Gaussian measurements of the Gaussian process, connecting the model with a larger literature on modelling time-dependent functions from Bayesian non-parametric regression to time series. A scalable state-space representation of the Matern Gaussian process in 1 dimension is used to provide access to linear time filtering algorithms for performing inference. An MCMC algorithm based on Gibbs sampling with slice-sampling steps is provided and illustrated on simulated and real datasets. The MCMC algorithm exhibits excellent mixing and scalability.
Chapter 3 builds on the previous model to detect specific signals in temporal point patterns arising in neuroscience. The firing of a neuron over time in response to an external stimulus generates a temporal point pattern or ``spike train''. Of special interest is how neurons encode information from dual simultaneous external stimuli. Among many hypotheses is the presence multiplexing - interleaving periods of firing as it would for each individual stimulus in isolation. Statistical models are developed to quantify evidence for a variety of experimental hypotheses. Each experimental hypothesis translates to a particular form of intensity function for the dual stimuli trials. The dual stimuli intensity is modelled as a dynamic superposition of single stimulus intensities, defined by a time-dependent weight function that is modelled non-parametrically as a transformed Gaussian process. Experiments on simulated data demonstrate that the model is able to learn the weight function very well, but other model parameters which have meaningful physical interpretations less well.
Chapters 4 and 5 concern mathematical optimization and theoretical properties of Bayesian models for variable selection. Such optimizations are challenging due to non-convexity, non-smoothness and discontinuity of the objective. Chapter 4 presents advances in continuous optimization algorithms based on relating mathematical and statistical approaches defined in connection with several iterative algorithms for penalized linear
regression. I demonstrate the equivalence of parameter mappings using EM under
several data augmentation strategies - location-mixture representations, orthogonal data augmentation and LQ design matrix decompositions. I show that these
model-based approaches are equivalent to algorithmic derivation via proximal
gradient methods. This provides new perspectives on model-based and algorithmic
approaches, connects across several research themes in optimization and statistics,
and provides access, beyond EM, to relevant theory from the proximal gradient
and convex analysis literatures.
Chapter 5 presents a modern and technologically up-to-date approach to discrete optimization for variable selection models through their formulation as mixed integer programming models. Mixed integer quadratic and quadratically constrained programs are developed for the point-mass-Laplace and g-prior. Combined with warm-starts and optimality-based bounds tightening procedures provided by the heuristics of the previous chapter, the MIQP model developed for the point-mass-Laplace prior converges to global optimality in a matter of seconds for moderately sized real datasets. The obtained estimator is demonstrated to possess superior predictive performance over that obtained by cross-validated lasso in a number of real datasets. The MIQCP model for the g-prior struggles to match the performance of the former and highlights the fact that the performance of the mixed integer solver depends critically on the ability of the prior to rapidly concentrate posterior mass on good models.
Item Open Access Multiple-try Stochastic Search for Bayesian Variable Selection(2017) Chen, XuVariable selection is a key issue when analyzing high-dimensional data. The explosion of data with large sample size and dimensionality brings new challenges to this problem in both inference efficiency and computational complexity. To alleviate these problems, a scalable Markov chain Monte Carlo (MCMC) sampling algorithm is proposed by generalizing multiple-try Metropolis to discrete model space and further incorporating neighborhood-based stochastic search. In this thesis, we study the behaviors of this MCMC sampler in the "large p small n'' scenario where the number of predictors p is much greater than the number of observations n. Extensive numerical experiments including simulated and real data examples are provided to illustrate its performance. Choices of tunning parameters are discussed.
Item Open Access On Bayesian Analyses of Functional Regression, Correlated Functional Data and Non-homogeneous Computer Models(2013) Montagna, SilviaCurrent frontiers in complex stochastic modeling of high-dimensional processes include major emphases on so-called functional data: problems in which the data are snapshots of curves and surfaces representing fundamentally important scientific quantities. This thesis explores new Bayesian methodologies for functional data analysis.
The first part of the thesis places emphasis on the role of factor models in functional data analysis. Data reduction becomes mandatory when dealing with such high-dimensional data, more so when data are available on a large number of individuals. In Chapter 2 we present a novel Bayesian framework which employs a latent factor construction to represent each variable by a low dimensional summary. Further, we explore the important issue of modeling and analyzing the relationship of functional data with other covariate and outcome variables simultaneously measured on the same subjects.
The second part of the thesis is concerned with the analysis of circadian data. The focus is on the identification of circadian genes that is, genes whose expression levels appear to be rhythmic through time with a period of approximately 24 hours. While addressing this goal, most of the current literature does not account for the potential dependence across genes. In Chapter 4, we propose a Bayesian approach which employs latent factors to accommodate dependence and verify patterns and relationships between genes, while representing the true gene expression trajectories in the Fourier domain allows for inference on period, phase, and amplitude of the signal.
The third part of the thesis is concerned with the statistical analysis of computer models (simulators). The heavy computational demand of these input-output maps calls for statistical techniques that quickly estimate the surface output at untried inputs given a few preliminary runs of the simulator at a set design points. In this regard, we propose a Bayesian methodology based on a non-stationary Gaussian process. Relying on a model-based assessment of uncertainty, we envision a sequential design technique which helps choosing input points where the simulator should be run to minimize the uncertainty in posterior surface estimation in an optimal way. The proposed non-stationary approach adapts well to output surfaces of unconstrained shape.
Item Open Access Simulation Study on Exchangeability and Significant Test on Survey Data(2015) Cao, YongThe two years of Master of Science in Statistical and Economic Modeling program is the most rewarding time ever in my life. This thesis acts as a portfolio of project and applied experience while I am enrolled in the Master of Science in Statistical and Economic Modeling program. This thesis will summarize my graduate study in two parts: Simulation Study of Exchangeability for Binary Data, and Summary of Summer Internship at Center for Responsible Lending. The project of Simulation Study of Exchangeability for Binary Data contains materials from a team project, which jointly performed by Sheng Jiang, Xuan Sun and me. Abstracts for both projects are below in order.
(1) Simulation Study of Exchangeability for Binary Data
To investigate tractable Bayesian tests on exchangeability, this project considers special cases of nonexchangeable random sequences: Markov chains. Asymptotic results of Bayes factor (BF) are derived. When null hypothesis is true, Bayes Factor in favor of the null goes to infinity at geometric rate (true odds is not one half). When null hypothesis is not true, Bayes Factor in favor of the null goes to 0 faster than geometric rate. The results are robust under misspecifications. Simulation studies are employed to see the performance of the test when the sample size is small, prior beliefs change and true parameters change.
(2) Summary of Summer Internship at Center for Responsible Lending
My summer internship deals with a survey data from Social Science Research Solution about auto financing. The dataset includes about one thousand valid responses and 114 variables for each response. My efforts on exploratory statistic analysis unfolded many interesting findings. For example, African Americans and Latinos are receiving 2.02% higher APR on average than white buyers, excluding the effects of relevant variables. And what's more, a Fisher's Exact Test of Significance is widely used to discover the significance of a series of variables. Results are presented in organized neat tables. Findings are included in weekly reports. One example finding is that warranty add-‐‑ons of a financed car has significant impacts on all three aspects of a loan, which is Annual Percent Rate, Loan Amount, and Monthly Payment.
Item Open Access Some Explorations of Bayesian Joint Quantile Regression(2017) Shi, WenliAlthough quantile regression provides a comprehensive and robust replacement for the traditional mean regression, a complete estimation technique is in blank for a long time. Original separate estimation could cause severe problems, which obstructs its popularization in methodology and application. A novel complete Bayesian joint estimation of quantile regression is proposed and serves as a thorough solution to this historical challenge. In this thesis, we first introduce this modeling technique and propose some preliminary but important theoretical development on the posterior convergence rate of this novel joint estimation, which offers significant guidance to the ultimate results. We provide the posterior convergence rate for the density estimation model induced by this joint quantile regression model. Furthermore, the prior concentration condition of the truncated version of this joint quantile regression model is proved and the entropy condition of the truncated model with any sphere predictor plane centered at 0 is verified. An application on high school math achievement is also introduced, which reveals some deep association between math achievement and socio-economic status. Some further developments about the estimation technique, convergence rate and application are discussed. Furthermore, some suggestions on school choices for minority students are mentioned according to the application.
Item Open Access Statistical Analysis of Response Distribution for Dependent Data via Joint Quantile Regression(2021) Chen, XuLinear quantile regression is a powerful tool to investigate how predictors may affect a response heterogeneously across different quantile levels. Unfortunately, existing approaches find it extremely difficult to adjust for any dependency between observation units, largely because such methods are not based upon a fully generative model of the data. In this dissertation, we address this difficulty for analyzing spatial point-referenced data and hierarchical data. Several models are introduced by generalizing the joint quantile regression model of Yang and Tokdar (2017) and characterizing different dependency structures via a copula model on the underlying quantile levels of the observation units. A Bayesian semiparametric approach is introduced to perform inference of model parameters and carry out prediction. Multiple copula families are discussed for modeling response data with tail dependence and/or tail asymmetry. An effective model comparison criterion is provided for selecting between models with different combinations of sets of predictors, marginal base distributions and copula models.
Extensive simulation studies and real applications are presented to illustrate substantial gains of the proposed models in inference quality, prediction accuracy and uncertainty quantification over existing alternatives. Through case studies, we highlight that the proposed models admit great interpretability and are competent in offering insightful new discoveries of response-predictor relationship at non-central parts of the response distribution. The effectiveness of the proposed model comparison criteria is verified with both empirical and theoretical evidence.
Item Open Access Testing Between Different Types of Poisson Mixtures with Applications to Neuroscience(2019) Chen, YunranWe propose a hypothesis testing for different types of stochastic order of mixture distributions (PRML classifier) and a hypothesis testing for screening out data with mixture distributions (PRML filter), in a Bayesian framework using a recursive algorithm called predictive recursion marginal likelihood (PRML) algorithm. Of particular interest is the special case of testing between different types of Poisson mixtures and testing Poisson distribution versus Poisson mixtures. The first testing procedure applies Laplace approximation coupled with optimization algorithm. This testing helps neuroscientists to classify the activation patterns that a single neuron exhibits when preserving information from multiple stimuli. The second testing aims to screen out over-dispersed data to boost the scientific information. Simulation shows the new classifier and filter outperform the previous testing especially for over-dispersed data. We apply the PRML classifier on the analysis of inferior colliculus neurons filtered by PRML filter. We show the PRML classifier emphasizes second order stochasticity. We present empirical evidence that the PRML filter contributes to avoid mistaking trial-to-trial variation as second order stochasticity.