Browsing by Subject "Nonparametric"
- Results Per Page
- Sort Options
Item Open Access A Data-Retaining Framework for Tail Estimation(2020) Cunningham, ErikaModeling of extreme data often involves thresholding, or retaining only the most extreme observations, in order that the tail may "speak" and not be overwhelmed by the bulk of the data. We describe a transformation-based framework that allows univariate density estimation to smoothly transition from a flexible, semi-parametric estimation of the bulk into a parametric estimation of the tail without thresholding. In the limit, this framework has desirable theoretical tail-matching properties to the selected parametric distribution. We develop three Bayesian models under the framework: one using a logistic Gaussian process (LGP) approach; one using a Dirichlet process mixture model (DPMM); and one using a predictive recursion approximation of the DPMM. Models produce estimates and intervals for density, distribution, and quantile functions across the full data range and for the tail index (inverse-power-decay parameter), under an assumption of heavy tails. For each approach, we carry out a simulation study to explore the model's practical usage in non-asymptotic settings, comparing its performance to methods that involve thresholding.
Among the three models proposed, the LGP has lowest bias through the bulk and highest quantile interval coverage generally. Compared to thresholding methods, its tail predictions have lower root mean squared error (RMSE) in all scenarios but the most complicated, e.g. a sharp bulk-to-tail transition. The LGP's consistent underestimation of the tail index does not hinder tail estimation in pre-extrapolation to moderate-extrapolation regions but does affect extreme extrapolations.
An interplay between the parametric transform and the natural sparsity of the DPMM sometimes causes the DPMM to favor estimation of the bulk over estimation of the tail. This can be overcome by increasing prior precision on less sparse (flatter) base-measure density shapes. A finite mixture model (FMM), substituted for the DPMM in simulation, proves effective at reducing tail RMSE over thresholding methods in some, but not all, scenarios and quantile levels.
The predictive recursion marginal posterior (PRMP) model is fast and does the best job among proposed models of estimating the tail-index parameter. This allows it to reduce RMSE in extrapolation over thresholding methods in most scenarios considered. However, bias from the predictive recursion contaminates the tail, casting doubt on the PRMP's predictions in tail regions where data should still inform estimation. We recommend the PRMP model as a quick tool for visualizing the marginal posterior over transformation parameters, which can aid in diagnosing multimodality and informing the precision needed to overcome sparsity in the mixture model approach.
In summary, there is not enough information in the likelihood alone to prevent the bulk from overwhelming the tail. However, a model that harnesses the likelihood with a carefully specified prior can allow both the bulk and tail to speak without an explicit separation of the two. Moreover, retaining all of the data under this framework reduces quantile variability, improving prediction in the tails compared to methods that threshold.
Item Open Access Applications and Computation of Stateful Polya Trees(2017) Christensen, JonathanPolya trees are a class of nonparametric priors on distributions which are able to model absolutely continuous distributions directly, rather than modeling a discrete distribution over parameters of a mixing kernel to obtain an absolutely continuous distribution. The Polya tree discretizes the state space with a recursive partition, generating a distribution by assigning mass to the child elements at each level of the recursive partition according to a Beta distribution. Stateful Polya trees are an extension of the Polya tree where each set in the recursive partition has one or more discrete state variables associated with it. We can learn the posterior distributions of these state variables along with the posterior of the distribution. State variables may be of interest in their own right, or may be nuisance parameters which we use to achieve more flexible models but wish to integrate out in the posterior. We discuss the development of stateful Polya trees and discuss the Hierarchical Adaptive Polya Tree, which uses state variables to flexibly model the concentration parameter of Polya trees in a hierarchical Bayesian model. We also consider difficulties with the use of marginal likelihoods to determine posterior probabilities of states.
Item Open Access Bayesian Methods for Two-Sample Comparison(2015) Soriano, JacopoTwo-sample comparison is a fundamental problem in statistics. Given two samples of data, the interest lies in understanding whether the two samples were generated by the same distribution or not. Traditional two-sample comparison methods are not suitable for modern data where the underlying distributions are multivariate and highly multi-modal, and the differences across the distributions are often locally concentrated. The focus of this thesis is to develop novel statistical methodology for two-sample comparison which is effective in such scenarios. Tools from the nonparametric Bayesian literature are used to flexibly describe the distributions. Additionally, the two-sample comparison problem is decomposed into a collection of local tests on individual parameters describing the distributions. This strategy not only yields high statistical power, but also allows one to identify the nature of the distributional difference. In many real-world applications, detecting the nature of the difference is as important as the existence of the difference itself. Generalizations to multi-sample comparison and more complex statistical problems, such as multi-way analysis of variance, are also discussed.
Item Open Access Bayesian Techniques for Adaptive Acoustic Surveillance(2010) Morton, Kenneth DAutomated acoustic sensing systems are required to detect, classify and localize acoustic signals in real-time. Despite the fact that humans are capable of performing acoustic sensing tasks with ease in a variety of situations, the performance of current automated acoustic sensing algorithms is limited by seemingly benign changes in environmental or operating conditions. In this work, a framework for acoustic surveillance that is capable of accounting for changing environmental and operational conditions, is developed and analyzed. The algorithms employed in this work utilize non-stationary and nonparametric Bayesian inference techniques to allow the resulting framework to adapt to varying background signals and allow the system to characterize new signals of interest when additional information is available. The performance of each of the two stages of the framework is compared to existing techniques and superior performance of the proposed methodology is demonstrated. The algorithms developed operate on the time-domain acoustic signals in a nonparametric manner, thus enabling them to operate on other types of time-series data without the need to perform application specific tuning. This is demonstrated in this work as the developed models are successfully applied, without alteration, to landmine signatures resulting from ground penetrating radar data. The nonparametric statistical models developed in this work for the characterization of acoustic signals may ultimately be useful not only in acoustic surveillance but also other topics within acoustic sensing.
Item Open Access Continuous-Time Models of Arrival Times and Optimization Methods for Variable Selection(2018) Lindon, Michael ScottThis thesis naturally divides itself into two sections. The first two chapters concern
the development of Bayesian semi-parametric models for arrival times. Chapter 2
considers Bayesian inference for a Gaussian process modulated temporal inhomogeneous Poisson point process, made challenging by an intractable likelihood. The intractable likelihood is circumvented by two novel data augmentation strategies which result in Gaussian measurements of the Gaussian process, connecting the model with a larger literature on modelling time-dependent functions from Bayesian non-parametric regression to time series. A scalable state-space representation of the Matern Gaussian process in 1 dimension is used to provide access to linear time filtering algorithms for performing inference. An MCMC algorithm based on Gibbs sampling with slice-sampling steps is provided and illustrated on simulated and real datasets. The MCMC algorithm exhibits excellent mixing and scalability.
Chapter 3 builds on the previous model to detect specific signals in temporal point patterns arising in neuroscience. The firing of a neuron over time in response to an external stimulus generates a temporal point pattern or ``spike train''. Of special interest is how neurons encode information from dual simultaneous external stimuli. Among many hypotheses is the presence multiplexing - interleaving periods of firing as it would for each individual stimulus in isolation. Statistical models are developed to quantify evidence for a variety of experimental hypotheses. Each experimental hypothesis translates to a particular form of intensity function for the dual stimuli trials. The dual stimuli intensity is modelled as a dynamic superposition of single stimulus intensities, defined by a time-dependent weight function that is modelled non-parametrically as a transformed Gaussian process. Experiments on simulated data demonstrate that the model is able to learn the weight function very well, but other model parameters which have meaningful physical interpretations less well.
Chapters 4 and 5 concern mathematical optimization and theoretical properties of Bayesian models for variable selection. Such optimizations are challenging due to non-convexity, non-smoothness and discontinuity of the objective. Chapter 4 presents advances in continuous optimization algorithms based on relating mathematical and statistical approaches defined in connection with several iterative algorithms for penalized linear
regression. I demonstrate the equivalence of parameter mappings using EM under
several data augmentation strategies - location-mixture representations, orthogonal data augmentation and LQ design matrix decompositions. I show that these
model-based approaches are equivalent to algorithmic derivation via proximal
gradient methods. This provides new perspectives on model-based and algorithmic
approaches, connects across several research themes in optimization and statistics,
and provides access, beyond EM, to relevant theory from the proximal gradient
and convex analysis literatures.
Chapter 5 presents a modern and technologically up-to-date approach to discrete optimization for variable selection models through their formulation as mixed integer programming models. Mixed integer quadratic and quadratically constrained programs are developed for the point-mass-Laplace and g-prior. Combined with warm-starts and optimality-based bounds tightening procedures provided by the heuristics of the previous chapter, the MIQP model developed for the point-mass-Laplace prior converges to global optimality in a matter of seconds for moderately sized real datasets. The obtained estimator is demonstrated to possess superior predictive performance over that obtained by cross-validated lasso in a number of real datasets. The MIQCP model for the g-prior struggles to match the performance of the former and highlights the fact that the performance of the mixed integer solver depends critically on the ability of the prior to rapidly concentrate posterior mass on good models.
Item Open Access Dependent Hierarchical Bayesian Models for Joint Analysis of Social Networks and Associated Text(2012) Wang, Eric XunThis thesis presents spatially and temporally dependent hierarchical Bayesian models for the analysis of social networks and associated textual data. Social network analysis has received significant recent attention and has been applied to fields as varied as analysis of Supreme Court votes, Congressional roll call data, and inferring links between authors of scientific papers. In many traditional social network analysis models, temporal and spatial dependencies are not considered due to computational difficulties, even though significant such dependencies often play a significant role in the underlying generative process of the observed social network data.
Thus motivated, this thesis presents four new models that consider spatial and/or temporal dependencies and (when available) the associated text. The first is a time-dependent (dynamic) relational topic model that models nodes by their relevant documents and uses probit regression construction to map topic overlap between nodes to a link. The second is a factor model with dynamic random effects that is used to analyze the voting patterns of the United States Supreme Court. hTe last two models present the primary contribution of this thesis two spatially and temporally dependent models that jointly analyze legislative roll call data and the their associated legislative text and introduce a new paradigm for social network factor analysis: being able to predict new columns (or rows) of matrices from the text. The first uses a nonparametric joint clustering approach to link the factor and topic models while the second uses a text regression construction. Finally, two other models on analysis of and tracking in video are also presented and discussed.
Item Open Access Essays on the Econometrics of Option Prices(2014) Vogt, ErikThis dissertation develops new econometric techniques for use in estimating and conducting inference on parameters that can be identified from option prices. The techniques in question extend the existing literature in financial econometrics along several directions.
The first essay considers the problem of estimating and conducting inference on the term structures of a class of economically interesting option portfolios. The option portfolios of interest play the role of functionals on an infinite-dimensional parameter (the option surface indexed by the term structure of state-price densities) that is well-known to be identified from option prices. Admissible functionals in the essay are generalizations of the VIX volatility index, which represent weighted integrals of options prices at a fixed maturity. By forming portfolios for various maturities, one can study their term structure. However, an important econometric difficulty that must be addressed is the illiquidity of options at longer maturities, which the essay overcomes by proposing a new nonparametric framework that takes advantage of asset pricing restrictions to estimate a shape-conforming option surface. In a second stage, the option portfolios of interest are cast as functionals of the estimated option surface, which then gives rise to a new, asymptotic distribution theory for option portfolios. The distribution theory is used to quantify the estimation error induced by computing integrated option portfolios from a sample of noisy option data. Moreover, by relying on the method of sieves, the framework is nonparametric, adheres to economic shape restrictions for arbitrary maturities, yields closed-form option prices, and is easy to compute. The framework also permits the extraction of the entire term structure of risk-neutral distributions in closed-form. Monte Carlo simulations confirm the framework's performance in finite samples. An application to the term structure of the synthetic variance swap portfolio finds sizeable uncertainty around the swap's true fair value, particularly when the variance swap is synthesized from noisy long-maturity options. A nonparametric investigation into the term structure of the variance risk premium finds growing compensation for variance risk at long maturities.
The second essay, which represents joint work with Jia Li, proposes an econometric framework for inference on parametric option pricing models with two novel features. First, point identification is not assumed. The lack of identification arises naturally when a researcher only has interval observations on option quotes rather than on the efficient option price itself, which implies that the parameters of interest are only partially identified by observed option prices. This issue is solved by adopting a moment inequality approach. Second, the essay imposes no-arbitrage restrictions between the risk-neutral and the physical measures by nonparametrically estimating quantities that are invariant to changes of measures using high-frequency returns data. Theoretical justification for this framework is provided and is based on an asymptotic setting in which the sampling interval of high frequency returns goes to zero as the sampling span goes to infinity. Empirically, the essay shows that inference on risk-neutral parameters becomes much more conservative once the assumption of identification is relaxed. At the same time, however, the conservative inference approach yields new and interesting insights into how option model parameters are related. Finally, the essay shows how the informativeness of the inference can be restored with the use of high frequency observations on the underlying.
The third essay applies the sieve estimation framework developed in this dissertation to estimate a weekly time series of the risk-neutral return distribution's quantiles. Analogous quantiles for the objective-measure distribution are estimated using available methods in the literature for forecasting conditional quantiles from historical data. The essay documents the time-series properties for a range of return quantiles under each measure and further compares the difference between matching return quantiles. This difference is shown to correspond to a risk premium on binary options that pay off when the underlying asset moves below a given quantile. A brief empirical study shows asymmetric compensation for these return risk premia across different quantiles of the conditional return distribution.
Item Open Access Exploiting Big Data in Logistics Risk Assessment via Bayesian Nonparametrics(2014) Shang, YanIn cargo logistics, a key performance measure is transport risk, defined as the deviation of the actual arrival time from the planned arrival time. Neither earliness nor tardiness is desirable for the customer and freight forwarder. In this paper, we investigate ways to assess and forecast transport risks using a half-year of air cargo data, provided by a leading forwarder on 1336 routes served by 20 airlines. Interestingly, our preliminary data analysis shows a strong multimodal feature in the transport risks, driven by unobserved events, such as cargo missing flights. To accommodate this feature, we introduce a Bayesian nonparametric model -- the probit stick-breaking process (PSBP) mixture model -- for flexible estimation of the conditional (i.e., state-dependent) density function of transport risk. We demonstrate that using simpler methods, such as OLS linear regression, can lead to misleading inferences. Our model provides a tool for the forwarder to offer customized price and service quotes. It can also generate baseline airline performance to enable fair supplier evaluation. Furthermore, the method allows us to separate recurrent risks from disruption risks. This is important, because hedging strategies for these two kinds of risks are often drastically different.
Item Open Access Mixture of Quantile Regression(2024) Ti, Tze HongQuantile Regression (QR) is a potent statistical technique enabling the estimation of conditional quantiles within a distribution. However, its application to dependent data units remains a challenging endeavor. In this paper, we introduce a novel approach aimed at extending the utility of quantile regression to correlated data structures. Leveraging the Ewens-Pitman Attraction (EPA) distribution, we propose a non-parametric mixture model of Quantile Regression, facilitating aplication of quantile regression to dependent / clustered data. Through experiments conducted on both synthetic and real-world datasets, we demonstrate the efficacy of our model in uncovering latent clusters embedded within the data while accommodating the fitting of adaptive quantiles to capture diverse patterns. Our findings underscore the versatility and effectiveness of the proposed framework, offering a promising avenue for the analysis of correlated data structures through quantile regression methodologies.
Item Open Access Modeling Temporal and Spatial Data Dependence with Bayesian Nonparametrics(2010) Ren, LuIn this thesis, temporal and spatial dependence are considered within nonparametric priors to help infer patterns, clusters or segments in data. In traditional nonparametric mixture models, observations are usually assumed exchangeable, even though dependence often exists associated with the space or time at which data are generated.
Focused on model-based clustering and segmentation, this thesis addresses the issue in different ways, for temporal and spatial dependence.
For sequential data analysis, the dynamic hierarchical Dirichlet process is proposed to capture the temporal dependence across different groups. The data collected at any time point are represented via a mixture associated with an appropriate underlying model; the statistical properties of data collected at consecutive time points are linked via a random parameter that controls their probabilistic similarity. The new model favors a smooth evolutionary clustering while allowing innovative patterns to be inferred. Experimental analysis is performed on music, and may also be employed on text data for learning topics.
Spatially dependent data is more challenging to model due to its spatially-grid structure and often large computational cost of analysis. As a non-parametric clustering prior, the logistic stick-breaking process introduced here imposes the belief that proximate data are more likely to be clustered together. Multiple logistic regression functions generate a set of sticks with each dominating a spatially localized segment. The proposed model is employed on image segmentation and speaker diarization, yielding generally homogeneous segments with sharp boundaries.
In addition, we also consider a multi-task learning with each task associated with spatial dependence. For the specific application of co-segmentation with multiple images, a hierarchical Bayesian model called H-LSBP is proposed. By sharing the same mixture atoms for different images, the model infers the inter-similarity between each pair of images, and hence can be employed for image sorting.
Item Open Access Scalable Nonparametric Bayes Learning(2013) Banerjee, AnjishnuCapturing high dimensional complex ensembles of data is becoming commonplace in a variety of application areas. Some examples include
biological studies exploring relationships between genetic mutations and diseases, atmospheric and spatial data, and internet usage and online behavioral data. These large complex data present many challenges in their modeling and statistical analysis. Motivated by high dimensional data applications, in this thesis, we focus on building scalable Bayesian nonparametric regression algorithms and on developing models for joint distributions of complex object ensembles.
We begin with a scalable method for Gaussian process regression, a commonly used tool for nonparametric regression, prediction and spatial modeling. A very common bottleneck for large data sets is the need for repeated inversions of a big covariance matrix, which is required for likelihood evaluation and inference. Such inversion can be practically infeasible and even if implemented, highly numerically unstable. We propose an algorithm utilizing random projection ideas to construct flexible, computationally efficient and easy to implement approaches for generic scenarios. We then further improve the algorithm incorporating some structure and blocking ideas in our random projections and demonstrate their applicability in other contexts requiring inversion of large covariance matrices. We show theoretical guarantees for performance as well as substantial improvements over existing methods with simulated and real data. A by product of the work is that we discover hitherto unknown equivalences between approaches in machine learning, random linear algebra and Bayesian statistics. We finally connect random projection methods for large dimensional predictors and large sample size under a unifying theoretical framework.
The other focus of this thesis is joint modeling of complex ensembles of data from different domains. This goes beyond traditional relational modeling of ensembles of one type of data and relies on probability mixing measures over tensors. These models have added flexibility over some existing product mixture model approaches in letting each component of the ensemble have its own dependent cluster structure. We further investigate the question of measuring dependence between variables of different types and propose a very general novel scaled measure based on divergences between the joint and marginal distributions of the objects. Once again, we show excellent performance in both simulated and real data scenarios.
Item Open Access Some Advances in Nonparametric Statistics(2023) Zhu, YichenNonparametric statistics is an important branch of statistics that utilize infinite dimensional modelsto achieve great flexibility. However, such flexibility often comes with difficulties in computations and convergent properties. One approach is to study the natural patterns for one type of datasets and summarize such patterns into mathematical assumptions that can potentially provide computational and theoretical benefits. I carried out the above idea on three different problems. The first problem is the classification trees for imbalanced datasets, where I formulate the regularity of shapes into surface-to-volumeratio and develop satisfactory theory and methodology using this ratio. The second problem is the approximation of Gaussian process, where I observe the critical role of spatial decaying covariance function in Gaussian process approximations and use such decaying properties to prove the approximation error for my proposed method. The last problem is the posterior contraction rates in Kullback-Leibler (KL) divergence, where I am motivated by the dismatch between KL divergence and Hellinger distance and develop a posterior contraction theory entirely based on KL divergence