# Browsing by Author "Tokdar, Surya T"

###### Results Per Page

###### Sort Options

Item Open Access A Data-Retaining Framework for Tail Estimation(2020) Cunningham, ErikaModeling of extreme data often involves thresholding, or retaining only the most extreme observations, in order that the tail may "speak" and not be overwhelmed by the bulk of the data. We describe a transformation-based framework that allows univariate density estimation to smoothly transition from a flexible, semi-parametric estimation of the bulk into a parametric estimation of the tail without thresholding. In the limit, this framework has desirable theoretical tail-matching properties to the selected parametric distribution. We develop three Bayesian models under the framework: one using a logistic Gaussian process (LGP) approach; one using a Dirichlet process mixture model (DPMM); and one using a predictive recursion approximation of the DPMM. Models produce estimates and intervals for density, distribution, and quantile functions across the full data range and for the tail index (inverse-power-decay parameter), under an assumption of heavy tails. For each approach, we carry out a simulation study to explore the model's practical usage in non-asymptotic settings, comparing its performance to methods that involve thresholding.

Among the three models proposed, the LGP has lowest bias through the bulk and highest quantile interval coverage generally. Compared to thresholding methods, its tail predictions have lower root mean squared error (RMSE) in all scenarios but the most complicated, e.g. a sharp bulk-to-tail transition. The LGP's consistent underestimation of the tail index does not hinder tail estimation in pre-extrapolation to moderate-extrapolation regions but does affect extreme extrapolations.

An interplay between the parametric transform and the natural sparsity of the DPMM sometimes causes the DPMM to favor estimation of the bulk over estimation of the tail. This can be overcome by increasing prior precision on less sparse (flatter) base-measure density shapes. A finite mixture model (FMM), substituted for the DPMM in simulation, proves effective at reducing tail RMSE over thresholding methods in some, but not all, scenarios and quantile levels.

The predictive recursion marginal posterior (PRMP) model is fast and does the best job among proposed models of estimating the tail-index parameter. This allows it to reduce RMSE in extrapolation over thresholding methods in most scenarios considered. However, bias from the predictive recursion contaminates the tail, casting doubt on the PRMP's predictions in tail regions where data should still inform estimation. We recommend the PRMP model as a quick tool for visualizing the marginal posterior over transformation parameters, which can aid in diagnosing multimodality and informing the precision needed to overcome sparsity in the mixture model approach.

In summary, there is not enough information in the likelihood alone to prevent the bulk from overwhelming the tail. However, a model that harnesses the likelihood with a carefully specified prior can allow both the bulk and tail to speak without an explicit separation of the two. Moreover, retaining all of the data under this framework reduces quantile variability, improving prediction in the tails compared to methods that threshold.

Item Open Access Computational Challenges to Bayesian Density Discontinuity Regression(2022) Zheng, HaoliangMany policies subject an underlying continuous variable to an artificial cutoff. Agents may regulate the magnitude of the variable to stay on the preferred side of a known cutoff, which results in the form of a jump discontinuity of the distribution of the variable at the cutoff value. In this paper, we present a statistical method to estimate the presence and magnitude of such jump discontinuities as functions of measured covariates.

For the posterior computation of our model, we use a Gibbs sampling scheme as the overall structure. For each iteration, we have two layers of data augmentation. We first adopt the rejection history strategy to remove the intractable integral and then generate Pólya-Gamma latent variables to ease the computation. We implement algorithms including adaptive Metropolis, ensemble MCMC, and independent Metropolis for each parameter within the Gibbs sampler. We justify our method based on the simulation study.

As for the real data, we focus on a study of corporate proposal voting, where we encounter several computational challenges. We discuss the multimodality issue from two aspects. In an effort to solve this problem, we borrow the idea from parallel tempering. We build an adaptive parallel tempered version of our sampler. The result shows that introducing the tempering method indeed improves the performance of our original sampler.

Item Open Access Continuous-Time Models of Arrival Times and Optimization Methods for Variable Selection(2018) Lindon, Michael ScottThis thesis naturally divides itself into two sections. The first two chapters concern

the development of Bayesian semi-parametric models for arrival times. Chapter 2

considers Bayesian inference for a Gaussian process modulated temporal inhomogeneous Poisson point process, made challenging by an intractable likelihood. The intractable likelihood is circumvented by two novel data augmentation strategies which result in Gaussian measurements of the Gaussian process, connecting the model with a larger literature on modelling time-dependent functions from Bayesian non-parametric regression to time series. A scalable state-space representation of the Matern Gaussian process in 1 dimension is used to provide access to linear time filtering algorithms for performing inference. An MCMC algorithm based on Gibbs sampling with slice-sampling steps is provided and illustrated on simulated and real datasets. The MCMC algorithm exhibits excellent mixing and scalability.

Chapter 3 builds on the previous model to detect specific signals in temporal point patterns arising in neuroscience. The firing of a neuron over time in response to an external stimulus generates a temporal point pattern or ``spike train''. Of special interest is how neurons encode information from dual simultaneous external stimuli. Among many hypotheses is the presence multiplexing - interleaving periods of firing as it would for each individual stimulus in isolation. Statistical models are developed to quantify evidence for a variety of experimental hypotheses. Each experimental hypothesis translates to a particular form of intensity function for the dual stimuli trials. The dual stimuli intensity is modelled as a dynamic superposition of single stimulus intensities, defined by a time-dependent weight function that is modelled non-parametrically as a transformed Gaussian process. Experiments on simulated data demonstrate that the model is able to learn the weight function very well, but other model parameters which have meaningful physical interpretations less well.

Chapters 4 and 5 concern mathematical optimization and theoretical properties of Bayesian models for variable selection. Such optimizations are challenging due to non-convexity, non-smoothness and discontinuity of the objective. Chapter 4 presents advances in continuous optimization algorithms based on relating mathematical and statistical approaches defined in connection with several iterative algorithms for penalized linear

regression. I demonstrate the equivalence of parameter mappings using EM under

several data augmentation strategies - location-mixture representations, orthogonal data augmentation and LQ design matrix decompositions. I show that these

model-based approaches are equivalent to algorithmic derivation via proximal

gradient methods. This provides new perspectives on model-based and algorithmic

approaches, connects across several research themes in optimization and statistics,

and provides access, beyond EM, to relevant theory from the proximal gradient

and convex analysis literatures.

Chapter 5 presents a modern and technologically up-to-date approach to discrete optimization for variable selection models through their formulation as mixed integer programming models. Mixed integer quadratic and quadratically constrained programs are developed for the point-mass-Laplace and g-prior. Combined with warm-starts and optimality-based bounds tightening procedures provided by the heuristics of the previous chapter, the MIQP model developed for the point-mass-Laplace prior converges to global optimality in a matter of seconds for moderately sized real datasets. The obtained estimator is demonstrated to possess superior predictive performance over that obtained by cross-validated lasso in a number of real datasets. The MIQCP model for the g-prior struggles to match the performance of the former and highlights the fact that the performance of the mixed integer solver depends critically on the ability of the prior to rapidly concentrate posterior mass on good models.

Item Open Access Coordinated multiplexing of information about separate objects in visual cortex.(eLife, 2022-11) Jun, Na Young; Ruff, Douglas A; Kramer, Lily E; Bowes, Brittany; Tokdar, Surya T; Cohen, Marlene R; Groh, Jennifer MSensory receptive fields are large enough that they can contain more than one perceptible stimulus. How, then, can the brain encode information about*each*of the stimuli that may be present at a given moment? We recently showed that when more than one stimulus is present, single neurons can fluctuate between coding one vs. the other(s) across some time period, suggesting a form of neural multiplexing of different stimuli (Caruso et al., 2018). Here, we investigate (a) whether such coding fluctuations occur in early visual cortical areas; (b) how coding fluctuations are coordinated across the neural population; and (c) how coordinated coding fluctuations depend on the parsing of stimuli into separate vs. fused objects. We found coding fluctuations do occur in macaque V1 but only when the two stimuli form separate objects. Such separate objects evoked a novel pattern of V1 spike count ('noise') correlations involving distinct distributions of positive and negative values. This bimodal correlation pattern was most pronounced among pairs of neurons showing the strongest evidence for coding fluctuations or multiplexing. Whether a given pair of neurons exhibited positive or negative correlations depended on whether the two neurons both responded better to the same object or had different object preferences. Distinct distributions of spike count correlations based on stimulus preferences were also seen in V4 for separate objects but not when two stimuli fused to form one object. These findings suggest multiple objects evoke different response dynamics than those evoked by single stimuli, lending support to the multiplexing hypothesis and suggesting a means by which information about multiple objects can be preserved despite the apparent coarseness of sensory coding.Item Open Access Efficient Gaussian process regression for large datasets.(Biometrika, 2013-03) Banerjee, Anjishnu; Dunson, David B; Tokdar, Surya TGaussian processes are widely used in nonparametric regression, classification and spatiotemporal modelling, facilitated in part by a rich literature on their theoretical properties. However, one of their practical limitations is expensive computation, typically on the order of n(3) where n is the number of data points, in performing the necessary matrix inversions. For large datasets, storage and processing also lead to computational bottlenecks, and numerical stability of the estimates and predicted values degrades with increasing n. Various methods have been proposed to address these problems, including predictive processes in spatial data analysis and the subset-of-regressors technique in machine learning. The idea underlying these approaches is to use a subset of the data, but this raises questions concerning sensitivity to the choice of subset and limitations in estimating fine-scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through simulated and real data examples.Item Open Access Multiple-try Stochastic Search for Bayesian Variable Selection(2017) Chen, XuVariable selection is a key issue when analyzing high-dimensional data. The explosion of data with large sample size and dimensionality brings new challenges to this problem in both inference efficiency and computational complexity. To alleviate these problems, a scalable Markov chain Monte Carlo (MCMC) sampling algorithm is proposed by generalizing multiple-try Metropolis to discrete model space and further incorporating neighborhood-based stochastic search. In this thesis, we study the behaviors of this MCMC sampler in the "large p small n'' scenario where the number of predictors p is much greater than the number of observations n. Extensive numerical experiments including simulated and real data examples are provided to illustrate its performance. Choices of tunning parameters are discussed.

Item Open Access Sensorimotor abilities predict on-field performance in professional baseball.(Scientific reports, 2018-01-08) Burris, Kyle; Vittetoe, Kelly; Ramger, Benjamin; Suresh, Sunith; Tokdar, Surya T; Reiter, Jerome P; Appelbaum, L GregoryBaseball players must be able to see and react in an instant, yet it is hotly debated whether superior performance is associated with superior sensorimotor abilities. In this study, we compare sensorimotor abilities, measured through 8 psychomotor tasks comprising the Nike Sensory Station assessment battery, and game statistics in a sample of 252 professional baseball players to evaluate the links between sensorimotor skills and on-field performance. For this purpose, we develop a series of Bayesian hierarchical latent variable models enabling us to compare statistics across professional baseball leagues. Within this framework, we find that sensorimotor abilities are significant predictors of on-base percentage, walk rate and strikeout rate, accounting for age, position, and league. We find no such relationship for either slugging percentage or fielder-independent pitching. The pattern of results suggests performance contributions from both visual-sensory and visual-motor abilities and indicates that sensorimotor screenings may be useful for player scouting.Item Open Access Single neurons may encode simultaneous stimuli by switching between activity patterns.(Nature communications, 2018-07-13) Caruso, Valeria C; Mohl, Jeff T; Glynn, Christopher; Lee, Jungah; Willett, Shawn M; Zaman, Azeem; Ebihara, Akinori F; Estrada, Rolando; Freiwald, Winrich A; Tokdar, Surya T; Groh, Jennifer MHow the brain preserves information about multiple simultaneous items is poorly understood. We report that single neurons can represent multiple stimuli by interleaving signals across time. We record single units in an auditory region, the inferior colliculus, while monkeys localize 1 or 2 simultaneous sounds. During dual-sound trials, we find that some neurons fluctuate between firing rates observed for each single sound, either on a whole-trial or on a sub-trial timescale. These fluctuations are correlated in pairs of neurons, can be predicted by the state of local field potentials prior to sound onset, and, in one monkey, can predict which sound will be reported first. We find corroborating evidence of fluctuating activity patterns in a separate dataset involving responses of inferotemporal cortex neurons to multiple visual stimuli. Alternation between activity patterns corresponding to each of multiple items may therefore be a general strategy to enhance the brain processing capacity, potentially linking such disparate phenomena as variable neural firing, neural oscillations, and limits in attentional/memory capacity.Item Open Access Statistical Analysis of Response Distribution for Dependent Data via Joint Quantile Regression(2021) Chen, XuLinear quantile regression is a powerful tool to investigate how predictors may affect a response heterogeneously across different quantile levels. Unfortunately, existing approaches find it extremely difficult to adjust for any dependency between observation units, largely because such methods are not based upon a fully generative model of the data. In this dissertation, we address this difficulty for analyzing spatial point-referenced data and hierarchical data. Several models are introduced by generalizing the joint quantile regression model of Yang and Tokdar (2017) and characterizing different dependency structures via a copula model on the underlying quantile levels of the observation units. A Bayesian semiparametric approach is introduced to perform inference of model parameters and carry out prediction. Multiple copula families are discussed for modeling response data with tail dependence and/or tail asymmetry. An effective model comparison criterion is provided for selecting between models with different combinations of sets of predictors, marginal base distributions and copula models.

Extensive simulation studies and real applications are presented to illustrate substantial gains of the proposed models in inference quality, prediction accuracy and uncertainty quantification over existing alternatives. Through case studies, we highlight that the proposed models admit great interpretability and are competent in offering insightful new discoveries of response-predictor relationship at non-central parts of the response distribution. The effectiveness of the proposed model comparison criteria is verified with both empirical and theoretical evidence.

Item Open Access Visual abilities distinguish pitchers from hitters in professional baseball.(Journal of sports sciences, 2018-01) Klemish, David; Ramger, Benjamin; Vittetoe, Kelly; Reiter, Jerome P; Tokdar, Surya T; Appelbaum, Lawrence GregoryThis study aimed to evaluate the possibility that differences in sensorimotor abilities exist between hitters and pitchers in a large cohort of baseball players of varying levels of experience. Secondary data analysis was performed on 9 sensorimotor tasks comprising the Nike Sensory Station assessment battery. Bayesian hierarchical regression modelling was applied to test for differences between pitchers and hitters in data from 566 baseball players (112 high school, 85 college, 369 professional) collected at 20 testing centres. Explanatory variables including height, handedness, eye dominance, concussion history, and player position were modelled along with age curves using basis regression splines. Regression analyses revealed better performance for hitters relative to pitchers at the professional level in the visual clarity and depth perception tasks, but these differences did not exist at the high school or college levels. No significant differences were observed in the other 7 measures of sensorimotor capabilities included in the test battery, and no systematic biases were found between the testing centres. These findings, indicating that professional-level hitters have better visual acuity and depth perception than professional-level pitchers, affirm the notion that highly experienced athletes have differing perceptual skills. Findings are discussed in relation to deliberate practice theory.