Browsing by Subject "stat.ME"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
Item Open Access Adaptive Hyper-box Matching for Interpretable Individualized Treatment Effect Estimation.(CoRR, 2020) Morucci, Marco; Orlandi, Vittorio; Rudin, Cynthia; Roy, Sudeepa; Volfovsky, AlexanderWe propose a matching method for observational data that matches units with others in unit-specific, hyper-box-shaped regions of the covariate space. These regions are large enough that many matches are created for each unit and small enough that the treatment effect is roughly constant throughout. The regions are found as either the solution to a mixed integer program, or using a (fast) approximation algorithm. The result is an interpretable and tailored estimate of a causal effect for each unit.Item Open Access An Improved Multi-Output Gaussian Process RNN with Real-Time Validation for Early Sepsis Detection(2017-08-19) Futoma, Joseph; Hariharan, Sanjay; Sendak, Mark; Brajer, Nathan; Clement, Meredith; Bedoya, Armando; O'Brien, Cara; Heller, KatherineSepsis is a poorly understood and potentially life-threatening complication that can occur as a result of infection. Early detection and treatment improves patient outcomes, and as such it poses an important challenge in medicine. In this work, we develop a flexible classifier that leverages streaming lab results, vitals, and medications to predict sepsis before it occurs. We model patient clinical time series with multi-output Gaussian processes, maintaining uncertainty about the physiological state of a patient while also imputing missing values. The mean function takes into account the effects of medications administered on the trajectories of the physiological variables. Latent function values from the Gaussian process are then fed into a deep recurrent neural network to classify patient encounters as septic or not, and the overall model is trained end-to-end using back-propagation. We train and validate our model on a large dataset of 18 months of heterogeneous inpatient stays from the Duke University Health System, and develop a new "real-time" validation scheme for simulating the performance of our model as it will actually be used. Our proposed method substantially outperforms clinical baselines, and improves on a previous related model for detecting sepsis. Our model's predictions will be displayed in a real-time analytics dashboard to be used by a sepsis rapid response team to help detect and improve treatment of sepsis.Item Open Access Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data SetMiller, Jeffrey; Betancourt, Brenda; Zaidi, Abbas; Wallach, Hanna; Steorts, Rebecca CMost generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.Item Open Access Multiscale integrators for stochastic differential equations and irreversible Langevin samplers(2017-04-23) Lu, Jianfeng; Spiliopoulos, KWe study multiscale integrator numerical schemes for a class of stiff stochastic differential equations (SDEs). We consider multiscale SDEs that behave as diffusions on graphs as the stiffness parameter goes to its limit. Classical numerical discretization schemes, such as the Euler-Maruyama scheme, become unstable as the stiffness parameter converges to its limit and appropriate multiscale integrators can correct for this. We rigorously establish the convergence of the numerical method to the related diffusion on graph, identifying the appropriate choice of discretization parameters. Theoretical results are supplemented by numerical studies on the problem of the recently developing area of introducing irreversibility in Langevin samplers in order to accelerate convergence to equilibrium.Item Open Access Simultaneous Edit and Imputation For Household Data with Structural Zeros(Journal of Survey Statistics and Methodology) Akande, O; Barrientos, Andres; Reiter, JeromeMultivariate categorical data nested within households often include reported values that fail edit constraints---for example, a participating household reports a child's age as older than his biological parent's age---as well as missing values. Generally, agencies prefer datasets to be free from erroneous or missing values before analyzing them or disseminating them to secondary data users. We present a model-based engine for editing and imputation of household data based on a Bayesian hierarchical model that includes (i) a nested data Dirichlet process mixture of products of multinomial distributions as the model for the true latent values of the data, truncated to allow only households that satisfy all edit constraints, (ii) a model for the location of errors, and (iii) a reporting model for the observed responses in error. The approach propagates uncertainty due to unknown locations of errors and missing values, generates plausible datasets that satisfy all edit constraints, and can preserve multivariate relationships within and across individuals in the same household. We illustrate the approach using data from the 2012 American Community Survey.Item Open Access Smaller $p$-values in genomics studies using distilled historical informationBryan, Jordan G; Hoff, Peter DMedical research institutions have generated massive amounts of biological data by genetically profiling hundreds of cancer cell lines. In parallel, academic biology labs have conducted genetic screens on small numbers of cancer cell lines under custom experimental conditions. In order to share information between these two approaches to scientific discovery, this article proposes a "frequentist assisted by Bayes" (FAB) procedure for hypothesis testing that allows historical information from massive genomics datasets to increase the power of hypothesis tests in specialized studies. The exchange of information takes place through a novel probability model for multimodal genomics data, which distills historical information pertaining to cancer cell lines and genes across a wide variety of experimental contexts. If the relevance of the historical information for a given study is high, then the resulting FAB tests can be more powerful than the corresponding classical tests. If the relevance is low, then the FAB tests yield as many discoveries as the classical tests. Simulations and practical investigations demonstrate that the FAB testing procedure can increase the number of effects discovered in genomics studies while still maintaining strict control of type I error and false discovery rates.Item Open Access Variational Bayes for Merging Noisy DatabasesBroderick, Tamara; Steorts, Rebecca CBayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Instead, we propose applying variational approximations to allow scalable Bayesian inference in these models. We derive a coordinate-ascent approximation for mean-field variational Bayes, qualitatively compare our algorithm to existing methods, note unique challenges for inference that arise from the expected distribution of cluster sizes in entity resolution, and discuss directions for future work in this domain.