Browsing by Subject "stat.AP"
Now showing 1 - 8 of 8
- Results Per Page
- Sort Options
Item Open Access An Improved Multi-Output Gaussian Process RNN with Real-Time Validation for Early Sepsis Detection(2017-08-19) Futoma, Joseph; Hariharan, Sanjay; Sendak, Mark; Brajer, Nathan; Clement, Meredith; Bedoya, Armando; O'Brien, Cara; Heller, KatherineSepsis is a poorly understood and potentially life-threatening complication that can occur as a result of infection. Early detection and treatment improves patient outcomes, and as such it poses an important challenge in medicine. In this work, we develop a flexible classifier that leverages streaming lab results, vitals, and medications to predict sepsis before it occurs. We model patient clinical time series with multi-output Gaussian processes, maintaining uncertainty about the physiological state of a patient while also imputing missing values. The mean function takes into account the effects of medications administered on the trajectories of the physiological variables. Latent function values from the Gaussian process are then fed into a deep recurrent neural network to classify patient encounters as septic or not, and the overall model is trained end-to-end using back-propagation. We train and validate our model on a large dataset of 18 months of heterogeneous inpatient stays from the Duke University Health System, and develop a new "real-time" validation scheme for simulating the performance of our model as it will actually be used. Our proposed method substantially outperforms clinical baselines, and improves on a previous related model for detecting sepsis. Our model's predictions will be displayed in a real-time analytics dashboard to be used by a sepsis rapid response team to help detect and improve treatment of sepsis.Item Open Access Evaluating Partisan Gerrymandering in Wisconsin(2017-09-07) Ravier, R; Mattingly, J; Herschlag, GJWe examine the extent of gerrymandering for the 2010 General Assembly district map of Wisconsin. We find that there is substantial variability in the election outcome depending on what maps are used. We also found robust evidence that the district maps are highly gerrymandered and that this gerrymandering likely altered the partisan make up of the Wisconsin General Assembly in some elections. Compared to the distribution of possible redistricting plans for the General Assembly, Wisconsin's chosen plan is an outlier in that it yields results that are highly skewed to the Republicans when the statewide proportion of Democratic votes comprises more than 50-52% of the overall vote (with the precise threshold depending on the election considered). Wisconsin's plan acts to preserve the Republican majority by providing extra Republican seats even when the Democratic vote increases into the range when the balance of power would shift for the vast majority of redistricting plans.Item Open Access Linear regression model with a randomly censored predictor:Estimation procedures(2017-11-01) Atem, Folefac; Matsouaka, Roland AWe consider linear regression model estimation where the covariate of interest is randomly censored. Under a non-informative censoring mechanism, one may obtain valid estimates by deleting censored observations. However, this comes at a cost of lost information and decreased efficiency, especially under heavy censoring. Other methods for dealing with censored covariates, such as ignoring censoring or replacing censored observations with a fixed number, often lead to severely biased results and are of limited practicality. Parametric methods based on maximum likelihood estimation as well as semiparametric and non-parametric methods have been successfully used in linear regression estimation with censored covariates where censoring is due to a limit of detection. In this paper, we adapt some of these methods to handle randomly censored covariates and compare them under different scenarios to recently-developed semiparametric and nonparametric methods for randomly censored covariates. Specifically, we consider both dependent and independent randomly censored mechanisms as well as the impact of using a non-parametric algorithm on the distribution of the randomly censored covariate. Through extensive simulation studies, we compare the performance of these methods under different scenarios. Finally, we illustrate and compare the methods using the Framingham Health Study data to assess the association between low-density lipoprotein (LDL) in offspring and parental age at onset of a clinically-diagnosed cardiovascular event.Item Open Access Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data SetMiller, Jeffrey; Betancourt, Brenda; Zaidi, Abbas; Wallach, Hanna; Steorts, Rebecca CMost generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.Item Open Access Quantifying Gerrymandering in North CarolinaHerschlag, G; Kang, HS; Luo, J; Graves, CV; Bangia, S; Ravier, R; Mattingly, JCUsing an ensemble of redistricting plans, we evaluate whether a given political districting faithfully represents the geo-political landscape. Redistricting plans are sampled by a Monte Carlo algorithm from a probability distribution that adheres to realistic and non-partisan criteria. Using the sampled redistricting plans and historical voting data, we produce an ensemble of elections that reveal geo-political structure within the state. We showcase our methods on the two most recent districtings of NC for the U.S. House of Representatives, as well as a plan drawn by a bipartisan redistricting panel. We find the two state enacted plans are highly atypical outliers whereas the bipartisan plan accurately represents the ensemble both in partisan outcome and in the fine scale structure of district-level results.Item Open Access Redistricting: Drawing the Line(2017-04-12) Bangia, Sachet; Graves, Christy Vaughn; Herschlag, Gregory; Kang, Han Sung; Luo, Justin; Mattingly, Jonathan C; Ravier, RobertWe develop methods to evaluate whether a political districting accurately represents the will of the people. To explore and showcase our ideas, we concentrate on the congressional districts for the U.S. House of representatives and use the state of North Carolina and its redistrictings since the 2010 census. Using a Monte Carlo algorithm, we randomly generate over 24,000 redistrictings that are non-partisan and adhere to criteria from proposed legislation. Applying historical voting data to these random redistrictings, we find that the number of democratic and republican representatives elected varies drastically depending on how districts are drawn. Some results are more common, and we gain a clear range of expected election outcomes. Using the statistics of our generated redistrictings, we critique the particular congressional districtings used in the 2012 and 2016 NC elections as well as a districting proposed by a bipartisan redistricting commission. We find that the 2012 and 2016 districtings are highly atypical and not representative of the will of the people. On the other hand, our results indicate that a plan produced by a bipartisan panel of retired judges is highly typical and representative. Since our analyses are based on an ensemble of reasonable redistrictings of North Carolina, they provide a baseline for a given election which incorporates the geometry of the state's population distribution.Item Open Access Smaller $p$-values in genomics studies using distilled historical informationBryan, Jordan G; Hoff, Peter DMedical research institutions have generated massive amounts of biological data by genetically profiling hundreds of cancer cell lines. In parallel, academic biology labs have conducted genetic screens on small numbers of cancer cell lines under custom experimental conditions. In order to share information between these two approaches to scientific discovery, this article proposes a "frequentist assisted by Bayes" (FAB) procedure for hypothesis testing that allows historical information from massive genomics datasets to increase the power of hypothesis tests in specialized studies. The exchange of information takes place through a novel probability model for multimodal genomics data, which distills historical information pertaining to cancer cell lines and genes across a wide variety of experimental contexts. If the relevance of the historical information for a given study is high, then the resulting FAB tests can be more powerful than the corresponding classical tests. If the relevance is low, then the FAB tests yield as many discoveries as the classical tests. Simulations and practical investigations demonstrate that the FAB testing procedure can increase the number of effects discovered in genomics studies while still maintaining strict control of type I error and false discovery rates.Item Open Access SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplicationSteorts, RC; Hall, R; Fienberg, SEWe propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.