Browsing by Author "Steorts, Rebecca C"
- Results Per Page
- Sort Options
Item Open Access A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms(2017) Li, BaiDifferential privacy (DP) aims to design methods and algorithms that satisfy rigorous notions of privacy while simultaneously providing utility with valid statistical inference. More recently, an emphasis has been placed on combining notions of statistical utility with algorithmic approaches to address privacy risk in the presence of big data---with differential privacy emerging as a rigorous notion of risk. While DP provides strong guarantees for privacy, there are often tradeoffs regarding data utility and computational scalability. In this paper, we introduce a categorical data synthesizer that releases high-dimensional sparse histograms, illustrating its ability to overcome current limitations with data synthesizers in the current literature. Specifically, we combine a differential privacy algorithm---the stability based algorithm--- along with feature hashing, with allows for dimension reduction in terms of the histograms and Gibbs sampling. As a result, our proposed algorithm is differentially private, offers similar or better statistical utility and is scalable to large databases. In addition, we give an analytical result for the error caused by the stability based algorithm, which allows us to control the loss of utility. Finally, we study the behavior of our algorithm on both simulated and real data.
Item Open Access Development, Implementation, and Evaluation of an In-Hospital Optimized Early Warning Score for Patient Deterioration.(MDM policy & practice, 2020-01-10) O'Brien, Cara; Goldstein, Benjamin A; Shen, Yueqi; Phelan, Matthew; Lambert, Curtis; Bedoya, Armando D; Steorts, Rebecca CBackground. Identification of patients at risk of deteriorating during their hospitalization is an important concern. However, many off-shelf scores have poor in-center performance. In this article, we report our experience developing, implementing, and evaluating an in-hospital score for deterioration. Methods. We abstracted 3 years of data (2014-2016) and identified patients on medical wards that died or were transferred to the intensive care unit. We developed a time-varying risk model and then implemented the model over a 10-week period to assess prospective predictive performance. We compared performance to our currently used tool, National Early Warning Score. In order to aid clinical decision making, we transformed the quantitative score into a three-level clinical decision support tool. Results. The developed risk score had an average area under the curve of 0.814 (95% confidence interval = 0.79-0.83) versus 0.740 (95% confidence interval = 0.72-0.76) for the National Early Warning Score. We found the proposed score was able to respond to acute clinical changes in patients' clinical status. Upon implementing the score, we were able to achieve the desired positive predictive value but needed to retune the thresholds to get the desired sensitivity. Discussion. This work illustrates the potential for academic medical centers to build, refine, and implement risk models that are targeted to their patient population and work flow.Item Open Access Dynamic Time Varying Models for Predicting Patient Deterioration(2017) McCreanor, Reuben KnowlesStreaming data are becoming more common in a variety of fields. One common data stream in clinical medicine is electronic health records (EHRs) which have been used to develop risk prediction models. Our motivating application considers the risk of patient deterioration, which is defined as in-hospital mortality or transfer to the Intensive Care Unit (ICU). Duke University Hospital recently implemented an alert risk score for acute care wards: the National Early Warning Score (NEWS). However, NEWS was designed to be hand-calculable from patient vital data rather than to optimize prediction. Our approach considers three further methods to use on real-time EHR data to predict patient deterioration. We propose a Cox model, a joint modeling approach, and a Gaussian process. By evaluating the implementation of these models on clinical EHR data from more than 51,000 patients, we are able to provide a comparison of the methods on real EHR data for patient deterioration. We evaluate the results on both performance and scalability and consider the feasibility of implementing each approach in a clinical environment. While the more complicated models may potentially offer a small gain in predictive performance, they do not scale to a full patient data set. Thus, within a clinical setting, the Cox model is clearly the best approach.
Item Open Access Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data SetMiller, Jeffrey; Betancourt, Brenda; Zaidi, Abbas; Wallach, Hanna; Steorts, Rebecca CMost generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some tasks, this assumption is undesirable. For example, when performing entity resolution, the size of each cluster is often unrelated to the size of the data set. Consequently, each cluster contains a negligible fraction of the total number of data points. Such tasks therefore require models that yield clusters whose sizes grow sublinearly with the size of the data set. We address this requirement by defining the \emph{microclustering property} and introducing a new model that exhibits this property. We compare this model to several commonly used clustering models by checking model fit using real and simulated data sets.Item Open Access Variational Bayes for Merging Noisy DatabasesBroderick, Tamara; Steorts, Rebecca CBayesian entity resolution merges together multiple, noisy databases and returns the minimal collection of unique individuals represented, together with their true, latent record values. Bayesian methods allow flexible generative models that share power across databases as well as principled quantification of uncertainty for queries of the final, resolved database. However, existing Bayesian methods for entity resolution use Markov monte Carlo method (MCMC) approximations and are too slow to run on modern databases containing millions or billions of records. Instead, we propose applying variational approximations to allow scalable Bayesian inference in these models. We derive a coordinate-ascent approximation for mean-field variational Bayes, qualitatively compare our algorithm to existing methods, note unique challenges for inference that arise from the expected distribution of cluster sizes in entity resolution, and discuss directions for future work in this domain.