Browsing by Subject "Prediction"
Results Per Page
Sort Options
Item Open Access A Data-Retaining Framework for Tail Estimation(2020) Cunningham, ErikaModeling of extreme data often involves thresholding, or retaining only the most extreme observations, in order that the tail may "speak" and not be overwhelmed by the bulk of the data. We describe a transformation-based framework that allows univariate density estimation to smoothly transition from a flexible, semi-parametric estimation of the bulk into a parametric estimation of the tail without thresholding. In the limit, this framework has desirable theoretical tail-matching properties to the selected parametric distribution. We develop three Bayesian models under the framework: one using a logistic Gaussian process (LGP) approach; one using a Dirichlet process mixture model (DPMM); and one using a predictive recursion approximation of the DPMM. Models produce estimates and intervals for density, distribution, and quantile functions across the full data range and for the tail index (inverse-power-decay parameter), under an assumption of heavy tails. For each approach, we carry out a simulation study to explore the model's practical usage in non-asymptotic settings, comparing its performance to methods that involve thresholding.
Among the three models proposed, the LGP has lowest bias through the bulk and highest quantile interval coverage generally. Compared to thresholding methods, its tail predictions have lower root mean squared error (RMSE) in all scenarios but the most complicated, e.g. a sharp bulk-to-tail transition. The LGP's consistent underestimation of the tail index does not hinder tail estimation in pre-extrapolation to moderate-extrapolation regions but does affect extreme extrapolations.
An interplay between the parametric transform and the natural sparsity of the DPMM sometimes causes the DPMM to favor estimation of the bulk over estimation of the tail. This can be overcome by increasing prior precision on less sparse (flatter) base-measure density shapes. A finite mixture model (FMM), substituted for the DPMM in simulation, proves effective at reducing tail RMSE over thresholding methods in some, but not all, scenarios and quantile levels.
The predictive recursion marginal posterior (PRMP) model is fast and does the best job among proposed models of estimating the tail-index parameter. This allows it to reduce RMSE in extrapolation over thresholding methods in most scenarios considered. However, bias from the predictive recursion contaminates the tail, casting doubt on the PRMP's predictions in tail regions where data should still inform estimation. We recommend the PRMP model as a quick tool for visualizing the marginal posterior over transformation parameters, which can aid in diagnosing multimodality and informing the precision needed to overcome sparsity in the mixture model approach.
In summary, there is not enough information in the likelihood alone to prevent the bulk from overwhelming the tail. However, a model that harnesses the likelihood with a carefully specified prior can allow both the bulk and tail to speak without an explicit separation of the two. Moreover, retaining all of the data under this framework reduces quantile variability, improving prediction in the tails compared to methods that threshold.
Item Open Access Analysis and Error Correction in Structures of Macromolecular Interiors and Interfaces(2009) Headd, Jeffrey JohnAs of late 2009, the Protein Data Bank (PDB) has grown to contain over 70,000 models. This recent increase in the amount of structural data allows for more extensive explication of the governing principles of macromolecular folding and association to complement traditional studies focused on a single molecule or complex. PDB-wide characterization of structural features yields insights that are useful in prediction and validation of the 3D structure of macromolecules and their complexes. Here, these insights lead to a deeper understanding of protein--protein interfaces, full-atom critical assessment of increasingly more accurate structure predictions, a better defined library of RNA backbone conformers for validation and building 3D models, and knowledge-based automatic correction of errors in protein sidechain rotamers.
My study of protein--protein interfaces identifies amino acid pairing preferences in a set of 146 transient interfaces. Using a geometric interface surface definition devoid of arbitrary cutoffs common to previous studies of interface composition, I calculate inter- and intrachain amino acid pairing preferences. As expected, salt-bridges and hydrophobic patches are prevalent, but likelihood correction of observed pairing frequencies reveals some surprising pairing preferences, such as Cys-His interchain pairs and Met-Met intrachain pairs. To complement my statistical observations, I introduce a 2D visualization of the 3D interface surface that can display a variety of interface characteristics, including residue type, atomic distance and backbone/sidechain composition.
My study of protein interiors finds that 3D structure prediction from sequence (as part of the CASP experiment) is very close to full-atom accuracy. Validation of structure prediction should therefore consider all atom positions instead of the traditional Calpha-only evaluation. I introduce six new full-model quality criteria to assess the accuracy of CASP predictions, which demonstrate that groups who use structural knowledge culled from the PDB to inform their prediction protocols produce the most accurate results.
My study of RNA backbone introduces a set of rotamer-like "suite" conformers. Initially hand-identified by the Richardson laboratory, these 7D conformers represent backbone segments that are found to be genuine and favorable. X-ray crystallographers can use backbone conformers for model building in often poor backbone density and in validation after refinement. Increasing amounts of high quality RNA data allow for improved conformer identification, but also complicate hand-curation. I demonstrate that affinity propagation successfully differentiates between two related but distinct suite conformers, and is a useful tool for automated conformer clustering.
My study of protein sidechain rotamers in X-ray structures identifies a class of systematic errors that results in sidechains misfit by approximately 180 degrees. I introduce Autofix, a method for automated detection and correction of such errors. Autofix corrects over 40% of errors for Leu, Thr, and Val residues, and a significant number of Arg residues. On average, Autofix made four corrections per PDB file in 945 X-ray structures. Autofix will be implemented into MolProbity and PHENIX for easy integration into X-ray crystallography workflows.
Item Open Access Differentially Private Verification ofPredictions from Synthetic Data(2017) Yu, HaoyangWhen data are confidential, one approach for releasing public available files is to make synthetic data, i.e, data simulated from statistical models estimated on the confidential data. Given access only to synthetic data, users have no way to tell if the synthetic data can preserve the adequacy of their analysis. Thus, I present methods that can help users to make such assessments automatically while controlling the information disclosure risks in the confidential data. There are three verification methods presented in this thesis: differentially private prediction tolerance intervals, differentially private prediction histogram, and differentially private Kolmogorov-Smirnov test. I use simulation to illustrate these prediction verification methods.
Item Open Access Dynamic Time Varying Models for Predicting Patient Deterioration(2017) McCreanor, Reuben KnowlesStreaming data are becoming more common in a variety of fields. One common data stream in clinical medicine is electronic health records (EHRs) which have been used to develop risk prediction models. Our motivating application considers the risk of patient deterioration, which is defined as in-hospital mortality or transfer to the Intensive Care Unit (ICU). Duke University Hospital recently implemented an alert risk score for acute care wards: the National Early Warning Score (NEWS). However, NEWS was designed to be hand-calculable from patient vital data rather than to optimize prediction. Our approach considers three further methods to use on real-time EHR data to predict patient deterioration. We propose a Cox model, a joint modeling approach, and a Gaussian process. By evaluating the implementation of these models on clinical EHR data from more than 51,000 patients, we are able to provide a comparison of the methods on real EHR data for patient deterioration. We evaluate the results on both performance and scalability and consider the feasibility of implementing each approach in a clinical environment. While the more complicated models may potentially offer a small gain in predictive performance, they do not scale to a full patient data set. Thus, within a clinical setting, the Cox model is clearly the best approach.
Item Open Access Modeling Time Series and Sequences: Learning Representations and Making Predictions(2015) Lian, WenzhaoThe analysis of time series and sequences has been challenging in both statistics and machine learning community, because of their properties including high dimensionality, pattern dynamics, and irregular observations. In this thesis, novel methods are proposed to handle the difficulties mentioned above, thus enabling representation learning (dimension reduction and pattern extraction), and prediction making (classification and forecasting). This thesis consists of three main parts.
The first part analyzes multivariate time series, which is often non-stationary due to high levels of ambient noise and various interferences. We propose a nonlinear dimensionality reduction framework using diffusion maps on a learned statistical manifold, which gives rise to the construction of a low-dimensional representation of the high-dimensional non-stationary time series. We show that diffusion maps, with affinity kernels based on the Kullback-Leibler divergence between the local statistics of samples, allow for efficient approximation of pairwise geodesic distances. To construct the statistical manifold, we estimate time-evolving parametric distributions by designing a family of Bayesian generative models. The proposed framework can be applied to problems in which the time-evolving distributions (of temporally localized data), rather than the samples themselves, are driven by a low-dimensional underlying process. We provide efficient parameter estimation and dimensionality reduction methodology and apply it to two applications: music analysis and epileptic-seizure prediction.
The second part focuses on a time series classification task, where we want to leverage the temporal dynamic information in the classifier design. In many time series classification problems including fraud detection, a low false alarm rate is required; meanwhile, we enhance the positive detection rate. Therefore, we directly optimize the partial area under the curve (PAUC), which maximizes the accuracy in low false alarm rate regions. Latent variables are introduced to incorporate the temporal information, while maintaining a max-margin based method solvable. An optimization routine is proposed with its properties analyzed; the algorithm is designed as scalable to web-scale data. Simulation results demonstrate the effectiveness of optimizing the performance in the low false alarm rate regions.
The third part focuses on pattern extraction from correlated point process data, which consist of multiple correlated sequences observed at irregular times. The analysis of correlated point process data has wide applications, ranging from biomedical research to network analysis. We model such data as generated by a latent collection of continuous-time binary semi-Markov processes, corresponding to external events appearing and disappearing. A continuous-time modeling framework is more appropriate for multichannel point process data than a binning approach requiring time discretization, and we show connections between our model and recent ideas from the discrete-time literature. We describe an efficient MCMC algorithm for posterior inference, and apply our ideas to both synthetic data and a real-world biometrics application.
Item Open Access Predicting Adolescent Mental Health Outcomes Across Cultures: A Machine Learning Approach.(Journal of youth and adolescence, 2023-04) Rothenberg, W Andrew; Bizzego, Andrea; Esposito, Gianluca; Lansford, Jennifer E; Al-Hassan, Suha M; Bacchini, Dario; Bornstein, Marc H; Chang, Lei; Deater-Deckard, Kirby; Di Giunta, Laura; Dodge, Kenneth A; Gurdal, Sevtap; Liu, Qin; Long, Qian; Oburu, Paul; Pastorelli, Concetta; Skinner, Ann T; Sorbring, Emma; Tapanya, Sombat; Steinberg, Laurence; Tirado, Liliana Maria Uribe; Yotanyamaneewong, Saengduean; Alampay, Liane PeñaAdolescent mental health problems are rising rapidly around the world. To combat this rise, clinicians and policymakers need to know which risk factors matter most in predicting poor adolescent mental health. Theory-driven research has identified numerous risk factors that predict adolescent mental health problems but has difficulty distilling and replicating these findings. Data-driven machine learning methods can distill risk factors and replicate findings but have difficulty interpreting findings because these methods are atheoretical. This study demonstrates how data- and theory-driven methods can be integrated to identify the most important preadolescent risk factors in predicting adolescent mental health. Machine learning models examined which of 79 variables assessed at age 10 were the most important predictors of adolescent mental health at ages 13 and 17. These models were examined in a sample of 1176 families with adolescents from nine nations. Machine learning models accurately classified 78% of adolescents who were above-median in age 13 internalizing behavior, 77.3% who were above-median in age 13 externalizing behavior, 73.2% who were above-median in age 17 externalizing behavior, and 60.6% who were above-median in age 17 internalizing behavior. Age 10 measures of youth externalizing and internalizing behavior were the most important predictors of age 13 and 17 externalizing/internalizing behavior, followed by family context variables, parenting behaviors, individual child characteristics, and finally neighborhood and cultural variables. The combination of theoretical and machine-learning models strengthens both approaches and accurately predicts which adolescents demonstrate above average mental health difficulties in approximately 7 of 10 adolescents 3-7 years after the data used in machine learning models were collected.Item Open Access Prediction of Stock Market Price Index using Machine Learning and Global Trade Information(2020-10-30) Wong, Eugene Lu XianGlobalization has led to an increasingly integrated global economy, one with less trade barriers and more capital mobility between countries. Consequently, no country is an island of its own. In this paper, it aims to investigate how global trade affects a country's stock market and also determine if such information with the use of machine learning techniques can predict a country's stock market index.Item Open Access Predictive Models for Point Processes(2015) Lian, WenzhaoPoint process data are commonly observed in fields like healthcare and social science. Designing predictive models for such event streams is an under-explored problem, due to often scarce training data. In this thesis, a multitask point process model via a hierarchical Gaussian Process (GP) is proposed, to leverage statistical strength across multiple point processes. Nonparametric learning functions implemented by a GP, which map from past events to future rates, allow analysis of flexible arrival patterns. To facilitate efficient inference, a sparse construction for this hierarchical model is proposed, and a variational Bayes method is derived for learning and inference. Experimental results are shown on both synthetic data and as well as real electronic health-records data.