Browsing by Subject "survival analysis"
Results Per Page
Sort Options
Item Open Access Genetic variants of DOCK2, EPHB1 and VAV2 in the natural killer cell-related pathway are associated with non-small cell lung cancer survival.(American journal of cancer research, 2021-01) Du, Hailei; Liu, Lihua; Liu, Hongliang; Luo, Sheng; Patz, Edward F; Glass, Carolyn; Su, Li; Du, Mulong; Christiani, David C; Wei, QingyiAlthough natural killer (NK) cells are a known major player in anti-tumor immunity, the effect of genetic variation in NK-associated genes on survival in patients with non-small cell lung cancer (NSCLC) remains unknown. Here, in 1,185 with NSCLC cases of a discovery dataset, we evaluated associations of 28,219 single nucleotide polymorphisms (SNPs) in 276 NK-associated genes with their survival. These patients were from the reported genome-wide association study (GWAS) from the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. We further validated the findings in an additional 984 cases from the Harvard Lung Cancer Susceptibility (HLCS) Study. We identified three SNPs (i.e., DOCK2 rs261083 G>C, VAV2 rs2519996 C>T and EPHB1 rs36215 A>G) to be independently associated with overall survival (OS) in NSCLC cases with adjusted hazards ratios (HRs) of 1.16 (95% confidence interval [CI] = 1.07-1.26, P = 3.34×10-4), 1.28 (1.12-1.47, P = 4.57×10-4) and 0.75 (0.67-0.83, P = 1.50×10-7), respectively. Additional joint assessment of the unfavorable genotypes of the three SNPs showed significant associations with OS and disease-specific survival of NSCLC cases in the PLCO dataset (P trend<0.0001 and <0.0001, respectively). Moreover, the survival-associated DOCK2 rs261083 C allele had a significant correlation with reduced DOCK2 transcript levels in lung adenocarcinoma (LUAD), while the rs36215 G allele was significantly correlated with reduced EPHB1 transcript levels in lymphoblastoid cell lines in the 1000 Genomes Project. These results revealed that DOCK2 and EPHB1 genetic variants may be prognostic biomarkers of NSCLC survival, likely via transcription regulation of respective genes.Item Open Access Improving Clinical Prediction Models with Statistical Representation Learning(2021) Xiu, ZidiThis dissertation studies novel statistical machine learning approaches for healthcare risk prediction applications in the presence of challenging scenarios, such as rare events, noisy observations, data imbalance, missingness and censoring. Such scenarios manifest frequently in practice, and they compromise the validity of standard predictive models which often expect clean and complete data. As such, alleviating the negative impacts of real-world data challenges is of great significance and constitutes the overarching goal of this dissertation, which investigates novel strategies to (i) account for data uncertainties and statistical characteristics of low-prevalence events; (ii) re-balancing and augmenting the representations for minority data under proper causal assumptions; (iii) dynamically assigning scores and attend to the observed units to derive robust features.By integrating the ideas from representation learning, variational Bayes, causal inference, and contrastive training, this dissertation builds tools for risk modeling frameworks that are robust to various peculiarities of real-world datasets to yield reliable individualized risk evaluations.
This dissertation starts with a systematic review of classical risk prediction models in Chapter 1, and discusses the new opportunities and challenges presented by the big data era. With the increasing availability of healthcare data and the current rapid development of machine learning models, clinical decision support systems have seen new opportunities to improve clinical practice. However, in healthcare risk prediction applications, statistical analysis is not only challenged by data incompleteness and skewed distributions but also the complexity of the inputs. To motivate the subsequent developments, discussions on the limitations in risk minimization methods, robustness against high-dimensional data with incompleteness, and the need for individualization are provided.
As a concrete example to address a canonical problem, Chapter 2 proposes a variational disentanglement approach to semi-parametrically learns from the heavily imbalanced binary classification datasets. In this new method, which is named Variational Inference for Extremals (VIE), we apply an extreme value theory to enable efficient learning with few observations. By organically integrating the generalized additive model and isotonic neural nets, VIE enjoys the merits of improved robustness, interpretability, and generalizability for the accurate prediction of rare events. An analysis of the COVID-19 cohort from Duke Hospitals demonstrates that the proposed approach outperforms competing solutions. We investigate a more generalized setting of a multi-classification problem with heavily imbalanced data in Chapter 3, from the perspective of causal machine learning to promote sample efficiency and model generalization. Our solution, named Energy-based Causal Representation Transfer (ECRT), posits a meta-distributional scenario, where the data generating mechanism for label-conditional features is invariant across different labels. Such causal assumption enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if their feature distributions show apparent disparities. This allows us to leverage a causally informed data augmentation procedure based on nonlinear independent component analysis to enrich the representation of minority classes and simultaneously data whitening. The effectiveness and enhanced prediction accuracy are demonstrated through synthetic data and real-world benchmarks compared with state-of-art models.
In Chapter 4 we deal with the time-to-event prediction with censored (missing in outcomes) and incomplete (missing in covariates) observations. Also known as survival analysis, time-to-event prediction plays a crucial role in many clinical applications, yet the classical survival solutions scale poorly w.r.t. data complexity and incompleteness. To better handle sophisticated modern health data and alleviate the impact of real-world data challenges, we introduce a self-attention based model to capture helpful information for time-to-event prediction, called Energy-based Latent Self-Attentive Survival Analysis (ELSSA). A key novelty of this approach is the integration of contrastive mutual information based loss that non-parametrically maximizes the informativeness of learned data representations. The effectiveness of our approaches has been intensively validated with synthetic and real-world benchmarks, showing improved performance relative to competing solutions.
In summary, this dissertation presents flexible risk prediction frameworks that acknowledge representation uncertainties, data heterogeneity and incompleteness. Altogether, it presents three contributions: improved efficient learning from imbalanced data, enhanced robustness to missing data, and better generalizability to out-of-sample subjects.
Item Open Access Modeling and Methodological Advances in Causal Inference(2021) Zeng, ShuxiThis thesis presents several novel modeling or methodological advancements to causal inference. First, we investigate the use of propensity score weighting in the randomized trials for covariate adjustment. We introduce the class of balancing weights and study its theoretical property. We demonstrate that it is asymptotically equivalent to the analysis of covariance (ANCOVA) and derive the closed-form variance estimator. We further recommend the overlap weighting estimator based on its semiparametric efficiency and good finite-sample performance. Next, we focus on comparative effectiveness studies with survival outcomes. As opposed to the approach coupling with a Cox proportional hazards model, we follow an ``once for all'' approach and construct pseudo-observations of the censored outcomes. We study the theoretical property of propensity score weighting estimator based on pseudo-observations and provide closed-form variance estimators. The third contribution lies in the domain of causal mediation analysis, which studies how much of the treatment effect is mediated or explained through a given intermediate variable. The existing approaches are not directly applicable to scenario where both the mediator and outcome are measured on the sparse and irregular time grids. We propose a causal mediation framework by treating the sparse and irregular data as realizations of smooth processes and provide the assumptions for nonparametric identifications. We also provide a functional principal component analysis (FPCA) approach for estimation and carries out inference with a Bayesian paradigm. Furthermore, we study how to achieve double robustness with machine learning approaches. We develop a new algorithm that learns the double-robust representations in observational studies. The proposed method can learn the low-dimensional representations as well as the balancing weights simultaneously. Lastly, we study how to build a robust prediction model by exploiting the causal relationships. From a causal perspective, we argue robust models should capture the stable causal relationships as opposed to the spurious correlations. We propose a causal transfer random forest method learning the stable causal relationships efficiently from a large scale of observational data and a small amount of randomized data. We provide theoretical justifications and validate the algorithm empirically with synthetic experiments and real world prediction tasks.
In summary, this thesis makes contributions to the following three major areas in causal inference: (i) propensity score weighting methods for randomized experiments and observational studies, which consists of (a) randomized controlled trial (Chapter 2}) (b) survival outcome (Chapter 3); (ii) causal mediation analysis with sparse and irregular longitudinal data (Chapter 4); (iii) machine learning methods for causal inference, which consists of (a) double robustness (Chapter 5), (b) causal transfer random forest (Chapter 6).
Item Open Access Novel Genetic Variants of ALG6 and GALNTL4 of the Glycosylation Pathway Predict Cutaneous Melanoma-Specific Survival.(Cancers, 2020-01-24) Zhou, Bingrong; Zhao, Yu Chen; Liu, Hongliang; Luo, Sheng; Amos, Christopher I; Lee, Jeffrey E; Li, Xin; Nan, Hongmei; Wei, QingyiBecause aberrant glycosylation is known to play a role in the progression of melanoma, we hypothesize that genetic variants of glycosylation pathway genes are associated with the survival of cutaneous melanoma (CM) patients. To test this hypothesis, we used a Cox proportional hazards regression model in a single-locus analysis to evaluate associations between 34,096 genetic variants of 227 glycosylation pathway genes and CM disease-specific survival (CMSS) using genotyping data from two previously published genome-wide association studies. The discovery dataset included 858 CM patients with 95 deaths from The University of Texas MD Anderson Cancer Center, and the replication dataset included 409 CM patients with 48 deaths from Harvard University nurse/physician cohorts. In the multivariable Cox regression analysis, we found that two novel single-nucleotide polymorphisms (SNPs) (ALG6 rs10889417 G>A and GALNTL4 rs12270446 G>C) predicted CMSS, with an adjusted hazards ratios of 0.60 (95% confidence interval = 0.44-0.83 and p = 0.002) and 0.66 (0.52-0.84 and 0.004), respectively. Subsequent expression quantitative trait loci (eQTL) analysis revealed that ALG6 rs10889417 was associated with mRNA expression levels in the cultured skin fibroblasts and whole blood cells and that GALNTL4 rs12270446 was associated with mRNA expression levels in the skin tissues (all p < 0.05). Our findings suggest that, once validated by other large patient cohorts, these two novel SNPs in the glycosylation pathway genes may be useful prognostic biomarkers for CMSS, likely through modulating their gene expression.Item Open Access On the Advancement of Probabilistic Models of Decompression Sickness(2016) Hada, Ethan AlexanderThe work presented in this dissertation is focused on applying engineering methods to develop and explore probabilistic survival models for the prediction of decompression sickness in US NAVY divers. Mathematical modeling, computational model development, and numerical optimization techniques were employed to formulate and evaluate the predictive quality of models fitted to empirical data. In Chapters 1 and 2 we present general background information relevant to the development of probabilistic models applied to predicting the incidence of decompression sickness. The remainder of the dissertation introduces techniques developed in an effort to improve the predictive quality of probabilistic decompression models and to reduce the difficulty of model parameter optimization.
The first project explored seventeen variations of the hazard function using a well-perfused parallel compartment model. Models were parametrically optimized using the maximum likelihood technique. Model performance was evaluated using both classical statistical methods and model selection techniques based on information theory. Optimized model parameters were overall similar to those of previously published Results indicated that a novel hazard function definition that included both ambient pressure scaling and individually fitted compartment exponent scaling terms.
We developed ten pharmacokinetic compartmental models that included explicit delay mechanics to determine if predictive quality could be improved through the inclusion of material transfer lags. A fitted discrete delay parameter augmented the inflow to the compartment systems from the environment. Based on the observation that symptoms are often reported after risk accumulation begins for many of our models, we hypothesized that the inclusion of delays might improve correlation between the model predictions and observed data. Model selection techniques identified two models as having the best overall performance, but comparison to the best performing model without delay and model selection using our best identified no delay pharmacokinetic model both indicated that the delay mechanism was not statistically justified and did not substantially improve model predictions.
Our final investigation explored parameter bounding techniques to identify parameter regions for which statistical model failure will not occur. When a model predicts a no probability of a diver experiencing decompression sickness for an exposure that is known to produce symptoms, statistical model failure occurs. Using a metric related to the instantaneous risk, we successfully identify regions where model failure will not occur and identify the boundaries of the region using a root bounding technique. Several models are used to demonstrate the techniques, which may be employed to reduce the difficulty of model optimization for future investigations.
Item Open Access Probabilistic Time-to-Event Modeling Approaches for Risk Profiling(2021) Chapfuwa, PaidamoyoModern health data science applications leverage abundant molecular and electronic health data, providing opportunities for machine learning to build statistical models to support clinical practice. Time-to-event analysis, also called survival analysis, stands as one of the most representative examples of such statistical models. Models for predicting the time of a future event are crucial for risk assessment, across a diverse range of applications, i.e., drug development, risk profiling, and clinical trials, and such data are also relevant in fields like manufacturing (e.g., for equipment monitoring). Existing time-to-event (survival) models have focused primarily on preserving the pairwise ordering of estimated event times (i.e., relative risk).
In this dissertation, we propose neural time-to-event models that account for calibration and uncertainty, while predicting accurate absolute event times. Specifically, we introduce an adversarial nonparametric model for estimating matched time-to-event distributions for probabilistically concentrated and accurate predictions. We consider replacing the discriminator of the adversarial nonparametric model with a survival-function matching estimator that accounts for model calibration. The proposed estimator can be used as a means of estimating and comparing conditional survival distributions while accounting for the predictive uncertainty of probabilistic models.
Moreover, we introduce a theoretically grounded unified counterfactual inference framework for survival analysis, which adjusts for bias from two sources, namely, confounding (from covariates influencing both the treatment assignment and the outcome) and censoring (informative or non-informative). To account for censoring biases, a proposed flexible and nonparametric probabilistic model is leveraged for event times. Then, we formulate a model-free nonparametric hazard ratio metric for comparing treatment effects or leveraging prior randomized real-world experiments in longitudinal studies. Further, the proposed model-free hazard-ratio estimator can be used to identify or stratify heterogeneous treatment effects. For stratifying risk profiles, we formulate an interpretable time-to-event driven clustering method for observations (patients) via a Bayesian nonparametric stick-breaking representation of the Dirichlet Process.
Finally, through experiments on real-world datasets, consistent improvements in predictive performance and interpretability are demonstrated relative to existing state-of-the-art survival analysis models.