Browsing by Department "Biostatistics and Bioinformatics Doctor of Philosophy"
Results Per Page
Sort Options
Item Open Access AN INTEGRATIVE MODELING FRAMEWORK FOR MULTIVARIATE LONGITUDINAL CLINICAL DATA(2024) Choi, DongrakThis dissertation is centered on the development of innovative statistical methodologies specifically tailored to address the complex nature of multivariate data associated with Parkinson's disease (PD). PD is known to manifest impairments across both behavioral and cognitive domains, and as of now, there are no available disease-modifying treatments. Consequently, researchers typically gather a diverse array of data types, encompassing binary, continuous, categorical outcomes, and time-to-event data, in order to gain a comprehensive understanding of the multifaceted aspects of this disorder. Furthermore, due to the long follow-up of PD studies, the occurrence of intermittent missing data is a common challenge, which can introduce bias into the analysis. In this dissertation, we introduce novel approaches for jointly modeling multivariate longitudinal outcomes and time-to-event data using functional latent trait models and generalized multivariate functional mixed models to deal with different types of data. Additionally, we present a method for the detection and handling of missing data patterns through the utilization of joint modeling techniques.
Item Open Access Assessing Interchangeability among Raters with Continuous Outcomes in Agreement Studies(2020) Wang, TongrongIn various medical settings, new raters are available to take measurements for evaluations of medical conditions. One may want to use new and existing raters simultaneously or replace the existing raters with the new one due to low-cost or portability. For both situations, the raters should be interchangeable, such that it makes no clinical difference in which raters measure the subject in the population of interest. This is a problem of claiming sufficient agreement among the raters. Existing literature on assessing agreement is limited to two raters, and those for more than two raters have various issues of interpretability or unrealistic assumptions. This dissertation proposes new overall agreement indexes for multiple raters by extending the preferred pairwise agreement indexes, coverage probability, total deviation index, and the relative area under the coverage probability. The new indexes have intuitive interpretations regarding the clinical judgment of interchangeability. A unified generalized estimating equation (GEE) approach is developed for inference. Simulation studies are conducted to assess the performance and theoretical properties of the proposed approach. A blood pressure dataset is used for illustration. Due to limited literature on sample size calculation in the agreement study, this dissertation investigates sample size formulas with pre-specified power if one of the proposed overall agreement indices is used for claiming satisfactory interchangeability. While the sample size formulas based on the inference of the GEE framework is somewhat complicated, simplified formulas giving conservative sample size estimation is also proposed for easy implementation. Our simulation studies indicated that the sample size formulas work well if the resulting number of subjects is at least 30 where each rater takes about 3 replicates. We demonstrate how to design an agreement study based on a pilot blood pressure data set. The U.S. Food and Drug Administration recommends using a regression-based approach in case of replacing a new device with a commercially marketed device. In the third part, we discuss the potential pitfalls of this approach and compare it with the coverage probability approach, the currently preferred approach for assessing agreement. A respiratory rate dataset is used to illustrate the issues of the regression-based approach.
Item Embargo Design and Analysis of Clinical Trials with Restricted Mean Survival Time(2024) Hua, KaiyuanRestricted mean survival time (RMST), a summary of survival time up to a pre-specified clinically relevant truncation time, is increasingly recognized as a measure for treatment effect in recent biomedical studies with time-to-event endpoints. The difference or ratio of RMST between two groups (e.g., treatment versus control) measures the relative treatment effect concerning a gain or loss of survival time. The RMST offers greater flexibility than the hazard ratio (HR), which is often estimated from the Cox proportional hazards model under the proportional hazards (PH) assumption. Due to delayed treatment effects or other biomedical reasons, the PH assumption is often violated in oncology and cardiovascular trials, leading to biased estimation and misleading interpretations of treatment effects. Compared to HR, RMST requires no PH assumption and offers a more straightforward interpretation of treatment effects. In this dissertation, we propose novel RMST-based methodologies for clinical trials with time-to-event endpoints in three research areas including 1) individual participant data network meta-analysis, 2) inference in multi-regional clinical trials, and 3) biomarker-guided adaptive and enrichment design.
Item Embargo Design and Inference Methods for Randomized Clinical Trials(2023) Cao, ShiweiTraditionally, phase II trials have employed single-arm designs, recruiting patients exclusively for the experimental therapy, and comparing results with historical controls. Due to the limited sample size and patient heterogeneity, the characteristics of patients in new phase II trials often differ from those in the selected historical controls, leading to potential false positive or false negative conclusions. Randomized phase II trials offer a solution by randomizing patients between an experimental arm and a control arm.In this dissertation, we seek efficient designs for multi-stage randomized clinical trials and develop inference methods for the widely used odds ratio parameter. We propose a two-stage randomized phase II trial design based on Fisher's exact tests. This design includes options for early stopping due to either superiority or futility, aimed at optimizing patient enrollment whether the experimental therapy proves efficacious or not. Furthermore, we introduce a novel criterion, the weighted expected sample size, to define optimal designs for multi-stage clinical trials. We have also developed a Java software tool capable of identifying these optimal designs. Additionally, we present a bias-corrected estimator and an exact conditional confidence interval for the odds ratio in multi-stage randomized clinical trials.
Item Open Access Design and Monitoring of Clinical Trials with Clustered Time-to-Event Endpoint(2020) Li, JianghaoMany clinical trials are involved with clustered data that consists of groups (called clusters) of nested subjects (called subunits). Observations from subunits within each cluster tend to be positively correlated due to shared characteristics. Therefore, analysis of such data needs to account for the dependency between subunits. For clustered time-to-event endpoints, there are only few methods proposed for sample size calculation, especially when the cluster sizes are variable. In this dissertation, we aim to derive sample size formula for clustered survival endpoint based on nonparametric weighted rank tests. First, we propose closed form sample size formulas for cluster randomization trials and subunit randomization trials; accordingly, we derive the intracluster correlation coefficient for clustered time-to-event endpoint. We find that the required number of clusters is affected not only by the mean cluster size, but also by the variance of cluster size distribution. In addition, we prove that in group sequentially monitored cluster randomization studies, the log-rank statistics does not have independent increment property, which is different from the result for independent survival data. We further derive the limiting distribution of sequentially computed log-rank statistics, and develop a group sequential testing procedure based on alpha spending approach, as well as a corresponding sample size calculation method.
Item Embargo Develop Novel Statistical and Computational Methods for Omics Data Analysis(2023) Gao, QiRecent advances in sequencing technologies have enabled the measurement of gene expression and other omics profiles at multi-cell, single-cell or subcellular resolution. However, these advances also posed challenges for data analysis, such as identifying differentially expressed feature gene sets with high accuracy and benchmarking computational methods for various analysis topics on data with complex heterogeneity. In my dissertation, we have focused on developing novel statistical and computational methods to address these challenges.
In project 1, we developed SifiNet, a versatile pipeline to identify cell-subpopulation specific feature genes, annotate cell subpopulations, and reveal their relationships. The major advantage of SifiNet is that it bypasses cell clustering and thus avoids possible bias introduced by inaccurate clustering; thus, SifiNet achieves significantly higher accuracy in feature gene identification and cell annotation than tranditional two-step methods relying on clustering. SifiNet can analyze both single cell RNA sequencing (scRNA-seq) and single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) data, providing insight into multiomic cellular profile.
In project 2, we developed GeneScape, a novel scRNA-seq data simulator that can simulate complex cellular heterogeneity. Existing scRNA-seq data simulators are limited in their abilities to simulate data with complex or subtle cellular heterogeneities, especially for those cells exhibit both cell type and cell state differences (such as differences in cell cycles, senescence levels, and DNA-damage levels). GeneScape can successfully simulate gene expressions for cells with complex heterogeneity structures.
In project 3, we developed GeneScape-S (GeneScape-Spatial), a simulator for spatially resolved transcriptomics (SRT) data. Existing SRT-specific simulators cannot fulfill customized needs such as simulating multi-layer data, mimicking local tissue heterogeneity, and accommodating mixing cell-type structures in low-resolution spots. To fill these gaps, we propose GeneScape-S, which preserves the expression and spatial patterns of real SRT data, and offers specially designed functions tailored to fulfill customized needs. GeneScape-S also incorporates the features in GeneScape to simulate complex heterogeneities.
Item Open Access Developing Quantitative Models in Analyzing High-throughput Sequencing Data(2021) Kim, Young-SookDiverse functional genomics assays have been developed and helped to investigate complex gene regulations in various biological conditions. For example, RNA-seq has been used to capture gene expressions in diverse human tissues, helping to study tissue-common and tissue-specific gene regulation. ChIP-seq has been used to identify the genomic regions bound by numerous transcription factors, thus helping to identify collaborative and competitive binding mechanisms of the transcription factors. Despite this huge increase in the amount and the accessibility of genomic data, we have several challenges to analyze those data with proper statistical methods. Some assays such as STARR-seq do not have a proper statistical model that detects both activated and repressed regulatory elements, making researchers depend on the statistical models developed for other assays. Some assays such as ChIP-seq and RNA-seq have limited joint analysis models that are flexible and computationally scalable, resulting in the limited statistical power in identifying the genomic regions or genes shared by multiple biological conditions. To solve those challenges in analyzing high-throughput assays, we first developed a statistical model called correcting reads and analysis of differential active elements or CRADLE to analyze STARR-seq data. CRADLE removes technical biases that can confound quantification of regulatory activity and then detects both activated and repressed regulatory elements. We observed the corrected read counts improved the visualization of regulatory activity, allowing for more accurate detection of regulatory elements. Indeed, through simulation study, we showed CRADLE significantly improved precision and recall in detecting regulatory elements compared to the previous statistical approaches and that improvement was especially prominent in identifying repressed regulatory elements. Based on our work on developing CRADLE, we adapted the statistical framework of CRADLE and developed a joint analysis model of multiple data for biology or JAMMY that can be applied to diverse high-throughput sequencing data. JAMMY is a flexible statistical model that jointly analyzes multiple conditions, identifies condition-shared and condition-specific genomic regions, and then quantifies the preferential activity of a subset of biological conditions for each genomic region. We applied JAMMY to STARR-seq, ChIP-seq, and RNA-seq data, and observed JAMMY overall improved the precision and recall in identifying condition-shared activity compared to the traditional condition-by-condition analysis. This gain of statistical power from the joint analysis led us to find a novel co-binding of two transcription factors in our study. Those results show the substantial advantages of using joint analysis model in integrating genomic data from multiple biological conditions.
Item Open Access EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA(2022) Fang, JiyuanIn my dissertation, we have developed statistical and computational tools to evaluate and interpret machine learning outputs in genomics data. The first two projects focus on single-cell RNA-sequencing (scRNA-seq) data. In project 1, we evaluated the fitting of widely-used distribution families on scRNA-seq UMI counts and concluded that UMI counts of polyclonal cells following gene-specific cell-type-specific NB distributions without zero- inflation. Based on this modeling, we proposed the working dispersion score (WDS) to select genes that differentially express across cell types. In project 2, we developed a new internal (unsupervised) index, Clustering Deviation Index (CDI), to evaluate cell label sets obtained from clustering algorithms. We conducted in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. We also benchmarked CDI by comparing it with other internal indices in terms of the agreement with external indices using high-quality benchmark label sets. In addition, we demonstrated that CDI is more computationally efficient than other internal indices, especially for million-scale datasets. In project 3, we proposed a model-agnostic hypothesis testing framework to interpret feature interactions underneath complex machine learning models. The simulation study results demonstrated large power while controlling the type I error rate.
Item Embargo Exposomic modeling approaches for social and environmental determinants of health(2023) McCormack, KaraStudies of human health have recently expanded to focus on the exposome paradigm, encompassing allexposures humans encounter from conception onward. The central theme of this work is to develop and test novel statistical methodologies that can address the challenges of the complex relationships between environmental exposures, socioeconomic distress, and health outcomes. However, source, measurement, and volume intricacies inherent to these data have constrained progression of statistical methods for key research questions.
In this work, we explore three approaches to characterizing community health and its potential impact on several types of disease outcomes. In the first approach, we implement a latent class model to socioeconomic and comorbidities data and explore these classifications as fixed effects in an ecological spatial model of COVID-19 cases and deaths in NYC during two time periods of the pandemic. In the second, we use a non-parametric Bayesian approach to form socio-economic and pollution cluster profiles across US counties. We then use these profiles to inform a Bayesian spatial model on breast cancer mortality for data from 2014. In the final approach, we utilize a latent network model traditionally used in psychometrics research to explore structural racism. Using information from five domains (employment, education, housing, health, and criminal justice), we identify new variable complexes to illustrate the complex the manifestations of structural racism at the census tract level in Pennsylvania.
Item Open Access Extending Probabilistic Record Linkage(2020) Solomon, Nicole ChanelProbabilistic record linkage is the task of combining multiple data sources for statistical analysis by identifying records pertaining to the same individual in different databases. The need to perform probabilistic record linkage arises in comparative effectiveness research and other clinical research scenarios when records in different databases do not share an error-free unique patient identifier. This dissertation seeks to develop new methodology for probabilistic record linkage to address two highly practical and recurring challenges: how to implement record linkage in a manner that optimizes downstream statistical analyses of the linked data, and how to efficiently link databases having a clustered or multi-level data structure.
In Chapter 2 we propose a new framework for balancing the tradeoff between false positive and false negative linkage errors when linked data are analyzed in a generalized linear model framework and non-linked records lead to missing data for the study outcome variable. Our method seeks to maximize the probability that the point estimate of the parameter of interest will have the correct sign and that the confidence interval around this estimate will correctly exclude the null value of zero. Using large sample approximations and a model for linkage errors, we derive expressions relating bias and hypothesis testing power to the user's choice of threshold that determines how many records will be linked. We use these results to propose three data-driven threshold selection rules. Under one set of simplifying assumptions we prove that maximizing asymptotic power requires that the threshold be relaxed at least until the point where all pairs with >50% probability of being a true match are linked.
In Chapter 3 we explore the consequences of linkage errors when the study outcome variable is determined by linkage status and so linkage errors may cause outcome misclassification. This scenario arises when the outcome is disease status and those linked are classified as having the disease while those not linked are classified as disease-free. We assume the parameter of interest can be expressed as a linear combination of binomial proportions having mean zero under the null hypothesis. We derive an expression for the asymptotic relative efficiency of a Wald test calculated with a misclassified outcome compared to an error-free outcome using a linkage error model and large sample approximations. We use this expression to generate insights for planning and implementing studies using record linkage.
In Chapter 4 we develop a modeling framework for linking files with a clustered data structure. Linking such clustered data is especially challenging when error-free identifiers are unavailable for both individual-level and cluster-level units. The proposed approach improves over current methodology by modeling inter-pair dependencies in clustered data and producing collective link decisions. It is novel in that it models both record attributes and record relationships, and resolves match statuses for individual-level and cluster-level units simultaneously. We show that linkage probabilities can be estimated without labeled training data using assumptions that are less restrictive compared to existing record linkage models. Using Monte Carlo simulations based on real study data, we demonstrate its advantages over the current standard method.
Item Open Access Extending the Weighted Generalized Score Statistic for Comparison of Correlated Means(2023) Jones, Aaron DouglasThe generalized score (GS) statistic is widely used to test hypotheses about mean model parameters in the generalized estimating equations (GEE) framework. However, when comparing predictive values of two diagnostic tests in a paired study design, or comparing correlated proportions between two unequally sized groups with both paired and independent outcomes, GS has been shown neither to adequately control type I error nor to reduce to the score statistic under independence. Weighting the residuals in empirical variance estimation by the ratio of the two groups’ sample sizes produces a weighted generalized score (WGS) statistic that has been shown to resolve these issues and is now used in the diagnostic testing literature. Potential improvements from weighting in more general uses of GS have not previously been investigated.This dissertation extends the WGS method in several ways. Formulas are derived to extend the WGS statistic for paired and/or independent data from two binary proportions to two means in a quasi-likelihood model with any suitable link and variance functions, assuming finite fourth moments. The asymptotic convergence of WGS to the chi-square distribution in these general cases is proven. Finite-sample type I error rates are compared between GS and WGS, for which purpose the variance of the test statistic denominator (i.e., the variance of the empirical variance estimator) is proposed as an analytic heuristic. New weights are derived to optimize the variance-of-the-denominator criterion for approximate type I error control. Simulation results verify that the heuristically optimal weights achieve type I error rates closer to the nominal alpha level than GS or WGS for combinations of correlation and sample size where either GS or WGS demonstrates poor control.
Item Open Access Improving Clinical Prediction Models with Statistical Representation Learning(2021) Xiu, ZidiThis dissertation studies novel statistical machine learning approaches for healthcare risk prediction applications in the presence of challenging scenarios, such as rare events, noisy observations, data imbalance, missingness and censoring. Such scenarios manifest frequently in practice, and they compromise the validity of standard predictive models which often expect clean and complete data. As such, alleviating the negative impacts of real-world data challenges is of great significance and constitutes the overarching goal of this dissertation, which investigates novel strategies to (i) account for data uncertainties and statistical characteristics of low-prevalence events; (ii) re-balancing and augmenting the representations for minority data under proper causal assumptions; (iii) dynamically assigning scores and attend to the observed units to derive robust features.By integrating the ideas from representation learning, variational Bayes, causal inference, and contrastive training, this dissertation builds tools for risk modeling frameworks that are robust to various peculiarities of real-world datasets to yield reliable individualized risk evaluations.
This dissertation starts with a systematic review of classical risk prediction models in Chapter 1, and discusses the new opportunities and challenges presented by the big data era. With the increasing availability of healthcare data and the current rapid development of machine learning models, clinical decision support systems have seen new opportunities to improve clinical practice. However, in healthcare risk prediction applications, statistical analysis is not only challenged by data incompleteness and skewed distributions but also the complexity of the inputs. To motivate the subsequent developments, discussions on the limitations in risk minimization methods, robustness against high-dimensional data with incompleteness, and the need for individualization are provided.
As a concrete example to address a canonical problem, Chapter 2 proposes a variational disentanglement approach to semi-parametrically learns from the heavily imbalanced binary classification datasets. In this new method, which is named Variational Inference for Extremals (VIE), we apply an extreme value theory to enable efficient learning with few observations. By organically integrating the generalized additive model and isotonic neural nets, VIE enjoys the merits of improved robustness, interpretability, and generalizability for the accurate prediction of rare events. An analysis of the COVID-19 cohort from Duke Hospitals demonstrates that the proposed approach outperforms competing solutions. We investigate a more generalized setting of a multi-classification problem with heavily imbalanced data in Chapter 3, from the perspective of causal machine learning to promote sample efficiency and model generalization. Our solution, named Energy-based Causal Representation Transfer (ECRT), posits a meta-distributional scenario, where the data generating mechanism for label-conditional features is invariant across different labels. Such causal assumption enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if their feature distributions show apparent disparities. This allows us to leverage a causally informed data augmentation procedure based on nonlinear independent component analysis to enrich the representation of minority classes and simultaneously data whitening. The effectiveness and enhanced prediction accuracy are demonstrated through synthetic data and real-world benchmarks compared with state-of-art models.
In Chapter 4 we deal with the time-to-event prediction with censored (missing in outcomes) and incomplete (missing in covariates) observations. Also known as survival analysis, time-to-event prediction plays a crucial role in many clinical applications, yet the classical survival solutions scale poorly w.r.t. data complexity and incompleteness. To better handle sophisticated modern health data and alleviate the impact of real-world data challenges, we introduce a self-attention based model to capture helpful information for time-to-event prediction, called Energy-based Latent Self-Attentive Survival Analysis (ELSSA). A key novelty of this approach is the integration of contrastive mutual information based loss that non-parametrically maximizes the informativeness of learned data representations. The effectiveness of our approaches has been intensively validated with synthetic and real-world benchmarks, showing improved performance relative to competing solutions.
In summary, this dissertation presents flexible risk prediction frameworks that acknowledge representation uncertainties, data heterogeneity and incompleteness. Altogether, it presents three contributions: improved efficient learning from imbalanced data, enhanced robustness to missing data, and better generalizability to out-of-sample subjects.
Item Open Access Marginal Methods for the Design and Analysis of Cluster Randomized Trials and Related Studies(2022) Wang, XueqiCluster randomized trials (CRTs) are used to study the effectiveness of complex or community-level interventions across a diverse range of contexts. These contexts present a range of logistical and statistical challenges to the design and analysis of CRTs and related studies, such as individually randomized group treatment (IRGT) trials, for which clustering of outcomes arises. This dissertation, consisting of four distinct topics, uses real-world CRTs and IRGT trials to identify unanswered statistical challenges in the design and analysis of those trials and then tackles those questions and provide solutions. All four topics focus on the marginal modeling framework to accommodate the correlated outcome data that arises in CRTs and IRGT trials, with two topics focused on design and two on analysis.
The two design-focused topics assume a marginal modeling framework with data analyzed using generalized estimating equations paired with matrix-adjusted estimating equations with a bias-corrected sandwich variance estimator for the correlation parameters. In the first topic, we develop methods for sample size and power calculations in four-level intervention studies when intervention assignment is carried out at any level, with a particular focus on CRTs, assuming arbitrary link and variance functions. We demonstrate that, under both balanced and unbalanced designs, empirical power corresponds well with that predicted by the proposed method for as few as 8 clusters. In the second topic, we develop sample size formulas for longitudinal IRGT trials, under four models with different assumptions regarding the time effect. We show that empirical power corresponds well with that predicted by the proposed method for as few as 6 groups in the intervention arm.
The two analysis-focused topics relate to current challenges in the analysis of CRTs. In the first, we propose 9 bias-corrected sandwich variance estimators for CRTs with time-to-event data analyzed through the marginal Cox model, evaluate the performance of the proposed variance estimators, and develop an R package CoxBcv to facilitate their implementation. Our results indicate that the optimal choice of bias-corrected sandwich variance estimator for CRTs with survival outcomes can depend on the variability of cluster sizes, and can also differ whether it is evaluated according to relative bias or type I error rate. In the second, we compare four methods of generalized estimating equations analyses for CRTs, when cluster sizes vary and the goal is to generalize to a hypothetical population of clusters. We conclude that an analysis using both an exchangeable working correlation matrix and weighting by inverse cluster size, which may be considered the natural analytic approach, can lead to incorrect results. Furthermore, an analysis with both an independence working correlation matrix and weighting by inverse cluster size is the only approach that always provides valid results.
Item Open Access Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data(2020) Pura, JohnIn my dissertation, I have developed computational methods for high dimensional inference, motivated by the analysis of omics data. This dissertation is divided into two parts. The first part of this dissertation is motivated by flow cytometry data analysis, where a key goal is to identify sparse cell subpopulations that differ be- tween two groups. I have developed an algorithm called multiple Testing Embedded on an Aggregation tree Method (TEAM) to locate where distributions differ between two samples. Regions containing differences can be identified in layers along the tree: the first layer searches for regions containing short-range, strong distributional differences, and higher layers search for regions containing long-range, weak distributional differences. TEAM is able to pinpoint local differences and under mild assumptions, asymptotically control the layer-specific and overall false discovery rate (FDR). Simulations verify our theoretical results. When applied to real flow cytometry data, TEAM captures cell subtypes that are overexpressed in cytomegalovirus stimulation vs. control. In addition, I have extended the TEAM algorithm so that it can incorporate information from more than one cell attribute, allowing for more robust conclusions. The second part of this dissertation is motivated by rare variant association studies, where a key goal is to identify regions of rare variants, which are associated with disease. This problem is addressed via a flexible method called stochastic aggregation tree-embedded testing (SATET). SATET embeds testing of genomic regions onto an aggregation tree, which provides a way to test association at various resolutions. The rejection rule at each layer depends on the previous layer, and leads to a procedure that controls the layer-specific FDR. Compared to methods that search for rare-variant association over large regions, such as protein domains, SATET can pinpoint sub-genic regions associated with disease. Numerical experiments show FDR control for different genetic architectures and superior per- formance compared to domain-based analyses. When applied to a case-control study in amyotrophic lateral sclerosis (ALS), SATET identified sub-genic regions in known ALS-related genes, while implicating regions in new genes not previously captured by domain-based analyses.
Item Open Access Multiple Testing for Data with Ancillary Information(2022) Li, XuechanIn my dissertation, I develop three powerful hierarchical multiple testing methods by accounting for ancillary information of data. In my first project, we develop a multiple testing framework named Distance Assisted Recursive Testing (DART). DART assumes there exists some informative distance information in the data. Through rigorous proof and extensive simulations, we justified the false discovery rate (FDR) control and sensitivity improvement of DART. As an illustration, we apply our method to a clinical trial in leukemia patients receiving hematopoietic cell transplantation to identify the gut microbiota whose abundance will be impacted by the after-transplant care. The second project is motivated by the flow cytometry analysis in immunology study. The analysis can be translated into a statistical problem which is trying to pinpoint the regions where two density functions differ. By partitioning the sample space into small bins and conducting testing on each bin, we model the analysis into a multiple testing problem. We provide theoretical justification that the procedure achieves the statistical goal of pinpointing the regions with differential density with high sensitivity and precision. My third project is motivated by the rare variant association study. We develop a multiple testing framework named DATED (Dynamic Aggregation and Tree-Embedded testing) to pinpoint the disease-associated rare-variant regions hierarchically and dynamically. To accommodate the application objective, DATED adopts a rare variant region-level FDR weighted by the proportions of the neutral rare-variant. Extensive numerical simulations demonstrate the superior performance of DATED under various scenarios compared to the existing methods. We illustrate DATED by applying it to an amyotrophic lateral sclerosis (ALS) study for identifying pathogenic rare variants.
Item Open Access Propensity Score Methods For Causal Subgroup Analysis(2022) Yang, SiyunSubgroup analyses are frequently conducted in comparative effectiveness research and randomized clinical trials to assess evidence of heterogeneous treatment effect across patient subpopulations. Though widely used in medical research, causal inference methods for conducting statistical analyses on a range of pre-specified subpopulations remain underdeveloped, particularly in observational studies. This dissertation develops and extends propensity score methods for causal subgroup analysis.
In Chapter 2, we develop a suite of analytical methods and visualization tools for causal subgroup analysis. First, we introduce the estimand of subgroup weighted average treatment effect and provide the corresponding propensity score weighting estimator. We show that balancing covariates within a subgroup bounds the bias of the estimator of subgroup causal effects. Second, we propose to use the overlap weighting method to achieve exact balance within subgroups. We further propose a method that combines overlap weighting and LASSO, to balance the bias-variance tradeoff in subgroup analysis. Finally, we design a new diagnostic plot---the Connect-S plot---for visualizing the subgroup covariate balance. Extensive simulation studies are presented to compare the proposed method with several existing methods. We apply the proposed methods to the observational COMPARE-UF study to evaluate the causal effects of Myomectomy versus Hysterectomy on the relief of symptoms and quality of life (a continuous outcome) in a number of pre-specified subgroups of patients with uterine fibroids.
In Chapter 3, we investigate the propensity score weighting method for causal subgroup analysis with time-to-event outcomes. We introduce two causal estimands, the subgroup marginal hazard ratio and subgroup restricted average causal effect, and provide corresponding propensity score weighting estimators. We analytically established that the bias of subgroup restricted average causal effect is determined by subgroup covariate balance. Using extensive simulations, we compare the performance of various combination of propensity score models (logistic regression, random forests, LASSO, and generalized boosted models) and weighting schemes (inverse probability weighting, and overlap weighting) for estimating the survival causal estimands. We find that the logistic model with subgroup-covariate interactions selected by LASSO consistently outperforms other propensity score models. Also, overlap weighting generally outperforms inverse probability weighting in terms of balance, bias and variance, and the advantage is particularly pronounced in small subgroups and/or in the presence of poor overlap. Again, we apply the methods to the COMPARE-UF study with a time-to-event outcome, the time to disease recurrence after receiving a procedure.
In Chapter 4, we extend propensity score weighting methodology for covariate adjustment to improve the precision and power of subgroup analyses in RCTs. We fit a logistic regression propensity model with pre-specified covariate-subgroup interactions. We show that by construction, overlap weighting exactly balances the covariates with interaction terms in each subgroup. Extensive simulations are performed to compare the operating characteristics of unadjusted estimator, different propensity score weighting estimators and the analysis of covariance estimator. We apply these methods to the HF-ACTION trial to evaluate the effect of exercise training on 6-minute walk test in several pre-specified subgroups.
Item Open Access Topics and Applications of Weighting Methods in Case-Control and Observational Studies(2019) Li, FanWeighting methods have been widely used in statistics and related applications. For example, the inverse probability weighting is a standard approach to correct for survey non-response. The case-control design, frequently seen in epidemiologic or genetic studies, can be regarded as a special type of survey design; analogous inverse probability weighting approaches have been explored when the interest is the association between exposures and the disease (primary analysis) as well as when the interest is the association among exposures (secondary analysis). Meanwhile, in observational comparative effectiveness research, inverse probability weighting has been suggested as a valid approach to correct for confounding bias. This dissertation develops and extends weighting methods for case-control and observational studies.
The first part of this dissertation extends the inverse probability weighting approach for secondary analysis of case-control data. We revisit an inverse probability weighting estimator to offer new insights and extensions. Specifically, we construct its more general form by generalized least squares (GLS). Such a construction allows us to connect the GLS estimator with the generalized method of moments and motivates a new specification test designed to assess the adequacy of the inverse probability weights. The specification test statistic measures the weighted discrepancy between the case and control subsample estimators, and asymptotically follows a Chi-squared distribution under correct model specification. We illustrate the GLS estimator and specification test using a case-control sample of peripheral arterial disease, and use simulations to shed light on the operating characteristics of the specification test. The second part develops a robust difference-in-differences (DID) estimator for estimating causal effect with observational before-after data. Within the DID framework, two common estimation strategies are outcome regression and propensity score weighting. Motivated by a real application in traffic safety research, we propose a new double-robust DID estimator that hybridizes outcome regression and propensity score weighting. We show that the proposed estimator possesses the desirable large-sample robustness property, namely the consistency only requires either one of the outcome model or the propensity score model to be correctly specified. We illustrate the new estimator to study the causal effect of rumble strips in reducing vehicle crashes, and conduct a simulation study to examine its finite-sample performance. The third part discusses a unified framework, the balancing weights, for estimating causal effects in observational studies with multiple treatments. These weights incorporate the generalized propensity scores to balance the weighted covariate distribution of each treatment group, all weighted toward a common pre-specified target population. Within this framework, we further develop the generalized overlap weights, constructed as the product of the inverse probability weights and the harmonic mean of the generalized propensity scores. The generalized overlap weights corresponds to the target population with the most overlap in covariates between treatments, similar to the population in equipoise in clinical trials. We show that the generalized overlap weights minimize the total asymptotic variance of the nonparametric estimators for the pairwise contrasts within the class of balancing weights. We apply the new weighting method to study the racial disparities in medical expenditure and further examine its operating characteristics by simulations.