Browsing by Subject "Biostatistics"
Results Per Page
Sort Options
Item Open Access Advanced Topics in Introductory Statistics(2023) Bryan, Jordan GreyIt is now common practice in many scientific disciplines to collect large amounts of experimental or observational data in the course of a research study. The abundance of such data creates a circumstance in which even simply posed research questions may, or sometimes must, be answered using multivariate datasets with complex structure. Introductory-level statistical tools familiar to practitioners may be applied to these types of data, but inference will either be sub-optimal or invalid if properties of the data violate the assumptions made by these statistical procedures. In this thesis, we provide examples of how basic statistical procedures may be adapted to suit the complexity of modern datasets while preserving the simplicity of low-dimensional parametric models. In the context of genomics studies, we propose a frequentist-assisted-by-Bayes (FAB) method for conducting hypothesis tests for the means of normal models when auxiliary information about the means is available. If the auxiliary information accurately describes the means, then the proposed FAB hypothesis tests may be more powerful than the corresponding classical $t$-tests. If the information is not accurate, then the FAB tests retain type-I error control. For multivariate financial and climatological data, we develop a semiparametric model in order to characterize the dependence between two sets of random variables. Our approach is inspired by a multivariate notion of the sample rank and extends classical concepts such as canonical correlation analysis (CCA) and the Gaussian copula model. The proposed model allows for the analysis of multivariate dependence between variable sets with arbitrary marginal distributions. Motivated by fluorescence spectroscopy data collected from sites along the Neuse River, we also propose a least squares estimator for quantifying the contribution of various land-use sources to the water quality of the river. The estimator can be computed quickly relative to estimators derived using parallel factor analysis (PARAFAC) and it performs favorably in two source apportionment tasks.
Item Open Access AN INTEGRATIVE MODELING FRAMEWORK FOR MULTIVARIATE LONGITUDINAL CLINICAL DATA(2024) Choi, DongrakThis dissertation is centered on the development of innovative statistical methodologies specifically tailored to address the complex nature of multivariate data associated with Parkinson's disease (PD). PD is known to manifest impairments across both behavioral and cognitive domains, and as of now, there are no available disease-modifying treatments. Consequently, researchers typically gather a diverse array of data types, encompassing binary, continuous, categorical outcomes, and time-to-event data, in order to gain a comprehensive understanding of the multifaceted aspects of this disorder. Furthermore, due to the long follow-up of PD studies, the occurrence of intermittent missing data is a common challenge, which can introduce bias into the analysis. In this dissertation, we introduce novel approaches for jointly modeling multivariate longitudinal outcomes and time-to-event data using functional latent trait models and generalized multivariate functional mixed models to deal with different types of data. Additionally, we present a method for the detection and handling of missing data patterns through the utilization of joint modeling techniques.
Item Open Access Assessing Interchangeability among Raters with Continuous Outcomes in Agreement Studies(2020) Wang, TongrongIn various medical settings, new raters are available to take measurements for evaluations of medical conditions. One may want to use new and existing raters simultaneously or replace the existing raters with the new one due to low-cost or portability. For both situations, the raters should be interchangeable, such that it makes no clinical difference in which raters measure the subject in the population of interest. This is a problem of claiming sufficient agreement among the raters. Existing literature on assessing agreement is limited to two raters, and those for more than two raters have various issues of interpretability or unrealistic assumptions. This dissertation proposes new overall agreement indexes for multiple raters by extending the preferred pairwise agreement indexes, coverage probability, total deviation index, and the relative area under the coverage probability. The new indexes have intuitive interpretations regarding the clinical judgment of interchangeability. A unified generalized estimating equation (GEE) approach is developed for inference. Simulation studies are conducted to assess the performance and theoretical properties of the proposed approach. A blood pressure dataset is used for illustration. Due to limited literature on sample size calculation in the agreement study, this dissertation investigates sample size formulas with pre-specified power if one of the proposed overall agreement indices is used for claiming satisfactory interchangeability. While the sample size formulas based on the inference of the GEE framework is somewhat complicated, simplified formulas giving conservative sample size estimation is also proposed for easy implementation. Our simulation studies indicated that the sample size formulas work well if the resulting number of subjects is at least 30 where each rater takes about 3 replicates. We demonstrate how to design an agreement study based on a pilot blood pressure data set. The U.S. Food and Drug Administration recommends using a regression-based approach in case of replacing a new device with a commercially marketed device. In the third part, we discuss the potential pitfalls of this approach and compare it with the coverage probability approach, the currently preferred approach for assessing agreement. A respiratory rate dataset is used to illustrate the issues of the regression-based approach.
Item Open Access B cells and the Antibody-Dependent Immune Response in Cancer and Infection(2015) Lykken, JacquelynB cells and humoral immunity are critical components of an effective immune response. However, B cells are also a significant driver of a variety of autoimmune diseases and can also become malignant. Antibody-mediated B cell depletion is now regularly used in the clinic to treat both B cell-derived cancers and B-cell driven autoimmunity, and while depletion itself is effective in some patients, removal of B cells is not often curative for patients and may present additional, unforeseen risks. The overall goal of this dissertation was therefore to determine the impact of B cell depletion on T cell homeostasis and function during infection and to elucidate the genetic factors that determine the effectiveness of antibody-mediated therapy.
In Chapter 3 of this dissertation, the role of B cells in promoting T cell homeostasis was investigated by depleting mature B cells using CD20 monoclonal antibody (mAb). Acute B cell depletion in adult mice significantly reduced spleen and lymph node T cell numbers, including naïve, activated, and cytokine-producing cells, as well as Foxp3+ regulatory T cells, whereas chronic B cell depletion in aged mice resulted in a profound decrease in activated and cytokineproducing T cell numbers. To determine the significance of this finding, B cell-depleted adult mice were infected with acute lymphocytic choriomeningitis virus (LCMV). Despite their expansion, activated and cytokine-producing T cell numbers were still significantly reduced one week later. Moreover, viral peptide-specific T cell numbers and effector cell development were significantly reduced in mice lacking B cells, while LCMV titers were dramatically increased. Thus, B cells are required for optimal T cell homeostasis, activation, and effector development in vivo, particularly during acute viral infection.
In Chapter 4 of this dissertation, lymphoma genetic changes that conferred either sensitivity or resistance to CD20 mAb therapy were examined in a preclinical mouse lymphoma model. An examination of primary lymphomas and extensive lymphoma families demonstrated that sensitivity to CD20 mAb was not regulated by differences in CD20 expression, prior exposure to CD20 mAb, nor serial in vivo passage. An unbiased forward genetic screen of CD20 mAb-resistant and -sensitive lymphomas identified galectin-1 as a significant factor driving CD20 mAb therapy resistance. As lymphomas acquired therapy resistance following serial in vivo passage, galectin-1 expression also increased. Furthermore, inducing lymphoma galectin-1 expression within the tumor microenvironment ablated lymphoma sensitivity to CD20 mAb. Therefore, lymphoma acquisition of galectin-1 expression confers CD20 mAb therapy resistance.
In Chapter 5 of this dissertation, the distinct germline components that control the efficacy of host CD20 mAb-dependent B cell and lymphoma depletion were evaluated using genetically distinct lab mouse strains. Variations in B cell depletion by CD20 mAb among several lab mouse strains were observed, where 129 mice had significantly impaired mAb-dependent depletion of endogenous B cells and primary lymphomas relative to B6 mice. An unbiased forward genetic screen of mice revealed that a 1.5 Mbp region of Chromosome 12 that contains mycn significantly altered CD20 mAb-dependent lymphoma depletion. Elevated mycn expression enhanced mAb-dependent B cell depletion and lymphoma phagocytosis and correlated with higher macrophage numbers. Thus, host genetic variations in mycn expression in macrophages alter the outcome of Ab-dependent depletion of endogenous and malignant cells.
These studies collectively demonstrate that B cells are required for effective cellular immune responses during infection and identified factors that alter the effectiveness of mAb-dependent B cell depletion. This research also established and validated an unbiased forward genetics approach to identify the totality of host and tumor-intrinsic factors that influence mAb therapy in vivo. The findings of these studies ultimately urge careful consideration in the clinical application of B cell depletion therapies.
Item Open Access Bayesian interaction estimation with high-dimensional dependent predictors(2021) Ferrari, FedericoHumans are constantly exposed to mixtures of different chemicals arising from environmental contamination. While certain compounds, such as heavy metals and mercury, are well known to be toxic, there are many complex mixtures whose health effects are still unknown. It is of fundamental public health importance to understand how these exposures interact to impact risk of disease and the health effects of cumulative exposure to multiple agents. The goal of this thesis is to build data-driven models to tackle major challenges in modern health applications, with a special interest in estimating statistical interactions among correlated exposures. In Chapter 1, we develop a flexible Gaussian process regression model (MixSelect) that allows to simultaneously estimate a complex nonparametric model and provide interpretability. A key component of this approach is the incorporation of a heredity constraint to only include interactions in the presence of main effects, effectively reducing dimensionality of the model search. Next, we focus our modelling effort on characterizing the joint variability of chemical exposures using factor models. In fact, chemicals usually co-occur in the environment or in synthetic mixtures; as a result, their exposure levels can be highly correlated. In Chapter 3, we build a Factor analysis for INteractions (FIN) framework that jointly provides dimensionality reduction in the chemical measurements and allows to estimate main effects and interactions. Through appropriate modifications of the factor modeling structure, FIN can accommodate higher order interactions and multivariate outcomes. Further, we extend FIN to survival analysis and exponential families in Chapter 4, as medical studies often include collect high-dimensional data and time-to-event outcomes. We address these cases through a joint factor analysis modeling approach in which latent factors underlying the predictors are included in a quadratic proportional hazards regression model, and we provide expressions for the induced coefficients on the covariates. In Chapter 5, we combine factor models and nonparametric regression. We build a copula factor model for the chemical exposures and use Bayesian B-splines for flexible dose-response modeling. Finally, in Chapter 6 we we propose a post-processing algorithm that allows for identification and interpretation of the factor loadings matrix and can be easily applied to the models described in the previous chapters.
Item Open Access Bayesian Kernel Models for Statistical Genetics and Cancer Genomics(2017) Crawford, Lorin AnthonyThe main contribution of this thesis is to examine the utility of kernel regression ap- proaches and variance component models for solving complex problems in statistical genetics and molecular biology. Many of these types of statistical methods have been developed specifically to be applied to solve similar biological problems. For example, kernel regression models have a long history in statistics, applied mathematics, and machine learning. More recently, variance component models have been extensively utilized as tools to broaden understanding of the genetic basis of phenotypic varia- tion. However, because of large combinatorial search spaces and other confounding factors, many of these current methods face enormous computational challenges and often suffer from low statistical power --- particularly when phenotypic variation is driven by complicated underlying genetic architectures (e.g. the presence of epistatic effects involving higher order genetic interactions). This thesis highlights two novel methods which provide innovative solutions to better address the important statis- tical and computational hurdles faced within complex biological data sets. The first is a Bayesian non-parametric statistical framework that allows for efficient variable selection in nonlinear regression which we refer to as "Bayesian approximate kernel regression", or BAKR. The second is a novel algorithm for identifying genetic vari- ants that are involved in epistasis without the need to identify the exact partners with which the variants interact. We refer to this method as the "MArginal ePIstasis Test", or MAPIT. Here, we develop the theory of these two approaches, and demonstrate their power, interpretability, and computational efficiency for analyz- ing complex phenotypes. We also illustrate their ability to facilitate novel biological discoveries in several real data sets, each of them representing a particular class of analyses: genome-wide association studies (GWASs), molecular trait quantitative trait loci (QTL) mapping studies, and cancer biology association studies. Lastly, we will also explore the potential of these approaches in radiogenomics, a brand new subfield of genetics and genomics that focuses on the study of correlations between imaging or network features and genetic variation.
Item Open Access Behavioral and Geophysical Factors Influencing Success in Long Distance Navigation(2023) Granger, JesseMany animals can sense the earth’s magnetic field and use it to perform incredible feats of navigation; however, studying this phenomenon in the lab is difficult because behavioral responses to magnetic cues can be highly variable. My Ph.D. research attempts to fill this knowledge gap in the following ways: we first explore potential sources for this variability, including both natural and artificial sources of noise. We then examine the ways in which these natural sources of noise could be used to study magnetoreception in animals that are not feasible to study in the laboratory. Finally, we propose a possible solution for how navigating animals may overcome noise to still accomplish highly accurate migrations. Chapter 1 contains the relevant background and introduction. In Chapter 2, we conduct a synthetic review of natural and anthropogenic sources of radio frequency electromagnetic noise (RF) and its effects on magnetoreception. Anthropogenic RF has been shown to disrupt magnetic orientation behavior in some animals. Two sources of natural RF might also have the potential to disturb magnetic orientation behavior under some conditions: solar RF and atmospheric RF. In this review, we outline the frequency ranges and electric/magnetic field magnitudes of RF that have been shown to disturb magnetoreceptive behavior in laboratory studies and compare these to the ranges of solar and atmospheric RF. Frequencies shown to be disruptive in laboratory studies range from 0.1 to 10 MHz, with magnetic magnitudes as low as 1 nT reported to have effects. Based on these values, it appears unlikely that solar RF alone routinely disrupts magnetic orientation. In contrast, atmospheric RF does sometimes exceed the levels known to disrupt magnetic orientation in laboratory studies. We provide a reference for when and where atmospheric RF can be expected to reach these levels, as well as a guide for quantifying RF measurements.
In Chapter 3, we explore how these natural sources of noise may allow us to study magnetoreception in animals that are not feasible to study in the laboratory. Although it is difficult to perform behavioral experiments on baleen whales, it may be possible to use live stranding data (strandings that indicate the whale may have made a navigational error, rather than those having died at sea and washed ashore) as a tool to investigate the cues they use while navigating. Here we show that there is a 2.1-fold increase in the likelihood of a live gray whale (Eschrichtius robustus) stranding (n=186) on days with a high sunspot count than on low sunspot days (p<0.0001). Increased sunspot count is strongly correlated with solar storms – sudden releases of high-energy particles from the sun which have the potential to disrupt magnetic orientation behavior when they interact with earth’s magnetosphere. We further explore this relationship by examining portions of earth’s electromagnetic spectrum that are affected by solar storms and found a 3.7-fold increase in the likelihood of a live stranding on days with high solar radio flux (RF) as measured from earth (p<0.0001). One hypothesized mechanism for magnetoreception, the radical-pair theory, predicts that magnetoreception can be disrupted by RF radiation, and RF noise has been shown to disrupt magnetic orientation in certain species. To our knowledge, this is the first evidence that provides support for a specific magnetoreception mechanism in whales.
Finally, in Chapter 4, we propose a mechanism for how magnetoreceptive animals may overcome noise to perform incredibly accurate migrations. Many animals use the geomagnetic field to migrate long distances with high accuracy; however, research has shown that individual responses to magnetic cues in the laboratory can be highly variable. Thus, it has been hypothesized that magnetoreception alone is insufficient for accurate migrations and animals must either switch to a more accurate sensory cue or integrate their magnetic sense over time. Here we suggest that magnetoreceptive migrators could also use collective navigation strategies. Using agent-based models, we compare agents utilizing collective navigation to both the use of a secondary sensory system and time-integration. In our models, by using collective navigation alone, over 70% of the group is still able to successfully reach their goal even as their ability to navigate becomes extremely noisy. To reach the same success rates, in our models, a secondary sensory system must provide perfect navigation for over 73% of the migratory route, and time integration must integrate over 50 time-steps, indicating that magnetoreceptive animals could benefit from using collective navigation. Finally, we explore the impact of population loss on animals relying on collective navigation. We show that as population density decreases, a greater proportion of individuals fail to reach their destination and, in our models, a 50% population reduction resulted in up to a 37% decrease in the proportion of individuals completing their migration. We additionally show that this process is compounding, eventually resulting in complete population collapse.
Item Open Access Deriving Real-World Insights From Real-World Data: Biostatistics to the Rescue.(Annals of internal medicine, 2018-09) Pencina, Michael J; Rockhold, Frank W; D'Agostino, Ralph BItem Embargo Design and Analysis of Clinical Trials with Restricted Mean Survival Time(2024) Hua, KaiyuanRestricted mean survival time (RMST), a summary of survival time up to a pre-specified clinically relevant truncation time, is increasingly recognized as a measure for treatment effect in recent biomedical studies with time-to-event endpoints. The difference or ratio of RMST between two groups (e.g., treatment versus control) measures the relative treatment effect concerning a gain or loss of survival time. The RMST offers greater flexibility than the hazard ratio (HR), which is often estimated from the Cox proportional hazards model under the proportional hazards (PH) assumption. Due to delayed treatment effects or other biomedical reasons, the PH assumption is often violated in oncology and cardiovascular trials, leading to biased estimation and misleading interpretations of treatment effects. Compared to HR, RMST requires no PH assumption and offers a more straightforward interpretation of treatment effects. In this dissertation, we propose novel RMST-based methodologies for clinical trials with time-to-event endpoints in three research areas including 1) individual participant data network meta-analysis, 2) inference in multi-regional clinical trials, and 3) biomarker-guided adaptive and enrichment design.
Item Embargo Design and Inference Methods for Randomized Clinical Trials(2023) Cao, ShiweiTraditionally, phase II trials have employed single-arm designs, recruiting patients exclusively for the experimental therapy, and comparing results with historical controls. Due to the limited sample size and patient heterogeneity, the characteristics of patients in new phase II trials often differ from those in the selected historical controls, leading to potential false positive or false negative conclusions. Randomized phase II trials offer a solution by randomizing patients between an experimental arm and a control arm.In this dissertation, we seek efficient designs for multi-stage randomized clinical trials and develop inference methods for the widely used odds ratio parameter. We propose a two-stage randomized phase II trial design based on Fisher's exact tests. This design includes options for early stopping due to either superiority or futility, aimed at optimizing patient enrollment whether the experimental therapy proves efficacious or not. Furthermore, we introduce a novel criterion, the weighted expected sample size, to define optimal designs for multi-stage clinical trials. We have also developed a Java software tool capable of identifying these optimal designs. Additionally, we present a bias-corrected estimator and an exact conditional confidence interval for the odds ratio in multi-stage randomized clinical trials.
Item Open Access Design and Monitoring of Clinical Trials with Clustered Time-to-Event Endpoint(2020) Li, JianghaoMany clinical trials are involved with clustered data that consists of groups (called clusters) of nested subjects (called subunits). Observations from subunits within each cluster tend to be positively correlated due to shared characteristics. Therefore, analysis of such data needs to account for the dependency between subunits. For clustered time-to-event endpoints, there are only few methods proposed for sample size calculation, especially when the cluster sizes are variable. In this dissertation, we aim to derive sample size formula for clustered survival endpoint based on nonparametric weighted rank tests. First, we propose closed form sample size formulas for cluster randomization trials and subunit randomization trials; accordingly, we derive the intracluster correlation coefficient for clustered time-to-event endpoint. We find that the required number of clusters is affected not only by the mean cluster size, but also by the variance of cluster size distribution. In addition, we prove that in group sequentially monitored cluster randomization studies, the log-rank statistics does not have independent increment property, which is different from the result for independent survival data. We further derive the limiting distribution of sequentially computed log-rank statistics, and develop a group sequential testing procedure based on alpha spending approach, as well as a corresponding sample size calculation method.
Item Embargo Develop Novel Statistical and Computational Methods for Omics Data Analysis(2023) Gao, QiRecent advances in sequencing technologies have enabled the measurement of gene expression and other omics profiles at multi-cell, single-cell or subcellular resolution. However, these advances also posed challenges for data analysis, such as identifying differentially expressed feature gene sets with high accuracy and benchmarking computational methods for various analysis topics on data with complex heterogeneity. In my dissertation, we have focused on developing novel statistical and computational methods to address these challenges.
In project 1, we developed SifiNet, a versatile pipeline to identify cell-subpopulation specific feature genes, annotate cell subpopulations, and reveal their relationships. The major advantage of SifiNet is that it bypasses cell clustering and thus avoids possible bias introduced by inaccurate clustering; thus, SifiNet achieves significantly higher accuracy in feature gene identification and cell annotation than tranditional two-step methods relying on clustering. SifiNet can analyze both single cell RNA sequencing (scRNA-seq) and single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) data, providing insight into multiomic cellular profile.
In project 2, we developed GeneScape, a novel scRNA-seq data simulator that can simulate complex cellular heterogeneity. Existing scRNA-seq data simulators are limited in their abilities to simulate data with complex or subtle cellular heterogeneities, especially for those cells exhibit both cell type and cell state differences (such as differences in cell cycles, senescence levels, and DNA-damage levels). GeneScape can successfully simulate gene expressions for cells with complex heterogeneity structures.
In project 3, we developed GeneScape-S (GeneScape-Spatial), a simulator for spatially resolved transcriptomics (SRT) data. Existing SRT-specific simulators cannot fulfill customized needs such as simulating multi-layer data, mimicking local tissue heterogeneity, and accommodating mixing cell-type structures in low-resolution spots. To fill these gaps, we propose GeneScape-S, which preserves the expression and spatial patterns of real SRT data, and offers specially designed functions tailored to fulfill customized needs. GeneScape-S also incorporates the features in GeneScape to simulate complex heterogeneities.
Item Open Access Efficient analysis of complex, multimodal genomic data(2016) Acharya, Chaitanya RamanujOur primary goal is to better understand complex diseases using statistically disciplined approaches. As multi-modal data is streaming out of consortium projects like Genotype-Tissue Expression (GTEx) project, which aims at collecting samples from various tissue sites in order to understand tissue-specific gene regulation, new approaches are needed that can efficiently model groups of data with minimal loss of power. For example, GTEx project delivers RNA-Seq, Microarray gene expression and genotype data (SNP Arrays) from a vast number of tissues in a given individual subject. In order to analyze this type of multi-level (hierarchical) multi-modal data, we proposed a series of efficient-score based tests or score tests and leveraged groups of tissues or gene isoforms in order map genomic biomarkers. We model group-specific variability as a random effect within a mixed effects model framework. In one instance, we proposed a score-test based approach to map expression quantitative trait loci (eQTL) across multiple-tissues. In order to do that we jointly model all the tissues and make use of all the information available to maximize the power of eQTL mapping and investigate an overall shift in the gene expression combined with tissue-specific effects due to genetic variants. In the second instance, we showed the flexibility of our model framework by expanding it to include tissue-specific epigenetic data (DNA methylation) and map eQTL by leveraging both tissues and methylation. Finally, we also showed that our methods are applicable on different data type such as whole transcriptome expression data, which is designed to analyze genomic events such alternative gene splicing. In order to accomplish this, we proposed two different models that exploit gene expression data of all available gene-isoforms within a gene to map biomarkers of interest (either genes or gene-sets) in paired early-stage breast tumor samples before and after treatment with external beam radiation. Our efficient score-based approaches have very distinct advantages. They have a computational edge over existing methods because they do not need parameter estimation under the alternative hypothesis. As a result, model parameters only have to be estimated once per genome, significantly decreasing computation time. Also, the efficient score is the locally most powerful test and is guaranteed a theoretical optimality over all other approaches in a neighborhood of the null hypothesis. This theoretical performance is born out in extensive simulation studies which show that our approaches consistently outperform existing methods both in statistical power and computational speed. We applied our methods to publicly available datasets. It is important to note that all of our methods also accommodate the analysis of next-generation sequencing data.
Item Open Access EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA(2022) Fang, JiyuanIn my dissertation, we have developed statistical and computational tools to evaluate and interpret machine learning outputs in genomics data. The first two projects focus on single-cell RNA-sequencing (scRNA-seq) data. In project 1, we evaluated the fitting of widely-used distribution families on scRNA-seq UMI counts and concluded that UMI counts of polyclonal cells following gene-specific cell-type-specific NB distributions without zero- inflation. Based on this modeling, we proposed the working dispersion score (WDS) to select genes that differentially express across cell types. In project 2, we developed a new internal (unsupervised) index, Clustering Deviation Index (CDI), to evaluate cell label sets obtained from clustering algorithms. We conducted in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. We also benchmarked CDI by comparing it with other internal indices in terms of the agreement with external indices using high-quality benchmark label sets. In addition, we demonstrated that CDI is more computationally efficient than other internal indices, especially for million-scale datasets. In project 3, we proposed a model-agnostic hypothesis testing framework to interpret feature interactions underneath complex machine learning models. The simulation study results demonstrated large power while controlling the type I error rate.
Item Embargo Exposomic modeling approaches for social and environmental determinants of health(2023) McCormack, KaraStudies of human health have recently expanded to focus on the exposome paradigm, encompassing allexposures humans encounter from conception onward. The central theme of this work is to develop and test novel statistical methodologies that can address the challenges of the complex relationships between environmental exposures, socioeconomic distress, and health outcomes. However, source, measurement, and volume intricacies inherent to these data have constrained progression of statistical methods for key research questions.
In this work, we explore three approaches to characterizing community health and its potential impact on several types of disease outcomes. In the first approach, we implement a latent class model to socioeconomic and comorbidities data and explore these classifications as fixed effects in an ecological spatial model of COVID-19 cases and deaths in NYC during two time periods of the pandemic. In the second, we use a non-parametric Bayesian approach to form socio-economic and pollution cluster profiles across US counties. We then use these profiles to inform a Bayesian spatial model on breast cancer mortality for data from 2014. In the final approach, we utilize a latent network model traditionally used in psychometrics research to explore structural racism. Using information from five domains (employment, education, housing, health, and criminal justice), we identify new variable complexes to illustrate the complex the manifestations of structural racism at the census tract level in Pennsylvania.
Item Open Access Extending Probabilistic Record Linkage(2020) Solomon, Nicole ChanelProbabilistic record linkage is the task of combining multiple data sources for statistical analysis by identifying records pertaining to the same individual in different databases. The need to perform probabilistic record linkage arises in comparative effectiveness research and other clinical research scenarios when records in different databases do not share an error-free unique patient identifier. This dissertation seeks to develop new methodology for probabilistic record linkage to address two highly practical and recurring challenges: how to implement record linkage in a manner that optimizes downstream statistical analyses of the linked data, and how to efficiently link databases having a clustered or multi-level data structure.
In Chapter 2 we propose a new framework for balancing the tradeoff between false positive and false negative linkage errors when linked data are analyzed in a generalized linear model framework and non-linked records lead to missing data for the study outcome variable. Our method seeks to maximize the probability that the point estimate of the parameter of interest will have the correct sign and that the confidence interval around this estimate will correctly exclude the null value of zero. Using large sample approximations and a model for linkage errors, we derive expressions relating bias and hypothesis testing power to the user's choice of threshold that determines how many records will be linked. We use these results to propose three data-driven threshold selection rules. Under one set of simplifying assumptions we prove that maximizing asymptotic power requires that the threshold be relaxed at least until the point where all pairs with >50% probability of being a true match are linked.
In Chapter 3 we explore the consequences of linkage errors when the study outcome variable is determined by linkage status and so linkage errors may cause outcome misclassification. This scenario arises when the outcome is disease status and those linked are classified as having the disease while those not linked are classified as disease-free. We assume the parameter of interest can be expressed as a linear combination of binomial proportions having mean zero under the null hypothesis. We derive an expression for the asymptotic relative efficiency of a Wald test calculated with a misclassified outcome compared to an error-free outcome using a linkage error model and large sample approximations. We use this expression to generate insights for planning and implementing studies using record linkage.
In Chapter 4 we develop a modeling framework for linking files with a clustered data structure. Linking such clustered data is especially challenging when error-free identifiers are unavailable for both individual-level and cluster-level units. The proposed approach improves over current methodology by modeling inter-pair dependencies in clustered data and producing collective link decisions. It is novel in that it models both record attributes and record relationships, and resolves match statuses for individual-level and cluster-level units simultaneously. We show that linkage probabilities can be estimated without labeled training data using assumptions that are less restrictive compared to existing record linkage models. Using Monte Carlo simulations based on real study data, we demonstrate its advantages over the current standard method.
Item Open Access Extending the Weighted Generalized Score Statistic for Comparison of Correlated Means(2023) Jones, Aaron DouglasThe generalized score (GS) statistic is widely used to test hypotheses about mean model parameters in the generalized estimating equations (GEE) framework. However, when comparing predictive values of two diagnostic tests in a paired study design, or comparing correlated proportions between two unequally sized groups with both paired and independent outcomes, GS has been shown neither to adequately control type I error nor to reduce to the score statistic under independence. Weighting the residuals in empirical variance estimation by the ratio of the two groups’ sample sizes produces a weighted generalized score (WGS) statistic that has been shown to resolve these issues and is now used in the diagnostic testing literature. Potential improvements from weighting in more general uses of GS have not previously been investigated.This dissertation extends the WGS method in several ways. Formulas are derived to extend the WGS statistic for paired and/or independent data from two binary proportions to two means in a quasi-likelihood model with any suitable link and variance functions, assuming finite fourth moments. The asymptotic convergence of WGS to the chi-square distribution in these general cases is proven. Finite-sample type I error rates are compared between GS and WGS, for which purpose the variance of the test statistic denominator (i.e., the variance of the empirical variance estimator) is proposed as an analytic heuristic. New weights are derived to optimize the variance-of-the-denominator criterion for approximate type I error control. Simulation results verify that the heuristically optimal weights achieve type I error rates closer to the nominal alpha level than GS or WGS for combinations of correlation and sample size where either GS or WGS demonstrates poor control.
Item Open Access Gaussian Process-Based Models for Clinical Time Series in Healthcare(2018) Futoma, Joseph DavidClinical prediction models offer the ability to help physicians make better data-driven decisions that can improve patient outcomes. Given the wealth of data available with the widespread adoption of electronic health records, more flexible statistical models are required that can account for the messiness and complexity of this data. In this dissertation we focus on developing models for clinical time series, as most data within healthcare is collected longitudinally and it is important to take this structure into account. Models built off of Gaussian processes are natural in this setting of irregularly sampled, noisy time series with many missing values. In addition, they have the added benefit of accounting for and quantifying uncertainty, which can be extremely useful in medical decision making. In this dissertation, we develop new Gaussian process-based models for medical time series along with associated algorithms for efficient inference on large-scale electronic health records data. We apply these models to several real healthcare applications, using local data obtained from the Duke University healthcare system.
In Chapter 1 we give a brief overview of clinical prediction models, electronic health records, and Gaussian processes. In Chapter 2, we develop several Gaussian process models for clinical time series in the context of chronic kidney disease management. We show how our proposed joint model for longitudinal and time-to-event data and model for multivariate time series can make accurate predictions about a patient's future disease trajectory. In Chapter 3, we combine multi-output Gaussian processes with a downstream black-box deep recurrent neural network model from deep learning. We apply this modeling framework to clinical time series to improve early detection of sepsis among patients in the hospital, and show that the Gaussian process preprocessing layer both allows for uncertainty quantification and acts as a form of data augmentation to reduce overfitting. In Chapter 4, we again use multi-output Gaussian processes as a preprocessing layer in model-free deep reinforcement learning. Here the goal is to learn optimal treatments for sepsis given clinical time series and historical treatment decisions taken by clinicians, and we show that the Gaussian process preprocessing layer and use of a recurrent architecture offers improvements over standard deep reinforcement learning methods. We conclude in Chapter 5 with a summary of future areas for work, and a discussion on practical considerations and challenges involved in deploying machine learning models into actual clinical practice.
Item Open Access Gene set-based Signal-Detection Analyses with Goodness-of-Fit Statistics and Their Application in Complex Diseases(2019) Zhang, MengqiRare diseases are difficult to diagnose and uncertain to treat. The identification of specific genes associated with particular rare diseases and phenotypes can provide insight into the mechanism of certain rare disease subtypes and suggest therapeutic targets to improve patient outcomes. However, single gene-based methods for detecting rare disease-associated variants are often underpowered and can be hard to interpret. Therefore, this dissertation explores alternative approaches based on gene set-based methods. These analyses can be solved with a goodness-of-fit test that assesses whether the distribution of observed statistics of a given set of genes/variants significantly differs from the expected distribution.
This dissertation explores a flexible gene set-based signal-detection framework based on the goodness-of-fit tests. A user-friendly and efficient R program was developed for this research. In addition, this dissertation proposes a new gene-set analyses method that can leverage prior information to inform the detection of whether any of the genes within a biologically informed gene-set is associated with disease phenotypes on a special goodness-of-fit a test called higher criticism. Further, this dissertation investigates the asymptotic distribution of our higher criticism statistic based on the theoretically weighted p-values. Collectively, these methods are innovative because they based on gene set and incorporate the prior information, which enhances the power of associations between rare variants and complex diseases. These results improve the ability to identify and optimally treat genetic disease subtypes.
Item Open Access Genetic Analysis of Gulf War illness: Phenotype Development, GWAS, and Gene-Environment Interaction(2022) Vahey, JacquelineVeterans who served in the 1990-1991 Gulf War experience debilitating chronic symptoms at extremely high rates. In the 30 years since the Gulf War, many researchers have worked to identify the cause and biological pathway of Gulf War illness (GWI). There is, however, no biomarker, ICD code, or other standardized way to identify veterans with GWI; veterans are told they have GWI based on a clinician’s assessment of their unexplained chronic symptoms. There is also little agreement on the causes and potential biological pathways of GWI. This dissertation describes phenotyping efforts, the first genome-wide association study (GWAS) of GWI, and a candidate gene-environment interaction study. First, I describe methods for developing well-documented indicators for complex phenotypes, which have generated GWI indicators that are used for the MVP and GWECB datasets. This is the only tested and published algorithm for defining GWI. This work required extensive exploratory analysis and data cleaning, as it was the first major analysis of the GWECB dataset. The variables generated through both the data cleaning and GWI algorithm have been incorporated into the GWECB. Then, I performed the first GWAS of GWI, which supports prior work in the field and suggests further candidate analyses. Top gene-set associations include response to cadmium ion, regulation of response to interferon gamma, and regulation of autophagosome maturation. Among other top associations, these results indicate association with a neuroimmune response to exposure. GWAS summary statistics will be made available. Finally, I developed a hypothesis-driven candidate gene-environment interaction study, which replicated a previously published statistically significant association of rs662/PB pill exposure with GWI. Future research building off my contributions could help identify the underlying biological pathways and causes of GWI, allowing better treatment of the underlying disease for hundreds of thousands of Gulf War Veterans.
- «
- 1 (current)
- 2
- 3
- »