Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data

Thumbnail Image



Journal Title

Journal ISSN

Volume Title

Repository Usage Stats



In my dissertation, I have developed computational methods for high dimensional inference, motivated by the analysis of omics data. This dissertation is divided into two parts. The first part of this dissertation is motivated by flow cytometry data analysis, where a key goal is to identify sparse cell subpopulations that differ be- tween two groups. I have developed an algorithm called multiple Testing Embedded on an Aggregation tree Method (TEAM) to locate where distributions differ between two samples. Regions containing differences can be identified in layers along the tree: the first layer searches for regions containing short-range, strong distributional differences, and higher layers search for regions containing long-range, weak distributional differences. TEAM is able to pinpoint local differences and under mild assumptions, asymptotically control the layer-specific and overall false discovery rate (FDR). Simulations verify our theoretical results. When applied to real flow cytometry data, TEAM captures cell subtypes that are overexpressed in cytomegalovirus stimulation vs. control. In addition, I have extended the TEAM algorithm so that it can incorporate information from more than one cell attribute, allowing for more robust conclusions. The second part of this dissertation is motivated by rare variant association studies, where a key goal is to identify regions of rare variants, which are associated with disease. This problem is addressed via a flexible method called stochastic aggregation tree-embedded testing (SATET). SATET embeds testing of genomic regions onto an aggregation tree, which provides a way to test association at various resolutions. The rejection rule at each layer depends on the previous layer, and leads to a procedure that controls the layer-specific FDR. Compared to methods that search for rare-variant association over large regions, such as protein domains, SATET can pinpoint sub-genic regions associated with disease. Numerical experiments show FDR control for different genetic architectures and superior per- formance compared to domain-based analyses. When applied to a case-control study in amyotrophic lateral sclerosis (ALS), SATET identified sub-genic regions in known ALS-related genes, while implicating regions in new genes not previously captured by domain-based analyses.





Pura, John (2020). Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data. Dissertation, Duke University. Retrieved from


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.