Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data

Pura, John

Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data

View / Download1.25 MB

Date

2020

Authors

Pura, John

Advisors

Xie, Jichun

Repository Usage Stats

210
views

88
downloads

Abstract

In my dissertation, I have developed computational methods for high dimensional inference, motivated by the analysis of omics data. This dissertation is divided into two parts. The first part of this dissertation is motivated by flow cytometry data analysis, where a key goal is to identify sparse cell subpopulations that differ be- tween two groups. I have developed an algorithm called multiple Testing Embedded on an Aggregation tree Method (TEAM) to locate where distributions differ between two samples. Regions containing differences can be identified in layers along the tree: the first layer searches for regions containing short-range, strong distributional differences, and higher layers search for regions containing long-range, weak distributional differences. TEAM is able to pinpoint local differences and under mild assumptions, asymptotically control the layer-specific and overall false discovery rate (FDR). Simulations verify our theoretical results. When applied to real flow cytometry data, TEAM captures cell subtypes that are overexpressed in cytomegalovirus stimulation vs. control. In addition, I have extended the TEAM algorithm so that it can incorporate information from more than one cell attribute, allowing for more robust conclusions. The second part of this dissertation is motivated by rare variant association studies, where a key goal is to identify regions of rare variants, which are associated with disease. This problem is addressed via a flexible method called stochastic aggregation tree-embedded testing (SATET). SATET embeds testing of genomic regions onto an aggregation tree, which provides a way to test association at various resolutions. The rejection rule at each layer depends on the previous layer, and leads to a procedure that controls the layer-specific FDR. Compared to methods that search for rare-variant association over large regions, such as protein domains, SATET can pinpoint sub-genic regions associated with disease. Numerical experiments show FDR control for different genetic architectures and superior per- formance compared to domain-based analyses. When applied to a case-control study in amyotrophic lateral sclerosis (ALS), SATET identified sub-genic regions in known ALS-related genes, while implicating regions in new genes not previously captured by domain-based analyses.

Type

Dissertation

Department

Biostatistics and Bioinformatics Doctor of Philosophy

Subjects

Biostatistics, Bioinformatics, Genetics, aggregation tree, false discovery proportion, Flow cytometry, Multiple testing, rare variant association

Permalink

https://hdl.handle.net/10161/21453

Citation

Pura, John (2020). Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/21453.

Collections

Dissertations

Full item page

Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.

Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections