dc.description.abstract |
<p>In my dissertation, I have developed computational methods for high dimensional
inference, motivated by the analysis of omics data. This dissertation is divided into
two parts. The first part of this dissertation is motivated by flow cytometry data
analysis, where a key goal is to identify sparse cell subpopulations that differ be-
tween two groups. I have developed an algorithm called multiple Testing Embedded on
an Aggregation tree Method (TEAM) to locate where distributions differ between two
samples. Regions containing differences can be identified in layers along the tree:
the first layer searches for regions containing short-range, strong distributional
differences, and higher layers search for regions containing long-range, weak distributional
differences. TEAM is able to pinpoint local differences and under mild assumptions,
asymptotically control the layer-specific and overall false discovery rate (FDR).
Simulations verify our theoretical results. When applied to real flow cytometry data,
TEAM captures cell subtypes that are overexpressed in cytomegalovirus stimulation
vs. control. In addition, I have extended the TEAM algorithm so that it can incorporate
information from more than one cell attribute, allowing for more robust conclusions.
The second part of this dissertation is motivated by rare variant association studies,
where a key goal is to identify regions of rare variants, which are associated with
disease. This problem is addressed via a flexible method called stochastic aggregation
tree-embedded testing (SATET). SATET embeds testing of genomic regions onto an aggregation
tree, which provides a way to test association at various resolutions. The rejection
rule at each layer depends on the previous layer, and leads to a procedure that controls
the layer-specific FDR. Compared to methods that search for rare-variant association
over large regions, such as protein domains, SATET can pinpoint sub-genic regions
associated with disease. Numerical experiments show FDR control for different genetic
architectures and superior per- formance compared to domain-based analyses. When applied
to a case-control study in amyotrophic lateral sclerosis (ALS), SATET identified sub-genic
regions in known ALS-related genes, while implicating regions in new genes not previously
captured by domain-based analyses.</p>
|
|