EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA

Loading...
Thumbnail Image

Date

2022

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

82
views
101
downloads

Abstract

In my dissertation, we have developed statistical and computational tools to evaluate and interpret machine learning outputs in genomics data. The first two projects focus on single-cell RNA-sequencing (scRNA-seq) data. In project 1, we evaluated the fitting of widely-used distribution families on scRNA-seq UMI counts and concluded that UMI counts of polyclonal cells following gene-specific cell-type-specific NB distributions without zero- inflation. Based on this modeling, we proposed the working dispersion score (WDS) to select genes that differentially express across cell types. In project 2, we developed a new internal (unsupervised) index, Clustering Deviation Index (CDI), to evaluate cell label sets obtained from clustering algorithms. We conducted in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. We also benchmarked CDI by comparing it with other internal indices in terms of the agreement with external indices using high-quality benchmark label sets. In addition, we demonstrated that CDI is more computationally efficient than other internal indices, especially for million-scale datasets. In project 3, we proposed a model-agnostic hypothesis testing framework to interpret feature interactions underneath complex machine learning models. The simulation study results demonstrated large power while controlling the type I error rate.

Description

Provenance

Citation

Citation

Fang, Jiyuan (2022). EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/25813.

Collections


Dukes student scholarship is made available to the public using a Creative Commons Attribution / Non-commercial / No derivative (CC-BY-NC-ND) license.