EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA

Fang, Jiyuan

EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA

View / Download9.1 MB

Date

2022

Authors

Fang, Jiyuan

Advisors

Xie, Jichun

Repository Usage Stats

97
views

138
downloads

Abstract

In my dissertation, we have developed statistical and computational tools to evaluate and interpret machine learning outputs in genomics data. The first two projects focus on single-cell RNA-sequencing (scRNA-seq) data. In project 1, we evaluated the fitting of widely-used distribution families on scRNA-seq UMI counts and concluded that UMI counts of polyclonal cells following gene-specific cell-type-specific NB distributions without zero- inflation. Based on this modeling, we proposed the working dispersion score (WDS) to select genes that differentially express across cell types. In project 2, we developed a new internal (unsupervised) index, Clustering Deviation Index (CDI), to evaluate cell label sets obtained from clustering algorithms. We conducted in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. We also benchmarked CDI by comparing it with other internal indices in terms of the agreement with external indices using high-quality benchmark label sets. In addition, we demonstrated that CDI is more computationally efficient than other internal indices, especially for million-scale datasets. In project 3, we proposed a model-agnostic hypothesis testing framework to interpret feature interactions underneath complex machine learning models. The simulation study results demonstrated large power while controlling the type I error rate.

Type

Dissertation

Department

Biostatistics and Bioinformatics Doctor of Philosophy

Subjects

Biostatistics

Permalink

https://hdl.handle.net/10161/25813

Citation

Fang, Jiyuan (2022). EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/25813.

Collections

Dissertations

Full item page

Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.

EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA

Date

Authors

Advisors

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

Abstract

Type

Department

Description

Provenance

Subjects

Citation

Permalink

Citation

Collections