Database and Computational Methods Development for Multi-Modal Single-Cell Data
Date
2025
Authors
Advisors
Journal Title
Journal ISSN
Volume Title
Abstract
High-throughput sequencing techniques have been developed to investigate cellular processes by generating data across different stages of the central dogma. These approaches encompass genomics, epigenomics, transcriptomics, and proteomics, offering insights into distinct molecular processes. Integrating data from these modalities, or "omics," enables scientists to unravel the complexities of cellular function and deepen our understanding of molecular biology.
The advent of single-cell sequencing has revolutionized the resolution of omics data by enabling the analysis of individual cells. Various methods have been developed to capture multi-modal information within single cells. Among them, Multiome—a co-assay of RNA and ATAC—simultaneously profiles the transcriptome and epigenome. Since its commercialization in 2020, Multiome has become a widely adopted tool for studying gene transcription regulation and interpreting key genetic variants, leading to a rapid accumulation of publicly available datasets.
However, several obstacles hinder researchers from fully utilizing multi-modal single-cell data. The first challenge lies in reusing publicly available datasets such as Multiome data. Different laboratories often adopt varying strategies, pipelines, and reference genomes for data processing, introducing technical variance that complicates the integration and comparison of samples across studies. This inconsistency also limits the development of machine learning methods that rely on standardized datasets.
The second challenge is the lack of robust methods for integrating information across modalities and effectively analyzing multi-modal single-cell data. Conventional single-cell analysis tools are primarily designed for single-modality datasets and are not easily adaptable to tasks such as trajectory inference, cell clustering, or batch correction in multi-modal data. This gap highlights the need for specialized computational tools capable of handling the complexity of multi-modal data integration and analysis.
In the second chapter, I present comparative analysis of gene regulation in single cells (Compass). Compass consists of two main modules: CompassDB, a comprehensive database of publicly available Multiome datasets, and CompassR, an open-source R software package designed for data exploration and visualization. CompassDB contains 435 Multiome samples, all processed using a uniform pipeline to ensure consistency and comparability across datasets. Additionally, case studies are provided using CompassR to explore gene regulation at the tissue, cell type, and sample levels.
The third chapter introduces single-cell modality information integration with lightweight contrastive learning (MILL). MILL is designed to handle various types of multi-modal single-cell data, including Multiome, CITE-seq, TCR-seq, and spatial transcriptomics. Our results show that MILL effectively integrates information from multiple modalities by projecting them into a shared latent space. The latent space can be used for downstream analyses such as cell type annotation, trajectory inference, and batch effect correction. Notably, by incorporating image-based information, MILL significantly enhances cell type identification, particularly in cells with low RNA quality. Our benchmark results demonstrate that MILL outperforms existing state-of-the-art methods in tasks such as batch correction while requiring significantly fewer computational resources and less time.
In conclusion, Compass serves as a valuable resource and platform for the reuse of publicly available single-cell Multiome data, while MILL offers an efficient and powerful tool for analyzing multi-modal single-cell datasets. Together, these contributions provide robust computational tools and novel insights that advance the field of multi-modal single-cell data analysis.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Citation
Wan, Changxin (2025). Database and Computational Methods Development for Multi-Modal Single-Cell Data. Dissertation, Duke University. Retrieved from https://hdl.handle.net/10161/32717.
Collections
Except where otherwise noted, student scholarship that was shared on DukeSpace after 2009 is made available to the public under a Creative Commons Attribution / Non-commercial / No derivatives (CC-BY-NC-ND) license. All rights in student work shared on DukeSpace before 2009 remain with the author and/or their designee, whose permission may be required for reuse.