Browsing by Author "Xie, Jichun"
Results Per Page
Sort Options
Item Embargo Develop Novel Statistical and Computational Methods for Omics Data Analysis(2023) Gao, QiRecent advances in sequencing technologies have enabled the measurement of gene expression and other omics profiles at multi-cell, single-cell or subcellular resolution. However, these advances also posed challenges for data analysis, such as identifying differentially expressed feature gene sets with high accuracy and benchmarking computational methods for various analysis topics on data with complex heterogeneity. In my dissertation, we have focused on developing novel statistical and computational methods to address these challenges.
In project 1, we developed SifiNet, a versatile pipeline to identify cell-subpopulation specific feature genes, annotate cell subpopulations, and reveal their relationships. The major advantage of SifiNet is that it bypasses cell clustering and thus avoids possible bias introduced by inaccurate clustering; thus, SifiNet achieves significantly higher accuracy in feature gene identification and cell annotation than tranditional two-step methods relying on clustering. SifiNet can analyze both single cell RNA sequencing (scRNA-seq) and single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) data, providing insight into multiomic cellular profile.
In project 2, we developed GeneScape, a novel scRNA-seq data simulator that can simulate complex cellular heterogeneity. Existing scRNA-seq data simulators are limited in their abilities to simulate data with complex or subtle cellular heterogeneities, especially for those cells exhibit both cell type and cell state differences (such as differences in cell cycles, senescence levels, and DNA-damage levels). GeneScape can successfully simulate gene expressions for cells with complex heterogeneity structures.
In project 3, we developed GeneScape-S (GeneScape-Spatial), a simulator for spatially resolved transcriptomics (SRT) data. Existing SRT-specific simulators cannot fulfill customized needs such as simulating multi-layer data, mimicking local tissue heterogeneity, and accommodating mixing cell-type structures in low-resolution spots. To fill these gaps, we propose GeneScape-S, which preserves the expression and spatial patterns of real SRT data, and offers specially designed functions tailored to fulfill customized needs. GeneScape-S also incorporates the features in GeneScape to simulate complex heterogeneities.
Item Open Access EVALUATING AND INTERPRETING MACHINE LEARNING OUTPUTS IN GENOMICS DATA(2022) Fang, JiyuanIn my dissertation, we have developed statistical and computational tools to evaluate and interpret machine learning outputs in genomics data. The first two projects focus on single-cell RNA-sequencing (scRNA-seq) data. In project 1, we evaluated the fitting of widely-used distribution families on scRNA-seq UMI counts and concluded that UMI counts of polyclonal cells following gene-specific cell-type-specific NB distributions without zero- inflation. Based on this modeling, we proposed the working dispersion score (WDS) to select genes that differentially express across cell types. In project 2, we developed a new internal (unsupervised) index, Clustering Deviation Index (CDI), to evaluate cell label sets obtained from clustering algorithms. We conducted in silico and experimental scRNA-seq studies to show that CDI can select the optimal clustering label set. We also benchmarked CDI by comparing it with other internal indices in terms of the agreement with external indices using high-quality benchmark label sets. In addition, we demonstrated that CDI is more computationally efficient than other internal indices, especially for million-scale datasets. In project 3, we proposed a model-agnostic hypothesis testing framework to interpret feature interactions underneath complex machine learning models. The simulation study results demonstrated large power while controlling the type I error rate.
Item Open Access Genetic variant of IRAK2 in the toll-like receptor signaling pathway and survival of non-small cell lung cancer.(International journal of cancer, 2018-11) Xu, Yinghui; Liu, Hongliang; Liu, Shun; Wang, Yanru; Xie, Jichun; Stinchcombe, Thomas E; Su, Li; Zhang, Ruyang; Christiani, David C; Li, Wei; Wei, QingyiThe toll-like receptor (TLR) signaling pathway plays an important role in the innate immune responses and antigen-specific acquired immunity. Aberrant activation of the TLR pathway has a significant impact on carcinogenesis or tumor progression. Therefore, we hypothesize that genetic variants in the TLR signaling pathway genes are associated with overall survival (OS) of patients with non-small cell lung cancer (NSCLC). To test this hypothesis, we first performed Cox proportional hazards regression analysis to evaluate associations between genetic variants of 165 TLR signaling pathway genes and NSCLC OS using the genome-wide association study (GWAS) dataset from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial (PLCO). The results were further validated by the Harvard Lung Cancer Susceptibility GWAS dataset. Specifically, we identified IRAK2 rs779901 C > T as a predictor of NSCLC OS, with a variant-allele (T) attributed hazards ratio (HR) of 0.78 [95% confidence interval (CI) = 0.67-0.91, P = 0.001] in the PLCO dataset, 0.84 (0.72-0.98, 0.031) in the Harvard dataset, and 0.81 (0.73-0.90, 1.08x10-4 ) in the meta-analysis of these two GWAS datasets. In addition, the T allele was significantly associated with an increased mRNA expression level of IRAK2. Our findings suggest that IRAK2 rs779901 C > T may be a promising prognostic biomarker for NSCLC OS.Item Open Access Genetic variants in the calcium signaling pathway genes are associated with cutaneous melanoma-specific survival.(Carcinogenesis, 2018-12-29) Wang, Xiaomeng; Liu, Hongliang; Xu, Yinghui; Xie, Jichun; Zhu, Dakai; Amos, Christopher I; Fang, Shenying; Lee, Jeffrey E; Li, Xin; Nan, Hongmei; Song, Yanqiu; Wei, QingyiRemodeling or deregulation of the calcium signaling pathway is a relevant hallmark of cancer including cutaneous melanoma (CM). In the present study, using data from a published genome-wide association study (GWAS) from The University of Texas M.D. Anderson Cancer Center, we assessed the role of 41,377 common single nucleotide polymorphisms (SNPs) of 167 calcium signaling pathway genes in CM survival. We used another GWAS from Harvard University as the validation dataset. In the single-locus analysis, 1,830 SNPs were found to be significantly associated with CM-specific survival (CMSS) (P ≤ 0.050 and false-positive report probability ≤ 0.2), of which nine SNPs were validated in the Harvard study (P ≤ 0.050). Among these, three independent SNPs (i.e., PDE1A rs6750552 T>C, ITPR1 rs6785564 A>G and RYR3 rs2596191 C>A) had a predictive role in CMSS, with a meta-analysis derived hazards ratio (HR) of 1.52 [95% confidence interval (CI) = 1.19-1.94, P = 7.21×10-4]], 0.49 (0.33-0.73, 3.94×10-4) and 0.67 (0.53-0.86, 0.0017), respectively. Patients with an increasing number of protective genotypes had remarkably improved CMSS. Additional expression quantitative trait loci (eQTL) analysis showed that these genotypes were also significantly associated with mRNA expression levels of the genes. Taken together, these results may help us to identify prospective biomarkers in the calcium signaling pathway for CM prognosis.Item Open Access Genetic variants in the platelet-derived growth factor subunit B gene associated with pancreatic cancer risk.(International journal of cancer, 2018-04) Duan, Bensong; Hu, Jiangfeng; Liu, Hongliang; Wang, Yanru; Li, Hongyu; Liu, Shun; Xie, Jichun; Owzar, Kouros; Abbruzzese, James; Hurwitz, Herbert; Gao, Hengjun; Wei, QingyiThe platelet-derived growth factor (PDGF) signaling pathway plays important roles in development and progression of human cancers. In our study, we aimed to identify genetic variants of the PDGF pathway genes associated with pancreatic cancer (PC) risk in European populations using three published genome-wide association study datasets, which consisted of 9,381 cases and 7,719 controls. The expression quantitative trait loci (eQTL) analysis was also performed using data from the 1000 Genomes, TCGA and GTEx projects. As a result, we identified two potential susceptibility loci (rs5757573 and rs6001516) of PDGFB associated with PC risk [odds ratio (OR) = 1.10, 95% confidence interval (CI) = 1.05-1.16, and p = 4.70 × 10-5 for the rs5757573 C allele and 1.21, 1.11-1.32, and 2.01 × 10-5 for the rs6001516 T allele]. Haplotype analysis revealed that the C-T haplotype carriers had a significantly increased risk of PC than those carrying the T-C haplotype (OR = 1.23, 95% CI = 1.12-1.34, p =5.00 × 10-6 ). The multivariate regression model incorporating the number of unfavorable genotypes (NUGs) with age and sex showed that carriers with 1-2 NUGs, particularly among 60-70 age group or males, had an increased risk of PC, compared to those without NUG. Furthermore, the eQTL analysis revealed that both loci were correlated with a decreased mRNA expression level of PDGFB in lymphoblastoid cell lines and pancreatic tumor tissues (p = 0.015 and 0.071, respectively). Our results suggest that genetic variants in PDGFB may play a role in susceptibility to PC. Further population and functional validations of our findings are warranted.Item Open Access Multiple Testing Embedded in an Aggregation Tree With Applications to Omics Data(2020) Pura, JohnIn my dissertation, I have developed computational methods for high dimensional inference, motivated by the analysis of omics data. This dissertation is divided into two parts. The first part of this dissertation is motivated by flow cytometry data analysis, where a key goal is to identify sparse cell subpopulations that differ be- tween two groups. I have developed an algorithm called multiple Testing Embedded on an Aggregation tree Method (TEAM) to locate where distributions differ between two samples. Regions containing differences can be identified in layers along the tree: the first layer searches for regions containing short-range, strong distributional differences, and higher layers search for regions containing long-range, weak distributional differences. TEAM is able to pinpoint local differences and under mild assumptions, asymptotically control the layer-specific and overall false discovery rate (FDR). Simulations verify our theoretical results. When applied to real flow cytometry data, TEAM captures cell subtypes that are overexpressed in cytomegalovirus stimulation vs. control. In addition, I have extended the TEAM algorithm so that it can incorporate information from more than one cell attribute, allowing for more robust conclusions. The second part of this dissertation is motivated by rare variant association studies, where a key goal is to identify regions of rare variants, which are associated with disease. This problem is addressed via a flexible method called stochastic aggregation tree-embedded testing (SATET). SATET embeds testing of genomic regions onto an aggregation tree, which provides a way to test association at various resolutions. The rejection rule at each layer depends on the previous layer, and leads to a procedure that controls the layer-specific FDR. Compared to methods that search for rare-variant association over large regions, such as protein domains, SATET can pinpoint sub-genic regions associated with disease. Numerical experiments show FDR control for different genetic architectures and superior per- formance compared to domain-based analyses. When applied to a case-control study in amyotrophic lateral sclerosis (ALS), SATET identified sub-genic regions in known ALS-related genes, while implicating regions in new genes not previously captured by domain-based analyses.
Item Open Access Multiple Testing for Data with Ancillary Information(2022) Li, XuechanIn my dissertation, I develop three powerful hierarchical multiple testing methods by accounting for ancillary information of data. In my first project, we develop a multiple testing framework named Distance Assisted Recursive Testing (DART). DART assumes there exists some informative distance information in the data. Through rigorous proof and extensive simulations, we justified the false discovery rate (FDR) control and sensitivity improvement of DART. As an illustration, we apply our method to a clinical trial in leukemia patients receiving hematopoietic cell transplantation to identify the gut microbiota whose abundance will be impacted by the after-transplant care. The second project is motivated by the flow cytometry analysis in immunology study. The analysis can be translated into a statistical problem which is trying to pinpoint the regions where two density functions differ. By partitioning the sample space into small bins and conducting testing on each bin, we model the analysis into a multiple testing problem. We provide theoretical justification that the procedure achieves the statistical goal of pinpointing the regions with differential density with high sensitivity and precision. My third project is motivated by the rare variant association study. We develop a multiple testing framework named DATED (Dynamic Aggregation and Tree-Embedded testing) to pinpoint the disease-associated rare-variant regions hierarchically and dynamically. To accommodate the application objective, DATED adopts a rare variant region-level FDR weighted by the proportions of the neutral rare-variant. Extensive numerical simulations demonstrate the superior performance of DATED under various scenarios compared to the existing methods. We illustrate DATED by applying it to an amyotrophic lateral sclerosis (ALS) study for identifying pathogenic rare variants.
Item Open Access Novel genetic variants in the P38MAPK pathway gene ZAK and susceptibility to lung cancer.(Molecular carcinogenesis, 2018-02) Feng, Yun; Wang, Yanru; Liu, Hongliang; Liu, Zhensheng; Mills, Coleman; Owzar, Kouros; Xie, Jichun; Han, Younghun; Qian, David C; Hung Rj, Rayjean J; Brhane, Yonathan; McLaughlin, John; Brennan, Paul; Bickeböller, Heike; Rosenberger, Albert; Houlston, Richard S; Caporaso, Neil; Landi, Maria Teresa; Brüske, Irene; Risch, Angela; Ye, Yuanqing; Wu, Xifeng; Christiani, David C; Amos, Christopher I; Wei, QingyiThe P38MAPK pathway participates in regulating cell cycle, inflammation, development, cell death, cell differentiation, and tumorigenesis. Genetic variants of some genes in the P38MAPK pathway are reportedly associated with lung cancer risk. To substantiate this finding, we used six genome-wide association studies (GWASs) to comprehensively investigate the associations of 14 904 single nucleotide polymorphisms (SNPs) in 108 genes of this pathway with lung cancer risk. We identified six significant lung cancer risk-associated SNPs in two genes (CSNK2B and ZAK) after correction for multiple comparisons by a false discovery rate (FDR) <0.20. After removal of three CSNK2B SNPs that are located in the same locus previously reported by GWAS, we performed the LD analysis and found that rs3769201 and rs7604288 were in high LD. We then chose two independent representative SNPs of rs3769201 and rs722864 in ZAK for further analysis. We also expanded the analysis by including these two SNPs from additional GWAS datasets of Harvard University (984 cases and 970 controls) and deCODE (1319 cases and 26 380 controls). The overall effects of these two SNPs were assessed using all eight GWAS datasets (OR = 0.92, 95%CI = 0.89-0.95, and P = 1.03 × 10-5 for rs3769201; OR = 0.91, 95%CI = 0.88-0.95, and P = 2.03 × 10-6 for rs722864). Finally, we performed an expression quantitative trait loci (eQTL) analysis and found that these two SNPs were significantly associated with ZAK mRNA expression levels in lymphoblastoid cell lines. In conclusion, the ZAK rs3769201 and rs722864 may be functional susceptibility loci for lung cancer risk.Item Open Access PenPC: A two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs.(Biometrics, 2016-03) Ha, Min Jin; Sun, Wei; Xie, JichunEstimation of the skeleton of a directed acyclic graph (DAG) is of great importance for understanding the underlying DAG and causal effects can be assessed from the skeleton when the DAG is not identifiable. We propose a novel method named PenPC to estimate the skeleton of a high-dimensional DAG by a two-step approach. We first estimate the nonzero entries of a concentration matrix using penalized regression, and then fix the difference between the concentration matrix and the skeleton by evaluating a set of conditional independence hypotheses. For high-dimensional problems where the number of vertices p is in polynomial or exponential scale of sample size n, we study the asymptotic property of PenPC on two types of graphs: traditional random graphs where all the vertices have the same expected number of neighbors, and scale-free graphs where a few vertices may have a large number of neighbors. As illustrated by extensive simulations and applications on gene expression data of cancer patients, PenPC has higher sensitivity and specificity than the state-of-the-art method, the PC-stable algorithm.Item Open Access Single-cell landscape analysis unravels molecular programming of the human B cell compartment in chronic GVHD.(JCI insight, 2023-06) Poe, Jonathan C; Fang, Jiyuan; Zhang, Dadong; Lee, Marissa R; DiCioccio, Rachel A; Su, Hsuan; Qin, Xiaodi; Zhang, Jennifer Y; Visentin, Jonathan; Bracken, Sonali J; Ho, Vincent T; Wang, Kathy S; Rose, Jeremy J; Pavletic, Steven Z; Hakim, Frances T; Jia, Wei; Suthers, Amy N; Curry-Chisolm, Itaevia M; Horwitz, Mitchell E; Rizzieri, David A; McManigle, William C; Chao, Nelson J; Cardones, Adela R; Xie, Jichun; Owzar, Kouros; Sarantopoulos, StefanieAlloreactivity can drive autoimmune syndromes. After allogeneic hematopoietic stem cell transplantation (allo-HCT), chronic graft-versus-host disease (cGVHD), a B cell-associated autoimmune-like syndrome, commonly occurs. Because donor-derived B cells continually develop under selective pressure from host alloantigens, aberrant B cell receptor (BCR) activation and IgG production can emerge and contribute to cGVHD pathobiology. To better understand molecular programing of B cells in allo-HCT, we performed scRNA-Seq analysis on high numbers of purified B cells from patients. An unsupervised analysis revealed 10 clusters, distinguishable by signature genes for maturation, activation, and memory. Within the memory B cell compartment, we found striking transcriptional differences in allo-HCT patients compared with healthy or infected individuals, including potentially pathogenic atypical B cells (ABCs) that were expanded in active cGVHD. To identify intrinsic alterations in potentially pathological B cells, we interrogated all clusters for differentially expressed genes (DEGs) in active cGVHD versus patients who never had signs of immune tolerance loss (no cGVHD). Active cGVHD DEGs occurred in both naive and BCR-activated B cell clusters. Remarkably, some DEGs occurred across most clusters, suggesting common molecular programs that may promote B cell plasticity. Our study of human allo-HCT and cGVHD provides understanding of altered B cell memory during chronic alloantigen stimulation.