Browsing by Subject "Bioinformatics"
- Results Per Page
- Sort Options
Item Open Access A Bayesian Model for Nucleosome Positioning Using DNase-seq Data(2015) Zhong, JianlingAs fundamental structural units of the chromatin, nucleosomes are involved in virtually all aspects of genome function. Different methods have been developed to map genome-wide nucleosome positions, including MNase-seq and a recent chemical method requiring genetically engineered cells. However, these methods are either low resolution and prone to enzymatic sequence bias or require genetically modified cells. The DNase I enzyme has been used to probe nucleosome structure since the 1960s, but in the current high throughput sequencing era, DNase-seq has mainly been used to study regulatory sequences known as DNase hypersensitive sites. This thesis shows that DNase-seq data is also very informative about nucleosome positioning. The distinctive oscillatory DNase I cutting patterns on nucleosomal DNA are shown and discussed. Based on these patterns, a Bayes factor is proposed to be used for distinguishing nucleosomal and non-nucleosomal genome positions. The results show that this approach is highly sensitive and specific. A Bayesian method that simulates the data generation process and can provide more interpretable results is further developed based on the Bayes factor investigations. Preliminary results on a test genomic region show that the Bayesian model works well in identifying nucleosome positioning. Estimated posterior distributions also agree with some known biological observations from external data. Taken together, methods developed in this thesis show that DNase-seq can be used to identify nucleosome positioning, adding great value to this widely utilized protocol.
Item Open Access A Cloud-Based Infrastructure for Cancer Genomics(2020) Panea, Razvan IoanThe advent of new genomic approaches, particularly next generation sequencing (NGS) has resulted in explosive growth of biological data. As the size of biological data keeps growing at exponential rates, new methods for data management and data processing are becoming essential in bioinformatics and computational biology. Indeed, data analysis has now become the central challenge in genomics.
NGS has provided rich tools for defining genomic alterations that cause cancer. The processing time and computing requirements have now become a serious bottleneck to the characterization and analysis of these genomic alterations. Moreover, as the adoption of NGS continues to increase, the computing power required often exceeds what any single institution can provide, leading to major restraints in the type and number of analyses that can be performed.
Cloud computing represents a potential solution to this problem. On a cloud platform, computing resources can be available on-demand, thus allowing users to implement scalable and highly parallel methods. However, few centralized frameworks exist to allow the average researcher the ability to apply bioinformatics workflows using cloud resources. Moreover, bioinformatics approaches are associated with multiple processing challenges, such as the variability in the methods or data used and the reproducibility requirements of the research analysis.
Here, we present CloudConductor, a software system that is specifically designed to harness the power of cloud computing to perform complex analysis pipelines on large biological datasets. CloudConductor was designed with five central features in mind: scalability, modularity, parallelism, reproducibility and platform agnosticism.
We demonstrate the processing power afforded by CloudConductor on a real-world genomics problem. Using CloudConductor, we processed and analyzed 101 whole genome tumor-normal paired samples from Burkitt lymphoma subtypes to identify novel genomic alterations. We identified a total of 72 driver genes associated with the disease. Somatic events were identified in both coding and non-coding regions of nearly all driver genes, notably in genes IGLL5, BACH2, SIN3A, and DNMT1. We have developed the analysis framework by implementing a graphical user interface, a back-end database system, a data loader and a workflow management system.
In this thesis, we develop the concepts and describe an implementation of automated cloud-based infrastructure to analyze genomics data, creating a fast and efficient analysis resource for genomics researchers.
Item Open Access A Computational Synthesis of Genes, Behavior, and Evolution Provides Insights into the Molecular Basis of Vocal Learning(2012) Pfenning, Andreas RVocal learning is the ability modify vocal output based on auditory input and is the basis of human speech acquisition. It is shared by few distantly related bird and mammal orders, and is thus very likely to be an example of convergent evolution, having evolved independently in multiple lineages. This complex behavior is presumed to require networks of regulated genes to develop the necessary neural circuits for learning and maintaining vocalizations. Deciphering these networks has been limited by the lack of high throughput genomic tools in vocal learning avian species and the lack of a solid computational framework to understand the relationship between gene expression and behavior. This dissertation provides new insights into the evolution and mechanisms of vocal learning by taking a top-down, systems biology approach to understanding gene expression regulation across avian and mammalian species. First, I worked with colleagues to develop a zebra finch Agilent oligonucleotide microarray, including developing programs for more accurate annotation of oligonucleotides and genes. I then used these arrays and tools in multiple collaborative, but related projects, to measure transcriptome expression data in vocal learning and non-learning avian species, under a number of behavioral paradigms, with a focus on song production. To make sense of the avian microarray data, I compiled microarray data from other sources, including expression analyses across over 900 human brain regions generated by Allen Brain Institute. To compare these data sets, I developed and performed a variety of computational analyses including clustering, linear models, gene set enrichment analysis, motif discovery, and phylogenetic inference, providing a novel framework to study the gene regulatory networks associated with a complex behavior. Using the developed framework, we are able to better understand vocal learning at different levels: how the brain regions for vocal learning evolved and how those brain regions function during the production of learned vocalizations. At the evolutionary level, we identified genes with unique expression patterns in the brains of vocal learning birds and humans. Interesting candidates include genes related to formation of neural connections, in particular the SLIT/ROBO axon guidance pathway. This algorithm also allowed us to identify the analogous regions that are a part of vocal learning circuit across species, providing the first quantitative evidence relating the human vocal learning circuit to the avian vocal learning circuit. With the avian song system verified as a model for human speech at the molecular level, we conducted an experiment to better understand what is happening in those brain regions during singing by profiling gene expression in a time course as birds are producing song. Surprisingly, an overwhelming majority of the gene expression identified was strongly enriched in a particular region. We also found a tight coupling between the behavioral function of a particular region and the gene expression pattern. To gain insight into the mechanisms of this gene regulation, we conducted a genomic scan of transcription factor binding sites in zebra finch. Many transcription factor binding sites were enriched in the promoters of genes with a particular temporal patterns, several of which had already been hypothesized to play a role in the neural system. Using this data set of gene expression profiles and transcription factor binding sites along with separate experiments conducted in mouse, we were able uncover evidence that the transcription factor CARF plays a role in neuron homeostasis. These results have broadened our understanding of the molecular basis of vocal learning at multiple levels. Overall, this dissertation outlines a novel way of approaching the study of the relationship between genes and behavior.
Item Open Access A Genomic Definition of Centromeres in Complex Genomes(2011) Hayden, Karen ElizabethCentromeres, or sites of chromosomal spindle attachment during mitosis and meiosis, are non-randomly distributed in complex genomes and are largely associated with expansive, near-identical satellite DNA arrays. While the sequence basis of centromere identity remains a subject of considerable debate, one approach is to examine the genomic organization of satellite DNA arrays and their potential function. Current genome assembly and sequence annotation strategies, however, are dependent on robust sequence variation, and, as a result, these regions of near sequence identity remain absent from current genome reference sequences and thus are detached from explorations of centromere biology. This dissertation is designed as a foundational study for centromere genomics, providing the initial steps to characterize those sequences at endogenous centromeres, while further classifying `functional' sequences that directly interact with, or are capable of recruiting proteins involved in, centromere function. These studies build on and take advantage of the limited sequence variation in centromeric satellite DNA, providing the necessary genomic scope to promote biologically meaningful characterization of endogenous centromere sequences in both human and non-human genomes. As a result, this thesis demonstrates possible genomic standards for future studies in the emerging field of satellite biology, which is now positioned to address functional centromere sequence variation across evolutionary time.
Item Open Access A Geometric Approach to Biomedical Time Series Analysis(2020) Malik, JohnBiomedical time series are non-invasive windows through which we may observe human systems. Although a vast amount of information is hidden in the medical field's growing collection of long-term, high-resolution, and multi-modal biomedical time series, effective algorithms for extracting that information have not yet been developed. We are particularly interested in the physiological dynamics of a human system, namely the changes in state that the system experiences over time (which may be intrinsic or extrinsic in origin). We introduce a mathematical model for a particular class of biomedical time series, called the wave-shape oscillatory model, which quantifies the sense in which dynamics are hidden in those time series. There are two key ideas behind the new model. First, instead of viewing a biomedical time series as a sequence of measurements made at the sampling rate of the signal, we can often view it as a sequence of cycles occurring at irregularly-sampled time points. Second, the "shape" of an individual cycle is assumed to have a one-to-one correspondence with the state of the system being monitored; as such, changes in system state (dynamics) can be inferred by tracking changes in cycle shape. Since physiological dynamics are not random but are well-regulated (except in the most pathological of cases), we can assume that all of the system's states lie on a low-dimensional, abstract Riemannian manifold called the phase manifold. When we model the correspondence between the hidden system states and the observed cycle shapes using a diffeomorphism, we allow the topology of the phase manifold to be recovered by methods belonging to the field of unsupervised manifold learning. In particular, we prove that the physiological dynamics hidden in a time series adhering to the wave-shape oscillatory model can be well-recovered by applying the diffusion maps algorithm to the time series' set of oscillatory cycles. We provide several applications of the wave-shape oscillatory model and the associated algorithm for dynamics recovery, including unsupervised and supervised heartbeat classification, derived respiratory monitoring, intra-operative cardiovascular monitoring, supervised and unsupervised sleep stage classification, and f-wave extraction (a single-channel blind source separation problem).
Item Open Access A Phylogenetic, Ecological, and Functional Characterization of Non-Photoautotrophic Bacteria in the Lichen Microbiome(2011) Hodkinson, Brendan P.Although common knowledge dictates that the lichen thallus is formed solely by a fungus (mycobiont) that develops a symbiotic relationship with an alga and/or cyanobacterium (photobiont), the non-photoautotrophic bacteria found in lichen microbiomes are increasingly regarded as integral components of lichen thalli and significant players in the ecology and physiology of lichens. Despite recent interest in this topic, the phylogeny, ecology, and function of these bacteria remain largely unknown. The experiments presented in this dissertation employ culture-free methods to examine the bacteria housed in these unique environments to ultimately inform an assessment of their status with regard to the lichen symbiosis. Microbiotic surveys of lichen thalli using new oligonucleotide-primers targeting the 16S SSU rRNA gene (developed as part of this study to target Bacteria, but exclude sequences derived from chloroplasts and Cyanobacteria) revealed the identity of diverse bacterial associates, including members of an undescribed lineage in the order Rhizobiales (Lichen-Associated Rhizobiales 1; `LAR1'). It is shown that the LAR1 bacterial lineage, uniquely associated with lichen thalli, is widespread among lichens formed by distantly related lichen-forming fungi and is found in lichens collected from the tropics to the arctic. Through extensive molecular cloning of the 16S rRNA gene and 454 16S amplicon sequencing, ecological trends were inferred based on mycobiont, photobiont, and geography. The implications for using lichens as microcosms to study larger principles of ecology and evolution are discussed. In addition to phylogenetic and ecological studies of lichen-associated bacterial communities, this dissertation provides a first assessment of the functions performed by these bacteria within the lichen microbiome in nature through 454 sequencing of two different lichen metatranscriptomes (one from a chlorolichen, Cladonia grayi, and one from a cyanolichen, Peltigera praetextata). Non-photobiont bacterial genes for nitrogen fixation were not detected in the Cladonia thallus (even though transcripts of cyanobacterial nitrogen fixation genes from two different pathways were detected in the cyanolichen thallus), implying that the role of nitrogen fixation in the maintenance of chlorolichens might have previously been overstated. Additionally, bacterial polyol dehydrogenases were found to be expressed in chlorolichen thalli (along with fungal polyol dehydrogenases and kinases from the mycobiont), suggesting the potential for bacteria to begin the process of breaking down the fixed carbon compounds secreted by the photobiont for easier metabolism by the mycobiont. This first look at the group of functional genes expressed at the level of transcription provides initial insights into the symbiotic network of interacting genes within the lichen microbiome.
Item Open Access A Semi-Supervised Predictive Model to Link Regulatory Regions to Their Target Genes(2015) Hafez, Dina MohamedNext generation sequencing technologies have provided us with a wealth of data profiling a diverse range of biological processes. In an effort to better understand the process of gene regulation, two predictive machine learning models specifically tailored for analyzing gene transcription and polyadenylation are presented.
Transcriptional enhancers are specific DNA sequences that act as ``information integration hubs" to confer regulatory requirements on a given cell. These non-coding DNA sequences can regulate genes from long distances, or across chromosomes, and their relationships with their target genes are not limited to one-to-one. With thousands of putative enhancers and less than 14,000 protein-coding genes, detecting enhancer-gene pairs becomes a very complex machine learning and data analysis challenge.
In order to predict these specific-sequences and link them to genes they regulate, we developed McEnhancer. Using DNAseI sensitivity data and annotated in-situ hybridization gene expression clusters, McEnhancer builds interpolated Markov models to learn enriched sequence content of known enhancer-gene pairs and predicts unknown interactions in a semi-supervised learning algorithm. Classification of predicted relationships were 73-98% accurate for gene sets with varying levels of initial known examples. Predicted interactions showed a great overlap when compared to Hi-C identified interactions. Enrichment of known functionally related TF binding motifs, enhancer-associated histone modification marks, along with corresponding developmental time point was highly evident.
On the other hand, pre-mRNA cleavage and polyadenylation is an essential step for 3'-end maturation and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage site (polyA site), which are frequently constrained by sequence content and position. More than 50\% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with variable 3'-UTRs, thus potentially affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered by the lack of appropriate tests for determining APAs with significant differences across multiple libraries.
We specified a linear effects regression model to identify tissue-specific biases indicating regulated APA; the significance of differences between tissue types was assessed by an appropriately designed permutation test. This combination allowed us to identify highly specific subsets of APA events in the individual tissue types. Predictive kernel-based SVM models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6%), as well as tissue-specific regulated sets from each other. The main cis-regulatory elements described for polyadenylation were found to be a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical PAS signal being nearly absent at brain-specific sites. We applied this model on SRp20 data, an RNA binding protein that might be involved in oncogene activation and obtained interesting insights.
Together, these two models contribute to the understanding of enhancers and the key role they play in regulating tissue-specific expression patterns during development, as well as provide a better understanding of the diversity of post-transcriptional gene regulation in multiple tissue types.
Item Open Access A Systems Level Analysis of Temperature-Dependent Sex Determination in the Red-Eared Slider Turtle Trachemys Scripta Elegans.(2016) Czerwinski, Michael JamesSex determination is a critical biological process for all sexually reproducing animals. Despite its significance, evolution has provided a vast array of mechanisms by which sexual phenotype is determined and elaborated even within amniote vertebrates. The most prevalent systems of sex determination in this clade are genetic and temperature dependent sex determination. These two systems are sometimes consistent within large groups of species, such as the mammals who nearly ubiquitously utilize XY genetic sex determination, or they can be much more mixed as in reptiles that use genetic or temperature dependent systems and even both simultaneously. The turtles are a particularly diverse group in the way they determine sex with multiple different genetic and temperature based systems having been described. We investigated the nature of the temperature based sex determination system in Trachemys scripta elegans to ascertain whether it behaved as a purely temperature based system or if some other global source of sex determining information might be apparent within thermal regions insufficient to fully induce male or female development. These experiments found that sex determination in this species is much more complex and early acting than previously thought and that each gonad within an individual has the same sexual fate established enough that it can persist even without further communication between. We established a best practice for the assembly and annotation of de novo whole transcriptomes from T. scripta RNA-seq and utilized the technique to quantify the gene regulatory events that occur across the thermal sensitive period.
Evidence is entirely lacking on the resolution of TSD when eggs are incubated at the pivitol temperature in which equal numbers or males and females are produces. We have produced a timecourse data set that allowed for the elucidation of the gene expression events that occur at both the MPT and FPT over the course of the thermal sensitive period. Our data suggests that early establishment of a male or female fate is possible when temperature is sufficiently strong enough as at MPT and FPT. We see a strong pattern of mutually antagonistic gene expression patterns emerging early and expanding over time through the end of the period of gonad plasticity. In addition, we have identified a strong pattern of differential expression in the early embryo at stages prior to the formation of the gonad. Even without the known systemic signaling attributed to sex hormones emanating from the gonad, the early embryo has a clear male and female gene expression pattern. We discuss how this early potential masculinization or feminization of the embryo may indicate that the influence of temperature may extend beyond the determination of gonadal sex or even metabolic adjustments and how this challenges the well-defined paradigm in which gonadal sex determines peripheral sexual characteristics.
Item Open Access Application of Phylogenetic Analysis in Cancer Evolution(2018) Ding, YuantongCancer is a major threat to human health and results in 1 in 6 deaths globally. Despite an extraordinary amount of effort and money spent, eradication or control of advanced disease has not yet been achieved. Understanding cancer from an evolutionary point of view may provide new insight to more effective control and treatment of the disease. Cancer as a disease of dynamic, stochastic somatic genomic evolution was first described by Nowell in 1976, and since then researchers have identified clonal expansions and genetic heterogeneity within many different types of neoplasms. The advancement in sequencing technology, especially single-cell sequencing, has open up new frontier by bringing the study of genomes to the cellular level. Phylogenetic analysis, which is a powerful tool inferring evolutionary relationships among various biological species or other entities based upon similarities and differences in their physical or genetic characteristics, has recently been applied to cancer studies and start to show promises in deciphering cancer evolution. However, new challenges have also arisen in experimental design, methodology and interpretation regarding to phylogeny of cancer cells. The overarching theme of this dissertation is to bring phylogenetic analysis to the context of cancer evolution. By using in silico simulations, I show the advantages and disadvantages of different sampling designs for phylogenetic analysis. Although bulk sequencing can hardly recover the topology of phylogenetic trees, I then developed a new method to infer sub-clone spatial distribution utilizing phased haplotypes from bulk sequencing. And lastly, I demonstrate the usage of phylogenetic analysis in breast cancer with multi-regional bulk sequencing and lung cancer with single cell sequencing.
Item Open Access Bayesian meta-analysis models for heterogeneous genomics data(2013) Zheng, LinglingThe accumulation of high-throughput data from vast sources has drawn a lot attentions to develop methods for extracting meaningful information out of the massive data. More interesting questions arise from how to combine the disparate information, which goes beyond modeling sparsity and dimension reduction. This dissertation focuses on the innovations in the area of heterogeneous data integration.
Chapter 1 contextualizes this dissertation by introducing different aspects of meta-analysis and model frameworks for high-dimensional genomic data.
Chapter 2 introduces a novel technique, joint Bayesian sparse factor analysis model, to vertically integrate multi-dimensional genomic data from different platforms.
Chapter 3 extends the above model to a nonparametric Bayes formula. It directly infers number of factors from a model-based approach.
On the other hand, chapter 4 deals with horizontal integration of diverse gene expression data; the model infers pathway activities across various experimental conditions.
All the methods mentioned above are demonstrated in both simulation studies and real data applications in chapters 2-4.
Finally, chapter 5 summarizes the dissertation and discusses future directions.
Item Open Access Bayesian Modeling for Identifying Selection in B cell Maturation(2023) Tang, TengjieThis thesis focuses on modeling the selection effects on B cell antibody mutations to identify amino acids under strong selection. Site-wise selection coefficients are parameterized by the fitnesses of amino acids. First, we conduct simulation studies to evaluate the accuracy of the Monte Carlo p-value approach for identifying selection for specific amino acid/location combinations. Then, we adopt Bayesian methods to infer location-specific fitness parameters for each amino acid. In particular, we propose the use of a spike-and-slab prior and implement Markov chain Monte Carlo (MCMC) algorithms for posterior sampling. Further simulation studies are conducted to evaluate the performance of the proposed Bayesian methods in inferring fitness parameters and identifying strong selection. The results demonstrate the reliable inference and detection performance of the proposed Bayesian methods. Finally, an example using real antibody sequences is provided. This work can help identify important early mutations in B cell antibodies, which is crucial for developing an effective HIV vaccine.
Item Open Access Bayesian modeling of microbial physiology(2017) Tonner, PeterMicrobial population growth measurements are widespread in the study of microorganisms, providing insight into areas including genetics, physiology, and engineering. The most common models of microbial population growth data are parametric, and are derived from specific assumptions about the underlying growth process. While useful in cases where these assumptions are valid, these models are inadequate in many cases typically found in microbial growth studies, including presence of significant population death and the presence of multiple growth phases (e.g. diauxie). Here, we explore the use of the Bayesian non-parametric model Gaussian processes on microbial population growth. We first develop a general hypothesis-test using Gaussian process regression and false-discovery rate corrected Bayes factor scores. We then explore a fully Bayesian model with Gaussian process priors that can capture the latent growth processes of many population measurements under a single model. Finally, we develop hierarchical Bayesian model with GP priors in order to capture random effects in microbial population growth data.
Item Open Access Bayesian Multivariate Count Models for the Analysis of Microbiome Studies(2019) Silverman, Justin DavidAdvances in high-throughput DNA sequencing allow for rapid and affordable surveys of thousands of bacterial taxa across thousands of samples. The exploding availability of sequencing data has poised microbiota research to advance our understanding of fields as diverse as ecology, evolution, medicine, and agriculture. Yet, while microbiota data is now ubiquitous, methods for the analysis of such data remain underdeveloped. This gap reflects the challenge of analyzing sparse high-dimensional count data that contains compositional (relative abundance) information. To address these challenges this dissertation introduces a number of tools for Bayesian inference applied to microbiome data. A central theme throughout this work is the use of multinomial logistic-normal models which are found to concisely address these challenges. In particular, the connection between the logistic-normal distribution and the Aitchison geometry of the simplex is commonly used to develop interpretable tools for the analysis of microbiome data.
The structure of this dissertation is as follows. Chapter 1 introduces key challenges in the analysis of microbiome data. Chapter 2 introduces a novel log-ratio transform between the simplex and Real space to enable the development of statistical tools for compositional data with phylogenetic structure. Chapter 3 introduces a multinomial logistic-normal generalized dynamic linear modelling framework for analysis of microbiome time-series data. Chapter 4 explores the analysis of zero values in sequence count data from a stochastic process perspective and demonstrates that zero-inflated models often produce counter-intuitive results in this this regime. Finally, Chapter 5 introduces the theory of Marginally Latent Matrix-T Processes as a means of developing efficient accurate inference for a large class of both multinomial logistic-normal models including linear regression, non-linear regression, and dynamic linear models. Notably, the inference schemes developed in Chapter 5 are found to often be orders of magnitude faster than Hamiltonian Monte Carlo without sacrificing accuracy in point estimation or uncertainty quantification.
Item Open Access Bayesian Statistical Models of Cell-Cycle Progression at Single-Cell and Population Levels(2014) Mayhew, Michael BenjaminCell division is a biological process fundamental to all life. One aspect of the process that is still under investigation is whether or not cells in a lineage are correlated in their cell-cycle progression. Data on cell-cycle progression is typically acquired either in lineages of single cells or in synchronized cell populations, and each source of data offers complementary information on cell division. To formally assess dependence in cell-cycle progression, I develop a hierarchical statistical model of single-cell measurements and extend a previously proposed model of population cell division in the budding yeast, Saccharomyces cerevisiae. Both models capture correlation and cell-to-cell heterogeneity in cell-cycle progression, and parameter inference is carried out in a fully Bayesian manner. The single-cell model is fit to three published time-lapse microscopy datasets and the population-based model is fit to simulated data for which the true model is known. Based on posterior inferences and formal model comparisons, the single-cell analysis demonstrates that budding yeast mother and daughter cells do not appear to correlate in their cell-cycle progression in two of the three experimental settings. In contrast, mother cells grown in a less preferred sugar source, glycerol/ethanol, did correlate in their rate of cell division in two successive cell cycles. Population model fitting to simulated data suggested that, under typical synchrony experimental conditions, population-based measurements of the cell-cycle were not informative for correlation in cell-cycle progression or heterogeneity in daughter-specific G1 phase progression.
Item Open Access Bioinformatics and Molecular Approaches for the Construction of Biological Artificial Cartilage(2018) Huynh, Nguyen Phuong ThaoOsteoarthritis (OA) is one of the leading causes of disability in the United States, afflicting over 27 million Americans and imposing an economic burden of more than $128 billion each year (1, 2). OA is characterized by progressive degeneration of articular cartilage together with sub-chondral bone remodeling and synovial joint inflammation. Currently, OA treatments are limited, and inadequate to restore the joint to its full functionality.
Over the years, progresses have been made to create biologic cartilage substitutes. However, the repair of degenerated cartilage remains challenging due to its complex architecture and limited capability to integrate with surrounding tissues. Hence, there exists a need to create not only functional chondral constructs, but functional osteochondral constructs, which could potentially enhance affixing properties of cartilage implants utilizing the underlying bone. Furthermore, the molecular mechanisms driving chondrogenesis are still not fully understood. Therefore, detailed transcriptomic profiling would bring forth the progression of not only genes, but gene entities and networks that orchestrate this process.
Bone-marrow derived mesenchymal stem cells (MSCs) are routinely utilized to create cartilage constructs in vitro for the study of chondrogenesis. In this work, we set out to examine the underlying mechanisms of these cells, as well as the intricate gene correlation networks over the time course of lineage development. We first asked the question of how transforming growth factors are determining MSC differentiation, and subsequently utilized genetic engineering to manipulate this pathway to create an osteochondral construct. Next, we performed high-throughput next-generation sequencing to profile the dynamics of MSC transcriptomes over the time course of chondrogenesis. Bioinformatics analyses of these big data have yielded a multitude of information: the chondrogenic functional module, the associated gene ontologies, and finally the elucidation of GRASLND and its crucial function in chondrogenesis. We extended our results with a detailed molecular characterization of GRASLND and its underlying mechanisms. We showed that GRASLND could enhance chondrogenesis, and thus proposed its therapeutic use in cartilage tissue engineering as well as in the treatment of OA.
Item Open Access Characterization of Gene-by-Age Interaction and Gene-by-Gene Interaction In Coronary Artery Disease(2012) Zhao, YiThe success of genome-wide association studies (GWAS) has been limited by missing heritability and lack of biological relevance of identified variants. We sought to address these issues by characterizing interaction among genotypes and environment using case-control samples enrolled at Duke University Medical Center. First, we studied the impact of age on coronary artery disease (CAD). Gene-by-age (GxAGE) interactions were tested at genome-wide scale, along with genes' marginal effects in age-stratified groups. Based on the interaction model, age plays the role as a modifier of the age-CAD relationship. SNPs associated with CAD in both young and old demonstrate consistency in effect sizes and directions. In spite of these SNPs, vastly different CAD associated genes were discovered across age and race groups, suggesting age-dependent mechanisms of CAD onset. Second, we explored gene-by-gene interaction (GxG) using a statistical model and compared results to biological evidence. Specifically, we investigated GATA2 as a candidate gene transcription factor, and modeled the interaction with genome-wide SNPs. The genetic effects at interacting loci were modified by GATA2 genotype. Without taking GATA2 variants into account , no marginal main effects were detected. Open access ChIP-seq data was available for comparison with the statistical model, and to relate GWAS findings with biological mechanisms. The agreement between the statistical and biological models was very limited.
Item Open Access Characterization of Influenza A Virus Infection through Analysis of Intrahost Viral Evolution and Within-host Infection Dynamics(2016) Sobel Leonard, Ashley ElizabethInfluenza A virus is a major source of morbidity and mortality, annually resulting in over 9000 deaths in the United States alone. As a segmented, RNA virus, influenza has a high mutation rate, facilitating its evolution to overcome cross protective immunity through natural selection and adapt to new host species or sources of evolutionary pressure through reassortment. The high viral mutation rate also means that these processes affect not only evolution at the population level, but also at the intrahost level. While these processes have been well characterized for population-level viral evolution, viral evolution within a single host is far less well defined. In this dissertation, I characterize influenza infection at the intrahost level with respect to viral evolution and infection dynamics. In the second chapter, I critically evaluate methods for estimating the transmission bottleneck size for influenza A virus from viral sequencing data. The transmission bottleneck describes the infecting population size, a determinant for the level of genetic diversity present at the onset of infection. I show current methods may be biased, both by the criteria used to identify sequencing variants and the presence of demographic stochasticity. In response to these biases, I introduce a new method that (1) corrects for differences in variant calling criteria and (2) accommodates demographic stochasticity. Chapters 3-5 are based on data collected from an existing human challenge study with influenza A virus. In this challenge study, volunteers were experimentally infected with a heterogeneous viral inoculum that had adapted to the conditions in which it had been generated. In chapter 3, I show that transmission was governed by a selective bottleneck and that subsequent intrahost viral evolution was dominated by purifying selection. In chapter 4, I further characterize the observed intrahost viral evolution through the reconstruction of viral haplotypes to evaluate different models of selection. These models differed by the level at which selection was acting, whether selection is focused on individual loci, multiple loci within a single gene segment, or across gene segments. Model selection favored the third model, wherein selection acted across gene segments, thereby establishing that the effective viral reassortment rate was limited in these subjects. In chapter 5, I develop a mathematical model for within-host influenza infection linking viral replication and the host immune response with the development of disease symptoms. I fit this model to experimental data collected from the challenge study. Analysis of the model fits indicated that much of the heterogeneity in the data between subjects could be explained by interindividual variation in viral infectivity. This finding echoed the results of chapters 3 and 4, that there were quantifiable differences in the infecting viral populations between the study subjects. Taken together, these observations suggest that
intrahost viral genetics may underlie the differences between the subjects’ response to infection.
Item Open Access Characterizing Genetic Drivers of Lymphoma through High-Throughput Sequencing(2016) Zhang, JennyThe advent of next-generation sequencing, now nearing a decade in age, has enabled, among other capabilities, measurement of genome-wide sequence features at unprecedented scale and resolution.
In this dissertation, I describe work to understand the genetic underpinnings of non-Hodgkin’s lymphoma through exploration of the epigenetics of its cell of origin, initial characterization and interpretation of driver mutations, and finally, a larger-scale, population-level study that incorporates mutation interpretation with clinical outcome.
In the first research chapter, I describe genomic characteristics of lymphomas through the lens of their cells of origin. Just as many other cancers, such as breast cancer or lung cancer, are categorized based on their cell of origin, lymphoma subtypes can be examined through the context of their normal B Cells of origin, Naïve, Germinal Center, and post-Germinal Center. By applying integrative analysis of the epigenetics of normal B Cells of origin through chromatin-immunoprecipitation sequencing, we find that differences in normal B Cell subtypes are reflected in the mutational landscapes of the cancers that arise from them, namely Mantle Cell, Burkitt, and Diffuse Large B-Cell Lymphoma.
In the next research chapter, I describe our first endeavor into understanding the genetic heterogeneity of Diffuse Large B Cell Lymphoma, the most common form of non-Hodgkin’s lymphoma, which affects 100,000 patients in the world. Through whole-genome sequencing of 1 case as well as whole-exome sequencing of 94 cases, we characterize the most recurrent genetic features of DLBCL and lay the groundwork for a larger study.
In the last research chapter, I describe work to characterize and interpret the whole exomes of 1001 cases of DLBCL in the largest single-cancer study to date. This highly-powered study enabled sub-gene, gene-level, and gene-network level understanding of driver mutations within DLBCL. Moreover, matched genomic and clinical data enabled the connection of these driver mutations to clinical features such as treatment response or overall survival. As sequencing costs continue to drop, whole-exome sequencing will become a routine clinical assay, and another diagnostic dimension in addition to existing methods such as histology. However, to unlock the full utility of sequencing data, we must be able to interpret it. This study undertakes a first step in developing the understanding necessary to uncover the genomic signals of DLBCL hidden within its exomes. However, beyond the scope of this one disease, the experimental and analytical methods can be readily applied to other cancer sequencing studies.
Thus, this dissertation leverages next-generation sequencing analysis to understand the genetic underpinnings of lymphoma, both by examining its normal cells of origin as well as through a large-scale study to sensitively identify recurrently mutated genes and their relationship to clinical outcome.
Item Open Access Chromatin Accessibility Dynamics Underlying Development and Disease(2015) Frank, Christopher L.Despite a largely static DNA sequence, our genomes are incredibly malleable. Comparative studies of chromatin features between different cell types, tissues, and species have revealed tremendous differences in how the genome is accessed, transcribed, and replicated. However, how the dynamics of chromatin accessibility contribute to development, environmental response, and disease status has only begun to be appreciated. In this work we identified chromatin accessibility changes by DNase-seq in three diverse processes: in granule neurons of the developing cerebellum, with intestinal epithelial cells in the absence of a normal microbiota, and with myelogenous leukemia cells in response to histone deacetylase inhibitor treatments. In all cases, we coupled these analyses with RNA-seq assays to identify concurrent transcriptional changes. By mapping the changes to these genome-wide signals we defined the contribution of local chromatin structure to the transcriptional programs underlying these processes, and improved our understanding of their relation to other chromatin changes like histone modifications. Furthermore we demonstrated use of the strongest accessibility changes to identify transcription factors critical for these processes by finding enrichment of their binding motifs. For a few of these key factors, depletion or overexpression of the protein was sufficient to regulate the expression of predicted target genes or exert limited chromatin accessibility changes, demonstrating the functional significance of these proteins in these processes. Together these studies have informed our understanding of the role chromatin accessibility changes play in development and environmental responses while also proving their utility for key regulator identification.
Item Open Access Chromatin Determinants of the Eukaryotic DNA Replication Program(2011) Eaton, Matthew LucasThe accurate and timely replication of eukaryotic DNA during S-phase is of critical importance for the cell and for the inheritance of genetic information. Missteps in the replication program can activate cell cycle checkpoints or, worse, trigger the genomic instability and aneuploidy associated with diseases such as cancer. Eukaryotic DNA replication initiates asynchronously from hundreds to tens of thousands of replication origins spread across the genome. The origins are acted upon independently, but patterns emerge in the form of large-scale replication timing domains. Each of these origins must be localized, and the activation time determined by a system of signals that, though they have yet to be fully understood, are not dependent on the primary DNA sequence. This regulation of DNA replication has been shown to be extremely plastic, changing to fit the needs of cells in development or effected by replication stress.
We have investigated the role of chromatin in specifying the eukaryotic DNA replication program. Chromatin elements, including histone variants, histone modifications and nucleosome positioning, are an attractive candidate for DNA replication control, as they are not specified fully by sequence, and they can be modified to fit the unique needs of a cell without altering the DNA template. The origin recognition complex (ORC) specifies replication origin location by binding the DNA of origins. The S. cerevisiae ORC recognizes the ARS (autonomously replicating sequence) consensus sequence (ACS), but only a subset of potential genomic sites are bound, suggesting other chromosomal features influence ORC binding. Using high-throughput sequencing to map ORC binding and nucleosome positioning, we show that yeast origins are characterized by an asymmetric pattern of positioned nucleosomes flanking the ACS. The origin sequences are sufficient to maintain a nucleosome-free origin; however, ORC is required for the precise positioning of nucleosomes flanking the origin. These findings identify local nucleosomes as an important determinant for origin selection and function. Next, we describe the D. melanogaster replication program in the context of the chromatin and transcription landscape for multiple cell lines using data generated by the modENCODE consortium. We find that while the cell lines exhibit similar replication programs, there are numerous cell line-specific differences that correlate with changes in the chromatin architecture. We identify chromatin features that are associated with replication timing, early origin usage, and ORC binding. Primary sequence, activating chromatin marks, and DNA-binding proteins (including chromatin remodelers) contribute in an additive manner to specify ORC-binding sites. We also generate accurate and predictive models from the chromatin data to describe origin usage and strength between cell lines. Multiple activating chromatin modifications contribute to the function and relative strength of replication origins, suggesting that the chromatin environment does not regulate origins of replication as a simple binary switch, but rather acts as a tunable rheostat to regulate replication initiation events.
Taken together our data and analyses imply that the chromatin contains sufficient information to direct the DNA replication program.