Browsing by Subject "Biology, Bioinformatics"
Results Per Page
Sort Options
Item Open Access A Bayesian Hierarchical Model with SNP-level Functional Priors Applied to a Pathway-wide Association Study.(2010) Huang, WeiziTremendous effort has been put into study of the etiology of complex
diseases including the breast cancer, type 2 diabetes,
cardiovascular diseases, and prostate cancers. Despite large numbers of reported disease-associated loci,
few associated loci have been replicated, and some true associations
does not belong to the group of the most significant loci
reported to be associated. We built a Bayesian hierarchical model incorporated
with SNP-level functional data that can help identify associated SNPs in pathway-wide association studies.
We applied the model to an association study for the serous invasive ovarian cancer based on the DNA repair and apoptosis pathways. We found that using our model, blocks of SNPs located in regions enriched for missense SNPs or gene inversions were more likely to be identified as candidates of the association.
Item Open Access Analysis and Error Correction in Structures of Macromolecular Interiors and Interfaces(2009) Headd, Jeffrey JohnAs of late 2009, the Protein Data Bank (PDB) has grown to contain over 70,000 models. This recent increase in the amount of structural data allows for more extensive explication of the governing principles of macromolecular folding and association to complement traditional studies focused on a single molecule or complex. PDB-wide characterization of structural features yields insights that are useful in prediction and validation of the 3D structure of macromolecules and their complexes. Here, these insights lead to a deeper understanding of protein--protein interfaces, full-atom critical assessment of increasingly more accurate structure predictions, a better defined library of RNA backbone conformers for validation and building 3D models, and knowledge-based automatic correction of errors in protein sidechain rotamers.
My study of protein--protein interfaces identifies amino acid pairing preferences in a set of 146 transient interfaces. Using a geometric interface surface definition devoid of arbitrary cutoffs common to previous studies of interface composition, I calculate inter- and intrachain amino acid pairing preferences. As expected, salt-bridges and hydrophobic patches are prevalent, but likelihood correction of observed pairing frequencies reveals some surprising pairing preferences, such as Cys-His interchain pairs and Met-Met intrachain pairs. To complement my statistical observations, I introduce a 2D visualization of the 3D interface surface that can display a variety of interface characteristics, including residue type, atomic distance and backbone/sidechain composition.
My study of protein interiors finds that 3D structure prediction from sequence (as part of the CASP experiment) is very close to full-atom accuracy. Validation of structure prediction should therefore consider all atom positions instead of the traditional Calpha-only evaluation. I introduce six new full-model quality criteria to assess the accuracy of CASP predictions, which demonstrate that groups who use structural knowledge culled from the PDB to inform their prediction protocols produce the most accurate results.
My study of RNA backbone introduces a set of rotamer-like "suite" conformers. Initially hand-identified by the Richardson laboratory, these 7D conformers represent backbone segments that are found to be genuine and favorable. X-ray crystallographers can use backbone conformers for model building in often poor backbone density and in validation after refinement. Increasing amounts of high quality RNA data allow for improved conformer identification, but also complicate hand-curation. I demonstrate that affinity propagation successfully differentiates between two related but distinct suite conformers, and is a useful tool for automated conformer clustering.
My study of protein sidechain rotamers in X-ray structures identifies a class of systematic errors that results in sidechains misfit by approximately 180 degrees. I introduce Autofix, a method for automated detection and correction of such errors. Autofix corrects over 40% of errors for Leu, Thr, and Val residues, and a significant number of Arg residues. On average, Autofix made four corrections per PDB file in 945 X-ray structures. Autofix will be implemented into MolProbity and PHENIX for easy integration into X-ray crystallography workflows.
Item Open Access Automated Microscopy and High Throughput Image Analysis in Arabidopsis and Drosophila(2009) Mace, Daniel L.Development of a single cell into an adult organism is accomplished through an elaborate and complex cascade of spatiotemporal gene expression. While methods exist for capturing spatiotemporal expression patterns---in situ hybridization, reporter constructs, fluorescent tags---these methods have been highly laborious, and results are frequently assessed by subjective qualitative comparisons. To address these issues, methods must be developed for automating the capture of images, as well as for the normalization and quantification of the resulting data. In this thesis, I design computational approaches for high throughput image analysis which can be grouped into three main areas. First, I develop methods for the capture of high resolution images from high throughput platforms. In addition to the informatics aspect of this problem, I also devise a novel multiscale probabilistic model that allows us to identify and segment objects in an automated fashion. Second, high resolution images must be registered and normalized to a common frame of reference for cross image comparisons. To address these issues, I implement approaches for image registration using statistical shape models and non-rigid registration. Lastly, I validate the spatial expression data obtained from microscopy images to other known spatial expression methods, and develop methods for comparing and calculating the significance between spatial expression patterns. I demonstrate these methods on two model developmental organisms: Arabidopsis and Drosophila.
Item Open Access Characterization of Gene Interaction and Assessment of Ld Matrix Measures for the Analysis of Biological Pathway Association(2009) Crosslin, David RussellLeukotrienes are arachidonic acid derivatives long known for their inflammatory properties and their involvement with a number of human diseases, most notably asthma. Recently, leukotriene-based inflammation has also been implicated in atherosclerosis: ALOX5AP and LTA4H, two genes in the leukotriene biosynthesis pathway, have been associated with various cardiovascular disease (CVD) phenotypes. To assess the role of the leukotriene pathway in CVD pathogenesis, we performed genetic association studies of ALOX5AP and LTA4H in a non-familial data set of early onset coronary artery disease. Our results support a modest role for the leukotriene pathway in atherosclerosis pathogenesis, reveal important genomic interactions within the pathway, and suggest the importance of using pathway-based modeling for evaluating the genomics of atherosclerosis susceptibility. Motivated by this need, we investigated the statistical properties of a class of matrix-based statistics to assess epistasis. We simulated multiple two-variant disease models with haplotypes to gain an understanding of pathway interactions in terms of correlation patterns. Our goal was to detect an interaction between multiple disease-causing variants by means of their linkage disequlibrium (LD) patterns with other haplotype markers. The simulated models can be summarized into three categories: 1. No epistasis in the presence of marginal effects and LD; 2. Epistasis in the presence of LD and no marginal effects; and 3. Epistasis in the presence marginal effects and LD. We then assessed previously introduced single-gene methods that compare whole matrices of Single Nucleotide Polymorphism (SNP) LD between two samples. These methods include comparing two sets of principal components, a sum-of-squared-differences comparing pairwise LD, and a contrast test that controls for background LD. We also considered a partial least-square (PLS) approach for modeling gene-gene interactions. Our results indicate that these measures can be used to assess epistasis as well as marginal effects under certain disease models. Understanding and quantifying whole-gene variation and association to disease using multiple SNPs remains a difficult task. Providing a single statistical measure per gene will facilitate combining multiple types of genomic data at a gene-level and will serve as an alternative approach to assess epistasis in genome-wide association studies. The matrix-based measures can also be used in pathway ascertainment tools that require scores on a gene-level.
Item Open Access Computational Methods to Study Diversification in Pathogens, and Invertebrate and Vertebrate Immune Systems(2010) Munshaw, Supriya ShaunakPathogens and host immune systems use strikingly similar methods of diversification. Mechanisms such as point mutations and recombination help pathogens escape the host immune system and similar mechanisms help the host immune system attack rapidly evolving pathogens. Understanding the interplay between pathogen and immune system evolution is crucial to effective drug and vaccine development. In this thesis we employ various computational methods to study diversification in a pathogen, an invertebrate and a vertebrate immune system.
First, we develop a technique for phylogenetic inference in the presence of recombination based on the principle of minimum description length, which assigns a cost-the description length-to each network topology given the observed sequence data. We show that the method performs well on simulated data and demonstrate its application on HIV env gene sequence data from 8 human subjects.
Next, we demonstrate via phylogenetic analysis that the evolution of repeats in an immune-related gene family in Strongylocentrotus purpuratus is the result of recombination and duplication and/or deletion. These results support the evidence suggesting that invertebrate immune systems are highly complex and may employ similar mechanisms for diversification as higher vertebrates.
Third, we develop a probabilistic model of the immunoglobulin (Ig) rearrangement process and a Bayesian method for estimating posterior probabilities for the comparison of multiple plausible rearrangements. We validate the software using various datasets and in all tests, SoDA2 performed better than other available software.
Finally, we characterize the somatic population genetics of the nucleotide sequences of >1000 recombinant Ig pairs derived from the blood of 5 acute HIV-1 infected (AHI) subjects. We found that the Ig genes from the 20 day AHI PC showed extraordinary clonal relatedness among themselves; a single clone comprised of 52 members, with observed and inferred precursor antibodies specific for HIV-1 Env gp41. Antibodies from AHI patients show a decreased CDR3H length and an increased mutation frequency when compared to influenza vaccinated individuals. The high mutation frequency is coupled with a comparatively low synonymous to non-synonymous mutation ratio in the heavy chain. Our results may suggest presence of positive antigenic selection in previously triggered non-HIV-1 memory B cells in AHI.
Taken together, the studies presented in this thesis provide methods to study diversification in pathogens, and invertebrate and vertebrate immune systems.
Item Open Access Computational Molecular Engineering Nucleic Acid Binding Proteins and Enzymes(2010) Reza, FaisalInteractions between nucleic acid substrates and the proteins and enzymes that bind and catalyze them are ubiquitous and essential for reading, writing, replicating, repairing, and regulating the genomic code by the proteomic machinery. In this dissertation, computational molecular engineering furthered the elucidation of spatial-temporal interactions of natural nucleic acid binding proteins and enzymes and the creation of synthetic counterparts with structure-function interactions at predictive proficiency. We examined spatial-temporal interactions to study how natural proteins can process signals and substrates. The signals, propagated by spatial interactions between genes and proteins, can encode and decode information in the temporal domain. Natural proteins evolved through facilitating signaling, limiting crosstalk, and overcoming noise locally and globally. Findings indicate that fidelity and speed of frequency signal transmission in cellular noise was coordinated by a critical frequency, beyond which interactions may degrade or fail. The substrates, bound to their corresponding proteins, present structural information that is precisely recognized and acted upon in the spatial domain. Natural proteins evolved by coordinating substrate features with their own. Findings highlight the importance of accurate structural modeling. We explored structure-function interactions to study how synthetic proteins can complex with substrates. These complexes, composed of nucleic acid containing substrates and amino acid containing enzymes, can recognize and catalyze information in the spatial and temporal domains. Natural proteins evolved by balancing stability, solubility, substrate affinity, specificity, and catalytic activity. Accurate computational modeling of mutants with desirable properties for nucleic acids while maintaining such balances extended molecular redesign approaches. Findings demonstrate that binding and catalyzing proteins redesigned by single-conformation and multiple-conformation approaches maintained this balance to function, often as well as or better than those found in nature. We enabled access to computational molecular engineering of these interactions through open-source practices. We examined the applications and issues of engineering nucleic acid binding proteins and enzymes for nanotechnology, therapeutics, and in the ethical, legal, and social dimensions. Findings suggest that these access and applications can make engineering biology more widely adopted, easier, more effective, and safer.
Item Open Access Computer Aided Detection of Masses in Breast Tomosynthesis Imaging Using Information Theory Principles(2008-09-18) Singh, SwateeBreast cancer screening is currently performed by mammography, which is limited by overlying anatomy and dense breast tissue. Computer aided detection (CADe) systems can serve as a double reader to improve radiologist performance. Tomosynthesis is a limited-angle cone-beam x-ray imaging modality that is currently being investigated to overcome mammography's limitations. CADe systems will play a crucial role to enhance workflow and performance for breast tomosynthesis.
The purpose of this work was to develop unique CADe algorithms for breast tomosynthesis reconstructed volumes. Unlike traditional CADe algorithms which rely on segmentation followed by feature extraction, selection and merging, this dissertation instead adopts information theory principles which are more robust. Information theory relies entirely on the statistical properties of an image and makes no assumptions about underlying distributions and is thus advantageous for smaller datasets such those currently used for all tomosynthesis CADe studies.
The proposed algorithm has two 2 stages (1) initial candidate generation of suspicious locations (2) false positive reduction. Images were accrued from 250 human subjects. In the first stage, initial suspicious locations were first isolated in the 25 projection images per subject acquired by the tomosynthesis system. Only these suspicious locations were reconstructed to yield 3D Volumes of Interest (VOI). For the second stage of the algorithm false positive reduction was then done in three ways: (1) using only the central slice of the VOI containing the largest cross-section of the mass, (2) using the entire volume, and (3) making decisions on a per slice basis and then combining those decisions using either a linear discriminant or decision fusion. A 92% sensitivity was achieved by all three approaches with 4.4 FPs / volume for approach 1, 3.9 for the second approach and 2.5 for the slice-by-slice based algorithm using decision fusion.
We have therefore developed a novel CADe algorithm for breast tomosynthesis. The techniques uses an information theory approach to achieve very high sensitivity for cancer detection while effectively minimizing false positives.
Item Open Access Detecting Changes in Alternative mRNA Processing From Microarray Expression Data(2010) Robinson, Timothy J.Alternative mRNA processing can result in the generation of multiple, qualitatively different RNA transcripts from the same gene and is a powerful engine of complexity in higher organisms. Recent deep sequencing studies have indicated that essentially all human genes containing more than a single exon generate multiple RNA transcripts. Functional roles of alternative processing have been established in virtually all areas of biological regulation, particularly in development and cancer. Changes in alternative mRNA processing can now be detected from over a billion dollars' worth of conventional gene expression microarray data archived over the past 20 years using a program we created called SplicerAV. Application of SplicerAV to publicly available microarray data has granted new insights into previously existing studies of oncogene over-expression and clinical cancer prognosis.
Adaptation of SplicerAV to the new Affymetrix Human Exon arrays has resulted in the creation of SplicerEX, the first program that can automatically categorize microarray detected changes in alternative processing into biologically pertinent categories. We use SplicerEX's automatic event categorization to identify changes in global mRNA processing during B cell transformation and show that the conventional U133 platform is able to detect 3' located changes in mRNA processing five times more frequently than the Human Exon array.
Item Open Access Localized Correlation Analysis and Genetic Association with Cardiovascular Disease(2010) Ou, ChernHanVariations in gene expression are potential risk factors for atherosclerosis, which is one of the most common forms of cardiovascular disease. We performed a localized Pearson correlation test in 372 individuals from seven datasets relevant to cardiovascular disease studies. The genomes of samples were separated into 20Mb windows and correlation tests were performed locally in these windows. The localized Pearson correlation test found chr3:115Mb–135Mb was tightly connected by significantly high proportion of highly correlated pairs (P value = 0.0266 with Z-test). LSAMP, GATA2, MBD4, and other genes in the region were considered associated with cardiovascular disease because they were involved in highly correlated pairs. Furthermore, these genes were also associated with cardiovascular disease by having significantly high SNP odds ratios (P value < 0.1) between patients and controls in an independent Duke University Medical Center database. In addition, a permutation test was performed to demonstrate that chr3:115Mb–135Mb might underlie the regulation of cardiovascular disease. Finally, the localized Pearson correlation test also found some other regions that could be associated with cardiovascular disease.
Item Open Access Mechanistic and Genetic Biases in Human Immunoglobulin Heavy Chain Development(2008-04-23) Volpe, Joseph MBroadly neutralizing antibodies against HIV are rare; most patients never develop them at detectable levels. The discovery of four such antibodies therefore warrants research into their origins and their presumed unique characteristics. Such studies, however, require baseline knowledge about commonalities and biases affecting human immunoglobulin development. Obtaining that knowledge requires large sets of gene sequence data and the appropriate statistical techniques and tools. The Genbank repository provides a free and easily accessible source for such data. Several large datasets cumulatively comprising over 10,000 human Ig heavy chain genes were identified, downloaded, and carefully filtered. We then developed a special software tool called SoDA, which employs a unique dynamic programming algorithm to provide a statistical reconstruction of the events that led to a given antigen receptor gene. Once developed, tested, and peer-reviewed, we used SoDA to provide initial data about each downloaded gene with respect to gene segment usage, n-nucleotide addition, CDR3 length, and mutation frequency, thereby establishing the most precise estimates currently available for human Ig heavy chain gene segment usage frequencies. We compared data from productive non-autoreactive Ig to non-productive Ig and found evidence for gene segment usage biases, D/J segment pairing preferences resulting from multiple sequential D-to-J recombination events, and biases in TdT action between the V-D and D-J. Further analysis of autoreactive Ig genes yielded evidence that n-nucleotide addition comes at a cost: the higher the ratio of n-nucleotides to germline-encoded nucleotides for a given CDR3 length, the greater the probability of autoreactivity. These results suggest that the germline gene segments have been selected for lack of autoreactivity. It has previously been shown that human Ig gene segments have evolved efficient evolvability under somatic hypermutation. We have now extended these results, showing that Ig gene sequences are "tuned" to preferentially produce consequential mutations in the antigen-binding domains, and synonymous mutations in the framework regions. Together, these analyses provide new insights into the genetic and mechanistic biases shaping the human Ig repertoire.Item Open Access Modeling Cancer Progression on the Pathway Level(2008-12-11) Edelman, Elena JaneOver the past several decades, many genes have been discovered that govern important functions in the development of a variety of different cancers. However, biological insight from the list of genes is still limited and the underlying mechanisms that occur in the cell during tumorigenesis have not been well established. Studying cancer progression in terms of the oncogenic pathways that are responsible for specific actions that change normal cells into tumors is a means for bringing insight onto these issues. The work presented here will uncover mechanisms that are occurring at the pathway level that first initiate tumor formation and then continue through cancer progression and finally metastasis. This knowledge will allow for drug treatment that is better targeted towards an individual.
Microarray technology has allowed for the collection of gene expression datasets from clinical cancer and other studies. These datasets can be used to study how expression levels of individual genes or groups of related genes are altered in individuals from different phenotypic groups. Statistical methods exist which assay pathway enrichment by phenotypic class but do not describe individual variation. In order to study this individual variation, we developed a formal statistical method called ASSESS which measures the enrichment of a gene set in each sample in an expression dataset.
As cancer advances through the stages of initiation, progression, and proliferation, multiple pathways experience disruptions at various times. However, there is still much unknown on these particular pathways that evidence gene expression changes throughout tumorigenesis. Using gene expression datasets comprised of individuals with tumors classified by location and stage, we applied ASSESS in order to study the data on the pathway level. We then utilized novel statistical methods to uncover the pathways that play a role in cancer progression and in what order the pathways become perturbed.
These analyses can give a basis for how genetic disruptions serve to alter actions in specific cell types. The results may provide insight that will lead to treatments of existing tumors and prevention of incipient cancers from forming. Treatments for existing tumors will use multiple drugs to target the pathways that show an altered state of activity.
Item Open Access Modeling Multi-factor Binding of the Genome(2010) Wasson, Todd StevenHundreds of different factors adorn the eukaryotic genome, binding to it in large number. These DNA binding factors (DBFs) include nucleosomes, transcription factors (TFs), and other proteins and protein complexes, such as the origin recognition complex (ORC). DBFs compete with one another for binding along the genome, yet many current models of genome binding do not consider different types of DBFs together simultaneously. Additionally, binding is a stochastic process that results in a continuum of binding probabilities at any position along the genome, but many current models tend to consider positions as being either binding sites or not.
Here, we present a model that allows a multitude of DBFs, each at different concentrations, to compete with one another for binding sites along the genome. The result is an 'occupancy profile', a probabilistic description of the DNA occupancy of each factor at each position. We implement our model efficiently as the software package COMPETE. We demonstrate genome-wide and at specific loci how modeling nucleosome binding alters TF binding, and vice versa, and illustrate how factor concentration influences binding occupancy. Binding cooperativity between nearby TFs arises implicitly via mutual competition with nucleosomes. Our method applies not only to TFs, but also recapitulates known occupancy profiles of a well-studied replication origin with and without ORC binding.
We then develop a statistical framework for tuning our model concentrations to further improve its predictions. Importantly, this tuning optimizes with respect to actual biological data. We take steps to ensure that our tuned parameters are biologically plausible.
Finally, we discuss novel extensions and applications of our model, suggesting next steps in its development and deployment.
Item Open Access Protein-DNA Binding: Discovering Motifs and Distinguishing Direct from Indirect Interactions(2009) Gordan, Raluca MihaelaThe initiation of two major processes in the eukaryotic cell, gene transcription and DNA replication, is regulated largely through interactions between proteins or protein complexes and DNA. Although a lot is known about the interacting proteins and their role in regulating transcription and replication, the specific DNA binding motifs of many regulatory proteins and complexes are still to be determined. For this purpose, many computational tools for DNA motif discovery have been developed in the last two decades. These tools employ a variety of strategies, from exhaustive search to sampling techniques, with the hope of finding over-represented motifs in sets of co-regulated or co-bound sequences. Despite the variety of computational tools aimed at solving the problem of motif discovery, their ability to correctly detect known DNA motifs is still limited. The motifs are usually short and many times degenerate, which makes them difficult to distinguish from genomic background. We believe the most efficient strategy for improving the performance of motif discovery is not to use increasingly complex computational and statistical methods and models, but to incorporate more of the biology into the computational techniques, in a principled manner. To this end, we propose a novel motif discovery algorithm: PRIORITY. Based on a general Gibbs sampling framework, PRIORITY has a major advantage over other motif discovery tools: it can incorporate different types of biological information (e.g., nucleosome positioning information) to guide the search for DNA binding sites toward regions where these sites are more likely to occur (e.g., nucleosome-free regions).
We use transcription factor (TF) binding data from yeast chromatin immunoprecipitation (ChIP-chip) experiments to test the performance of our motif discovery algorithm when incorporating three types of biological information: information about nucleosome positioning, information about DNA double-helical stability, and evolutionary conservation information. In each case, incorporating additional biological information has proven very useful in increasing the accuracy of motif finding, with the number of correctly identified motifs increasing with up to 52%. PRIORITY is not restricted to TF binding data. In this work, we also analyze origin recognition complex (ORC) binding data and show that PRIORITY can utilize DNA structural information to predict the binding specificity of the yeast ORC.
Despite the improvement obtained using additional biological information, the success of motif discovery algorithms in identifying known motifs is still limited, especially when applied to sequences bound in vivo (such as those of ChIP-chip) because the observed protein-DNA interactions are not necessarily direct. Some TFs associate with DNA only indirectly via protein partners, while others exhibit both direct and indirect binding. We propose a novel method to distinguish between direct and indirect TF-DNA interactions, integrating in vivo TF binding data, in vivo nucleosome occupancy data, and in vitro motifs from protein binding microarrays. When applied to yeast ChIP-chip data, our method reveals that only 48% of the ChIP-chip data sets can be readily explained by direct binding of the profiled TF, while 16% can be explained by indirect DNA binding. In the remaining 36%, we found that none of the motifs used in our analysis was able to explain the ChIP-chip data, either because the data was too noisy or because the set of motifs was incomplete. As more in vitro motifs become available, our method can be used to build a complete catalog of direct and indirect TF-DNA interactions.
Item Open Access Regulation of Global Transcription Dynamics During Cell Division and Root Development(2009) Orlando, David AnthonyThe successful completion of many critical biological processes depends on the proper execution of complex spatial and temporal gene expression programs. With the advent of high-throughput microarray technology, it is now possible to measure the dynamics of these expression programs on a genome-wide level. In this thesis we present work focused on utilizing this technology, in combination with novel computational techniques, to examine the role of transcriptional regulatory mechanisms in controlling the complex gene expression programs underlying two fundamental biological processes---the cell cycle and the development and differentiation of an organ.
We generate a dataset describing the genomic expression program which occurs during the cell division cycle of Saccharomyces cerevisiae. By concurrently measuring the dynamics in both wild-type and mutant cells that do not express either S-phase or mitotic cyclins we quantify the relative contributions of cyclin-CDK complexes and transcriptional regulatory networks in the regulation the cell cell expression program. We show that CDKs are not the sole regulators of periodic transcription as contrary to previously accepted models; and we hypothesize an oscillating transcriptional regulatory network which could work independent of, or in tandem with, the CDK oscillator to control the cell cell expression program.
To understand the acquisition of cellular identity, we generate a nearly complete gene expression map of the Arabidopsis Thaliana root at the resolution of individual cell-types and developmental stages. An analysis of this data reveals a representative set of dominant expression patterns which are used to begin defining the spatiotemporal transcriptional programs that control development within the root.
Additionally, we develop computational tools that improve the interpretability and power of these data. We present CLOCCS, a model for the dynamics of population synchrony loss in time-series experiments. We demonstrate the utility of CLOCCS in integrating disparate datasets and present a CLOCCS based deconvolution of the cell-cycle expression data. A deconvolution method is also developed for the Arabidopsis dataset, increasing its resolution to cell-type/section subregion specificity. Finally, a method for identifying biological processes occurring on multiple timescales is presented and applied to both datasets.
It is through the combination of these new genome-wide expression studies and computational tools that we begin to elucidate the transcriptional regulatory mechanisms controlling fundamental biological processes.
Item Open Access RNA Backbone Rotamers and Chiropraxis(2007-07-25) Murray, Laura WestonRNA backbone is biologically important with many roles in reactions and interactions, but has historically been a challenge in structural determination. It has many atoms and torsions to place, and often there is less data on it than one might wish. This problem leads to both random and systematic error, producing noise in an already high-dimensional and complex distribution to further complicate data-driven analysis. With the advent of the ribosomal subunit structures published in 2000, large RNA structures at good resolution, it became possible to apply the Richardson laboratory's quality-filtering, visualization, and analysis techniques to RNA and develop new tools for RNA as well. A first set of 42 RNA backbone rotamers was identified, developed, and published in 2003; it has since been thoroughly overhauled in conjunction with the backbone group of the RNA Ontology Consortium to combine the strengths of different approaches, incorporate new data, and produce a consensus set of 46 conformers. Meanwhile, extensive work has taken place on developing validation and remodeling tools to correct and improve existing structures as well as to assist in initial fitting. The use of base-phosphate perpendicular distances to identify sugar pucker has proven very useful in both hand-refitting and the semi-automated process of using RNABC (RNA Backbone Correction), a program developed in conjunction with Dr. Jack Snoeyink's laboratory. The guanine riboswitch structure ur0039/1U8D, by Dr. Rob Batey's laboratory, has been collaboratively refit and rerefined as a successful test case of the utility of these tools and techniques. Their testing and development will continue, and they are expected to help to improve RNA structure determination in both ease and quality.Item Open Access Studies into Location-specific cis-Regulatory Motifs(2010) Yokoyama, Ken DaigoroGene expression and regulation are major determinants of phenotypic traits displayed across species. Although the DNA sequence elements that control gene expression play a crucial role in determining species morphology, predicting cis-regulatory elements through sequence analysis alone remains a difficult task. A few regulatory elements, such as the TATA-box and Initiator sequence, have been known to exhibit overrepresentation at specific locations within the proximal promoter. However, the extent to which this occurs among cis-regulatory elements is not well understood. Here, we take a genome-wide approach towards detecting such functional sequence elements, using location-specific overrepresentation as a criterion for regulatory function. We provide evidence that a surprisingly large number of regulatory elements exhibit locational overrepresentation with respect to the transcription start site. We then utilize this characteristic to predict novel cis-regulatory elements overrepresented at particular locations within the proximal promoter.
Transcriptional regulation is most often controlled not by single protein factors acting in isolation, but instead multiple transcription factors acting together within multi-protein complexes. As protein-protein interactions are largely determined through protein structure, we would expect to see patterns of spatial preference between motif-pairs binding interacting factors. However, in the absence of methods to predict such spatial preferences between motifs, comprehensive assessments of such inter-relationships have not been previously conducted. As our model provides a general tool for detecting positional specificities of a motif relative to a given reference point, we expanded our model to measure distance preferences between pairs of motifs on a genome-wide scale. We show that there often exist patterns of spatial dependencies between pairs of sequence elements that bind interacting protein factors. We find that regulatory motifs binding interacting proteins often have multiple inter-motif distances at which they preferentially occur, and we show that the intervals between preferred distances are highly consistent across motif-pairs. This distance preference `phasing' was empirically found to occur at consistent intervals around ~8-10 bp, corresponding to approximately the number of nucleotides within a single turn of the DNA double-helix. This finding suggests a tendency for protein factor-pairs to interact in a specific orientation with respect to the turn of the DNA molecule, and offers a convenient method by which to determine motif-pairs binding interacting transcription factors de novo.
While little is known about the mechanisms by which individual cis-regulatory elements ultimately control gene expression, even less is known about how such elements evolve over time. A single transcription factor can potentially target hundreds of genes across the genome, and thus modifications in the binding affinities of such proteins must induce conversions at a multitude of functional sites in order to preserve the set of target genes that the trans-factor regulates. It is therefore commonly assumed that such changes occur rarely and at a slow rate over the course of evolution. Despite this widespread assumption, we find that a surprisingly large number of cis-regulatory elements have been subject to significant changes in consensus sequence in a lineage-specific manner. Here, we demonstrate that the genomic landscape is highly adaptable, rapidly adjusting to global changes in preferred regulatory consensus sequences. Focusing upon regulatory elements exhibiting location-specific overrepresentation, we find that a substantial fraction of regulatory elements have been subject to evolutionary modifications, even between closely related eutherians. These findings have broad implications regarding evolving phenotypes observed across species.
Item Open Access Studies on Human Chromatin Using High-Throughput DNaseI Sequencing(2009) Boyle, Alan PCis-elements govern the key step of transcription to regulate gene expression within a cell. Identification of utilized elements within a particular cell line will help further our understanding of individual and cumulative effects of trans-acting factors. These elements can be identified through an assay leveraging the ability of DNaseI to cut DNA that is in an open and accessible state making it hypersensitive to cleavage. Here we develop and explore computational techniques to measure open chromatin from sequencing and microarray data. We are able to identify 94,925 DNaseI hypersensitive sites genome-wide in CD4+ T cells. Interestingly, only 16%-20% of these sites were found in promoters. We also show that these regions are associated with different chromatin modifications. We found that DNaseI data can also be used to identify precise 'footprints' indicating protein-DNA interaction sites. Footprints for specific transcription factors correlate well with ChIP-seq enrichment, reveal distinct conservation patters, and reveal a cell-type specific arrangement of transcriptional regulation. These footprints can be used in addition to or in lieu of ChIP-seq data to better understand genomic regulatory systems.
Item Open Access The Geometry of Cancer(2009) Guinney, JustinCancer is a complex, multifaceted disease that operates through dynamic changes in the genome. Cancer is best understood through the process that generates it -- random mutations operated on by natural selection -- and several global hallmarks that describe its broad mechanisms. While many genes, protein interactions, and pathways have been enumerated as a kind of ``parts'' list for cancer, researchers are attempting to synthesize broader models for inferring and predicting cancer behavior using high-throughput data and integrative analyses.
The focus of this thesis is on the development of two novel methods that are optimized for the analysis of complex cancer phenotypes. The first method incorporates ideas from gradient learning with multitask learning to assess statistical dependencies across multiple related data sets. The second method integrates multiscale analysis on graphs and manifolds developed in applied harmonic analysis with sparse factor models, a mainstay of applied statistics. This method generates multiscale factors that are used for inferring hierarchical associations within complex biological networks. The primary biological focus is the inference of gene and pathway dependencies associated with cancer progression and metastatic disease in prostate cancer. Significant findings include evidence of Skp2 degradation of the cell-cycle regulator p27, and the upstream deregulation of the TGF-beta pathway, driving prostate cancer recurrence.
Item Open Access Using Genome-wide Approaches to Characterize the Relationship Between Genomic Variation and Disease: A Case Study in Oligodendroglioma and Staphylococcus arueus(2010) Johnson, NicoleGenetic variation is a natural occurrence in the genome that contributes to the phenotypic differences observed between individuals as well as the phenotypic outcomes of various diseases, including infectious disease and cancer. Single nucleotide polymorphisms (SNPs) have been identified as genetic factors influencing host susceptibility to infectious disease while the study of copy number variation (CNV) in various cancers has led to the identification of causal genetic factors influencing tumor formation and severity. In this work, we evaluated the association between genomic variation and disease phenotypes to identify SNPs contributing to host susceptibility in Staphylococcus aureus (S. aureus) infection and to characterize a nervous system brain tumor, known as oligodendroglioma (OD), using the CNV observed in tumors with varying degree of malignancy.
Using SNP data, we utilized a computational approach, known as in silico haplotype mapping (ISHM), to identify genomic regions significantly associated with susceptibility to S. aureus infection in inbred mouse strains. We conducted ISHM on four phenotypes collected from S. aureus infected mice and identified genes contained in the significant regions, which were considered to be potential candidate genes. Gene expression studies were then conducted on inbred mice considered to be resistant or susceptible to S. aureus infection to identify genes differentially expressed between the two groups, which provided biological validation of the genes identified in significant ISHM regions. Genes identified by both analyses were considered our top priority genes and known biological information about the genes was used to determine their function roles in susceptibility to S. aureus infection.
We then evaluated CNV in subtypes of ODs to characterize the tumors by their genomic aberrations. We conducted array-based comparative genomic hybridization (CGH) on 74 ODs to generate genomic profiles that were classified by tumor grade, providing insight about the genomic changes that typically occur in patients with OD ranging from the less to more severe tumor types. Additionally, smaller genomic intervals with substantial copy number differences between normal and OD DNA samples, known as minimal critical regions (MCRs), were identified among the tumors. The genomic regions with copy number changes were interrogated for genes and assessed for their biological roles in the tumors' phenotype and formation. This information was used to assess the validity of using genomic variation in these tumors to further classify these tumors in addition to standard classification techniques.
The studies described in this project demonstrate the utility of using genetic variation to study disease phenotypes and applying computational and experimental techniques to identify the underlying genetic factors contributing to disease pathogenesis. Moreover, the continued development of similar approaches could aid in the development of new diagnostic procedures as well as novel therapeutics for the generation of more personalized treatments. The genomic regions with copy number changes were interrogated for genes and assessed for their biological roles in the tumors' phenotype and formation. This information was used to assess the validity of using genomic variation in these tumors to further classify these tumors in addition to standard classification techniques.
The studies described in this project demonstrate the utility of using genetic variation to study disease phenotypes and applying computational and experimental techniques to identify the underlying genetic factors contributing to disease pathogenesis. Moreover, the continued development of similar approaches could aid in the development of new diagnostic procedures as well as novel therapeutics for the generation of more personalized treatments.