Browsing by Author "Zhao, Shiwen"
- Results Per Page
- Sort Options
Item Open Access Efficient Enumeration and Visualization of Helix-coil Ensembles.(bioRxiv, 2023-09-17) Schmidler, Scott C; Hughes, Roy Gene; Oas, Terrence G; Zhao, ShiwenHelix-coil models are routinely used to interpret CD data of helical peptides or predict the helicity of naturally-occurring and designed polypeptides. However, a helix-coil model contains significantly more information than mean helicity alone, as it defines the entire ensemble - the equilibrium population of every possible helix-coil configuration - for a given sequence. Many desirable quantities of this ensemble are either not obtained as ensemble averages, or are not available using standard helicity-averaging calculations. Enumeration of the entire ensemble can allow calculation of a wider set of ensemble properties, but the exponential size of the configuration space typically renders this intractable. We present an algorithm that efficiently approximates the helix-coil ensemble to arbitrary accuracy, by sequentially generating a list of the M highest populated configurations in descending order of population. Truncating this list of (configuration, population) pairs at a desired accuracy provides an approximating sub-ensemble. We demonstrate several uses of this approach for providing insight into helix-coil ensembles and folding mechanisms, including landscape visualization.Item Open Access New Advancements of Scalable Statistical Methods for Learning Latent Structures in Big Data(2016) Zhao, ShiwenConstant technology advances have caused data explosion in recent years. Accord- ingly modern statistical and machine learning methods must be adapted to deal with complex and heterogeneous data types. This phenomenon is particularly true for an- alyzing biological data. For example DNA sequence data can be viewed as categorical variables with each nucleotide taking four different categories. The gene expression data, depending on the quantitative technology, could be continuous numbers or counts. With the advancement of high-throughput technology, the abundance of such data becomes unprecedentedly rich. Therefore efficient statistical approaches are crucial in this big data era.
Previous statistical methods for big data often aim to find low dimensional struc- tures in the observed data. For example in a factor analysis model a latent Gaussian distributed multivariate vector is assumed. With this assumption a factor model produces a low rank estimation of the covariance of the observed variables. Another example is the latent Dirichlet allocation model for documents. The mixture pro- portions of topics, represented by a Dirichlet distributed variable, is assumed. This dissertation proposes several novel extensions to the previous statistical methods that are developed to address challenges in big data. Those novel methods are applied in multiple real world applications including construction of condition specific gene co-expression networks, estimating shared topics among newsgroups, analysis of pro- moter sequences, analysis of political-economics risk data and estimating population structure from genotype data.