The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues.
Date
2022-01
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Citation Stats
Abstract
We have curated a high-quality, "best-parts" reference dataset of about 3 million protein residues in about 15,000 PDB-format coordinate files, each containing only residues with good electron density support for a physically acceptable model conformation. The resulting prefiltered data typically contain the entire core of each chain, in quite long continuous fragments. Each reference file is a single protein chain, and the total set of files were selected for low redundancy, high resolution, good MolProbity score, and other chain-level criteria. Then each residue was critically tested for adequate local map quality to firmly support its conformation, which must also be free of serious clashes or covalent-geometry outliers. The resulting Top2018 prefiltered datasets have been released on the Zenodo online web service and are freely available for all uses under a Creative Commons license. Currently, one dataset is residue filtered on main chain plus Cβ atoms, and a second dataset is full-residue filtered; each is available at four different sequence-identity levels. Here, we illustrate both statistics and examples that show the beneficial consequences of residue-level filtering. That process is necessary because even the best of structures contain a few highly disordered local regions with poor density and low-confidence conformations that should not be included in reference data. Therefore, the open distribution of these very large, prefiltered reference datasets constitutes a notable advance for structural bioinformatics and the fields that depend upon it.
Type
Department
Description
Provenance
Subjects
Citation
Permalink
Published Version (Please cite this version)
Publication Info
Williams, Christopher J, David C Richardson and Jane S Richardson (2022). The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues. Protein science : a publication of the Protein Society, 31(1). pp. 290–300. 10.1002/pro.4239 Retrieved from https://hdl.handle.net/10161/29455.
This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.
Collections
Scholars@Duke

David C. Richardson
Protein structure, folding, and design; 3D computer graphics; x-ray crystallography.

Jane Shelby Richardson
3D structure of macromolecules; molecular graphics; protein folding and design; all-atom contacts; x-ray crystallography; structure validation.
Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.