The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues.
Repository Usage Stats
We have curated a high-quality, "best-parts" reference dataset of about 3 million protein residues in about 15,000 PDB-format coordinate files, each containing only residues with good electron density support for a physically acceptable model conformation. The resulting prefiltered data typically contain the entire core of each chain, in quite long continuous fragments. Each reference file is a single protein chain, and the total set of files were selected for low redundancy, high resolution, good MolProbity score, and other chain-level criteria. Then each residue was critically tested for adequate local map quality to firmly support its conformation, which must also be free of serious clashes or covalent-geometry outliers. The resulting Top2018 prefiltered datasets have been released on the Zenodo online web service and are freely available for all uses under a Creative Commons license. Currently, one dataset is residue filtered on main chain plus Cβ atoms, and a second dataset is full-residue filtered; each is available at four different sequence-identity levels. Here, we illustrate both statistics and examples that show the beneficial consequences of residue-level filtering. That process is necessary because even the best of structures contain a few highly disordered local regions with poor density and low-confidence conformations that should not be included in reference data. Therefore, the open distribution of these very large, prefiltered reference datasets constitutes a notable advance for structural bioinformatics and the fields that depend upon it.
Published Version (Please cite this version)
Williams, Christopher J, David C Richardson and Jane S Richardson (2022). The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues. Protein science : a publication of the Protein Society, 31(1). pp. 290–300. 10.1002/pro.4239 Retrieved from https://hdl.handle.net/10161/29455.
This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.
3D structure of macromolecules; molecular graphics; protein folding and design; all-atom contacts; x-ray crystallography; structure validation.
Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.