The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues.
dc.contributor.author | Williams, Christopher J | |
dc.contributor.author | Richardson, David C | |
dc.contributor.author | Richardson, Jane S | |
dc.date.accessioned | 2023-12-01T18:52:06Z | |
dc.date.available | 2023-12-01T18:52:06Z | |
dc.date.issued | 2022-01 | |
dc.date.updated | 2023-12-01T18:52:03Z | |
dc.description.abstract | We have curated a high-quality, "best-parts" reference dataset of about 3 million protein residues in about 15,000 PDB-format coordinate files, each containing only residues with good electron density support for a physically acceptable model conformation. The resulting prefiltered data typically contain the entire core of each chain, in quite long continuous fragments. Each reference file is a single protein chain, and the total set of files were selected for low redundancy, high resolution, good MolProbity score, and other chain-level criteria. Then each residue was critically tested for adequate local map quality to firmly support its conformation, which must also be free of serious clashes or covalent-geometry outliers. The resulting Top2018 prefiltered datasets have been released on the Zenodo online web service and are freely available for all uses under a Creative Commons license. Currently, one dataset is residue filtered on main chain plus Cβ atoms, and a second dataset is full-residue filtered; each is available at four different sequence-identity levels. Here, we illustrate both statistics and examples that show the beneficial consequences of residue-level filtering. That process is necessary because even the best of structures contain a few highly disordered local regions with poor density and low-confidence conformations that should not be included in reference data. Therefore, the open distribution of these very large, prefiltered reference datasets constitutes a notable advance for structural bioinformatics and the fields that depend upon it. | |
dc.identifier.issn | 0961-8368 | |
dc.identifier.issn | 1469-896X | |
dc.identifier.uri | ||
dc.language | eng | |
dc.publisher | Wiley | |
dc.relation.ispartof | Protein science : a publication of the Protein Society | |
dc.relation.isversionof | 10.1002/pro.4239 | |
dc.subject | Proteins | |
dc.subject | Crystallography, X-Ray | |
dc.subject | Computational Biology | |
dc.subject | Protein Conformation | |
dc.subject | Algorithms | |
dc.subject | Models, Molecular | |
dc.subject | Software | |
dc.subject | Databases, Protein | |
dc.title | The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues. | |
dc.type | Journal article | |
duke.contributor.orcid | Richardson, Jane S|0000-0002-3311-2944 | |
pubs.begin-page | 290 | |
pubs.end-page | 300 | |
pubs.issue | 1 | |
pubs.organisational-group | Duke | |
pubs.organisational-group | School of Medicine | |
pubs.organisational-group | Basic Science Departments | |
pubs.organisational-group | Biochemistry | |
pubs.publication-status | Published | |
pubs.volume | 31 |
Files
Original bundle
- Name:
- The importance of residue-level filtering and the Top2018 best-parts dataset of high-quality protein residues.pdf
- Size:
- 3.81 MB
- Format:
- Adobe Portable Document Format