Intrinsic disorder in the protein data bank

Tanguy Le Gall, Pedro R. Romero, Marc S. Cortese, Vladimir N. Uversky, A. Dunker

Research output: Contribution to journalArticle

97 Citations (Scopus)

Abstract

The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only ∼7% of proteins are observed in the corresponding PDB structures, and only ∼25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, "Observed" (which correspond to structured regions), "Not observed" (regions with missing electron density, potentially disordered), "Uncharacterized," and "Ambiguous," depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a 'fragment' or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. "Non-observed," "Ambiguous," and "Uncharacterized" regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR® VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the "Observed" dataset are ordered, and that the "Not observed" regions are mostly disordered. The "Uncharacterized" regions possess some tendency toward order, whereas the predictions for the short "Ambiguous" regions are really ambiguous. Long "Ambiguous" regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be "wobbly" domains. Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset ∼10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and ∼40% of the proteins possess short regions (≥10 and <30 amino-acid long) of missing and ambiguous residues.

Original languageEnglish
Pages (from-to)325-341
Number of pages17
JournalJournal of Biomolecular Structure and Dynamics
Volume24
Issue number4
StatePublished - Feb 2007

Fingerprint

Databases
Proteins
Amino Acids
X Ray Crystallography
Intrinsically Disordered Proteins
Electrons
Amino Acid Sequence
Magnetic Resonance Spectroscopy

Keywords

  • Intrinsic disorder
  • PONDR
  • Protein data bank

ASJC Scopus subject areas

  • Molecular Biology
  • Structural Biology

Cite this

Le Gall, T., Romero, P. R., Cortese, M. S., Uversky, V. N., & Dunker, A. (2007). Intrinsic disorder in the protein data bank. Journal of Biomolecular Structure and Dynamics, 24(4), 325-341.

Intrinsic disorder in the protein data bank. / Le Gall, Tanguy; Romero, Pedro R.; Cortese, Marc S.; Uversky, Vladimir N.; Dunker, A.

In: Journal of Biomolecular Structure and Dynamics, Vol. 24, No. 4, 02.2007, p. 325-341.

Research output: Contribution to journalArticle

Le Gall, T, Romero, PR, Cortese, MS, Uversky, VN & Dunker, A 2007, 'Intrinsic disorder in the protein data bank', Journal of Biomolecular Structure and Dynamics, vol. 24, no. 4, pp. 325-341.
Le Gall T, Romero PR, Cortese MS, Uversky VN, Dunker A. Intrinsic disorder in the protein data bank. Journal of Biomolecular Structure and Dynamics. 2007 Feb;24(4):325-341.
Le Gall, Tanguy ; Romero, Pedro R. ; Cortese, Marc S. ; Uversky, Vladimir N. ; Dunker, A. / Intrinsic disorder in the protein data bank. In: Journal of Biomolecular Structure and Dynamics. 2007 ; Vol. 24, No. 4. pp. 325-341.
@article{579d5fad8c414158819d5e0ed026548d,
title = "Intrinsic disorder in the protein data bank",
abstract = "The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76{\%} of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only ∼7{\%} of proteins are observed in the corresponding PDB structures, and only ∼25{\%} of the total dataset have >95{\%} of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, {"}Observed{"} (which correspond to structured regions), {"}Not observed{"} (regions with missing electron density, potentially disordered), {"}Uncharacterized,{"} and {"}Ambiguous,{"} depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a 'fragment' or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. {"}Non-observed,{"} {"}Ambiguous,{"} and {"}Uncharacterized{"} regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR{\circledR} VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the {"}Observed{"} dataset are ordered, and that the {"}Not observed{"} regions are mostly disordered. The {"}Uncharacterized{"} regions possess some tendency toward order, whereas the predictions for the short {"}Ambiguous{"} regions are really ambiguous. Long {"}Ambiguous{"} regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be {"}wobbly{"} domains. Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset ∼10{\%} of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and ∼40{\%} of the proteins possess short regions (≥10 and <30 amino-acid long) of missing and ambiguous residues.",
keywords = "Intrinsic disorder, PONDR, Protein data bank",
author = "{Le Gall}, Tanguy and Romero, {Pedro R.} and Cortese, {Marc S.} and Uversky, {Vladimir N.} and A. Dunker",
year = "2007",
month = "2",
language = "English",
volume = "24",
pages = "325--341",
journal = "Journal of Biomolecular Structure and Dynamics",
issn = "0739-1102",
publisher = "Adenine Press",
number = "4",

}

TY - JOUR

T1 - Intrinsic disorder in the protein data bank

AU - Le Gall, Tanguy

AU - Romero, Pedro R.

AU - Cortese, Marc S.

AU - Uversky, Vladimir N.

AU - Dunker, A.

PY - 2007/2

Y1 - 2007/2

N2 - The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only ∼7% of proteins are observed in the corresponding PDB structures, and only ∼25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, "Observed" (which correspond to structured regions), "Not observed" (regions with missing electron density, potentially disordered), "Uncharacterized," and "Ambiguous," depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a 'fragment' or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. "Non-observed," "Ambiguous," and "Uncharacterized" regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR® VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the "Observed" dataset are ordered, and that the "Not observed" regions are mostly disordered. The "Uncharacterized" regions possess some tendency toward order, whereas the predictions for the short "Ambiguous" regions are really ambiguous. Long "Ambiguous" regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be "wobbly" domains. Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset ∼10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and ∼40% of the proteins possess short regions (≥10 and <30 amino-acid long) of missing and ambiguous residues.

AB - The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only ∼7% of proteins are observed in the corresponding PDB structures, and only ∼25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, "Observed" (which correspond to structured regions), "Not observed" (regions with missing electron density, potentially disordered), "Uncharacterized," and "Ambiguous," depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a 'fragment' or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. "Non-observed," "Ambiguous," and "Uncharacterized" regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR® VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the "Observed" dataset are ordered, and that the "Not observed" regions are mostly disordered. The "Uncharacterized" regions possess some tendency toward order, whereas the predictions for the short "Ambiguous" regions are really ambiguous. Long "Ambiguous" regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be "wobbly" domains. Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset ∼10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and ∼40% of the proteins possess short regions (≥10 and <30 amino-acid long) of missing and ambiguous residues.

KW - Intrinsic disorder

KW - PONDR

KW - Protein data bank

UR - http://www.scopus.com/inward/record.url?scp=33847200077&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33847200077&partnerID=8YFLogxK

M3 - Article

VL - 24

SP - 325

EP - 341

JO - Journal of Biomolecular Structure and Dynamics

JF - Journal of Biomolecular Structure and Dynamics

SN - 0739-1102

IS - 4

ER -