Intelligent data analysis for protein disorder prediction

Pedro Romero, Zoran Obradovic, A. Keith Dunker

Research output: Contribution to journalArticlepeer-review

29 Scopus citations

Abstract

Although an ordered 3D structure is generally considered to be a necessary precondition for protein functionality, there are disordered counter examples found to have biological activity. The objectives of our data mining project are: (1) to generalize from the limited set of counter examples and then apply this knowledge to large data bases of amino acid sequence in order to estimate commonness of disordered protein regions in nature, and (2) to determine whether there are different types of protein disorder. For general disorder estimation, a neural network based predictor was designed and tested on data built from several public domain data banks through a nontrivial search, statistical analysis and data dimensionality reduction. In addition, predictors for identification of family-specific disorder were developed by extracting knowledge from databases generated through multiple sequence alignments of a known disordered sequence to other highly related proteins. Family-specific predictors were also integrated to test quality of general protein disorder identification from such hybrid prediction systems. Out-of-sample cross validation performance of several predictors was computed first, followed by tests on an unrelated database of proteins with long disordered regions, and the application of few selected predictors to two large protein data banks: Nrl_3D, currently containing more than 10,000 protein fragments of known 3D structure, and Swiss Protein, having almost 60,000 protein sequences. The obtained results provide evidence that long disordered regions are common in nature, with an estimate that 11% of all the residues in the Swiss Protein data bank belong to disordered regions of length 40 or greater. The hypothesis that different protein disorder types exist is supported by high specificity/low sensitivity results of two family-specific predictors, by hybrid systems outperforming general models on a two-family test, and by existence of significant gaps in Swiss Protein vs. Nrl_3D disorder frequency estimates for both families. These findings prompt the need for a revision in the current understanding of protein structure and function, as well as for the developing of improved disorder predictors that should have important uses in biotechnology applications.

Original languageEnglish (US)
Pages (from-to)447-484
Number of pages38
JournalArtificial Intelligence Review
Volume14
Issue number6
DOIs
StatePublished - Dec 1 2000
Externally publishedYes

Keywords

  • Data mining
  • Neural networks
  • Protein databases
  • Protein disorder prediction
  • Protein structure

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Intelligent data analysis for protein disorder prediction'. Together they form a unique fingerprint.

Cite this