Intelligent data analysis for protein disorder prediction

Pedro Romero, Zoran Obradovic, A. Dunker

Research output: Contribution to journalArticle

29 Citations (Scopus)

Abstract

Although an ordered 3D structure is generally considered to be a necessary precondition for protein functionality, there are disordered counter examples found to have biological activity. The objectives of our data mining project are: (1) to generalize from the limited set of counter examples and then apply this knowledge to large data bases of amino acid sequence in order to estimate commonness of disordered protein regions in nature, and (2) to determine whether there are different types of protein disorder. For general disorder estimation, a neural network based predictor was designed and tested on data built from several public domain data banks through a nontrivial search, statistical analysis and data dimensionality reduction. In addition, predictors for identification of family-specific disorder were developed by extracting knowledge from databases generated through multiple sequence alignments of a known disordered sequence to other highly related proteins. Family-specific predictors were also integrated to test quality of general protein disorder identification from such hybrid prediction systems. Out-of-sample cross validation performance of several predictors was computed first, followed by tests on an unrelated database of proteins with long disordered regions, and the application of few selected predictors to two large protein data banks: Nrl_3D, currently containing more than 10,000 protein fragments of known 3D structure, and Swiss Protein, having almost 60,000 protein sequences. The obtained results provide evidence that long disordered regions are common in nature, with an estimate that 11% of all the residues in the Swiss Protein data bank belong to disordered regions of length 40 or greater. The hypothesis that different protein disorder types exist is supported by high specificity/low sensitivity results of two family-specific predictors, by hybrid systems outperforming general models on a two-family test, and by existence of significant gaps in Swiss Protein vs. Nrl_3D disorder frequency estimates for both families. These findings prompt the need for a revision in the current understanding of protein structure and function, as well as for the developing of improved disorder predictors that should have important uses in biotechnology applications.

Original languageEnglish (US)
Pages (from-to)447-484
Number of pages38
JournalArtificial Intelligence Review
Volume14
Issue number6
DOIs
StatePublished - Dec 2000
Externally publishedYes

Fingerprint

data analysis
data bank
Proteins
Swiss
biotechnology
neural network
functionality
statistical analysis
Prediction
Protein
performance
evidence
Biotechnology
Bioactivity
Predictors
Hybrid systems
Data mining
Data Base
Amino acids
Data reduction

Keywords

  • Data mining
  • Neural networks
  • Protein databases
  • Protein disorder prediction
  • Protein structure

ASJC Scopus subject areas

  • Control and Systems Engineering
  • Artificial Intelligence

Cite this

Intelligent data analysis for protein disorder prediction. / Romero, Pedro; Obradovic, Zoran; Dunker, A.

In: Artificial Intelligence Review, Vol. 14, No. 6, 12.2000, p. 447-484.

Research output: Contribution to journalArticle

Romero, Pedro ; Obradovic, Zoran ; Dunker, A. / Intelligent data analysis for protein disorder prediction. In: Artificial Intelligence Review. 2000 ; Vol. 14, No. 6. pp. 447-484.
@article{7eb0d901090f4e57a6ffdfb3e84a1c18,
title = "Intelligent data analysis for protein disorder prediction",
abstract = "Although an ordered 3D structure is generally considered to be a necessary precondition for protein functionality, there are disordered counter examples found to have biological activity. The objectives of our data mining project are: (1) to generalize from the limited set of counter examples and then apply this knowledge to large data bases of amino acid sequence in order to estimate commonness of disordered protein regions in nature, and (2) to determine whether there are different types of protein disorder. For general disorder estimation, a neural network based predictor was designed and tested on data built from several public domain data banks through a nontrivial search, statistical analysis and data dimensionality reduction. In addition, predictors for identification of family-specific disorder were developed by extracting knowledge from databases generated through multiple sequence alignments of a known disordered sequence to other highly related proteins. Family-specific predictors were also integrated to test quality of general protein disorder identification from such hybrid prediction systems. Out-of-sample cross validation performance of several predictors was computed first, followed by tests on an unrelated database of proteins with long disordered regions, and the application of few selected predictors to two large protein data banks: Nrl_3D, currently containing more than 10,000 protein fragments of known 3D structure, and Swiss Protein, having almost 60,000 protein sequences. The obtained results provide evidence that long disordered regions are common in nature, with an estimate that 11{\%} of all the residues in the Swiss Protein data bank belong to disordered regions of length 40 or greater. The hypothesis that different protein disorder types exist is supported by high specificity/low sensitivity results of two family-specific predictors, by hybrid systems outperforming general models on a two-family test, and by existence of significant gaps in Swiss Protein vs. Nrl_3D disorder frequency estimates for both families. These findings prompt the need for a revision in the current understanding of protein structure and function, as well as for the developing of improved disorder predictors that should have important uses in biotechnology applications.",
keywords = "Data mining, Neural networks, Protein databases, Protein disorder prediction, Protein structure",
author = "Pedro Romero and Zoran Obradovic and A. Dunker",
year = "2000",
month = "12",
doi = "10.1023/A:1006678623815",
language = "English (US)",
volume = "14",
pages = "447--484",
journal = "Artificial Intelligence Review",
issn = "0269-2821",
publisher = "Springer Netherlands",
number = "6",

}

TY - JOUR

T1 - Intelligent data analysis for protein disorder prediction

AU - Romero, Pedro

AU - Obradovic, Zoran

AU - Dunker, A.

PY - 2000/12

Y1 - 2000/12

N2 - Although an ordered 3D structure is generally considered to be a necessary precondition for protein functionality, there are disordered counter examples found to have biological activity. The objectives of our data mining project are: (1) to generalize from the limited set of counter examples and then apply this knowledge to large data bases of amino acid sequence in order to estimate commonness of disordered protein regions in nature, and (2) to determine whether there are different types of protein disorder. For general disorder estimation, a neural network based predictor was designed and tested on data built from several public domain data banks through a nontrivial search, statistical analysis and data dimensionality reduction. In addition, predictors for identification of family-specific disorder were developed by extracting knowledge from databases generated through multiple sequence alignments of a known disordered sequence to other highly related proteins. Family-specific predictors were also integrated to test quality of general protein disorder identification from such hybrid prediction systems. Out-of-sample cross validation performance of several predictors was computed first, followed by tests on an unrelated database of proteins with long disordered regions, and the application of few selected predictors to two large protein data banks: Nrl_3D, currently containing more than 10,000 protein fragments of known 3D structure, and Swiss Protein, having almost 60,000 protein sequences. The obtained results provide evidence that long disordered regions are common in nature, with an estimate that 11% of all the residues in the Swiss Protein data bank belong to disordered regions of length 40 or greater. The hypothesis that different protein disorder types exist is supported by high specificity/low sensitivity results of two family-specific predictors, by hybrid systems outperforming general models on a two-family test, and by existence of significant gaps in Swiss Protein vs. Nrl_3D disorder frequency estimates for both families. These findings prompt the need for a revision in the current understanding of protein structure and function, as well as for the developing of improved disorder predictors that should have important uses in biotechnology applications.

AB - Although an ordered 3D structure is generally considered to be a necessary precondition for protein functionality, there are disordered counter examples found to have biological activity. The objectives of our data mining project are: (1) to generalize from the limited set of counter examples and then apply this knowledge to large data bases of amino acid sequence in order to estimate commonness of disordered protein regions in nature, and (2) to determine whether there are different types of protein disorder. For general disorder estimation, a neural network based predictor was designed and tested on data built from several public domain data banks through a nontrivial search, statistical analysis and data dimensionality reduction. In addition, predictors for identification of family-specific disorder were developed by extracting knowledge from databases generated through multiple sequence alignments of a known disordered sequence to other highly related proteins. Family-specific predictors were also integrated to test quality of general protein disorder identification from such hybrid prediction systems. Out-of-sample cross validation performance of several predictors was computed first, followed by tests on an unrelated database of proteins with long disordered regions, and the application of few selected predictors to two large protein data banks: Nrl_3D, currently containing more than 10,000 protein fragments of known 3D structure, and Swiss Protein, having almost 60,000 protein sequences. The obtained results provide evidence that long disordered regions are common in nature, with an estimate that 11% of all the residues in the Swiss Protein data bank belong to disordered regions of length 40 or greater. The hypothesis that different protein disorder types exist is supported by high specificity/low sensitivity results of two family-specific predictors, by hybrid systems outperforming general models on a two-family test, and by existence of significant gaps in Swiss Protein vs. Nrl_3D disorder frequency estimates for both families. These findings prompt the need for a revision in the current understanding of protein structure and function, as well as for the developing of improved disorder predictors that should have important uses in biotechnology applications.

KW - Data mining

KW - Neural networks

KW - Protein databases

KW - Protein disorder prediction

KW - Protein structure

UR - http://www.scopus.com/inward/record.url?scp=0034458146&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0034458146&partnerID=8YFLogxK

U2 - 10.1023/A:1006678623815

DO - 10.1023/A:1006678623815

M3 - Article

AN - SCOPUS:0034458146

VL - 14

SP - 447

EP - 484

JO - Artificial Intelligence Review

JF - Artificial Intelligence Review

SN - 0269-2821

IS - 6

ER -