Classification and knowledge discovery in protein databases

Predrag Radivojac, Nitesh V. Chawla, A. Keith Dunker, Zoran Obradovic

Research output: Contribution to journalArticle

63 Scopus citations


We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.

Original languageEnglish (US)
Pages (from-to)224-239
Number of pages16
JournalJournal of biomedical informatics
Issue number4
StatePublished - Aug 1 2004


  • Class imbalance
  • Class-distribution estimation
  • Classification
  • Clustering
  • Feature selection
  • Noise

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Fingerprint Dive into the research topics of 'Classification and knowledge discovery in protein databases'. Together they form a unique fingerprint.

  • Cite this