Classification and knowledge discovery in protein databases

Predrag Radivojac, Nitesh V. Chawla, A. Dunker, Zoran Obradovic

Research output: Contribution to journalArticle

58 Citations (Scopus)

Abstract

We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.

Original languageEnglish
Pages (from-to)224-239
Number of pages16
JournalJournal of Biomedical Informatics
Volume37
Issue number4
DOIs
StatePublished - Aug 2004

Fingerprint

Protein Databases
Data mining
Noise
Logistic Models
Proteins
Logistics
Feature extraction
Classifiers
Sampling
Neural networks
Decision Trees
Decision trees
Cluster Analysis
Learning systems
Learning
Datasets
Experiments

Keywords

  • Class imbalance
  • Class-distribution estimation
  • Classification
  • Clustering
  • Feature selection
  • Noise

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics

Cite this

Classification and knowledge discovery in protein databases. / Radivojac, Predrag; Chawla, Nitesh V.; Dunker, A.; Obradovic, Zoran.

In: Journal of Biomedical Informatics, Vol. 37, No. 4, 08.2004, p. 224-239.

Research output: Contribution to journalArticle

Radivojac, Predrag ; Chawla, Nitesh V. ; Dunker, A. ; Obradovic, Zoran. / Classification and knowledge discovery in protein databases. In: Journal of Biomedical Informatics. 2004 ; Vol. 37, No. 4. pp. 224-239.
@article{6bdde3b458f245f3b4bcbc86401c1298,
title = "Classification and knowledge discovery in protein databases",
abstract = "We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.",
keywords = "Class imbalance, Class-distribution estimation, Classification, Clustering, Feature selection, Noise",
author = "Predrag Radivojac and Chawla, {Nitesh V.} and A. Dunker and Zoran Obradovic",
year = "2004",
month = "8",
doi = "10.1016/j.jbi.2004.07.008",
language = "English",
volume = "37",
pages = "224--239",
journal = "Journal of Biomedical Informatics",
issn = "1532-0464",
publisher = "Academic Press Inc.",
number = "4",

}

TY - JOUR

T1 - Classification and knowledge discovery in protein databases

AU - Radivojac, Predrag

AU - Chawla, Nitesh V.

AU - Dunker, A.

AU - Obradovic, Zoran

PY - 2004/8

Y1 - 2004/8

N2 - We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.

AB - We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.

KW - Class imbalance

KW - Class-distribution estimation

KW - Classification

KW - Clustering

KW - Feature selection

KW - Noise

UR - http://www.scopus.com/inward/record.url?scp=4744344959&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4744344959&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2004.07.008

DO - 10.1016/j.jbi.2004.07.008

M3 - Article

VL - 37

SP - 224

EP - 239

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

SN - 1532-0464

IS - 4

ER -