Predictive power estimation algorithm (PPEA) - a new algorithm to reduce overfitting for genomic biomarker discovery

Jiangang Liu, Robert A. Jolly, Aaron T. Smith, George H. Searfoss, Keith M. Goldstein, Vladimir N. Uversky, A. Dunker, Shuyu Li, Craig E. Thomas, Tao Wei

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Toxicogenomics promises to aid in predicting adverse effects, understanding the mechanisms of drug action or toxicity, and uncovering unexpected or secondary pharmacology. However, modeling adverse effects using high dimensional and high noise genomic data is prone to over-fitting. Models constructed from such data sets often consist of a large number of genes with no obvious functional relevance to the biological effect the model intends to predict that can make it challenging to interpret the modeling results. To address these issues, we developed a novel algorithm, Predictive Power Estimation Algorithm (PPEA), which estimates the predictive power of each individual transcript through an iterative two-way bootstrapping procedure. By repeatedly enforcing that the sample number is larger than the transcript number, in each iteration of modeling and testing, PPEA reduces the potential risk of overfitting. We show with three different cases studies that: (1) PPEA can quickly derive a reliable rank order of predictive power of individual transcripts in a relatively small number of iterations, (2) the top ranked transcripts tend to be functionally related to the phenotype they are intended to predict, (3) using only the most predictive top ranked transcripts greatly facilitates development of multiplex assay such as qRT-PCR as a biomarker, and (4) more importantly, we were able to demonstrate that a small number of genes identified from the top-ranked transcripts are highly predictive of phenotype as their expression changes distinguished adverse from nonadverse effects of compounds in completely independent tests. Thus, we believe that the PPEA model effectively addresses the over-fitting problem and can be used to facilitate genomic biomarker discovery for predictive toxicology and drug responses.

Original languageEnglish
Article numbere24233
JournalPLoS One
Volume6
Issue number9
DOIs
StatePublished - Sep 15 2011

Fingerprint

Biomarkers
biomarkers
genomics
Genes
Toxicogenetics
Phenotype
toxicogenomics
Biological Models
adverse effects
phenotype
drugs
Pharmaceutical Preparations
Toxicology
toxicology
pharmacology
Toxicity
Noise
Assays
Pharmacology
genes

ASJC Scopus subject areas

  • Agricultural and Biological Sciences(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Medicine(all)

Cite this

Liu, J., Jolly, R. A., Smith, A. T., Searfoss, G. H., Goldstein, K. M., Uversky, V. N., ... Wei, T. (2011). Predictive power estimation algorithm (PPEA) - a new algorithm to reduce overfitting for genomic biomarker discovery. PLoS One, 6(9), [e24233]. https://doi.org/10.1371/journal.pone.0024233

Predictive power estimation algorithm (PPEA) - a new algorithm to reduce overfitting for genomic biomarker discovery. / Liu, Jiangang; Jolly, Robert A.; Smith, Aaron T.; Searfoss, George H.; Goldstein, Keith M.; Uversky, Vladimir N.; Dunker, A.; Li, Shuyu; Thomas, Craig E.; Wei, Tao.

In: PLoS One, Vol. 6, No. 9, e24233, 15.09.2011.

Research output: Contribution to journalArticle

Liu, J, Jolly, RA, Smith, AT, Searfoss, GH, Goldstein, KM, Uversky, VN, Dunker, A, Li, S, Thomas, CE & Wei, T 2011, 'Predictive power estimation algorithm (PPEA) - a new algorithm to reduce overfitting for genomic biomarker discovery', PLoS One, vol. 6, no. 9, e24233. https://doi.org/10.1371/journal.pone.0024233
Liu, Jiangang ; Jolly, Robert A. ; Smith, Aaron T. ; Searfoss, George H. ; Goldstein, Keith M. ; Uversky, Vladimir N. ; Dunker, A. ; Li, Shuyu ; Thomas, Craig E. ; Wei, Tao. / Predictive power estimation algorithm (PPEA) - a new algorithm to reduce overfitting for genomic biomarker discovery. In: PLoS One. 2011 ; Vol. 6, No. 9.
@article{42928f7febfb4154ae7292ad654b7652,
title = "Predictive power estimation algorithm (PPEA) - a new algorithm to reduce overfitting for genomic biomarker discovery",
abstract = "Toxicogenomics promises to aid in predicting adverse effects, understanding the mechanisms of drug action or toxicity, and uncovering unexpected or secondary pharmacology. However, modeling adverse effects using high dimensional and high noise genomic data is prone to over-fitting. Models constructed from such data sets often consist of a large number of genes with no obvious functional relevance to the biological effect the model intends to predict that can make it challenging to interpret the modeling results. To address these issues, we developed a novel algorithm, Predictive Power Estimation Algorithm (PPEA), which estimates the predictive power of each individual transcript through an iterative two-way bootstrapping procedure. By repeatedly enforcing that the sample number is larger than the transcript number, in each iteration of modeling and testing, PPEA reduces the potential risk of overfitting. We show with three different cases studies that: (1) PPEA can quickly derive a reliable rank order of predictive power of individual transcripts in a relatively small number of iterations, (2) the top ranked transcripts tend to be functionally related to the phenotype they are intended to predict, (3) using only the most predictive top ranked transcripts greatly facilitates development of multiplex assay such as qRT-PCR as a biomarker, and (4) more importantly, we were able to demonstrate that a small number of genes identified from the top-ranked transcripts are highly predictive of phenotype as their expression changes distinguished adverse from nonadverse effects of compounds in completely independent tests. Thus, we believe that the PPEA model effectively addresses the over-fitting problem and can be used to facilitate genomic biomarker discovery for predictive toxicology and drug responses.",
author = "Jiangang Liu and Jolly, {Robert A.} and Smith, {Aaron T.} and Searfoss, {George H.} and Goldstein, {Keith M.} and Uversky, {Vladimir N.} and A. Dunker and Shuyu Li and Thomas, {Craig E.} and Tao Wei",
year = "2011",
month = "9",
day = "15",
doi = "10.1371/journal.pone.0024233",
language = "English",
volume = "6",
journal = "PLoS One",
issn = "1932-6203",
publisher = "Public Library of Science",
number = "9",

}

TY - JOUR

T1 - Predictive power estimation algorithm (PPEA) - a new algorithm to reduce overfitting for genomic biomarker discovery

AU - Liu, Jiangang

AU - Jolly, Robert A.

AU - Smith, Aaron T.

AU - Searfoss, George H.

AU - Goldstein, Keith M.

AU - Uversky, Vladimir N.

AU - Dunker, A.

AU - Li, Shuyu

AU - Thomas, Craig E.

AU - Wei, Tao

PY - 2011/9/15

Y1 - 2011/9/15

N2 - Toxicogenomics promises to aid in predicting adverse effects, understanding the mechanisms of drug action or toxicity, and uncovering unexpected or secondary pharmacology. However, modeling adverse effects using high dimensional and high noise genomic data is prone to over-fitting. Models constructed from such data sets often consist of a large number of genes with no obvious functional relevance to the biological effect the model intends to predict that can make it challenging to interpret the modeling results. To address these issues, we developed a novel algorithm, Predictive Power Estimation Algorithm (PPEA), which estimates the predictive power of each individual transcript through an iterative two-way bootstrapping procedure. By repeatedly enforcing that the sample number is larger than the transcript number, in each iteration of modeling and testing, PPEA reduces the potential risk of overfitting. We show with three different cases studies that: (1) PPEA can quickly derive a reliable rank order of predictive power of individual transcripts in a relatively small number of iterations, (2) the top ranked transcripts tend to be functionally related to the phenotype they are intended to predict, (3) using only the most predictive top ranked transcripts greatly facilitates development of multiplex assay such as qRT-PCR as a biomarker, and (4) more importantly, we were able to demonstrate that a small number of genes identified from the top-ranked transcripts are highly predictive of phenotype as their expression changes distinguished adverse from nonadverse effects of compounds in completely independent tests. Thus, we believe that the PPEA model effectively addresses the over-fitting problem and can be used to facilitate genomic biomarker discovery for predictive toxicology and drug responses.

AB - Toxicogenomics promises to aid in predicting adverse effects, understanding the mechanisms of drug action or toxicity, and uncovering unexpected or secondary pharmacology. However, modeling adverse effects using high dimensional and high noise genomic data is prone to over-fitting. Models constructed from such data sets often consist of a large number of genes with no obvious functional relevance to the biological effect the model intends to predict that can make it challenging to interpret the modeling results. To address these issues, we developed a novel algorithm, Predictive Power Estimation Algorithm (PPEA), which estimates the predictive power of each individual transcript through an iterative two-way bootstrapping procedure. By repeatedly enforcing that the sample number is larger than the transcript number, in each iteration of modeling and testing, PPEA reduces the potential risk of overfitting. We show with three different cases studies that: (1) PPEA can quickly derive a reliable rank order of predictive power of individual transcripts in a relatively small number of iterations, (2) the top ranked transcripts tend to be functionally related to the phenotype they are intended to predict, (3) using only the most predictive top ranked transcripts greatly facilitates development of multiplex assay such as qRT-PCR as a biomarker, and (4) more importantly, we were able to demonstrate that a small number of genes identified from the top-ranked transcripts are highly predictive of phenotype as their expression changes distinguished adverse from nonadverse effects of compounds in completely independent tests. Thus, we believe that the PPEA model effectively addresses the over-fitting problem and can be used to facilitate genomic biomarker discovery for predictive toxicology and drug responses.

UR - http://www.scopus.com/inward/record.url?scp=80052848777&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80052848777&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0024233

DO - 10.1371/journal.pone.0024233

M3 - Article

C2 - 21935387

AN - SCOPUS:80052848777

VL - 6

JO - PLoS One

JF - PLoS One

SN - 1932-6203

IS - 9

M1 - e24233

ER -