Joint between-sample normalization and differential expression detection through 0-regularized regression

Kefei Liu, Li Shen, Hui Jiang

Research output: Contribution to journalArticle

Abstract

Background: A fundamental problem in RNA-seq data analysis is to identify genes or exons that are differentially expressed with varying experimental conditions based on the read counts. The relativeness of RNA-seq measurements makes the between-sample normalization of read counts an essential step in differential expression (DE) analysis. In most existing methods, the normalization step is performed prior to the DE analysis. Recently, Jiang and Zhan proposed a statistical method which introduces sample-specific normalization parameters into a joint model, which allows for simultaneous normalization and differential expression analysis from log-transformed RNA-seq data. Furthermore, an 0 penalty is used to yield a sparse solution which selects a subset of DE genes. The experimental conditions are restricted to be categorical in their work. Results: In this paper, we generalize Jiang and Zhan's method to handle experimental conditions that are measured in continuous variables. As a result, genes with expression levels associated with a single or multiple covariates can be detected. As the problem being high-dimensional, non-differentiable and non-convex, we develop an efficient algorithm for model fitting. Conclusions: Experiments on synthetic data demonstrate that the proposed method outperforms existing methods in terms of detection accuracy when a large fraction of genes are differentially expressed in an asymmetric manner, and the performance gain becomes more substantial for larger sample sizes. We also apply our method to a real prostate cancer RNA-seq dataset to identify genes associated with pre-operative prostate-specific antigen (PSA) levels in patients.

Original languageEnglish (US)
Article number593
JournalBMC Bioinformatics
Volume20
DOIs
StatePublished - Dec 2 2019
Externally publishedYes

Fingerprint

Differential Expression
RNA
Normalization
Genes
Joints
Regression
Gene
Count
Prostate-Specific Antigen
Antigens
Joint Model
Gene expression
Prostate Cancer
Model Fitting
Gene Expression
Continuous Variables
Exons
Statistical methods
Synthetic Data
Categorical

Keywords

  • Between-sample normalization
  • Differential expression
  • RNA-seq
  • ℓ -regularized regression

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

Joint between-sample normalization and differential expression detection through 0-regularized regression. / Liu, Kefei; Shen, Li; Jiang, Hui.

In: BMC Bioinformatics, Vol. 20, 593, 02.12.2019.

Research output: Contribution to journalArticle

@article{4c97d36fdcf4496388dd1fbb05f92349,
title = "Joint between-sample normalization and differential expression detection through 0-regularized regression",
abstract = "Background: A fundamental problem in RNA-seq data analysis is to identify genes or exons that are differentially expressed with varying experimental conditions based on the read counts. The relativeness of RNA-seq measurements makes the between-sample normalization of read counts an essential step in differential expression (DE) analysis. In most existing methods, the normalization step is performed prior to the DE analysis. Recently, Jiang and Zhan proposed a statistical method which introduces sample-specific normalization parameters into a joint model, which allows for simultaneous normalization and differential expression analysis from log-transformed RNA-seq data. Furthermore, an 0 penalty is used to yield a sparse solution which selects a subset of DE genes. The experimental conditions are restricted to be categorical in their work. Results: In this paper, we generalize Jiang and Zhan's method to handle experimental conditions that are measured in continuous variables. As a result, genes with expression levels associated with a single or multiple covariates can be detected. As the problem being high-dimensional, non-differentiable and non-convex, we develop an efficient algorithm for model fitting. Conclusions: Experiments on synthetic data demonstrate that the proposed method outperforms existing methods in terms of detection accuracy when a large fraction of genes are differentially expressed in an asymmetric manner, and the performance gain becomes more substantial for larger sample sizes. We also apply our method to a real prostate cancer RNA-seq dataset to identify genes associated with pre-operative prostate-specific antigen (PSA) levels in patients.",
keywords = "Between-sample normalization, Differential expression, RNA-seq, ℓ -regularized regression",
author = "Kefei Liu and Li Shen and Hui Jiang",
year = "2019",
month = "12",
day = "2",
doi = "10.1186/s12859-019-3070-4",
language = "English (US)",
volume = "20",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Joint between-sample normalization and differential expression detection through 0-regularized regression

AU - Liu, Kefei

AU - Shen, Li

AU - Jiang, Hui

PY - 2019/12/2

Y1 - 2019/12/2

N2 - Background: A fundamental problem in RNA-seq data analysis is to identify genes or exons that are differentially expressed with varying experimental conditions based on the read counts. The relativeness of RNA-seq measurements makes the between-sample normalization of read counts an essential step in differential expression (DE) analysis. In most existing methods, the normalization step is performed prior to the DE analysis. Recently, Jiang and Zhan proposed a statistical method which introduces sample-specific normalization parameters into a joint model, which allows for simultaneous normalization and differential expression analysis from log-transformed RNA-seq data. Furthermore, an 0 penalty is used to yield a sparse solution which selects a subset of DE genes. The experimental conditions are restricted to be categorical in their work. Results: In this paper, we generalize Jiang and Zhan's method to handle experimental conditions that are measured in continuous variables. As a result, genes with expression levels associated with a single or multiple covariates can be detected. As the problem being high-dimensional, non-differentiable and non-convex, we develop an efficient algorithm for model fitting. Conclusions: Experiments on synthetic data demonstrate that the proposed method outperforms existing methods in terms of detection accuracy when a large fraction of genes are differentially expressed in an asymmetric manner, and the performance gain becomes more substantial for larger sample sizes. We also apply our method to a real prostate cancer RNA-seq dataset to identify genes associated with pre-operative prostate-specific antigen (PSA) levels in patients.

AB - Background: A fundamental problem in RNA-seq data analysis is to identify genes or exons that are differentially expressed with varying experimental conditions based on the read counts. The relativeness of RNA-seq measurements makes the between-sample normalization of read counts an essential step in differential expression (DE) analysis. In most existing methods, the normalization step is performed prior to the DE analysis. Recently, Jiang and Zhan proposed a statistical method which introduces sample-specific normalization parameters into a joint model, which allows for simultaneous normalization and differential expression analysis from log-transformed RNA-seq data. Furthermore, an 0 penalty is used to yield a sparse solution which selects a subset of DE genes. The experimental conditions are restricted to be categorical in their work. Results: In this paper, we generalize Jiang and Zhan's method to handle experimental conditions that are measured in continuous variables. As a result, genes with expression levels associated with a single or multiple covariates can be detected. As the problem being high-dimensional, non-differentiable and non-convex, we develop an efficient algorithm for model fitting. Conclusions: Experiments on synthetic data demonstrate that the proposed method outperforms existing methods in terms of detection accuracy when a large fraction of genes are differentially expressed in an asymmetric manner, and the performance gain becomes more substantial for larger sample sizes. We also apply our method to a real prostate cancer RNA-seq dataset to identify genes associated with pre-operative prostate-specific antigen (PSA) levels in patients.

KW - Between-sample normalization

KW - Differential expression

KW - RNA-seq

KW - ℓ -regularized regression

UR - http://www.scopus.com/inward/record.url?scp=85075843690&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85075843690&partnerID=8YFLogxK

U2 - 10.1186/s12859-019-3070-4

DO - 10.1186/s12859-019-3070-4

M3 - Article

AN - SCOPUS:85075843690

VL - 20

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 593

ER -