Correcting imbalanced reads coverage in bacterial transcriptome sequencing with extreme deep coverage

Xinjun Zhang, Dharanesh Gangaiah, Robert S. Munson, Stanley Spinola, Yunlong Liu

Research output: Contribution to journalArticle

Abstract

High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or underestimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.

Original languageEnglish
Pages (from-to)195-213
Number of pages19
JournalInternational Journal of Computational Biology and Drug Design
Volume7
Issue number2-3
DOIs
StatePublished - 2014

Fingerprint

Transcriptome
Genes
Gene Expression
Gene expression
Bacterial RNA
RNA
Chemical analysis
Linear Models
Throughput
Experiments

Keywords

  • Bacterial transcriptome sequencing
  • Computational biology
  • Coverage imbalance
  • Gene differential expression
  • Generalised linear model
  • GLM
  • RNA-Seq
  • Tri-nucleotides

ASJC Scopus subject areas

  • Computer Science Applications
  • Drug Discovery

Cite this

Correcting imbalanced reads coverage in bacterial transcriptome sequencing with extreme deep coverage. / Zhang, Xinjun; Gangaiah, Dharanesh; Munson, Robert S.; Spinola, Stanley; Liu, Yunlong.

In: International Journal of Computational Biology and Drug Design, Vol. 7, No. 2-3, 2014, p. 195-213.

Research output: Contribution to journalArticle

@article{8c9bb158eaf74998807ef6d834bedd8e,
title = "Correcting imbalanced reads coverage in bacterial transcriptome sequencing with extreme deep coverage",
abstract = "High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or underestimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.",
keywords = "Bacterial transcriptome sequencing, Computational biology, Coverage imbalance, Gene differential expression, Generalised linear model, GLM, RNA-Seq, Tri-nucleotides",
author = "Xinjun Zhang and Dharanesh Gangaiah and Munson, {Robert S.} and Stanley Spinola and Yunlong Liu",
year = "2014",
doi = "10.1504/IJCBDD.2014.061646",
language = "English",
volume = "7",
pages = "195--213",
journal = "International Journal of Computational Biology and Drug Design",
issn = "1756-0756",
publisher = "Inderscience Enterprises Ltd",
number = "2-3",

}

TY - JOUR

T1 - Correcting imbalanced reads coverage in bacterial transcriptome sequencing with extreme deep coverage

AU - Zhang, Xinjun

AU - Gangaiah, Dharanesh

AU - Munson, Robert S.

AU - Spinola, Stanley

AU - Liu, Yunlong

PY - 2014

Y1 - 2014

N2 - High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or underestimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.

AB - High throughput bacterial RNA-Seq experiments can generate extremely high and imbalanced sequencing coverage. Over- or underestimation of gene expression levels will hinder accurate gene differential expression analysis. Here we evaluated strategies to identify expression differences of genes with high coverage in bacterial transcriptome data using either raw sequence reads or unique reads with duplicate fragments removed. In addition, we proposed a generalised linear model (GLM) based approach to identify imbalance in read coverage based on sequence compositions. Our results show that analysis using raw reads identifies more differentially expressed genes with more accurate fold change than using unique reads. We also demonstrate the presence of sequence composition related biases that are independent of gene expression levels and experimental conditions. Finally, genes that still show strong coverage imbalance after correction were tagged using statistical approach.

KW - Bacterial transcriptome sequencing

KW - Computational biology

KW - Coverage imbalance

KW - Gene differential expression

KW - Generalised linear model

KW - GLM

KW - RNA-Seq

KW - Tri-nucleotides

UR - http://www.scopus.com/inward/record.url?scp=84901939129&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84901939129&partnerID=8YFLogxK

U2 - 10.1504/IJCBDD.2014.061646

DO - 10.1504/IJCBDD.2014.061646

M3 - Article

C2 - 24878730

AN - SCOPUS:84901939129

VL - 7

SP - 195

EP - 213

JO - International Journal of Computational Biology and Drug Design

JF - International Journal of Computational Biology and Drug Design

SN - 1756-0756

IS - 2-3

ER -