DDIG-in: Detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels

Lukas Folkman, Yuedong Yang, Zhixiu Li, Bela Stantic, Abdul Sattar, Matthew Mort, David N. Cooper, Yunlong Liu, Yaoqi Zhou

Research output: Contribution to journalArticle

28 Citations (Scopus)

Abstract

Motivation: Frameshifting (FS) indels and nonsense (NS) variants disrupt the protein-coding sequence downstream of the mutation site by changing the reading frame or introducing a premature termination codon, respectively. Despite such drastic changes to the protein sequence, FS indels and NS variants have been discovered in healthy individuals. How to discriminate disease-causing from neutral FS indels and NS variants is an understudied problem. Results: We have built a machine learning method called DDIG-in (FS) based on real human genetic variations from the Human Gene Mutation Database (inherited disease-causing) and the 1000 Genomes Project (GP) (putatively neutral). The method incorporates both sequence and predicted structural features and yields a robust performance by 10-fold cross-validation and independent tests on both FS indels and NS variants. We showed that human-derived NS variants and FS indels derived from animal orthologs can be effectively employed for independent testing of our method trained on human-derived FS indels. DDIG-in (FS) achieves a Matthews correlation coefficient (MCC) of 0.59, a sensitivity of 86%, and a specificity of 72% for FS indels. Application of DDIG-in (FS) to NS variants yields essentially the same performance (MCC of 0.43) as a method that was specifically trained for NS variants. DDIG-in (FS) was shown to make a significant improvement over existing techniques.

Original languageEnglish (US)
Pages (from-to)1599-1606
Number of pages8
JournalBioinformatics
Volume31
Issue number10
DOIs
StatePublished - Sep 25 2014

Fingerprint

Genetic Variation
Nonsense Codon
Nucleotides
Structural Properties
Structural properties
Mutation
Genes
Proteins
Protein
Correlation coefficient
Learning systems
Animals
Reading Frames
Robust Performance
Medical Genetics
Testing
Protein Sequence
Cross-validation
Termination
Specificity

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability
  • Medicine(all)

Cite this

DDIG-in : Detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. / Folkman, Lukas; Yang, Yuedong; Li, Zhixiu; Stantic, Bela; Sattar, Abdul; Mort, Matthew; Cooper, David N.; Liu, Yunlong; Zhou, Yaoqi.

In: Bioinformatics, Vol. 31, No. 10, 25.09.2014, p. 1599-1606.

Research output: Contribution to journalArticle

Folkman, Lukas ; Yang, Yuedong ; Li, Zhixiu ; Stantic, Bela ; Sattar, Abdul ; Mort, Matthew ; Cooper, David N. ; Liu, Yunlong ; Zhou, Yaoqi. / DDIG-in : Detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. In: Bioinformatics. 2014 ; Vol. 31, No. 10. pp. 1599-1606.
@article{02a6bce7a5d749ffb8868c13de95ecc2,
title = "DDIG-in: Detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels",
abstract = "Motivation: Frameshifting (FS) indels and nonsense (NS) variants disrupt the protein-coding sequence downstream of the mutation site by changing the reading frame or introducing a premature termination codon, respectively. Despite such drastic changes to the protein sequence, FS indels and NS variants have been discovered in healthy individuals. How to discriminate disease-causing from neutral FS indels and NS variants is an understudied problem. Results: We have built a machine learning method called DDIG-in (FS) based on real human genetic variations from the Human Gene Mutation Database (inherited disease-causing) and the 1000 Genomes Project (GP) (putatively neutral). The method incorporates both sequence and predicted structural features and yields a robust performance by 10-fold cross-validation and independent tests on both FS indels and NS variants. We showed that human-derived NS variants and FS indels derived from animal orthologs can be effectively employed for independent testing of our method trained on human-derived FS indels. DDIG-in (FS) achieves a Matthews correlation coefficient (MCC) of 0.59, a sensitivity of 86{\%}, and a specificity of 72{\%} for FS indels. Application of DDIG-in (FS) to NS variants yields essentially the same performance (MCC of 0.43) as a method that was specifically trained for NS variants. DDIG-in (FS) was shown to make a significant improvement over existing techniques.",
author = "Lukas Folkman and Yuedong Yang and Zhixiu Li and Bela Stantic and Abdul Sattar and Matthew Mort and Cooper, {David N.} and Yunlong Liu and Yaoqi Zhou",
year = "2014",
month = "9",
day = "25",
doi = "10.1093/bioinformatics/btu862",
language = "English (US)",
volume = "31",
pages = "1599--1606",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "10",

}

TY - JOUR

T1 - DDIG-in

T2 - Detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels

AU - Folkman, Lukas

AU - Yang, Yuedong

AU - Li, Zhixiu

AU - Stantic, Bela

AU - Sattar, Abdul

AU - Mort, Matthew

AU - Cooper, David N.

AU - Liu, Yunlong

AU - Zhou, Yaoqi

PY - 2014/9/25

Y1 - 2014/9/25

N2 - Motivation: Frameshifting (FS) indels and nonsense (NS) variants disrupt the protein-coding sequence downstream of the mutation site by changing the reading frame or introducing a premature termination codon, respectively. Despite such drastic changes to the protein sequence, FS indels and NS variants have been discovered in healthy individuals. How to discriminate disease-causing from neutral FS indels and NS variants is an understudied problem. Results: We have built a machine learning method called DDIG-in (FS) based on real human genetic variations from the Human Gene Mutation Database (inherited disease-causing) and the 1000 Genomes Project (GP) (putatively neutral). The method incorporates both sequence and predicted structural features and yields a robust performance by 10-fold cross-validation and independent tests on both FS indels and NS variants. We showed that human-derived NS variants and FS indels derived from animal orthologs can be effectively employed for independent testing of our method trained on human-derived FS indels. DDIG-in (FS) achieves a Matthews correlation coefficient (MCC) of 0.59, a sensitivity of 86%, and a specificity of 72% for FS indels. Application of DDIG-in (FS) to NS variants yields essentially the same performance (MCC of 0.43) as a method that was specifically trained for NS variants. DDIG-in (FS) was shown to make a significant improvement over existing techniques.

AB - Motivation: Frameshifting (FS) indels and nonsense (NS) variants disrupt the protein-coding sequence downstream of the mutation site by changing the reading frame or introducing a premature termination codon, respectively. Despite such drastic changes to the protein sequence, FS indels and NS variants have been discovered in healthy individuals. How to discriminate disease-causing from neutral FS indels and NS variants is an understudied problem. Results: We have built a machine learning method called DDIG-in (FS) based on real human genetic variations from the Human Gene Mutation Database (inherited disease-causing) and the 1000 Genomes Project (GP) (putatively neutral). The method incorporates both sequence and predicted structural features and yields a robust performance by 10-fold cross-validation and independent tests on both FS indels and NS variants. We showed that human-derived NS variants and FS indels derived from animal orthologs can be effectively employed for independent testing of our method trained on human-derived FS indels. DDIG-in (FS) achieves a Matthews correlation coefficient (MCC) of 0.59, a sensitivity of 86%, and a specificity of 72% for FS indels. Application of DDIG-in (FS) to NS variants yields essentially the same performance (MCC of 0.43) as a method that was specifically trained for NS variants. DDIG-in (FS) was shown to make a significant improvement over existing techniques.

UR - http://www.scopus.com/inward/record.url?scp=84929628001&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929628001&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btu862

DO - 10.1093/bioinformatics/btu862

M3 - Article

C2 - 25573915

AN - SCOPUS:84929628001

VL - 31

SP - 1599

EP - 1606

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 10

ER -