Protein sequence alignment and structural disorder

A substitution matrix for an extended alphabet

Uros Midic, A. Dunker, Zoran Obradovic

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

In protein sequence alignment algorithms, a substitution matrix of 20×20 alignment parameters is used to describe the rates of amino acid substitutions over time. Development and evaluation of most substitution matrices including the BLOSUM family [1] was based almost entirely on fully structured proteins. Structurally disordered proteins (i.e. proteins that lack structure, either in part or as a whole) that have been shown to be very common in nature [2] have a significantly different amino acid composition than ordered (i.e. structured) proteins [3]. Furthermore, the sequence evolution rate is higher in unstructured as compared to structured regions of proteins containing both structured and unstructured regions [4]. These results cast doubt on appropriateness of the BLOSUM substitution matrices for alignment of structurally disordered proteins [5].To address this problem, we take into the account the concept of structural disorder by extending the alphabet for sequence representation from 20 to 2×20=40 symbols, 20 for amino acids in disordered regions and 20 for amino acids in ordered regions. A 40×40 substitution matrix is required for alignment of sequences represented in the extended alphabet. Such an expanded matrix contains 20×20 submatrices that correspond to matching ordered-ordered, ordered-disordered, and disordered-disordered pairs of residues. In this paper we describe an iterative procedure that we used to estimate such a 40×40 substitution matrix. The iterative procedure converged with stable results with respect to the choice of the sequences in the dataset. In the obtained 40×40 matrix we found substantial differences between the 20×20 submatrices corresponding to ordered-ordered, ordered-disordered, and disordered-disordered region matching. These differences provide evidence that for alignment of protein sequences that contain disordered segments, the discovered substitution matrix is more appropriate than the BLOSUM substitution matrices. At the same time, the new substitution matrix is applicable for sequence alignment of fully ordered proteins as its order-order submatrix is very similar to a BLOSUM matrix.

Original languageEnglish
Title of host publicationProceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09
Pages27-31
Number of pages5
DOIs
StatePublished - 2009
EventKDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 - Paris, France
Duration: Jun 28 2009Jun 28 2009

Other

OtherKDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09
CountryFrance
CityParis
Period6/28/096/28/09

Fingerprint

Sequence Alignment
Substitution reactions
Proteins
Amino acids
Amino Acids
Amino Acid Substitution

Keywords

  • Protein sequence alignment
  • Structurally disordered proteins
  • Substitution matrices

ASJC Scopus subject areas

  • Software
  • Biomedical Engineering
  • Health Informatics

Cite this

Midic, U., Dunker, A., & Obradovic, Z. (2009). Protein sequence alignment and structural disorder: A substitution matrix for an extended alphabet. In Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 (pp. 27-31) https://doi.org/10.1145/1562090.1562096

Protein sequence alignment and structural disorder : A substitution matrix for an extended alphabet. / Midic, Uros; Dunker, A.; Obradovic, Zoran.

Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09. 2009. p. 27-31.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Midic, U, Dunker, A & Obradovic, Z 2009, Protein sequence alignment and structural disorder: A substitution matrix for an extended alphabet. in Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09. pp. 27-31, KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09, Paris, France, 6/28/09. https://doi.org/10.1145/1562090.1562096
Midic U, Dunker A, Obradovic Z. Protein sequence alignment and structural disorder: A substitution matrix for an extended alphabet. In Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09. 2009. p. 27-31 https://doi.org/10.1145/1562090.1562096
Midic, Uros ; Dunker, A. ; Obradovic, Zoran. / Protein sequence alignment and structural disorder : A substitution matrix for an extended alphabet. Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09. 2009. pp. 27-31
@inproceedings{736d5dace6154fbdab2b5f889f3b19b0,
title = "Protein sequence alignment and structural disorder: A substitution matrix for an extended alphabet",
abstract = "In protein sequence alignment algorithms, a substitution matrix of 20×20 alignment parameters is used to describe the rates of amino acid substitutions over time. Development and evaluation of most substitution matrices including the BLOSUM family [1] was based almost entirely on fully structured proteins. Structurally disordered proteins (i.e. proteins that lack structure, either in part or as a whole) that have been shown to be very common in nature [2] have a significantly different amino acid composition than ordered (i.e. structured) proteins [3]. Furthermore, the sequence evolution rate is higher in unstructured as compared to structured regions of proteins containing both structured and unstructured regions [4]. These results cast doubt on appropriateness of the BLOSUM substitution matrices for alignment of structurally disordered proteins [5].To address this problem, we take into the account the concept of structural disorder by extending the alphabet for sequence representation from 20 to 2×20=40 symbols, 20 for amino acids in disordered regions and 20 for amino acids in ordered regions. A 40×40 substitution matrix is required for alignment of sequences represented in the extended alphabet. Such an expanded matrix contains 20×20 submatrices that correspond to matching ordered-ordered, ordered-disordered, and disordered-disordered pairs of residues. In this paper we describe an iterative procedure that we used to estimate such a 40×40 substitution matrix. The iterative procedure converged with stable results with respect to the choice of the sequences in the dataset. In the obtained 40×40 matrix we found substantial differences between the 20×20 submatrices corresponding to ordered-ordered, ordered-disordered, and disordered-disordered region matching. These differences provide evidence that for alignment of protein sequences that contain disordered segments, the discovered substitution matrix is more appropriate than the BLOSUM substitution matrices. At the same time, the new substitution matrix is applicable for sequence alignment of fully ordered proteins as its order-order submatrix is very similar to a BLOSUM matrix.",
keywords = "Protein sequence alignment, Structurally disordered proteins, Substitution matrices",
author = "Uros Midic and A. Dunker and Zoran Obradovic",
year = "2009",
doi = "10.1145/1562090.1562096",
language = "English",
isbn = "9781605586670",
pages = "27--31",
booktitle = "Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09",

}

TY - GEN

T1 - Protein sequence alignment and structural disorder

T2 - A substitution matrix for an extended alphabet

AU - Midic, Uros

AU - Dunker, A.

AU - Obradovic, Zoran

PY - 2009

Y1 - 2009

N2 - In protein sequence alignment algorithms, a substitution matrix of 20×20 alignment parameters is used to describe the rates of amino acid substitutions over time. Development and evaluation of most substitution matrices including the BLOSUM family [1] was based almost entirely on fully structured proteins. Structurally disordered proteins (i.e. proteins that lack structure, either in part or as a whole) that have been shown to be very common in nature [2] have a significantly different amino acid composition than ordered (i.e. structured) proteins [3]. Furthermore, the sequence evolution rate is higher in unstructured as compared to structured regions of proteins containing both structured and unstructured regions [4]. These results cast doubt on appropriateness of the BLOSUM substitution matrices for alignment of structurally disordered proteins [5].To address this problem, we take into the account the concept of structural disorder by extending the alphabet for sequence representation from 20 to 2×20=40 symbols, 20 for amino acids in disordered regions and 20 for amino acids in ordered regions. A 40×40 substitution matrix is required for alignment of sequences represented in the extended alphabet. Such an expanded matrix contains 20×20 submatrices that correspond to matching ordered-ordered, ordered-disordered, and disordered-disordered pairs of residues. In this paper we describe an iterative procedure that we used to estimate such a 40×40 substitution matrix. The iterative procedure converged with stable results with respect to the choice of the sequences in the dataset. In the obtained 40×40 matrix we found substantial differences between the 20×20 submatrices corresponding to ordered-ordered, ordered-disordered, and disordered-disordered region matching. These differences provide evidence that for alignment of protein sequences that contain disordered segments, the discovered substitution matrix is more appropriate than the BLOSUM substitution matrices. At the same time, the new substitution matrix is applicable for sequence alignment of fully ordered proteins as its order-order submatrix is very similar to a BLOSUM matrix.

AB - In protein sequence alignment algorithms, a substitution matrix of 20×20 alignment parameters is used to describe the rates of amino acid substitutions over time. Development and evaluation of most substitution matrices including the BLOSUM family [1] was based almost entirely on fully structured proteins. Structurally disordered proteins (i.e. proteins that lack structure, either in part or as a whole) that have been shown to be very common in nature [2] have a significantly different amino acid composition than ordered (i.e. structured) proteins [3]. Furthermore, the sequence evolution rate is higher in unstructured as compared to structured regions of proteins containing both structured and unstructured regions [4]. These results cast doubt on appropriateness of the BLOSUM substitution matrices for alignment of structurally disordered proteins [5].To address this problem, we take into the account the concept of structural disorder by extending the alphabet for sequence representation from 20 to 2×20=40 symbols, 20 for amino acids in disordered regions and 20 for amino acids in ordered regions. A 40×40 substitution matrix is required for alignment of sequences represented in the extended alphabet. Such an expanded matrix contains 20×20 submatrices that correspond to matching ordered-ordered, ordered-disordered, and disordered-disordered pairs of residues. In this paper we describe an iterative procedure that we used to estimate such a 40×40 substitution matrix. The iterative procedure converged with stable results with respect to the choice of the sequences in the dataset. In the obtained 40×40 matrix we found substantial differences between the 20×20 submatrices corresponding to ordered-ordered, ordered-disordered, and disordered-disordered region matching. These differences provide evidence that for alignment of protein sequences that contain disordered segments, the discovered substitution matrix is more appropriate than the BLOSUM substitution matrices. At the same time, the new substitution matrix is applicable for sequence alignment of fully ordered proteins as its order-order submatrix is very similar to a BLOSUM matrix.

KW - Protein sequence alignment

KW - Structurally disordered proteins

KW - Substitution matrices

UR - http://www.scopus.com/inward/record.url?scp=70350678701&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350678701&partnerID=8YFLogxK

U2 - 10.1145/1562090.1562096

DO - 10.1145/1562090.1562096

M3 - Conference contribution

SN - 9781605586670

SP - 27

EP - 31

BT - Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09

ER -