Protein sequence alignment and structural disorder: A substitution matrix for an extended alphabet

Uros Midic, A. Keith Dunker, Zoran Obradovic

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

In protein sequence alignment algorithms, a substitution matrix of 20×20 alignment parameters is used to describe the rates of amino acid substitutions over time. Development and evaluation of most substitution matrices including the BLOSUM family [1] was based almost entirely on fully structured proteins. Structurally disordered proteins (i.e. proteins that lack structure, either in part or as a whole) that have been shown to be very common in nature [2] have a significantly different amino acid composition than ordered (i.e. structured) proteins [3]. Furthermore, the sequence evolution rate is higher in unstructured as compared to structured regions of proteins containing both structured and unstructured regions [4]. These results cast doubt on appropriateness of the BLOSUM substitution matrices for alignment of structurally disordered proteins [5].To address this problem, we take into the account the concept of structural disorder by extending the alphabet for sequence representation from 20 to 2×20=40 symbols, 20 for amino acids in disordered regions and 20 for amino acids in ordered regions. A 40×40 substitution matrix is required for alignment of sequences represented in the extended alphabet. Such an expanded matrix contains 20×20 submatrices that correspond to matching ordered-ordered, ordered-disordered, and disordered-disordered pairs of residues. In this paper we describe an iterative procedure that we used to estimate such a 40×40 substitution matrix. The iterative procedure converged with stable results with respect to the choice of the sequences in the dataset. In the obtained 40×40 matrix we found substantial differences between the 20×20 submatrices corresponding to ordered-ordered, ordered-disordered, and disordered-disordered region matching. These differences provide evidence that for alignment of protein sequences that contain disordered segments, the discovered substitution matrix is more appropriate than the BLOSUM substitution matrices. At the same time, the new substitution matrix is applicable for sequence alignment of fully ordered proteins as its order-order submatrix is very similar to a BLOSUM matrix.

Original languageEnglish (US)
Title of host publicationProceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09
Pages27-31
Number of pages5
DOIs
StatePublished - Nov 9 2009
EventKDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 - Paris, France
Duration: Jun 28 2009Jun 28 2009

Publication series

NameProceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09

Other

OtherKDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09
CountryFrance
CityParis
Period6/28/096/28/09

    Fingerprint

Keywords

  • Protein sequence alignment
  • Structurally disordered proteins
  • Substitution matrices

ASJC Scopus subject areas

  • Software
  • Biomedical Engineering
  • Health Informatics

Cite this

Midic, U., Dunker, A. K., & Obradovic, Z. (2009). Protein sequence alignment and structural disorder: A substitution matrix for an extended alphabet. In Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 (pp. 27-31). (Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09). https://doi.org/10.1145/1562090.1562096