### Abstract

In protein sequence alignment algorithms, a substitution matrix of 20×20 alignment parameters is used to describe the rates of amino acid substitutions over time. Development and evaluation of most substitution matrices including the BLOSUM family [1] was based almost entirely on fully structured proteins. Structurally disordered proteins (i.e. proteins that lack structure, either in part or as a whole) that have been shown to be very common in nature [2] have a significantly different amino acid composition than ordered (i.e. structured) proteins [3]. Furthermore, the sequence evolution rate is higher in unstructured as compared to structured regions of proteins containing both structured and unstructured regions [4]. These results cast doubt on appropriateness of the BLOSUM substitution matrices for alignment of structurally disordered proteins [5].To address this problem, we take into the account the concept of structural disorder by extending the alphabet for sequence representation from 20 to 2×20=40 symbols, 20 for amino acids in disordered regions and 20 for amino acids in ordered regions. A 40×40 substitution matrix is required for alignment of sequences represented in the extended alphabet. Such an expanded matrix contains 20×20 submatrices that correspond to matching ordered-ordered, ordered-disordered, and disordered-disordered pairs of residues. In this paper we describe an iterative procedure that we used to estimate such a 40×40 substitution matrix. The iterative procedure converged with stable results with respect to the choice of the sequences in the dataset. In the obtained 40×40 matrix we found substantial differences between the 20×20 submatrices corresponding to ordered-ordered, ordered-disordered, and disordered-disordered region matching. These differences provide evidence that for alignment of protein sequences that contain disordered segments, the discovered substitution matrix is more appropriate than the BLOSUM substitution matrices. At the same time, the new substitution matrix is applicable for sequence alignment of fully ordered proteins as its order-order submatrix is very similar to a BLOSUM matrix.

Original language | English (US) |
---|---|

Title of host publication | Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 |

Pages | 27-31 |

Number of pages | 5 |

DOIs | |

State | Published - Nov 9 2009 |

Event | KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 - Paris, France Duration: Jun 28 2009 → Jun 28 2009 |

### Publication series

Name | Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 |
---|

### Other

Other | KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09 |
---|---|

Country | France |

City | Paris |

Period | 6/28/09 → 6/28/09 |

### Fingerprint

### Keywords

- Protein sequence alignment
- Structurally disordered proteins
- Substitution matrices

### ASJC Scopus subject areas

- Software
- Biomedical Engineering
- Health Informatics

### Cite this

*Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09*(pp. 27-31). (Proceedings of the KDD-09 Workshop on Statistical and Relational Learning in Bioinformatics, StReBio '09). https://doi.org/10.1145/1562090.1562096