| 1 | This is just an ASCII text version of the manuscript describing |
|---|
| 2 | Clustal W, without the figures. It was published: |
|---|
| 3 | |
|---|
| 4 | Nucleic Acids Research, 22(22):4673-4680. |
|---|
| 5 | |
|---|
| 6 | |
|---|
| 7 | |
|---|
| 8 | CLUSTAL W: improving the sensitivity of progressive multiple |
|---|
| 9 | sequence alignment through sequence weighting, position specific |
|---|
| 10 | gap penalties and weight matrix choice. |
|---|
| 11 | |
|---|
| 12 | |
|---|
| 13 | |
|---|
| 14 | Julie D. Thompson, Desmond G. Higgins1 and Toby J. Gibson* |
|---|
| 15 | |
|---|
| 16 | European Molecular Biology Laboratory |
|---|
| 17 | Postfach 102209 |
|---|
| 18 | Meyerhofstrasse 1 |
|---|
| 19 | D-69012 Heidelberg |
|---|
| 20 | Germany |
|---|
| 21 | |
|---|
| 22 | |
|---|
| 23 | Phone: +49-6221-387398 |
|---|
| 24 | Fax: +49-6221-387306 |
|---|
| 25 | E-mail: Gibson@EMBL-Heidelberg.DE |
|---|
| 26 | Des.Higgins@EBI.AC.UK |
|---|
| 27 | Thompson@EMBL-Heidelberg.DE |
|---|
| 28 | |
|---|
| 29 | |
|---|
| 30 | Keywords: Multiple alignment, phylogenetic tree, weight matrix, gap |
|---|
| 31 | penalty, dynamic programming, sequence weighting. |
|---|
| 32 | |
|---|
| 33 | |
|---|
| 34 | 1 Current address: |
|---|
| 35 | European Bioinformatics Institute |
|---|
| 36 | Hinxton Hall |
|---|
| 37 | Hinxton |
|---|
| 38 | Cambridge CB10 1RQ |
|---|
| 39 | UK. |
|---|
| 40 | |
|---|
| 41 | * To whom correspondence should be addressed |
|---|
| 42 | |
|---|
| 43 | |
|---|
| 44 | ABSTRACT |
|---|
| 45 | |
|---|
| 46 | The sensitivity of the commonly used progressive multiple sequence |
|---|
| 47 | alignment method has been greatly improved for the alignment of divergent |
|---|
| 48 | protein sequences. Firstly, individual weights are assigned to each sequence |
|---|
| 49 | in a partial alignment in order to downweight near-duplicate sequences and |
|---|
| 50 | upweight the most divergent ones. Secondly, amino acid substitution |
|---|
| 51 | matrices are varied at different alignment stages according to the divergence |
|---|
| 52 | of the sequences to be aligned. Thirdly, residue specific gap penalties and |
|---|
| 53 | locally reduced gap penalties in hydrophilic regions encourage new gaps in |
|---|
| 54 | potential loop regions rather than regular secondary structure. Fourthly, |
|---|
| 55 | positions in early alignments where gaps have been opened receive locally |
|---|
| 56 | reduced gap penalties to encourage the opening up of new gaps at these |
|---|
| 57 | positions. These modifications are incorporated into a new program, |
|---|
| 58 | CLUSTAL W which is freely available. |
|---|
| 59 | |
|---|
| 60 | |
|---|
| 61 | INTRODUCTION |
|---|
| 62 | |
|---|
| 63 | The simultaneous alignment of many nucleotide or amino acid sequences is |
|---|
| 64 | now an essential tool in molecular biology. Multiple alignments are used to |
|---|
| 65 | find diagnostic patterns to characterise protein families; to detect or |
|---|
| 66 | demonstrate homology between new sequences and existing families of |
|---|
| 67 | sequences; to help predict the secondary and tertiary structures of new |
|---|
| 68 | sequences; to suggest oligonucleotide primers for PCR; as an essential prelude |
|---|
| 69 | to molecular evolutionary analysis. The rate of appearance of new sequence |
|---|
| 70 | data is steadily increasing and the development of efficient and accurate |
|---|
| 71 | automatic methods for multiple alignment is, therefore, of major |
|---|
| 72 | importance. The majority of automatic multiple alignments are now carried |
|---|
| 73 | out using the "progressive" approach of Feng and Doolittle (1). In this paper, |
|---|
| 74 | we describe a number of improvements to the progressive multiple |
|---|
| 75 | alignment method which greatly improve the sensitivity without sacrificing |
|---|
| 76 | any of the speed and efficiency which makes this approach so practical. The |
|---|
| 77 | new methods are made available in a program called CLUSTAL W which is |
|---|
| 78 | freely available and portable to a wide variety of computers and operating |
|---|
| 79 | systems. |
|---|
| 80 | |
|---|
| 81 | In order to align just two sequences, it is standard practice to use dynamic |
|---|
| 82 | programming (2). This guarantees a mathematically optimal alignment, |
|---|
| 83 | given a table of scores for matches and mismatches between all amino acids |
|---|
| 84 | or nucleotides (e.g. the PAM250 matrix (3) or BLOSUM62 matrix (4)) and |
|---|
| 85 | penalties for insertions or deletions of different lengths. Attempts at |
|---|
| 86 | generalising dynamic programming to multiple alignments are limited to |
|---|
| 87 | small numbers of short sequences (5). For much more than eight or so |
|---|
| 88 | proteins of average length, the problem is uncomputable given current |
|---|
| 89 | computer power. Therefore, all of the methods capable of handling larger |
|---|
| 90 | problems in practical timescales, make use of heuristics. Currently, the most |
|---|
| 91 | widely used approach is to exploit the fact that homologous sequences are |
|---|
| 92 | evolutionarily related. One can build up a multiple alignment progressively |
|---|
| 93 | by a series of pairwise alignments, following the branching order in a |
|---|
| 94 | phylogenetic tree (1). One first aligns the most closely related sequences, |
|---|
| 95 | gradually adding in the more distant ones. This approach is sufficiently fast |
|---|
| 96 | to allow alignments of virtually any size. Further, in simple cases, the |
|---|
| 97 | quality of the alignments is excellent, as judged by the ability to correctly align |
|---|
| 98 | corresponding domains from sequences of known secondary or tertiary |
|---|
| 99 | structure (6). In more difficult cases, the alignments give good starting points |
|---|
| 100 | for further automatic or manual refinement. |
|---|
| 101 | |
|---|
| 102 | This approach works well when the data set consists of sequences of different |
|---|
| 103 | degrees of divergence. Pairwise alignment of very closely related sequences |
|---|
| 104 | can be carried out very accurately. The correct answer may often be obtained |
|---|
| 105 | using a wide range of parameter values (gap penalties and weight matrix). By |
|---|
| 106 | the time the most distantly related sequences are aligned, one already has a |
|---|
| 107 | sample of aligned sequences which gives important information about the |
|---|
| 108 | variability at each position. The positions of the gaps that were introduced |
|---|
| 109 | during the early alignments of the closely related sequences are not changed |
|---|
| 110 | as new sequences are added. This is justified because the placement of gaps |
|---|
| 111 | in alignments between closely related sequences is much more accurate than |
|---|
| 112 | between distantly related ones. When all of the sequences are highly |
|---|
| 113 | divergent (e.g. less than approximately 25-30% identity between any pair of |
|---|
| 114 | sequences), this progressive approach becomes much less reliable. |
|---|
| 115 | |
|---|
| 116 | There are two major problems with the progressive approach: the local |
|---|
| 117 | minimum problem and the choice of alignment parameters. The local |
|---|
| 118 | minimum problem stems from the "greedy" nature of the alignment strategy. |
|---|
| 119 | The algorithm greedily adds sequences together, following the initial tree. |
|---|
| 120 | There is no guarantee that the global optimal solution, as defined by some |
|---|
| 121 | overall measure of multiple alignment quality (7,8), or anything close to it, |
|---|
| 122 | will be found. More specifically, any mistakes (misaligned regions) made |
|---|
| 123 | early in the alignment process cannot be corrected later as new information |
|---|
| 124 | from other sequences is added. This problem is frequently thought of as |
|---|
| 125 | mainly resulting from an incorrect branching order in the initial tree. The |
|---|
| 126 | initial trees are derived from a matrix of distances between separately aligned |
|---|
| 127 | pairs of sequences and are much less reliable than trees from complete |
|---|
| 128 | multiple alignments. In our experience, however, the real problem is caused |
|---|
| 129 | simply by errors in the initial alignments. Even if the topology of the guide |
|---|
| 130 | tree is correct, each alignment step in the multiple alignment process may |
|---|
| 131 | have some percentage of the residues misaligned. This percentage will be |
|---|
| 132 | very low on average for very closely related sequences but will increase as |
|---|
| 133 | sequences diverge. It is these misalignments which carry through from the |
|---|
| 134 | early alignment steps that cause the local minimum problem. The only way |
|---|
| 135 | to correct this is to use an iterative or stochastic sampling procedure (e.g. |
|---|
| 136 | 7,9,10). We do not directly address this problem in this paper. |
|---|
| 137 | |
|---|
| 138 | The alignment parameter choice problem is, in our view, at least as serious as |
|---|
| 139 | the local minimum problem. Stochastic or iterative algorithms will be just |
|---|
| 140 | as badly affected as progressive ones if the parameters are inappropriate: they |
|---|
| 141 | will arrive at a false global minimum. Traditionally, one chooses one weight |
|---|
| 142 | matrix and two gap penalties (one for opening a new gap and one for |
|---|
| 143 | extending an existing gap) and hope that these will work well over all parts of |
|---|
| 144 | all the sequences in the data set. When the sequences are all closely related, |
|---|
| 145 | this works. The first reason is that virtually all residue weight matrices give |
|---|
| 146 | most weight to identities. When identities dominate an alignment, almost |
|---|
| 147 | any weight matrix will find approximately the correct solution. With very |
|---|
| 148 | divergent sequences, however, the scores given to non-identical residues will |
|---|
| 149 | become critically important; there will be more mismatches than identities. |
|---|
| 150 | Different weight matrices will be optimal at different evolutionary distances |
|---|
| 151 | or for different classes of proteins. |
|---|
| 152 | |
|---|
| 153 | The second reason is that the range of gap penalty values that will find the |
|---|
| 154 | correct or best possible solution can be very broad for highly similar sequences |
|---|
| 155 | (11). As more and more divergent sequences are used, however, the exact |
|---|
| 156 | values of the gap penalties become important for success. In each case, there |
|---|
| 157 | may be a very narrow range of values which will deliver the best alignment. |
|---|
| 158 | Further, in protein alignments, gaps do not occur randomly (i.e. with equal |
|---|
| 159 | probability at all positions). They occur far more often between the major |
|---|
| 160 | secondary structural elements of alpha helices and beta strands than within |
|---|
| 161 | (12). |
|---|
| 162 | |
|---|
| 163 | The major improvements described in this paper attempt to address the |
|---|
| 164 | alignment parameter choice problem. We dynamically vary the gap |
|---|
| 165 | penalties in a position and residue specific manner. The observed relative |
|---|
| 166 | frequencies of gaps adjacent to each of the 20 amino acids (12) are used to |
|---|
| 167 | locally adjust the gap opening penalty after each residue. Short stretches of |
|---|
| 168 | hydrophilic residues (e.g. 5 or more) usually indicate loop or random coil |
|---|
| 169 | regions and the gap opening penalties are locally reduced in these stretches. |
|---|
| 170 | In addition, the locations of the gaps found in the early alignments are also |
|---|
| 171 | given reduced gap opening penalties. It has been observed in alignments |
|---|
| 172 | between sequences of known structure that gaps tend not to be closer than |
|---|
| 173 | roughly eight residues on average (12). We increase the gap opening penalty |
|---|
| 174 | within eight residues of exising gaps. The two main series of amino acid |
|---|
| 175 | weight matrices that are used today are the PAM series (3) and the BLOSUM |
|---|
| 176 | series (4). In each case, there is a range of matrices to choose from. Some |
|---|
| 177 | matrices are appropriate for aligning very closely related sequences where |
|---|
| 178 | most weight by far is given to identities, with only the most frequent |
|---|
| 179 | conservative substitutions receiving high scores. Other matrices work better |
|---|
| 180 | at greater evolutionary distances where less importance is attached to |
|---|
| 181 | identities (13). We choose different weight matrices, as the alignment |
|---|
| 182 | proceeds, depending on the estimated divergence of the sequences to be |
|---|
| 183 | aligned at each stage. |
|---|
| 184 | |
|---|
| 185 | Sequences are weighted to correct for unequal sampling across all |
|---|
| 186 | evolutionary distances in the data set (14). This downweights sequences that |
|---|
| 187 | are very similar to other sequences in the data set and upweights the most |
|---|
| 188 | divergent ones. The weights are calculated directly from the branch lengths |
|---|
| 189 | in the initial guide tree (15). Sequence weighting has already been shown to |
|---|
| 190 | be effective in improving the sensitivity of profile searches (15,16). In the |
|---|
| 191 | original CLUSTAL programs (17-19), the initial guide trees, used to guide the |
|---|
| 192 | multiple alignment, were calculated using the UPGMA method (20). We |
|---|
| 193 | now use the Neighbour-Joining method (21) which is more robust against the |
|---|
| 194 | effects of unequal evolutionary rates in different lineages and which gives |
|---|
| 195 | better estimates of individual branch lengths. This is useful because it is these |
|---|
| 196 | branch lengths which are used to derive the sequence weights. We also allow |
|---|
| 197 | users to choose between fast approximate alignments (22) or full dynamic |
|---|
| 198 | programming for the distance calculations used to make the guide tree. |
|---|
| 199 | |
|---|
| 200 | The new improvements dramatically improve the sensitivity of the |
|---|
| 201 | progressive alignment method for difficult alignments involving highly |
|---|
| 202 | diverged sequences. We show one very demanding test case of over 60 SH3 |
|---|
| 203 | domains (23) which includes sequence pairs with as little as 12% identity and |
|---|
| 204 | where there is only one exactly conserved residue across all of the sequences. |
|---|
| 205 | Using default parameters, we can achieve an alignment that is almost exactly |
|---|
| 206 | correct, according to available structural information (24). Using the program |
|---|
| 207 | in a wide variety of situations, we find that it will normally find the correct |
|---|
| 208 | alignment, in all but the most difficult and pathological of cases. |
|---|
| 209 | |
|---|
| 210 | |
|---|
| 211 | MATERIAL AND METHODS |
|---|
| 212 | |
|---|
| 213 | |
|---|
| 214 | The basic alignment method |
|---|
| 215 | |
|---|
| 216 | The basic multiple alignment algorithm consists of three main stages: 1) all |
|---|
| 217 | pairs of sequences are aligned separately in order to calculate a distance matrix |
|---|
| 218 | giving the divergence of each pair of sequences; 2) a guide tree is calculated |
|---|
| 219 | from the distance matrix; 3) the sequences are progressively aligned according |
|---|
| 220 | to the branching order in the guide tree. An example using 7 globin |
|---|
| 221 | sequences of known tertiary structure (25) is given in figure 1. |
|---|
| 222 | |
|---|
| 223 | |
|---|
| 224 | 1) The distance matrix/pairwise alignments |
|---|
| 225 | |
|---|
| 226 | In the original CLUSTAL programs, the pairwise distances were calculated |
|---|
| 227 | using a fast approximate method (22). This allows very large numbers of |
|---|
| 228 | sequences to be aligned, even on a microcomputer. The scores are calculated |
|---|
| 229 | as the number of k-tuple matches (runs of identical residues, typically 1 or 2 |
|---|
| 230 | long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment |
|---|
| 231 | between two sequences minus a fixed penalty for every gap. We now offer a |
|---|
| 232 | choice between this method and the slower but more accurate scores from full |
|---|
| 233 | dynamic programming alignments using two gap penalties (for opening or |
|---|
| 234 | extending gaps) and a full amino acid weight matrix. These scores are |
|---|
| 235 | calculated as the number of identities in the best alignment divided by the |
|---|
| 236 | number of residues compared (gap positions are excluded). Both of these |
|---|
| 237 | scores are initially calculated as percent identity scores and are converted to |
|---|
| 238 | distances by dividing by 100 and subtracting from 1.0 to give number of |
|---|
| 239 | differences per site. We do not correct for multiple substitutions in these |
|---|
| 240 | initial distances. In figure 1 we give the 7x7 distance matrix between the 7 |
|---|
| 241 | globin sequences calculated using the full dynamic programming method. |
|---|
| 242 | |
|---|
| 243 | |
|---|
| 244 | 2) The guide tree |
|---|
| 245 | |
|---|
| 246 | The trees used to guide the final multiple alignment process are calculated |
|---|
| 247 | from the distance matrix of step 1 using the Neighbour-Joining method (21). |
|---|
| 248 | This produces unrooted trees with branch lengths proportional to estimated |
|---|
| 249 | divergence along each branch. The root is placed by a "mid-point" method |
|---|
| 250 | (15) at a position where the means of the branch lengths on either side of the |
|---|
| 251 | root are equal. These trees are also used to derive a weight for each sequence |
|---|
| 252 | (15). The weights are dependent upon the distance from the root of the tree |
|---|
| 253 | but sequences which have a common branch with other sequences share the |
|---|
| 254 | weight derived from the shared branch. In the example in figure 1, the |
|---|
| 255 | leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the |
|---|
| 256 | length of the branch from the root to it. The Human beta globin |
|---|
| 257 | (Hbb_Human) gets a weight consisting of the length of the branch leading to |
|---|
| 258 | it that is not shared with any other sequences (0.081) plus half the length of |
|---|
| 259 | the branch shared with the horse beta globin (0.226/2) plus one quarter the |
|---|
| 260 | length of the branch shared by all four haemoglobins (0.061/4) plus one fifth |
|---|
| 261 | the branch shared between the haemoglobins and the myoglobin (0.015/5) |
|---|
| 262 | plus one sixth the branch leading to all the vertebrate globins (0.062). This |
|---|
| 263 | sums to a total of 0.221. By contrast, in the normal progressive alignment |
|---|
| 264 | algorithm, all sequences would be equally weighted. The rooted tree with |
|---|
| 265 | branch lengths and sequence weights for the 7 globins is given in figure 1. |
|---|
| 266 | |
|---|
| 267 | |
|---|
| 268 | 3) Progressive alignment |
|---|
| 269 | |
|---|
| 270 | The basic procedure at this stage is to use a series of pairwise alignments to |
|---|
| 271 | align larger and larger groups of sequences, following the branching order in |
|---|
| 272 | the guide tree. You proceed from the tips of the rooted tree towards the root. |
|---|
| 273 | In the globin example in figure 1 you align the sequences in the following |
|---|
| 274 | order: human vs. horse beta globin; human vs. horse alpha globin; the 2 |
|---|
| 275 | alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the |
|---|
| 276 | cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin |
|---|
| 277 | vs. all the rest. At each stage a full dynamic programming (26,27) algorithm is |
|---|
| 278 | used with a residue weight matrix and penalties for opening and extending |
|---|
| 279 | gaps. Each step consists of aligning two existing alignments or sequences. |
|---|
| 280 | Gaps that are present in older alignments remain fixed. In the basic |
|---|
| 281 | algorithm, new gaps that are introduced at each stage get full gap opening and |
|---|
| 282 | extension penalties, even if they are introduced inside old gap positions (see |
|---|
| 283 | the section on gap penalties below for modifications to this rule). In order to |
|---|
| 284 | calculate the score between a position from one sequence or alignment and |
|---|
| 285 | one from another, the average of all the pairwise weight matrix scores from |
|---|
| 286 | the amino acids in the two sets of sequences is used i.e. if you align 2 |
|---|
| 287 | alignments with 2 and 4 sequences respectively, the score at each position is |
|---|
| 288 | the average of 8 (2x4) comparisons. This is illustrated in figure 2. If either set |
|---|
| 289 | of sequences contains one or more gaps in one of the positions being |
|---|
| 290 | considered, each gap versus a residue is scored as zero. The default amino |
|---|
| 291 | acid weight matrices we use are rescored to have only positive values. |
|---|
| 292 | Therefore, this treatment of gaps treats the score of a residue versus a gap as |
|---|
| 293 | having the worst possible score. When sequences are weighted (see |
|---|
| 294 | improvements to progressive alignment, below), each weight matrix value is |
|---|
| 295 | multiplied by the weights from the 2 sequences, as illustrated in figure 2. |
|---|
| 296 | |
|---|
| 297 | |
|---|
| 298 | Improvements to progressive alignment |
|---|
| 299 | |
|---|
| 300 | All of the remaining modifications apply only to the final progressive |
|---|
| 301 | alignment stage. Sequence weighting is relatively straightforward and is |
|---|
| 302 | already widely used in profile searches (15,16). The treatment of gap penalties |
|---|
| 303 | is more complicated. Initial gap penalties are calculated depending on the |
|---|
| 304 | weight matrix, the similarity of the sequences, and the length of the |
|---|
| 305 | sequences. Then, an attempt is made to derive sensible local gap opening |
|---|
| 306 | penalties at every position in each pre-aligned group of sequences that will |
|---|
| 307 | vary as new sequences are added. The use of different weight matrices as the |
|---|
| 308 | alignment progresses is novel and largely by-passes the problem of initial |
|---|
| 309 | choice of weight matrix. The final modification allows us to delay the |
|---|
| 310 | addition of very divergent sequences until the end of the alignment process |
|---|
| 311 | when all of the more closely related sequences have already been aligned. |
|---|
| 312 | |
|---|
| 313 | |
|---|
| 314 | Sequence weighting |
|---|
| 315 | |
|---|
| 316 | Sequence weights are calculated directly from the guide tree. The weights |
|---|
| 317 | are normalised such that the biggest one is set to 1.0 and the rest are all less |
|---|
| 318 | than one. Groups of closely related sequences receive lowered weights |
|---|
| 319 | because they contain much duplicated information. Highly divergent |
|---|
| 320 | sequences without any close relatives receive high weights. These weights |
|---|
| 321 | are used as simple multiplication factors for scoring positions from different |
|---|
| 322 | sequences or prealigned groups of sequences. The method is illustrated in |
|---|
| 323 | figure 2. In the globin example in figure 1, the two alpha globins get |
|---|
| 324 | downweighted because they are almost duplicate sequences (as do the two |
|---|
| 325 | beta globins); they receive a combined weight of only slightly more than if a |
|---|
| 326 | single alpha globin was used. |
|---|
| 327 | |
|---|
| 328 | |
|---|
| 329 | Initial gap penalties |
|---|
| 330 | |
|---|
| 331 | Initially, two gap penalties are used: a gap opening penalty (GOP) which gives |
|---|
| 332 | the cost of opening a new gap of any length and a gap extension penalty (GEP) |
|---|
| 333 | which gives the cost of every item in a gap. Initial values can be set by the |
|---|
| 334 | user from a menu. The software then automatically attempts to choose |
|---|
| 335 | appropriate gap penalties for each sequence alignment, depending on the |
|---|
| 336 | following factors. |
|---|
| 337 | |
|---|
| 338 | 1) Dependence on the weight matrix |
|---|
| 339 | |
|---|
| 340 | It has been shown (16,28) that varying the gap penalties used with different |
|---|
| 341 | weight matrices can improve the accuracy of sequence alignments. Here, we |
|---|
| 342 | use the average score for two mismatched residues (ie. off-diagonal values in |
|---|
| 343 | the matrix) as a scaling factor for the GOP. |
|---|
| 344 | |
|---|
| 345 | 2) Dependence on the similarity of the sequences |
|---|
| 346 | |
|---|
| 347 | The percent identity of the two (groups of) sequences to be aligned is used to |
|---|
| 348 | increase the GOP for closely related sequences and decrease it for more |
|---|
| 349 | divergent sequences on a linear scale. |
|---|
| 350 | |
|---|
| 351 | 3) Dependence on the lengths of the sequences |
|---|
| 352 | |
|---|
| 353 | The scores for both true and false sequence alignments grow with the length |
|---|
| 354 | of the sequences. We use the logarithm of the length of the shorter sequence |
|---|
| 355 | to increase the GOP with sequence length. |
|---|
| 356 | |
|---|
| 357 | Using these three modifications, the initial GOP calculated by the program is: |
|---|
| 358 | |
|---|
| 359 | GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) * |
|---|
| 360 | (percent identity scaling factor) |
|---|
| 361 | where N, M are the lengths of the two sequences. |
|---|
| 362 | |
|---|
| 363 | 4) Dependence on the difference in the lengths of the sequences |
|---|
| 364 | |
|---|
| 365 | The GEP is modified depending on the difference between the lengths of the |
|---|
| 366 | two sequences to be aligned. If one sequence is much shorter than the other, |
|---|
| 367 | the GEP is increased to inhibit too many long gaps in the shorter sequence. |
|---|
| 368 | The initial GEP calculated by the program is: |
|---|
| 369 | |
|---|
| 370 | GEP -> GEP*(1.0+|log(N/M)|) |
|---|
| 371 | where N, M are the lengths of the two sequences. |
|---|
| 372 | |
|---|
| 373 | |
|---|
| 374 | Position-specific gap penalties |
|---|
| 375 | |
|---|
| 376 | In most dynamic programming applications, the initial gap opening and |
|---|
| 377 | extension penalties are applied equally at every position in the sequence, |
|---|
| 378 | regardless of the location of a gap, except for terminal gaps which are usually |
|---|
| 379 | allowed at no cost. In CLUSTAL W, before any pair of sequences or |
|---|
| 380 | prealigned groups of sequences are aligned, we generate a table of gap opening |
|---|
| 381 | penalties for every position in the two (sets of) sequences. An example is |
|---|
| 382 | shown in figure 3. We manipulate the initial gap opening penalty in a |
|---|
| 383 | position specific manner, in order to make gaps more or less likely at different |
|---|
| 384 | positions. |
|---|
| 385 | |
|---|
| 386 | The local gap penalty modification rules are applied in a hierarchical manner. |
|---|
| 387 | The exact details of each rule are given below. Firstly, if there is a gap at a |
|---|
| 388 | position, the gap opening and gap extension penalties are lowered; the other |
|---|
| 389 | rules do not apply. This makes gaps more likely at positions where there are |
|---|
| 390 | already gaps. If there is no gap at a position, then the gap opening penalty is |
|---|
| 391 | increased if the position is within 8 residues of an existing gap. This |
|---|
| 392 | discourages gaps that are too close together. Finally, at any position within a |
|---|
| 393 | run of hydrophilic residues, the penalty is decreased. These runs usually |
|---|
| 394 | indicate loop regions in protein structures. If there is no run of hydrophilic |
|---|
| 395 | residues, the penalty is modified using a table of residue specific gap |
|---|
| 396 | propensities (12). These propensities were derived by counting the frequency |
|---|
| 397 | of each residue at either end of gaps in alignments of proteins of known |
|---|
| 398 | structure. An illustration of the application of these rules from one part of |
|---|
| 399 | the globin example, in figure 1, is given in figure 3. |
|---|
| 400 | |
|---|
| 401 | 1) Lowered gap penalties at existing gaps |
|---|
| 402 | |
|---|
| 403 | If there are already gaps at a position, then the GOP is reduced in proportion |
|---|
| 404 | to the number of sequences with a gap at this position and the GEP is lowered |
|---|
| 405 | by a half. The new gap opening penalty is calculated as: |
|---|
| 406 | |
|---|
| 407 | GOP -> GOP*0.3*(no. of sequences without a gap/no. of sequences). |
|---|
| 408 | |
|---|
| 409 | 2) Increased gap penalties near existing gaps |
|---|
| 410 | |
|---|
| 411 | If a position does not have any gaps but is within 8 residues of an existing gap, |
|---|
| 412 | the GOP is increased by: |
|---|
| 413 | |
|---|
| 414 | GOP -> GOP*(2+((8-distance from gap)*2)/8) |
|---|
| 415 | |
|---|
| 416 | 3) Reduced gap penalties in hydrophilic stretches |
|---|
| 417 | |
|---|
| 418 | Any run of 5 hydrophilic residues is considered to be a hydrophilic stretch. |
|---|
| 419 | The residues that are to be considered hydrophilic may be set by the user but |
|---|
| 420 | are conservatively set to D, E, G, K, N, Q, P, R or S by default. If, at any |
|---|
| 421 | position, there are no gaps and any of the sequences has such a stretch, the |
|---|
| 422 | GOP is reduced by one third. |
|---|
| 423 | |
|---|
| 424 | |
|---|
| 425 | 4) Residue specific penalties |
|---|
| 426 | |
|---|
| 427 | If there is no hydrophilic stretch and the position does not contain any gaps, |
|---|
| 428 | then the GOP is multiplied by one of the 20 numbers in table 1, depending on |
|---|
| 429 | the residue. If there is a mixture of residues at a position, the multiplication |
|---|
| 430 | factor is the average of all the contributions from each sequence. |
|---|
| 431 | |
|---|
| 432 | |
|---|
| 433 | Weight matrices |
|---|
| 434 | |
|---|
| 435 | Two main series of weight matrices are offered to the user: the Dayhoff PAM |
|---|
| 436 | series (3) and the BLOSUM series (4). The default is the BLOSUM series. In |
|---|
| 437 | each case, there is a choice of matrix ranging from strict ones, useful for |
|---|
| 438 | comparing very closely related sequences to very "soft" ones that are useful |
|---|
| 439 | for comparing very distantly related sequences. Depending on the distance |
|---|
| 440 | between the two sequences or groups of sequences to be compared, we switch |
|---|
| 441 | between 4 different matrices. The distances are measured directly from the |
|---|
| 442 | guide tree. The ranges of distances and tables used with the PAM series of |
|---|
| 443 | matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350. |
|---|
| 444 | The range used with the BLOSUM series is:80-100%:BLOSUM80, |
|---|
| 445 | 60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30. |
|---|
| 446 | |
|---|
| 447 | |
|---|
| 448 | Divergent sequences |
|---|
| 449 | |
|---|
| 450 | The most divergent sequences (most different, on average from all of the |
|---|
| 451 | other sequences) are usually the most difficult to align correctly. It is |
|---|
| 452 | sometimes better to delay the incorporation of these sequences until all of the |
|---|
| 453 | more easily aligned sequences are merged first. This may give a better chance |
|---|
| 454 | of correctly placing the gaps and matching weakly conserved positions against |
|---|
| 455 | the rest of the sequences. A choice is offered to set a cut off (default is 40% |
|---|
| 456 | identity or less with any other sequence) that will delay the alignment of the |
|---|
| 457 | divergent sequences until all of the rest have been aligned. |
|---|
| 458 | |
|---|
| 459 | |
|---|
| 460 | Software and Algorithms |
|---|
| 461 | |
|---|
| 462 | |
|---|
| 463 | Dynamic Programming |
|---|
| 464 | |
|---|
| 465 | The most demanding part of the multiple alignment strategy, in terms of |
|---|
| 466 | computer processing and memory usage, is the alignment of two (groups of) |
|---|
| 467 | sequences at each step in the final progressive alignment. To make it |
|---|
| 468 | possible to align very long sequences (e.g. dynein heavy chains at ~ 5,000 |
|---|
| 469 | residues) in a reasonable amount of memory, we use the memory efficient |
|---|
| 470 | dynamic programming algorithm of Myers and Miller (26). This sacrifices |
|---|
| 471 | some processing time but makes very large alignments practical in very little |
|---|
| 472 | memory. One disadvantage of this algorithm is that it does not allow |
|---|
| 473 | different gap opening and extension penalties at each position. We have |
|---|
| 474 | modified the algorithm so as to allow this and the details are described in a |
|---|
| 475 | separate paper (27). |
|---|
| 476 | |
|---|
| 477 | |
|---|
| 478 | |
|---|
| 479 | Menus/file formats |
|---|
| 480 | |
|---|
| 481 | Six different sequence input formats are detected automatically and read by |
|---|
| 482 | the program: EMBL/Swiss Prot, NBRF/PIR, Pearson/FASTA (29), GCG/MSF |
|---|
| 483 | (30), GDE (Steven Smith, Harvard University Genome Center) and CLUSTAL |
|---|
| 484 | format alignments. The last three formats allow users to read in complete |
|---|
| 485 | alignments (e.g. for calculating phylogenetic trees or for addition of new |
|---|
| 486 | sequences to an existing alignment). Alignment output may be requested in |
|---|
| 487 | standard CLUSTAL format (self-explanatory blocked alignments) or in |
|---|
| 488 | formats compatible with the GDE, PHYLIP (31) or GCG (30) packages. The |
|---|
| 489 | program offers the user the ability to calculate Neighbour-Joining |
|---|
| 490 | phylogenetic trees from existing alignments with options to correct for |
|---|
| 491 | multiple hits (32,33) and to estimate confidence levels using a bootstrap |
|---|
| 492 | resampling procedure (34). The trees may be output in the "New |
|---|
| 493 | Hampshire" format that is compatible with the PHYLIP package (31). |
|---|
| 494 | |
|---|
| 495 | Alignment to an alignment |
|---|
| 496 | |
|---|
| 497 | Profile alignment is used to align two existing alignments (either of which |
|---|
| 498 | may consist of just one sequence) or to add a series of new sequences to an |
|---|
| 499 | existing alignment. This is useful because one may wish to build up a |
|---|
| 500 | multiple alignment gradually, choosing different parameters manually, or |
|---|
| 501 | correcting intermediate errors as the alignment proceeds. Often, just a few |
|---|
| 502 | sequences cause misalignments in the progressive algorithm and these can be |
|---|
| 503 | removed from the process and then added at the end by profile alignment. A |
|---|
| 504 | second use is where one has a high quality reference alignment and wishes to |
|---|
| 505 | keep it fixed while adding new sequences automatically. |
|---|
| 506 | |
|---|
| 507 | |
|---|
| 508 | Portability/Availability |
|---|
| 509 | |
|---|
| 510 | The full source code of the package is provided free to academic users. The |
|---|
| 511 | program will run on any machine with a full ANSI conforming C compiler. |
|---|
| 512 | It has been tested on the following hardware/software combinations: |
|---|
| 513 | Decstation/Ultrix, Vax or ALPHA/VMS, Silicon Graphics/IRIX. The source |
|---|
| 514 | code and documentation are available by E-mail from the EMBL file server |
|---|
| 515 | (send the words HELP and HELP SOFTWARE on two lines to the internet |
|---|
| 516 | address: |
|---|
| 517 | Netserv@EMBL-Heidelberg.DE) or by anonymous FTP from |
|---|
| 518 | FTP.EMBL-Heidelberg.DE. Queries may be addressed by E-mail to |
|---|
| 519 | Des.Higgins@EBI.AC.UK or Gibson@EMBL-Heidelberg.DE. |
|---|
| 520 | |
|---|
| 521 | |
|---|
| 522 | RESULTS AND DISCUSSION |
|---|
| 523 | |
|---|
| 524 | |
|---|
| 525 | Alignment of SH3 Domains |
|---|
| 526 | |
|---|
| 527 | The ~60 residue SH3 domain was chosen to illustrate the performance of |
|---|
| 528 | CLUSTAL W, as there is a reference manual alignment (23) and the fold is |
|---|
| 529 | known (24). SH3 domains, with a minimum similarity below 12% identity, |
|---|
| 530 | are poorly aligned by progressive alignment programs such as CLUSTAL V |
|---|
| 531 | and PILEUP: neither program can generate the correct blocks corresponding to |
|---|
| 532 | the secondary structure elements. |
|---|
| 533 | |
|---|
| 534 | Figure 4 shows an alignment generated by CLUSTAL W of the example set of |
|---|
| 535 | SH3 domains. The alignment was generated in two steps. After progressive |
|---|
| 536 | alignment, five blocks were produced, corresponding to structural elements, |
|---|
| 537 | with gaps inserted exclusively in the known loop regions. The beta strands in |
|---|
| 538 | blocks 1, 4 and 5 were all correctly superposed. However, four sequences in |
|---|
| 539 | block 2 and one sequence in block 3 were misaligned by 1-2 residues |
|---|
| 540 | (underlined in figure 4). A second progressive alignment of the aligned |
|---|
| 541 | sequences, including the gaps, improved this alignment: A single misaligned |
|---|
| 542 | sequence, H_P55, remains in block 2 (boxed in figure 4), while block 3 is now |
|---|
| 543 | completely aligned. This alignment corrects several errors (eg. P85A, P85B |
|---|
| 544 | and FUS1) in the manual alignment (23). |
|---|
| 545 | |
|---|
| 546 | The SH3 alignment illustrates several features of CLUSTAL W usage. Firstly, |
|---|
| 547 | in a practical application involving divergent sequences, the initial |
|---|
| 548 | progressive alignment is likely to be a good but not perfect approximation to |
|---|
| 549 | the correct alignment. The alignment quality can be improved in a number of |
|---|
| 550 | ways. If the block structure of the alignment appears to be correct, realignment |
|---|
| 551 | of the alignment will usually improve most of the misaligned blocks: the |
|---|
| 552 | existing gaps allow the blocks to "float" cheaply to a locally optimal position |
|---|
| 553 | without disturbing the rest of the alignment. Remaining sequences which are |
|---|
| 554 | doubtfully aligned can then be individually tested by profile alignment to the |
|---|
| 555 | remainder: the misaligned H_P55 SH3 domain can be correctly aligned by |
|---|
| 556 | profile (with GOP <= 8). The indel regions in the final alignment can then be |
|---|
| 557 | manually cleaned up: Usually the exact alignment in the loop regions is not |
|---|
| 558 | determinable, and may have no meaning in structural terms. It is then |
|---|
| 559 | desirable to have a single gap per structural loop. CLUSTAL W achieved this |
|---|
| 560 | for two of the four SH3 loop regions (figure 4). |
|---|
| 561 | |
|---|
| 562 | If the block structure of the alignment appears suspect, greater intervention by |
|---|
| 563 | the user may be required. The most divergent sequences, especially if they |
|---|
| 564 | have large insertions (which can be discerned with the aid of dot matrix |
|---|
| 565 | plots), should be left out of the progressive alignment. If there are sets of |
|---|
| 566 | closely related sequences that are deeply diverged from other sets, these can be |
|---|
| 567 | separately aligned and then merged by profile alignment. Incorrectly |
|---|
| 568 | determined sequences, containing frameshifts, can also confound regions of |
|---|
| 569 | an alignment: these can be hard to detect but sometimes they have been |
|---|
| 570 | grouped within the excluded divergent sequences: then they may be revealed |
|---|
| 571 | when they are individually compared to the alignment as having apparently |
|---|
| 572 | nonsense segments with respect to the other sequences. |
|---|
| 573 | |
|---|
| 574 | |
|---|
| 575 | |
|---|
| 576 | Finding the best alignment |
|---|
| 577 | |
|---|
| 578 | In cases where all of the sequences in a data set are very similar (e.g. no pair |
|---|
| 579 | less than 35% identical), CLUSTAL W will find an alignment which is |
|---|
| 580 | difficult to improve by eye. In this sense, the alignment is optimal with |
|---|
| 581 | regard to the alternative of manual alignment. Mathematically, this is vague |
|---|
| 582 | and can only be put on a more systematic footing by finding an objective |
|---|
| 583 | function (a measure of multiple alignment quality) that exactly mirrors the |
|---|
| 584 | information used by an "expert" to evaluate an alignment. Nonetheless, if an |
|---|
| 585 | alignment is impossible to improve by eye, then the program has achieved a |
|---|
| 586 | very useful result. |
|---|
| 587 | |
|---|
| 588 | In more difficult cases, as more divergent sequences are included, it becomes |
|---|
| 589 | increasingly difficult to find good alignments and to evaluate them. What |
|---|
| 590 | we find with CLUSTAL W is that the basic block-like structure of the |
|---|
| 591 | alignment (corresponding to the major secondary structure elements) is |
|---|
| 592 | usually recovered, with some of the most divergent sequences misaligned in |
|---|
| 593 | small regions. This is a very useful starting point for manual refinement as it |
|---|
| 594 | helps define the major blocks of similarity. The problem sequences can be |
|---|
| 595 | removed from the analysis and realigned to the rest of the sequences |
|---|
| 596 | automatically or with different parameter settings. An examination of the |
|---|
| 597 | tree used to guide the alignment will usually show which sequences will be |
|---|
| 598 | most unreliably placed (those that branch off closest to the root and/or those |
|---|
| 599 | that align to other single sequences at a very low level of sequence identity |
|---|
| 600 | rather than align to a group of pre-aligned sequences). Finally, one can |
|---|
| 601 | simply iterate the multiple alignment process by feeding an output alignment |
|---|
| 602 | back into CLUSTAL W and repeating the multiple alignment process (using |
|---|
| 603 | the same or different parameters). The SH3 domain alignment in figure 4 |
|---|
| 604 | was derived in this way by 2 passes using default parameters. In the second |
|---|
| 605 | pass, the local gap penalties are dominated by the placement of the initial |
|---|
| 606 | major gap positions. The alignment will either remain unchanged or will |
|---|
| 607 | converge rapidly (after 1 or 2 extra passes) on a better solution. If the |
|---|
| 608 | placement of the initial gaps is approximately correct but some of the |
|---|
| 609 | sequences are locally misaligned, this works well. |
|---|
| 610 | |
|---|
| 611 | |
|---|
| 612 | Comparison with other methods |
|---|
| 613 | |
|---|
| 614 | Recently, several papers have addressed the problem of position specific |
|---|
| 615 | parameters for multiple alignment. In one case (35), local gap penalties are |
|---|
| 616 | increased in alpha helical and beta strand regions, when the 3-D structures of |
|---|
| 617 | one or more of the sequences are known. In a second case (36), a hidden |
|---|
| 618 | Markov model was used to estimate position specific gap penalties and |
|---|
| 619 | residue substitution weight matrices when large numbers of examples of a |
|---|
| 620 | protein domain were known. With CLUSTAL W, we attempt to derive the |
|---|
| 621 | same information purely from the set of sequences to be aligned. Therefore, |
|---|
| 622 | we can apply the method to any set of sequences. The success of this approach |
|---|
| 623 | will depend on the number of available sequences and their evolutionary |
|---|
| 624 | relationships. It will also depend on the decision making process during |
|---|
| 625 | multiple alignment (e.g. when to change weight matrix) and the accuracy and |
|---|
| 626 | appropriateness of our parameterisation. In the long term, this can only be |
|---|
| 627 | evaluated by exhaustive testing of sets of sequences where the correct |
|---|
| 628 | alignment (or parts of it) are known from structural information. What is |
|---|
| 629 | clear, however, is that the modifications described here significantly improve |
|---|
| 630 | the sensitivity of the progressive multiple alignment approach. This is |
|---|
| 631 | achieved with almost no sacrifice in speed and efficiency. |
|---|
| 632 | |
|---|
| 633 | There are several areas where further improvements in sensitivity and |
|---|
| 634 | accuracy can be made. Firstly, the residue weight matrices and gap settings |
|---|
| 635 | can be made more accurate as more and more data accumulate, while |
|---|
| 636 | matrices for specific sequence types can be derived (e.g. for transmembrane |
|---|
| 637 | regions (37)). Secondly, stochastic or iterative optimisation methods can be |
|---|
| 638 | used to refine initial alignments (7,9,10). CLUSTAL W could be run with |
|---|
| 639 | several sets of starting parameters and in each case, the alignments refined |
|---|
| 640 | according to an objective function. The search for a good objective function, |
|---|
| 641 | that takes into account the sequence and position specific information used in |
|---|
| 642 | CLUSTAL W is a key area of research. Finally, the average number of |
|---|
| 643 | examples of each protein domain or family is growing steadily. It is not only |
|---|
| 644 | important that programs can cope with the large volumes of data that are |
|---|
| 645 | being generated, they should be able to exploit the new information to make |
|---|
| 646 | the alignments more and more accurate. Globally optimal alignments |
|---|
| 647 | (according to an objective function) may not always be possible but the |
|---|
| 648 | problem may be avoided if sufficiently large volumes of data become |
|---|
| 649 | available. CLUSTAL W is a step in this direction. |
|---|
| 650 | |
|---|
| 651 | ACKNOWLEDGEMENTS |
|---|
| 652 | |
|---|
| 653 | Numerous people have offered advice and suggestions for improvements to |
|---|
| 654 | earlier versions of the CLUSTAL programs. D.H. wishes to apologise to all of |
|---|
| 655 | the irate CLUSTAL V users who had to live with the bugs and lack of facilities |
|---|
| 656 | for getting trees in the New Hampshire format. We wish to specifically thank |
|---|
| 657 | Jeroen Coppieters who suggested using a series of weight matrices and Steven |
|---|
| 658 | Henikoff for advice on using the BLOSUM matrices. We are grateful to Rein |
|---|
| 659 | Aasland, Peer Bork, Ariel Blocker and Brtrand Seraphin for providing |
|---|
| 660 | challenging alignment problems. T.G. and J.T. thank Kevin Leonard for |
|---|
| 661 | support and encouragement. Finally, we thank all of the people who were |
|---|
| 662 | involved with various CLUSTAL programs over the years, namely: Paul |
|---|
| 663 | Sharp, Rainer Fuchs and Alan Bleasby. |
|---|
| 664 | |
|---|
| 665 | |
|---|
| 666 | REFERENCES |
|---|
| 667 | |
|---|
| 668 | 1.Feng, D.-F. and Doolittle, R.F. (1987). J. Mol. Evol. 25, 351-360. |
|---|
| 669 | 2.Needleman, S.B. and Wunsch, C.D. (1970). J. Mol. Biol. 48, 443-453. |
|---|
| 670 | 3.Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of Protein |
|---|
| 671 | Sequence and Structure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352, |
|---|
| 672 | NBRF, Washington. |
|---|
| 673 | 4.Henikoff, S. and Henikoff, J.G. (1992). Proc. Natl. Acad. Sci. USA 89, 10915- |
|---|
| 674 | 10919. |
|---|
| 675 | 5.Lipman, D.J., Altschul, S.F. and Kececioglu, J.D. (1989). Proc. Natl. Acad. Sci. |
|---|
| 676 | USA 86, 4412-4415. |
|---|
| 677 | 6.Barton, G.J. and Sternberg, M.J.E. (1987). J. Mol. Biol. 198, 327-337. |
|---|
| 678 | 7.Gotoh, O. (1993). CABIOS 9, 361-370. |
|---|
| 679 | 8.Altschul, S.F. (1989). J. Theor. Biol. 138, 297-309. |
|---|
| 680 | 9.Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992). Nucl. Acids Res. 20, |
|---|
| 681 | 2511-2516. |
|---|
| 682 | 10.Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and |
|---|
| 683 | Wooton, J.C. (1993). Science, 262, 208-214. |
|---|
| 684 | 11.Vingron, M. and Waterman, M.S. (1993). J. Mol. Biol. 234, 1-12. |
|---|
| 685 | 12.Pascarella, S. and Argos, P. (1992). J. Mol. Biol. 224, 461-471. |
|---|
| 686 | 13.Collins, J.F. and Coulson, A.F.W. (1987). In Nucleic acid and protein |
|---|
| 687 | sequence analysis a practical approach, Bishop, M.J. and Rawlings, C.J. ed., |
|---|
| 688 | chapter 13, pp. 323-358. |
|---|
| 689 | 14.Vingron, M. and Sibbald, P.R. (1993). Proc. Natl. Acad. Sci. USA, 90, 8777- |
|---|
| 690 | 8781. |
|---|
| 691 | 15.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CABIOS, 10, 19-29. |
|---|
| 692 | 16.Lthy, R., Xenarios, I. and Bucher, P. (1994). Protein Science, 3, 139-146. |
|---|
| 693 | 17.Higgins, D.G. and Sharp, P.M. (1988). Gene, 73, 237-244. |
|---|
| 694 | 18.Higgins, D.G. and Sharp, P.M. (1989). CABIOS, 5, 151-153. |
|---|
| 695 | 19.Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992). CABIOS, 8, 189-191. |
|---|
| 696 | 20.Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy, W.H. |
|---|
| 697 | Freeman, San Francisco. |
|---|
| 698 | 21.Saitou, N. and Nei, M. (1987). Mol. Biol. Evol. 4, 406-425. |
|---|
| 699 | 22.Wilbur, W.J. and Lipman, D.J. (1983). Proc. Natl. Acad. Sci. USA, 80, 726- |
|---|
| 700 | 730. |
|---|
| 701 | 23.Musacchio, A., Gibson, T., Lehto, V.-P. and Saraste, M. (1992). FEBS Lett. |
|---|
| 702 | 307, 55-61. |
|---|
| 703 | 24.Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. and Saraste, M. (1992). |
|---|
| 704 | Nature, 359, 851-855. |
|---|
| 705 | 25.Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216. |
|---|
| 706 | 26.Myers, E.W. and Miller, W. (1988). CABIOS, 4, 11-17. |
|---|
| 707 | 27.Thompson, J.D. (1994). CABIOS, (Submitted). |
|---|
| 708 | 28.Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981). J. Mol. Evol. 18, 38-46. |
|---|
| 709 | 29.Pearson, W.R. and Lipman, D.J. (1988). Proc. Natl. Acad. Sci. USA. 85, 2444- |
|---|
| 710 | 2448. |
|---|
| 711 | 30.Devereux, J., Haeberli, P. and Smithies, O. (1984). Nucleic Acids Res. 12, |
|---|
| 712 | 387-395. |
|---|
| 713 | 31.Felsenstein, J. (1989). Cladistics 5, 164-166. |
|---|
| 714 | 32.Kimura, M. (1980). J. Mol. Evol. 16, 111-120. |
|---|
| 715 | 33.Kimura, M. (1983). The Neutral Theory of Molecular Evolution. |
|---|
| 716 | Cambridge University Press, Cambridge. |
|---|
| 717 | 34.Felsenstein, J. (1985). Evolution 39, 783-791. |
|---|
| 718 | 35.Smith, R.F. and Smith, T.F. (1992) Protein Engineering 5, 35-41. |
|---|
| 719 | 36.Krogh, A., Brown, M., Mian, S., Sjlander, K. and Haussler, D. (1994) J. Mol. |
|---|
| 720 | Biol. 235-1501-1531. |
|---|
| 721 | 37.Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994). FEBS Lett. 339, 269-275. |
|---|
| 722 | 38.Bairoch, A. and Bckmann, B. (1992) Nucleic Acids Res., 20, 2019-2022. |
|---|
| 723 | 39.Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and |
|---|
| 724 | Wierenga, R.K. (1993) EMBO J. 12, 2617-2624. |
|---|
| 725 | 40.Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637. |
|---|
| 726 | |
|---|
| 727 | FIGURE LEGENDS |
|---|
| 728 | |
|---|
| 729 | Figure 1. The basic progressive alignment procedure, illustrated using a set of |
|---|
| 730 | 7 globins of known tertiary structure. The sequence names are from Swiss |
|---|
| 731 | Prot (38): Hba_Horse: horse alpha globin; Hba_Human: human alpha globin; |
|---|
| 732 | Hbb_Horse: horse beta globin; Hbb_Human: human beta globin; Myg_Phyca: |
|---|
| 733 | sperm whale myoglobin; Glb5_Petma: lamprey cyanohaemoglobin; |
|---|
| 734 | Lgb2_Luplu: lupin leghaemoglobin. In the distance matrix, the mean |
|---|
| 735 | number of differences per residue is given. The unrooted tree shows all |
|---|
| 736 | branch lengths drawn to scale. In the rooted tree, all branch lengths (mean |
|---|
| 737 | number of differences per residue along each branch) are given as well as |
|---|
| 738 | weights for each sequence. In the multiple alignment, the approximate |
|---|
| 739 | positions of the 7 alpha helices, common to all 7 proteins are shown. This |
|---|
| 740 | alignment was derived using CLUSTAL W with default parameters and the |
|---|
| 741 | PAM (3) series of weight matrices. |
|---|
| 742 | |
|---|
| 743 | Figure 2. The scoring scheme for comparing two positions from two |
|---|
| 744 | alignments. Two sections of alignment with 4 and 2 sequences respectively |
|---|
| 745 | are shown. The score of the position with amino acids T,L,K,K versus the |
|---|
| 746 | position with amino acids V and I is given with and without sequence |
|---|
| 747 | weights. M(X,Y) is the weight matrix entry for amino acid X versus amino |
|---|
| 748 | acid Y. Wn is the weight for sequence n. |
|---|
| 749 | |
|---|
| 750 | Figure 3. The variation in local gap opening penalty is plotted for a section of |
|---|
| 751 | alignment. The inital gap opening penalty is indicated by a dotted line. Two |
|---|
| 752 | hydrophilic stretches are underlined. The lowest penalties correspond to the |
|---|
| 753 | ends of the alignment, the hydrophilic stretches and the two positions with |
|---|
| 754 | gaps. The highest values are within 8 residues of the two gap positions. The |
|---|
| 755 | rest of the variation is caused by the residue specific gap penalties (12). |
|---|
| 756 | |
|---|
| 757 | Figure 4. CLUSTAL W Alignment of a set of SH3 domains taken from (23). |
|---|
| 758 | Secondary structure assignments for the solved Spectrin (24) and Fyn (39) |
|---|
| 759 | domains are according to DSSP (40). The alignment was generated in two |
|---|
| 760 | steps using default parameters. After full multiple alignment, the aligned |
|---|
| 761 | sequences were realigned. Segments which were correctly aligned in the |
|---|
| 762 | second pass are underlined. The single misaligned segment in H_P55 and the |
|---|
| 763 | misaligned residue in H_NCK/2 are boxed. |
|---|
| 764 | |
|---|
| 765 | The sequences are coloured to illustrate significant features. All G (orange) |
|---|
| 766 | and P (yellow) are coloured. Other residues matching a frequent occurrence of |
|---|
| 767 | a property in a column are coloured: hydrophobic = blue; hydrophobic |
|---|
| 768 | tendency = light blue; basic = red; acidic = purple; hydrophilic = green; White |
|---|
| 769 | = unconserved. The alignment figure was prepared with the GDE sequence |
|---|
| 770 | editor (S. Smith, Harvard University) and COLORMASK (J. Thompson, |
|---|
| 771 | EMBL). |
|---|
| 772 | |
|---|
| 773 | |
|---|
| 774 | |
|---|
| 775 | |
|---|
| 776 | Table 1. Pascarella and Argos residue specific gap modification factors. |
|---|
| 777 | ----------------------------------------------------------------------------------- |
|---|
| 778 | A 1.13 M 1.29 |
|---|
| 779 | C 1.13 N 0.63 |
|---|
| 780 | D 0.96 P 0.74 |
|---|
| 781 | E 1.31 Q 1.07 |
|---|
| 782 | F 1.20 R 0.72 |
|---|
| 783 | G 0.61 S 0.76 |
|---|
| 784 | H 1.00 T 0.89 |
|---|
| 785 | I 1.32 V 1.25 |
|---|
| 786 | K 0.96 Y 1.00 |
|---|
| 787 | L 1.21 W 1.23 |
|---|
| 788 | ----------------------------------------------------------------------------------- |
|---|
| 789 | The values are normalised around a mean value of 1.0 for H. The lower the |
|---|
| 790 | value, the greater the chance of having an adjacent gap. These are derived |
|---|
| 791 | from the original table of relative frequencies of gaps adjacent to each residue |
|---|
| 792 | (12) by subtraction from 2.0. |
|---|
| 793 | |
|---|
| 794 | |
|---|