1 | This is just an ASCII text version of the manuscript describing |
---|
2 | Clustal W, without the figures. It was published: |
---|
3 | |
---|
4 | Nucleic Acids Research, 22(22):4673-4680. |
---|
5 | |
---|
6 | |
---|
7 | |
---|
8 | CLUSTAL W: improving the sensitivity of progressive multiple |
---|
9 | sequence alignment through sequence weighting, position specific |
---|
10 | gap penalties and weight matrix choice. |
---|
11 | |
---|
12 | |
---|
13 | |
---|
14 | Julie D. Thompson, Desmond G. Higgins1 and Toby J. Gibson* |
---|
15 | |
---|
16 | European Molecular Biology Laboratory |
---|
17 | Postfach 102209 |
---|
18 | Meyerhofstrasse 1 |
---|
19 | D-69012 Heidelberg |
---|
20 | Germany |
---|
21 | |
---|
22 | |
---|
23 | Phone: +49-6221-387398 |
---|
24 | Fax: +49-6221-387306 |
---|
25 | E-mail: Gibson@EMBL-Heidelberg.DE |
---|
26 | Des.Higgins@EBI.AC.UK |
---|
27 | Thompson@EMBL-Heidelberg.DE |
---|
28 | |
---|
29 | |
---|
30 | Keywords: Multiple alignment, phylogenetic tree, weight matrix, gap |
---|
31 | penalty, dynamic programming, sequence weighting. |
---|
32 | |
---|
33 | |
---|
34 | 1 Current address: |
---|
35 | European Bioinformatics Institute |
---|
36 | Hinxton Hall |
---|
37 | Hinxton |
---|
38 | Cambridge CB10 1RQ |
---|
39 | UK. |
---|
40 | |
---|
41 | * To whom correspondence should be addressed |
---|
42 | |
---|
43 | |
---|
44 | ABSTRACT |
---|
45 | |
---|
46 | The sensitivity of the commonly used progressive multiple sequence |
---|
47 | alignment method has been greatly improved for the alignment of divergent |
---|
48 | protein sequences. Firstly, individual weights are assigned to each sequence |
---|
49 | in a partial alignment in order to downweight near-duplicate sequences and |
---|
50 | upweight the most divergent ones. Secondly, amino acid substitution |
---|
51 | matrices are varied at different alignment stages according to the divergence |
---|
52 | of the sequences to be aligned. Thirdly, residue specific gap penalties and |
---|
53 | locally reduced gap penalties in hydrophilic regions encourage new gaps in |
---|
54 | potential loop regions rather than regular secondary structure. Fourthly, |
---|
55 | positions in early alignments where gaps have been opened receive locally |
---|
56 | reduced gap penalties to encourage the opening up of new gaps at these |
---|
57 | positions. These modifications are incorporated into a new program, |
---|
58 | CLUSTAL W which is freely available. |
---|
59 | |
---|
60 | |
---|
61 | INTRODUCTION |
---|
62 | |
---|
63 | The simultaneous alignment of many nucleotide or amino acid sequences is |
---|
64 | now an essential tool in molecular biology. Multiple alignments are used to |
---|
65 | find diagnostic patterns to characterise protein families; to detect or |
---|
66 | demonstrate homology between new sequences and existing families of |
---|
67 | sequences; to help predict the secondary and tertiary structures of new |
---|
68 | sequences; to suggest oligonucleotide primers for PCR; as an essential prelude |
---|
69 | to molecular evolutionary analysis. The rate of appearance of new sequence |
---|
70 | data is steadily increasing and the development of efficient and accurate |
---|
71 | automatic methods for multiple alignment is, therefore, of major |
---|
72 | importance. The majority of automatic multiple alignments are now carried |
---|
73 | out using the "progressive" approach of Feng and Doolittle (1). In this paper, |
---|
74 | we describe a number of improvements to the progressive multiple |
---|
75 | alignment method which greatly improve the sensitivity without sacrificing |
---|
76 | any of the speed and efficiency which makes this approach so practical. The |
---|
77 | new methods are made available in a program called CLUSTAL W which is |
---|
78 | freely available and portable to a wide variety of computers and operating |
---|
79 | systems. |
---|
80 | |
---|
81 | In order to align just two sequences, it is standard practice to use dynamic |
---|
82 | programming (2). This guarantees a mathematically optimal alignment, |
---|
83 | given a table of scores for matches and mismatches between all amino acids |
---|
84 | or nucleotides (e.g. the PAM250 matrix (3) or BLOSUM62 matrix (4)) and |
---|
85 | penalties for insertions or deletions of different lengths. Attempts at |
---|
86 | generalising dynamic programming to multiple alignments are limited to |
---|
87 | small numbers of short sequences (5). For much more than eight or so |
---|
88 | proteins of average length, the problem is uncomputable given current |
---|
89 | computer power. Therefore, all of the methods capable of handling larger |
---|
90 | problems in practical timescales, make use of heuristics. Currently, the most |
---|
91 | widely used approach is to exploit the fact that homologous sequences are |
---|
92 | evolutionarily related. One can build up a multiple alignment progressively |
---|
93 | by a series of pairwise alignments, following the branching order in a |
---|
94 | phylogenetic tree (1). One first aligns the most closely related sequences, |
---|
95 | gradually adding in the more distant ones. This approach is sufficiently fast |
---|
96 | to allow alignments of virtually any size. Further, in simple cases, the |
---|
97 | quality of the alignments is excellent, as judged by the ability to correctly align |
---|
98 | corresponding domains from sequences of known secondary or tertiary |
---|
99 | structure (6). In more difficult cases, the alignments give good starting points |
---|
100 | for further automatic or manual refinement. |
---|
101 | |
---|
102 | This approach works well when the data set consists of sequences of different |
---|
103 | degrees of divergence. Pairwise alignment of very closely related sequences |
---|
104 | can be carried out very accurately. The correct answer may often be obtained |
---|
105 | using a wide range of parameter values (gap penalties and weight matrix). By |
---|
106 | the time the most distantly related sequences are aligned, one already has a |
---|
107 | sample of aligned sequences which gives important information about the |
---|
108 | variability at each position. The positions of the gaps that were introduced |
---|
109 | during the early alignments of the closely related sequences are not changed |
---|
110 | as new sequences are added. This is justified because the placement of gaps |
---|
111 | in alignments between closely related sequences is much more accurate than |
---|
112 | between distantly related ones. When all of the sequences are highly |
---|
113 | divergent (e.g. less than approximately 25-30% identity between any pair of |
---|
114 | sequences), this progressive approach becomes much less reliable. |
---|
115 | |
---|
116 | There are two major problems with the progressive approach: the local |
---|
117 | minimum problem and the choice of alignment parameters. The local |
---|
118 | minimum problem stems from the "greedy" nature of the alignment strategy. |
---|
119 | The algorithm greedily adds sequences together, following the initial tree. |
---|
120 | There is no guarantee that the global optimal solution, as defined by some |
---|
121 | overall measure of multiple alignment quality (7,8), or anything close to it, |
---|
122 | will be found. More specifically, any mistakes (misaligned regions) made |
---|
123 | early in the alignment process cannot be corrected later as new information |
---|
124 | from other sequences is added. This problem is frequently thought of as |
---|
125 | mainly resulting from an incorrect branching order in the initial tree. The |
---|
126 | initial trees are derived from a matrix of distances between separately aligned |
---|
127 | pairs of sequences and are much less reliable than trees from complete |
---|
128 | multiple alignments. In our experience, however, the real problem is caused |
---|
129 | simply by errors in the initial alignments. Even if the topology of the guide |
---|
130 | tree is correct, each alignment step in the multiple alignment process may |
---|
131 | have some percentage of the residues misaligned. This percentage will be |
---|
132 | very low on average for very closely related sequences but will increase as |
---|
133 | sequences diverge. It is these misalignments which carry through from the |
---|
134 | early alignment steps that cause the local minimum problem. The only way |
---|
135 | to correct this is to use an iterative or stochastic sampling procedure (e.g. |
---|
136 | 7,9,10). We do not directly address this problem in this paper. |
---|
137 | |
---|
138 | The alignment parameter choice problem is, in our view, at least as serious as |
---|
139 | the local minimum problem. Stochastic or iterative algorithms will be just |
---|
140 | as badly affected as progressive ones if the parameters are inappropriate: they |
---|
141 | will arrive at a false global minimum. Traditionally, one chooses one weight |
---|
142 | matrix and two gap penalties (one for opening a new gap and one for |
---|
143 | extending an existing gap) and hope that these will work well over all parts of |
---|
144 | all the sequences in the data set. When the sequences are all closely related, |
---|
145 | this works. The first reason is that virtually all residue weight matrices give |
---|
146 | most weight to identities. When identities dominate an alignment, almost |
---|
147 | any weight matrix will find approximately the correct solution. With very |
---|
148 | divergent sequences, however, the scores given to non-identical residues will |
---|
149 | become critically important; there will be more mismatches than identities. |
---|
150 | Different weight matrices will be optimal at different evolutionary distances |
---|
151 | or for different classes of proteins. |
---|
152 | |
---|
153 | The second reason is that the range of gap penalty values that will find the |
---|
154 | correct or best possible solution can be very broad for highly similar sequences |
---|
155 | (11). As more and more divergent sequences are used, however, the exact |
---|
156 | values of the gap penalties become important for success. In each case, there |
---|
157 | may be a very narrow range of values which will deliver the best alignment. |
---|
158 | Further, in protein alignments, gaps do not occur randomly (i.e. with equal |
---|
159 | probability at all positions). They occur far more often between the major |
---|
160 | secondary structural elements of alpha helices and beta strands than within |
---|
161 | (12). |
---|
162 | |
---|
163 | The major improvements described in this paper attempt to address the |
---|
164 | alignment parameter choice problem. We dynamically vary the gap |
---|
165 | penalties in a position and residue specific manner. The observed relative |
---|
166 | frequencies of gaps adjacent to each of the 20 amino acids (12) are used to |
---|
167 | locally adjust the gap opening penalty after each residue. Short stretches of |
---|
168 | hydrophilic residues (e.g. 5 or more) usually indicate loop or random coil |
---|
169 | regions and the gap opening penalties are locally reduced in these stretches. |
---|
170 | In addition, the locations of the gaps found in the early alignments are also |
---|
171 | given reduced gap opening penalties. It has been observed in alignments |
---|
172 | between sequences of known structure that gaps tend not to be closer than |
---|
173 | roughly eight residues on average (12). We increase the gap opening penalty |
---|
174 | within eight residues of exising gaps. The two main series of amino acid |
---|
175 | weight matrices that are used today are the PAM series (3) and the BLOSUM |
---|
176 | series (4). In each case, there is a range of matrices to choose from. Some |
---|
177 | matrices are appropriate for aligning very closely related sequences where |
---|
178 | most weight by far is given to identities, with only the most frequent |
---|
179 | conservative substitutions receiving high scores. Other matrices work better |
---|
180 | at greater evolutionary distances where less importance is attached to |
---|
181 | identities (13). We choose different weight matrices, as the alignment |
---|
182 | proceeds, depending on the estimated divergence of the sequences to be |
---|
183 | aligned at each stage. |
---|
184 | |
---|
185 | Sequences are weighted to correct for unequal sampling across all |
---|
186 | evolutionary distances in the data set (14). This downweights sequences that |
---|
187 | are very similar to other sequences in the data set and upweights the most |
---|
188 | divergent ones. The weights are calculated directly from the branch lengths |
---|
189 | in the initial guide tree (15). Sequence weighting has already been shown to |
---|
190 | be effective in improving the sensitivity of profile searches (15,16). In the |
---|
191 | original CLUSTAL programs (17-19), the initial guide trees, used to guide the |
---|
192 | multiple alignment, were calculated using the UPGMA method (20). We |
---|
193 | now use the Neighbour-Joining method (21) which is more robust against the |
---|
194 | effects of unequal evolutionary rates in different lineages and which gives |
---|
195 | better estimates of individual branch lengths. This is useful because it is these |
---|
196 | branch lengths which are used to derive the sequence weights. We also allow |
---|
197 | users to choose between fast approximate alignments (22) or full dynamic |
---|
198 | programming for the distance calculations used to make the guide tree. |
---|
199 | |
---|
200 | The new improvements dramatically improve the sensitivity of the |
---|
201 | progressive alignment method for difficult alignments involving highly |
---|
202 | diverged sequences. We show one very demanding test case of over 60 SH3 |
---|
203 | domains (23) which includes sequence pairs with as little as 12% identity and |
---|
204 | where there is only one exactly conserved residue across all of the sequences. |
---|
205 | Using default parameters, we can achieve an alignment that is almost exactly |
---|
206 | correct, according to available structural information (24). Using the program |
---|
207 | in a wide variety of situations, we find that it will normally find the correct |
---|
208 | alignment, in all but the most difficult and pathological of cases. |
---|
209 | |
---|
210 | |
---|
211 | MATERIAL AND METHODS |
---|
212 | |
---|
213 | |
---|
214 | The basic alignment method |
---|
215 | |
---|
216 | The basic multiple alignment algorithm consists of three main stages: 1) all |
---|
217 | pairs of sequences are aligned separately in order to calculate a distance matrix |
---|
218 | giving the divergence of each pair of sequences; 2) a guide tree is calculated |
---|
219 | from the distance matrix; 3) the sequences are progressively aligned according |
---|
220 | to the branching order in the guide tree. An example using 7 globin |
---|
221 | sequences of known tertiary structure (25) is given in figure 1. |
---|
222 | |
---|
223 | |
---|
224 | 1) The distance matrix/pairwise alignments |
---|
225 | |
---|
226 | In the original CLUSTAL programs, the pairwise distances were calculated |
---|
227 | using a fast approximate method (22). This allows very large numbers of |
---|
228 | sequences to be aligned, even on a microcomputer. The scores are calculated |
---|
229 | as the number of k-tuple matches (runs of identical residues, typically 1 or 2 |
---|
230 | long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment |
---|
231 | between two sequences minus a fixed penalty for every gap. We now offer a |
---|
232 | choice between this method and the slower but more accurate scores from full |
---|
233 | dynamic programming alignments using two gap penalties (for opening or |
---|
234 | extending gaps) and a full amino acid weight matrix. These scores are |
---|
235 | calculated as the number of identities in the best alignment divided by the |
---|
236 | number of residues compared (gap positions are excluded). Both of these |
---|
237 | scores are initially calculated as percent identity scores and are converted to |
---|
238 | distances by dividing by 100 and subtracting from 1.0 to give number of |
---|
239 | differences per site. We do not correct for multiple substitutions in these |
---|
240 | initial distances. In figure 1 we give the 7x7 distance matrix between the 7 |
---|
241 | globin sequences calculated using the full dynamic programming method. |
---|
242 | |
---|
243 | |
---|
244 | 2) The guide tree |
---|
245 | |
---|
246 | The trees used to guide the final multiple alignment process are calculated |
---|
247 | from the distance matrix of step 1 using the Neighbour-Joining method (21). |
---|
248 | This produces unrooted trees with branch lengths proportional to estimated |
---|
249 | divergence along each branch. The root is placed by a "mid-point" method |
---|
250 | (15) at a position where the means of the branch lengths on either side of the |
---|
251 | root are equal. These trees are also used to derive a weight for each sequence |
---|
252 | (15). The weights are dependent upon the distance from the root of the tree |
---|
253 | but sequences which have a common branch with other sequences share the |
---|
254 | weight derived from the shared branch. In the example in figure 1, the |
---|
255 | leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the |
---|
256 | length of the branch from the root to it. The Human beta globin |
---|
257 | (Hbb_Human) gets a weight consisting of the length of the branch leading to |
---|
258 | it that is not shared with any other sequences (0.081) plus half the length of |
---|
259 | the branch shared with the horse beta globin (0.226/2) plus one quarter the |
---|
260 | length of the branch shared by all four haemoglobins (0.061/4) plus one fifth |
---|
261 | the branch shared between the haemoglobins and the myoglobin (0.015/5) |
---|
262 | plus one sixth the branch leading to all the vertebrate globins (0.062). This |
---|
263 | sums to a total of 0.221. By contrast, in the normal progressive alignment |
---|
264 | algorithm, all sequences would be equally weighted. The rooted tree with |
---|
265 | branch lengths and sequence weights for the 7 globins is given in figure 1. |
---|
266 | |
---|
267 | |
---|
268 | 3) Progressive alignment |
---|
269 | |
---|
270 | The basic procedure at this stage is to use a series of pairwise alignments to |
---|
271 | align larger and larger groups of sequences, following the branching order in |
---|
272 | the guide tree. You proceed from the tips of the rooted tree towards the root. |
---|
273 | In the globin example in figure 1 you align the sequences in the following |
---|
274 | order: human vs. horse beta globin; human vs. horse alpha globin; the 2 |
---|
275 | alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the |
---|
276 | cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin |
---|
277 | vs. all the rest. At each stage a full dynamic programming (26,27) algorithm is |
---|
278 | used with a residue weight matrix and penalties for opening and extending |
---|
279 | gaps. Each step consists of aligning two existing alignments or sequences. |
---|
280 | Gaps that are present in older alignments remain fixed. In the basic |
---|
281 | algorithm, new gaps that are introduced at each stage get full gap opening and |
---|
282 | extension penalties, even if they are introduced inside old gap positions (see |
---|
283 | the section on gap penalties below for modifications to this rule). In order to |
---|
284 | calculate the score between a position from one sequence or alignment and |
---|
285 | one from another, the average of all the pairwise weight matrix scores from |
---|
286 | the amino acids in the two sets of sequences is used i.e. if you align 2 |
---|
287 | alignments with 2 and 4 sequences respectively, the score at each position is |
---|
288 | the average of 8 (2x4) comparisons. This is illustrated in figure 2. If either set |
---|
289 | of sequences contains one or more gaps in one of the positions being |
---|
290 | considered, each gap versus a residue is scored as zero. The default amino |
---|
291 | acid weight matrices we use are rescored to have only positive values. |
---|
292 | Therefore, this treatment of gaps treats the score of a residue versus a gap as |
---|
293 | having the worst possible score. When sequences are weighted (see |
---|
294 | improvements to progressive alignment, below), each weight matrix value is |
---|
295 | multiplied by the weights from the 2 sequences, as illustrated in figure 2. |
---|
296 | |
---|
297 | |
---|
298 | Improvements to progressive alignment |
---|
299 | |
---|
300 | All of the remaining modifications apply only to the final progressive |
---|
301 | alignment stage. Sequence weighting is relatively straightforward and is |
---|
302 | already widely used in profile searches (15,16). The treatment of gap penalties |
---|
303 | is more complicated. Initial gap penalties are calculated depending on the |
---|
304 | weight matrix, the similarity of the sequences, and the length of the |
---|
305 | sequences. Then, an attempt is made to derive sensible local gap opening |
---|
306 | penalties at every position in each pre-aligned group of sequences that will |
---|
307 | vary as new sequences are added. The use of different weight matrices as the |
---|
308 | alignment progresses is novel and largely by-passes the problem of initial |
---|
309 | choice of weight matrix. The final modification allows us to delay the |
---|
310 | addition of very divergent sequences until the end of the alignment process |
---|
311 | when all of the more closely related sequences have already been aligned. |
---|
312 | |
---|
313 | |
---|
314 | Sequence weighting |
---|
315 | |
---|
316 | Sequence weights are calculated directly from the guide tree. The weights |
---|
317 | are normalised such that the biggest one is set to 1.0 and the rest are all less |
---|
318 | than one. Groups of closely related sequences receive lowered weights |
---|
319 | because they contain much duplicated information. Highly divergent |
---|
320 | sequences without any close relatives receive high weights. These weights |
---|
321 | are used as simple multiplication factors for scoring positions from different |
---|
322 | sequences or prealigned groups of sequences. The method is illustrated in |
---|
323 | figure 2. In the globin example in figure 1, the two alpha globins get |
---|
324 | downweighted because they are almost duplicate sequences (as do the two |
---|
325 | beta globins); they receive a combined weight of only slightly more than if a |
---|
326 | single alpha globin was used. |
---|
327 | |
---|
328 | |
---|
329 | Initial gap penalties |
---|
330 | |
---|
331 | Initially, two gap penalties are used: a gap opening penalty (GOP) which gives |
---|
332 | the cost of opening a new gap of any length and a gap extension penalty (GEP) |
---|
333 | which gives the cost of every item in a gap. Initial values can be set by the |
---|
334 | user from a menu. The software then automatically attempts to choose |
---|
335 | appropriate gap penalties for each sequence alignment, depending on the |
---|
336 | following factors. |
---|
337 | |
---|
338 | 1) Dependence on the weight matrix |
---|
339 | |
---|
340 | It has been shown (16,28) that varying the gap penalties used with different |
---|
341 | weight matrices can improve the accuracy of sequence alignments. Here, we |
---|
342 | use the average score for two mismatched residues (ie. off-diagonal values in |
---|
343 | the matrix) as a scaling factor for the GOP. |
---|
344 | |
---|
345 | 2) Dependence on the similarity of the sequences |
---|
346 | |
---|
347 | The percent identity of the two (groups of) sequences to be aligned is used to |
---|
348 | increase the GOP for closely related sequences and decrease it for more |
---|
349 | divergent sequences on a linear scale. |
---|
350 | |
---|
351 | 3) Dependence on the lengths of the sequences |
---|
352 | |
---|
353 | The scores for both true and false sequence alignments grow with the length |
---|
354 | of the sequences. We use the logarithm of the length of the shorter sequence |
---|
355 | to increase the GOP with sequence length. |
---|
356 | |
---|
357 | Using these three modifications, the initial GOP calculated by the program is: |
---|
358 | |
---|
359 | GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) * |
---|
360 | (percent identity scaling factor) |
---|
361 | where N, M are the lengths of the two sequences. |
---|
362 | |
---|
363 | 4) Dependence on the difference in the lengths of the sequences |
---|
364 | |
---|
365 | The GEP is modified depending on the difference between the lengths of the |
---|
366 | two sequences to be aligned. If one sequence is much shorter than the other, |
---|
367 | the GEP is increased to inhibit too many long gaps in the shorter sequence. |
---|
368 | The initial GEP calculated by the program is: |
---|
369 | |
---|
370 | GEP -> GEP*(1.0+|log(N/M)|) |
---|
371 | where N, M are the lengths of the two sequences. |
---|
372 | |
---|
373 | |
---|
374 | Position-specific gap penalties |
---|
375 | |
---|
376 | In most dynamic programming applications, the initial gap opening and |
---|
377 | extension penalties are applied equally at every position in the sequence, |
---|
378 | regardless of the location of a gap, except for terminal gaps which are usually |
---|
379 | allowed at no cost. In CLUSTAL W, before any pair of sequences or |
---|
380 | prealigned groups of sequences are aligned, we generate a table of gap opening |
---|
381 | penalties for every position in the two (sets of) sequences. An example is |
---|
382 | shown in figure 3. We manipulate the initial gap opening penalty in a |
---|
383 | position specific manner, in order to make gaps more or less likely at different |
---|
384 | positions. |
---|
385 | |
---|
386 | The local gap penalty modification rules are applied in a hierarchical manner. |
---|
387 | The exact details of each rule are given below. Firstly, if there is a gap at a |
---|
388 | position, the gap opening and gap extension penalties are lowered; the other |
---|
389 | rules do not apply. This makes gaps more likely at positions where there are |
---|
390 | already gaps. If there is no gap at a position, then the gap opening penalty is |
---|
391 | increased if the position is within 8 residues of an existing gap. This |
---|
392 | discourages gaps that are too close together. Finally, at any position within a |
---|
393 | run of hydrophilic residues, the penalty is decreased. These runs usually |
---|
394 | indicate loop regions in protein structures. If there is no run of hydrophilic |
---|
395 | residues, the penalty is modified using a table of residue specific gap |
---|
396 | propensities (12). These propensities were derived by counting the frequency |
---|
397 | of each residue at either end of gaps in alignments of proteins of known |
---|
398 | structure. An illustration of the application of these rules from one part of |
---|
399 | the globin example, in figure 1, is given in figure 3. |
---|
400 | |
---|
401 | 1) Lowered gap penalties at existing gaps |
---|
402 | |
---|
403 | If there are already gaps at a position, then the GOP is reduced in proportion |
---|
404 | to the number of sequences with a gap at this position and the GEP is lowered |
---|
405 | by a half. The new gap opening penalty is calculated as: |
---|
406 | |
---|
407 | GOP -> GOP*0.3*(no. of sequences without a gap/no. of sequences). |
---|
408 | |
---|
409 | 2) Increased gap penalties near existing gaps |
---|
410 | |
---|
411 | If a position does not have any gaps but is within 8 residues of an existing gap, |
---|
412 | the GOP is increased by: |
---|
413 | |
---|
414 | GOP -> GOP*(2+((8-distance from gap)*2)/8) |
---|
415 | |
---|
416 | 3) Reduced gap penalties in hydrophilic stretches |
---|
417 | |
---|
418 | Any run of 5 hydrophilic residues is considered to be a hydrophilic stretch. |
---|
419 | The residues that are to be considered hydrophilic may be set by the user but |
---|
420 | are conservatively set to D, E, G, K, N, Q, P, R or S by default. If, at any |
---|
421 | position, there are no gaps and any of the sequences has such a stretch, the |
---|
422 | GOP is reduced by one third. |
---|
423 | |
---|
424 | |
---|
425 | 4) Residue specific penalties |
---|
426 | |
---|
427 | If there is no hydrophilic stretch and the position does not contain any gaps, |
---|
428 | then the GOP is multiplied by one of the 20 numbers in table 1, depending on |
---|
429 | the residue. If there is a mixture of residues at a position, the multiplication |
---|
430 | factor is the average of all the contributions from each sequence. |
---|
431 | |
---|
432 | |
---|
433 | Weight matrices |
---|
434 | |
---|
435 | Two main series of weight matrices are offered to the user: the Dayhoff PAM |
---|
436 | series (3) and the BLOSUM series (4). The default is the BLOSUM series. In |
---|
437 | each case, there is a choice of matrix ranging from strict ones, useful for |
---|
438 | comparing very closely related sequences to very "soft" ones that are useful |
---|
439 | for comparing very distantly related sequences. Depending on the distance |
---|
440 | between the two sequences or groups of sequences to be compared, we switch |
---|
441 | between 4 different matrices. The distances are measured directly from the |
---|
442 | guide tree. The ranges of distances and tables used with the PAM series of |
---|
443 | matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350. |
---|
444 | The range used with the BLOSUM series is:80-100%:BLOSUM80, |
---|
445 | 60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30. |
---|
446 | |
---|
447 | |
---|
448 | Divergent sequences |
---|
449 | |
---|
450 | The most divergent sequences (most different, on average from all of the |
---|
451 | other sequences) are usually the most difficult to align correctly. It is |
---|
452 | sometimes better to delay the incorporation of these sequences until all of the |
---|
453 | more easily aligned sequences are merged first. This may give a better chance |
---|
454 | of correctly placing the gaps and matching weakly conserved positions against |
---|
455 | the rest of the sequences. A choice is offered to set a cut off (default is 40% |
---|
456 | identity or less with any other sequence) that will delay the alignment of the |
---|
457 | divergent sequences until all of the rest have been aligned. |
---|
458 | |
---|
459 | |
---|
460 | Software and Algorithms |
---|
461 | |
---|
462 | |
---|
463 | Dynamic Programming |
---|
464 | |
---|
465 | The most demanding part of the multiple alignment strategy, in terms of |
---|
466 | computer processing and memory usage, is the alignment of two (groups of) |
---|
467 | sequences at each step in the final progressive alignment. To make it |
---|
468 | possible to align very long sequences (e.g. dynein heavy chains at ~ 5,000 |
---|
469 | residues) in a reasonable amount of memory, we use the memory efficient |
---|
470 | dynamic programming algorithm of Myers and Miller (26). This sacrifices |
---|
471 | some processing time but makes very large alignments practical in very little |
---|
472 | memory. One disadvantage of this algorithm is that it does not allow |
---|
473 | different gap opening and extension penalties at each position. We have |
---|
474 | modified the algorithm so as to allow this and the details are described in a |
---|
475 | separate paper (27). |
---|
476 | |
---|
477 | |
---|
478 | |
---|
479 | Menus/file formats |
---|
480 | |
---|
481 | Six different sequence input formats are detected automatically and read by |
---|
482 | the program: EMBL/Swiss Prot, NBRF/PIR, Pearson/FASTA (29), GCG/MSF |
---|
483 | (30), GDE (Steven Smith, Harvard University Genome Center) and CLUSTAL |
---|
484 | format alignments. The last three formats allow users to read in complete |
---|
485 | alignments (e.g. for calculating phylogenetic trees or for addition of new |
---|
486 | sequences to an existing alignment). Alignment output may be requested in |
---|
487 | standard CLUSTAL format (self-explanatory blocked alignments) or in |
---|
488 | formats compatible with the GDE, PHYLIP (31) or GCG (30) packages. The |
---|
489 | program offers the user the ability to calculate Neighbour-Joining |
---|
490 | phylogenetic trees from existing alignments with options to correct for |
---|
491 | multiple hits (32,33) and to estimate confidence levels using a bootstrap |
---|
492 | resampling procedure (34). The trees may be output in the "New |
---|
493 | Hampshire" format that is compatible with the PHYLIP package (31). |
---|
494 | |
---|
495 | Alignment to an alignment |
---|
496 | |
---|
497 | Profile alignment is used to align two existing alignments (either of which |
---|
498 | may consist of just one sequence) or to add a series of new sequences to an |
---|
499 | existing alignment. This is useful because one may wish to build up a |
---|
500 | multiple alignment gradually, choosing different parameters manually, or |
---|
501 | correcting intermediate errors as the alignment proceeds. Often, just a few |
---|
502 | sequences cause misalignments in the progressive algorithm and these can be |
---|
503 | removed from the process and then added at the end by profile alignment. A |
---|
504 | second use is where one has a high quality reference alignment and wishes to |
---|
505 | keep it fixed while adding new sequences automatically. |
---|
506 | |
---|
507 | |
---|
508 | Portability/Availability |
---|
509 | |
---|
510 | The full source code of the package is provided free to academic users. The |
---|
511 | program will run on any machine with a full ANSI conforming C compiler. |
---|
512 | It has been tested on the following hardware/software combinations: |
---|
513 | Decstation/Ultrix, Vax or ALPHA/VMS, Silicon Graphics/IRIX. The source |
---|
514 | code and documentation are available by E-mail from the EMBL file server |
---|
515 | (send the words HELP and HELP SOFTWARE on two lines to the internet |
---|
516 | address: |
---|
517 | Netserv@EMBL-Heidelberg.DE) or by anonymous FTP from |
---|
518 | FTP.EMBL-Heidelberg.DE. Queries may be addressed by E-mail to |
---|
519 | Des.Higgins@EBI.AC.UK or Gibson@EMBL-Heidelberg.DE. |
---|
520 | |
---|
521 | |
---|
522 | RESULTS AND DISCUSSION |
---|
523 | |
---|
524 | |
---|
525 | Alignment of SH3 Domains |
---|
526 | |
---|
527 | The ~60 residue SH3 domain was chosen to illustrate the performance of |
---|
528 | CLUSTAL W, as there is a reference manual alignment (23) and the fold is |
---|
529 | known (24). SH3 domains, with a minimum similarity below 12% identity, |
---|
530 | are poorly aligned by progressive alignment programs such as CLUSTAL V |
---|
531 | and PILEUP: neither program can generate the correct blocks corresponding to |
---|
532 | the secondary structure elements. |
---|
533 | |
---|
534 | Figure 4 shows an alignment generated by CLUSTAL W of the example set of |
---|
535 | SH3 domains. The alignment was generated in two steps. After progressive |
---|
536 | alignment, five blocks were produced, corresponding to structural elements, |
---|
537 | with gaps inserted exclusively in the known loop regions. The beta strands in |
---|
538 | blocks 1, 4 and 5 were all correctly superposed. However, four sequences in |
---|
539 | block 2 and one sequence in block 3 were misaligned by 1-2 residues |
---|
540 | (underlined in figure 4). A second progressive alignment of the aligned |
---|
541 | sequences, including the gaps, improved this alignment: A single misaligned |
---|
542 | sequence, H_P55, remains in block 2 (boxed in figure 4), while block 3 is now |
---|
543 | completely aligned. This alignment corrects several errors (eg. P85A, P85B |
---|
544 | and FUS1) in the manual alignment (23). |
---|
545 | |
---|
546 | The SH3 alignment illustrates several features of CLUSTAL W usage. Firstly, |
---|
547 | in a practical application involving divergent sequences, the initial |
---|
548 | progressive alignment is likely to be a good but not perfect approximation to |
---|
549 | the correct alignment. The alignment quality can be improved in a number of |
---|
550 | ways. If the block structure of the alignment appears to be correct, realignment |
---|
551 | of the alignment will usually improve most of the misaligned blocks: the |
---|
552 | existing gaps allow the blocks to "float" cheaply to a locally optimal position |
---|
553 | without disturbing the rest of the alignment. Remaining sequences which are |
---|
554 | doubtfully aligned can then be individually tested by profile alignment to the |
---|
555 | remainder: the misaligned H_P55 SH3 domain can be correctly aligned by |
---|
556 | profile (with GOP <= 8). The indel regions in the final alignment can then be |
---|
557 | manually cleaned up: Usually the exact alignment in the loop regions is not |
---|
558 | determinable, and may have no meaning in structural terms. It is then |
---|
559 | desirable to have a single gap per structural loop. CLUSTAL W achieved this |
---|
560 | for two of the four SH3 loop regions (figure 4). |
---|
561 | |
---|
562 | If the block structure of the alignment appears suspect, greater intervention by |
---|
563 | the user may be required. The most divergent sequences, especially if they |
---|
564 | have large insertions (which can be discerned with the aid of dot matrix |
---|
565 | plots), should be left out of the progressive alignment. If there are sets of |
---|
566 | closely related sequences that are deeply diverged from other sets, these can be |
---|
567 | separately aligned and then merged by profile alignment. Incorrectly |
---|
568 | determined sequences, containing frameshifts, can also confound regions of |
---|
569 | an alignment: these can be hard to detect but sometimes they have been |
---|
570 | grouped within the excluded divergent sequences: then they may be revealed |
---|
571 | when they are individually compared to the alignment as having apparently |
---|
572 | nonsense segments with respect to the other sequences. |
---|
573 | |
---|
574 | |
---|
575 | |
---|
576 | Finding the best alignment |
---|
577 | |
---|
578 | In cases where all of the sequences in a data set are very similar (e.g. no pair |
---|
579 | less than 35% identical), CLUSTAL W will find an alignment which is |
---|
580 | difficult to improve by eye. In this sense, the alignment is optimal with |
---|
581 | regard to the alternative of manual alignment. Mathematically, this is vague |
---|
582 | and can only be put on a more systematic footing by finding an objective |
---|
583 | function (a measure of multiple alignment quality) that exactly mirrors the |
---|
584 | information used by an "expert" to evaluate an alignment. Nonetheless, if an |
---|
585 | alignment is impossible to improve by eye, then the program has achieved a |
---|
586 | very useful result. |
---|
587 | |
---|
588 | In more difficult cases, as more divergent sequences are included, it becomes |
---|
589 | increasingly difficult to find good alignments and to evaluate them. What |
---|
590 | we find with CLUSTAL W is that the basic block-like structure of the |
---|
591 | alignment (corresponding to the major secondary structure elements) is |
---|
592 | usually recovered, with some of the most divergent sequences misaligned in |
---|
593 | small regions. This is a very useful starting point for manual refinement as it |
---|
594 | helps define the major blocks of similarity. The problem sequences can be |
---|
595 | removed from the analysis and realigned to the rest of the sequences |
---|
596 | automatically or with different parameter settings. An examination of the |
---|
597 | tree used to guide the alignment will usually show which sequences will be |
---|
598 | most unreliably placed (those that branch off closest to the root and/or those |
---|
599 | that align to other single sequences at a very low level of sequence identity |
---|
600 | rather than align to a group of pre-aligned sequences). Finally, one can |
---|
601 | simply iterate the multiple alignment process by feeding an output alignment |
---|
602 | back into CLUSTAL W and repeating the multiple alignment process (using |
---|
603 | the same or different parameters). The SH3 domain alignment in figure 4 |
---|
604 | was derived in this way by 2 passes using default parameters. In the second |
---|
605 | pass, the local gap penalties are dominated by the placement of the initial |
---|
606 | major gap positions. The alignment will either remain unchanged or will |
---|
607 | converge rapidly (after 1 or 2 extra passes) on a better solution. If the |
---|
608 | placement of the initial gaps is approximately correct but some of the |
---|
609 | sequences are locally misaligned, this works well. |
---|
610 | |
---|
611 | |
---|
612 | Comparison with other methods |
---|
613 | |
---|
614 | Recently, several papers have addressed the problem of position specific |
---|
615 | parameters for multiple alignment. In one case (35), local gap penalties are |
---|
616 | increased in alpha helical and beta strand regions, when the 3-D structures of |
---|
617 | one or more of the sequences are known. In a second case (36), a hidden |
---|
618 | Markov model was used to estimate position specific gap penalties and |
---|
619 | residue substitution weight matrices when large numbers of examples of a |
---|
620 | protein domain were known. With CLUSTAL W, we attempt to derive the |
---|
621 | same information purely from the set of sequences to be aligned. Therefore, |
---|
622 | we can apply the method to any set of sequences. The success of this approach |
---|
623 | will depend on the number of available sequences and their evolutionary |
---|
624 | relationships. It will also depend on the decision making process during |
---|
625 | multiple alignment (e.g. when to change weight matrix) and the accuracy and |
---|
626 | appropriateness of our parameterisation. In the long term, this can only be |
---|
627 | evaluated by exhaustive testing of sets of sequences where the correct |
---|
628 | alignment (or parts of it) are known from structural information. What is |
---|
629 | clear, however, is that the modifications described here significantly improve |
---|
630 | the sensitivity of the progressive multiple alignment approach. This is |
---|
631 | achieved with almost no sacrifice in speed and efficiency. |
---|
632 | |
---|
633 | There are several areas where further improvements in sensitivity and |
---|
634 | accuracy can be made. Firstly, the residue weight matrices and gap settings |
---|
635 | can be made more accurate as more and more data accumulate, while |
---|
636 | matrices for specific sequence types can be derived (e.g. for transmembrane |
---|
637 | regions (37)). Secondly, stochastic or iterative optimisation methods can be |
---|
638 | used to refine initial alignments (7,9,10). CLUSTAL W could be run with |
---|
639 | several sets of starting parameters and in each case, the alignments refined |
---|
640 | according to an objective function. The search for a good objective function, |
---|
641 | that takes into account the sequence and position specific information used in |
---|
642 | CLUSTAL W is a key area of research. Finally, the average number of |
---|
643 | examples of each protein domain or family is growing steadily. It is not only |
---|
644 | important that programs can cope with the large volumes of data that are |
---|
645 | being generated, they should be able to exploit the new information to make |
---|
646 | the alignments more and more accurate. Globally optimal alignments |
---|
647 | (according to an objective function) may not always be possible but the |
---|
648 | problem may be avoided if sufficiently large volumes of data become |
---|
649 | available. CLUSTAL W is a step in this direction. |
---|
650 | |
---|
651 | ACKNOWLEDGEMENTS |
---|
652 | |
---|
653 | Numerous people have offered advice and suggestions for improvements to |
---|
654 | earlier versions of the CLUSTAL programs. D.H. wishes to apologise to all of |
---|
655 | the irate CLUSTAL V users who had to live with the bugs and lack of facilities |
---|
656 | for getting trees in the New Hampshire format. We wish to specifically thank |
---|
657 | Jeroen Coppieters who suggested using a series of weight matrices and Steven |
---|
658 | Henikoff for advice on using the BLOSUM matrices. We are grateful to Rein |
---|
659 | Aasland, Peer Bork, Ariel Blocker and Brtrand Seraphin for providing |
---|
660 | challenging alignment problems. T.G. and J.T. thank Kevin Leonard for |
---|
661 | support and encouragement. Finally, we thank all of the people who were |
---|
662 | involved with various CLUSTAL programs over the years, namely: Paul |
---|
663 | Sharp, Rainer Fuchs and Alan Bleasby. |
---|
664 | |
---|
665 | |
---|
666 | REFERENCES |
---|
667 | |
---|
668 | 1.Feng, D.-F. and Doolittle, R.F. (1987). J. Mol. Evol. 25, 351-360. |
---|
669 | 2.Needleman, S.B. and Wunsch, C.D. (1970). J. Mol. Biol. 48, 443-453. |
---|
670 | 3.Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of Protein |
---|
671 | Sequence and Structure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352, |
---|
672 | NBRF, Washington. |
---|
673 | 4.Henikoff, S. and Henikoff, J.G. (1992). Proc. Natl. Acad. Sci. USA 89, 10915- |
---|
674 | 10919. |
---|
675 | 5.Lipman, D.J., Altschul, S.F. and Kececioglu, J.D. (1989). Proc. Natl. Acad. Sci. |
---|
676 | USA 86, 4412-4415. |
---|
677 | 6.Barton, G.J. and Sternberg, M.J.E. (1987). J. Mol. Biol. 198, 327-337. |
---|
678 | 7.Gotoh, O. (1993). CABIOS 9, 361-370. |
---|
679 | 8.Altschul, S.F. (1989). J. Theor. Biol. 138, 297-309. |
---|
680 | 9.Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992). Nucl. Acids Res. 20, |
---|
681 | 2511-2516. |
---|
682 | 10.Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and |
---|
683 | Wooton, J.C. (1993). Science, 262, 208-214. |
---|
684 | 11.Vingron, M. and Waterman, M.S. (1993). J. Mol. Biol. 234, 1-12. |
---|
685 | 12.Pascarella, S. and Argos, P. (1992). J. Mol. Biol. 224, 461-471. |
---|
686 | 13.Collins, J.F. and Coulson, A.F.W. (1987). In Nucleic acid and protein |
---|
687 | sequence analysis a practical approach, Bishop, M.J. and Rawlings, C.J. ed., |
---|
688 | chapter 13, pp. 323-358. |
---|
689 | 14.Vingron, M. and Sibbald, P.R. (1993). Proc. Natl. Acad. Sci. USA, 90, 8777- |
---|
690 | 8781. |
---|
691 | 15.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CABIOS, 10, 19-29. |
---|
692 | 16.Lthy, R., Xenarios, I. and Bucher, P. (1994). Protein Science, 3, 139-146. |
---|
693 | 17.Higgins, D.G. and Sharp, P.M. (1988). Gene, 73, 237-244. |
---|
694 | 18.Higgins, D.G. and Sharp, P.M. (1989). CABIOS, 5, 151-153. |
---|
695 | 19.Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992). CABIOS, 8, 189-191. |
---|
696 | 20.Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy, W.H. |
---|
697 | Freeman, San Francisco. |
---|
698 | 21.Saitou, N. and Nei, M. (1987). Mol. Biol. Evol. 4, 406-425. |
---|
699 | 22.Wilbur, W.J. and Lipman, D.J. (1983). Proc. Natl. Acad. Sci. USA, 80, 726- |
---|
700 | 730. |
---|
701 | 23.Musacchio, A., Gibson, T., Lehto, V.-P. and Saraste, M. (1992). FEBS Lett. |
---|
702 | 307, 55-61. |
---|
703 | 24.Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. and Saraste, M. (1992). |
---|
704 | Nature, 359, 851-855. |
---|
705 | 25.Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216. |
---|
706 | 26.Myers, E.W. and Miller, W. (1988). CABIOS, 4, 11-17. |
---|
707 | 27.Thompson, J.D. (1994). CABIOS, (Submitted). |
---|
708 | 28.Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981). J. Mol. Evol. 18, 38-46. |
---|
709 | 29.Pearson, W.R. and Lipman, D.J. (1988). Proc. Natl. Acad. Sci. USA. 85, 2444- |
---|
710 | 2448. |
---|
711 | 30.Devereux, J., Haeberli, P. and Smithies, O. (1984). Nucleic Acids Res. 12, |
---|
712 | 387-395. |
---|
713 | 31.Felsenstein, J. (1989). Cladistics 5, 164-166. |
---|
714 | 32.Kimura, M. (1980). J. Mol. Evol. 16, 111-120. |
---|
715 | 33.Kimura, M. (1983). The Neutral Theory of Molecular Evolution. |
---|
716 | Cambridge University Press, Cambridge. |
---|
717 | 34.Felsenstein, J. (1985). Evolution 39, 783-791. |
---|
718 | 35.Smith, R.F. and Smith, T.F. (1992) Protein Engineering 5, 35-41. |
---|
719 | 36.Krogh, A., Brown, M., Mian, S., Sjlander, K. and Haussler, D. (1994) J. Mol. |
---|
720 | Biol. 235-1501-1531. |
---|
721 | 37.Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994). FEBS Lett. 339, 269-275. |
---|
722 | 38.Bairoch, A. and Bckmann, B. (1992) Nucleic Acids Res., 20, 2019-2022. |
---|
723 | 39.Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and |
---|
724 | Wierenga, R.K. (1993) EMBO J. 12, 2617-2624. |
---|
725 | 40.Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637. |
---|
726 | |
---|
727 | FIGURE LEGENDS |
---|
728 | |
---|
729 | Figure 1. The basic progressive alignment procedure, illustrated using a set of |
---|
730 | 7 globins of known tertiary structure. The sequence names are from Swiss |
---|
731 | Prot (38): Hba_Horse: horse alpha globin; Hba_Human: human alpha globin; |
---|
732 | Hbb_Horse: horse beta globin; Hbb_Human: human beta globin; Myg_Phyca: |
---|
733 | sperm whale myoglobin; Glb5_Petma: lamprey cyanohaemoglobin; |
---|
734 | Lgb2_Luplu: lupin leghaemoglobin. In the distance matrix, the mean |
---|
735 | number of differences per residue is given. The unrooted tree shows all |
---|
736 | branch lengths drawn to scale. In the rooted tree, all branch lengths (mean |
---|
737 | number of differences per residue along each branch) are given as well as |
---|
738 | weights for each sequence. In the multiple alignment, the approximate |
---|
739 | positions of the 7 alpha helices, common to all 7 proteins are shown. This |
---|
740 | alignment was derived using CLUSTAL W with default parameters and the |
---|
741 | PAM (3) series of weight matrices. |
---|
742 | |
---|
743 | Figure 2. The scoring scheme for comparing two positions from two |
---|
744 | alignments. Two sections of alignment with 4 and 2 sequences respectively |
---|
745 | are shown. The score of the position with amino acids T,L,K,K versus the |
---|
746 | position with amino acids V and I is given with and without sequence |
---|
747 | weights. M(X,Y) is the weight matrix entry for amino acid X versus amino |
---|
748 | acid Y. Wn is the weight for sequence n. |
---|
749 | |
---|
750 | Figure 3. The variation in local gap opening penalty is plotted for a section of |
---|
751 | alignment. The inital gap opening penalty is indicated by a dotted line. Two |
---|
752 | hydrophilic stretches are underlined. The lowest penalties correspond to the |
---|
753 | ends of the alignment, the hydrophilic stretches and the two positions with |
---|
754 | gaps. The highest values are within 8 residues of the two gap positions. The |
---|
755 | rest of the variation is caused by the residue specific gap penalties (12). |
---|
756 | |
---|
757 | Figure 4. CLUSTAL W Alignment of a set of SH3 domains taken from (23). |
---|
758 | Secondary structure assignments for the solved Spectrin (24) and Fyn (39) |
---|
759 | domains are according to DSSP (40). The alignment was generated in two |
---|
760 | steps using default parameters. After full multiple alignment, the aligned |
---|
761 | sequences were realigned. Segments which were correctly aligned in the |
---|
762 | second pass are underlined. The single misaligned segment in H_P55 and the |
---|
763 | misaligned residue in H_NCK/2 are boxed. |
---|
764 | |
---|
765 | The sequences are coloured to illustrate significant features. All G (orange) |
---|
766 | and P (yellow) are coloured. Other residues matching a frequent occurrence of |
---|
767 | a property in a column are coloured: hydrophobic = blue; hydrophobic |
---|
768 | tendency = light blue; basic = red; acidic = purple; hydrophilic = green; White |
---|
769 | = unconserved. The alignment figure was prepared with the GDE sequence |
---|
770 | editor (S. Smith, Harvard University) and COLORMASK (J. Thompson, |
---|
771 | EMBL). |
---|
772 | |
---|
773 | |
---|
774 | |
---|
775 | |
---|
776 | Table 1. Pascarella and Argos residue specific gap modification factors. |
---|
777 | ----------------------------------------------------------------------------------- |
---|
778 | A 1.13 M 1.29 |
---|
779 | C 1.13 N 0.63 |
---|
780 | D 0.96 P 0.74 |
---|
781 | E 1.31 Q 1.07 |
---|
782 | F 1.20 R 0.72 |
---|
783 | G 0.61 S 0.76 |
---|
784 | H 1.00 T 0.89 |
---|
785 | I 1.32 V 1.25 |
---|
786 | K 0.96 Y 1.00 |
---|
787 | L 1.21 W 1.23 |
---|
788 | ----------------------------------------------------------------------------------- |
---|
789 | The values are normalised around a mean value of 1.0 for H. The lower the |
---|
790 | value, the greater the chance of having an adjacent gap. These are derived |
---|
791 | from the original table of relative frequencies of gaps adjacent to each residue |
---|
792 | (12) by subtraction from 2.0. |
---|
793 | |
---|
794 | |
---|