source: trunk/GDE/CLUSTALW/clustalw.ms

Last change on this file was 13845, checked in by westram, 9 years ago
  • remove executable flag
  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 43.4 KB
Line 
1This is just an ASCII text version of the manuscript describing
2Clustal W, without the figures.  It was published:
3
4Nucleic Acids Research, 22(22):4673-4680.
5
6
7
8CLUSTAL W: improving the sensitivity of progressive multiple
9sequence alignment through sequence weighting, position specific
10gap penalties and weight matrix choice.
11
12
13
14Julie D. Thompson, Desmond G. Higgins1 and Toby J. Gibson*
15
16European Molecular Biology Laboratory
17Postfach 102209
18Meyerhofstrasse 1
19D-69012 Heidelberg
20Germany
21
22
23Phone:          +49-6221-387398
24Fax:            +49-6221-387306
25E-mail:         Gibson@EMBL-Heidelberg.DE
26                Des.Higgins@EBI.AC.UK
27                Thompson@EMBL-Heidelberg.DE
28
29
30Keywords:       Multiple alignment, phylogenetic tree, weight matrix, gap
31                penalty, dynamic programming, sequence weighting.
32
33
341 Current address:
35European Bioinformatics Institute
36Hinxton Hall
37Hinxton
38Cambridge CB10 1RQ
39UK.
40
41* To whom correspondence should be addressed
42
43
44ABSTRACT
45
46The sensitivity of the commonly used progressive multiple sequence
47alignment method has been greatly improved for the alignment of divergent
48protein sequences.   Firstly, individual weights are assigned to each sequence
49in a partial alignment in order to downweight near-duplicate sequences and
50upweight the most divergent ones.   Secondly, amino acid substitution
51matrices are varied at different alignment stages according to the divergence
52of the sequences to be aligned.    Thirdly, residue specific gap penalties and
53locally reduced gap penalties in hydrophilic regions encourage new gaps in
54potential loop regions rather than regular secondary structure.   Fourthly,
55positions in early alignments where gaps have been opened receive locally
56reduced gap penalties to encourage the opening up of new gaps at these
57positions.  These modifications are incorporated into a new program,
58CLUSTAL W which is freely available. 
59
60
61INTRODUCTION
62
63The simultaneous alignment of many nucleotide or amino acid sequences is
64now an essential tool in molecular biology.  Multiple alignments are used to
65find diagnostic patterns to characterise protein families; to detect or
66demonstrate homology between new sequences and existing families of
67sequences; to help predict the secondary and tertiary structures of new
68sequences; to suggest oligonucleotide primers for PCR; as an essential prelude
69to molecular evolutionary analysis.   The rate of appearance of new sequence
70data is steadily increasing and the development of efficient and accurate
71automatic methods for multiple alignment is, therefore, of major
72importance.   The majority of automatic multiple alignments are now carried
73out using the "progressive" approach of Feng and Doolittle (1).   In this paper,
74we describe a number of improvements to the progressive multiple
75alignment method which greatly improve the sensitivity without sacrificing
76any of the speed and efficiency which makes this approach so practical.  The
77new methods are made available in a program called CLUSTAL W which is
78freely available and portable to a wide variety of computers and operating
79systems.
80
81In order to align just two sequences, it is standard practice to use dynamic
82programming (2).  This guarantees a mathematically optimal alignment,
83given a table of scores for matches and mismatches between all amino acids
84or nucleotides (e.g. the PAM250 matrix (3) or BLOSUM62 matrix (4)) and
85penalties for insertions or deletions of different lengths.   Attempts at
86generalising dynamic programming to multiple alignments are limited to
87small numbers of short sequences (5).  For much more than eight or so
88proteins of average length, the problem is uncomputable given current
89computer power.  Therefore, all of the methods capable of handling larger
90problems in practical timescales, make use of heuristics.    Currently, the most
91widely used approach is to exploit the fact that homologous sequences are
92evolutionarily related.  One can build up a multiple alignment progressively
93by a series of pairwise alignments, following the branching order in a
94phylogenetic tree (1).  One first aligns the most closely related sequences,
95gradually adding in the more distant ones.   This approach is sufficiently fast
96to allow alignments of virtually any size.   Further, in simple cases, the
97quality of the alignments is excellent, as judged by the ability to correctly align
98corresponding domains from sequences of known secondary or tertiary
99structure (6).  In more difficult cases, the alignments give good starting points
100for further automatic or manual refinement.
101
102This approach works well when the data set consists of sequences of different
103degrees of divergence.   Pairwise alignment of very closely related sequences
104can be carried out very accurately.   The correct answer may often be obtained
105using a wide range of parameter values (gap penalties and weight matrix).  By
106the time the most distantly related sequences are aligned, one already has a
107sample of aligned sequences which gives important information about the
108variability at each position.   The positions of the gaps that were introduced
109during the early alignments of the closely related sequences are not changed
110as new sequences are added.   This is justified because the placement of gaps
111in alignments between closely related sequences is much more accurate than
112between distantly related ones.   When all of the sequences are highly
113divergent (e.g. less than approximately 25-30% identity between any pair of
114sequences), this progressive approach becomes much less reliable.
115
116There are two major problems with the progressive approach:  the local
117minimum problem and the choice of alignment parameters.   The local
118minimum problem stems from the "greedy" nature of the alignment strategy. 
119The algorithm greedily adds sequences together, following the initial tree. 
120There is no guarantee that the global optimal solution, as defined by some
121overall measure of multiple alignment quality (7,8), or anything close to it,
122will be found.   More specifically, any mistakes (misaligned regions) made
123early in the alignment process cannot be corrected later as new information
124from other sequences is added.   This problem is frequently thought of as
125mainly resulting from an incorrect branching order in the initial tree.  The
126initial trees are derived from a matrix of distances between separately aligned
127pairs of sequences and are much less reliable than trees from complete
128multiple alignments.   In our experience, however, the real problem is caused
129simply by errors in the initial alignments.  Even if the topology of the guide
130tree is correct, each alignment step in the multiple alignment process may
131have some percentage of the residues misaligned.   This percentage will be
132very low on average for very closely related sequences but will increase as
133sequences diverge.   It is these misalignments which carry through from the
134early alignment steps that cause the local minimum problem.   The only way
135to correct this is to use an iterative or stochastic sampling procedure (e.g.
1367,9,10).   We do not directly address this problem in this paper.
137
138The alignment parameter choice problem is, in our view, at least as serious as
139the local minimum problem.   Stochastic or iterative algorithms will be just
140as badly affected as progressive ones if the parameters are inappropriate: they
141will arrive at a false global minimum.  Traditionally, one chooses one weight
142matrix and two gap penalties (one for opening a new gap and one for
143extending an existing gap) and hope that these will work well over all parts of
144all the sequences in the data set.   When the sequences are all closely related,
145this works.  The first reason is that virtually all residue weight matrices give
146most weight to identities.   When identities dominate an alignment, almost
147any weight matrix will find approximately the correct solution.   With very
148divergent sequences, however, the scores given to non-identical residues will
149become critically important; there will be more mismatches than identities.   
150Different weight matrices will be optimal at different evolutionary distances
151or for different classes of proteins. 
152
153The second reason is that the range of gap penalty values that will find the
154correct or best possible solution can be very broad for highly similar sequences
155(11).   As more and more divergent sequences are used, however, the exact
156values of the gap penalties become important for success.   In each case, there
157may be a very narrow range of values which will deliver the best alignment. 
158Further, in protein alignments, gaps do not occur randomly (i.e. with equal
159probability at all positions).  They occur far more often between the major
160secondary structural elements of alpha helices and beta strands than within
161(12).
162
163The major improvements described in this paper attempt to address the
164alignment parameter choice problem.   We dynamically vary the gap
165penalties in a position and residue specific manner. The observed relative
166frequencies of gaps adjacent to each of the 20 amino acids (12) are used to
167locally adjust the gap opening penalty after each residue.   Short stretches of
168hydrophilic residues (e.g. 5 or more) usually indicate loop or random coil
169regions and the gap opening penalties are locally reduced in these stretches.   
170In addition, the locations of the gaps found in the early alignments are also
171given reduced gap opening penalties.  It has been observed in alignments
172between sequences of known structure that gaps tend not to be closer than
173roughly eight residues on average (12).   We increase the gap opening penalty
174within eight residues of exising gaps.   The two main series of amino acid
175weight matrices that are used today are the PAM series (3) and the BLOSUM
176series (4).   In each case, there is a range of matrices to choose from.  Some
177matrices are appropriate for aligning very closely related sequences where
178most weight by far is given to identities, with only the most frequent
179conservative substitutions receiving high scores.  Other matrices work better
180at greater evolutionary distances where less importance is attached to
181identities (13).  We choose different weight matrices, as the alignment
182proceeds, depending on the estimated divergence of the sequences to be
183aligned at each stage. 
184
185Sequences are weighted to correct for unequal sampling across all
186evolutionary distances in the data set (14).   This downweights sequences that
187are very similar to other sequences in the data set and upweights the most
188divergent ones.  The weights are calculated directly from the branch lengths
189in the initial guide tree (15).   Sequence weighting has already been shown to
190be effective in improving the sensitivity of profile searches (15,16).  In the
191original CLUSTAL programs (17-19), the initial guide trees, used to guide the
192multiple alignment, were calculated using the UPGMA method (20).  We
193now use the Neighbour-Joining method (21) which is more robust against the
194effects of unequal evolutionary rates in different lineages and which gives
195better estimates of individual branch lengths.  This is useful because it is these
196branch lengths which are used to derive the sequence weights.  We also allow
197users to choose between fast approximate alignments (22) or full dynamic
198programming for the distance calculations used to make the guide tree.
199
200The new improvements dramatically improve the sensitivity of the
201progressive alignment method for difficult alignments involving highly
202diverged sequences.  We show one very demanding test case of over 60 SH3
203domains (23) which includes sequence pairs with as little as 12% identity and
204where there is only one exactly conserved residue across all of the sequences.   
205Using default parameters, we can achieve an alignment that is almost exactly
206correct, according to available structural information (24).   Using the program
207in a wide variety of situations, we find that it will normally find the correct
208alignment, in all but the most difficult and pathological of cases. 
209
210
211MATERIAL AND METHODS
212
213
214The basic alignment method
215
216The basic multiple alignment algorithm consists of three main stages: 1) all
217pairs of sequences are aligned separately in order to calculate a distance matrix
218giving the divergence of each pair of sequences; 2) a guide tree is calculated
219from the distance matrix; 3) the sequences are progressively aligned according
220to the branching order in the guide tree.   An example using 7 globin
221sequences of known tertiary structure (25) is given in figure 1.
222
223
2241) The distance matrix/pairwise alignments
225
226In the original CLUSTAL programs, the pairwise distances were calculated
227using a fast approximate method (22).   This allows very large numbers of
228sequences to be aligned, even on a microcomputer.   The scores are calculated
229as the number of k-tuple matches (runs of identical residues, typically 1 or 2
230long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment
231between two sequences minus a fixed penalty for every gap.   We now offer a
232choice between this method and the slower but more accurate scores from full
233dynamic programming alignments using two gap penalties (for opening or
234extending gaps) and a full amino acid weight matrix.   These scores are
235calculated as the number of identities in the best alignment divided by the
236number of residues compared (gap positions are excluded).   Both of these
237scores are initially calculated as percent identity scores and are converted to
238distances by dividing by 100 and subtracting from 1.0 to give number of
239differences per site.   We do not correct for multiple substitutions in these
240initial distances.   In figure 1 we give the 7x7 distance matrix between the 7
241globin sequences calculated using the full dynamic programming method.
242
243
2442) The guide tree
245
246The trees used to guide the final multiple alignment process are calculated
247from the distance matrix of step 1 using the Neighbour-Joining method (21).   
248This produces unrooted trees with branch lengths proportional to estimated
249divergence along each branch.   The root is placed by a "mid-point" method
250(15) at a position where the means of the branch lengths on either side of the
251root are equal.   These trees are also used to derive a weight for each sequence
252(15).   The weights are dependent upon the distance from the root of the tree
253but sequences which have a common branch with other sequences share the
254weight derived from the shared branch.   In the example in figure 1, the
255leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the
256length of the branch from the root to it.  The Human beta globin
257(Hbb_Human) gets a weight consisting of the length of the branch leading to
258it that is not shared with any other sequences (0.081) plus half the length of
259the branch shared with the horse beta globin (0.226/2) plus one quarter the
260length of the branch shared by all four haemoglobins (0.061/4) plus one fifth
261the branch shared between the haemoglobins and the myoglobin (0.015/5)
262plus one sixth the branch leading to all the vertebrate globins (0.062).  This
263sums to a total of 0.221.  By contrast, in the normal progressive alignment
264algorithm, all sequences would be equally weighted.  The rooted tree with
265branch lengths and sequence weights for the 7 globins is given in figure 1. 
266
267
2683) Progressive alignment
269
270The basic procedure at this stage is to use a series of pairwise alignments to
271align larger and larger groups of sequences, following the branching order in
272the guide tree.   You proceed from the tips of the rooted tree towards the root.   
273In the globin example in figure 1 you align the sequences in the following
274order: human vs. horse beta globin; human vs. horse alpha globin; the 2
275alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the
276cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin
277vs. all the rest.  At each stage a full dynamic programming (26,27) algorithm is
278used with a residue weight matrix and penalties for opening and extending
279gaps.   Each step consists of aligning two existing alignments or sequences. 
280Gaps that are present in older alignments remain fixed.  In the basic
281algorithm, new gaps that are introduced at each stage get full gap opening and
282extension penalties, even if they are introduced inside old gap positions (see
283the section on gap penalties below for modifications to this rule).  In order to
284calculate the score between a position from one sequence or alignment and
285one from another, the average of all the pairwise weight matrix scores from
286the amino acids in the two sets of sequences is used i.e. if you align 2
287alignments with 2 and 4 sequences respectively, the score at each position is
288the average of 8 (2x4) comparisons.   This is illustrated in figure 2.  If either set
289of sequences contains one or more gaps in one of the positions being
290considered, each gap versus a residue is scored as zero.   The default amino
291acid weight matrices we use are rescored to have only positive values.
292Therefore, this treatment of gaps treats the score of a residue versus a gap as
293having the worst possible score.  When sequences are weighted (see
294improvements to progressive alignment, below), each weight matrix value is
295multiplied by the weights from the 2 sequences, as illustrated in figure 2.
296
297
298Improvements to progressive alignment
299
300All of the remaining modifications apply only to the final progressive
301alignment stage.   Sequence weighting is relatively straightforward and is
302already widely used in profile searches (15,16).   The treatment of gap penalties
303is more complicated.   Initial gap penalties are calculated depending on the
304weight matrix, the similarity of the sequences, and the length of the
305sequences. Then, an attempt is made to derive sensible local gap opening
306penalties at every position in each pre-aligned group of sequences that will
307vary as new sequences are added.   The use of different weight matrices as the
308alignment progresses is novel and largely by-passes the problem of initial
309choice of weight matrix.   The final modification allows us to delay the
310addition of very divergent sequences until the end of the alignment process
311when all of the more closely related sequences have already been aligned.
312
313
314Sequence weighting
315
316Sequence weights are calculated directly from the guide tree.    The weights
317are normalised such that the biggest one is set to 1.0 and the rest are all less
318than one.  Groups of closely related sequences receive lowered weights
319because they contain much duplicated information.  Highly divergent
320sequences without any close relatives receive high weights.  These weights
321are used as simple multiplication factors for scoring positions from different
322sequences or prealigned groups of sequences.  The method is illustrated in
323figure 2.  In the globin example in figure 1, the two alpha globins get
324downweighted because they are almost duplicate sequences (as do the two
325beta globins); they receive a combined weight of only slightly more than if a
326single alpha globin was used.   
327
328
329Initial gap penalties
330
331Initially, two gap penalties are used: a gap opening penalty (GOP) which gives
332the cost of opening a new gap of any length and a gap extension penalty (GEP)
333which gives the cost of every item in a gap.  Initial values can be set by the
334user from a menu.   The software then automatically attempts to choose
335appropriate gap penalties for each sequence alignment, depending on the
336following factors.
337
3381) Dependence on the weight matrix
339
340It has been shown (16,28) that varying the gap penalties used with different
341weight matrices can improve the accuracy of sequence alignments. Here, we
342use the average score for two mismatched residues (ie. off-diagonal values in
343the matrix) as a scaling factor for the GOP.
344
3452) Dependence on the similarity of the sequences
346
347The percent identity of the two (groups of) sequences to be aligned is used to
348increase the GOP for closely related sequences and decrease it for more
349divergent sequences on a linear scale.
350
3513) Dependence on the lengths of the sequences   
352
353The scores for both true and false sequence alignments grow with the length
354of the sequences. We use the logarithm of the length of the shorter sequence
355to increase the GOP with sequence length.
356
357Using these three modifications, the initial GOP calculated by the program is:
358
359GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) *
360                                                               (percent identity scaling factor)
361where N, M are the lengths of the two sequences.
362
3634) Dependence on the difference in the lengths of the sequences
364
365The GEP is modified depending on the difference between the lengths of the
366two sequences to be aligned. If one sequence is much shorter than the other,
367the GEP is increased to inhibit too many long gaps in the shorter sequence.
368The initial GEP calculated by the program is:
369
370GEP ->  GEP*(1.0+|log(N/M)|)
371where N, M are the lengths of the two sequences.
372
373
374Position-specific gap penalties
375
376 In most dynamic programming applications, the initial gap opening and
377extension penalties are applied equally at every position in the sequence,
378regardless of the location of a gap, except for terminal gaps which are usually
379allowed at no cost.   In CLUSTAL W, before any pair of sequences or
380prealigned groups of sequences are aligned, we generate a table of gap opening
381penalties for every position in the two (sets of) sequences.  An example is
382shown in figure 3.  We manipulate the initial gap opening penalty in a
383position specific manner, in order to make gaps more or less likely at different
384positions.   
385
386The local gap penalty modification rules are applied in a hierarchical manner.   
387The exact details of each rule are given below.  Firstly, if there is a gap at a
388position, the gap opening and gap extension penalties are lowered; the other
389rules do not apply.   This makes gaps more likely at positions where there are
390already gaps.  If there is no gap at a position, then the gap opening penalty is
391increased if the position is within 8 residues of an existing gap.   This
392discourages gaps that are too close together.  Finally, at any position within a
393run of hydrophilic residues, the penalty is decreased.  These runs usually
394indicate loop regions in protein structures.  If there is no run of hydrophilic
395residues, the penalty is modified using a table of residue specific gap
396propensities (12).   These propensities were derived by counting the frequency
397of each residue at either end of gaps in alignments of proteins of known
398structure.  An illustration of the application of these rules from one part of
399the globin example, in figure 1, is given in figure 3. 
400
4011) Lowered gap penalties at existing gaps
402
403If there are already gaps at a position, then the GOP is reduced in proportion
404to the number of sequences with a gap at this position and the GEP is lowered
405by a half.  The new gap opening penalty is calculated as:
406
407GOP ->  GOP*0.3*(no. of sequences without a gap/no. of sequences).
408
4092) Increased gap penalties near existing gaps
410
411If a position does not have any gaps but is within 8 residues of an existing gap,
412the GOP is increased by:
413
414GOP ->  GOP*(2+((8-distance from gap)*2)/8)
415
4163) Reduced gap penalties in hydrophilic stretches
417
418Any run of 5 hydrophilic residues is considered to be a hydrophilic stretch. 
419The residues that are to be considered hydrophilic may be set by the user but
420are conservatively set to D, E, G, K, N, Q, P, R or S by default.   If, at any
421position, there are no gaps and any of the sequences has such a stretch, the
422GOP is reduced by one third.
423
424
4254) Residue specific penalties
426
427If there is no hydrophilic stretch and the position does not contain any gaps,
428then the GOP is multiplied by one of the 20 numbers in table 1, depending on
429the residue.  If there is a mixture of residues at a position, the multiplication
430factor is the average of all the contributions from each sequence. 
431
432
433Weight matrices
434
435Two main series of weight matrices are offered to the user: the Dayhoff PAM
436series (3) and the BLOSUM series (4).   The default is the BLOSUM series.  In
437each case, there is a choice of matrix ranging from strict ones, useful for
438comparing very closely related sequences to very "soft" ones that are useful
439for comparing very distantly related sequences.   Depending on the distance
440between the two sequences or groups of sequences to be compared, we switch
441between 4 different matrices.  The distances are measured directly from the
442guide tree.  The ranges of distances and tables used with the PAM series of
443matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350.
444The range used with the BLOSUM series is:80-100%:BLOSUM80,
44560-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30.
446
447
448Divergent sequences
449
450The most divergent sequences (most different, on average from all of the
451other sequences) are usually the most difficult to align correctly.  It is
452sometimes better to delay the incorporation of these sequences until all of the
453more easily aligned sequences are merged first.  This may give a better chance
454of correctly placing the gaps and matching weakly conserved positions against
455the rest of the sequences.   A choice is offered to set a cut off (default is 40%
456identity or less with any other sequence) that will delay the alignment of the
457divergent sequences until all of the rest have been aligned. 
458
459
460Software and Algorithms
461
462
463Dynamic Programming
464
465The most demanding part of the multiple alignment strategy, in terms of
466computer processing and memory usage, is the alignment of two (groups of)
467sequences at each step in the final progressive alignment.   To make it
468possible to align very long sequences (e.g. dynein heavy chains at ~ 5,000
469residues) in a reasonable amount of memory, we use the memory efficient
470dynamic programming algorithm of Myers and Miller (26).   This sacrifices
471some processing time but makes very large alignments practical in very little
472memory.   One disadvantage of this algorithm is that it does not allow
473different gap opening and extension penalties at each position.  We have
474modified the algorithm so as to allow this and the details are described in a
475separate paper (27).   
476
477
478
479Menus/file formats
480
481Six different sequence input formats are detected automatically and read by
482the program:  EMBL/Swiss Prot, NBRF/PIR, Pearson/FASTA (29), GCG/MSF
483(30), GDE (Steven Smith, Harvard University Genome Center) and CLUSTAL
484format alignments.   The last three formats allow users to read in complete
485alignments (e.g. for calculating phylogenetic trees or for addition of new
486sequences to an existing alignment).   Alignment output may be requested in
487standard CLUSTAL format (self-explanatory blocked alignments) or in
488formats compatible with the GDE, PHYLIP (31) or GCG (30) packages.   The
489program offers the user the ability to calculate Neighbour-Joining
490phylogenetic trees from existing alignments with options to correct for
491multiple hits (32,33) and to estimate confidence levels using a bootstrap
492resampling procedure (34).   The trees may be output in the "New
493Hampshire" format that is compatible with the PHYLIP package (31).
494
495Alignment to an alignment
496
497Profile alignment is used to align two existing alignments (either of which
498may consist of just one sequence) or to add a series of new sequences to an
499existing alignment.   This is useful because one may wish to build up a
500multiple alignment gradually, choosing different parameters manually, or
501correcting intermediate errors as the alignment proceeds.   Often, just a few
502sequences cause misalignments in the progressive algorithm and these can be
503removed from the process and then added at the end by profile alignment.  A
504second use is where one has a high quality reference alignment and wishes to
505keep it fixed while adding new sequences automatically. 
506
507
508Portability/Availability
509
510The full source code of the package is provided free to academic users.   The
511program will run on any machine with a full ANSI conforming C compiler. 
512It has been tested on the following hardware/software combinations: 
513Decstation/Ultrix, Vax or ALPHA/VMS, Silicon Graphics/IRIX.   The source
514code and documentation are available by E-mail from the EMBL file server
515(send the words HELP and HELP SOFTWARE on two lines to the internet
516address:
517Netserv@EMBL-Heidelberg.DE) or by anonymous FTP from
518FTP.EMBL-Heidelberg.DE.  Queries may be addressed by E-mail to
519Des.Higgins@EBI.AC.UK or Gibson@EMBL-Heidelberg.DE.
520
521
522RESULTS AND DISCUSSION
523
524
525Alignment of SH3 Domains
526
527The ~60 residue SH3 domain was chosen to illustrate the performance of
528CLUSTAL W, as there is a reference manual alignment (23) and the fold is
529known (24).  SH3 domains, with a minimum similarity below 12% identity,
530are poorly aligned by progressive alignment programs such as CLUSTAL V
531and PILEUP: neither program can generate the correct blocks corresponding to
532the secondary structure elements.
533
534Figure 4 shows an alignment generated by CLUSTAL W of the example set of
535SH3 domains. The alignment was generated in two steps. After progressive
536alignment, five blocks were produced, corresponding to structural elements,
537with gaps inserted exclusively in the known loop regions. The beta strands in
538blocks 1, 4 and 5 were all correctly superposed. However, four sequences in
539block 2 and one sequence in block 3 were misaligned by 1-2 residues
540(underlined in figure 4). A second progressive alignment of the aligned
541sequences, including the gaps, improved this alignment: A single misaligned
542sequence, H_P55, remains in block 2 (boxed in figure 4), while block 3 is now
543completely aligned.  This alignment corrects several errors (eg. P85A, P85B
544and FUS1) in the manual alignment (23).
545
546The SH3 alignment illustrates several features of CLUSTAL W usage. Firstly,
547in a practical application involving divergent sequences, the initial
548progressive alignment is likely to be a good but not perfect approximation to
549the correct alignment. The alignment quality can be improved in a number of
550ways. If the block structure of the alignment appears to be correct, realignment
551of the alignment will usually improve most of the misaligned blocks: the
552existing gaps allow the blocks to "float" cheaply to a locally optimal position
553without disturbing the rest of the alignment. Remaining sequences which are
554doubtfully aligned can then be individually tested by profile alignment to the
555remainder: the misaligned H_P55 SH3 domain can be correctly aligned by
556profile (with GOP <= 8). The indel regions in the final alignment can then be
557manually cleaned up: Usually the exact alignment in the loop regions is not
558determinable, and may have no meaning in structural terms. It is then
559desirable to have a single gap per structural loop. CLUSTAL W achieved this
560for two of the four SH3 loop regions (figure 4).
561
562If the block structure of the alignment appears suspect, greater intervention by
563the user may be required. The most divergent sequences, especially if they
564have large insertions (which can be discerned with the aid of dot matrix
565plots), should be left out of the progressive alignment. If there are sets of
566closely related sequences that are deeply diverged from other sets, these can be
567separately aligned and then merged by profile alignment. Incorrectly
568determined sequences, containing frameshifts, can also confound regions of
569an alignment: these can be hard to detect but sometimes they have been
570grouped within the excluded divergent sequences: then they may be revealed
571when they are individually compared to the alignment as having apparently
572nonsense segments with respect to the other sequences.
573
574
575
576Finding the best alignment
577
578In cases where all of the sequences in a data set are very similar (e.g. no pair
579less than 35% identical), CLUSTAL W will find an alignment which is
580difficult to improve by eye.  In this sense, the alignment is optimal with
581regard to the alternative of manual alignment.  Mathematically, this is vague
582and can only be put on a more systematic footing by finding an objective
583function (a measure of multiple alignment quality) that exactly mirrors the
584information used by an "expert" to evaluate an alignment.  Nonetheless, if an
585alignment is impossible to improve by eye, then the program has achieved a
586very useful result.   
587
588In more difficult cases, as more divergent sequences are included, it becomes
589increasingly difficult to find good alignments and to evaluate them.    What
590we find with CLUSTAL W is that the basic block-like structure of the
591alignment (corresponding to the major secondary structure elements) is
592usually recovered, with some of the most divergent sequences misaligned in
593small regions.  This is a very useful starting point for manual refinement as it
594helps define the major blocks of similarity.   The problem sequences can be
595removed from the analysis and realigned to the rest of the sequences
596automatically or with different parameter settings.   An examination of the
597tree used to guide the alignment will usually show which sequences will be
598most unreliably placed (those that branch off closest to the root and/or those
599that align to other single sequences at a very low level of sequence identity
600rather than align to a group of pre-aligned sequences).  Finally, one can
601simply iterate the multiple alignment process by feeding an output alignment
602back into CLUSTAL W and repeating the multiple alignment process (using
603the same or different parameters).   The SH3 domain alignment in figure 4
604was derived in this way by 2 passes using default parameters.  In the second
605pass, the local gap penalties are dominated by the placement of the initial
606major gap positions.  The alignment will either remain unchanged or will
607converge rapidly (after 1 or 2 extra passes) on a better solution.  If the
608placement of the initial gaps is approximately correct but some of the
609sequences are locally misaligned, this works well. 
610
611
612Comparison with other methods
613
614Recently, several papers have addressed the problem of position specific
615parameters for multiple alignment.  In one case (35), local gap penalties are
616increased in alpha helical and beta strand regions, when the 3-D structures of
617one or more of the sequences are known.  In a second case (36), a hidden
618Markov model was used to estimate position specific gap penalties and
619residue substitution weight matrices when large numbers of examples of a
620protein domain were known.  With CLUSTAL W, we attempt to derive the
621same information purely from the set of sequences to be aligned.  Therefore,
622we can apply the method to any set of sequences.  The success of this approach
623will depend on the number of available sequences and their evolutionary
624relationships.  It will also depend on the decision making process during
625multiple alignment (e.g. when to change weight matrix) and the accuracy and
626appropriateness of our parameterisation.  In the long term, this can only be
627evaluated by exhaustive testing of sets of sequences where the correct
628alignment (or parts of it) are known from structural information.   What is
629clear, however, is that the modifications described here significantly improve
630the sensitivity of the progressive multiple alignment approach.  This is
631achieved with almost no sacrifice in speed and efficiency. 
632
633There are several areas where further improvements in sensitivity and
634accuracy can be made.  Firstly, the residue weight matrices and gap settings
635can be made more accurate as more and more data accumulate, while
636matrices for specific sequence types can be derived (e.g. for transmembrane
637regions (37)).  Secondly, stochastic or iterative optimisation methods can be
638used to refine initial alignments (7,9,10).   CLUSTAL W could be run with
639several sets of starting parameters and in each case, the alignments refined
640according to an objective function.   The search for a good objective function,
641that takes into account the sequence and position specific information used in
642CLUSTAL W is a key area of research.   Finally, the average number of
643examples of each protein domain or family is growing steadily.  It is not only
644important that programs can cope with the large volumes of data that are
645being generated, they should be able to exploit the new information to make
646the alignments more and more accurate.   Globally optimal alignments
647(according to an objective function) may not always be possible but the
648problem may be avoided if sufficiently large volumes of data become
649available.  CLUSTAL W is a step in this direction.
650
651ACKNOWLEDGEMENTS
652
653Numerous people have offered advice and suggestions for improvements to
654earlier versions of the CLUSTAL programs.  D.H. wishes to apologise to all of
655the irate CLUSTAL V users who had to live with the bugs and lack of facilities
656for getting trees in the New Hampshire format.  We wish to specifically thank
657Jeroen Coppieters who suggested using a series of weight matrices and Steven
658Henikoff for advice on using the BLOSUM matrices.  We are grateful to Rein
659Aasland, Peer Bork, Ariel Blocker and BŽrtrand Seraphin for providing
660challenging alignment problems.   T.G. and J.T. thank Kevin Leonard for
661support and encouragement.  Finally, we thank all of the people who were
662involved with various CLUSTAL programs over the years, namely: Paul
663Sharp, Rainer Fuchs and Alan Bleasby.
664
665
666REFERENCES
667
668 1.Feng, D.-F. and Doolittle, R.F. (1987). J. Mol. Evol. 25, 351-360.
669 2.Needleman, S.B. and Wunsch, C.D. (1970). J. Mol. Biol. 48, 443-453.
670 3.Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978)  in Atlas of Protein
671Sequence and Structure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352,
672NBRF, Washington.
673 4.Henikoff, S. and Henikoff, J.G. (1992). Proc. Natl. Acad. Sci. USA 89, 10915-
67410919.
675 5.Lipman, D.J., Altschul, S.F. and Kececioglu, J.D. (1989). Proc. Natl. Acad. Sci.
676USA 86, 4412-4415.
677 6.Barton, G.J. and Sternberg, M.J.E. (1987). J. Mol. Biol. 198, 327-337.
678 7.Gotoh, O. (1993). CABIOS 9, 361-370.
679 8.Altschul, S.F. (1989). J. Theor. Biol. 138, 297-309.
680 9.Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992). Nucl. Acids Res. 20,
6812511-2516.
68210.Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and
683Wooton, J.C. (1993). Science, 262, 208-214.
68411.Vingron, M. and Waterman, M.S. (1993). J. Mol. Biol. 234, 1-12.
68512.Pascarella, S. and Argos, P. (1992). J. Mol. Biol. 224, 461-471.
68613.Collins, J.F. and Coulson, A.F.W. (1987). In Nucleic acid and protein
687sequence analysis a practical approach, Bishop, M.J. and Rawlings, C.J. ed.,
688chapter 13, pp. 323-358.
68914.Vingron, M. and Sibbald, P.R. (1993). Proc. Natl. Acad. Sci. USA, 90, 8777-
6908781.
69115.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CABIOS, 10, 19-29.
69216.LŸthy, R., Xenarios, I. and Bucher, P. (1994). Protein Science, 3, 139-146.
69317.Higgins, D.G. and Sharp, P.M. (1988). Gene, 73, 237-244.
69418.Higgins, D.G. and Sharp, P.M. (1989). CABIOS, 5, 151-153.
69519.Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992). CABIOS, 8, 189-191.
69620.Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy, W.H.
697Freeman, San Francisco.
69821.Saitou, N. and Nei, M. (1987). Mol. Biol. Evol. 4, 406-425.
69922.Wilbur, W.J. and Lipman, D.J. (1983). Proc. Natl. Acad. Sci. USA, 80, 726-
700730.
70123.Musacchio, A., Gibson, T., Lehto, V.-P. and Saraste, M. (1992). FEBS Lett.
702307, 55-61.
70324.Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. and Saraste, M. (1992).
704Nature, 359, 851-855.
70525.Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216.
70626.Myers, E.W. and Miller, W. (1988). CABIOS, 4, 11-17.
70727.Thompson, J.D. (1994). CABIOS, (Submitted).
70828.Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981). J. Mol. Evol. 18, 38-46.
70929.Pearson, W.R. and Lipman, D.J. (1988). Proc. Natl. Acad. Sci. USA. 85, 2444-
7102448.
71130.Devereux, J., Haeberli, P. and Smithies, O. (1984). Nucleic Acids Res. 12,
712387-395.
71331.Felsenstein, J. (1989). Cladistics 5, 164-166.
71432.Kimura, M. (1980). J. Mol. Evol. 16, 111-120.
71533.Kimura, M. (1983). The Neutral Theory of Molecular Evolution. 
716Cambridge University Press, Cambridge.
71734.Felsenstein, J. (1985). Evolution 39, 783-791.
71835.Smith, R.F. and Smith, T.F. (1992) Protein Engineering 5, 35-41.
71936.Krogh, A., Brown, M., Mian, S., Sjšlander, K. and Haussler, D. (1994) J. Mol.
720Biol. 235-1501-1531.
72137.Jones, D.T., Taylor, W.R. and Thornton, J.M.  (1994). FEBS Lett. 339, 269-275.
72238.Bairoch, A. and Bšckmann, B. (1992) Nucleic Acids Res., 20, 2019-2022.
72339.Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and
724Wierenga, R.K. (1993) EMBO J. 12, 2617-2624.
72540.Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637.
726
727FIGURE LEGENDS
728
729Figure 1.  The basic progressive alignment procedure, illustrated using a set of
7307 globins of known tertiary structure.  The sequence names are from Swiss
731Prot (38):  Hba_Horse: horse alpha globin; Hba_Human: human alpha globin;
732Hbb_Horse: horse beta globin; Hbb_Human: human beta globin; Myg_Phyca:
733sperm whale myoglobin; Glb5_Petma: lamprey cyanohaemoglobin;
734Lgb2_Luplu: lupin leghaemoglobin.   In the distance matrix, the mean
735number of differences per residue is given.  The unrooted tree shows all
736branch lengths drawn to scale.  In the rooted tree, all branch lengths (mean
737number of differences per residue along each branch) are given as well as
738weights for each sequence.  In the multiple alignment, the approximate
739positions of the 7 alpha helices, common to all 7 proteins are shown.  This
740alignment was derived using CLUSTAL W with default parameters and the
741PAM (3) series of weight matrices. 
742
743Figure 2.  The scoring scheme for comparing two positions from two
744alignments.   Two sections of alignment with 4 and 2 sequences respectively
745are shown.   The score of the position with amino acids T,L,K,K versus the
746position with amino acids V and I is given with and without sequence
747weights.  M(X,Y) is the weight matrix entry for amino acid X versus amino
748acid Y.  Wn is the weight for sequence n.
749
750Figure 3.  The variation in local gap opening penalty is plotted for a section of
751alignment.  The inital gap opening penalty is indicated by a dotted line. Two
752hydrophilic stretches are underlined.  The lowest penalties correspond to the
753ends of the alignment, the hydrophilic stretches and the two positions with
754gaps.   The highest values are within 8 residues of the two gap positions.  The
755rest of the variation is caused by the residue specific gap penalties (12).
756
757Figure 4.  CLUSTAL W Alignment of a set of SH3 domains taken from (23).
758Secondary structure assignments for the solved Spectrin (24) and Fyn (39)
759domains are according to DSSP (40). The alignment was generated in two
760steps using default parameters. After full multiple alignment, the aligned
761sequences were realigned. Segments which were correctly aligned in the
762second pass are underlined. The single misaligned segment in H_P55 and the
763misaligned residue in H_NCK/2 are boxed.
764
765The sequences are coloured to illustrate significant features. All G (orange)
766and P (yellow) are coloured. Other residues matching a frequent occurrence of
767a property in a column are coloured: hydrophobic = blue; hydrophobic
768tendency = light blue; basic = red; acidic = purple; hydrophilic = green; White
769= unconserved. The alignment figure was prepared with the GDE sequence
770editor (S. Smith, Harvard University) and COLORMASK (J. Thompson,
771EMBL).
772
773
774
775
776Table 1.  Pascarella and Argos residue specific gap modification factors.   
777-----------------------------------------------------------------------------------
778A       1.13            M       1.29
779C       1.13            N       0.63
780D       0.96            P       0.74
781E       1.31            Q       1.07
782F       1.20            R       0.72
783G       0.61            S       0.76
784H       1.00            T       0.89
785I       1.32            V       1.25
786K       0.96            Y       1.00
787L       1.21            W       1.23
788-----------------------------------------------------------------------------------
789The values are normalised around a mean value of 1.0 for H.  The lower the
790value, the greater the chance of having an adjacent gap.  These are derived
791from the original table of relative frequencies of gaps adjacent to each residue
792(12) by subtraction from 2.0.
793
794
Note: See TracBrowser for help on using the repository browser.