Context Navigation

← Previous Revision
Next Revision →
Blame
Revision Log

source: branches/profile/GDE/CLUSTALW/clustalw.ms

Visit:

Last change on this file was 176, checked in by jobb, 23 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:executable set to ``* Property svn:keywords set to `Author Date Id Revision`
File size: 43.4 KB

Line
1	This is just an ASCII text version of the manuscript describing
2	Clustal W, without the figures. It was published:
3
4	Nucleic Acids Research, 22(22):4673-4680.
5
6
7
8	CLUSTAL W: improving the sensitivity of progressive multiple
9	sequence alignment through sequence weighting, position specific
10	gap penalties and weight matrix choice.
11
12
13
14	Julie D. Thompson, Desmond G. Higgins1 and Toby J. Gibson*
15
16	European Molecular Biology Laboratory
17	Postfach 102209
18	Meyerhofstrasse 1
19	D-69012 Heidelberg
20	Germany
21
22
23	Phone: +49-6221-387398
24	Fax: +49-6221-387306
25	E-mail: Gibson@EMBL-Heidelberg.DE
26	Des.Higgins@EBI.AC.UK
27	Thompson@EMBL-Heidelberg.DE
28
29
30	Keywords: Multiple alignment, phylogenetic tree, weight matrix, gap
31	penalty, dynamic programming, sequence weighting.
32
33
34	1 Current address:
35	European Bioinformatics Institute
36	Hinxton Hall
37	Hinxton
38	Cambridge CB10 1RQ
39	UK.
40
41	* To whom correspondence should be addressed
42
43
44	ABSTRACT
45
46	The sensitivity of the commonly used progressive multiple sequence
47	alignment method has been greatly improved for the alignment of divergent
48	protein sequences. Firstly, individual weights are assigned to each sequence
49	in a partial alignment in order to downweight near-duplicate sequences and
50	upweight the most divergent ones. Secondly, amino acid substitution
51	matrices are varied at different alignment stages according to the divergence
52	of the sequences to be aligned. Thirdly, residue specific gap penalties and
53	locally reduced gap penalties in hydrophilic regions encourage new gaps in
54	potential loop regions rather than regular secondary structure. Fourthly,
55	positions in early alignments where gaps have been opened receive locally
56	reduced gap penalties to encourage the opening up of new gaps at these
57	positions. These modifications are incorporated into a new program,
58	CLUSTAL W which is freely available.
59
60
61	INTRODUCTION
62
63	The simultaneous alignment of many nucleotide or amino acid sequences is
64	now an essential tool in molecular biology. Multiple alignments are used to
65	find diagnostic patterns to characterise protein families; to detect or
66	demonstrate homology between new sequences and existing families of
67	sequences; to help predict the secondary and tertiary structures of new
68	sequences; to suggest oligonucleotide primers for PCR; as an essential prelude
69	to molecular evolutionary analysis. The rate of appearance of new sequence
70	data is steadily increasing and the development of efficient and accurate
71	automatic methods for multiple alignment is, therefore, of major
72	importance. The majority of automatic multiple alignments are now carried
73	out using the "progressive" approach of Feng and Doolittle (1). In this paper,
74	we describe a number of improvements to the progressive multiple
75	alignment method which greatly improve the sensitivity without sacrificing
76	any of the speed and efficiency which makes this approach so practical. The
77	new methods are made available in a program called CLUSTAL W which is
78	freely available and portable to a wide variety of computers and operating
79	systems.
80
81	In order to align just two sequences, it is standard practice to use dynamic
82	programming (2). This guarantees a mathematically optimal alignment,
83	given a table of scores for matches and mismatches between all amino acids
84	or nucleotides (e.g. the PAM250 matrix (3) or BLOSUM62 matrix (4)) and
85	penalties for insertions or deletions of different lengths. Attempts at
86	generalising dynamic programming to multiple alignments are limited to
87	small numbers of short sequences (5). For much more than eight or so
88	proteins of average length, the problem is uncomputable given current
89	computer power. Therefore, all of the methods capable of handling larger
90	problems in practical timescales, make use of heuristics. Currently, the most
91	widely used approach is to exploit the fact that homologous sequences are
92	evolutionarily related. One can build up a multiple alignment progressively
93	by a series of pairwise alignments, following the branching order in a
94	phylogenetic tree (1). One first aligns the most closely related sequences,
95	gradually adding in the more distant ones. This approach is sufficiently fast
96	to allow alignments of virtually any size. Further, in simple cases, the
97	quality of the alignments is excellent, as judged by the ability to correctly align
98	corresponding domains from sequences of known secondary or tertiary
99	structure (6). In more difficult cases, the alignments give good starting points
100	for further automatic or manual refinement.
101
102	This approach works well when the data set consists of sequences of different
103	degrees of divergence. Pairwise alignment of very closely related sequences
104	can be carried out very accurately. The correct answer may often be obtained
105	using a wide range of parameter values (gap penalties and weight matrix). By
106	the time the most distantly related sequences are aligned, one already has a
107	sample of aligned sequences which gives important information about the
108	variability at each position. The positions of the gaps that were introduced
109	during the early alignments of the closely related sequences are not changed
110	as new sequences are added. This is justified because the placement of gaps
111	in alignments between closely related sequences is much more accurate than
112	between distantly related ones. When all of the sequences are highly
113	divergent (e.g. less than approximately 25-30% identity between any pair of
114	sequences), this progressive approach becomes much less reliable.
115
116	There are two major problems with the progressive approach: the local
117	minimum problem and the choice of alignment parameters. The local
118	minimum problem stems from the "greedy" nature of the alignment strategy.
119	The algorithm greedily adds sequences together, following the initial tree.
120	There is no guarantee that the global optimal solution, as defined by some
121	overall measure of multiple alignment quality (7,8), or anything close to it,
122	will be found. More specifically, any mistakes (misaligned regions) made
123	early in the alignment process cannot be corrected later as new information
124	from other sequences is added. This problem is frequently thought of as
125	mainly resulting from an incorrect branching order in the initial tree. The
126	initial trees are derived from a matrix of distances between separately aligned
127	pairs of sequences and are much less reliable than trees from complete
128	multiple alignments. In our experience, however, the real problem is caused
129	simply by errors in the initial alignments. Even if the topology of the guide
130	tree is correct, each alignment step in the multiple alignment process may
131	have some percentage of the residues misaligned. This percentage will be
132	very low on average for very closely related sequences but will increase as
133	sequences diverge. It is these misalignments which carry through from the
134	early alignment steps that cause the local minimum problem. The only way
135	to correct this is to use an iterative or stochastic sampling procedure (e.g.
136	7,9,10). We do not directly address this problem in this paper.
137
138	The alignment parameter choice problem is, in our view, at least as serious as
139	the local minimum problem. Stochastic or iterative algorithms will be just
140	as badly affected as progressive ones if the parameters are inappropriate: they
141	will arrive at a false global minimum. Traditionally, one chooses one weight
142	matrix and two gap penalties (one for opening a new gap and one for
143	extending an existing gap) and hope that these will work well over all parts of
144	all the sequences in the data set. When the sequences are all closely related,
145	this works. The first reason is that virtually all residue weight matrices give
146	most weight to identities. When identities dominate an alignment, almost
147	any weight matrix will find approximately the correct solution. With very
148	divergent sequences, however, the scores given to non-identical residues will
149	become critically important; there will be more mismatches than identities.
150	Different weight matrices will be optimal at different evolutionary distances
151	or for different classes of proteins.
152
153	The second reason is that the range of gap penalty values that will find the
154	correct or best possible solution can be very broad for highly similar sequences
155	(11). As more and more divergent sequences are used, however, the exact
156	values of the gap penalties become important for success. In each case, there
157	may be a very narrow range of values which will deliver the best alignment.
158	Further, in protein alignments, gaps do not occur randomly (i.e. with equal
159	probability at all positions). They occur far more often between the major
160	secondary structural elements of alpha helices and beta strands than within
161	(12).
162
163	The major improvements described in this paper attempt to address the
164	alignment parameter choice problem. We dynamically vary the gap
165	penalties in a position and residue specific manner. The observed relative
166	frequencies of gaps adjacent to each of the 20 amino acids (12) are used to
167	locally adjust the gap opening penalty after each residue. Short stretches of
168	hydrophilic residues (e.g. 5 or more) usually indicate loop or random coil
169	regions and the gap opening penalties are locally reduced in these stretches.
170	In addition, the locations of the gaps found in the early alignments are also
171	given reduced gap opening penalties. It has been observed in alignments
172	between sequences of known structure that gaps tend not to be closer than
173	roughly eight residues on average (12). We increase the gap opening penalty
174	within eight residues of exising gaps. The two main series of amino acid
175	weight matrices that are used today are the PAM series (3) and the BLOSUM
176	series (4). In each case, there is a range of matrices to choose from. Some
177	matrices are appropriate for aligning very closely related sequences where
178	most weight by far is given to identities, with only the most frequent
179	conservative substitutions receiving high scores. Other matrices work better
180	at greater evolutionary distances where less importance is attached to
181	identities (13). We choose different weight matrices, as the alignment
182	proceeds, depending on the estimated divergence of the sequences to be
183	aligned at each stage.
184
185	Sequences are weighted to correct for unequal sampling across all
186	evolutionary distances in the data set (14). This downweights sequences that
187	are very similar to other sequences in the data set and upweights the most
188	divergent ones. The weights are calculated directly from the branch lengths
189	in the initial guide tree (15). Sequence weighting has already been shown to
190	be effective in improving the sensitivity of profile searches (15,16). In the
191	original CLUSTAL programs (17-19), the initial guide trees, used to guide the
192	multiple alignment, were calculated using the UPGMA method (20). We
193	now use the Neighbour-Joining method (21) which is more robust against the
194	effects of unequal evolutionary rates in different lineages and which gives
195	better estimates of individual branch lengths. This is useful because it is these
196	branch lengths which are used to derive the sequence weights. We also allow
197	users to choose between fast approximate alignments (22) or full dynamic
198	programming for the distance calculations used to make the guide tree.
199
200	The new improvements dramatically improve the sensitivity of the
201	progressive alignment method for difficult alignments involving highly
202	diverged sequences. We show one very demanding test case of over 60 SH3
203	domains (23) which includes sequence pairs with as little as 12% identity and
204	where there is only one exactly conserved residue across all of the sequences.
205	Using default parameters, we can achieve an alignment that is almost exactly
206	correct, according to available structural information (24). Using the program
207	in a wide variety of situations, we find that it will normally find the correct
208	alignment, in all but the most difficult and pathological of cases.
209
210
211	MATERIAL AND METHODS
212
213
214	The basic alignment method
215
216	The basic multiple alignment algorithm consists of three main stages: 1) all
217	pairs of sequences are aligned separately in order to calculate a distance matrix
218	giving the divergence of each pair of sequences; 2) a guide tree is calculated
219	from the distance matrix; 3) the sequences are progressively aligned according
220	to the branching order in the guide tree. An example using 7 globin
221	sequences of known tertiary structure (25) is given in figure 1.
222
223
224	1) The distance matrix/pairwise alignments
225
226	In the original CLUSTAL programs, the pairwise distances were calculated
227	using a fast approximate method (22). This allows very large numbers of
228	sequences to be aligned, even on a microcomputer. The scores are calculated
229	as the number of k-tuple matches (runs of identical residues, typically 1 or 2
230	long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment
231	between two sequences minus a fixed penalty for every gap. We now offer a
232	choice between this method and the slower but more accurate scores from full
233	dynamic programming alignments using two gap penalties (for opening or
234	extending gaps) and a full amino acid weight matrix. These scores are
235	calculated as the number of identities in the best alignment divided by the
236	number of residues compared (gap positions are excluded). Both of these
237	scores are initially calculated as percent identity scores and are converted to
238	distances by dividing by 100 and subtracting from 1.0 to give number of
239	differences per site. We do not correct for multiple substitutions in these
240	initial distances. In figure 1 we give the 7x7 distance matrix between the 7
241	globin sequences calculated using the full dynamic programming method.
242
243
244	2) The guide tree
245
246	The trees used to guide the final multiple alignment process are calculated
247	from the distance matrix of step 1 using the Neighbour-Joining method (21).
248	This produces unrooted trees with branch lengths proportional to estimated
249	divergence along each branch. The root is placed by a "mid-point" method
250	(15) at a position where the means of the branch lengths on either side of the
251	root are equal. These trees are also used to derive a weight for each sequence
252	(15). The weights are dependent upon the distance from the root of the tree
253	but sequences which have a common branch with other sequences share the
254	weight derived from the shared branch. In the example in figure 1, the
255	leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the
256	length of the branch from the root to it. The Human beta globin
257	(Hbb_Human) gets a weight consisting of the length of the branch leading to
258	it that is not shared with any other sequences (0.081) plus half the length of
259	the branch shared with the horse beta globin (0.226/2) plus one quarter the
260	length of the branch shared by all four haemoglobins (0.061/4) plus one fifth
261	the branch shared between the haemoglobins and the myoglobin (0.015/5)
262	plus one sixth the branch leading to all the vertebrate globins (0.062). This
263	sums to a total of 0.221. By contrast, in the normal progressive alignment
264	algorithm, all sequences would be equally weighted. The rooted tree with
265	branch lengths and sequence weights for the 7 globins is given in figure 1.
266
267
268	3) Progressive alignment
269
270	The basic procedure at this stage is to use a series of pairwise alignments to
271	align larger and larger groups of sequences, following the branching order in
272	the guide tree. You proceed from the tips of the rooted tree towards the root.
273	In the globin example in figure 1 you align the sequences in the following
274	order: human vs. horse beta globin; human vs. horse alpha globin; the 2
275	alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the
276	cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin
277	vs. all the rest. At each stage a full dynamic programming (26,27) algorithm is
278	used with a residue weight matrix and penalties for opening and extending
279	gaps. Each step consists of aligning two existing alignments or sequences.
280	Gaps that are present in older alignments remain fixed. In the basic
281	algorithm, new gaps that are introduced at each stage get full gap opening and
282	extension penalties, even if they are introduced inside old gap positions (see
283	the section on gap penalties below for modifications to this rule). In order to
284	calculate the score between a position from one sequence or alignment and
285	one from another, the average of all the pairwise weight matrix scores from
286	the amino acids in the two sets of sequences is used i.e. if you align 2
287	alignments with 2 and 4 sequences respectively, the score at each position is
288	the average of 8 (2x4) comparisons. This is illustrated in figure 2. If either set
289	of sequences contains one or more gaps in one of the positions being
290	considered, each gap versus a residue is scored as zero. The default amino
291	acid weight matrices we use are rescored to have only positive values.
292	Therefore, this treatment of gaps treats the score of a residue versus a gap as
293	having the worst possible score. When sequences are weighted (see
294	improvements to progressive alignment, below), each weight matrix value is
295	multiplied by the weights from the 2 sequences, as illustrated in figure 2.
296
297
298	Improvements to progressive alignment
299
300	All of the remaining modifications apply only to the final progressive
301	alignment stage. Sequence weighting is relatively straightforward and is
302	already widely used in profile searches (15,16). The treatment of gap penalties
303	is more complicated. Initial gap penalties are calculated depending on the
304	weight matrix, the similarity of the sequences, and the length of the
305	sequences. Then, an attempt is made to derive sensible local gap opening
306	penalties at every position in each pre-aligned group of sequences that will
307	vary as new sequences are added. The use of different weight matrices as the
308	alignment progresses is novel and largely by-passes the problem of initial
309	choice of weight matrix. The final modification allows us to delay the
310	addition of very divergent sequences until the end of the alignment process
311	when all of the more closely related sequences have already been aligned.
312
313
314	Sequence weighting
315
316	Sequence weights are calculated directly from the guide tree. The weights
317	are normalised such that the biggest one is set to 1.0 and the rest are all less
318	than one. Groups of closely related sequences receive lowered weights
319	because they contain much duplicated information. Highly divergent
320	sequences without any close relatives receive high weights. These weights
321	are used as simple multiplication factors for scoring positions from different
322	sequences or prealigned groups of sequences. The method is illustrated in
323	figure 2. In the globin example in figure 1, the two alpha globins get
324	downweighted because they are almost duplicate sequences (as do the two
325	beta globins); they receive a combined weight of only slightly more than if a
326	single alpha globin was used.
327
328
329	Initial gap penalties
330
331	Initially, two gap penalties are used: a gap opening penalty (GOP) which gives
332	the cost of opening a new gap of any length and a gap extension penalty (GEP)
333	which gives the cost of every item in a gap. Initial values can be set by the
334	user from a menu. The software then automatically attempts to choose
335	appropriate gap penalties for each sequence alignment, depending on the
336	following factors.
337
338	1) Dependence on the weight matrix
339
340	It has been shown (16,28) that varying the gap penalties used with different
341	weight matrices can improve the accuracy of sequence alignments. Here, we
342	use the average score for two mismatched residues (ie. off-diagonal values in
343	the matrix) as a scaling factor for the GOP.
344
345	2) Dependence on the similarity of the sequences
346
347	The percent identity of the two (groups of) sequences to be aligned is used to
348	increase the GOP for closely related sequences and decrease it for more
349	divergent sequences on a linear scale.
350
351	3) Dependence on the lengths of the sequences
352
353	The scores for both true and false sequence alignments grow with the length
354	of the sequences. We use the logarithm of the length of the shorter sequence
355	to increase the GOP with sequence length.
356
357	Using these three modifications, the initial GOP calculated by the program is:
358
359	GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) *
360	(percent identity scaling factor)
361	where N, M are the lengths of the two sequences.
362
363	4) Dependence on the difference in the lengths of the sequences
364
365	The GEP is modified depending on the difference between the lengths of the
366	two sequences to be aligned. If one sequence is much shorter than the other,
367	the GEP is increased to inhibit too many long gaps in the shorter sequence.
368	The initial GEP calculated by the program is:
369
370	GEP -> GEP*(1.0+\|log(N/M)\|)
371	where N, M are the lengths of the two sequences.
372
373
374	Position-specific gap penalties
375
376	In most dynamic programming applications, the initial gap opening and
377	extension penalties are applied equally at every position in the sequence,
378	regardless of the location of a gap, except for terminal gaps which are usually
379	allowed at no cost. In CLUSTAL W, before any pair of sequences or
380	prealigned groups of sequences are aligned, we generate a table of gap opening
381	penalties for every position in the two (sets of) sequences. An example is
382	shown in figure 3. We manipulate the initial gap opening penalty in a
383	position specific manner, in order to make gaps more or less likely at different
384	positions.
385
386	The local gap penalty modification rules are applied in a hierarchical manner.
387	The exact details of each rule are given below. Firstly, if there is a gap at a
388	position, the gap opening and gap extension penalties are lowered; the other
389	rules do not apply. This makes gaps more likely at positions where there are
390	already gaps. If there is no gap at a position, then the gap opening penalty is
391	increased if the position is within 8 residues of an existing gap. This
392	discourages gaps that are too close together. Finally, at any position within a
393	run of hydrophilic residues, the penalty is decreased. These runs usually
394	indicate loop regions in protein structures. If there is no run of hydrophilic
395	residues, the penalty is modified using a table of residue specific gap
396	propensities (12). These propensities were derived by counting the frequency
397	of each residue at either end of gaps in alignments of proteins of known
398	structure. An illustration of the application of these rules from one part of
399	the globin example, in figure 1, is given in figure 3.
400
401	1) Lowered gap penalties at existing gaps
402
403	If there are already gaps at a position, then the GOP is reduced in proportion
404	to the number of sequences with a gap at this position and the GEP is lowered
405	by a half. The new gap opening penalty is calculated as:
406
407	GOP -> GOP0.3(no. of sequences without a gap/no. of sequences).
408
409	2) Increased gap penalties near existing gaps
410
411	If a position does not have any gaps but is within 8 residues of an existing gap,
412	the GOP is increased by:
413
414	GOP -> GOP(2+((8-distance from gap)2)/8)
415
416	3) Reduced gap penalties in hydrophilic stretches
417
418	Any run of 5 hydrophilic residues is considered to be a hydrophilic stretch.
419	The residues that are to be considered hydrophilic may be set by the user but
420	are conservatively set to D, E, G, K, N, Q, P, R or S by default. If, at any
421	position, there are no gaps and any of the sequences has such a stretch, the
422	GOP is reduced by one third.
423
424
425	4) Residue specific penalties
426
427	If there is no hydrophilic stretch and the position does not contain any gaps,
428	then the GOP is multiplied by one of the 20 numbers in table 1, depending on
429	the residue. If there is a mixture of residues at a position, the multiplication
430	factor is the average of all the contributions from each sequence.
431
432
433	Weight matrices
434
435	Two main series of weight matrices are offered to the user: the Dayhoff PAM
436	series (3) and the BLOSUM series (4). The default is the BLOSUM series. In
437	each case, there is a choice of matrix ranging from strict ones, useful for
438	comparing very closely related sequences to very "soft" ones that are useful
439	for comparing very distantly related sequences. Depending on the distance
440	between the two sequences or groups of sequences to be compared, we switch
441	between 4 different matrices. The distances are measured directly from the
442	guide tree. The ranges of distances and tables used with the PAM series of
443	matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350.
444	The range used with the BLOSUM series is:80-100%:BLOSUM80,
445	60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30.
446
447
448	Divergent sequences
449
450	The most divergent sequences (most different, on average from all of the
451	other sequences) are usually the most difficult to align correctly. It is
452	sometimes better to delay the incorporation of these sequences until all of the
453	more easily aligned sequences are merged first. This may give a better chance
454	of correctly placing the gaps and matching weakly conserved positions against
455	the rest of the sequences. A choice is offered to set a cut off (default is 40%
456	identity or less with any other sequence) that will delay the alignment of the
457	divergent sequences until all of the rest have been aligned.
458
459
460	Software and Algorithms
461
462
463	Dynamic Programming
464
465	The most demanding part of the multiple alignment strategy, in terms of
466	computer processing and memory usage, is the alignment of two (groups of)
467	sequences at each step in the final progressive alignment. To make it
468	possible to align very long sequences (e.g. dynein heavy chains at ~ 5,000
469	residues) in a reasonable amount of memory, we use the memory efficient
470	dynamic programming algorithm of Myers and Miller (26). This sacrifices
471	some processing time but makes very large alignments practical in very little
472	memory. One disadvantage of this algorithm is that it does not allow
473	different gap opening and extension penalties at each position. We have
474	modified the algorithm so as to allow this and the details are described in a
475	separate paper (27).
476
477
478
479	Menus/file formats
480
481	Six different sequence input formats are detected automatically and read by
482	the program: EMBL/Swiss Prot, NBRF/PIR, Pearson/FASTA (29), GCG/MSF
483	(30), GDE (Steven Smith, Harvard University Genome Center) and CLUSTAL
484	format alignments. The last three formats allow users to read in complete
485	alignments (e.g. for calculating phylogenetic trees or for addition of new
486	sequences to an existing alignment). Alignment output may be requested in
487	standard CLUSTAL format (self-explanatory blocked alignments) or in
488	formats compatible with the GDE, PHYLIP (31) or GCG (30) packages. The
489	program offers the user the ability to calculate Neighbour-Joining
490	phylogenetic trees from existing alignments with options to correct for
491	multiple hits (32,33) and to estimate confidence levels using a bootstrap
492	resampling procedure (34). The trees may be output in the "New
493	Hampshire" format that is compatible with the PHYLIP package (31).
494
495	Alignment to an alignment
496
497	Profile alignment is used to align two existing alignments (either of which
498	may consist of just one sequence) or to add a series of new sequences to an
499	existing alignment. This is useful because one may wish to build up a
500	multiple alignment gradually, choosing different parameters manually, or
501	correcting intermediate errors as the alignment proceeds. Often, just a few
502	sequences cause misalignments in the progressive algorithm and these can be
503	removed from the process and then added at the end by profile alignment. A
504	second use is where one has a high quality reference alignment and wishes to
505	keep it fixed while adding new sequences automatically.
506
507
508	Portability/Availability
509
510	The full source code of the package is provided free to academic users. The
511	program will run on any machine with a full ANSI conforming C compiler.
512	It has been tested on the following hardware/software combinations:
513	Decstation/Ultrix, Vax or ALPHA/VMS, Silicon Graphics/IRIX. The source
514	code and documentation are available by E-mail from the EMBL file server
515	(send the words HELP and HELP SOFTWARE on two lines to the internet
516	address:
517	Netserv@EMBL-Heidelberg.DE) or by anonymous FTP from
518	FTP.EMBL-Heidelberg.DE. Queries may be addressed by E-mail to
519	Des.Higgins@EBI.AC.UK or Gibson@EMBL-Heidelberg.DE.
520
521
522	RESULTS AND DISCUSSION
523
524
525	Alignment of SH3 Domains
526
527	The ~60 residue SH3 domain was chosen to illustrate the performance of
528	CLUSTAL W, as there is a reference manual alignment (23) and the fold is
529	known (24). SH3 domains, with a minimum similarity below 12% identity,
530	are poorly aligned by progressive alignment programs such as CLUSTAL V
531	and PILEUP: neither program can generate the correct blocks corresponding to
532	the secondary structure elements.
533
534	Figure 4 shows an alignment generated by CLUSTAL W of the example set of
535	SH3 domains. The alignment was generated in two steps. After progressive
536	alignment, five blocks were produced, corresponding to structural elements,
537	with gaps inserted exclusively in the known loop regions. The beta strands in
538	blocks 1, 4 and 5 were all correctly superposed. However, four sequences in
539	block 2 and one sequence in block 3 were misaligned by 1-2 residues
540	(underlined in figure 4). A second progressive alignment of the aligned
541	sequences, including the gaps, improved this alignment: A single misaligned
542	sequence, H_P55, remains in block 2 (boxed in figure 4), while block 3 is now
543	completely aligned. This alignment corrects several errors (eg. P85A, P85B
544	and FUS1) in the manual alignment (23).
545
546	The SH3 alignment illustrates several features of CLUSTAL W usage. Firstly,
547	in a practical application involving divergent sequences, the initial
548	progressive alignment is likely to be a good but not perfect approximation to
549	the correct alignment. The alignment quality can be improved in a number of
550	ways. If the block structure of the alignment appears to be correct, realignment
551	of the alignment will usually improve most of the misaligned blocks: the
552	existing gaps allow the blocks to "float" cheaply to a locally optimal position
553	without disturbing the rest of the alignment. Remaining sequences which are
554	doubtfully aligned can then be individually tested by profile alignment to the
555	remainder: the misaligned H_P55 SH3 domain can be correctly aligned by
556	profile (with GOP <= 8). The indel regions in the final alignment can then be
557	manually cleaned up: Usually the exact alignment in the loop regions is not
558	determinable, and may have no meaning in structural terms. It is then
559	desirable to have a single gap per structural loop. CLUSTAL W achieved this
560	for two of the four SH3 loop regions (figure 4).
561
562	If the block structure of the alignment appears suspect, greater intervention by
563	the user may be required. The most divergent sequences, especially if they
564	have large insertions (which can be discerned with the aid of dot matrix
565	plots), should be left out of the progressive alignment. If there are sets of
566	closely related sequences that are deeply diverged from other sets, these can be
567	separately aligned and then merged by profile alignment. Incorrectly
568	determined sequences, containing frameshifts, can also confound regions of
569	an alignment: these can be hard to detect but sometimes they have been
570	grouped within the excluded divergent sequences: then they may be revealed
571	when they are individually compared to the alignment as having apparently
572	nonsense segments with respect to the other sequences.
573
574
575
576	Finding the best alignment
577
578	In cases where all of the sequences in a data set are very similar (e.g. no pair
579	less than 35% identical), CLUSTAL W will find an alignment which is
580	difficult to improve by eye. In this sense, the alignment is optimal with
581	regard to the alternative of manual alignment. Mathematically, this is vague
582	and can only be put on a more systematic footing by finding an objective
583	function (a measure of multiple alignment quality) that exactly mirrors the
584	information used by an "expert" to evaluate an alignment. Nonetheless, if an
585	alignment is impossible to improve by eye, then the program has achieved a
586	very useful result.
587
588	In more difficult cases, as more divergent sequences are included, it becomes
589	increasingly difficult to find good alignments and to evaluate them. What
590	we find with CLUSTAL W is that the basic block-like structure of the
591	alignment (corresponding to the major secondary structure elements) is
592	usually recovered, with some of the most divergent sequences misaligned in
593	small regions. This is a very useful starting point for manual refinement as it
594	helps define the major blocks of similarity. The problem sequences can be
595	removed from the analysis and realigned to the rest of the sequences
596	automatically or with different parameter settings. An examination of the
597	tree used to guide the alignment will usually show which sequences will be
598	most unreliably placed (those that branch off closest to the root and/or those
599	that align to other single sequences at a very low level of sequence identity
600	rather than align to a group of pre-aligned sequences). Finally, one can
601	simply iterate the multiple alignment process by feeding an output alignment
602	back into CLUSTAL W and repeating the multiple alignment process (using
603	the same or different parameters). The SH3 domain alignment in figure 4
604	was derived in this way by 2 passes using default parameters. In the second
605	pass, the local gap penalties are dominated by the placement of the initial
606	major gap positions. The alignment will either remain unchanged or will
607	converge rapidly (after 1 or 2 extra passes) on a better solution. If the
608	placement of the initial gaps is approximately correct but some of the
609	sequences are locally misaligned, this works well.
610
611
612	Comparison with other methods
613
614	Recently, several papers have addressed the problem of position specific
615	parameters for multiple alignment. In one case (35), local gap penalties are
616	increased in alpha helical and beta strand regions, when the 3-D structures of
617	one or more of the sequences are known. In a second case (36), a hidden
618	Markov model was used to estimate position specific gap penalties and
619	residue substitution weight matrices when large numbers of examples of a
620	protein domain were known. With CLUSTAL W, we attempt to derive the
621	same information purely from the set of sequences to be aligned. Therefore,
622	we can apply the method to any set of sequences. The success of this approach
623	will depend on the number of available sequences and their evolutionary
624	relationships. It will also depend on the decision making process during
625	multiple alignment (e.g. when to change weight matrix) and the accuracy and
626	appropriateness of our parameterisation. In the long term, this can only be
627	evaluated by exhaustive testing of sets of sequences where the correct
628	alignment (or parts of it) are known from structural information. What is
629	clear, however, is that the modifications described here significantly improve
630	the sensitivity of the progressive multiple alignment approach. This is
631	achieved with almost no sacrifice in speed and efficiency.
632
633	There are several areas where further improvements in sensitivity and
634	accuracy can be made. Firstly, the residue weight matrices and gap settings
635	can be made more accurate as more and more data accumulate, while
636	matrices for specific sequence types can be derived (e.g. for transmembrane
637	regions (37)). Secondly, stochastic or iterative optimisation methods can be
638	used to refine initial alignments (7,9,10). CLUSTAL W could be run with
639	several sets of starting parameters and in each case, the alignments refined
640	according to an objective function. The search for a good objective function,
641	that takes into account the sequence and position specific information used in
642	CLUSTAL W is a key area of research. Finally, the average number of
643	examples of each protein domain or family is growing steadily. It is not only
644	important that programs can cope with the large volumes of data that are
645	being generated, they should be able to exploit the new information to make
646	the alignments more and more accurate. Globally optimal alignments
647	(according to an objective function) may not always be possible but the
648	problem may be avoided if sufficiently large volumes of data become
649	available. CLUSTAL W is a step in this direction.
650
651	ACKNOWLEDGEMENTS
652
653	Numerous people have offered advice and suggestions for improvements to
654	earlier versions of the CLUSTAL programs. D.H. wishes to apologise to all of
655	the irate CLUSTAL V users who had to live with the bugs and lack of facilities
656	for getting trees in the New Hampshire format. We wish to specifically thank
657	Jeroen Coppieters who suggested using a series of weight matrices and Steven
658	Henikoff for advice on using the BLOSUM matrices. We are grateful to Rein
659	Aasland, Peer Bork, Ariel Blocker and Brtrand Seraphin for providing
660	challenging alignment problems. T.G. and J.T. thank Kevin Leonard for
661	support and encouragement. Finally, we thank all of the people who were
662	involved with various CLUSTAL programs over the years, namely: Paul
663	Sharp, Rainer Fuchs and Alan Bleasby.
664
665
666	REFERENCES
667
668	1.Feng, D.-F. and Doolittle, R.F. (1987). J. Mol. Evol. 25, 351-360.
669	2.Needleman, S.B. and Wunsch, C.D. (1970). J. Mol. Biol. 48, 443-453.
670	3.Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of Protein
671	Sequence and Structure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352,
672	NBRF, Washington.
673	4.Henikoff, S. and Henikoff, J.G. (1992). Proc. Natl. Acad. Sci. USA 89, 10915-
674	10919.
675	5.Lipman, D.J., Altschul, S.F. and Kececioglu, J.D. (1989). Proc. Natl. Acad. Sci.
676	USA 86, 4412-4415.
677	6.Barton, G.J. and Sternberg, M.J.E. (1987). J. Mol. Biol. 198, 327-337.
678	7.Gotoh, O. (1993). CABIOS 9, 361-370.
679	8.Altschul, S.F. (1989). J. Theor. Biol. 138, 297-309.
680	9.Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992). Nucl. Acids Res. 20,
681	2511-2516.
682	10.Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and
683	Wooton, J.C. (1993). Science, 262, 208-214.
684	11.Vingron, M. and Waterman, M.S. (1993). J. Mol. Biol. 234, 1-12.
685	12.Pascarella, S. and Argos, P. (1992). J. Mol. Biol. 224, 461-471.
686	13.Collins, J.F. and Coulson, A.F.W. (1987). In Nucleic acid and protein
687	sequence analysis a practical approach, Bishop, M.J. and Rawlings, C.J. ed.,
688	chapter 13, pp. 323-358.
689	14.Vingron, M. and Sibbald, P.R. (1993). Proc. Natl. Acad. Sci. USA, 90, 8777-
690	8781.
691	15.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CABIOS, 10, 19-29.
692	16.Lthy, R., Xenarios, I. and Bucher, P. (1994). Protein Science, 3, 139-146.
693	17.Higgins, D.G. and Sharp, P.M. (1988). Gene, 73, 237-244.
694	18.Higgins, D.G. and Sharp, P.M. (1989). CABIOS, 5, 151-153.
695	19.Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992). CABIOS, 8, 189-191.
696	20.Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy, W.H.
697	Freeman, San Francisco.
698	21.Saitou, N. and Nei, M. (1987). Mol. Biol. Evol. 4, 406-425.
699	22.Wilbur, W.J. and Lipman, D.J. (1983). Proc. Natl. Acad. Sci. USA, 80, 726-
700	730.
701	23.Musacchio, A., Gibson, T., Lehto, V.-P. and Saraste, M. (1992). FEBS Lett.
702	307, 55-61.
703	24.Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. and Saraste, M. (1992).
704	Nature, 359, 851-855.
705	25.Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216.
706	26.Myers, E.W. and Miller, W. (1988). CABIOS, 4, 11-17.
707	27.Thompson, J.D. (1994). CABIOS, (Submitted).
708	28.Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981). J. Mol. Evol. 18, 38-46.
709	29.Pearson, W.R. and Lipman, D.J. (1988). Proc. Natl. Acad. Sci. USA. 85, 2444-
710	2448.
711	30.Devereux, J., Haeberli, P. and Smithies, O. (1984). Nucleic Acids Res. 12,
712	387-395.
713	31.Felsenstein, J. (1989). Cladistics 5, 164-166.
714	32.Kimura, M. (1980). J. Mol. Evol. 16, 111-120.
715	33.Kimura, M. (1983). The Neutral Theory of Molecular Evolution.
716	Cambridge University Press, Cambridge.
717	34.Felsenstein, J. (1985). Evolution 39, 783-791.
718	35.Smith, R.F. and Smith, T.F. (1992) Protein Engineering 5, 35-41.
719	36.Krogh, A., Brown, M., Mian, S., Sjlander, K. and Haussler, D. (1994) J. Mol.
720	Biol. 235-1501-1531.
721	37.Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994). FEBS Lett. 339, 269-275.
722	38.Bairoch, A. and Bckmann, B. (1992) Nucleic Acids Res., 20, 2019-2022.
723	39.Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and
724	Wierenga, R.K. (1993) EMBO J. 12, 2617-2624.
725	40.Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637.
726
727	FIGURE LEGENDS
728
729	Figure 1. The basic progressive alignment procedure, illustrated using a set of
730	7 globins of known tertiary structure. The sequence names are from Swiss
731	Prot (38): Hba_Horse: horse alpha globin; Hba_Human: human alpha globin;
732	Hbb_Horse: horse beta globin; Hbb_Human: human beta globin; Myg_Phyca:
733	sperm whale myoglobin; Glb5_Petma: lamprey cyanohaemoglobin;
734	Lgb2_Luplu: lupin leghaemoglobin. In the distance matrix, the mean
735	number of differences per residue is given. The unrooted tree shows all
736	branch lengths drawn to scale. In the rooted tree, all branch lengths (mean
737	number of differences per residue along each branch) are given as well as
738	weights for each sequence. In the multiple alignment, the approximate
739	positions of the 7 alpha helices, common to all 7 proteins are shown. This
740	alignment was derived using CLUSTAL W with default parameters and the
741	PAM (3) series of weight matrices.
742
743	Figure 2. The scoring scheme for comparing two positions from two
744	alignments. Two sections of alignment with 4 and 2 sequences respectively
745	are shown. The score of the position with amino acids T,L,K,K versus the
746	position with amino acids V and I is given with and without sequence
747	weights. M(X,Y) is the weight matrix entry for amino acid X versus amino
748	acid Y. Wn is the weight for sequence n.
749
750	Figure 3. The variation in local gap opening penalty is plotted for a section of
751	alignment. The inital gap opening penalty is indicated by a dotted line. Two
752	hydrophilic stretches are underlined. The lowest penalties correspond to the
753	ends of the alignment, the hydrophilic stretches and the two positions with
754	gaps. The highest values are within 8 residues of the two gap positions. The
755	rest of the variation is caused by the residue specific gap penalties (12).
756
757	Figure 4. CLUSTAL W Alignment of a set of SH3 domains taken from (23).
758	Secondary structure assignments for the solved Spectrin (24) and Fyn (39)
759	domains are according to DSSP (40). The alignment was generated in two
760	steps using default parameters. After full multiple alignment, the aligned
761	sequences were realigned. Segments which were correctly aligned in the
762	second pass are underlined. The single misaligned segment in H_P55 and the
763	misaligned residue in H_NCK/2 are boxed.
764
765	The sequences are coloured to illustrate significant features. All G (orange)
766	and P (yellow) are coloured. Other residues matching a frequent occurrence of
767	a property in a column are coloured: hydrophobic = blue; hydrophobic
768	tendency = light blue; basic = red; acidic = purple; hydrophilic = green; White
769	= unconserved. The alignment figure was prepared with the GDE sequence
770	editor (S. Smith, Harvard University) and COLORMASK (J. Thompson,
771	EMBL).
772
773
774
775
776	Table 1. Pascarella and Argos residue specific gap modification factors.
777	-----------------------------------------------------------------------------------
778	A 1.13 M 1.29
779	C 1.13 N 0.63
780	D 0.96 P 0.74
781	E 1.31 Q 1.07
782	F 1.20 R 0.72
783	G 0.61 S 0.76
784	H 1.00 T 0.89
785	I 1.32 V 1.25
786	K 0.96 Y 1.00
787	L 1.21 W 1.23
788	-----------------------------------------------------------------------------------
789	The values are normalised around a mean value of 1.0 for H. The lower the
790	value, the greater the chance of having an adjacent gap. These are derived
791	from the original table of relative frequencies of gaps adjacent to each residue
792	(12) by subtraction from 2.0.
793
794

Note: See TracBrowser for help on using the repository browser.

Download in other formats: