Context Navigation

clustalw.txt

Visit:

Last change on this file was 19575, checked in by westram, 44 hours ago
reintegrates 'help' into 'trunk' preformatted text gets checked for width now (to enforce it fits into the arb help window). fixed help following these checks, using the following steps: ignore problems in foreign documentation. increase default help window width. introduce control comments to accept oversized preformatted sections. enforce preformatted style for whole sections. simply define single-line preformatted sections Used intensive for definition of internal script languages. fixed several non-related problems found in documentation. minor layout changes for HTML version of arb help (more compacted; highlight anchored/all sections). refactor system interface (GUI version) and use it from help module. adds: log:branches/help@19532:19574
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 32.9 KB

Line
1	README for Clustal W version 1.7 June 1997
2
3	Clustal W version 1.7 Documentation
4
5	This file provides some notes on the latest changes, installation and usage
6	of the Clustal W multiple sequence alignment program.
7
8
9
10	Julie Thompson (Thompson@EMBL-Heidelberg.DE)
11	Toby Gibson (Gibson@EMBL-Heidelberg.DE)
12
13	European Molecular Biology Laboratory
14	Meyerhofstrasse 1
15	D 69117 Heidelberg
16	Germany
17
18
19	Des Higgins (Higgins@ucc.ie)
20
21	University of County Cork
22	Cork
23	Ireland
24
25
26	Please e-mail bug reports/complaints/suggestions (polite if possible)
27	to Toby Gibson or Des Higgins.
28
29
30
31	Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
32	CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment
33	through sequence weighting, positions-specific gap penalties and weight matrix
34	choice. Nucleic Acids Research, 22:4673-4680.
35
36	--------------------------------------------------------------
37
38	What's New (June 1997) in Version 1.7 (since version 1.6).
39
40
41	1. The static arrays used by clustalw for storing the alignment data have been
42	replaced by dynamically allocated memory. There is now no limit on the number
43	or length of sequences which can be input.
44
45	2. The alignment of DNA sequences now offers a new hard-coded matrix, as well
46	as the identity matrix used previously. The new matrix is the default scoring
47	matrix used by the BESTFIT program of the GCG package for the comparison of
48	nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity
49	symbol. All matches score 1.9; all mismatches for IUB symbols score 0.0.
50
51	3. The transition weight option for aligning nucleotide sequences has been
52	changed from an on/off toggle to a weight between 0 and 1. A weight of zero
53	means that the transitions are scored as mismatches; a weight of 1 gives
54	transitions the full match score. For distantly related DNA sequences, the
55	weight should be near to zero; for closely related sequences it can be useful
56	to assign a higher score.
57
58	4. The RSF sequence alignment file format used by GCG Version 9 can now be
59	read.
60
61	5. The clustal sequence alignment file format has been changed to allow
62	sequence names longer than 10 characters. The maximum length allowed is set in
63	clustalw.h by the statement:
64
65	#define MAXNAMES 10
66
67	For the fasta format, the name is taken as the first string after the '>'
68	character, stopping at the first white space. (Previously, the first 10
69	characters were taken, replacing blanks by underscores).
70
71	6. The bootstrap values written in the phylip tree file format can be assigned
72	either to branches or nodes. The default is to write the values on the nodes,
73	as this can be read by several commonly-used tree display programs. But note
74	that this can lead to confusion if the tree is rooted and the bootstraps may
75	be better attached to the internal branches: Software developers should ensure
76	they can read the branch label format.
77
78	7. The sequence weighting used during sequence to profile alignments has been
79	changed. The tree weight is now multiplied by the percent identity of the
80	new sequence compared with the most closely related sequence in the profile.
81
82	8. The sequence weighting used during profile to profile alignments has been
83	changed. A guide tree is now built for each profile separately and the
84	sequence weights calculated from the two trees. The weights for each
85	sequence are then multiplied by the percent identity of the sequence compared
86	with the most closely related sequence in the opposite profile.
87
88	9. The adjustment of the Gap Opening and Gap Extension Penalties for sequences
89	of unequal length has been improved.
90
91	10. The default order of the sequences in the output alignment file has been
92	changed. Previously the default was to output the sequences in the same order
93	as the input file. Now the default is to use the order in which the sequences
94	were aligned (from the guide tree/dendrogram), thus automatically grouping
95	closely related sequences.
96
97	11. The option to 'Reset Gaps between alignments' has been switched off by
98	default.
99
100	12. The conservation line output in the clustal format alignment file has been
101	changed. Three characters are now used:
102
103	'*' indicates positions which have a single, fully conserved residue
104
105	':' indicates that one of the following 'strong' groups is fully conserved:-
106
107	STA
108	NEQK
109	NHQK
110	NDEQ
111	QHRK
112	MILV
113	MILF
114	HY
115	FYW
116
117	'.' indicates that one of the following 'weaker' groups is fully conserved:-
118
119	CSA
120	ATV
121	SAG
122	STNK
123	STPA
124	SGND
125	SNDEQK
126	NDEQHK
127	NEQHRK
128	FVLIM
129	HFY
130
131	These are all the positively scoring groups that occur in the Gonnet Pam250
132	matrix. The strong and weak groups are defined as strong score >0.5 and weak
133	score =<0.5 respectively.
134
135	13. A bug in the modification of the Myers and Miller alignment algorithm
136	for residue-specific gap penalites has been fixed. This occasionally caused
137	new gaps to be opened a few residues away from the optimal position.
138
139	14. The GCG/MSF input format no longer needs the word PILEUP on the first
140	line. Several versions can now be recognised:-
141	1. The word PILEUP as the first word in the file
142	2. The word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
143	as the first word in the file
144	3. The characters MSF on the first line in the line, and the
145	characters .. at the end of the line.
146
147	15. The standard command line separator for UNIX systems has been changed from
148	'/' to '-'. ie. to give options on the command line, you now type
149
150	clustalw input.aln -gapopen=8.0
151
152	instead of
153
154	clustalw input.aln /gapopen=8.0
155
156
157	ATTENTION SOFTWARE DEVELOPERS!!
158	-------------------------------
159
160	The CLUSTAL sequence alignment output format has been modified:
161
162	1. Names longer than 10 chars are now allowed. (The maximum is specified in
163	clustalw.h by '#define MAXNAMES'.)
164
165	2. The consensus line now consists of three characters: '*',':' and '.'. (Only
166	the '*' and '.' were previously used.)
167
168	3. An option (not the default) has been added, allowing the user to print out
169	sequence numbers at the end of each line of the alignment output.
170
171	4. Both RNA bases (U) and base ambiguities are now supported in nucleic acid
172	sequences. In the past, all characters (upper or lower case) other than
173	a,c,g,t or u were converted to N. Now the following characters are recognised
174	and retained in the alignment output: ABCDGHKMNRSTUVWXY (upper or lower case).
175
176	5. A Blank line inadvertently added in the version 1.6 header has been taken
177	out again.
178
179
180	--------------------------------------------------------------
181
182	What's New (March 1996) in Version 1.6 (since version 1.5).
183
184
185	1) Improved handling of sequences of unequal length. Previously, we
186	increased the gap extension penalties for both sequences if the two sequences
187	(or groups of previously aligned sequences) were of different lengths.
188	Now, we increase the gap opening and extension penalties for the shorter
189	sequence only. This helps prevent short sequences being stretched out
190	along longer ones.
191
192	2) Added the "Gonnet" series of weight matrices (from Gaston Gonnet and
193	co-workers at the ETH in Zurich). Fixed a bug in the matrix
194	choice menu; now PAM matrices can be selected ok.
195
196	3) Added secondary structure/gap penalty masks. These allow you to
197	include, in an alignment, a position specific set of gap penalties.
198	You can either set a gap opening penalty at each position or specify
199	the secondary strcuture (if protein; alpha helix, beta strand or loop)
200	and have gap penalties set automatically. This, basically, is used to make
201	gaps harder to open inside helices or strands.
202
203	These masks are only used in the "profile alignment" menu. They may be read in
204	as part of an alignment in a special format (see the on-line help for
205	details) or associated with each sequence, if the sequences are in Swiss Prot
206	format and secondary structure information is given. All of the mask
207	parameters can be set from the profile alignment menu. Basically, the
208	mask is made up of a series of numbers between 1 and 9, one per position.
209	The gap opening penalty at a position is calculated as the starting penalty
210	multipleied by the mask value at that site.
211
212	4) Added command line options /profile and /sequences.
213	These allow uses to choose between normal profile alignment where the
214	two profiles (pre-existing alignments specified in the files
215	/profile1= and /profile2=) are merged/aligned with each other (/profile)
216	and the case where the individual sequences in /profile2 are aligned
217	sequentially with the alignment in /profile1 (/sequences).
218
219	5) Fixed bug in modified Myers and Miller algorithm - gap penalty score
220	was not always calculated properly for type 2 midpoints. This is the core
221	alignment algorithm.
222
223	6) Only allows one output file format to be selected from command line
224	- ie. multiple output alignment files are not allowed.
225
226	7) Fixed 'bad calls to ckfree' error during calculation of phylip distance
227	matrix.
228
229	8) Fixed command line options /gapopen /gapext /type=protein /negative.
230
231	9) Allowed user to change command line separator on UNIX from '/' to '-'.
232	This allows unix users to use the more conventinal '-' symbol
233	for seperating command line options. "/" can then be used in unix
234	file names on the command line. The symbol that is used,
235	is specified in the file clustalw.h which must be edited if you
236	wish to change it (and the program must then be recompiled). Find the
237	block of code in clustalw.h that corrsponds to the operating system you
238	are using. These blocks are started by one of the following:
239
240	#ifdef VMS
241	#elif MAC
242	#elif MSDOS
243	#elif UNIX
244
245	On the next line after each is the line:
246
247	#define COMMANDSEP '/'
248
249	Change this in the appropriate block of code (e.g. the UNIX block) to
250
251	#define COMMANDSEP '-'
252
253	if you wish to use the "-" character as command seperator.
254
255
256
257	--------------------------------------------------------------
258
259	What's New (April 1995) in Version 1.5 (since version 1.3).
260
261	1) ported to MAC and PC. These versions are quite slow unless you
262	have a nice beefy machine. On a Power Mac or a Pentium box
263	it is nice and fast. Two precompiled versions are supplied for Macs
264	(Power mac and old mac versions).
265
266	Mac: 1500 residues by 100 sequences
267	Power Mac 3000 " " " "
268	PC 1500 " " " "
269
270	2) alignment of new sequences to an alignment. Fixed a serious bug
271	which assigned weights to the wrong sequences. Now also, weights
272	sequences according to distance from the incoming sequence. The
273	new weights are: tree weights * similarity to incoming sequence.
274	The tree weights are the old weights that we derive from the tree
275	connecting all the sequences in the existing alignment.
276
277	3) for all platforms, output linelength = 60.
278
279	4) Bootstrap files (*.phb): the "final" node (arbitrary trichotomy
280	at the end of the neighbor-joining process) is labelled as
281	TRICHOTOMY in the bootstrap output files. This is to help
282	link bootstrap figures with nodes when you reroot the tree.
283
284	5) Command line /bootstrap option now more robust.
285
286	--------------------------------------------------------------
287	INTRODUCTION
288
289
290
291	This document gives some BRIEF notes about usage of the Clustal W
292	multiple alignment program for UNIX and VMS machines. Clustal W
293	is a major update and rewrite of the Clustal V program which
294	was described in:
295
296	Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992)
297	CLUSTAL V: improved software for multiple sequence alignment.
298	Computer Applications in the Biosciences (CABIOS), 8(2):189-191.
299
300	The main new features are a greatly improved (more sensitive)
301	multiple alignment procedure for proteins and improved support
302	for different file formats. This software was described in:
303
304	Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
305	CLUSTAL W: improving the sensitivity of progressive multiple
306	sequence alignment through sequence weighting, position specific
307	gap penalties and weight matrix choice.
308	Nucleic Acids Research, 22(22):4673-4680.
309
310
311	The usage of Clustal W is largely the same as for
312	Clustal V details of which are described in clustalv.txt. Details of the
313	new alignment algorithms are described in the manuscript by
314	Thompson et. al. above, an ascii/text version of which is included
315	(clustalw.ms). This file lists some of the details not covered by either
316	of the above documents.
317
318
319	There are brief notes on the following topics:
320
321	1) Installation for VMS and UNIX and MAC and PC
322	2) File input
323	3) file output
324	4) changes to the alignment algorithms
325	5) minor modifications to the phylogenetic tree and bootstrapping methods
326	6) summary of the command line usage.
327
328	-------------------------------------------------------------------
329
330	1) INSTALLATION (for Unix, VAX/VMS, PC and MAC)
331
332
333
334	***IMPORTANT***
335	If you wish to recompile the program (or compile it for the first
336	time; you will have to do this with UNIX):
337	first check the file CLUSTALW.H which needs to be changed if you
338	move the code from between unix and vms machines. At the top
339	of the file are four lines which define one of VMS, MSDOS, MAC or
340	UNIX to be 1. All of these EXCEPT one must be commented out
341	using enclosed /* ... */.
342	*******************
343
344
345	Unix
346	-----
347
348	Make files are supplied for unix machines. The code was compiled and
349	tested using Decstation (Ultrix), SUN (Gnu C compiler/gcc), Silicon
350	Graphics (IRIX) and DEC/Alpha (OSF1). We have not tested the code on any other
351	systems. Just use makefile to make on most systems. For Sun, you need to
352	have the Gnuc C (gcc) compiler installed ... use the file makefile.sun in this
353	case. You make the program with:
354	make (or make -f makefile.sun)
355
356	This produces the file clustalw which can be run by typing clustalw and
357	pressing return. The help file is called clustalw_help
358
359
360	VMS
361	----
362
363	There is a small DCL command file (VMSLINK.COM) to compile and link the
364	code for VMS machines (vax or alpha). This procedure just compiles the
365	source files and links using default settings. Run it using:
366	$ @vmslink
367	This produces Clustalw.exe which can be run using the run command:
368	$ run clustalw
369
370	The intermediate object files can be deleted with:
371	$ del *.obj;
372
373	There is an extensive command line facility. To use this, you must
374	create a symbol to run the program (and put this in your login.com file).
375	e.g.
376	$ clustalw :== $$drive:[dir.dir]clustalw
377	where $drive is the drive on which the executable file is stored (clustalw.exe)
378	and [dir.dir] is the full directory specification. NOTE THE EXTRA DOLLAR SIGN.
379	Then the program can be run using the command:
380	$ clustalw
381
382
383	PC
384	__
385
386	We supply an executable file (Clustalw.exe) which will run using MSDOS.
387	It will also run under windows (as a DOS application)
388	* IF you have a maths coprocessor*. If you do not have a maths chip
389	(e.g. 80387), the program can only be run under MSDOS. In the latter case,
390	you must have the file EMU387.exe in the same directory as CLUSTALW.EXE.
391	This file emulates a maths chip if you do not have one.
392
393
394	We generated the executable file using gnu c for MSDOS.
395	It will also compile (with about 10,000 warning messages)
396	using Microsoft C but we have not tested it and there appear to be problems
397	with the executable.
398
399	You will need to use a "memory extender" to allow the program to get at more
400	than 640kb of memory.
401
402
403
404	MAC
405	---
406
407	The code compiles for Power Mac and older macs using Metroworks Codewarrior
408	C compiler. We supply 2 executable programs (one each for PowerMac and
409	older mac): ClustalwPPC and Clustalw68k). These need up to
410	10mb of memory to run which needs to be adjusted with the Get Info (%I)
411	command from the Finder if you have problems. Just double click the
412	executable file name or icon and off you go (we hope).
413
414	As a special treat for Mac users, we supply an executable and brief readme
415	file for NJPLOT. This is a really nice program by Manolo Gouy
416	(University of Lyon, France) that allows you to import the trees
417	made by Clustal W and display them/manipulate them. It will properly
418	display the bootstrap figures from the *.phb files. It can export the
419	trees in PICT format which can then be used by MacDraw for example.
420
421
422	-------------------------------------------------------------------------
423
424	2) FILE INPUT (sequences to be aligned)
425
426
427
428	The sequences must all be in one file (or two files for a "profile alignment")
429	in ONE of the following formats:
430
431	FASTA (Pearson), NBRF/PIR, EMBL/Swiss Prot, GDE, CLUSTAL, GCG/MSF, GCG9/RSF.
432
433	The program tries to "guess" which format is being used and whether
434	the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The
435	format is recognised by the first characters in the file. This is kind
436	of stupid/crude but works most of the time and it is difficult
437	to do reliably, any other way.
438
439
440	Format First non blank word or character in the file.
441	...............................................................
442	FASTA >
443	NBRF >P1; or >D1;
444	EMBL/SWISS ID
445	GDE protein %
446	GDE nucleotide #
447	CLUSTAL CLUSTAL (blocked multiple alignments)
448	GCG/MSF PILEUP or !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
449	or MSF on the first line, and '..' at the end of line
450	GCG9/RSF !!RICH_SEQUENCE
451
452	Note, that the only way of spotting that a file is MSF format is if
453	the word PILEUP appears at the very beginning of the file. If you
454	produce this format from software other than the GCG pileup program,
455	then you will have to insert the word PILEUP at the start of the file.
456	Similarly, if you use clustal format, the word CLUSTAL must appear first.
457
458	All of these formats can be used to read in AN EXISTING FULL ALIGNMENT.
459	With CLUSTAL format, this is just the same as the output format of this
460	program and Clustal V. If you use PILEUP or CLUSTAL format, all sequences
461	must be the same length, INCLUDING GAPS ("-" in clustal format; "." in MSF).
462	With the other formats, sequences can be gapped with "-" characters. If you
463	read in any gaps these are kept during any later alignments. You can use
464	this facility to read in an alignment in order to calculate a phylogenetic
465	tree OR to output the same alignment in a different format (from the
466	output format options menu of the multiple alignment menu) e.g. read
467	in a GCG/MSF format alignment and output a PHYLIP format alignment. This is
468	also useful to read in one reference alignment and to add one or more new
469	sequences to it using the "profile alignment" facilities.
470
471	DNA vs. PROTEIN: the program will count the number of A,C,G,T,U and N
472	charcters. If 85% or more of the characters in a sequence are as above,
473	then DNA/RNA is assumed, protein otherwise.
474
475	-------------------------------------------------------------------------
476
477
478	3) FILE OUTPUT
479
480
481	1) the alignments.
482
483	In the multiple alignment and profile alignment menus, there is a menu
484	item to control the output format(s).
485
486	The alignment output format can be set to any (or all) of:
487	CLUSTAL (a self explanatory blocked alignment)
488	NBRF/PIR (same as input format but with "-" characters for gaps)
489	MSF (the main GCG package multiple alignment format)
490	PHYLIP (Joe Felsenstein's phylogeny inference package. Gaps are set to
491	"-" characters. For some programs (e.g. PROTPARS/DNAPARS) these
492	should be changed to "?" characters for unknown residues.
493	GDE (Used by Steven Smith's GDE package)
494
495	You can also choose between having the sequences in the same order as in
496	the input file or writing them out in an order that more closely matches the
497	order used to carry out the multiple alignment.
498
499
500	2) The trees.
501
502	Believe it or not, we now use the New Hampshire (nested parentheses)
503	format as default for our trees. This format is compatible with e.g. the
504	PHYLIP package. If you want to view a tree, you can use the RETREE or
505	DRAWGRAM/DRAWTREE programs of PHYLIP. This format is used for all our
506	trees, even the initial guide trees for deciding the order of multiple
507	alignment. The output trees from the phylogenetic tree menu can also be
508	requested in our old verbose/cryptic format. This may be more useful
509	if, for example, you wish to see the bootstrap figures. The bootstrap
510	trees in the default New Hampshire format give the bootstrap figures
511	as extra labels which can be viewed very easily using TREETOOL which is
512	available as part of the GDE package. TREETOOL is available from the
513	RDP project by ftp from rdp.life.uiuc.edu.
514
515	The New Hampshire format is only useful if you have software to display or
516	manipulate the trees. The PHYLIP package is highly recommended if you intend
517	to do much work with trees and includes programs for doing this. If you do
518	not have such software, request the trees in the older clustal format
519	and see the documentation for Clustal V (clustalv.txt). WE DO NOT PROVIDE
520	ANY DIRECT MEANS FOR VIEWING TREES GRAPHICALLY.
521
522	-------------------------------------------------------------------------
523
524	4) THE ALIGNMENT ALGORITHMS
525
526
527	The basic algorithm is the same as for Clustal V and is described in some
528	detail in clustalv.txt. The new modifications are described in detail in
529	clustalw.ms. Here we just list some notes to help answer some of the most
530	obvious questions.
531
532
533	Terminal Gaps
534
535	In the original Clustal V program, terminal gaps were penalised the same
536	as all other gaps. This caused some ugly side effects e.g.
537
538	acgtacgtacgtacgt acgtacgtacgtacgt
539	a----cgtacgtacgt gets the same score as ----acgtacgtacgt
540
541	NOW, terminal gaps are free. This is better on average and stops silly
542	effects like single residues jumping to the edge of the alignment. However,
543	it is not perfect. It does mean that if there should be a gap near the end
544	of the alignment, the program may be reluctant to insert it i.e.
545
546	cccccgggccccc cccccgggccccc
547	ccccc---ccccc may be considered worse (lower score) than cccccccccc---
548
549	In the right hand case above, the terminal gap is free and may score higher
550	than the laft hand alignment. This can be prevented by lowering the gap
551	opening and extension penalties. It is difficult to get this right all the
552	time. Please watch the ends of your alignments.
553
554
555
556	Speed of the initial (pairwise) alignments (fast approximate/slow accurate)
557
558	By default, the initial pairwise alignments are now carried out using a full
559	dynamic programming algorithm. This is more accurate than the older hash/
560	k-tuple based alignments (Wilbur and Lipman) but is MUCH slower. On a fast
561	workstation you may not notice but on a slow box, the difference is extreme.
562	You can set the alignment method from the menus easily to the older, faster
563	method.
564
565
566
567	Delaying alignment of distant sequences
568
569	The user can set a cut off to delay the alignment of the most divergent
570	sequences in a data set until all other sequences have been aligned. By
571	default, this is set to 40% which means that if a sequence is less than 40%
572	identical to any other sequence, its alignment will be delayed.
573
574
575
576	Iterative realignment/Reset gaps between alignments
577
578	By default, if you align a set of sequences a second time (e.g. with changed
579	gap penalties), the gaps from the first alignment are discarded. You can
580	set this from the menus so that older gaps will be kept between alignments,
581	This can sometimes give better alignments by keeping the gaps (do not reset
582	them) and doing the full multiple alignment a second time. Sometimes, the
583	alignment will converge on a better solution; sometimes the new alignment will
584	be the same as the first. There can be a strange side effect: you can get
585	columns of nothing but gaps introduced.
586
587	Any gaps that are read in from the input file are always kept, regardless
588	of the setting of this switch. If you read in a full multiple alignment, the "reset
589	gaps" switch has no effect. The old gaps will remain and if you carry out
590	a multiple alignment, any new gaps will be added in. If you wish to carry out
591	a full new alignment of a set of sequences that are already aligned in a file
592	you must input the sequences without gaps.
593
594
595
596	Profile alignment
597
598	By profile alignment, we simply mean the alignment of old alignments/sequences.
599	In this context, a profile is just an existing alignment (or even a set of
600	unaligned sequences; see below). This allows you to
601	read in an old alignment (in any of the allowed input formats) and align
602	one or more new sequences to it. From the profile alignment menu, you
603	are allowed to read in 2 profiles. Either profile can be a full alignment
604	OR a single sequence. In the simplest mode, you simply align the two profiles
605	to each other. This is useful if you want to gradually build up a full
606	multiple alignment.
607
608	A second option is to align the sequences from the second profile, one at
609	a time to the first profile. This is done, taking the underlying tree between
610	the sequences into account. This is useful if you have a set of new sequences
611	(not aligned) and you wish to add them all to an older alignment.
612
613	----------------------------------------------------------------------------
614
615	5) CHANGES TO THE PHYLOGENTIC TREE CALCULATIONS AND SOME HINTS.
616
617
618
619	IMPROVED DISTANCE CALCULATIONS FOR PROTEIN TREES
620
621
622	The phylogenetic trees in Clustal W (the real trees that you calculate
623	AFTER alignment; not the guide trees used to decide the branching order
624	for multiple alignment) use the Neighbor-Joining method of Saitou and
625	Nei based on a matrix of "distances" between all sequences. These distances
626	can be corrected for "multiple hits". This is normal practice when accurate
627	trees are needed. This correction stretches distances (especially large ones)
628	to try to correct for the fact that OBSERVED distances (mean number of
629	differences per site) greatly underestimate the actual number that happened
630	during evolution.
631
632	In Clustal V we used a simple formula to convert an observed distance to one
633	that is corrected for multiple hits. The observed distance is the mean number
634	of differences per site in an alignment (ignoring sites with a gap) and is
635	therefore always between 0.0 (for ientical sequences) an 1.0 (no residues the
636	same at any site). These distances can be multiplied by 100 to give percent
637	difference values. 100 minus percent difference gives percent identity.
638	The formula we use to correct for multiple hits is from Motoo Kimura
639	(Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983,
640	page 75) and is:
641
642	K = -Ln(1 - D - (D.D)/5) where D is the observed distance and K is
643	corrected distance.
644
645	This formula gives mean number of estimated substitutions per site and, in
646	contrast to D (the observed number), can be greater than 1 i.e. more than
647	one substitution per site, on average. For example, if you observe 0.8
648	differences per site (80% difference; 20% identity), then the above formula
649	predicts that there have been 2.5 substitutions per site over the course
650	of evolution since the 2 sequences diverged. This can also be expressed in
651	PAM units by multiplying by 100 (mean number of substitutions per 100 residues).
652	The PAM scale of evolution and its derivation/calculation comes from the
653	work of Margaret Dayhoff and co workers (the famous Dayhoff PAM series
654	of weight matrices also came from this work). Dayhoff et al constructed
655	an elaborate model of protein evolution based on observed frequencies
656	of substitution between very closely related proteins. Using this model,
657	they derived a table relating observed distances to predicted PAM distances.
658	Kimura's formula, above, is just a "curve fitting" approximation to this table.
659	It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly
660	unaccurate at high D (>0.75) and fails completely at around D = 0.85.
661
662	To circumvent this problem, we calculated all the values for K corresponding
663	to D above 0.75 directly using the Dayhoff model and store these in an
664	internal table, used by Clustal W. This table is declared in the file dayhoff.h and
665	gives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e.
666	for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930. For any observed D
667	higher than 0.930, we arbitrarily set K to 10.0. This sounds drastic but
668	with real sequences, distances of 0.93 (less than 7% identity) are rare.
669	If your data set includes sequences with this degree of divergence, you
670	will have great difficulty getting accurate trees by ANY method; the alignment
671	itself will be very difficult (to construct and to evaluate).
672
673	There are some important
674	things to note. Firstly, this formula works well if your sequences are
675	of average amino acid composition and if the amino acids substitute according
676	to the original Dayhoff model. In other cases, it may be misleading. Secondly,
677	it is based only on observed percent distance i.e. it does not DIRECTLY
678	take conservative substitutions into account. Thirdly, the error on the
679	estimated PAM distances may be VERY great for high distances; at very high
680	distance (e.g. over 85%) it may give largely arbitrary corrected distances.
681	In most cases, however, the correction is still worth using; the trees will
682	be more accurate and the branch lengths will be more realistic.
683
684	A far more sophisticated distance correction based on a full Dayhoff
685	model which DOES take conservative substitutions and actual amino acid
686	composition into account, may be found in the PROTDIST program of the
687	PHYLIP package. For serious tree makers, this program is highly recommended.
688
689
690
691	TWO NOTES ON BOOTSTRAPPING...
692
693	When you use the BOOTSTRAP in Clustal W to estimate the reliability of parts
694	of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut
695	off of 0.93 (sequences only 7% identical) if the sequences are distantly
696	related. This will happen randomly i.e. even if none of the pairs of
697	sequences are less than 7% identical, the bootstrap samples may contain pairs
698	of sequences that do exceed this cut off.
699	If this happens, you will be warned. In practice, this can
700	happen with many data sets. It is not a serious problem if it happens rarely.
701	If it does happen (you are warned when it happens and told how often the
702	problem occurs), you should consider removing the most distantly
703	related sequences and/or using the PHYLIP package instead.
704
705
706	A further problem arises in almost exactly the opposite situation: when
707	you bootstrap a data set which contains 3 or more sequences that are identical
708	or almost identical. Here, the sets of identical sequences should be shown
709	as a multifurcation (several sequences joing at the same part of the tree).
710	Because the Neighbor-Joining method only gives strictly dichotomous trees
711	(never more than 2 sequences join at one time), this cannot be exactly
712	represented. In practice, this is NOT a problem as there will be some
713	internal branches of zero length seperating the sequences. If you
714	display the tree with all branch lengths, you will still see a multifurcation.
715	However, when you bootstrap
716	the tree, only the branching orders are stored and counted. In the case
717	of multifurcations, the exact branching order is arbitrary but the program
718	will always get the same branching order, depending only on the input order
719	of the sequences. In practice, this is only a problem in situations where
720	you have a set of sequences where all of them are VERY similar. In this case,
721	you can find very high support for some groupings which will disappear if you
722	run the analysis with a different input order. Again, the PHYLIP package
723	deals with this by offering a JUMBLE option to shuffle the input order
724	of your sequences between each bootstrap sample.
725
726	----------------------------------------------------------------------------
727
728	6) SUMMARY OF THE COMMAND LINE USAGE
729
730	Clustal W is designed to be run interactively. However, there are many
731	situations where it is convenient to run it from the command line, especially
732	if you wish to run it from another piece of software (e.g. SeqApp or GDE).
733	All parameters can be set from the command line by giving options after the
734	clustalw command. On UNIX options should be preceded by '-', all other systems
735	use the '/' character.
736
737	If anything is put on the command line, the program will (attempt to) carry
738	out whatever is requested and will exit. If you wish to use the command
739	line to set some parameters and then go into interactive mode, use the
740	command line switch: interactive .... e.g.
741
742	clustalw -quicktree -interactive on UNIX
743	or
744	clustalw /quicktree /interactive on VMS,MAC and PC
745
746	will set the default initial alignment mode to fast/approximate and will then
747	go to the main menu.
748
749
750	To see a list of all the command line parameters, type:
751
752	clustalw -options on UNIX
753	or
754	clustalw /options on VMS,MAC and PC
755
756	and you will see a list with no explanation.
757
758
759	To get (VERY BRIEF) help on command line usage, use the /HELP or /CHECK
760	(-help or -check on UNIX systems) options. Otherwise, the command line
761	usage is self explanatory or is explained in clustalv.txt. The defaults
762	for all parameters are set in the file param.h which can be changed easily
763	(remember to recompile the program afterwards :-).
764
765	------------------------------------------------------------------------------

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/CLUSTALW/clustalw.txt

Download in other formats: