Context Navigation

clustalv.txt

Visit:

Last change on this file was 19575, checked in by westram, 44 hours ago
reintegrates 'help' into 'trunk' preformatted text gets checked for width now (to enforce it fits into the arb help window). fixed help following these checks, using the following steps: ignore problems in foreign documentation. increase default help window width. introduce control comments to accept oversized preformatted sections. enforce preformatted style for whole sections. simply define single-line preformatted sections Used intensive for definition of internal script languages. fixed several non-related problems found in documentation. minor layout changes for HTML version of arb help (more compacted; highlight anchored/all sections). refactor system interface (GUI version) and use it from help module. adds: log:branches/help@19532:19574
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 73.8 KB

Line
1
2
3
4	Clustal V Multiple Sequence Alignments.
5
6	Documentation (Installation and Usage).
7
8	Des Higgins
9	European Molecular Biology Laboratory
10	Postfach 10.2209
11	D-6900 Heidelberg
12	Germany.
13
14	higgins@EMBL-Heidelberg.DE
15
16
17	*******************************************************************
18
19
20	Contents.
21
22
23	1 Overview
24
25	2 Installation
26
27	3 Interactive usage
28
29	4 Command-line interface
30
31	5 Algorithms and references
32
33
34	*******************************************************************
35
36	1. Overview
37
38	This document describes how to install and use ClustalV on various
39	machines. ClustalV is a complete upgrade and rewrite of the Clustal
40	package of multiple alignment programs (Higgins and Sharp, 1988 and
41	1989). The original programs were written in Fortran for
42	microcomputers running MSDOS. You carried out a complete alignment
43	by running 3 programs in succession. Later, these were merged into
44	a single menu driven program with on-line help, for VAX/VMS.
45	ClustalV was written in C and has all of the features of the old
46	programs plus many new ones. It has been compiled and tested using
47	VAX/VMS C, Decstation ULTRIX C, Gnu C for Sun workstations, Turbo C
48	for IBM PC's and Think C for Apple Mac's. The original Clustal was
49	written by Des Higgins while he was a Post-Doc in the lab of Paul
50	Sharp in the Genetics Department, Trinity College, Dublin 2,
51	Ireland.
52
53	The main feature of the old package was the ability to carry out
54	reliable multiple alignments of many sequences. The sensitivity of
55	the program is as good as from any other program we have tried, with
56	the exception of the programs of Vingron and Argos (1991), while it
57	works in reasonable time on a microcomputer. The programs of
58	Vingron and Argos are specialised for finding distant similarities
59	between proteins but require mainframes or workstations and are more
60	difficult to use.
61
62	The main new features are: profile alignments (alignments of old
63	alignments); phylogenetic trees (Neighbor Joining trees calculated
64	after multiple alignment with a bootstrapping option); better
65	sequence input (automatically recognise and read NBRF/PIR, Pearson
66	(Fasta) or EMBL/SwissProt formats); flexible alignment output
67	(choose one of: old Clustal format, NBRF/PIR, GCG msf format or
68	Phylip format); full command line interface (everything that you can
69	do interactively can be specified on the command line).
70
71	In version 7 of the GCG package, there is a program called PILEUP
72	which uses a very similar algorithm to the one in ClustalV. There
73	are 2 main differences between the programs: 1) the metric used to
74	compare the sequences for the initial "guide tree" uses a full
75	global, optimal alignment in PILEUP instead of the fast, approximate
76	ones in ClustalV. This makes PILEUP much slower for the comparison
77	of long sequences. In principle, the distances calculated from
78	PILEUP will be more sensitive than ours, but in practice it will not
79	make much difference, except in difficult cases. 2) During the
80	multiple alignment, terminal gaps are penalised in ClustalV but not
81	in PILEUP. This will make the PILEUP alignments better when the
82	sequences are of very different lengths (has no effect if there are
83	no large terminal gaps).
84
85
86	This software may be distributed and used freely, provided that you
87	do not modify it or this documentation in any way without the
88	permission of the authors.
89
90	If you wish to refer to ClustalV, please cite:
91	Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software
92	for multiple sequence alignment. CABIOS, vol .8, 189-191.
93
94	The overall multiple alignment algorithm was described in:
95	Higgins,D.G. and Sharp,P.M. (1989). Fast and sensitive multiple
96	sequence alignments on a microcomputer. CABIOS, vol. 5, 151-153.
97
98
99	ACKNOWLEDGEMENTS.
100
101	D.H. would particularly like to thank Paul Sharp, in whose lab. this
102	work originated. We also thank Manolo Gouy, Gene Myers, Peter Rice
103	and Martin Vingron for suggestions, bug-fixes and help.
104
105	Des Higgins and Rainer Fuchs,
106	EMBL Data Library, Heidelberg, Germany.
107
108	Alan Bleasby,
109	Daresbury, UK.
110
111	JUNE 1991
112	*******************************************************************
113
114	2. Installation.
115
116
117
118	As far as possible, we have tried to make ClustalV portable to any
119	machine with a standard C compiler (proposed ANSI C standard). The
120	source code, as supplied by us, has been compiled and tested using
121	the following compilers:
122
123	VAX/VMS C
124	Ultrix C (on a Decstation 2100)
125	Gnu C on a Sun 4 workstation
126	Think C on an Apple Macintosh SE
127	Turbo C on an IBM AT.
128
129	In each case, one must make 1 change to 1 line of code in 1 header
130	file. This is described below. The exact capacity of the program
131	(how many sequences of what length can be aligned) will depend of
132	course on available memory but can also be set in this header file.
133
134	The package comes as 9 C source files; 3 header files; 1 file of on-
135	line help; this documentation file; 3 make files:
136
137	Source code: clustalv.c, amenu.c, gcgcheck.c, myers.c, sequence.c,
138	showpair.c, trees.c, upgma.c, util.c
139
140	Header files: clustalv.h, general.h, matrices.h
141
142	On-Line help: clustalv.hlp (must be renamed or defined as
143	clustalv_help except on PC's)
144
145	Documentation: clustalv.txt (this file).
146
147	Makefiles: makefile.sun (gnu c on Sun), vmslink.com (vax/vms),
148	makefile.ult (ultrix).
149
150
151
152
153
154
155
156	Before compiling ClustalV you must look at and possibly change
157	clustalV.h, shown below..
158
159	/*****************CLUSTALV.H******************************/
160
161	/*
162	Main header file for ClustalV. Uncomment ONE of the following lines
163	depending on which compiler you wish to use.
164	*/
165
166	#define VMS 1 /* VAX VMS */
167
168	/#define MAC 1 Think_C for MacIntosh /
169
170	/#define MSDOS 1 Turbo C for PC's /
171
172	/#define UNIX 1 Ultrix for Decstations or Gnu C for Sun /
173
174	/*************************************************************/
175
176	#include "general.h"
177
178	#define MAXNAMES 10
179	#define MAXTITLES 60
180	#define FILENAMELEN 256
181
182	#define UNKNOWN 0
183	#define EMBLSWISS 1
184	#define PIR 2
185	#define PEARSON 3
186
187	#define PAGE_LEN 22
188
189	#if VMS
190	#define DIRDELIM ']'
191	#define MAXLEN 3000
192	#define MAXN 150
193	#define FSIZE 15000
194	#define LINELENGTH 60
195	#define GCG_LINELENGTH 50
196
197	#elif MAC
198	#define DIRDELIM ':'
199	#define MAXLEN 2600
200	#define MAXN 30
201	#define FSIZE 10000
202	#define LINELENGTH 50
203	#define GCG_LINELENGTH 50
204
205	#elif MSDOS
206	#define DIRDELIM '\\'
207	#define MAXLEN 1300
208	#define MAXN 30
209	#define FSIZE 5000
210	#define LINELENGTH 50
211	#define GCG_LINELENGTH 50
212
213	#elif UNIX
214	#define DIRDELIM '/'
215	#define MAXLEN 3000
216	#define MAXN 50
217	#define FSIZE 15000
218	#define LINELENGTH 60
219	#define GCG_LINELENGTH 50
220	#endif
221	/****************endofCLUSTALV.H**************************/
222
223
224
225	First, you must remove the comments from one of the first 10 lines.
226	There are 4 'define' compiler directives here (e.g. #define VMS 1),
227	and you should use one of these, depending on which system you wish
228	to work. So choose one of these, remove its comments (if it is
229	already commented out) and put comments around any of the others
230	that are still active. If you wish to use a different system, you
231	will need to insert a new line with a new keyword (which you must
232	invent) to identify your system. Most of the rest of this header
233	file is taken up with a block of 'define' statements for each system
234	type; e.g. the VAX/VMS block is:
235
236	#if VMS
237	#define DIRDELIM ']'
238	#define MAXLEN 3000
239	#define MAXN 150
240	#define FSIZE 15000
241	#define LINELENGTH 60
242	#define GCG_LINELENGTH 50
243
244	In this block, you can specify the maximum number of sequences to be
245	allowed (MAXN); the maximum sequence length, including gaps
246	(MAXLEN); FSIZE declares the size of some workspace, used by the
247	fast 2 sequence comparison routines and should be APPROXIMATELY 4
248	times MAXLEN; LINELENGTH is the length of the blocks of alignment
249	output in the output files; GCG_LINELENGTH is the same but for the
250	GCG compatible output only. Finally, DIRDELIM is the character used
251	to specify directories and subdirectories in file names. It should
252	be the character used to seperate the file name itself from the
253	directory name (e.g. in VMS, file names are like:
254	$drive:[dir1.dir2.dir3]filename.ext;2 so ']' is used as DIRDELIM).
255
256	So, if you want to use a system, not covered in Clustalv.h, you will
257	have to insert a new block, like the above one. To compile and link
258	the program, we supply 3 makefiles: one each for VAX/VMS, Ultrix
259	and GNU C for Sun workstations.
260
261
262
263	VAX/VMS
264
265	Compile and link the program with the
266	supplied makefile for vms: vmslink.com .
267
268	$ @vmslink
269
270	This will produce clustalv.exe (and a lot of .obj files which you can delete).
271
272	The on-line help file (clustalv.hlp) should be 'defined' as
273	clustalv_help as follows:
274
275	$ def clustalv_help $drive:[dir1.dir2]clustalv.hlp
276
277	where $drive is the drive designation and [dir1.dir2] is the
278	directory where clustalv.hlp is kept.
279
280	To make use of the command-line interface, you must make clustalv a
281	'foreign' command with:
282
283	$ clustalv :== $$drive:[dir1.dir2]clustalv
284
285	where $drive is the drive designation and [dir1.dir2] is the
286	directory where clustalv.exe is kept.
287
288
289
290	IBM PC/MSDOS/TURBO C
291
292	Create a makefile (something.prj) with the names of the source files
293	(clustalv.c, amenu.c etc.) and 'make' this using the HUGE memory
294	model. You will get half a dozen warnings from the compiler about
295	pieces of code than look suspicious to it but ignore these. The
296	help file should remain as clustalv.hlp . To run the program using
297	the default settings in Clustalv.h, you need approximately 500k of
298	memory. To reduce this, the main influence on memory usage is the
299	parameter MAXLEN; reduce MAXLEN to reduce memory usage.
300
301
302
303	Apple Mac/THINK_C version 4.0.2
304
305	This version of the program is not at all Mac like. It runs in a
306	window, the inside of which looks just like a normal character based
307	terminal. In the future we might put a proper Mac interface on it
308	but do not have the time right now. With the default settings in
309	the header file ClustalV.h, you need just over 800k of memory to run
310	the program. To reduce this, reduce MAXLEN; this is easily the
311	biggest influence on memory usage. To compile the program and save
312	it as an application you need to 'set the application type'; here
313	you specify how much memory (in kilobytes (k)) the application will
314	need. You should set this to 900k to run the application as it is
315	OR reduce MAXLEN in the header. To compile the program you have to
316	create a 'project'; you 'add' the names of the 9 source files to the
317	project AND the name of the ANSI library. The source code is too
318	large to compile in one compilation unit. You will get a 'link
319	error: code segment too big' if you try to compile and link as is.
320	You should compile amenu.c (the biggest source file) as a seperate
321	unit ..... you will have to read the manual/ask someone/mail me to
322	find out what this is.
323
324
325	*******************************************************************
326
327	3. Interactive usage.
328
329
330
331	Interactive usage of Clustal V is completely menu driven. On-line
332	help is provided, defaults are offered for all parameters and file
333	names. With a little effort it should be completely self
334	explanatory. The main menu, which appears when you run the
335	programs is shown below. Each item brings you to a sub menu.
336
337
338
339	Main menu for Clustal V:
340
341
342	1. Sequence Input From Disc
343	2. Multiple Alignments
344	3. Profile Alignments
345	4. Phylogenetic trees
346
347	S. Execute a system command
348	H. HELP
349	X. EXIT (leave program)
350
351
352	Your choice:
353
354
355
356	The options S and H appear on all the main menus. H will provide
357	help and if you type S you will be asked to enter a command, such as
358	DIR or LS, which will be sent to the system (does not work on
359	Mac's). Before carrying out an alignment, you must use option 1
360	(sequence input); the format for sequences is explained below.
361	Under menu item 2 you will be able to automatically align your
362	sequences to each other. Menu item 3 allows you to do profile
363	alignments. These are alignments of old alignments. This allows
364	you to build up a multiple alignment in stages or add a new sequence
365	to an old alignment. You can calculate phylogenetic trees from
366	alignments using menu item 4.
367
368
369
370
371	******************************
372	* SEQUENCE INPUT. *
373	******************************
374
375
376	All sequences should be in 1 file. Three formats are automatically
377	recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and
378	Lipman (1988) format).
379
380	***
381	Users of the Wisconsin GCG package should use the command TONBRF
382	(recently changed to TOPIR) to reformat their sequences before use.
383	***
384
385	Sequences can be in upper or lower case. For proteins, the only
386	symbols recognised are: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and
387	for DNA/RNA use: A,C,G and T (or U). Any other letters of the
388	alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown.
389	All other symbols (blanks, digits etc.) will be ignored EXCEPT for
390	the hyphen "-" which can be used to specify a gap. This last point
391	is especially useful for 2 reasons: 1) you can fix the positions of
392	some gaps in advance; 2) the alignment output from this program can
393	be written out in NBRF format using "-"'s to specify gaps; these
394	alignments can be used again as input, either for profile alignments
395	or for phylogenetic trees.
396
397	If you are using an editor to create sequence files, use the FASTA
398	format as it is by far the simplest (see below). If you have access
399	to utility programs for generating/converting the NBRF/PIR format
400	then use it in preference.
401
402
403
404	FASTA (PEARSON AND LIPMAN, 1988) FORMAT: The sequences are
405	delimited by an angle bracket ">" in column 1. The text immediately
406	after the ">" is used as a title. Everything on the following line
407	until the next ">" or the end of the file is one sequence.
408
409	e.g.
410
411	> RABSTOUT rabbit Guinness receptor
412	LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
413	ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC
414	> MUSNOSE mouse nose drying factor
415	mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
416	fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv
417	> HSHEAVEN human Guinness receptor repeat
418	mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
419	fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
420	mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
421	fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
422
423
424
425	NBRF/PIR FORMAT is similar to FASTA format but immediately
426	after the ">", you find the characters "P1;" if the sequences are
427	protein or "DL;" if they are nucleic acid. Clustalv looks for the
428	";" character as the third character after the ">". If it finds one
429	it assumes that the format is NBRF if not, FASTA format is assumed.
430	The text after the ";" is treated as a sequence name while the
431	entire next line is treated as a title. The sequence is terminated
432	by a star "*" and the next sequence can then begin (with a >P1; etc
433	). This is just the basic format description (there are other
434	variations and rules).
435
436	ANY files/sequences in GCG format can be converted to this format
437	using the TONBRF command (now TOPIR) of the Wisconsin GCG package.
438
439
440	e.g.
441
442	>P1;RABSTOUT
443	rabbit Guinness receptor
444	LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
445	ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC*
446	>P1;MUSNOSE
447	mouse nose drying factor
448	mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
449	fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd
450	*
451	>P1;HSHEAVEN
452	human Guinness receptor repeat protein.
453	mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
454	fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
455	mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
456	fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv*
457
458
459
460
461	EMBL/SWISSPROT FORMAT: Do not try to create files with this
462	format unless you have utilities to help. If you are just using an
463	editor, use one of the above formats. If you do use this format,
464	the program will ignore everything between the ID line (line
465	beginning with the characters "ID") and the SQ line. The sequence
466	is then read from between the SQ line and the "//" characters.
467
468
469
470	It is critically important for the program to know whether or not it
471	is aligning DNA or protein sequences. The input routines attempt to
472	guess which type of sequence is being used by counting the number of
473	A,C,G,T or U's in the sequences. If the total is more than 85% of
474	the sequence length then DNA is assumed. If you use very bizarre
475	sequences (proteins with really strange aa compositions or DNA
476	sequences with loads of strange ambiguity codes) you might confuse
477	the program. It is difficult to do but be careful.
478
479
480
481
482
483	******************************
484	* MULTIPLE ALIGNMENT MENU. *
485	******************************
486
487	The multiple alignment menu is shown below. Before explaining how
488	to use it, you must be introduced briefly to the alignment strategy.
489	If you do not follow this, try using option 1 anyway; the entire
490	process will be carried out automatically.
491
492	To do a complete multiple alignment, we need to know the approximate
493	relationships of the sequences to each other (which ones are most
494	similar to each other). We do this by calculating a crude
495	phylogenetic tree which we call a dendrogram (to distinguish it from
496	the more sensitive trees available under the phylogenetic tree
497	menu). This dendrogram is used as a guide to align bigger and
498	bigger groups of sequences during the multiple alignment. The
499	dendrogram is calculated in 2 stages: 1) all pairs of sequence are
500	compared using the fast/approximate method of Wilbur and Lipman
501	(1983); the result of each comparison is a similarity score. 2) the
502	similarity scores are used to construct the dendrogram using the
503	UPGMA cluster analysis method of Sneath and Sokal (1973).
504
505	The construction of the dendrogram can be very time consuming if you
506	wish to align many sequences (e.g. for 100 sequences you need to
507	carry out 100x99/2 sequence comparisons = 4950). During every
508	multiple alignment, a dendrogram is constructed and saved to a file
509	(something.dnd). These can be reused later.
510
511
512
513
514
515
516
517
518	*****MultipleAlignmentMenu*****
519
520
521	1. Do complete multiple alignment now
522	2. Produce dendrogram file only
523	3. Use old dendrogram file
524	4. Pairwise alignment parameters
525	5. Multiple alignment parameters
526	6. Output format options
527
528	S. Execute a system command
529	H. HELP
530	or press [RETURN] to go back to main menu
531
532
533	Your choice:
534
535
536	So, if in doubt, and you have already loaded some sequences from the
537	main menu, just try option 1 and press the <Return> key in response
538	to any questions. You will be prompted for 2 file names e.g. if the
539	sequence input file was called DRINK.PEP, you will be offered
540	DRINK.ALN as the file to contain the alignment and DRINK.DND for the
541	dendrogram.
542
543	If you wish to repeat a multiple alignment (e.g. to experiment with
544	different gap penalties) but do not wish to make a dendrogram all
545	over again use menu item 3 (providing you are using the same
546	sequences). Similarly, menu item 2 allows you to produce the
547	dendrogram file only.
548
549
550
551
552	PAIRWISE ALIGNMENT PARAMETERS:
553
554	The parameters that control the initial fast/approximate comparisons
555	can be set from menu item 4 which looks like:
556
557
558	******* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS *******
559
560
561	1. Toggle Scoring Method :Percentage
562	2. Gap Penalty :3
563	3. K-tuple :1
564	4. No. of top diagonals :5
565	5. Window size :5
566
567	H. HELP
568
569
570	Enter number (or [RETURN] to exit):
571
572
573
574	The similarity scores are calculated from fast alignments generated
575	by the method of Wilbur and Lipman (1983). These are 'hash' or
576	'word' or 'k-tuple' alignments carried out in 3 stages.
577
578	First you mark the positions of every fragment of sequence, K-tuple
579	long (for proteins, the default length is 1 residue, for DNA it is 2
580	bases) in both sequences. Then you locate all k-tuple matches
581	between the 2 sequences. At this stage you have to imagine a dot-
582	matrix plot between the 2 sequences with each k-tuple match as a
583	dot. You find those diagonals in the plot with most matches (you
584	take the "No. of top diagonals" best ones) and mark all diagonals
585	within "Window size" of each top diagonal. This process will define
586	diagonal bands in the plot where you hope the most likely regions of
587	similarity will lie.
588
589	The final alignment stage is to find that head to tail arrangement
590	of k-tuple matches from these diagonal regions that will give the
591	highest score. The score is calculated as the number of exactly
592	matching residues in this alignment minus a "gap penalty" for every
593	gap that was introduced. When you toggle "Scoring method" you
594	choose between expressing these similarity scores as raw scores or
595	expressed as a percentage of the shorter sequence length.
596
597	K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA.
598	Increase this to increase speed; decrease to improve sensitivity.
599
600	GAP PENALTY: The number of matching residues that must be found
601	in order to introduce a gap. This should be larger than K-Tuple
602	Size. This has little effect on speed or sensitivity.
603
604	NO. OF TOP DIAGONALS: The number of best diagonals in the
605	imaginary dot-matrix plot that are considered. Decrease (must be
606	greater than zero) to increase speed; increase to improve
607	sensitivity.
608
609	WINDOW SIZE: The number of diagonals around each "top" diagonal
610	that are considered. Decrease for speed; increase for greater
611	sensitivity.
612
613	SCORING METHOD: The similarity scores may be expressed as raw scores
614	(number of identical residues minus a "gap penalty" for each gap) or
615	as percentage scores. If the sequences are of very different
616	lengths, percentage scores make more sense.
617
618
619
620	CHANGING THE PAIRWISE ALIGNMENT PARAMETERS
621
622	The main reason for wanting to change the above parameters is SPEED
623	(especially on microcomputers), NOT SENSITIVITY. The dendrograms
624	that are produced can only show the relationships between the
625	sequences APPROXIMATELY because the similarity scores are calculated
626	from seperate pairwise alignments; not from a multiple alignment
627	(that is what we eventually hope to produce). If the groupings of
628	the sequences are "obvious", the above method should work well; if
629	the relationships are obscure or weakly represented by the data, it
630	will not make much difference playing with the parameters. The main
631	factor influencing speed is the K-TUPLE SIZE followed by the WINDOW
632	SIZE.
633
634	The alignments are carried out in a small amount of memory.
635	Occasionally (it is hard to predict), you will run out of memory
636	while doing these alignments; when this happens, it will say on the
637	screen: "Sequences (a,b) partially aligned" (instead of "Sequences
638	(a,b) aligned"). This means that the alignment score for these
639	sequences will be approximate; it is not a problem unless many of
640	the alignments do this. It can be fixed by using less sensitive
641	parameters or increasing parameter FSIZE in clustalv.h .
642
643
644	THE DENDROGRAM ITSELF
645
646	The similarity scores generated by the fast comparison of all the
647	sequences are used to construct a dendrogram by the UPGMA method of
648	Sneath and Sokal (1973). This is a form of cluster analysis and the
649	end result produces something that looks like a tree. It represents
650	the similarity of the sequences as a hierarchy. The dendrogram is
651	written to a file in a machine readable format and is ahown below
652	for an example with 6 sequences.
653
654
655	91.0 0 0 2 012000 ! seq 2 joins seq 3 at 91% ID.
656	72.0 1 0 3 011200 ! seq 4 joins seqs 2,3 at 72%
657	71.1 0 0 2 000012 ! seq 5 joins seq 6 at 71%
658	35.5 0 2 4 122200 ! seq 1 joins seqs 2,3,4
659	21.7 4 3 6 111122 ! seqs 1,2,3,4 join seqs 5,6
660
661	This LOOKS complicated but you do not normally need to care what is
662	in here. Anyway, each row represents the joining together of 2 or
663	more sequences. You progress from the top down, joining more and
664	more sequences until all are joined together; for N sequences you
665	have N-1 groupings hence there are 5 rows in the above file (there
666	were 6 sequences). In each row, the first number is the similarity
667	score of this grouping; ignore the next three columns for the
668	moment; the last 6 digits in the line show which sequences are
669	grouped; there is one digit for each sequence (the first digit is
670	for the first sequence). The rule is: in each row, all of the "1"s
671	join all of the "2"s; the zero's do nothing.
672
673	Hence, in the first row, sequence 2 joins sequence 3 at a similarity
674	level of 91% identity; next, sequence 4 joins the previous grouping
675	of 2 plus 3 at a level of 72% etc. This is shown diagrammatically
676	below. Before leaving the dendrogram format, the other 3 columns of
677	numbers are: a pointer to the row from which the "1" sequences were
678	last joined (or zero if only one of them); a pointer to the row in
679	which the "2"s were last joined; the total number of sequences
680	joined in this line.
681
682
683
684
685	I------ 2
686	I------I
687	I I------ 3 Diagram of the sequence similarity
688	I----I
689	I I------------- 4 relationships shown in the above
690	I--I
691	I I------------------ 1 dendrogram file (branch lengths are
692	----I
693	I I------------- 5 not to scale).
694	I-------I
695	I------------- 6
696
697
698
699
700
701
702
703
704
705	MULTIPLE ALIGNMENT PARAMETERS:
706
707
708	Having calculated a dendrogram between a set of sequences, the final
709	multiple alignment is carried out by a series of alignments of
710	larger and larger groups of sequences. The order is determined by
711	the dendrogram so that the most similar sequences get aligned first.
712	Any gaps that are introduced in the early alignments are fixed.
713	When two groups of sequences are aligned against each other, a full
714	protein weight matrix (such as a Dayhoff PAM 250) is used. Two gap
715	penalties are offered: a "FIXED" penalty for opening up a gap and a
716	"FLOATING" penalty for extending a gap.
717
718
719	******* MULTIPLE ALIGNMENT PARAMETERS *******
720
721
722	1. Fixed Gap Penalty :10
723	2. Floating Gap Penalty :10
724	3. Toggle Transitions (DNA):Weighted
725	4. Protein weight matrix :PAM 250
726
727	H. HELP
728
729
730	Enter number (or [RETURN] to exit):
731
732
733	FIXED GAP PENALTY: Reduce this to encourage gaps of all sizes;
734	increase it to discourage them. Terminal gaps are penalised same
735	as all others. BEWARE of making this too small (approx 5 or so); if
736	the penalty is too small, the program may prefer to align each
737	sequence opposite one long gap.
738
739	FLOATING GAP PENALTY: Reduce this to encourage longer gaps;
740	increase it to shorten them. Terminal gaps are penalised same as
741	all others. BEWARE of making this too small (approx 5 or so); if
742	the penalty is too small, the program may prefer to align each
743	sequence opposite one long gap.
744
745
746	DNA TRANSITIONS = WEIGHTED or UNWEIGHTED: By default, transitions
747	(A versus G; C versus T) are weighted more strongly than
748	transversions (an A aligned with a G will be preferred to an A
749	aligned with a C or a T). You can make all pairs of nucleotide
750	equally weighted with this option.
751
752	PROTEIN WEIGHT MATRIX: For protein comparisons, a weight matrix is
753	used to differentially weight different pairs of aligned amino
754	acids. The default is the well known Dayhoff PAM 250 matrix. We
755	also offer a PAM 100 matrix, an identity matrix (all weights are the
756	same for exact matches) or allow you to give the name of a file with
757	your own matrix. The weight matrices used by Clustal V are shown in
758	full in the Algorithms and References section of this documentation.
759
760	If you input a matrix from a file, it must be in the following
761	format. Use a 20x20 matrix only (entries for the 20 "normal" amino
762	acids only; no ambiguity codes etc.). Input the lower left triangle
763	of the matrix, INCLUDING the diagonal. The order of the amino acids
764	(rows and columns) must be: CSTPAGNDEQHRKMILVFYW. The values can be
765	in free format seperated by spaces (not commas). The PAM 250 matrix
766	is shown below in this format.
767
768	12
769	0 2
770	-2 1 3
771	-3 1 0 6
772	-2 1 1 1 2
773	-3 1 0 -1 1 5
774	-4 1 0 -1 0 0 2
775	-5 0 0 -1 0 1 2 4
776	-5 0 0 -1 0 0 1 3 4
777	-5 -1 -1 0 0 -1 1 2 2 4
778	-3 -1 -1 0 -1 -2 2 1 1 3 6
779	-4 0 -1 0 -2 -3 0 -1 -1 1 2 6
780	-5 0 0 -1 -1 -2 1 0 0 1 0 3 5
781	-5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
782	-2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
783	-6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
784	-2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
785	-4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
786	0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
787	-8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
788
789	Values must be integers and can be all positive or positive and
790	negative as above. These are SIMILARITY values.
791
792
793
794
795	ALIGNMENT OUTPUT OPTIONS:
796
797	By default, the alignment goes to a file in a self explanatory
798	"blocked" alignment format. This format is fine for displaying the
799	results but requires heavy editing if you wish to use the alignment
800	with other software. To help, we provide 3 other formats which can
801	be turned on or off. If you have a sequence data set or alignment
802	in memory, you can also ask for output files in whatever formats are
803	turned on, NOW. The menu you use to choose format is shown below.
804
805	***
806	We draw your attention to NBRF/PIR format in particular. This
807	format is EXACTLY the same as one of the input formats. Therefore,
808	alignments written in this format can be used again as input (to the
809	profile alignments or phylogenetic trees).
810	***
811
812
813	******* Format of Alignment Output *******
814
815
816	1. Toggle CLUSTAL format output = ON
817	2. Toggle NBRF/PIR format output = OFF
818	3. Toggle GCG format output = OFF
819	4. Toggle PHYLIP format output = OFF
820
821	5. Create alignment output file(s) now?
822	H. HELP
823
824
825	Enter number (or [RETURN] to exit):
826
827
828
829	CLUSTAL FORMAT: This is a self explanatory alignment. The
830	alignment is written out in blocks. Identities are highlighted and
831	(if you use a PAM 250 matrix) positions in the alignment where all
832	of the residues are "similar" to each other (PAM 250 score of 8 or
833	more) are indicated.
834
835	NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps
836	indicated by hyphens ("-"). AS we have stressed before, this format
837	is EXACTLY compatible with the sequence input format. Therefore you
838	can read in these alignments again for profile alignments or for
839	calculating phylogenetic trees.
840
841	GCG FORMAT: In version 7 of the Wisconsin GCG package, a new
842	multiple sequence format was introduced. This is the MSF (Multiple
843	Sequence Format) format. It can be used as input to the GCG
844	sequence editor or any of the GCG programs that make use of multiple
845	alignments. THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG
846	PACKAGE OR LATER.
847
848	PHYLIP FORMAT: This format can be used by the Phylip package of
849	Joe Felsenstein (see the references/algorithms section for details
850	of how to get it). Phylip allows you to do a huge range of
851	phylogenetic analyses (we just offer one method in this program) and
852	is probably the most widely used set of programs for drawing trees.
853	It also works on just about every computer you can think of,
854	providing you have a decent Pascal compiler.
855
856
857
858
859
860	******************************
861	* PROFILE ALIGNMENT MENU. *
862	******************************
863
864
865
866	This menu is for taking two old alignments (or single sequences) and
867	aligning them with each other. The result is one bigger alignment.
868	The menu is very similar to the multiple alignment menu except that
869	there is no mention of dendrograms here (they are not needed) and
870	you need to input two sets of sequences. The menu looks like this:
871
872
873
874	*****ProfileAlignmentMenu*****
875
876
877	1. Input 1st. profile/sequence
878	2. Input 2nd. profile/sequence
879	3. Do alignment now
880	4. Alignment parameters
881	5. Output format options
882
883	S. Execute a system command
884	H. HELP
885	or press [RETURN] to go back to main menu
886
887
888	Your choice:
889
890
891	You must input profile number 1 first. When both profiles are
892	loaded, use item 3 (Do alignment now) and the 2 profiles will be
893	aligned. Items 4 and 5 (parameters and output options) are
894	identical to the equivalent options on the multiple alignment menu.
895
896	The same input routines that are used for general input are used
897	here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA
898	format, with gaps indicated by hyphens ("-"). This is why we have
899	continualy drawn your attention to the NBRF/PIR format as a useful
900	output format.
901
902	Either profile can consist of just one sequence. Therefore, if you
903	have a favourite alignment of sequences that you are working on and
904	wish to add a new sequence, you can use this menu, provided the
905	alignment is in the correct format.
906
907	The total number of sequences in the two profiles must be less less
908	than or equal to the MAXN parameter set in the clustalv.h header
909	file.
910
911
912
913
914
915
916
917
918
919
920
921	******************************
922	* PHYLOGENETIC TREE MENU. *
923	******************************
924
925
926	This menu allows you to input an alignment and calculate a
927	phylogenetic tree. You can also calculate a tree if you have just
928	carried out a multiple alignment and the alignment is still in
929	memory. THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!! The tree will
930	look strange if the sequences are not already aligned. You can also
931	"BOOTSTRAP" the tree to show confidence levels for groupings. This
932	is SLOW on microcomputers but works fine on workstations or
933	mainframes.
934
935
936
937	*****PhylogenetictreeMenu*****
938
939
940	1. Input an alignment
941	2. Exclude positions with gaps? = OFF
942	3. Correct for multiple substitutions? = OFF
943	4. Draw tree now
944	5. Bootstrap tree
945
946	S. Execute a system command
947	H. HELP
948	or press [RETURN] to go back to main menu
949
950
951	Your choice:
952
953
954
955
956	The same input routine that is used for general input is used here
957	i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format,
958	with gaps indicated by hyphens ("-"). This is why we have
959	continualy drawn your attention to the NBRF/PIR format as a useful
960	output format.
961
962	If you have input an alignment, then just use item 4 to draw a tree.
963	The method used is the Neighbor Joining method of Saitou and Nei
964	(1987). This is a "distance method". First, percent divergence
965	figures are calculated between all pairs of sequence. These
966	divergence figures are then used by the NJ method to give the tree.
967	Example trees will be shown below.
968
969	There are two options which can be used to control the way the
970	distances are calculated. These are set by options 2 and 3 in the
971	menu.
972
973	EXCLUDE POSITIONS WITH GAPS? This option allows you to ignore all
974	alignment positions (columns) where there is a gap in ANY sequence.
975	This guarantees that "like" is compared with "like" in all distances
976	i.e. the same positions are used to calculate all distances. It
977	also means that the distances will be "metric". The disadvantage of
978	using this option is that you throw away much of the data if there
979	are many gaps. If the total number of gaps is small, it has little
980	effect.
981
982	CORRECT FOR MULTIPLE SUBSTITUTIONS? As sequences diverge,
983	substitutions accumulate. It becomes increasingly likely that more
984	than one substitution (as a result of a mutation) will have happened
985	at a site where you observe just one difference now. This option
986	allows you to use formulae developed by Motoo Kimura to correct for
987	this effect. It has the effect of stretching long branches in tres
988	while leaving short ones relatively untouched. The desired effect
989	is to try and make distances proportional to time since divergence.
990
991	The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the
992	name of the input, alignment file. An example is shown below for 6
993	globin sequences.
994
995
996
997	DIST = percentage divergence (/100)
998	Length = number of sites used in comparison
999
1000	1 vs. 2 DIST = 0.5683; length = 139
1001	1 vs. 3 DIST = 0.5540; length = 139
1002	1 vs. 4 DIST = 0.5315; length = 111
1003	1 vs. 5 DIST = 0.7447; length = 141
1004	1 vs. 6 DIST = 0.7571; length = 140
1005	2 vs. 3 DIST = 0.0897; length = 145
1006	2 vs. 4 DIST = 0.1391; length = 115
1007	2 vs. 5 DIST = 0.7517; length = 145
1008	2 vs. 6 DIST = 0.7431; length = 144
1009	3 vs. 4 DIST = 0.0957; length = 115
1010	3 vs. 5 DIST = 0.7379; length = 145
1011	3 vs. 6 DIST = 0.7361; length = 144
1012	4 vs. 5 DIST = 0.7304; length = 115
1013	4 vs. 6 DIST = 0.7368; length = 114
1014	5 vs. 6 DIST = 0.2697; length = 152
1015
1016
1017	Neighbor-joining Method
1018
1019	Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:
1020	A New Method for Reconstructing Phylogenetic Trees.
1021	Mol. Biol. Evol., 4(4), 406-425
1022
1023
1024	This is an UNROOTED tree
1025
1026	Numbers in parentheses are branch lengths
1027
1028
1029	Cycle 1 = SEQ: 5 ( 0.13382) joins SEQ: 6 ( 0.13592)
1030
1031	Cycle 2 = SEQ: 1 ( 0.28142) joins Node: 5 ( 0.33462)
1032
1033	Cycle 3 = SEQ: 2 ( 0.05879) joins SEQ: 3 ( 0.03086)
1034
1035	Cycle 4 (Last cycle, trichotomy):
1036
1037	Node: 1 ( 0.20798) joins
1038	Node: 2 ( 0.02341) joins
1039	SEQ: 4 ( 0.04915)
1040
1041
1042
1043	The output file first shows the percent divergence (distance)
1044	figures between each pair of sequence. Then a description of a NJ
1045	tree is given. This description shows which sequences (SEQ:) or
1046	which groups of sequences (NODE: , a node is numbered using the
1047	lowest sequence that belongs to it) join at each level of the tree.
1048
1049	This is an unrooted tree!! This means that the direction of
1050	evolution through the tree is not shown. This can only be inferred
1051	in one of two ways:
1052	1) assume a degree of constancy in the molecular clock and place the
1053	root (bottom of the tree; the point where all the sequences radiate
1054	from) half way along the longest branch. OR
1055	2) use an "outgroup", a sequence from an organism that you "know"
1056	must be outside of the rest of the sequences i.e. root the tree
1057	manually, on biological grounds.
1058
1059	The above tree can be represented diagramatically as follows:
1060
1061
1062	SEQ 1 SEQ 4
1063	I I
1064	13.6 I 28.1 I 4.9 5.9
1065	SEQ 6 ----------I I I I--------- SEQ 2
1066	I I I I
1067	I--------I-----------I----------I
1068	13.4 I 33.5 20.8 2.3 I 3.1
1069	SEQ 5 ----------I I--------- SEQ 3
1070
1071
1072	The figures along each branch are percent divergences along that
1073	branch. If you root the tree by placing the root along the longest
1074	branch (33.5%) then you can draw it again as follows, this time
1075	rooted:
1076
1077
1078
1079	13.6
1080	I-------------------- SEQ 6
1081	I---------I 13.4
1082	I I-------------------- SEQ 5
1083	I 33.5
1084	-----I 28.1
1085	I I-------------------- SEQ 1
1086	I I
1087	I---------I 4.9
1088	I 20.8 I----------- SEQ 4
1089	I--------I
1090	I 5.9
1091	I 2.3 I----- SEQ 2
1092	I-----I 3.1
1093	I----- SEQ 3
1094
1095
1096
1097	The longest branch (33.5% between 5,6 and 1,2,3,4) is split between
1098	the 2 bottom branches of the tree. As it happens in this particular
1099	case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4
1100	are alpha and beta globins, so you could also justify the above
1101	rooting on biological grounds. If you do not have any particular
1102	need or evidence for the position of the root, then LEAVE THE TREE
1103	UNROOTED. Unrooted trees do not look as pretty as rooted ones but
1104	it is uaual to leave them unrooted if you do not have any evidence
1105	for the position of the root.
1106
1107
1108	BOTSTRAPPING: Different sets of sequences and different tree
1109	drawing methods may give different topologies (branching orders) for
1110	parts of a tree that are weakly supported by the data. It is useful
1111	to have an indication of the degree of error in the tree. There are
1112	several ways of doing this, some of them rather technical. We
1113	provide one general purpose method in this program, which makes use
1114	of a technique called bootstrapping (see Felsenstein, 1985).
1115
1116	In the case of sequence alignments, bootstrapping involves taking
1117	random samples of positions from the alignment. If the alignment
1118	has N positions, each bootstrap sample consists of a random sample
1119	of N positions, taken WITH REPLACEMENT i.e. in any given sample,
1120	some sites may be sampled several times, others not at all. Then,
1121	with each sample of sites, you calculate a distance matrix as usual
1122	and draw a tree. If the data very strongly support just one tree
1123	then the sample trees will be very similar to each other and to the
1124	original tree, drawn without bootstrapping. However, if parts of
1125	the tree are not well supported, then the sample trees will vary
1126	considerably in how they represent these parts.
1127
1128	In practice, you should use a very large number of bootstrap
1129	replicates (1000 is recommended, even if it means running the
1130	program for an hour on a slow microcomputer; on a workstation it
1131	will be MUCH faster). For each grouping on the tree, you record the
1132	number of times this grouping occurs in the sample trees. For a
1133	group to be considered "significant" at the 95% level (or P <= 0.05
1134	in statistical terms) you expect the grouping to show up in >= 95%
1135	of the sample trees. If this happens, then you can say that the
1136	grouping is significant, given the data set and the method used to
1137	draw the tree.
1138
1139	So, when you use the bootstrap option, a NJ tree is drawn as before
1140	and then you are asked to say how many bootstrap samples you want
1141	(1000 is the default) and you are asked to give a seed number for
1142	the random number generator. If you give the same seed number in
1143	future, you will get the same results (we hope). Remember to give
1144	different seed numbers if you wish to carry out genuinely different
1145	bootstrap sampling experiments. Below is the output file from using
1146	the same data for the 6 globin sequences as used before. The output
1147	file has the same name as the input fike with the extension ".njb".
1148
1149	//
1150	STUFF DELETED .... same as for the ordinary NJ output
1151	//
1152	Bootstrap Confidence Limits
1153
1154
1155	Random number generator seed = 99
1156
1157	Number of bootstrap trials = 1000
1158
1159
1160	Diagrammatic representation of the above tree:
1161
1162	Each row represents 1 tree cycle; defining 2 groups.
1163
1164	Each column is 1 sequence; the stars in each line show 1 group;
1165	the dots show the other
1166
1167	Numbers show occurences in bootstrap samples.
1168
1169	****.. 1000
1170	.***.. 1000 <- This is the answer!!
1171	..** 812
1172	122311
1173
1174
1175	For an unrooted tree with N sequences, there are actually only N-3
1176	genuinely different groupings that we can test (this is the number
1177	of "internal branches"; each internal branch splits the sequences
1178	into 2 groups). In this example, we have 6 sequences with 3
1179	internal branches in the reference tree. In the bootstrap
1180	resampling, we count how often each of these internal branches
1181	occur. Here, we find that the branch which splits 1,2,3 and 4
1182	versus 5 and 6 occurs in all 1000 samples; the branch which splits
1183	2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2
1184	and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put
1185	these figures on to the diagrammatic representation we made earlier
1186	of our unrooted NJ tree as follows:
1187
1188
1189
1190	SEQ 1 SEQ 4
1191	I I
1192	I I
1193	SEQ 6 ----------I I I I--------- SEQ 2
1194	I 1000 I 1000 I 812 I
1195	I--------I-----------I----------I
1196	I I
1197	SEQ 5 ----------I I--------- SEQ 3
1198
1199
1200
1201	You can equally put these confidence figures on the rooted tree (in
1202	fact the interpretation is simpler with rooted trees). With the
1203	unrooted tree, the grouping of sequence 5 with 6 is significant (as
1204	is the grouping of sequences 1,2,3 and 4). Equally the grouping of
1205	sequences 1,5 and 6 is significant (the same as saying that 2,3 and
1206	4 group significantly). However, the grouping of 2 and 3 is not
1207	significant, although it is relatively strongly supported.
1208
1209	Unfortunately, there is a small complication in the interpretation
1210	of these results. In statistical hypothesis testing, it is not
1211	valid to make multiple simultaneous tests and to treat the result of
1212	each test completely independantly. In the above case, if you have
1213	one particular test (grouping) that you wish to make in advance, it
1214	is valid to test IT ALONE and to simply show the other bootstrap
1215	figures for reference. If you do not have any particular test in
1216	mind before you do the bootstrapping, you can just show all of the
1217	figures and use the 95% level as an ARBITRARY cut off to show those
1218	groups that are very strongly supported; but not mention anything
1219	about SIGNIFICANCE testing. In the literature, it is common
1220	practice to simply show the figures with a tree; they frequently
1221	speak for themselves.
1222
1223
1224
1225	*******************************************************************
1226
1227	4. Command Line Interface.
1228
1229
1230
1231	You can do almost everything that can be done from the menus, using
1232	a command line interface. In this mode, the program will take all of
1233	its instructions as "switches" when you activate it; no questions
1234	will be asked; if there are no errors, the program just does an
1235	analysis and stops. It does not work so well on the MAC but is
1236	still possible. To get you started we will show you the 2 simplest
1237	uses of the command line as it looks on VAX/VMS. On all other
1238	machines (except the MAC) it works in the same way.
1239
1240	$ clustalv /help OR $ clustalv /check
1241
1242	Both of the above switches give you a one page summary of the
1243	command line on the screen and then the program stops.
1244
1245
1246	$ clustalv proteins.seq OR $ clustalv /infile=proteins.seq
1247
1248	This will read the sequences from the file 'proteins.seq' and do a
1249	complete multiple alignment. Default parameters will be used, the
1250	program will try to tell whether or not the sequences are DNA or
1251	protein and the output will go to a file called 'proteins.aln' . A
1252	dendrogram file called 'proteins.dnd' will also be created. Thus
1253	the default action for the program, when it successfully reads in an
1254	input file is to do a full multiple alignment. Some further
1255	examples of command line usage will be given leter.
1256
1257	Command line switches can be abbreviated but MAKE SURE YOU DO NOT
1258	MAKE THEM AMBIGUOUS. No attempt will be made to detect ambiguity.
1259	Use enough characters to distinguish each switch uniquely.
1260
1261
1262
1263
1264
1265
1266
1267	The full list of allowed switches is given below:
1268
1269
1270	DATA (sequences)
1271
1272	/INFILE=file.ext :input sequences. If you give an input file and
1273	nothing else as a switch, the default action is
1274	to do a complete multiple alignment. The input
1275	file can also be specified by giving it as the
1276	first command line parameter with no "/" in
1277	front of it e.g $ clustalv file.ext .
1278
1279	/PROFILE1=file.ext :You use these two switches to give the names of
1280	/PROFILE2=file.ext two profiles. The default action is to align
1281	the two. You must give the names of both profile
1282	files.
1283
1284
1285
1286	VERBS (do things)
1287
1288	/HELP :list the command line parameters on the screen.
1289	/CHECK
1290
1291	/ALIGN :do full multiple alignment. This is the default
1292	action if no other switches except for input files
1293	are given.
1294
1295	/TREE :calculate NJ tree. If this is the only action
1296	specified (e.g. $ clustalv proteins.seq/tree ) it IS
1297	ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED. If
1298	the sequences are not already aligned, you should
1299	also give the /ALIGN switch. This will align the
1300	sequences first, output an alignment file and
1301	calculate the tree in memory.
1302
1303	/BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps;
1304	default = 1000). If this is the only action
1305	specified (e.g. $ clustalv proteins.seq/bootstrap )
1306	it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.
1307	If the sequences are not already aligned, you should
1308	also give the /ALIGN switch. This will align the
1309	sequences first, output an alignment file and
1310	calculate the bootstraps in memory. You can set the
1311	number of bootstrap trials here (e.g./bootstrap=500).
1312	You can set the seed number for the random number
1313	generator with /seed=n.
1314
1315
1316
1317	PARAMETERS (set things)
1318
1319	*Pairwise alignments:*
1320
1321	/KTUP=n :word size
1322
1323	/TOPDIAGS=n :number of best diagonals
1324
1325	/WINDOW=n :window around best diagonals
1326
1327	/PAIRGAP=n :gap penalty
1328
1329
1330
1331	*Multiple alignments:*
1332
1333	/FIXEDGAP=n :fixed length gap pen.
1334
1335	/FLOATGAP=n :variable length gap pen.
1336
1337	/MATRIX= :PAM100 or ID or file name. The default weight matrix
1338	for proteins is PAM 250.
1339
1340	/TYPE=p or d :type is protein or DNA. This allows you to
1341	explicitely overide the programs attempt at guessing
1342	the type of the sequence. It is only useful if you
1343	are using sequences with a VERY strange composition.
1344
1345	/OUTPUT= :GCG or PHYLIP or PIR. The default output is
1346	Clustal format.
1347
1348	/TRANSIT :transitions not weighted. The default is to weight
1349	transitions as more favourable than other mismatches
1350	in DNA alignments. This switch makes all nucleotide
1351	mismatches equally weighted.
1352
1353
1354	*Trees:*
1355
1356	/KIMURA :use Kimura's correction on distances.
1357
1358	/TOSSGAPS :ignore positions with a gap in ANY sequence.
1359
1360	/SEED=n :seed number for bootstraps.
1361
1362
1363
1364
1365	EXAMPLES:
1366
1367	These examples use the VAX/VMS $ prompt; otherwise, command-line
1368	usage is the same on all machines except the Macintosh.
1369
1370
1371	$ clustalv proteins.seq OR $ clustalv /infile=proteins.seq
1372
1373	Read whatever sequences are in the file "proteins.seq" and do a full
1374	multiple alignment; output will go to the files: "proteins.dnd"
1375	(dendrogram) and "proteins.aln" (alignment).
1376
1377
1378	$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir
1379
1380	Same as last example but use K-Tuple size of 2; use a PAM 100
1381	protein weight matrix; write the alignment out in NBRF/PIR format
1382	(goes to a file called "proteins.pir").
1383
1384
1385	$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11
1386
1387	Take the alignment in "proteins.seq" and align it with "more.seq"
1388	using default values for everything except the fixed gap penalty
1389	which is set to 11. The sequence type is explicitely set to
1390	PROTEIN.
1391
1392
1393	$ clustalv proteins.pir/tree/kimura
1394
1395	Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY)
1396	and calculate a phylogenetic tree using Kimura's correction for
1397	distances.
1398
1399
1400	$ clustalv proteins.pir/align/tree/kimura
1401
1402	Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE
1403	FIRST.
1404
1405
1406	$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p
1407
1408	Take the sequences in proteins.seq; they are explicitely set to be
1409	protein; align them; bootstrap a tree using 500 samples and a seed
1410	number of 99.
1411
1412
1413	*******************************************************************
1414
1415	5. Algorithms and references.
1416
1417
1418
1419	In this section, we will try to BRIEFLY describe the algorithms used
1420	in ClustalV and give references. The topics covered are:
1421
1422
1423	-Multiple alignments
1424
1425	-Profile alignments
1426
1427	-Protein weight matrices
1428
1429	-Phylogenetic trees
1430
1431	-distances
1432
1433	-NJ method
1434
1435	-Bootstrapping
1436
1437	-Phylip
1438
1439	-References
1440
1441
1442
1443
1444
1445
1446	MULTIPLE ALIGNMENTS.
1447
1448	The approach used in ClustalV is a modified version of the method of
1449	Feng and Doolittle (1987) who aligned the sequences in larger and
1450	larger groups according to the branching order in an initial
1451	phylogenetic tree. This approach allows a very useful combination
1452	of computational tractability and sensitivity.
1453
1454	The positions of gaps that are generated in early alignments remain
1455	through later stages. This can be justified because gaps that arise
1456	from the comparison of closely related sequences should not be moved
1457	because of later alignment with more distantly related sequences.
1458	At each alignment stage, you align two groups of already aligned
1459	sequences. This is done using a dynamic programming algorithm where
1460	one allows the residues that occur in every sequence at each
1461	alignment position to contribute to the alignment score. A Dayhoff
1462	(1978) PAM matrix is used in protein comparisons.
1463
1464	The details of the algorithm used in ClustalV have been published in
1465	Higgins and Sharp (1989). This was an improved version of an
1466	earlier algorithm published in Higgins and Sharp (1988). First, you
1467	calculate a crude similarity measure between every pair of sequence.
1468	This is done using the fast, approximate alignment algorithm of
1469	Wilbur and Lipman (1983). Then, these scores are used to calculate
1470	a "guide tree" or dendrogram, which will tell the multiple alignment
1471	stage in which order to align the sequences for the final multiple
1472	alignment. This "guide tree" is calculated using the UPGMA method
1473	of Sneath and Sokal (1973). UPGMA is a fancy name for one type of
1474	average linkage cluster analysis, invented by Sokal and Michener
1475	(1958).
1476
1477	Having calculated the dendrogram, the sequences are aligned in
1478	larger and larger groups. At each alignment stage, we use the
1479	algorithm of Myers and Miller (1988) for the optimal alignments.
1480	This algorithm is a very memory efficient variation of Gotoh's
1481	algorithm (Gotoh, 1982). It is because of this algorithm that
1482	ClustalV can work on microcomputers. Each of these alignments
1483	consists of aligning 2 alignments, using what we call "profile
1484	alignments".
1485
1486
1487	PROFILE ALIGNMENTS.
1488
1489	We use the term "profile alignment" to describe the alignment of 2
1490	alignments. We use this term because the method is a simple
1491	extension of the profile method of Gribskov, et al. (1987) for
1492	aligning 1 sequence with an alignment. Normally, with a 2 sequence
1493	alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a
1494	score between the pairs of aligned residues. The alignment is
1495	considered "optimal" if it gives the best total score for aligned
1496	residues minus penalties for any gaps (insertions or deletions) that
1497	must be introduced.
1498
1499	Profile alignments are a simple extension of 2 sequence alignments
1500	in that you can treat each of the two input alignments as single
1501	sequences but you calculate the score at aligned positions as the
1502	average weight matrix score of all the residues in one alignment
1503	versus all those in the other e.g. if you have 2 alignments with I
1504	and J sequences respectively; the score at any position is the
1505	average of all the I times J scores of the residues compared
1506	seperately. Any gaps that are introduced are placed in all of the
1507	sequences of an alignment at the same position. The profile
1508	alignments offered in the "profile alignment menu" are also
1509	calculated in this way.
1510
1511
1512	PROTEIN WEIGHT MATRICES.
1513
1514	There are 3 built-in weight matrices used by clustalV. These are
1515	the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity
1516	matrix. Each matrix is given as the bottom left half, including the
1517	diagonal of a 20 by 20 matrix. The order of the rows and columns is
1518	CSTPAGNDEQHRKMILVFYW.
1519
1520
1521	PAM 250
1522
1523	C 12
1524	S 0 2
1525	T -2 1 3
1526	P -3 1 0 6
1527	A -2 1 1 1 2
1528	G -3 1 0 -1 1 5
1529	N -4 1 0 -1 0 0 2
1530	D -5 0 0 -1 0 1 2 4
1531	E -5 0 0 -1 0 0 1 3 4
1532	Q -5 -1 -1 0 0 -1 1 2 2 4
1533	H -3 -1 -1 0 -1 -2 2 1 1 3 6
1534	R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
1535	K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
1536	M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
1537	I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
1538	L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
1539	V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
1540	F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
1541	Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
1542	W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
1543	----------------------------------------------------------------
1544	C S T P A G N D E Q H R K M I L V F Y W
1545
1546
1547	IDENTITY MATRIX
1548
1549	10
1550	0 10
1551	0 0 10
1552	0 0 0 10
1553	0 0 0 0 10
1554	0 0 0 0 1 10
1555	0 0 0 0 0 0 10
1556	0 0 0 0 0 0 0 10
1557	0 0 0 0 0 0 0 0 10
1558	0 0 0 0 0 0 0 0 0 10
1559	0 0 0 0 0 0 0 0 0 0 10
1560	0 0 0 0 0 0 0 0 0 0 0 10
1561	0 0 0 0 0 0 0 0 0 0 0 0 10
1562	0 0 0 0 0 0 0 0 0 0 0 0 0 10
1563	0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
1564	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
1565	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
1566	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
1567	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
1568	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
1569
1570
1571
1572
1573
1574	PAM 100
1575
1576	14
1577	-1 6
1578	-5 2 7
1579	-6 1 -1 10
1580	-5 2 2 1 6
1581	-8 1 -3 -3 1 8
1582	-8 2 0 -3 -1 -1 7
1583	-11 -1 -2 -4 -1 -1 4 8
1584	-11 -2 -3 -3 0 -2 1 5 8
1585	-11 -3 -3 -1 -2 -5 -1 1 4 9
1586	-6 -4 -5 -2 -5 -7 2 -1 -2 4 11
1587	-6 -1 -4, -2 -5 -8 -3 -6 -5 1 1 10
1588	-11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8
1589	-11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13
1590	-5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9
1591	-12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9
1592	-4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8
1593	-10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7-11 -2 0 0 -5 12
1594	-2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1-10-10 -8 -4 -5 -6 6 13
1595	-13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9-11-12 -7-14 -2 -2 19
1596
1597
1598
1599
1600	PHYLOGENETIC TREES.
1601
1602	There are two COMMONLY used approaches for inferring phylogentic
1603	trees from sequence data: parsimony and distance methods. There are
1604	other approaches which are probably superior in theory but which are
1605	yet to be used widely. This does not mean that they are no use; we
1606	(the authors of this program at any rate) simply do not know enough
1607	about them yet. You should see the documentation accompanying the
1608	Phylip package and some of the references there for an explanation
1609	of the different methods and what assumptions are implied when you
1610	use them.
1611
1612	There is a constant debate in the literature as to the merits of
1613	different methods but unfortunately, a lot of what is said is
1614	incomprehensible or inaccurate. It is also a field that is prone to
1615	having highly opinionated schools of thought. This is a pity
1616	because it prevents rational discussion of the pro's and con's of
1617	the different methods. The approach adopted in ClustalV is to
1618	supply just one method and to produce alignments in a format that
1619	can be used by Phylip. In simple cases, the trees produced will be
1620	as "good" (reliable, robust) as those from ANY other method. In
1621	more complicated cases, there is no single magic recipe that we can
1622	supply that will work well in even most situations.
1623
1624	The method we provide is the Neighbor Joining method (NJ) of Saitou
1625	and Nei (1987) which is a distance method. We use this for three
1626	reasons: it is conceptually and computationally simple; it is fast;
1627	it gives "good" trees in simple cases. It is difficult to prove that
1628	one tree is "better" than another if you do not know the true
1629	phylogeny; the few systematic surveys of methods show it to work
1630	more or less as well as any other method ON AVERAGE. Another reason
1631	for using the NJ method is that it is very commonly used; THIS IS A
1632	BAD REASON SCIENTIFICALLY but at least you will not feel lonely if
1633	you use it.
1634
1635	The NJ method works on a matrix of distances (the distance matrix)
1636	between all pairs of sequence to be analysed. These distances are
1637	related to the degree of divergence between the sequences. It is
1638	normal to calculate the distances from the sequences after they are
1639	multiply aligned. If you calculate them from seperate alignments
1640	(as done for the dendrograms in another part of this program), you
1641	may increase the error considerably.
1642
1643
1644	DISTANCES
1645
1646	The simplest measure of distance between sequences is percent
1647	divergence (100% minus percent identity). For two sequences, you
1648	count how many positions differ between them (ignoring all positions
1649	with a gap or an unknown residue) and divide by the number of
1650	positions considered. It is common practice to also ignore all
1651	positions in the alignment where there is a GAP in ANY of the
1652	sequences (Tossgaps ? option in the menu). Usually, you express the
1653	percent distance divided by 100 (gives distances between 0.0 and
1654	1.0).
1655
1656	This measure of distance is perfectly adequate (with some further
1657	modification described below) for rRNA sequences. However it treats
1658	all residues identically e.g. all amino acid substitutions are
1659	equally weighted. It also treats all positions identically e.g. it
1660	does not take account of different rates of substitution in
1661	different positions of different codons in protein coding DNA
1662	sequences; see Li et al (1985) for a distance measure that does.
1663	Despite these shortcomings, these percent identity distances do work
1664	well in practice in a wide variety of situations.
1665
1666	In a simple world, you would like a distance to be proportional to
1667	the time since the sequences diverged. If this were EXACTLY true,
1668	then the calculation of the tree would be a simple matter of algebra
1669	(UPGMA does this for you) and the branch lengths will be nice and
1670	meaningful (times). In practice this OBVIOUSLY depends on the
1671	existence and quality of the "molecular clock", a subject of on-
1672	going debate. However, even if there is a good clock, there is a
1673	further problem with estimating divergences. As sequences diverge,
1674	they become "saturated" with mutations. Sites can have
1675	substitutions more than once. Calculated distances will
1676	underestimate actual divergence times; the greater the divergence,
1677	the greater the discrepancy. There are various methods for dealing
1678	with this and we provide two commonly used ones, both due to Motoo
1679	Kimura; one for proteins and one for DNA.
1680
1681
1682	For distance K (percent divergence /100 ) ...
1683
1684	Correction for Protein distances: (Kimura, 1983).
1685
1686	Corrected K = -ln(1.0 - K - (K * k/5.0))
1687
1688
1689
1690	Correction for nucleotide distances: Kimura's 2-parameter method
1691	(Kimura, 1980).
1692
1693	Corrected K = 0.5ln(a) + 0.25ln(b)
1694
1695	where a = 1/(1 - 2*P - Q)
1696	and b = 1/(1 - 2*Q)
1697
1698	P and Q are the proportions of transitions (A<-->G, C<-->T)
1699	and transversions occuring between the sequences.
1700
1701
1702	One paradoxical effect of these corrections, is that distances can
1703	be corrected to have more than 100% divergence. That is because,
1704	for very highly diverged sequences of length N, you can estimate
1705	that more than N substitutions have occured by correcting the
1706	observed distance in the above ways. Don't panic!
1707
1708
1709
1710	NEIGHBOR JOINING TREES.
1711
1712	VERY briefly, the NJ method works as follows. You start by placing
1713	the sequences in a star topology (no internal branches). You then
1714	find that internal branch (take 2 sequences; join them; connect them
1715	to the rest by the internal branch) which when added to the tree
1716	will minimise the total branch length. The two joined sequences
1717	(neighbours) are merged into a single sequence and the process is
1718	repeated. For an unrooted tree with N sequences, there are N-3
1719	internal branches. The above process is repeated N-3 times to give
1720	the final tree. The full details are given in Saitou and Nei
1721	(1987).
1722
1723	As explained elsewhere in the documentation, you can only root the
1724	tree by one of two methods:
1725
1726	1) assume a degree of constancy in the molecular clock and place the
1727	root along the longest branch (internal or external). Methods that
1728	appear to produce rooted trees automatically are often just doing
1729	this without letting you know; this is true of UPGMA.
1730
1731	2) root the tree on biological grounds. The usual method is to
1732	include an "outgroup", a sequence that you are certain will branch
1733	to the outside of the tree.
1734
1735
1736
1737	BOOTSTRAPPING.
1738
1739	Bootstrapping is a general purpose technique that can be used for
1740	placing confidence limits on statistics that you estimate without
1741	any knowledge of the underlying distribution (e.g. a normal or
1742	poisson distribution). In the case of phylogenetic trees, there are
1743	several analytical methods for placing confidence limits on
1744	groupings (actually on the internal branches) but these are either
1745	restricted to particular tree drawing methods or only work on small
1746	trees of 4 or 5 sequences. Felsenstein (1985) showed how to use
1747	bootstrapping to calculate confidence limits on trees. His approach
1748	is completely general and can be applied to any tree drawing method.
1749	The main assumption of the method in this context is that the sites
1750	in the alignment are independant; this will be true of some sequence
1751	alignments (e.g. pseudogenes) but not others (e.g. rRNA's). What
1752	effect, lack of independance will have on the results is not known.
1753
1754	The method works by taking random samples of data from the complete
1755	data set. You compute the test statistic (tree in this case) on
1756	each sample. Variation in the statistic computed from the samples
1757	gives a measure of variation in the statistic which can be used to
1758	calculate confidence intervals. Each random sample is the same size
1759	as the complete data set and is taken WITH REPLACEMENT i.e. a data
1760	point can be selected more than once (or not at all) in any given
1761	sample.
1762
1763	In the case of an alignment N residues long, each random sample is a
1764	random selection of N sites form the alignment. For each sample, we
1765	calculate a distance matrix and tree in the usual way. Variation in
1766	the sample trees compared to a tree calculated from the full data
1767	set gives an indication of how well supported the tree is by the
1768	data. If the sample trees are very similar to each other and to the
1769	full tree, then the tree is "strongly" supported; if the sample
1770	trees show great variation, then the tree will be weakly supported.
1771	In practice, you usually find some parts of a tree well supported,
1772	others weakly. This can be seen by counting how often each
1773	monophyletic group in the full tree occurs in the sample trees.
1774
1775	For a particular grouping, one considers it to be significant at the
1776	95% level (P <= 0.05) if it occurs in 95% of the bootstrap samples.
1777	If a grouping is significant, it is significant with respect to the
1778	particular data set and method used for drawing the tree.
1779	Biological "significance" is another matter.
1780
1781
1782	PHYLIP.
1783
1784	The Phylip package was written by Joe Felsenstein, University of
1785	Washington, USA. It provides Pascal source code for a large number
1786	of programs for doing most types of phylogenetic analyses. The
1787	Phylip format alignments produced by this program can be used by all
1788	of the Phylip programs, version 3.4 or later (March 1991). It is
1789	freely available from him as follows.
1790
1791
1792
1793	================= PHYLIP information sheet =====================
1794
1795	PHYLIP - Phylogeny Inference Package (version 3.3)
1796
1797	This is a FREE package of programs for inferring phylogenies and
1798	carrying out certain related tasks. At present it contains 28
1799	programs, which carry out different algorithms on different kinds of
1800	data. The programs in the package are:
1801
1802	---------- Programs for molecular sequence data ----------
1803	PROTPARS Protein parsimony
1804	DNAPARS Parsimony method for DNA
1805	DNAMOVE Interactive DNA parsimony
1806	DNAPENNY Branch and bound for DNA
1807	DNABOOT Bootstraps DNA parsimony
1808	DNACOMP Compatibility for DNA
1809	DNAINVAR Phylogenetic invariants
1810	DNAML Maximum likelihood method
1811	DNAMLK DNAML with molecular clock
1812	DNADIST Distances from sequences
1813	RESTML ML for restriction sites
1814
1815	----------- Programs for distance matrix data ------------
1816	FITCH Fitch-Margoliash and least-squares methods
1817	KITSCH Fitch-Margoliash and least squares methods with
1818	evolutionary clock
1819
1820	--- Programs for gene frequencies and continuous characters --
1821	CONTML Maximum likelihood method
1822	GENDIST Computes genetic distances
1823
1824	------------- Programs for discrete state data -----------
1825	MIX Wagner, Camin-Sokal, and mixed parsimony criteria
1826	MOVE Interactive Wagner, C-S, mixed parsimony program
1827	PENNY Finds all most parsimonious trees by branch-and-bound
1828	BOOT Bootstrap confidence interval on mixed parsimony methods
1829	DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT same as preceding four
1830	programs, but for the Dollo and polymorphism parsimony
1831	criteria
1832	CLIQUE Compatibility method
1833	FACTOR recode multistate characters
1834
1835	---- Programs for plotting trees and consensus trees ----
1836	DRAWGRAM Draws cladograms and phenograms on screens, plotters and
1837	printers
1838	DRAWTREE Draws unrooted phylogenies on screens, plotters and
1839	printers
1840	CONSENSE Majority-rule and strict consensus trees
1841
1842	The package includes extensive documentation files that provide the
1843	information necessary to use and modify the programs.
1844
1845	COMPATIBILITY: The programs are written in a very standard subset of
1846	Pascal, a language that is available on most computers (including
1847	microcomputers). The programs require only trivial modifications to
1848	run on most machines: for example they work with only minor
1849	modifications with Turbo Pascal, and without modifications on VAX
1850	VMS Pascal. Pascal source code is distributed in the regular version
1851	of PHYLIP: compiled object code is not. To use that version, you
1852	must have a Pascal compiler.
1853
1854	DISKETTE DISTRIBUTION: The package is distributed in a variety of
1855	microcomputer diskette formats. You should send FORMATTED
1856	diskettes, which I will return with the package written on them.
1857	Unfortunately, I cannot write any Apple formats. See below for how
1858	many diskettes to send. The programs on the magnetic tape or
1859	electronic network versions may of course also be moved to
1860	microcomputers using a terminal program.
1861
1862	PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS
1863	systems are available from me. Specify the "PCDOS executable
1864	version" and send the number of extra diskettes indicated below.
1865	An Apple Macintosh version with precompiled code is available from
1866	Willem Ellis, Instituut voor Taxonomische Zoologie, Zoologisch
1867	Museum, Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH
1868	Amsterdam, Netherlands, who asks that you send 5 800K diskettes.
1869
1870	HOW MANY DISKETTES TO SEND: The following table shows for different
1871	PCDOS formats how many diskettes to send, and how many extra
1872	diskettes to send for the PCDOS executable version:
1873
1874	Diskette size Density For source code For executables, send
1875	in addition
1876	3.5 inch 1.44 Mb 2 1
1877	5.25 inch 1.2 Mb 2 2
1878	3.5 inch 720 Kb 4 2
1879	5.25 inch 360 Kb 7 4
1880
1881	Some other formats are also available. You MUST tell me EXACTLY
1882	which of these formats you need. The diskettes MUST be formatted by
1883	you before being sent to me. Sending an extra diskette may be
1884	helpful.
1885
1886	NETWORK DISTRIBUTION: The package is also available by distribution
1887	of the files directly over electronic networks, and by anonymous ftp
1888	from evolution.genetics.washington.edu. Contact me by electronic
1889	mail for details.
1890
1891	TAPE DISTRIBUTION: The programs are also distributed on a magnetic
1892	tape provided by you (which should be a small tape and need only be
1893	able to hold two megabytes) in the following format: 9-track, ASCII,
1894	odd parity, unlabelled, 6250 bpi (unless otherwise indicated).
1895	Logical record: 80 bytes, physical record: 3200 bytes (i.e. blocking
1896	factor 40). There are a total of 71 files. The first one describes
1897	the contents of the package.
1898
1899	POLICIES: The package is distributed free. I do not make it
1900	available or support it in South Africa. The package will be
1901	written on the diskettes or tape, which will be mailed back. They
1902	can be sent to:
1903
1904	Joe Felsenstein
1905	Electronic mail addresses: Department of Genetics SK-50
1906	Internet: joe@genetics.washington.edu University of Washington
1907	Bitnet/EARN: felsenst@uwavm Seattle, Washington 98195
1908	UUCP: uw-beaver!evolution.genetics!joe U.S.A.
1909
1910
1911	===================== End of Phylip Info. Sheet ====================
1912
1913
1914
1915
1916	REFERENCES.
1917
1918	Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of
1919	Protein Sequence and Structure, Vol. 5 supplement 3, Dayhoff, M.O.
1920	(ed.), NBRF, Washington, p. 345.
1921
1922	Felsenstein, J. (1985) Confidence limits on phylogenies: an
1923	approach using the bootstrap. Evolution 39, 783-791.
1924
1925	Feng, D.-F. and Doolittle, R.F. (1987) Progressive sequence
1926	alignment as a prerequisite to correct phylogenetic trees.
1927	J.Mol.Evol. 25, 351-360.
1928
1929	Gotoh, O. (1982) An improved algorithm for matching biological
1930	sequences. J.Mol.Biol. 162, 705-708.
1931
1932	Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile
1933	analysis: detection of distantly related proteins. PNAS USA 84,
1934	4355-4358.
1935
1936	Higgins, D.G. and Sharp, P.M. (1988) CLUSTAL: a package for
1937	performing multiple sequence alignments on a microcomputer. Gene
1938	73, 237-244.
1939
1940	Higgins, D.G. and Sharp, P.M. (1989) Fast and sensitive multiple
1941	sequence alignments on a microcomputer. CABIOS 5, 151-153.
1942
1943	Kimura, M. (1980) A simple method for estimating evolutionary
1944	rates of base substitutions through comparative studies of
1945	nucleotide sequences. J. Mol. Evol. 16, 111-120.
1946
1947	Kimura, M. (1983) The Neutral Theory of Molecular Evolution.
1948	Cambridge University Press, Cambridge, England.
1949
1950	Li, W.-H., Wu, C.-I. and Luo, C.-C. (1985) A new method for
1951	estimating synonymous and nonsynonymous rates of nucleotide
1952	substitution considering the relative likelihood of nucleotide and
1953	codon changes. Mol.Biol.Evol. 2, 150-174.
1954
1955	Myers, E.W. and Miller, W. (1988) Optimal alignments in linear
1956	space. CABIOS 4, 11-17.
1957
1958	Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological
1959	sequence comparison. PNAS USA 85, 2444-2448.
1960
1961	Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new
1962	method for reconstructing phylogenetic trees. Mol.Biol.Evol. 4,
1963	406-425.
1964
1965	Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. Freeman,
1966	San Francisco.
1967
1968	Sokal, R.R. and Michener, C.D. (1958) A statistical method for
1969	evaluating systematic relationships. Univ.Kansas Sci.Bull. 38,
1970	1409-1438.
1971
1972	Vingron, M. and Argos, P. (1991) Motif recognition and alignment
1973	for many sequences by comparison of dot matrices. J.Mol.Biol. 218,
1974	33-43.
1975
1976	Wilbur, W.J. and Lipman, D.J. (1983) Rapid similarity searches of
1977	nucleic acid and protein data banks. PNAS USA 80, 726-730.
1978

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/CLUSTALW/clustalv.txt

Download in other formats: