source: trunk/GDE/CLUSTALW/clustalv.txt

Last change on this file was 19575, checked in by westram, 44 hours ago
  • reintegrates 'help' into 'trunk'
    • preformatted text gets checked for width now (to enforce it fits into the arb help window).
    • fixed help following these checks, using the following steps:
      • ignore problems in foreign documentation.
      • increase default help window width.
      • introduce control comments to
        • accept oversized preformatted sections.
        • enforce preformatted style for whole sections.
        • simply define single-line preformatted sections
          Used intensive for definition of internal script languages.
    • fixed several non-related problems found in documentation.
    • minor layout changes for HTML version of arb help (more compacted; highlight anchored/all sections).
    • refactor system interface (GUI version) and use it from help module.
  • adds: log:branches/help@19532:19574
  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 73.8 KB
Line 
1
2
3
4                Clustal V  Multiple Sequence Alignments.
5
6                Documentation (Installation and Usage).
7
8                Des Higgins
9                European Molecular Biology Laboratory
10                Postfach 10.2209
11                D-6900 Heidelberg
12                Germany.
13
14                higgins@EMBL-Heidelberg.DE
15
16
17*******************************************************************
18
19
20                Contents.
21
22
23                1               Overview
24
25                2               Installation
26
27                3               Interactive usage
28
29                4               Command-line interface
30
31                5               Algorithms and references
32
33
34*******************************************************************
35
36                1.  Overview
37
38This document describes how to install and use ClustalV on various
39machines.  ClustalV is a complete upgrade and rewrite of the Clustal
40package of multiple alignment programs (Higgins and Sharp, 1988 and
411989).   The original programs were written in Fortran for
42microcomputers running MSDOS.   You carried out a complete alignment
43by running 3 programs in succession.   Later, these were merged into
44a single menu driven program with on-line help, for VAX/VMS. 
45ClustalV was written in C and has all of the features of the old
46programs plus many new ones.  It has been compiled and tested using
47VAX/VMS C, Decstation ULTRIX C, Gnu C for Sun workstations, Turbo C
48for IBM PC's and Think C for Apple Mac's.   The original Clustal was
49written by Des Higgins while he was a Post-Doc in the lab of Paul
50Sharp in the Genetics Department, Trinity College, Dublin 2,
51Ireland.
52
53The main feature of the old package was the ability to carry out
54reliable multiple alignments of many sequences.  The sensitivity of
55the program is as good as from any other program we have tried, with
56the exception of the programs of Vingron and Argos (1991), while it
57works in reasonable time on a microcomputer.  The programs of
58Vingron and Argos are specialised for finding distant similarities
59between proteins but require mainframes or workstations and are more
60difficult to use.
61
62The main new features are: profile alignments (alignments of old
63alignments); phylogenetic trees (Neighbor Joining trees calculated
64after multiple alignment with a bootstrapping option); better
65sequence input (automatically recognise and read NBRF/PIR, Pearson
66(Fasta) or EMBL/SwissProt formats); flexible alignment output
67(choose one of: old Clustal format, NBRF/PIR, GCG msf format or
68Phylip format); full command line interface (everything that you can
69do interactively can be specified on the command line).
70
71In version 7 of the GCG package, there is a program called PILEUP
72which uses a very similar algorithm to the one in ClustalV.  There
73are 2 main differences between the programs: 1) the metric used to
74compare the sequences for the initial "guide tree" uses a full
75global, optimal alignment in PILEUP instead of the fast, approximate
76ones in ClustalV.  This makes PILEUP much slower for the comparison
77of long sequences.  In principle, the distances calculated from
78PILEUP will be more sensitive than ours, but in practice it will not
79make much difference, except in difficult cases.  2)  During the
80multiple alignment, terminal gaps are penalised in ClustalV but not
81in PILEUP.  This will make the PILEUP alignments better when the
82sequences are of very different lengths (has no effect if there are
83no large terminal gaps).   
84
85
86This software may be distributed and used freely, provided that you
87do not modify it or this documentation in any way without the
88permission of the authors. 
89
90If you wish to refer to ClustalV, please cite:
91Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software
92for multiple sequence alignment. CABIOS, vol .8, 189-191. 
93
94The overall multiple alignment algorithm was described in:
95Higgins,D.G. and Sharp,P.M. (1989).  Fast and sensitive multiple
96sequence alignments on a microcomputer.  CABIOS, vol. 5, 151-153.
97
98
99ACKNOWLEDGEMENTS.
100
101D.H. would particularly like to thank Paul Sharp, in whose lab. this
102work originated.  We also thank Manolo Gouy, Gene Myers, Peter Rice
103and Martin Vingron for suggestions, bug-fixes and help.   
104
105Des Higgins and Rainer Fuchs,
106EMBL Data Library, Heidelberg, Germany.
107
108Alan Bleasby, 
109Daresbury, UK.
110
111JUNE 1991
112*******************************************************************
113
114                2.  Installation.
115
116
117
118As far as possible, we have tried to make ClustalV portable to any
119machine with a standard C compiler (proposed ANSI C standard).  The
120source code, as supplied by us, has been compiled and tested using
121the following compilers:
122
123VAX/VMS C
124Ultrix C (on a Decstation 2100)
125Gnu C on a Sun 4 workstation
126Think C on an Apple Macintosh SE
127Turbo C on an IBM AT.
128
129In each case, one must make 1 change to 1 line of code in 1 header
130file.  This is described below.  The exact capacity of the program
131(how many sequences of what length can be aligned) will depend of
132course on available memory but can also be set in this header file.
133
134The package comes as 9 C source files; 3 header files; 1 file of on-
135line help; this documentation file; 3 make files:
136
137Source code:    clustalv.c, amenu.c, gcgcheck.c, myers.c, sequence.c,
138                        showpair.c, trees.c, upgma.c, util.c
139
140Header files:   clustalv.h, general.h, matrices.h
141
142On-Line help:   clustalv.hlp  (must be renamed or defined as           
143                        clustalv_help except on PC's)
144
145Documentation:  clustalv.txt (this file).
146
147Makefiles:      makefile.sun (gnu c on Sun), vmslink.com (vax/vms),
148                        makefile.ult (ultrix).
149
150
151
152
153
154
155
156Before compiling ClustalV you must look at and possibly change
157clustalV.h, shown below.. 
158
159/*******************CLUSTALV.H********************************/
160
161/*
162Main header file for ClustalV. Uncomment ONE of the following lines
163depending on which compiler you wish to use.
164*/
165
166#define VMS 1             /* VAX VMS */
167
168/*#define MAC 1           Think_C for MacIntosh */
169
170/*#define MSDOS 1         Turbo C for PC's */
171
172/*#define UNIX 1          Ultrix for Decstations or Gnu C for Sun */
173
174/*************************************************************/
175
176#include "general.h"
177
178#define MAXNAMES          10
179#define MAXTITLES         60
180#define FILENAMELEN      256
181
182#define UNKNOWN   0
183#define EMBLSWISS 1
184#define PIR       2
185#define PEARSON   3
186
187#define PAGE_LEN       22
188
189#if VMS
190#define DIRDELIM ']'
191#define MAXLEN          3000
192#define MAXN             150
193#define FSIZE          15000
194#define LINELENGTH        60
195#define GCG_LINELENGTH    50
196
197#elif MAC
198#define DIRDELIM ':'
199#define MAXLEN          2600
200#define MAXN              30
201#define FSIZE          10000
202#define LINELENGTH        50
203#define GCG_LINELENGTH    50
204
205#elif MSDOS
206#define DIRDELIM '\\'
207#define MAXLEN          1300
208#define MAXN              30
209#define FSIZE           5000
210#define LINELENGTH        50
211#define GCG_LINELENGTH    50
212
213#elif UNIX
214#define DIRDELIM '/'
215#define MAXLEN         3000
216#define MAXN             50
217#define FSIZE         15000
218#define LINELENGTH       60
219#define GCG_LINELENGTH   50
220#endif
221/*****************end*of*CLUSTALV.H***************************/
222
223
224
225First, you must remove the comments from one of the first 10 lines. 
226There are 4 'define' compiler directives here (e.g. #define VMS 1),
227and you should use one of these, depending on which system you wish
228to work. So choose one of these, remove its comments (if it is
229already commented out) and put comments around any of the others
230that are still active. If you wish to use a different system, you
231will need to insert a new line with a new keyword (which you must
232invent) to identify your system.  Most of the rest of this header
233file is taken up with a block of 'define' statements for each system
234type; e.g. the VAX/VMS block is:
235
236#if VMS
237#define DIRDELIM ']'
238#define MAXLEN          3000
239#define MAXN             150
240#define FSIZE          15000
241#define LINELENGTH        60
242#define GCG_LINELENGTH    50
243
244In this block, you can specify the maximum number of sequences to be
245allowed (MAXN); the maximum sequence length, including gaps
246(MAXLEN);  FSIZE declares the size of some workspace, used by the
247fast 2 sequence comparison routines and should be APPROXIMATELY 4
248times MAXLEN; LINELENGTH is the length of the blocks of alignment
249output in the output files; GCG_LINELENGTH is the same but for the
250GCG compatible output only.  Finally, DIRDELIM is the character used
251to specify directories and subdirectories in file names.  It should
252be the character used to seperate the file name itself from the
253directory name (e.g. in VMS, file names are like:
254$drive:[dir1.dir2.dir3]filename.ext;2  so ']' is used as DIRDELIM).   
255
256So, if you want to use a system, not covered in Clustalv.h, you will
257have to insert a new block, like the above one.  To compile and link
258the program, we supply 3 makefiles: one each for VAX/VMS, Ultrix
259and GNU C for Sun workstations.
260
261 
262
263VAX/VMS
264
265Compile and link the program with the
266supplied makefile for vms: vmslink.com .
267
268$ @vmslink
269
270This will produce clustalv.exe (and a lot of .obj files which you can delete). 
271
272The on-line help file (clustalv.hlp) should be 'defined' as
273clustalv_help as follows:
274
275$ def clustalv_help $drive:[dir1.dir2]clustalv.hlp
276
277where $drive is the drive designation and [dir1.dir2] is the
278directory where clustalv.hlp is kept. 
279
280To make use of the command-line interface, you must make clustalv a
281'foreign' command with:
282
283$ clustalv :== $$drive:[dir1.dir2]clustalv
284
285where $drive is the drive designation and [dir1.dir2] is the
286directory where clustalv.exe is kept. 
287
288
289
290IBM PC/MSDOS/TURBO C
291
292Create a makefile (something.prj) with the names of the source files
293(clustalv.c, amenu.c etc.) and 'make' this using the HUGE memory
294model.  You will get half a dozen warnings from the compiler about
295pieces of code than look suspicious to it but ignore these.  The
296help file should remain as clustalv.hlp .   To run the program using
297the default settings in Clustalv.h, you need approximately 500k of
298memory.  To reduce this, the main influence on memory usage is the
299parameter MAXLEN; reduce MAXLEN to reduce memory usage.
300
301
302
303Apple Mac/THINK_C version 4.0.2
304
305This version of the program is not at all Mac like.  It runs in a
306window, the inside of which looks just like a normal character based
307terminal.  In the future we might put a proper Mac interface on it
308but do not have the time right now.  With the default settings in
309the header file ClustalV.h, you need just over 800k of memory to run
310the program.  To reduce this, reduce MAXLEN; this is easily the
311biggest influence on memory usage.  To compile the program and save
312it as an application you need to 'set the application type'; here
313you specify how much memory (in kilobytes (k)) the application will
314need.  You should set this to 900k to run the application as it is
315OR reduce MAXLEN in the header.  To compile the program you have to
316create a 'project'; you 'add' the names of the 9 source files to the
317project AND the name of the ANSI library.  The source code is too
318large to compile in one compilation unit.  You will get a 'link
319error: code segment too big' if you try to compile and link as is. 
320You should compile amenu.c (the biggest source file) as a seperate
321unit ..... you will have to read the manual/ask someone/mail me to
322find out what this is.
323
324
325*******************************************************************
326
327                3.  Interactive usage.
328
329
330
331Interactive usage of Clustal V is completely menu driven.  On-line
332help is provided, defaults are offered for all parameters and file
333names.  With a little effort it should be completely self
334explanatory.   The main menu, which appears when you run the
335programs is shown below.  Each item brings you to a sub menu.
336
337
338
339Main menu for Clustal V:
340
341
342     1. Sequence Input From Disc
343     2. Multiple Alignments
344     3. Profile Alignments
345     4. Phylogenetic trees
346
347     S. Execute a system command
348     H. HELP
349     X. EXIT (leave program)
350
351
352Your choice:
353
354
355
356The options S and H appear on all the main menus.  H will provide
357help and if you type S you will be asked to enter a command, such as
358DIR or LS, which will be sent to the system (does not work on
359Mac's).  Before carrying out an alignment, you must use option 1
360(sequence input); the format for sequences is explained below. 
361Under menu item 2 you will be able to automatically align your
362sequences to each other.  Menu item 3 allows you to do profile
363alignments.  These are alignments of old alignments.  This allows
364you to build up a multiple alignment in stages or add a new sequence
365to an old alignment.   You can calculate phylogenetic trees from
366alignments using menu item 4.
367
368
369
370
371      ******************************
372      *       SEQUENCE INPUT.      *
373      ******************************
374
375
376All sequences should be in 1 file.  Three formats are automatically
377recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and
378Lipman (1988) format).   
379
380***
381Users of the Wisconsin GCG package should use the command TONBRF
382(recently changed to TOPIR) to reformat their sequences before use.
383***
384
385Sequences can be in upper or lower case.  For proteins, the only
386symbols recognised are:  A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and
387for DNA/RNA use: A,C,G and T (or U).  Any other letters of the
388alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown. 
389All other symbols (blanks, digits etc.) will be ignored EXCEPT for
390the hyphen "-" which can be used to specify a gap.  This last point
391is especially useful for 2 reasons: 1) you can fix the positions of
392some gaps in advance; 2) the alignment output from this program can
393be written out in NBRF format using "-"'s to specify gaps; these
394alignments can be used again as input, either for profile alignments
395or for phylogenetic trees.
396
397If you are using an editor to create sequence files, use the FASTA
398format as it is by far the simplest (see below).  If you have access
399to utility programs for generating/converting the NBRF/PIR format
400then use it in preference.
401
402
403
404FASTA (PEARSON AND LIPMAN, 1988) FORMAT:     The sequences are
405delimited by an angle bracket ">" in column 1.  The text immediately
406after the ">" is used as a title.  Everything on the following line
407until the next ">" or the end of the file is one sequence.
408
409e.g.
410
411> RABSTOUT   rabbit Guinness receptor
412   LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
413   ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC
414> MUSNOSE   mouse nose drying factor
415    mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
416    fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv
417> HSHEAVEN    human Guinness receptor repeat
418 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
419 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
420 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
421 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
422
423
424
425NBRF/PIR FORMAT         is similar to FASTA format but immediately
426after the ">", you find the characters "P1;" if the sequences are
427protein or "DL;" if they are nucleic acid.  Clustalv looks for the
428";" character as the third character after the ">".  If it finds one
429it assumes that the format is NBRF if not, FASTA format is assumed. 
430The text after the ";" is treated as a sequence name while the
431entire next line is treated as a title.  The sequence is terminated
432by a star "*" and the next sequence can then begin (with a >P1; etc
433).  This is just the basic format description (there are other
434variations and rules).
435
436ANY files/sequences in GCG format can be converted to this format
437using the TONBRF command (now TOPIR) of the Wisconsin GCG package.
438
439
440e.g.
441
442>P1;RABSTOUT
443rabbit Guinness receptor
444LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
445ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC*
446>P1;MUSNOSE   
447mouse nose drying factor
448mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
449fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd
450*
451>P1;HSHEAVEN   
452human Guinness receptor repeat protein.
453mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
454fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
455mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
456fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv*
457
458
459 
460
461EMBL/SWISSPROT FORMAT:       Do not try to create files with this
462format unless you have utilities to help.  If you are just using an
463editor, use one of the above formats.  If you do use this format,
464the program will ignore everything between the ID line (line
465beginning with the characters "ID") and the SQ line.  The sequence
466is then read from between the SQ line and the "//" characters.
467
468
469
470It is critically important for the program to know whether or not it
471is aligning DNA or protein sequences.  The input routines attempt to
472guess which type of sequence is being used by counting the number of
473A,C,G,T or U's in the sequences.  If the total is more than 85% of
474the sequence length then DNA is assumed.  If you use very bizarre
475sequences (proteins with really strange aa compositions or DNA
476sequences with loads of strange ambiguity codes) you might confuse
477the program.  It is difficult to do but be careful.
478
479
480
481
482
483      ******************************
484      *  MULTIPLE ALIGNMENT MENU.  *
485      ******************************
486
487The multiple alignment menu is shown below.  Before explaining how
488to use it, you must be introduced briefly to the alignment strategy.
489If you do not follow this, try using option 1 anyway; the entire
490process will be carried out automatically.
491
492To do a complete multiple alignment, we need to know the approximate
493relationships of the sequences to each other (which ones are most
494similar to each other).  We do this by calculating a crude
495phylogenetic tree which we call a dendrogram (to distinguish it from
496the more sensitive trees available under the phylogenetic tree
497menu).   This dendrogram is used as a guide to align bigger and
498bigger groups of sequences during the multiple alignment.  The
499dendrogram is calculated in 2 stages: 1) all pairs of sequence are
500compared using the fast/approximate method of Wilbur and Lipman
501(1983); the result of each comparison is a similarity score. 2) the
502similarity scores are used to construct the dendrogram using the
503UPGMA cluster analysis method of Sneath and Sokal (1973). 
504
505The construction of the dendrogram can be very time consuming if you
506wish to align many sequences (e.g. for 100 sequences you need to
507carry out 100x99/2 sequence comparisons = 4950). During every
508multiple alignment, a dendrogram is constructed and saved to a file
509(something.dnd).  These can be reused later.
510
511
512
513
514
515
516
517
518******Multiple*Alignment*Menu******
519
520
521    1.  Do complete multiple alignment now
522    2.  Produce dendrogram file only
523    3.  Use old dendrogram file
524    4.  Pairwise alignment parameters
525    5.  Multiple alignment parameters
526    6.  Output format options
527
528    S.  Execute a system command
529    H.  HELP
530    or press [RETURN] to go back to main menu
531
532
533Your choice:
534
535
536So, if in doubt, and you have already loaded some sequences from the
537main menu, just try option 1 and press the <Return> key in response
538to any questions.  You will be prompted for 2 file names e.g. if the
539sequence input file was called DRINK.PEP, you will be offered
540DRINK.ALN as the file to contain the alignment and DRINK.DND for the
541dendrogram. 
542
543If you wish to repeat a multiple alignment (e.g. to experiment with
544different gap penalties) but do not wish to make a dendrogram all
545over again use menu item 3 (providing you are using the same
546sequences).  Similarly, menu item 2 allows you to produce the
547dendrogram file only.
548
549
550
551
552PAIRWISE ALIGNMENT PARAMETERS:     
553
554The parameters that control the initial fast/approximate comparisons
555can be set from menu item 4 which looks like:
556
557
558 ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS *********
559
560
561     1. Toggle Scoring Method  :Percentage
562     2. Gap Penalty            :3
563     3. K-tuple                :1
564     4. No. of top diagonals   :5
565     5. Window size            :5
566
567     H. HELP
568
569
570Enter number (or [RETURN] to exit):
571
572
573
574The similarity scores are calculated from fast alignments generated
575by the method of Wilbur and Lipman (1983).  These are 'hash' or
576'word' or 'k-tuple' alignments carried out in 3 stages. 
577
578First you mark the positions of every fragment of sequence, K-tuple
579long (for proteins, the default length is 1 residue, for DNA it is 2
580bases) in both sequences.  Then you locate all k-tuple matches
581between the 2 sequences.   At this stage you have to imagine a dot-
582matrix plot between the 2 sequences with each k-tuple match as a
583dot.   You find those diagonals in the plot with most matches (you
584take the "No. of top diagonals" best ones) and mark all diagonals
585within "Window size" of each top diagonal.  This process will define
586diagonal bands in the plot where you hope the most likely regions of
587similarity will lie. 
588
589The final alignment stage is to find that head to tail arrangement
590of k-tuple matches from these diagonal regions that will give the
591highest score.  The score is calculated as the number of exactly
592matching residues in this alignment minus a "gap penalty" for every
593gap that was introduced.  When you toggle "Scoring method" you
594choose between expressing these similarity scores as raw scores or
595expressed as a percentage of the shorter sequence length. 
596
597K-TUPLE SIZE:   Can be 1 or 2 for proteins; 1 to 4 for DNA. 
598Increase this to increase speed; decrease to improve sensitivity.
599
600GAP PENALTY:    The number of matching residues that must be found
601in order to introduce a gap.  This should be larger than K-Tuple
602Size.  This has little effect on speed or sensitivity.
603
604NO. OF TOP DIAGONALS:    The number of best diagonals in the
605imaginary dot-matrix plot that are considered.  Decrease (must be
606greater than zero) to increase speed; increase to improve
607sensitivity.
608
609WINDOW SIZE:    The number of diagonals around each "top" diagonal
610that are considered.   Decrease for speed; increase for greater
611sensitivity.
612
613SCORING METHOD: The similarity scores may be expressed as raw scores
614(number of identical residues minus a "gap penalty" for each gap) or
615as percentage scores.  If the sequences are of very different
616lengths, percentage scores make more sense.
617
618
619
620CHANGING THE PAIRWISE ALIGNMENT PARAMETERS
621
622The main reason for wanting to change the above parameters is SPEED
623(especially on microcomputers), NOT SENSITIVITY.   The dendrograms
624that are produced can only show the relationships between the
625sequences APPROXIMATELY because the similarity scores are calculated
626from seperate pairwise alignments; not from a multiple alignment
627(that is what we eventually hope to produce).  If the groupings of
628the sequences are "obvious", the above method should work well; if
629the relationships are obscure or weakly represented by the data, it
630will not make much difference playing with the parameters.  The main
631factor influencing speed is the K-TUPLE SIZE followed by the WINDOW
632SIZE. 
633
634The alignments are carried out in a small amount of memory. 
635Occasionally (it is hard to predict), you will run out of memory
636while doing these alignments; when this happens, it will say on the
637screen: "Sequences (a,b) partially aligned" (instead of "Sequences
638(a,b) aligned").  This means that the alignment score for these
639sequences will be approximate;  it is not a problem unless many of
640the alignments do this.  It can be fixed by using less sensitive
641parameters or increasing parameter FSIZE in clustalv.h .
642
643
644THE DENDROGRAM ITSELF
645
646The similarity scores generated by the fast comparison of all the
647sequences are used to construct a dendrogram by the UPGMA method of
648Sneath and Sokal (1973).  This is a form of cluster analysis and the
649end result produces something that looks like a tree.  It represents
650the similarity of the sequences as a hierarchy.  The dendrogram is
651written to a file in a machine readable format and is ahown below
652for an example with 6 sequences.
653
654
655    91.0   0   0   2   012000         ! seq 2 joins seq 3 at 91% ID.
656    72.0   1   0   3   011200         ! seq 4 joins seqs 2,3 at 72%
657    71.1   0   0   2   000012         ! seq 5 joins seq 6 at 71%
658    35.5   0   2   4   122200         ! seq 1 joins seqs 2,3,4
659    21.7   4   3   6   111122         ! seqs 1,2,3,4 join seqs 5,6
660
661This LOOKS complicated but you do not normally need to care what is
662in here.  Anyway, each row represents the joining together of 2 or
663more sequences.  You progress from the top down, joining more and
664more sequences until all are joined together; for N sequences you
665have N-1 groupings hence there are 5 rows in the above file (there
666were 6 sequences).  In each row, the first number is the similarity
667score of this grouping; ignore the next three columns for the
668moment; the last 6 digits in the line show which sequences are
669grouped; there is one digit for each sequence (the first digit is
670for the first sequence).  The rule is:  in each row, all of the "1"s
671join all of the "2"s; the zero's do nothing.   
672
673Hence, in the first row, sequence 2 joins sequence 3 at a similarity
674level of 91% identity; next, sequence 4 joins the previous grouping
675of 2 plus 3 at a level of 72% etc.   This is shown diagrammatically
676below.  Before leaving the dendrogram format, the other 3 columns of
677numbers are: a pointer to the row from which the "1" sequences were
678last joined (or zero if only one of them); a pointer to the row in
679which the "2"s were last joined; the total number of sequences
680joined in this line.
681
682
683
684
685                      I------ 2
686               I------I
687               I      I------ 3  Diagram of the sequence similarity
688          I----I
689          I    I------------- 4  relationships shown in the above
690       I--I
691       I  I------------------ 1  dendrogram file (branch lengths are
692   ----I
693       I       I------------- 5  not to scale).
694       I-------I
695               I------------- 6
696
697
698
699
700
701
702
703
704
705MULTIPLE ALIGNMENT PARAMETERS:
706
707
708Having calculated a dendrogram between a set of sequences, the final
709multiple alignment is carried out by a series of alignments of
710larger and larger groups of sequences.  The order is determined by
711the dendrogram so that the most similar sequences get aligned first. 
712Any gaps that are introduced in the early alignments are fixed. 
713When two groups of sequences are aligned against each other, a full
714protein weight matrix (such as a Dayhoff PAM 250) is used.  Two gap
715penalties are offered: a "FIXED" penalty for opening up a gap and a
716"FLOATING" penalty for extending a gap. 
717
718
719 ********* MULTIPLE ALIGNMENT PARAMETERS *********
720
721
722     1. Fixed Gap Penalty       :10
723     2. Floating Gap Penalty    :10
724     3. Toggle Transitions (DNA):Weighted
725     4. Protein weight matrix   :PAM 250
726
727     H. HELP
728
729
730Enter number (or [RETURN] to exit):
731
732
733FIXED GAP PENALTY:   Reduce this to encourage gaps of all sizes;
734increase it to discourage them.   Terminal gaps are penalised same
735as all others.  BEWARE of making this too small (approx 5 or so); if
736the penalty is too small, the program may prefer to align each
737sequence opposite one long gap.
738
739FLOATING GAP PENALTY:   Reduce this to encourage longer gaps;
740increase it to shorten them.   Terminal gaps are penalised same as
741all others.  BEWARE of making this too small (approx 5 or so); if
742the penalty is too small, the program may prefer to align each
743sequence opposite one long gap.
744
745
746DNA TRANSITIONS = WEIGHTED or UNWEIGHTED:   By default, transitions
747(A versus G; C versus T) are weighted more strongly than
748transversions (an A aligned with a G will be preferred to an A
749aligned with a C or a T).  You can make all pairs of nucleotide
750equally weighted with this option.
751
752PROTEIN WEIGHT MATRIX:  For protein comparisons, a weight matrix is
753used to differentially weight different pairs of aligned amino
754acids.  The default is the well known Dayhoff PAM 250 matrix.  We
755also offer a PAM 100 matrix, an identity matrix (all weights are the
756same for exact matches) or allow you to give the name of a file with
757your own matrix.  The weight matrices used by Clustal V are shown in
758full in the Algorithms and References section of this documentation. 
759
760If you input a matrix from a file, it must be in the following
761format.  Use a 20x20 matrix only (entries for the 20 "normal" amino
762acids only; no ambiguity codes etc.).  Input the lower left triangle
763of the matrix, INCLUDING the diagonal.  The order of the amino acids
764(rows and columns) must be: CSTPAGNDEQHRKMILVFYW.  The values can be
765in free format seperated by spaces (not commas).  The PAM 250 matrix
766is shown below in this format.
767
768  12
769   0  2
770  -2  1  3
771  -3  1  0  6
772  -2  1  1  1  2
773  -3  1  0 -1  1  5
774  -4  1  0 -1  0  0  2
775  -5  0  0 -1  0  1  2  4
776  -5  0  0 -1  0  0  1  3  4
777  -5 -1 -1  0  0 -1  1  2  2  4
778  -3 -1 -1  0 -1 -2  2  1  1  3  6
779  -4  0 -1  0 -2 -3  0 -1 -1  1  2  6
780  -5  0  0 -1 -1 -2  1  0  0  1  0  3  5
781  -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6
782  -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5
783  -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6
784  -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4
785  -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9
786   0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10
787  -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17
788
789Values must be integers and can be all positive or positive and
790negative as above.  These are SIMILARITY values. 
791
792
793
794
795ALIGNMENT OUTPUT OPTIONS:
796     
797By default, the alignment goes to a file in a self explanatory
798"blocked" alignment format.  This format is fine for displaying the
799results but requires heavy editing if you wish to use the alignment
800with other software.  To help, we provide 3 other formats which can
801be turned on or off.  If you have a sequence data set or alignment
802in memory, you can also ask for output files in whatever formats are
803turned on, NOW.  The menu you use to choose format is shown below.
804 
805***
806We draw your attention to NBRF/PIR format in particular.  This
807format is EXACTLY the same as one of the input formats.  Therefore,
808alignments written in this format can be used again as input (to the
809profile alignments or phylogenetic trees).
810***
811
812
813 ********* Format of Alignment Output *********
814
815
816     1. Toggle CLUSTAL format output   =  ON
817     2. Toggle NBRF/PIR format output  =  OFF
818     3. Toggle GCG format output       =  OFF
819     4. Toggle PHYLIP format output    =  OFF
820
821     5. Create alignment output file(s) now?
822     H. HELP
823
824
825Enter number (or [RETURN] to exit):
826
827
828
829CLUSTAL FORMAT:     This is a self explanatory alignment.  The
830alignment is written out in blocks.  Identities are highlighted and
831(if you use a PAM 250 matrix) positions in the alignment where all
832of the residues are "similar" to each other (PAM 250 score of 8 or
833more) are indicated.
834
835NBRF/PIR FORMAT:    This is the usual NBRF/PIR format with gaps
836indicated by hyphens ("-"). AS we have stressed before, this format
837is EXACTLY compatible with the sequence input format.  Therefore you
838can read in these alignments again for profile alignments or for
839calculating phylogenetic trees. 
840
841GCG FORMAT:         In version 7 of the Wisconsin GCG package, a new
842multiple sequence format was introduced.  This is the MSF (Multiple
843Sequence Format) format.  It can be used as input to the GCG
844sequence editor or any of the GCG programs that make use of multiple
845alignments.   THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG
846PACKAGE OR LATER. 
847
848PHYLIP FORMAT:      This format can be used by the Phylip package of
849Joe Felsenstein (see the references/algorithms section for details
850of how to get it).  Phylip allows you to do a huge range of
851phylogenetic analyses (we just offer one method in this program) and
852is probably the most widely used set of programs for drawing trees.
853It also works on just about every computer you can think of,
854providing you have a decent Pascal compiler.
855
856
857
858
859
860      ******************************
861      *   PROFILE ALIGNMENT MENU.  *
862      ******************************
863
864
865
866This menu is for taking two old alignments (or single sequences) and
867aligning them with each other.  The result is one bigger alignment. 
868The menu is very similar to the multiple alignment menu except that
869there is no mention of dendrograms here (they are not needed) and
870you need to input two sets of sequences.  The menu looks like this:
871
872
873
874******Profile*Alignment*Menu******
875
876
877    1.  Input 1st. profile/sequence
878    2.  Input 2nd. profile/sequence
879    3.  Do alignment now
880    4.  Alignment parameters
881    5.  Output format options
882
883    S.  Execute a system command
884    H.  HELP
885    or press [RETURN] to go back to main menu
886
887
888Your choice:
889
890
891You must input profile number 1 first.   When both profiles are
892loaded, use item 3 (Do alignment now) and the 2 profiles will be
893aligned.  Items 4 and 5 (parameters and output options) are
894identical to the equivalent options on the multiple alignment menu. 
895
896The same input routines that are used for general input are used
897here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA
898format, with gaps indicated by hyphens ("-").  This is why we have
899continualy drawn your attention to the NBRF/PIR format as a useful
900output format. 
901
902Either profile can consist of just one sequence.   Therefore, if you
903have a favourite alignment of sequences that you are working on and
904wish to add a new sequence, you can use this menu, provided the
905alignment is in the correct format. 
906
907The total number of sequences in the two profiles must be less less
908than or equal to the MAXN parameter set in the clustalv.h header
909file. 
910
911
912
913
914
915
916
917
918
919
920
921      ******************************
922      *   PHYLOGENETIC TREE MENU.  *
923      ******************************
924
925
926This menu allows you to input an alignment and calculate a
927phylogenetic tree.  You can also calculate a tree if you have just
928carried out a multiple alignment and the alignment is still in
929memory.  THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!!   The tree will
930look strange if the sequences are not already aligned.  You can also
931"BOOTSTRAP" the tree to show confidence levels for groupings.  This
932is SLOW on microcomputers but works fine on workstations or
933mainframes.
934
935
936
937******Phylogenetic*tree*Menu******
938
939
940    1.  Input an alignment
941    2.  Exclude positions with gaps?        = OFF
942    3.  Correct for multiple substitutions? = OFF
943    4.  Draw tree now
944    5.  Bootstrap tree
945
946    S.  Execute a system command
947    H.  HELP
948    or press [RETURN] to go back to main menu
949
950
951Your choice:
952
953
954
955
956The same input routine that is used for general input is used here
957i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format,
958with gaps indicated by hyphens ("-").  This is why we have
959continualy drawn your attention to the NBRF/PIR format as a useful
960output format. 
961
962If you have input an alignment, then just use item 4 to draw a tree. 
963The method used is the Neighbor Joining method of Saitou and Nei
964(1987).  This is a "distance method". First, percent divergence
965figures are calculated between all pairs of sequence.  These
966divergence figures are then used by the NJ method to give the tree. 
967Example trees will be shown below. 
968
969There are two options which can be used to control the way the
970distances are calculated.  These are set by options 2 and 3 in the
971menu. 
972
973EXCLUDE POSITIONS WITH GAPS?   This option allows you to ignore all
974alignment positions (columns) where there is a gap in ANY sequence. 
975This guarantees that "like" is compared with "like" in all distances
976i.e. the same positions are used to calculate all distances.  It
977also means that the distances will be "metric".  The disadvantage of
978using this option is that you throw away much of the data if there
979are many gaps.  If the total number of gaps is small, it has little
980effect. 
981 
982CORRECT FOR MULTIPLE SUBSTITUTIONS?    As sequences diverge,
983substitutions accumulate.  It becomes increasingly likely that more
984than one substitution (as a result of a mutation) will have happened
985at a site where you observe just one difference now.  This option
986allows you to use formulae developed by Motoo Kimura to correct for
987this effect.  It has the effect of stretching long branches in tres
988while leaving short ones relatively untouched.  The desired effect
989is to try and make distances proportional to time since divergence. 
990
991The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the
992name of the input, alignment file.  An example is shown below for 6
993globin sequences. 
994
995
996
997 DIST   = percentage divergence (/100)
998 Length = number of sites used in comparison
999
1000   1 vs.   2  DIST = 0.5683;  length =    139
1001   1 vs.   3  DIST = 0.5540;  length =    139
1002   1 vs.   4  DIST = 0.5315;  length =    111
1003   1 vs.   5  DIST = 0.7447;  length =    141
1004   1 vs.   6  DIST = 0.7571;  length =    140
1005   2 vs.   3  DIST = 0.0897;  length =    145
1006   2 vs.   4  DIST = 0.1391;  length =    115
1007   2 vs.   5  DIST = 0.7517;  length =    145
1008   2 vs.   6  DIST = 0.7431;  length =    144
1009   3 vs.   4  DIST = 0.0957;  length =    115
1010   3 vs.   5  DIST = 0.7379;  length =    145
1011   3 vs.   6  DIST = 0.7361;  length =    144
1012   4 vs.   5  DIST = 0.7304;  length =    115
1013   4 vs.   6  DIST = 0.7368;  length =    114
1014   5 vs.   6  DIST = 0.2697;  length =    152
1015
1016
1017                        Neighbor-joining Method
1018
1019 Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:
1020 A New Method for Reconstructing Phylogenetic Trees.
1021 Mol. Biol. Evol., 4(4), 406-425
1022
1023
1024 This is an UNROOTED tree
1025
1026 Numbers in parentheses are branch lengths
1027
1028
1029 Cycle   1     =  SEQ:   5 (  0.13382) joins  SEQ:   6 (  0.13592)
1030
1031 Cycle   2     =  SEQ:   1 (  0.28142) joins Node:   5 (  0.33462)
1032
1033 Cycle   3     =  SEQ:   2 (  0.05879) joins  SEQ:   3 (  0.03086)
1034
1035 Cycle   4 (Last cycle, trichotomy):
1036
1037                 Node:   1 (  0.20798) joins
1038                 Node:   2 (  0.02341) joins
1039                  SEQ:   4 (  0.04915)
1040
1041
1042
1043The output file first shows the percent divergence (distance)
1044figures between each pair of sequence.  Then a description of a NJ
1045tree is given.  This description shows which sequences (SEQ:) or
1046which groups of sequences (NODE: , a node is numbered using the
1047lowest sequence that belongs to it) join at each level of the tree. 
1048
1049This is an unrooted tree!! This means that the direction of
1050evolution through the tree is not shown.  This can only be inferred
1051in one of two ways: 
10521) assume a degree of constancy in the molecular clock and place the
1053root (bottom of the tree; the point where all the sequences radiate
1054from) half way along the longest branch.     **OR**
10552) use an "outgroup", a sequence from an organism that you "know"
1056must be outside of the rest of the sequences i.e. root the tree
1057manually, on biological grounds.
1058
1059The above tree can be represented diagramatically as follows:
1060
1061
1062                          SEQ 1       SEQ 4
1063                           I           I
1064          13.6             I 28.1      I 4.9          5.9
1065  SEQ 6 ----------I        I           I          I--------- SEQ 2
1066                  I        I           I          I
1067                  I--------I-----------I----------I
1068          13.4    I  33.5      20.8         2.3   I   3.1
1069  SEQ 5 ----------I                               I--------- SEQ 3
1070
1071
1072The figures along each branch are percent divergences along that
1073branch.  If you root the tree by placing the root along the longest
1074branch (33.5%) then you can draw it again as follows, this time
1075rooted:
1076
1077
1078
1079                        13.6
1080                I-------------------- SEQ 6
1081      I---------I       13.4
1082      I         I-------------------- SEQ 5
1083      I 33.5
1084 -----I                 28.1
1085      I         I-------------------- SEQ 1
1086      I         I
1087      I---------I                4.9
1088                I  20.8  I----------- SEQ 4
1089                I--------I 
1090                         I       5.9
1091                         I 2.3 I----- SEQ 2
1092                         I-----I 3.1
1093                               I----- SEQ 3
1094
1095
1096
1097The longest branch (33.5% between 5,6 and 1,2,3,4) is split between
1098the 2 bottom branches of the tree.  As it happens in this particular
1099case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4
1100are alpha and beta globins, so you could also justify the above
1101rooting on biological grounds.  If you do not have any particular
1102need or evidence for the position of the root, then LEAVE THE TREE
1103UNROOTED.  Unrooted trees do not look as pretty as rooted ones but
1104it is uaual to leave them unrooted if you do not have any evidence
1105for the position of the root.
1106
1107
1108BOTSTRAPPING:    Different sets of sequences and different tree
1109drawing methods may give different topologies (branching orders) for
1110parts of a tree that are weakly supported by the data.  It is useful
1111to have an indication of the degree of error in the tree.  There are
1112several ways of doing this, some of them rather technical.  We
1113provide one general purpose method in this program, which makes use
1114of a technique called bootstrapping (see Felsenstein, 1985).
1115
1116In the case of sequence alignments, bootstrapping involves taking
1117random samples of positions from the alignment.  If the alignment
1118has N positions, each bootstrap sample consists of a random sample
1119of N positions, taken WITH REPLACEMENT i.e. in any given sample,
1120some sites may be sampled several times, others not at all.  Then,
1121with each sample of sites, you calculate a distance matrix as usual
1122and draw a tree.  If the data very strongly support just one tree
1123then the sample trees will be very similar to each other and to the
1124original tree, drawn without bootstrapping.  However, if parts of
1125the tree are not well supported, then the sample trees will vary
1126considerably in how they represent these parts.
1127
1128In practice, you should use a very large number of bootstrap
1129replicates (1000 is recommended, even if it means running the
1130program for an hour on a slow microcomputer; on a workstation it
1131will be MUCH faster).  For each grouping on the tree, you record the
1132number of times this grouping occurs in the sample trees.  For a
1133group to be considered "significant" at the 95% level (or P <= 0.05
1134in statistical terms) you expect the grouping to show up in >= 95%
1135of the sample trees.  If this happens, then you can say that the
1136grouping is significant, given the data set and the method used to
1137draw the tree. 
1138
1139So, when you use the bootstrap option, a NJ tree is drawn as before
1140and then you are asked to say how many bootstrap samples you want
1141(1000 is the default) and you are asked to give a seed number for
1142the random number generator.  If you give the same seed number in
1143future, you will get the same results (we hope).  Remember to give
1144different seed numbers if you wish to carry out genuinely different
1145bootstrap sampling experiments.  Below is the output file from using
1146the same data for the 6 globin sequences as used before.  The output
1147file has the same name as the input fike with the extension ".njb".
1148
1149//
1150STUFF DELETED  .... same as for the ordinary NJ output
1151//
1152                        Bootstrap Confidence Limits
1153
1154
1155 Random number generator seed =      99
1156
1157 Number of bootstrap trials   =    1000
1158
1159
1160 Diagrammatic representation of the above tree:
1161
1162 Each row represents 1 tree cycle; defining 2 groups.
1163
1164 Each column is 1 sequence; the stars in each line show 1 group;
1165 the dots show the other
1166
1167 Numbers show occurences in bootstrap samples.
1168 
1169****..   1000             
1170.***..   1000                <- This is the answer!!
1171*..***    812
1172122311
1173
1174
1175For an unrooted tree with N sequences, there are actually only N-3
1176genuinely different groupings that we can test (this is the number
1177of "internal branches"; each internal branch splits the sequences
1178into 2 groups).  In this example, we have 6 sequences with 3
1179internal branches in the reference tree.  In the bootstrap
1180resampling, we count how often each of these internal branches
1181occur.  Here, we find that the branch which splits 1,2,3 and 4
1182versus 5 and 6 occurs in all 1000 samples; the branch which splits
11832,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2
1184and 3 versus 1,4,5 and 6 occurs in 812/1000 samples.  We can put
1185these figures on to the diagrammatic representation we made earlier
1186of our unrooted NJ tree as follows:
1187
1188
1189
1190                          SEQ 1       SEQ 4
1191                           I           I
1192                           I           I           
1193  SEQ 6 ----------I        I           I          I--------- SEQ 2
1194                  I  1000  I   1000    I   812    I
1195                  I--------I-----------I----------I
1196                  I                               I   
1197  SEQ 5 ----------I                               I--------- SEQ 3
1198
1199
1200
1201You can equally put these confidence figures on the rooted tree (in
1202fact the interpretation is simpler with rooted trees).  With the
1203unrooted tree, the grouping of sequence 5 with 6 is significant (as
1204is the grouping of sequences 1,2,3 and 4).  Equally the grouping of
1205sequences 1,5 and 6 is significant (the same as saying that 2,3 and
12064 group significantly).  However, the grouping of 2 and 3 is not
1207significant, although it is relatively strongly supported. 
1208
1209Unfortunately, there is a small complication in the interpretation
1210of these results.  In statistical hypothesis testing, it is not
1211valid to make multiple simultaneous tests and to treat the result of
1212each test completely independantly.  In the above case, if you have
1213one particular test (grouping) that you wish to make in advance, it
1214is valid to test IT ALONE and to simply show the other bootstrap
1215figures for reference.  If you do not have any particular test in
1216mind before you do the bootstrapping, you can just show all of the
1217figures and use the 95% level as an ARBITRARY cut off to show those
1218groups that are very strongly supported; but not mention anything
1219about SIGNIFICANCE testing.  In the literature, it is common
1220practice to simply show the figures with a tree; they frequently
1221speak for themselves. 
1222
1223
1224
1225*******************************************************************
1226
1227                4.  Command Line Interface.
1228
1229
1230
1231You can do almost everything that can be done from the menus, using
1232a command line interface. In this mode, the program will take all of
1233its instructions as "switches" when you activate it; no questions
1234will be asked; if there are no errors, the program just does an
1235analysis and stops.   It does not work so well on the MAC but is
1236still possible.  To get you started we will show you the 2 simplest
1237uses of the command line as it looks on VAX/VMS.  On all other
1238machines (except the MAC) it works in the same way.
1239
1240$ clustalv /help           **OR**   $ clustalv /check
1241
1242Both of the above switches give you a one page summary of the
1243command line on the screen and then the program stops.
1244
1245
1246$ clustalv proteins.seq    **OR**   $ clustalv /infile=proteins.seq   
1247
1248This will read the sequences from the file 'proteins.seq' and do a
1249complete multiple alignment.  Default parameters will be used, the
1250program will try to tell whether or not the sequences are DNA or
1251protein and the output will go to a file called 'proteins.aln' . A
1252dendrogram file called 'proteins.dnd' will also be created.  Thus
1253the default action for the program, when it successfully reads in an
1254input file is to do a full multiple alignment.  Some further
1255examples of command line usage will be given leter.
1256
1257Command line switches can be abbreviated but MAKE SURE YOU DO NOT
1258MAKE THEM AMBIGUOUS.  No attempt will be made to detect ambiguity. 
1259Use enough characters to distinguish each switch uniquely.
1260
1261
1262
1263
1264
1265
1266
1267The full list of allowed switches is given below:
1268
1269
1270                DATA (sequences)
1271
1272/INFILE=file.ext    :input sequences.  If you give an input file and
1273                                nothing else as a switch, the default action is
1274                                to do a complete multiple alignment.  The input
1275                                file can also be specified by giving it as the
1276                                first command line parameter with no "/" in     
1277                                front of it e.g $ clustalv file.ext  .
1278
1279/PROFILE1=file.ext      :You use these two switches to give the names of 
1280/PROFILE2=file.ext      two profiles.  The default action is to align
1281                        the two. You must give the names of both profile
1282                                files.
1283
1284
1285
1286                VERBS (do things)
1287
1288/HELP           :list the command line parameters on the screen.
1289/CHECK           
1290               
1291/ALIGN          :do full multiple alignment.  This is the default       
1292                        action if no other switches except for input files
1293                        are given.
1294
1295/TREE           :calculate NJ tree.  If this is the only action         
1296                        specified (e.g. $ clustalv proteins.seq/tree ) it IS
1297                        ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  If
1298                        the sequences are not already aligned, you should       
1299                        also give the /ALIGN switch.  This will align the       
1300                        sequences first, output an alignment file and   
1301                        calculate the tree in memory.
1302
1303/BOOTSTRAP(=n)  :bootstrap a NJ tree (n= number of bootstraps; 
1304                        default = 1000).  If this is the only action           
1305                        specified (e.g. $ clustalv proteins.seq/bootstrap )
1306                        it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED. 
1307                        If the sequences are not already aligned, you should
1308                        also give the /ALIGN switch.  This will align the       
1309                        sequences first, output an alignment file and   
1310                        calculate the bootstraps in memory.  You can set the
1311                        number of bootstrap trials here (e.g./bootstrap=500). 
1312                        You can set the seed number for the random number       
1313                        generator with /seed=n.
1314
1315
1316
1317                PARAMETERS (set things)
1318
1319***Pairwise alignments:***
1320
1321/KTUP=n         :word size             
1322   
1323/TOPDIAGS=n     :number of best diagonals
1324
1325/WINDOW=n       :window around best diagonals
1326 
1327/PAIRGAP=n      :gap penalty
1328
1329
1330
1331***Multiple alignments:***
1332
1333/FIXEDGAP=n     :fixed length gap pen. 
1334   
1335/FLOATGAP=n     :variable length gap pen.
1336
1337/MATRIX=        :PAM100 or ID or file name. The default weight matrix
1338                        for proteins is PAM 250.
1339
1340/TYPE=p or d    :type is protein or DNA.   This allows you to   
1341                        explicitely overide the programs attempt at guessing
1342                        the type of the sequence.  It is only useful if you
1343                        are using sequences with a VERY strange composition.
1344
1345/OUTPUT=        :GCG or PHYLIP or PIR.  The default output is   
1346                        Clustal format.
1347   
1348/TRANSIT        :transitions not weighted.  The default is to weight
1349                        transitions as more favourable than other mismatches
1350                        in DNA alignments.  This switch makes all nucleotide
1351                        mismatches equally weighted.
1352
1353
1354***Trees:***                             
1355
1356/KIMURA         :use Kimura's correction on distances.   
1357
1358/TOSSGAPS       :ignore positions with a gap in ANY sequence.
1359
1360/SEED=n         :seed number for bootstraps.
1361
1362
1363
1364
1365EXAMPLES:
1366
1367These examples use the VAX/VMS $ prompt; otherwise, command-line
1368usage is the same on all machines except the Macintosh.
1369
1370 
1371$ clustalv proteins.seq      OR     $ clustalv /infile=proteins.seq
1372
1373Read whatever sequences are in the file "proteins.seq" and do a full
1374multiple alignment; output will go to the files: "proteins.dnd"
1375(dendrogram) and "proteins.aln" (alignment).
1376
1377
1378$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir
1379
1380Same as last example but use K-Tuple size of 2; use a PAM 100
1381protein weight matrix; write the alignment out in NBRF/PIR format
1382(goes to a file called "proteins.pir").
1383
1384
1385$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11
1386
1387Take the alignment in "proteins.seq" and align it with "more.seq"
1388using default values for everything except the fixed gap penalty
1389which is set to 11.  The sequence type is explicitely set to
1390PROTEIN.
1391
1392
1393$ clustalv proteins.pir/tree/kimura
1394
1395Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY)
1396and calculate a phylogenetic tree using Kimura's correction for
1397distances. 
1398
1399
1400$ clustalv proteins.pir/align/tree/kimura
1401
1402Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE
1403FIRST.
1404
1405
1406$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p
1407
1408Take the sequences in proteins.seq; they are explicitely set to be
1409protein; align them; bootstrap a tree using 500 samples and a seed
1410number of 99.
1411
1412
1413*******************************************************************
1414
1415                5.  Algorithms and references.
1416
1417
1418
1419In this section, we will try to BRIEFLY describe the algorithms used
1420in ClustalV and give references.  The topics covered are:
1421
1422
1423        -Multiple alignments
1424
1425        -Profile alignments
1426
1427        -Protein weight matrices
1428
1429        -Phylogenetic trees
1430
1431                -distances
1432
1433                -NJ method
1434
1435                -Bootstrapping
1436
1437                -Phylip
1438
1439        -References
1440
1441
1442
1443
1444
1445
1446MULTIPLE ALIGNMENTS.
1447
1448The approach used in ClustalV is a modified version of the method of
1449Feng and Doolittle (1987) who aligned the sequences in larger and
1450larger groups according to the branching order in an initial
1451phylogenetic tree.  This approach allows a very useful combination
1452of computational tractability and sensitivity. 
1453
1454The positions of gaps that are generated in early alignments remain
1455through later stages.  This can be justified because gaps that arise
1456from the comparison of closely related sequences should not be moved
1457because of later alignment with more distantly related sequences. 
1458At each alignment stage, you align two groups of already aligned
1459sequences.  This is done using a dynamic programming algorithm where
1460one allows the residues that occur in every sequence at each
1461alignment position to contribute to the alignment score.  A Dayhoff
1462(1978) PAM matrix is used in protein comparisons.
1463
1464The details of the algorithm used in ClustalV have been published in
1465Higgins and Sharp (1989).  This was an improved version of an
1466earlier algorithm published in Higgins and Sharp (1988).  First, you
1467calculate a crude similarity measure between every pair of sequence. 
1468This is done using the fast, approximate alignment algorithm of
1469Wilbur and Lipman (1983).  Then, these scores are used to calculate
1470a "guide tree" or dendrogram, which will tell the multiple alignment
1471stage in which order to align the sequences for the final multiple
1472alignment.  This "guide tree" is calculated using the UPGMA method
1473of Sneath and Sokal (1973).  UPGMA is a fancy name for one type of
1474average linkage cluster analysis, invented by Sokal and Michener
1475(1958). 
1476
1477Having calculated the dendrogram, the sequences are aligned in
1478larger and larger groups.  At each alignment stage, we use the
1479algorithm of Myers and Miller (1988) for the optimal alignments. 
1480This algorithm is a very memory efficient variation of Gotoh's
1481algorithm (Gotoh, 1982).  It is because of this algorithm that
1482ClustalV can work on microcomputers.   Each of these alignments
1483consists of aligning 2 alignments, using what we call "profile
1484alignments".
1485
1486
1487PROFILE ALIGNMENTS.
1488
1489We use the term "profile alignment" to describe the alignment of 2
1490alignments.  We use this term because the method is a simple
1491extension of the profile method of Gribskov, et al. (1987) for
1492aligning 1 sequence with an alignment.  Normally, with a 2 sequence
1493alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a
1494score between the pairs of aligned residues.  The alignment is
1495considered "optimal" if it gives the best total score for aligned
1496residues minus penalties for any gaps (insertions or deletions) that
1497must be introduced. 
1498
1499Profile alignments are a simple extension of 2 sequence alignments
1500in that you can treat each of the two input alignments as single
1501sequences but you calculate the score at aligned positions as the
1502average weight matrix score of all the residues in one alignment
1503versus all those in the other e.g. if you have 2 alignments with I
1504and J sequences respectively; the score at any position is the
1505average of all the I times J scores of the residues compared
1506seperately.  Any gaps that are introduced are placed in all of the
1507sequences of an alignment at the same position.  The profile
1508alignments offered in the "profile alignment menu" are also
1509calculated in this way.
1510
1511
1512PROTEIN WEIGHT MATRICES.
1513
1514There are 3 built-in weight matrices used by clustalV.  These are
1515the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity
1516matrix.  Each matrix is given as the bottom left half, including the
1517diagonal of a 20 by 20 matrix.  The order of the rows and columns is
1518CSTPAGNDEQHRKMILVFYW.
1519
1520
1521PAM 250
1522
1523C  12
1524S   0  2
1525T  -2  1  3
1526P  -3  1  0  6
1527A  -2  1  1  1  2
1528G  -3  1  0 -1  1  5
1529N  -4  1  0 -1  0  0  2
1530D  -5  0  0 -1  0  1  2  4
1531E  -5  0  0 -1  0  0  1  3  4
1532Q  -5 -1 -1  0  0 -1  1  2  2  4
1533H  -3 -1 -1  0 -1 -2  2  1  1  3  6
1534R  -4  0 -1  0 -2 -3  0 -1 -1  1  2  6
1535K  -5  0  0 -1 -1 -2  1  0  0  1  0  3  5
1536M  -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6
1537I  -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5
1538L  -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6
1539V  -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4
1540F  -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9
1541Y   0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10
1542W  -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17
1543----------------------------------------------------------------
1544    C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W
1545
1546
1547IDENTITY MATRIX
1548
154910
1550 0  10
1551 0  0  10
1552 0  0  0  10
1553 0  0  0  0  10
1554 0  0  0  0  1  10
1555 0  0  0  0  0  0  10
1556 0  0  0  0  0  0  0  10
1557 0  0  0  0  0  0  0  0  10
1558 0  0  0  0  0  0  0  0  0  10
1559 0  0  0  0  0  0  0  0  0  0  10
1560 0  0  0  0  0  0  0  0  0  0  0  10
1561 0  0  0  0  0  0  0  0  0  0  0  0  10
1562 0  0  0  0  0  0  0  0  0  0  0  0  0  10
1563 0  0  0  0  0  0  0  0  0  0  0  0  0  0  10
1564 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10
1565 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10
1566 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10
1567 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
1568 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10
1569
1570
1571
1572
1573
1574PAM 100
1575
1576 14
1577 -1  6
1578 -5  2   7
1579 -6  1  -1  10
1580 -5  2   2   1   6
1581 -8  1  -3  -3   1   8
1582 -8  2   0  -3  -1  -1  7
1583-11 -1  -2  -4  -1  -1  4   8
1584-11 -2  -3  -3   0  -2  1   5   8
1585-11 -3  -3  -1  -2  -5 -1   1   4   9
1586 -6 -4  -5  -2  -5  -7  2  -1  -2   4 11
1587 -6 -1  -4, -2  -5  -8 -3  -6  -5   1  1 10
1588-11 -2  -1  -4  -4  -5  1  -2  -2  -1 -3  3  8
1589-11 -4  -2  -6  -3  -8 -5  -8  -6  -2 -7 -2  1 13
1590 -5 -4  -1  -6  -3  -7 -4  -6  -5  -5 -7 -4  4  2  9
1591-12 -7  -5  -5  -5  -8 -6  -9  -7  -3 -5 -7 -6  4  2  9
1592 -4 -4  -1  -4   0  -4 -5  -6  -5  -5 -6 -6 -6  1  5  1  8
1593-10 -5  -6  -9  -7  -8 -6 -11 -11 -10 -4 -7-11 -2  0  0 -5 12
1594 -2 -6  -6 -11  -6 -11 -3  -9  -7  -9 -1-10-10 -8 -4 -5 -6  6 13
1595-13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7  1 -9-11-12 -7-14 -2 -2 19
1596
1597
1598
1599
1600PHYLOGENETIC TREES.
1601
1602There are two COMMONLY used approaches for inferring phylogentic
1603trees from sequence data: parsimony and distance methods. There are
1604other approaches which are probably superior in theory but which are
1605yet to be used widely. This does not mean that they are no use; we
1606(the authors of this program at any rate) simply do not know enough
1607about them yet.  You should see the documentation accompanying the
1608Phylip package and some of the references there for an explanation
1609of the different methods and what assumptions are implied when you
1610use them.   
1611
1612There is a constant debate in the literature as to the merits of
1613different methods but unfortunately, a lot of what is said is
1614incomprehensible or inaccurate.  It is also a field that is prone to
1615having highly opinionated schools of thought.  This is a pity
1616because it prevents rational discussion of the pro's and con's of
1617the different methods.  The approach adopted in ClustalV is to
1618supply just one method and to produce alignments in a format that
1619can be used by Phylip.  In simple cases, the trees produced will be
1620as "good" (reliable, robust) as those from ANY other method.  In
1621more complicated cases, there is no single magic recipe that we can
1622supply that will work well in even most situations.
1623
1624The method we provide is the Neighbor Joining method (NJ) of Saitou
1625and Nei (1987) which is a distance method.  We use this for three
1626reasons:  it is conceptually and computationally simple; it is fast;
1627it gives "good" trees in simple cases. It is difficult to prove that
1628one tree is "better" than another if you do not know the true
1629phylogeny; the few systematic surveys of methods show it to work
1630more or less as well as any other method ON AVERAGE.  Another reason
1631for using the NJ method is that it is very commonly used; THIS IS A
1632BAD REASON SCIENTIFICALLY but at least you will not feel lonely if
1633you use it.
1634
1635The NJ method works on a matrix of distances (the distance matrix)
1636between all pairs of sequence to be analysed.  These distances are
1637related to the degree of divergence between the sequences.  It is
1638normal to calculate the distances from the sequences after they are
1639multiply aligned.  If you calculate them from seperate alignments
1640(as done for the dendrograms in another part of this program), you
1641may increase the error considerably. 
1642
1643
1644DISTANCES
1645
1646The simplest measure of distance between sequences is percent
1647divergence (100% minus percent identity).  For two sequences, you
1648count how many positions differ between them (ignoring all positions
1649with a gap or an unknown residue) and divide by the number of
1650positions considered.  It is common practice to also ignore all
1651positions in the alignment where there is a GAP in ANY of the
1652sequences (Tossgaps ? option in the menu).  Usually, you express the
1653percent distance divided by 100 (gives distances between 0.0 and
16541.0).
1655
1656This measure of distance is perfectly adequate (with some further
1657modification described below) for rRNA sequences. However it treats
1658all residues identically e.g. all amino acid substitutions are
1659equally weighted. It also treats all positions identically e.g. it
1660does not take account of different rates of substitution in
1661different positions of different codons in protein coding DNA
1662sequences; see Li et al (1985) for a distance measure that does. 
1663Despite these shortcomings, these percent identity distances do work
1664well in practice in a wide variety of situations. 
1665
1666In a simple world, you would like a distance to be proportional to
1667the time since the sequences diverged.  If this were EXACTLY true,
1668then the calculation of the tree would be a simple matter of algebra
1669(UPGMA does this for you) and the branch lengths will be nice and
1670meaningful (times).  In practice this OBVIOUSLY depends on the
1671existence and quality of the "molecular clock", a subject of on-
1672going debate.  However, even if there is a good clock, there is a
1673further problem with estimating divergences.  As sequences diverge,
1674they become "saturated" with mutations.  Sites can have
1675substitutions more than once.  Calculated distances will
1676underestimate actual divergence times; the greater the divergence,
1677the greater the discrepancy.  There are various methods for dealing
1678with this and we provide two commonly used ones, both due to Motoo
1679Kimura; one for proteins and one for DNA.
1680
1681
1682For distance K (percent divergence /100 ) ...
1683
1684Correction for Protein distances:  (Kimura, 1983).
1685
1686       Corrected K = -ln(1.0 - K - (K * k/5.0))
1687
1688
1689
1690Correction for nucleotide distances: Kimura's 2-parameter method
1691(Kimura, 1980).
1692
1693       Corrected K = 0.5*ln(a) + 0.25*ln(b)
1694
1695       where     a = 1/(1 - 2*P - Q)
1696       and       b = 1/(1 - 2*Q)
1697
1698       P and Q are the proportions of transitions (A<-->G, C<-->T)
1699       and transversions occuring between the sequences. 
1700
1701
1702One paradoxical effect of these corrections, is that distances can
1703be corrected to have more than 100% divergence.  That is because,
1704for very highly diverged sequences of length N, you can estimate
1705that more than N substitutions have occured by correcting the
1706observed distance in the above ways.  Don't panic!
1707
1708
1709
1710NEIGHBOR JOINING TREES.
1711
1712VERY briefly, the NJ method works as follows.  You start by placing
1713the sequences in a star topology (no internal branches).  You then
1714find that internal branch (take 2 sequences; join them; connect them
1715to the rest by the internal branch) which when added to the tree
1716will minimise the total branch length. The two joined sequences
1717(neighbours) are merged into a single sequence and the process is
1718repeated.  For an unrooted tree with N sequences, there are N-3
1719internal branches.  The above process is repeated N-3 times to give
1720the final tree.  The full details are given in Saitou and Nei
1721(1987).
1722
1723As explained elsewhere in the documentation, you can only root the
1724tree by one of two methods:
1725
17261) assume a degree of constancy in the molecular clock and place the
1727root along the longest branch (internal or external).  Methods that
1728appear to produce rooted trees automatically are often just doing
1729this without letting you know; this is true of UPGMA.
1730
17312) root the tree on biological grounds.  The usual method is to
1732include an "outgroup", a sequence that you are certain will branch
1733to the outside of the tree. 
1734
1735
1736
1737BOOTSTRAPPING.
1738
1739Bootstrapping is a general purpose technique that can be used for
1740placing confidence limits on statistics that you estimate without
1741any knowledge of the underlying distribution (e.g. a normal or
1742poisson distribution).  In the case of phylogenetic trees, there are
1743several analytical methods for placing confidence limits on
1744groupings (actually on the internal branches) but these are either
1745restricted to particular tree drawing methods or only work on small
1746trees of 4 or 5 sequences.  Felsenstein (1985) showed how to use
1747bootstrapping to calculate confidence limits on trees.  His approach
1748is completely general and can be applied to any tree drawing method. 
1749The main assumption of the method in this context is that the sites
1750in the alignment are independant; this will be true of some sequence
1751alignments (e.g. pseudogenes) but not others (e.g. rRNA's).  What
1752effect, lack of independance will have on the results is not known.
1753
1754The method works by taking random samples of data from the complete
1755data set.  You compute the test statistic (tree in this case) on
1756each sample.   Variation in the statistic computed from the samples
1757gives a measure of variation in the statistic which can be used to
1758calculate confidence intervals.  Each random sample is the same size
1759as the complete data set and is taken WITH REPLACEMENT i.e. a data
1760point can be selected more than once (or not at all) in any given
1761sample. 
1762
1763In the case of an alignment N residues long, each random sample is a
1764random selection of N sites form the alignment.  For each sample, we
1765calculate a distance matrix and tree in the usual way.  Variation in
1766the sample trees compared to a tree calculated from the full data
1767set gives an indication of how well supported the tree is by the
1768data.  If the sample trees are very similar to each other and to the
1769full tree, then the tree is "strongly" supported; if the sample
1770trees show great variation, then the tree will be weakly supported. 
1771In practice, you usually find some parts of a tree well supported,
1772others weakly.  This can be seen by counting how often each
1773monophyletic group in the full tree occurs in the sample trees. 
1774
1775For a particular grouping, one considers it to be significant at the
177695% level (P <= 0.05) if it occurs in 95% of the bootstrap samples.
1777If a grouping is significant, it is significant with respect to the
1778particular data set and method used for drawing the tree. 
1779Biological "significance" is another matter.
1780
1781
1782PHYLIP.
1783
1784The Phylip package was written by Joe Felsenstein, University of
1785Washington, USA.  It provides Pascal source code for a large number
1786of programs for doing most types of phylogenetic analyses.  The
1787Phylip format alignments produced by this program can be used by all
1788of the Phylip programs, version 3.4 or later (March 1991).  It is
1789freely available from him as follows.
1790
1791
1792
1793================= PHYLIP information sheet =====================
1794
1795    PHYLIP - Phylogeny Inference Package (version 3.3)
1796
1797This is a FREE package of programs for inferring phylogenies and
1798carrying out certain related tasks.  At present it contains 28
1799programs, which carry out different algorithms on different kinds of
1800data.  The programs in the package are:
1801
1802      ---------- Programs for molecular sequence data ----------
1803PROTPARS  Protein parsimony         
1804DNAPARS   Parsimony method for DNA
1805DNAMOVE   Interactive DNA parsimony 
1806DNAPENNY  Branch and bound for DNA
1807DNABOOT   Bootstraps DNA parsimony   
1808DNACOMP   Compatibility for DNA
1809DNAINVAR  Phylogenetic invariants   
1810DNAML     Maximum likelihood method
1811DNAMLK    DNAML with molecular clock
1812DNADIST   Distances from sequences
1813RESTML    ML for restriction sites
1814
1815    ----------- Programs for distance matrix data ------------
1816FITCH     Fitch-Margoliash and least-squares methods
1817KITSCH    Fitch-Margoliash and least squares methods with
1818          evolutionary clock
1819
1820    --- Programs for gene frequencies and continuous characters --
1821CONTML    Maximum likelihood method 
1822GENDIST   Computes genetic distances
1823
1824    ------------- Programs for discrete state data -----------
1825MIX       Wagner, Camin-Sokal, and mixed parsimony criteria
1826MOVE      Interactive Wagner, C-S, mixed parsimony program
1827PENNY     Finds all most parsimonious trees by branch-and-bound
1828BOOT      Bootstrap confidence interval on mixed parsimony methods
1829DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT   same as preceding four
1830          programs, but for the Dollo and polymorphism parsimony
1831          criteria
1832CLIQUE    Compatibility method       
1833FACTOR    recode multistate characters
1834
1835    ---- Programs for plotting trees and consensus trees ----
1836DRAWGRAM  Draws cladograms and phenograms on screens, plotters and
1837          printers
1838DRAWTREE  Draws unrooted phylogenies on screens, plotters and
1839          printers
1840CONSENSE  Majority-rule and strict consensus trees
1841
1842The package includes extensive documentation files that provide the
1843information necessary to use and modify the programs.
1844
1845COMPATIBILITY: The programs are written in a very standard subset of
1846Pascal, a language that is available on most computers (including
1847microcomputers). The programs require only trivial modifications to
1848run on most machines: for example they work with only minor
1849modifications with Turbo Pascal, and without modifications on VAX
1850VMS Pascal. Pascal source code is distributed in the regular version
1851of PHYLIP: compiled object code is not.  To use that version, you
1852must have a Pascal compiler.
1853
1854DISKETTE DISTRIBUTION: The package is distributed in a variety of
1855microcomputer diskette formats.   You should send FORMATTED
1856diskettes, which I will return with the package written on them.
1857Unfortunately, I cannot write any Apple formats.   See below for how
1858many diskettes to send.  The programs on the magnetic tape or
1859electronic network versions may of course also be moved to
1860microcomputers using a terminal program.
1861
1862PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS
1863systems are available from me.  Specify the "PCDOS executable
1864version" and send the number of extra diskettes indicated below.   
1865An Apple Macintosh version with precompiled code is available from
1866Willem Ellis, Instituut voor Taxonomische Zoologie, Zoologisch
1867Museum, Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH
1868Amsterdam, Netherlands, who asks that you send 5 800K diskettes.
1869
1870HOW MANY DISKETTES TO SEND: The following table shows for different
1871PCDOS formats how many diskettes to send, and how many extra
1872diskettes to send for the PCDOS executable version:
1873
1874Diskette size   Density   For source code    For executables, send
1875                                                in addition
1876  3.5 inch      1.44 Mb          2                     1
1877  5.25 inch      1.2 Mb          2                     2
1878  3.5 inch       720 Kb          4                     2
1879  5.25 inch      360 Kb          7                     4
1880
1881Some other formats are also available. You MUST tell me EXACTLY
1882which of these formats you need.  The diskettes MUST be formatted by
1883you before being sent to me. Sending an extra diskette may be
1884helpful.
1885
1886NETWORK DISTRIBUTION: The package is also available by distribution
1887of the files directly over electronic networks, and by anonymous ftp
1888from evolution.genetics.washington.edu.  Contact me by electronic
1889mail for details.
1890
1891TAPE DISTRIBUTION: The programs are also distributed on a magnetic
1892tape provided by you (which should be a small tape and need only be
1893able to hold two megabytes) in the following format: 9-track, ASCII,
1894odd parity, unlabelled, 6250 bpi (unless otherwise indicated). 
1895Logical record: 80 bytes, physical record: 3200 bytes (i.e. blocking
1896factor 40). There are a total of 71 files. The first one describes
1897the contents of the package.
1898
1899POLICIES: The package is distributed free.  I do not make it
1900available or support it in South Africa.  The package will be
1901written on the diskettes or tape, which will be mailed back.  They
1902can be sent to:
1903
1904                                           Joe Felsenstein
1905Electronic mail addresses:                 Department of Genetics SK-50
1906 Internet:  joe@genetics.washington.edu    University of Washington
1907 Bitnet/EARN:  felsenst@uwavm              Seattle, Washington 98195
1908 UUCP:  uw-beaver!evolution.genetics!joe   U.S.A.
1909
1910
1911===================== End of Phylip Info. Sheet ====================
1912
1913
1914
1915
1916REFERENCES.
1917
1918Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of
1919Protein Sequence and Structure, Vol. 5 supplement 3, Dayhoff, M.O.
1920(ed.), NBRF, Washington, p. 345. 
1921
1922Felsenstein, J. (1985)  Confidence limits on phylogenies: an
1923approach using the bootstrap.  Evolution 39, 783-791.
1924
1925Feng, D.-F. and Doolittle, R.F. (1987)  Progressive sequence
1926alignment as a prerequisite to correct phylogenetic trees. 
1927J.Mol.Evol. 25, 351-360.
1928
1929Gotoh, O. (1982)  An improved algorithm for matching biological
1930sequences.  J.Mol.Biol. 162, 705-708.
1931
1932Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile
1933analysis: detection of distantly related proteins. PNAS USA 84,
19344355-4358.
1935
1936Higgins, D.G. and Sharp, P.M. (1988)  CLUSTAL: a package for
1937performing multiple sequence alignments on a microcomputer.  Gene
193873, 237-244.
1939
1940Higgins, D.G. and Sharp, P.M. (1989)  Fast and sensitive multiple
1941sequence alignments on a microcomputer.  CABIOS 5, 151-153.
1942
1943Kimura, M. (1980)   A simple method for estimating evolutionary
1944rates of base substitutions through comparative studies of
1945nucleotide sequences. J. Mol. Evol. 16, 111-120.
1946
1947Kimura, M. (1983)   The Neutral Theory of Molecular Evolution. 
1948Cambridge University Press, Cambridge, England.
1949
1950Li, W.-H., Wu, C.-I. and Luo, C.-C. (1985)  A new method for
1951estimating synonymous and nonsynonymous rates of nucleotide
1952substitution considering the relative likelihood of nucleotide and
1953codon changes.  Mol.Biol.Evol. 2, 150-174.
1954
1955Myers, E.W. and Miller, W. (1988)  Optimal alignments in linear
1956space.  CABIOS 4, 11-17.
1957
1958Pearson, W.R. and Lipman, D.J. (1988)  Improved tools for biological
1959sequence comparison.  PNAS USA 85, 2444-2448.
1960
1961Saitou, N. and Nei, M. (1987)  The neighbor-joining method: a new
1962method for reconstructing phylogenetic trees.  Mol.Biol.Evol. 4,
1963406-425.
1964
1965Sneath, P.H.A. and Sokal, R.R. (1973)  Numerical Taxonomy.  Freeman,
1966San Francisco.
1967
1968Sokal, R.R. and Michener, C.D. (1958)  A statistical method for
1969evaluating systematic relationships.  Univ.Kansas Sci.Bull. 38,
19701409-1438.
1971
1972Vingron, M. and Argos, P. (1991)  Motif recognition and alignment
1973for many sequences by comparison of dot matrices.  J.Mol.Biol. 218,
197433-43.
1975
1976Wilbur, W.J. and Lipman, D.J. (1983)  Rapid similarity searches of
1977nucleic acid and protein data banks.  PNAS USA 80, 726-730.
1978
Note: See TracBrowser for help on using the repository browser.