| 1 | |
|---|
| 2 | |
|---|
| 3 | |
|---|
| 4 | Clustal V Multiple Sequence Alignments. |
|---|
| 5 | |
|---|
| 6 | Documentation (Installation and Usage). |
|---|
| 7 | |
|---|
| 8 | Des Higgins |
|---|
| 9 | European Molecular Biology Laboratory |
|---|
| 10 | Postfach 10.2209 |
|---|
| 11 | D-6900 Heidelberg |
|---|
| 12 | Germany. |
|---|
| 13 | |
|---|
| 14 | higgins@EMBL-Heidelberg.DE |
|---|
| 15 | |
|---|
| 16 | |
|---|
| 17 | ******************************************************************* |
|---|
| 18 | |
|---|
| 19 | |
|---|
| 20 | Contents. |
|---|
| 21 | |
|---|
| 22 | |
|---|
| 23 | 1 Overview |
|---|
| 24 | |
|---|
| 25 | 2 Installation |
|---|
| 26 | |
|---|
| 27 | 3 Interactive usage |
|---|
| 28 | |
|---|
| 29 | 4 Command-line interface |
|---|
| 30 | |
|---|
| 31 | 5 Algorithms and references |
|---|
| 32 | |
|---|
| 33 | |
|---|
| 34 | ******************************************************************* |
|---|
| 35 | |
|---|
| 36 | 1. Overview |
|---|
| 37 | |
|---|
| 38 | This document describes how to install and use ClustalV on various |
|---|
| 39 | machines. ClustalV is a complete upgrade and rewrite of the Clustal |
|---|
| 40 | package of multiple alignment programs (Higgins and Sharp, 1988 and |
|---|
| 41 | 1989). The original programs were written in Fortran for |
|---|
| 42 | microcomputers running MSDOS. You carried out a complete alignment |
|---|
| 43 | by running 3 programs in succession. Later, these were merged into |
|---|
| 44 | a single menu driven program with on-line help, for VAX/VMS. |
|---|
| 45 | ClustalV was written in C and has all of the features of the old |
|---|
| 46 | programs plus many new ones. It has been compiled and tested using |
|---|
| 47 | VAX/VMS C, Decstation ULTRIX C, Gnu C for Sun workstations, Turbo C |
|---|
| 48 | for IBM PC's and Think C for Apple Mac's. The original Clustal was |
|---|
| 49 | written by Des Higgins while he was a Post-Doc in the lab of Paul |
|---|
| 50 | Sharp in the Genetics Department, Trinity College, Dublin 2, |
|---|
| 51 | Ireland. |
|---|
| 52 | |
|---|
| 53 | The main feature of the old package was the ability to carry out |
|---|
| 54 | reliable multiple alignments of many sequences. The sensitivity of |
|---|
| 55 | the program is as good as from any other program we have tried, with |
|---|
| 56 | the exception of the programs of Vingron and Argos (1991), while it |
|---|
| 57 | works in reasonable time on a microcomputer. The programs of |
|---|
| 58 | Vingron and Argos are specialised for finding distant similarities |
|---|
| 59 | between proteins but require mainframes or workstations and are more |
|---|
| 60 | difficult to use. |
|---|
| 61 | |
|---|
| 62 | The main new features are: profile alignments (alignments of old |
|---|
| 63 | alignments); phylogenetic trees (Neighbor Joining trees calculated |
|---|
| 64 | after multiple alignment with a bootstrapping option); better |
|---|
| 65 | sequence input (automatically recognise and read NBRF/PIR, Pearson |
|---|
| 66 | (Fasta) or EMBL/SwissProt formats); flexible alignment output |
|---|
| 67 | (choose one of: old Clustal format, NBRF/PIR, GCG msf format or |
|---|
| 68 | Phylip format); full command line interface (everything that you can |
|---|
| 69 | do interactively can be specified on the command line). |
|---|
| 70 | |
|---|
| 71 | In version 7 of the GCG package, there is a program called PILEUP |
|---|
| 72 | which uses a very similar algorithm to the one in ClustalV. There |
|---|
| 73 | are 2 main differences between the programs: 1) the metric used to |
|---|
| 74 | compare the sequences for the initial "guide tree" uses a full |
|---|
| 75 | global, optimal alignment in PILEUP instead of the fast, approximate |
|---|
| 76 | ones in ClustalV. This makes PILEUP much slower for the comparison |
|---|
| 77 | of long sequences. In principle, the distances calculated from |
|---|
| 78 | PILEUP will be more sensitive than ours, but in practice it will not |
|---|
| 79 | make much difference, except in difficult cases. 2) During the |
|---|
| 80 | multiple alignment, terminal gaps are penalised in ClustalV but not |
|---|
| 81 | in PILEUP. This will make the PILEUP alignments better when the |
|---|
| 82 | sequences are of very different lengths (has no effect if there are |
|---|
| 83 | no large terminal gaps). |
|---|
| 84 | |
|---|
| 85 | |
|---|
| 86 | This software may be distributed and used freely, provided that you |
|---|
| 87 | do not modify it or this documentation in any way without the |
|---|
| 88 | permission of the authors. |
|---|
| 89 | |
|---|
| 90 | If you wish to refer to ClustalV, please cite: |
|---|
| 91 | Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software |
|---|
| 92 | for multiple sequence alignment. CABIOS, vol .8, 189-191. |
|---|
| 93 | |
|---|
| 94 | The overall multiple alignment algorithm was described in: |
|---|
| 95 | Higgins,D.G. and Sharp,P.M. (1989). Fast and sensitive multiple |
|---|
| 96 | sequence alignments on a microcomputer. CABIOS, vol. 5, 151-153. |
|---|
| 97 | |
|---|
| 98 | |
|---|
| 99 | ACKNOWLEDGEMENTS. |
|---|
| 100 | |
|---|
| 101 | D.H. would particularly like to thank Paul Sharp, in whose lab. this |
|---|
| 102 | work originated. We also thank Manolo Gouy, Gene Myers, Peter Rice |
|---|
| 103 | and Martin Vingron for suggestions, bug-fixes and help. |
|---|
| 104 | |
|---|
| 105 | Des Higgins and Rainer Fuchs, |
|---|
| 106 | EMBL Data Library, Heidelberg, Germany. |
|---|
| 107 | |
|---|
| 108 | Alan Bleasby, |
|---|
| 109 | Daresbury, UK. |
|---|
| 110 | |
|---|
| 111 | JUNE 1991 |
|---|
| 112 | ******************************************************************* |
|---|
| 113 | |
|---|
| 114 | 2. Installation. |
|---|
| 115 | |
|---|
| 116 | |
|---|
| 117 | |
|---|
| 118 | As far as possible, we have tried to make ClustalV portable to any |
|---|
| 119 | machine with a standard C compiler (proposed ANSI C standard). The |
|---|
| 120 | source code, as supplied by us, has been compiled and tested using |
|---|
| 121 | the following compilers: |
|---|
| 122 | |
|---|
| 123 | VAX/VMS C |
|---|
| 124 | Ultrix C (on a Decstation 2100) |
|---|
| 125 | Gnu C on a Sun 4 workstation |
|---|
| 126 | Think C on an Apple Macintosh SE |
|---|
| 127 | Turbo C on an IBM AT. |
|---|
| 128 | |
|---|
| 129 | In each case, one must make 1 change to 1 line of code in 1 header |
|---|
| 130 | file. This is described below. The exact capacity of the program |
|---|
| 131 | (how many sequences of what length can be aligned) will depend of |
|---|
| 132 | course on available memory but can also be set in this header file. |
|---|
| 133 | |
|---|
| 134 | The package comes as 9 C source files; 3 header files; 1 file of on- |
|---|
| 135 | line help; this documentation file; 3 make files: |
|---|
| 136 | |
|---|
| 137 | Source code: clustalv.c, amenu.c, gcgcheck.c, myers.c, sequence.c, |
|---|
| 138 | showpair.c, trees.c, upgma.c, util.c |
|---|
| 139 | |
|---|
| 140 | Header files: clustalv.h, general.h, matrices.h |
|---|
| 141 | |
|---|
| 142 | On-Line help: clustalv.hlp (must be renamed or defined as |
|---|
| 143 | clustalv_help except on PC's) |
|---|
| 144 | |
|---|
| 145 | Documentation: clustalv.txt (this file). |
|---|
| 146 | |
|---|
| 147 | Makefiles: makefile.sun (gnu c on Sun), vmslink.com (vax/vms), |
|---|
| 148 | makefile.ult (ultrix). |
|---|
| 149 | |
|---|
| 150 | |
|---|
| 151 | |
|---|
| 152 | |
|---|
| 153 | |
|---|
| 154 | |
|---|
| 155 | |
|---|
| 156 | Before compiling ClustalV you must look at and possibly change |
|---|
| 157 | clustalV.h, shown below.. |
|---|
| 158 | |
|---|
| 159 | /*******************CLUSTALV.H********************************/ |
|---|
| 160 | |
|---|
| 161 | /* |
|---|
| 162 | Main header file for ClustalV. Uncomment ONE of the following lines |
|---|
| 163 | depending on which compiler you wish to use. |
|---|
| 164 | */ |
|---|
| 165 | |
|---|
| 166 | #define VMS 1 /* VAX VMS */ |
|---|
| 167 | |
|---|
| 168 | /*#define MAC 1 Think_C for MacIntosh */ |
|---|
| 169 | |
|---|
| 170 | /*#define MSDOS 1 Turbo C for PC's */ |
|---|
| 171 | |
|---|
| 172 | /*#define UNIX 1 Ultrix for Decstations or Gnu C for Sun */ |
|---|
| 173 | |
|---|
| 174 | /*************************************************************/ |
|---|
| 175 | |
|---|
| 176 | #include "general.h" |
|---|
| 177 | |
|---|
| 178 | #define MAXNAMES 10 |
|---|
| 179 | #define MAXTITLES 60 |
|---|
| 180 | #define FILENAMELEN 256 |
|---|
| 181 | |
|---|
| 182 | #define UNKNOWN 0 |
|---|
| 183 | #define EMBLSWISS 1 |
|---|
| 184 | #define PIR 2 |
|---|
| 185 | #define PEARSON 3 |
|---|
| 186 | |
|---|
| 187 | #define PAGE_LEN 22 |
|---|
| 188 | |
|---|
| 189 | #if VMS |
|---|
| 190 | #define DIRDELIM ']' |
|---|
| 191 | #define MAXLEN 3000 |
|---|
| 192 | #define MAXN 150 |
|---|
| 193 | #define FSIZE 15000 |
|---|
| 194 | #define LINELENGTH 60 |
|---|
| 195 | #define GCG_LINELENGTH 50 |
|---|
| 196 | |
|---|
| 197 | #elif MAC |
|---|
| 198 | #define DIRDELIM ':' |
|---|
| 199 | #define MAXLEN 2600 |
|---|
| 200 | #define MAXN 30 |
|---|
| 201 | #define FSIZE 10000 |
|---|
| 202 | #define LINELENGTH 50 |
|---|
| 203 | #define GCG_LINELENGTH 50 |
|---|
| 204 | |
|---|
| 205 | #elif MSDOS |
|---|
| 206 | #define DIRDELIM '\\' |
|---|
| 207 | #define MAXLEN 1300 |
|---|
| 208 | #define MAXN 30 |
|---|
| 209 | #define FSIZE 5000 |
|---|
| 210 | #define LINELENGTH 50 |
|---|
| 211 | #define GCG_LINELENGTH 50 |
|---|
| 212 | |
|---|
| 213 | #elif UNIX |
|---|
| 214 | #define DIRDELIM '/' |
|---|
| 215 | #define MAXLEN 3000 |
|---|
| 216 | #define MAXN 50 |
|---|
| 217 | #define FSIZE 15000 |
|---|
| 218 | #define LINELENGTH 60 |
|---|
| 219 | #define GCG_LINELENGTH 50 |
|---|
| 220 | #endif |
|---|
| 221 | /*****************end*of*CLUSTALV.H***************************/ |
|---|
| 222 | |
|---|
| 223 | |
|---|
| 224 | |
|---|
| 225 | First, you must remove the comments from one of the first 10 lines. |
|---|
| 226 | There are 4 'define' compiler directives here (e.g. #define VMS 1), |
|---|
| 227 | and you should use one of these, depending on which system you wish |
|---|
| 228 | to work. So choose one of these, remove its comments (if it is |
|---|
| 229 | already commented out) and put comments around any of the others |
|---|
| 230 | that are still active. If you wish to use a different system, you |
|---|
| 231 | will need to insert a new line with a new keyword (which you must |
|---|
| 232 | invent) to identify your system. Most of the rest of this header |
|---|
| 233 | file is taken up with a block of 'define' statements for each system |
|---|
| 234 | type; e.g. the VAX/VMS block is: |
|---|
| 235 | |
|---|
| 236 | #if VMS |
|---|
| 237 | #define DIRDELIM ']' |
|---|
| 238 | #define MAXLEN 3000 |
|---|
| 239 | #define MAXN 150 |
|---|
| 240 | #define FSIZE 15000 |
|---|
| 241 | #define LINELENGTH 60 |
|---|
| 242 | #define GCG_LINELENGTH 50 |
|---|
| 243 | |
|---|
| 244 | In this block, you can specify the maximum number of sequences to be |
|---|
| 245 | allowed (MAXN); the maximum sequence length, including gaps |
|---|
| 246 | (MAXLEN); FSIZE declares the size of some workspace, used by the |
|---|
| 247 | fast 2 sequence comparison routines and should be APPROXIMATELY 4 |
|---|
| 248 | times MAXLEN; LINELENGTH is the length of the blocks of alignment |
|---|
| 249 | output in the output files; GCG_LINELENGTH is the same but for the |
|---|
| 250 | GCG compatible output only. Finally, DIRDELIM is the character used |
|---|
| 251 | to specify directories and subdirectories in file names. It should |
|---|
| 252 | be the character used to seperate the file name itself from the |
|---|
| 253 | directory name (e.g. in VMS, file names are like: |
|---|
| 254 | $drive:[dir1.dir2.dir3]filename.ext;2 so ']' is used as DIRDELIM). |
|---|
| 255 | |
|---|
| 256 | So, if you want to use a system, not covered in Clustalv.h, you will |
|---|
| 257 | have to insert a new block, like the above one. To compile and link |
|---|
| 258 | the program, we supply 3 makefiles: one each for VAX/VMS, Ultrix |
|---|
| 259 | and GNU C for Sun workstations. |
|---|
| 260 | |
|---|
| 261 | |
|---|
| 262 | |
|---|
| 263 | VAX/VMS |
|---|
| 264 | |
|---|
| 265 | Compile and link the program with the |
|---|
| 266 | supplied makefile for vms: vmslink.com . |
|---|
| 267 | |
|---|
| 268 | $ @vmslink |
|---|
| 269 | |
|---|
| 270 | This will produce clustalv.exe (and a lot of .obj files which you can delete). |
|---|
| 271 | |
|---|
| 272 | The on-line help file (clustalv.hlp) should be 'defined' as |
|---|
| 273 | clustalv_help as follows: |
|---|
| 274 | |
|---|
| 275 | $ def clustalv_help $drive:[dir1.dir2]clustalv.hlp |
|---|
| 276 | |
|---|
| 277 | where $drive is the drive designation and [dir1.dir2] is the |
|---|
| 278 | directory where clustalv.hlp is kept. |
|---|
| 279 | |
|---|
| 280 | To make use of the command-line interface, you must make clustalv a |
|---|
| 281 | 'foreign' command with: |
|---|
| 282 | |
|---|
| 283 | $ clustalv :== $$drive:[dir1.dir2]clustalv |
|---|
| 284 | |
|---|
| 285 | where $drive is the drive designation and [dir1.dir2] is the |
|---|
| 286 | directory where clustalv.exe is kept. |
|---|
| 287 | |
|---|
| 288 | |
|---|
| 289 | |
|---|
| 290 | IBM PC/MSDOS/TURBO C |
|---|
| 291 | |
|---|
| 292 | Create a makefile (something.prj) with the names of the source files |
|---|
| 293 | (clustalv.c, amenu.c etc.) and 'make' this using the HUGE memory |
|---|
| 294 | model. You will get half a dozen warnings from the compiler about |
|---|
| 295 | pieces of code than look suspicious to it but ignore these. The |
|---|
| 296 | help file should remain as clustalv.hlp . To run the program using |
|---|
| 297 | the default settings in Clustalv.h, you need approximately 500k of |
|---|
| 298 | memory. To reduce this, the main influence on memory usage is the |
|---|
| 299 | parameter MAXLEN; reduce MAXLEN to reduce memory usage. |
|---|
| 300 | |
|---|
| 301 | |
|---|
| 302 | |
|---|
| 303 | Apple Mac/THINK_C version 4.0.2 |
|---|
| 304 | |
|---|
| 305 | This version of the program is not at all Mac like. It runs in a |
|---|
| 306 | window, the inside of which looks just like a normal character based |
|---|
| 307 | terminal. In the future we might put a proper Mac interface on it |
|---|
| 308 | but do not have the time right now. With the default settings in |
|---|
| 309 | the header file ClustalV.h, you need just over 800k of memory to run |
|---|
| 310 | the program. To reduce this, reduce MAXLEN; this is easily the |
|---|
| 311 | biggest influence on memory usage. To compile the program and save |
|---|
| 312 | it as an application you need to 'set the application type'; here |
|---|
| 313 | you specify how much memory (in kilobytes (k)) the application will |
|---|
| 314 | need. You should set this to 900k to run the application as it is |
|---|
| 315 | OR reduce MAXLEN in the header. To compile the program you have to |
|---|
| 316 | create a 'project'; you 'add' the names of the 9 source files to the |
|---|
| 317 | project AND the name of the ANSI library. The source code is too |
|---|
| 318 | large to compile in one compilation unit. You will get a 'link |
|---|
| 319 | error: code segment too big' if you try to compile and link as is. |
|---|
| 320 | You should compile amenu.c (the biggest source file) as a seperate |
|---|
| 321 | unit ..... you will have to read the manual/ask someone/mail me to |
|---|
| 322 | find out what this is. |
|---|
| 323 | |
|---|
| 324 | |
|---|
| 325 | ******************************************************************* |
|---|
| 326 | |
|---|
| 327 | 3. Interactive usage. |
|---|
| 328 | |
|---|
| 329 | |
|---|
| 330 | |
|---|
| 331 | Interactive usage of Clustal V is completely menu driven. On-line |
|---|
| 332 | help is provided, defaults are offered for all parameters and file |
|---|
| 333 | names. With a little effort it should be completely self |
|---|
| 334 | explanatory. The main menu, which appears when you run the |
|---|
| 335 | programs is shown below. Each item brings you to a sub menu. |
|---|
| 336 | |
|---|
| 337 | |
|---|
| 338 | |
|---|
| 339 | Main menu for Clustal V: |
|---|
| 340 | |
|---|
| 341 | |
|---|
| 342 | 1. Sequence Input From Disc |
|---|
| 343 | 2. Multiple Alignments |
|---|
| 344 | 3. Profile Alignments |
|---|
| 345 | 4. Phylogenetic trees |
|---|
| 346 | |
|---|
| 347 | S. Execute a system command |
|---|
| 348 | H. HELP |
|---|
| 349 | X. EXIT (leave program) |
|---|
| 350 | |
|---|
| 351 | |
|---|
| 352 | Your choice: |
|---|
| 353 | |
|---|
| 354 | |
|---|
| 355 | |
|---|
| 356 | The options S and H appear on all the main menus. H will provide |
|---|
| 357 | help and if you type S you will be asked to enter a command, such as |
|---|
| 358 | DIR or LS, which will be sent to the system (does not work on |
|---|
| 359 | Mac's). Before carrying out an alignment, you must use option 1 |
|---|
| 360 | (sequence input); the format for sequences is explained below. |
|---|
| 361 | Under menu item 2 you will be able to automatically align your |
|---|
| 362 | sequences to each other. Menu item 3 allows you to do profile |
|---|
| 363 | alignments. These are alignments of old alignments. This allows |
|---|
| 364 | you to build up a multiple alignment in stages or add a new sequence |
|---|
| 365 | to an old alignment. You can calculate phylogenetic trees from |
|---|
| 366 | alignments using menu item 4. |
|---|
| 367 | |
|---|
| 368 | |
|---|
| 369 | |
|---|
| 370 | |
|---|
| 371 | ****************************** |
|---|
| 372 | * SEQUENCE INPUT. * |
|---|
| 373 | ****************************** |
|---|
| 374 | |
|---|
| 375 | |
|---|
| 376 | All sequences should be in 1 file. Three formats are automatically |
|---|
| 377 | recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and |
|---|
| 378 | Lipman (1988) format). |
|---|
| 379 | |
|---|
| 380 | *** |
|---|
| 381 | Users of the Wisconsin GCG package should use the command TONBRF |
|---|
| 382 | (recently changed to TOPIR) to reformat their sequences before use. |
|---|
| 383 | *** |
|---|
| 384 | |
|---|
| 385 | Sequences can be in upper or lower case. For proteins, the only |
|---|
| 386 | symbols recognised are: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and |
|---|
| 387 | for DNA/RNA use: A,C,G and T (or U). Any other letters of the |
|---|
| 388 | alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown. |
|---|
| 389 | All other symbols (blanks, digits etc.) will be ignored EXCEPT for |
|---|
| 390 | the hyphen "-" which can be used to specify a gap. This last point |
|---|
| 391 | is especially useful for 2 reasons: 1) you can fix the positions of |
|---|
| 392 | some gaps in advance; 2) the alignment output from this program can |
|---|
| 393 | be written out in NBRF format using "-"'s to specify gaps; these |
|---|
| 394 | alignments can be used again as input, either for profile alignments |
|---|
| 395 | or for phylogenetic trees. |
|---|
| 396 | |
|---|
| 397 | If you are using an editor to create sequence files, use the FASTA |
|---|
| 398 | format as it is by far the simplest (see below). If you have access |
|---|
| 399 | to utility programs for generating/converting the NBRF/PIR format |
|---|
| 400 | then use it in preference. |
|---|
| 401 | |
|---|
| 402 | |
|---|
| 403 | |
|---|
| 404 | FASTA (PEARSON AND LIPMAN, 1988) FORMAT: The sequences are |
|---|
| 405 | delimited by an angle bracket ">" in column 1. The text immediately |
|---|
| 406 | after the ">" is used as a title. Everything on the following line |
|---|
| 407 | until the next ">" or the end of the file is one sequence. |
|---|
| 408 | |
|---|
| 409 | e.g. |
|---|
| 410 | |
|---|
| 411 | > RABSTOUT rabbit Guinness receptor |
|---|
| 412 | LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS |
|---|
| 413 | ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC |
|---|
| 414 | > MUSNOSE mouse nose drying factor |
|---|
| 415 | mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt |
|---|
| 416 | fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv |
|---|
| 417 | > HSHEAVEN human Guinness receptor repeat |
|---|
| 418 | mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt |
|---|
| 419 | fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv |
|---|
| 420 | mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt |
|---|
| 421 | fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv |
|---|
| 422 | |
|---|
| 423 | |
|---|
| 424 | |
|---|
| 425 | NBRF/PIR FORMAT is similar to FASTA format but immediately |
|---|
| 426 | after the ">", you find the characters "P1;" if the sequences are |
|---|
| 427 | protein or "DL;" if they are nucleic acid. Clustalv looks for the |
|---|
| 428 | ";" character as the third character after the ">". If it finds one |
|---|
| 429 | it assumes that the format is NBRF if not, FASTA format is assumed. |
|---|
| 430 | The text after the ";" is treated as a sequence name while the |
|---|
| 431 | entire next line is treated as a title. The sequence is terminated |
|---|
| 432 | by a star "*" and the next sequence can then begin (with a >P1; etc |
|---|
| 433 | ). This is just the basic format description (there are other |
|---|
| 434 | variations and rules). |
|---|
| 435 | |
|---|
| 436 | ANY files/sequences in GCG format can be converted to this format |
|---|
| 437 | using the TONBRF command (now TOPIR) of the Wisconsin GCG package. |
|---|
| 438 | |
|---|
| 439 | |
|---|
| 440 | e.g. |
|---|
| 441 | |
|---|
| 442 | >P1;RABSTOUT |
|---|
| 443 | rabbit Guinness receptor |
|---|
| 444 | LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS |
|---|
| 445 | ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC* |
|---|
| 446 | >P1;MUSNOSE |
|---|
| 447 | mouse nose drying factor |
|---|
| 448 | mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt |
|---|
| 449 | fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd |
|---|
| 450 | * |
|---|
| 451 | >P1;HSHEAVEN |
|---|
| 452 | human Guinness receptor repeat protein. |
|---|
| 453 | mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt |
|---|
| 454 | fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv |
|---|
| 455 | mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt |
|---|
| 456 | fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv* |
|---|
| 457 | |
|---|
| 458 | |
|---|
| 459 | |
|---|
| 460 | |
|---|
| 461 | EMBL/SWISSPROT FORMAT: Do not try to create files with this |
|---|
| 462 | format unless you have utilities to help. If you are just using an |
|---|
| 463 | editor, use one of the above formats. If you do use this format, |
|---|
| 464 | the program will ignore everything between the ID line (line |
|---|
| 465 | beginning with the characters "ID") and the SQ line. The sequence |
|---|
| 466 | is then read from between the SQ line and the "//" characters. |
|---|
| 467 | |
|---|
| 468 | |
|---|
| 469 | |
|---|
| 470 | It is critically important for the program to know whether or not it |
|---|
| 471 | is aligning DNA or protein sequences. The input routines attempt to |
|---|
| 472 | guess which type of sequence is being used by counting the number of |
|---|
| 473 | A,C,G,T or U's in the sequences. If the total is more than 85% of |
|---|
| 474 | the sequence length then DNA is assumed. If you use very bizarre |
|---|
| 475 | sequences (proteins with really strange aa compositions or DNA |
|---|
| 476 | sequences with loads of strange ambiguity codes) you might confuse |
|---|
| 477 | the program. It is difficult to do but be careful. |
|---|
| 478 | |
|---|
| 479 | |
|---|
| 480 | |
|---|
| 481 | |
|---|
| 482 | |
|---|
| 483 | ****************************** |
|---|
| 484 | * MULTIPLE ALIGNMENT MENU. * |
|---|
| 485 | ****************************** |
|---|
| 486 | |
|---|
| 487 | The multiple alignment menu is shown below. Before explaining how |
|---|
| 488 | to use it, you must be introduced briefly to the alignment strategy. |
|---|
| 489 | If you do not follow this, try using option 1 anyway; the entire |
|---|
| 490 | process will be carried out automatically. |
|---|
| 491 | |
|---|
| 492 | To do a complete multiple alignment, we need to know the approximate |
|---|
| 493 | relationships of the sequences to each other (which ones are most |
|---|
| 494 | similar to each other). We do this by calculating a crude |
|---|
| 495 | phylogenetic tree which we call a dendrogram (to distinguish it from |
|---|
| 496 | the more sensitive trees available under the phylogenetic tree |
|---|
| 497 | menu). This dendrogram is used as a guide to align bigger and |
|---|
| 498 | bigger groups of sequences during the multiple alignment. The |
|---|
| 499 | dendrogram is calculated in 2 stages: 1) all pairs of sequence are |
|---|
| 500 | compared using the fast/approximate method of Wilbur and Lipman |
|---|
| 501 | (1983); the result of each comparison is a similarity score. 2) the |
|---|
| 502 | similarity scores are used to construct the dendrogram using the |
|---|
| 503 | UPGMA cluster analysis method of Sneath and Sokal (1973). |
|---|
| 504 | |
|---|
| 505 | The construction of the dendrogram can be very time consuming if you |
|---|
| 506 | wish to align many sequences (e.g. for 100 sequences you need to |
|---|
| 507 | carry out 100x99/2 sequence comparisons = 4950). During every |
|---|
| 508 | multiple alignment, a dendrogram is constructed and saved to a file |
|---|
| 509 | (something.dnd). These can be reused later. |
|---|
| 510 | |
|---|
| 511 | |
|---|
| 512 | |
|---|
| 513 | |
|---|
| 514 | |
|---|
| 515 | |
|---|
| 516 | |
|---|
| 517 | |
|---|
| 518 | ******Multiple*Alignment*Menu****** |
|---|
| 519 | |
|---|
| 520 | |
|---|
| 521 | 1. Do complete multiple alignment now |
|---|
| 522 | 2. Produce dendrogram file only |
|---|
| 523 | 3. Use old dendrogram file |
|---|
| 524 | 4. Pairwise alignment parameters |
|---|
| 525 | 5. Multiple alignment parameters |
|---|
| 526 | 6. Output format options |
|---|
| 527 | |
|---|
| 528 | S. Execute a system command |
|---|
| 529 | H. HELP |
|---|
| 530 | or press [RETURN] to go back to main menu |
|---|
| 531 | |
|---|
| 532 | |
|---|
| 533 | Your choice: |
|---|
| 534 | |
|---|
| 535 | |
|---|
| 536 | So, if in doubt, and you have already loaded some sequences from the |
|---|
| 537 | main menu, just try option 1 and press the <Return> key in response |
|---|
| 538 | to any questions. You will be prompted for 2 file names e.g. if the |
|---|
| 539 | sequence input file was called DRINK.PEP, you will be offered |
|---|
| 540 | DRINK.ALN as the file to contain the alignment and DRINK.DND for the |
|---|
| 541 | dendrogram. |
|---|
| 542 | |
|---|
| 543 | If you wish to repeat a multiple alignment (e.g. to experiment with |
|---|
| 544 | different gap penalties) but do not wish to make a dendrogram all |
|---|
| 545 | over again use menu item 3 (providing you are using the same |
|---|
| 546 | sequences). Similarly, menu item 2 allows you to produce the |
|---|
| 547 | dendrogram file only. |
|---|
| 548 | |
|---|
| 549 | |
|---|
| 550 | |
|---|
| 551 | |
|---|
| 552 | PAIRWISE ALIGNMENT PARAMETERS: |
|---|
| 553 | |
|---|
| 554 | The parameters that control the initial fast/approximate comparisons |
|---|
| 555 | can be set from menu item 4 which looks like: |
|---|
| 556 | |
|---|
| 557 | |
|---|
| 558 | ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS ********* |
|---|
| 559 | |
|---|
| 560 | |
|---|
| 561 | 1. Toggle Scoring Method :Percentage |
|---|
| 562 | 2. Gap Penalty :3 |
|---|
| 563 | 3. K-tuple :1 |
|---|
| 564 | 4. No. of top diagonals :5 |
|---|
| 565 | 5. Window size :5 |
|---|
| 566 | |
|---|
| 567 | H. HELP |
|---|
| 568 | |
|---|
| 569 | |
|---|
| 570 | Enter number (or [RETURN] to exit): |
|---|
| 571 | |
|---|
| 572 | |
|---|
| 573 | |
|---|
| 574 | The similarity scores are calculated from fast alignments generated |
|---|
| 575 | by the method of Wilbur and Lipman (1983). These are 'hash' or |
|---|
| 576 | 'word' or 'k-tuple' alignments carried out in 3 stages. |
|---|
| 577 | |
|---|
| 578 | First you mark the positions of every fragment of sequence, K-tuple |
|---|
| 579 | long (for proteins, the default length is 1 residue, for DNA it is 2 |
|---|
| 580 | bases) in both sequences. Then you locate all k-tuple matches |
|---|
| 581 | between the 2 sequences. At this stage you have to imagine a dot- |
|---|
| 582 | matrix plot between the 2 sequences with each k-tuple match as a |
|---|
| 583 | dot. You find those diagonals in the plot with most matches (you |
|---|
| 584 | take the "No. of top diagonals" best ones) and mark all diagonals |
|---|
| 585 | within "Window size" of each top diagonal. This process will define |
|---|
| 586 | diagonal bands in the plot where you hope the most likely regions of |
|---|
| 587 | similarity will lie. |
|---|
| 588 | |
|---|
| 589 | The final alignment stage is to find that head to tail arrangement |
|---|
| 590 | of k-tuple matches from these diagonal regions that will give the |
|---|
| 591 | highest score. The score is calculated as the number of exactly |
|---|
| 592 | matching residues in this alignment minus a "gap penalty" for every |
|---|
| 593 | gap that was introduced. When you toggle "Scoring method" you |
|---|
| 594 | choose between expressing these similarity scores as raw scores or |
|---|
| 595 | expressed as a percentage of the shorter sequence length. |
|---|
| 596 | |
|---|
| 597 | K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA. |
|---|
| 598 | Increase this to increase speed; decrease to improve sensitivity. |
|---|
| 599 | |
|---|
| 600 | GAP PENALTY: The number of matching residues that must be found |
|---|
| 601 | in order to introduce a gap. This should be larger than K-Tuple |
|---|
| 602 | Size. This has little effect on speed or sensitivity. |
|---|
| 603 | |
|---|
| 604 | NO. OF TOP DIAGONALS: The number of best diagonals in the |
|---|
| 605 | imaginary dot-matrix plot that are considered. Decrease (must be |
|---|
| 606 | greater than zero) to increase speed; increase to improve |
|---|
| 607 | sensitivity. |
|---|
| 608 | |
|---|
| 609 | WINDOW SIZE: The number of diagonals around each "top" diagonal |
|---|
| 610 | that are considered. Decrease for speed; increase for greater |
|---|
| 611 | sensitivity. |
|---|
| 612 | |
|---|
| 613 | SCORING METHOD: The similarity scores may be expressed as raw scores |
|---|
| 614 | (number of identical residues minus a "gap penalty" for each gap) or |
|---|
| 615 | as percentage scores. If the sequences are of very different |
|---|
| 616 | lengths, percentage scores make more sense. |
|---|
| 617 | |
|---|
| 618 | |
|---|
| 619 | |
|---|
| 620 | CHANGING THE PAIRWISE ALIGNMENT PARAMETERS |
|---|
| 621 | |
|---|
| 622 | The main reason for wanting to change the above parameters is SPEED |
|---|
| 623 | (especially on microcomputers), NOT SENSITIVITY. The dendrograms |
|---|
| 624 | that are produced can only show the relationships between the |
|---|
| 625 | sequences APPROXIMATELY because the similarity scores are calculated |
|---|
| 626 | from seperate pairwise alignments; not from a multiple alignment |
|---|
| 627 | (that is what we eventually hope to produce). If the groupings of |
|---|
| 628 | the sequences are "obvious", the above method should work well; if |
|---|
| 629 | the relationships are obscure or weakly represented by the data, it |
|---|
| 630 | will not make much difference playing with the parameters. The main |
|---|
| 631 | factor influencing speed is the K-TUPLE SIZE followed by the WINDOW |
|---|
| 632 | SIZE. |
|---|
| 633 | |
|---|
| 634 | The alignments are carried out in a small amount of memory. |
|---|
| 635 | Occasionally (it is hard to predict), you will run out of memory |
|---|
| 636 | while doing these alignments; when this happens, it will say on the |
|---|
| 637 | screen: "Sequences (a,b) partially aligned" (instead of "Sequences |
|---|
| 638 | (a,b) aligned"). This means that the alignment score for these |
|---|
| 639 | sequences will be approximate; it is not a problem unless many of |
|---|
| 640 | the alignments do this. It can be fixed by using less sensitive |
|---|
| 641 | parameters or increasing parameter FSIZE in clustalv.h . |
|---|
| 642 | |
|---|
| 643 | |
|---|
| 644 | THE DENDROGRAM ITSELF |
|---|
| 645 | |
|---|
| 646 | The similarity scores generated by the fast comparison of all the |
|---|
| 647 | sequences are used to construct a dendrogram by the UPGMA method of |
|---|
| 648 | Sneath and Sokal (1973). This is a form of cluster analysis and the |
|---|
| 649 | end result produces something that looks like a tree. It represents |
|---|
| 650 | the similarity of the sequences as a hierarchy. The dendrogram is |
|---|
| 651 | written to a file in a machine readable format and is ahown below |
|---|
| 652 | for an example with 6 sequences. |
|---|
| 653 | |
|---|
| 654 | |
|---|
| 655 | 91.0 0 0 2 012000 ! seq 2 joins seq 3 at 91% ID. |
|---|
| 656 | 72.0 1 0 3 011200 ! seq 4 joins seqs 2,3 at 72% |
|---|
| 657 | 71.1 0 0 2 000012 ! seq 5 joins seq 6 at 71% |
|---|
| 658 | 35.5 0 2 4 122200 ! seq 1 joins seqs 2,3,4 |
|---|
| 659 | 21.7 4 3 6 111122 ! seqs 1,2,3,4 join seqs 5,6 |
|---|
| 660 | |
|---|
| 661 | This LOOKS complicated but you do not normally need to care what is |
|---|
| 662 | in here. Anyway, each row represents the joining together of 2 or |
|---|
| 663 | more sequences. You progress from the top down, joining more and |
|---|
| 664 | more sequences until all are joined together; for N sequences you |
|---|
| 665 | have N-1 groupings hence there are 5 rows in the above file (there |
|---|
| 666 | were 6 sequences). In each row, the first number is the similarity |
|---|
| 667 | score of this grouping; ignore the next three columns for the |
|---|
| 668 | moment; the last 6 digits in the line show which sequences are |
|---|
| 669 | grouped; there is one digit for each sequence (the first digit is |
|---|
| 670 | for the first sequence). The rule is: in each row, all of the "1"s |
|---|
| 671 | join all of the "2"s; the zero's do nothing. |
|---|
| 672 | |
|---|
| 673 | Hence, in the first row, sequence 2 joins sequence 3 at a similarity |
|---|
| 674 | level of 91% identity; next, sequence 4 joins the previous grouping |
|---|
| 675 | of 2 plus 3 at a level of 72% etc. This is shown diagrammatically |
|---|
| 676 | below. Before leaving the dendrogram format, the other 3 columns of |
|---|
| 677 | numbers are: a pointer to the row from which the "1" sequences were |
|---|
| 678 | last joined (or zero if only one of them); a pointer to the row in |
|---|
| 679 | which the "2"s were last joined; the total number of sequences |
|---|
| 680 | joined in this line. |
|---|
| 681 | |
|---|
| 682 | |
|---|
| 683 | |
|---|
| 684 | |
|---|
| 685 | I------ 2 |
|---|
| 686 | I------I |
|---|
| 687 | I I------ 3 Diagram of the sequence similarity |
|---|
| 688 | I----I |
|---|
| 689 | I I------------- 4 relationships shown in the above |
|---|
| 690 | I--I |
|---|
| 691 | I I------------------ 1 dendrogram file (branch lengths are |
|---|
| 692 | ----I |
|---|
| 693 | I I------------- 5 not to scale). |
|---|
| 694 | I-------I |
|---|
| 695 | I------------- 6 |
|---|
| 696 | |
|---|
| 697 | |
|---|
| 698 | |
|---|
| 699 | |
|---|
| 700 | |
|---|
| 701 | |
|---|
| 702 | |
|---|
| 703 | |
|---|
| 704 | |
|---|
| 705 | MULTIPLE ALIGNMENT PARAMETERS: |
|---|
| 706 | |
|---|
| 707 | |
|---|
| 708 | Having calculated a dendrogram between a set of sequences, the final |
|---|
| 709 | multiple alignment is carried out by a series of alignments of |
|---|
| 710 | larger and larger groups of sequences. The order is determined by |
|---|
| 711 | the dendrogram so that the most similar sequences get aligned first. |
|---|
| 712 | Any gaps that are introduced in the early alignments are fixed. |
|---|
| 713 | When two groups of sequences are aligned against each other, a full |
|---|
| 714 | protein weight matrix (such as a Dayhoff PAM 250) is used. Two gap |
|---|
| 715 | penalties are offered: a "FIXED" penalty for opening up a gap and a |
|---|
| 716 | "FLOATING" penalty for extending a gap. |
|---|
| 717 | |
|---|
| 718 | |
|---|
| 719 | ********* MULTIPLE ALIGNMENT PARAMETERS ********* |
|---|
| 720 | |
|---|
| 721 | |
|---|
| 722 | 1. Fixed Gap Penalty :10 |
|---|
| 723 | 2. Floating Gap Penalty :10 |
|---|
| 724 | 3. Toggle Transitions (DNA):Weighted |
|---|
| 725 | 4. Protein weight matrix :PAM 250 |
|---|
| 726 | |
|---|
| 727 | H. HELP |
|---|
| 728 | |
|---|
| 729 | |
|---|
| 730 | Enter number (or [RETURN] to exit): |
|---|
| 731 | |
|---|
| 732 | |
|---|
| 733 | FIXED GAP PENALTY: Reduce this to encourage gaps of all sizes; |
|---|
| 734 | increase it to discourage them. Terminal gaps are penalised same |
|---|
| 735 | as all others. BEWARE of making this too small (approx 5 or so); if |
|---|
| 736 | the penalty is too small, the program may prefer to align each |
|---|
| 737 | sequence opposite one long gap. |
|---|
| 738 | |
|---|
| 739 | FLOATING GAP PENALTY: Reduce this to encourage longer gaps; |
|---|
| 740 | increase it to shorten them. Terminal gaps are penalised same as |
|---|
| 741 | all others. BEWARE of making this too small (approx 5 or so); if |
|---|
| 742 | the penalty is too small, the program may prefer to align each |
|---|
| 743 | sequence opposite one long gap. |
|---|
| 744 | |
|---|
| 745 | |
|---|
| 746 | DNA TRANSITIONS = WEIGHTED or UNWEIGHTED: By default, transitions |
|---|
| 747 | (A versus G; C versus T) are weighted more strongly than |
|---|
| 748 | transversions (an A aligned with a G will be preferred to an A |
|---|
| 749 | aligned with a C or a T). You can make all pairs of nucleotide |
|---|
| 750 | equally weighted with this option. |
|---|
| 751 | |
|---|
| 752 | PROTEIN WEIGHT MATRIX: For protein comparisons, a weight matrix is |
|---|
| 753 | used to differentially weight different pairs of aligned amino |
|---|
| 754 | acids. The default is the well known Dayhoff PAM 250 matrix. We |
|---|
| 755 | also offer a PAM 100 matrix, an identity matrix (all weights are the |
|---|
| 756 | same for exact matches) or allow you to give the name of a file with |
|---|
| 757 | your own matrix. The weight matrices used by Clustal V are shown in |
|---|
| 758 | full in the Algorithms and References section of this documentation. |
|---|
| 759 | |
|---|
| 760 | If you input a matrix from a file, it must be in the following |
|---|
| 761 | format. Use a 20x20 matrix only (entries for the 20 "normal" amino |
|---|
| 762 | acids only; no ambiguity codes etc.). Input the lower left triangle |
|---|
| 763 | of the matrix, INCLUDING the diagonal. The order of the amino acids |
|---|
| 764 | (rows and columns) must be: CSTPAGNDEQHRKMILVFYW. The values can be |
|---|
| 765 | in free format seperated by spaces (not commas). The PAM 250 matrix |
|---|
| 766 | is shown below in this format. |
|---|
| 767 | |
|---|
| 768 | 12 |
|---|
| 769 | 0 2 |
|---|
| 770 | -2 1 3 |
|---|
| 771 | -3 1 0 6 |
|---|
| 772 | -2 1 1 1 2 |
|---|
| 773 | -3 1 0 -1 1 5 |
|---|
| 774 | -4 1 0 -1 0 0 2 |
|---|
| 775 | -5 0 0 -1 0 1 2 4 |
|---|
| 776 | -5 0 0 -1 0 0 1 3 4 |
|---|
| 777 | -5 -1 -1 0 0 -1 1 2 2 4 |
|---|
| 778 | -3 -1 -1 0 -1 -2 2 1 1 3 6 |
|---|
| 779 | -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 |
|---|
| 780 | -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 |
|---|
| 781 | -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 |
|---|
| 782 | -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 |
|---|
| 783 | -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 |
|---|
| 784 | -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 |
|---|
| 785 | -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 |
|---|
| 786 | 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 |
|---|
| 787 | -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 |
|---|
| 788 | |
|---|
| 789 | Values must be integers and can be all positive or positive and |
|---|
| 790 | negative as above. These are SIMILARITY values. |
|---|
| 791 | |
|---|
| 792 | |
|---|
| 793 | |
|---|
| 794 | |
|---|
| 795 | ALIGNMENT OUTPUT OPTIONS: |
|---|
| 796 | |
|---|
| 797 | By default, the alignment goes to a file in a self explanatory |
|---|
| 798 | "blocked" alignment format. This format is fine for displaying the |
|---|
| 799 | results but requires heavy editing if you wish to use the alignment |
|---|
| 800 | with other software. To help, we provide 3 other formats which can |
|---|
| 801 | be turned on or off. If you have a sequence data set or alignment |
|---|
| 802 | in memory, you can also ask for output files in whatever formats are |
|---|
| 803 | turned on, NOW. The menu you use to choose format is shown below. |
|---|
| 804 | |
|---|
| 805 | *** |
|---|
| 806 | We draw your attention to NBRF/PIR format in particular. This |
|---|
| 807 | format is EXACTLY the same as one of the input formats. Therefore, |
|---|
| 808 | alignments written in this format can be used again as input (to the |
|---|
| 809 | profile alignments or phylogenetic trees). |
|---|
| 810 | *** |
|---|
| 811 | |
|---|
| 812 | |
|---|
| 813 | ********* Format of Alignment Output ********* |
|---|
| 814 | |
|---|
| 815 | |
|---|
| 816 | 1. Toggle CLUSTAL format output = ON |
|---|
| 817 | 2. Toggle NBRF/PIR format output = OFF |
|---|
| 818 | 3. Toggle GCG format output = OFF |
|---|
| 819 | 4. Toggle PHYLIP format output = OFF |
|---|
| 820 | |
|---|
| 821 | 5. Create alignment output file(s) now? |
|---|
| 822 | H. HELP |
|---|
| 823 | |
|---|
| 824 | |
|---|
| 825 | Enter number (or [RETURN] to exit): |
|---|
| 826 | |
|---|
| 827 | |
|---|
| 828 | |
|---|
| 829 | CLUSTAL FORMAT: This is a self explanatory alignment. The |
|---|
| 830 | alignment is written out in blocks. Identities are highlighted and |
|---|
| 831 | (if you use a PAM 250 matrix) positions in the alignment where all |
|---|
| 832 | of the residues are "similar" to each other (PAM 250 score of 8 or |
|---|
| 833 | more) are indicated. |
|---|
| 834 | |
|---|
| 835 | NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps |
|---|
| 836 | indicated by hyphens ("-"). AS we have stressed before, this format |
|---|
| 837 | is EXACTLY compatible with the sequence input format. Therefore you |
|---|
| 838 | can read in these alignments again for profile alignments or for |
|---|
| 839 | calculating phylogenetic trees. |
|---|
| 840 | |
|---|
| 841 | GCG FORMAT: In version 7 of the Wisconsin GCG package, a new |
|---|
| 842 | multiple sequence format was introduced. This is the MSF (Multiple |
|---|
| 843 | Sequence Format) format. It can be used as input to the GCG |
|---|
| 844 | sequence editor or any of the GCG programs that make use of multiple |
|---|
| 845 | alignments. THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG |
|---|
| 846 | PACKAGE OR LATER. |
|---|
| 847 | |
|---|
| 848 | PHYLIP FORMAT: This format can be used by the Phylip package of |
|---|
| 849 | Joe Felsenstein (see the references/algorithms section for details |
|---|
| 850 | of how to get it). Phylip allows you to do a huge range of |
|---|
| 851 | phylogenetic analyses (we just offer one method in this program) and |
|---|
| 852 | is probably the most widely used set of programs for drawing trees. |
|---|
| 853 | It also works on just about every computer you can think of, |
|---|
| 854 | providing you have a decent Pascal compiler. |
|---|
| 855 | |
|---|
| 856 | |
|---|
| 857 | |
|---|
| 858 | |
|---|
| 859 | |
|---|
| 860 | ****************************** |
|---|
| 861 | * PROFILE ALIGNMENT MENU. * |
|---|
| 862 | ****************************** |
|---|
| 863 | |
|---|
| 864 | |
|---|
| 865 | |
|---|
| 866 | This menu is for taking two old alignments (or single sequences) and |
|---|
| 867 | aligning them with each other. The result is one bigger alignment. |
|---|
| 868 | The menu is very similar to the multiple alignment menu except that |
|---|
| 869 | there is no mention of dendrograms here (they are not needed) and |
|---|
| 870 | you need to input two sets of sequences. The menu looks like this: |
|---|
| 871 | |
|---|
| 872 | |
|---|
| 873 | |
|---|
| 874 | ******Profile*Alignment*Menu****** |
|---|
| 875 | |
|---|
| 876 | |
|---|
| 877 | 1. Input 1st. profile/sequence |
|---|
| 878 | 2. Input 2nd. profile/sequence |
|---|
| 879 | 3. Do alignment now |
|---|
| 880 | 4. Alignment parameters |
|---|
| 881 | 5. Output format options |
|---|
| 882 | |
|---|
| 883 | S. Execute a system command |
|---|
| 884 | H. HELP |
|---|
| 885 | or press [RETURN] to go back to main menu |
|---|
| 886 | |
|---|
| 887 | |
|---|
| 888 | Your choice: |
|---|
| 889 | |
|---|
| 890 | |
|---|
| 891 | You must input profile number 1 first. When both profiles are |
|---|
| 892 | loaded, use item 3 (Do alignment now) and the 2 profiles will be |
|---|
| 893 | aligned. Items 4 and 5 (parameters and output options) are |
|---|
| 894 | identical to the equivalent options on the multiple alignment menu. |
|---|
| 895 | |
|---|
| 896 | The same input routines that are used for general input are used |
|---|
| 897 | here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA |
|---|
| 898 | format, with gaps indicated by hyphens ("-"). This is why we have |
|---|
| 899 | continualy drawn your attention to the NBRF/PIR format as a useful |
|---|
| 900 | output format. |
|---|
| 901 | |
|---|
| 902 | Either profile can consist of just one sequence. Therefore, if you |
|---|
| 903 | have a favourite alignment of sequences that you are working on and |
|---|
| 904 | wish to add a new sequence, you can use this menu, provided the |
|---|
| 905 | alignment is in the correct format. |
|---|
| 906 | |
|---|
| 907 | The total number of sequences in the two profiles must be less less |
|---|
| 908 | than or equal to the MAXN parameter set in the clustalv.h header |
|---|
| 909 | file. |
|---|
| 910 | |
|---|
| 911 | |
|---|
| 912 | |
|---|
| 913 | |
|---|
| 914 | |
|---|
| 915 | |
|---|
| 916 | |
|---|
| 917 | |
|---|
| 918 | |
|---|
| 919 | |
|---|
| 920 | |
|---|
| 921 | ****************************** |
|---|
| 922 | * PHYLOGENETIC TREE MENU. * |
|---|
| 923 | ****************************** |
|---|
| 924 | |
|---|
| 925 | |
|---|
| 926 | This menu allows you to input an alignment and calculate a |
|---|
| 927 | phylogenetic tree. You can also calculate a tree if you have just |
|---|
| 928 | carried out a multiple alignment and the alignment is still in |
|---|
| 929 | memory. THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!! The tree will |
|---|
| 930 | look strange if the sequences are not already aligned. You can also |
|---|
| 931 | "BOOTSTRAP" the tree to show confidence levels for groupings. This |
|---|
| 932 | is SLOW on microcomputers but works fine on workstations or |
|---|
| 933 | mainframes. |
|---|
| 934 | |
|---|
| 935 | |
|---|
| 936 | |
|---|
| 937 | ******Phylogenetic*tree*Menu****** |
|---|
| 938 | |
|---|
| 939 | |
|---|
| 940 | 1. Input an alignment |
|---|
| 941 | 2. Exclude positions with gaps? = OFF |
|---|
| 942 | 3. Correct for multiple substitutions? = OFF |
|---|
| 943 | 4. Draw tree now |
|---|
| 944 | 5. Bootstrap tree |
|---|
| 945 | |
|---|
| 946 | S. Execute a system command |
|---|
| 947 | H. HELP |
|---|
| 948 | or press [RETURN] to go back to main menu |
|---|
| 949 | |
|---|
| 950 | |
|---|
| 951 | Your choice: |
|---|
| 952 | |
|---|
| 953 | |
|---|
| 954 | |
|---|
| 955 | |
|---|
| 956 | The same input routine that is used for general input is used here |
|---|
| 957 | i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, |
|---|
| 958 | with gaps indicated by hyphens ("-"). This is why we have |
|---|
| 959 | continualy drawn your attention to the NBRF/PIR format as a useful |
|---|
| 960 | output format. |
|---|
| 961 | |
|---|
| 962 | If you have input an alignment, then just use item 4 to draw a tree. |
|---|
| 963 | The method used is the Neighbor Joining method of Saitou and Nei |
|---|
| 964 | (1987). This is a "distance method". First, percent divergence |
|---|
| 965 | figures are calculated between all pairs of sequence. These |
|---|
| 966 | divergence figures are then used by the NJ method to give the tree. |
|---|
| 967 | Example trees will be shown below. |
|---|
| 968 | |
|---|
| 969 | There are two options which can be used to control the way the |
|---|
| 970 | distances are calculated. These are set by options 2 and 3 in the |
|---|
| 971 | menu. |
|---|
| 972 | |
|---|
| 973 | EXCLUDE POSITIONS WITH GAPS? This option allows you to ignore all |
|---|
| 974 | alignment positions (columns) where there is a gap in ANY sequence. |
|---|
| 975 | This guarantees that "like" is compared with "like" in all distances |
|---|
| 976 | i.e. the same positions are used to calculate all distances. It |
|---|
| 977 | also means that the distances will be "metric". The disadvantage of |
|---|
| 978 | using this option is that you throw away much of the data if there |
|---|
| 979 | are many gaps. If the total number of gaps is small, it has little |
|---|
| 980 | effect. |
|---|
| 981 | |
|---|
| 982 | CORRECT FOR MULTIPLE SUBSTITUTIONS? As sequences diverge, |
|---|
| 983 | substitutions accumulate. It becomes increasingly likely that more |
|---|
| 984 | than one substitution (as a result of a mutation) will have happened |
|---|
| 985 | at a site where you observe just one difference now. This option |
|---|
| 986 | allows you to use formulae developed by Motoo Kimura to correct for |
|---|
| 987 | this effect. It has the effect of stretching long branches in tres |
|---|
| 988 | while leaving short ones relatively untouched. The desired effect |
|---|
| 989 | is to try and make distances proportional to time since divergence. |
|---|
| 990 | |
|---|
| 991 | The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the |
|---|
| 992 | name of the input, alignment file. An example is shown below for 6 |
|---|
| 993 | globin sequences. |
|---|
| 994 | |
|---|
| 995 | |
|---|
| 996 | |
|---|
| 997 | DIST = percentage divergence (/100) |
|---|
| 998 | Length = number of sites used in comparison |
|---|
| 999 | |
|---|
| 1000 | 1 vs. 2 DIST = 0.5683; length = 139 |
|---|
| 1001 | 1 vs. 3 DIST = 0.5540; length = 139 |
|---|
| 1002 | 1 vs. 4 DIST = 0.5315; length = 111 |
|---|
| 1003 | 1 vs. 5 DIST = 0.7447; length = 141 |
|---|
| 1004 | 1 vs. 6 DIST = 0.7571; length = 140 |
|---|
| 1005 | 2 vs. 3 DIST = 0.0897; length = 145 |
|---|
| 1006 | 2 vs. 4 DIST = 0.1391; length = 115 |
|---|
| 1007 | 2 vs. 5 DIST = 0.7517; length = 145 |
|---|
| 1008 | 2 vs. 6 DIST = 0.7431; length = 144 |
|---|
| 1009 | 3 vs. 4 DIST = 0.0957; length = 115 |
|---|
| 1010 | 3 vs. 5 DIST = 0.7379; length = 145 |
|---|
| 1011 | 3 vs. 6 DIST = 0.7361; length = 144 |
|---|
| 1012 | 4 vs. 5 DIST = 0.7304; length = 115 |
|---|
| 1013 | 4 vs. 6 DIST = 0.7368; length = 114 |
|---|
| 1014 | 5 vs. 6 DIST = 0.2697; length = 152 |
|---|
| 1015 | |
|---|
| 1016 | |
|---|
| 1017 | Neighbor-joining Method |
|---|
| 1018 | |
|---|
| 1019 | Saitou, N. and Nei, M. (1987) The Neighbor-joining Method: |
|---|
| 1020 | A New Method for Reconstructing Phylogenetic Trees. |
|---|
| 1021 | Mol. Biol. Evol., 4(4), 406-425 |
|---|
| 1022 | |
|---|
| 1023 | |
|---|
| 1024 | This is an UNROOTED tree |
|---|
| 1025 | |
|---|
| 1026 | Numbers in parentheses are branch lengths |
|---|
| 1027 | |
|---|
| 1028 | |
|---|
| 1029 | Cycle 1 = SEQ: 5 ( 0.13382) joins SEQ: 6 ( 0.13592) |
|---|
| 1030 | |
|---|
| 1031 | Cycle 2 = SEQ: 1 ( 0.28142) joins Node: 5 ( 0.33462) |
|---|
| 1032 | |
|---|
| 1033 | Cycle 3 = SEQ: 2 ( 0.05879) joins SEQ: 3 ( 0.03086) |
|---|
| 1034 | |
|---|
| 1035 | Cycle 4 (Last cycle, trichotomy): |
|---|
| 1036 | |
|---|
| 1037 | Node: 1 ( 0.20798) joins |
|---|
| 1038 | Node: 2 ( 0.02341) joins |
|---|
| 1039 | SEQ: 4 ( 0.04915) |
|---|
| 1040 | |
|---|
| 1041 | |
|---|
| 1042 | |
|---|
| 1043 | The output file first shows the percent divergence (distance) |
|---|
| 1044 | figures between each pair of sequence. Then a description of a NJ |
|---|
| 1045 | tree is given. This description shows which sequences (SEQ:) or |
|---|
| 1046 | which groups of sequences (NODE: , a node is numbered using the |
|---|
| 1047 | lowest sequence that belongs to it) join at each level of the tree. |
|---|
| 1048 | |
|---|
| 1049 | This is an unrooted tree!! This means that the direction of |
|---|
| 1050 | evolution through the tree is not shown. This can only be inferred |
|---|
| 1051 | in one of two ways: |
|---|
| 1052 | 1) assume a degree of constancy in the molecular clock and place the |
|---|
| 1053 | root (bottom of the tree; the point where all the sequences radiate |
|---|
| 1054 | from) half way along the longest branch. **OR** |
|---|
| 1055 | 2) use an "outgroup", a sequence from an organism that you "know" |
|---|
| 1056 | must be outside of the rest of the sequences i.e. root the tree |
|---|
| 1057 | manually, on biological grounds. |
|---|
| 1058 | |
|---|
| 1059 | The above tree can be represented diagramatically as follows: |
|---|
| 1060 | |
|---|
| 1061 | |
|---|
| 1062 | SEQ 1 SEQ 4 |
|---|
| 1063 | I I |
|---|
| 1064 | 13.6 I 28.1 I 4.9 5.9 |
|---|
| 1065 | SEQ 6 ----------I I I I--------- SEQ 2 |
|---|
| 1066 | I I I I |
|---|
| 1067 | I--------I-----------I----------I |
|---|
| 1068 | 13.4 I 33.5 20.8 2.3 I 3.1 |
|---|
| 1069 | SEQ 5 ----------I I--------- SEQ 3 |
|---|
| 1070 | |
|---|
| 1071 | |
|---|
| 1072 | The figures along each branch are percent divergences along that |
|---|
| 1073 | branch. If you root the tree by placing the root along the longest |
|---|
| 1074 | branch (33.5%) then you can draw it again as follows, this time |
|---|
| 1075 | rooted: |
|---|
| 1076 | |
|---|
| 1077 | |
|---|
| 1078 | |
|---|
| 1079 | 13.6 |
|---|
| 1080 | I-------------------- SEQ 6 |
|---|
| 1081 | I---------I 13.4 |
|---|
| 1082 | I I-------------------- SEQ 5 |
|---|
| 1083 | I 33.5 |
|---|
| 1084 | -----I 28.1 |
|---|
| 1085 | I I-------------------- SEQ 1 |
|---|
| 1086 | I I |
|---|
| 1087 | I---------I 4.9 |
|---|
| 1088 | I 20.8 I----------- SEQ 4 |
|---|
| 1089 | I--------I |
|---|
| 1090 | I 5.9 |
|---|
| 1091 | I 2.3 I----- SEQ 2 |
|---|
| 1092 | I-----I 3.1 |
|---|
| 1093 | I----- SEQ 3 |
|---|
| 1094 | |
|---|
| 1095 | |
|---|
| 1096 | |
|---|
| 1097 | The longest branch (33.5% between 5,6 and 1,2,3,4) is split between |
|---|
| 1098 | the 2 bottom branches of the tree. As it happens in this particular |
|---|
| 1099 | case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4 |
|---|
| 1100 | are alpha and beta globins, so you could also justify the above |
|---|
| 1101 | rooting on biological grounds. If you do not have any particular |
|---|
| 1102 | need or evidence for the position of the root, then LEAVE THE TREE |
|---|
| 1103 | UNROOTED. Unrooted trees do not look as pretty as rooted ones but |
|---|
| 1104 | it is uaual to leave them unrooted if you do not have any evidence |
|---|
| 1105 | for the position of the root. |
|---|
| 1106 | |
|---|
| 1107 | |
|---|
| 1108 | BOTSTRAPPING: Different sets of sequences and different tree |
|---|
| 1109 | drawing methods may give different topologies (branching orders) for |
|---|
| 1110 | parts of a tree that are weakly supported by the data. It is useful |
|---|
| 1111 | to have an indication of the degree of error in the tree. There are |
|---|
| 1112 | several ways of doing this, some of them rather technical. We |
|---|
| 1113 | provide one general purpose method in this program, which makes use |
|---|
| 1114 | of a technique called bootstrapping (see Felsenstein, 1985). |
|---|
| 1115 | |
|---|
| 1116 | In the case of sequence alignments, bootstrapping involves taking |
|---|
| 1117 | random samples of positions from the alignment. If the alignment |
|---|
| 1118 | has N positions, each bootstrap sample consists of a random sample |
|---|
| 1119 | of N positions, taken WITH REPLACEMENT i.e. in any given sample, |
|---|
| 1120 | some sites may be sampled several times, others not at all. Then, |
|---|
| 1121 | with each sample of sites, you calculate a distance matrix as usual |
|---|
| 1122 | and draw a tree. If the data very strongly support just one tree |
|---|
| 1123 | then the sample trees will be very similar to each other and to the |
|---|
| 1124 | original tree, drawn without bootstrapping. However, if parts of |
|---|
| 1125 | the tree are not well supported, then the sample trees will vary |
|---|
| 1126 | considerably in how they represent these parts. |
|---|
| 1127 | |
|---|
| 1128 | In practice, you should use a very large number of bootstrap |
|---|
| 1129 | replicates (1000 is recommended, even if it means running the |
|---|
| 1130 | program for an hour on a slow microcomputer; on a workstation it |
|---|
| 1131 | will be MUCH faster). For each grouping on the tree, you record the |
|---|
| 1132 | number of times this grouping occurs in the sample trees. For a |
|---|
| 1133 | group to be considered "significant" at the 95% level (or P <= 0.05 |
|---|
| 1134 | in statistical terms) you expect the grouping to show up in >= 95% |
|---|
| 1135 | of the sample trees. If this happens, then you can say that the |
|---|
| 1136 | grouping is significant, given the data set and the method used to |
|---|
| 1137 | draw the tree. |
|---|
| 1138 | |
|---|
| 1139 | So, when you use the bootstrap option, a NJ tree is drawn as before |
|---|
| 1140 | and then you are asked to say how many bootstrap samples you want |
|---|
| 1141 | (1000 is the default) and you are asked to give a seed number for |
|---|
| 1142 | the random number generator. If you give the same seed number in |
|---|
| 1143 | future, you will get the same results (we hope). Remember to give |
|---|
| 1144 | different seed numbers if you wish to carry out genuinely different |
|---|
| 1145 | bootstrap sampling experiments. Below is the output file from using |
|---|
| 1146 | the same data for the 6 globin sequences as used before. The output |
|---|
| 1147 | file has the same name as the input fike with the extension ".njb". |
|---|
| 1148 | |
|---|
| 1149 | // |
|---|
| 1150 | STUFF DELETED .... same as for the ordinary NJ output |
|---|
| 1151 | // |
|---|
| 1152 | Bootstrap Confidence Limits |
|---|
| 1153 | |
|---|
| 1154 | |
|---|
| 1155 | Random number generator seed = 99 |
|---|
| 1156 | |
|---|
| 1157 | Number of bootstrap trials = 1000 |
|---|
| 1158 | |
|---|
| 1159 | |
|---|
| 1160 | Diagrammatic representation of the above tree: |
|---|
| 1161 | |
|---|
| 1162 | Each row represents 1 tree cycle; defining 2 groups. |
|---|
| 1163 | |
|---|
| 1164 | Each column is 1 sequence; the stars in each line show 1 group; |
|---|
| 1165 | the dots show the other |
|---|
| 1166 | |
|---|
| 1167 | Numbers show occurences in bootstrap samples. |
|---|
| 1168 | |
|---|
| 1169 | ****.. 1000 |
|---|
| 1170 | .***.. 1000 <- This is the answer!! |
|---|
| 1171 | *..*** 812 |
|---|
| 1172 | 122311 |
|---|
| 1173 | |
|---|
| 1174 | |
|---|
| 1175 | For an unrooted tree with N sequences, there are actually only N-3 |
|---|
| 1176 | genuinely different groupings that we can test (this is the number |
|---|
| 1177 | of "internal branches"; each internal branch splits the sequences |
|---|
| 1178 | into 2 groups). In this example, we have 6 sequences with 3 |
|---|
| 1179 | internal branches in the reference tree. In the bootstrap |
|---|
| 1180 | resampling, we count how often each of these internal branches |
|---|
| 1181 | occur. Here, we find that the branch which splits 1,2,3 and 4 |
|---|
| 1182 | versus 5 and 6 occurs in all 1000 samples; the branch which splits |
|---|
| 1183 | 2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2 |
|---|
| 1184 | and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put |
|---|
| 1185 | these figures on to the diagrammatic representation we made earlier |
|---|
| 1186 | of our unrooted NJ tree as follows: |
|---|
| 1187 | |
|---|
| 1188 | |
|---|
| 1189 | |
|---|
| 1190 | SEQ 1 SEQ 4 |
|---|
| 1191 | I I |
|---|
| 1192 | I I |
|---|
| 1193 | SEQ 6 ----------I I I I--------- SEQ 2 |
|---|
| 1194 | I 1000 I 1000 I 812 I |
|---|
| 1195 | I--------I-----------I----------I |
|---|
| 1196 | I I |
|---|
| 1197 | SEQ 5 ----------I I--------- SEQ 3 |
|---|
| 1198 | |
|---|
| 1199 | |
|---|
| 1200 | |
|---|
| 1201 | You can equally put these confidence figures on the rooted tree (in |
|---|
| 1202 | fact the interpretation is simpler with rooted trees). With the |
|---|
| 1203 | unrooted tree, the grouping of sequence 5 with 6 is significant (as |
|---|
| 1204 | is the grouping of sequences 1,2,3 and 4). Equally the grouping of |
|---|
| 1205 | sequences 1,5 and 6 is significant (the same as saying that 2,3 and |
|---|
| 1206 | 4 group significantly). However, the grouping of 2 and 3 is not |
|---|
| 1207 | significant, although it is relatively strongly supported. |
|---|
| 1208 | |
|---|
| 1209 | Unfortunately, there is a small complication in the interpretation |
|---|
| 1210 | of these results. In statistical hypothesis testing, it is not |
|---|
| 1211 | valid to make multiple simultaneous tests and to treat the result of |
|---|
| 1212 | each test completely independantly. In the above case, if you have |
|---|
| 1213 | one particular test (grouping) that you wish to make in advance, it |
|---|
| 1214 | is valid to test IT ALONE and to simply show the other bootstrap |
|---|
| 1215 | figures for reference. If you do not have any particular test in |
|---|
| 1216 | mind before you do the bootstrapping, you can just show all of the |
|---|
| 1217 | figures and use the 95% level as an ARBITRARY cut off to show those |
|---|
| 1218 | groups that are very strongly supported; but not mention anything |
|---|
| 1219 | about SIGNIFICANCE testing. In the literature, it is common |
|---|
| 1220 | practice to simply show the figures with a tree; they frequently |
|---|
| 1221 | speak for themselves. |
|---|
| 1222 | |
|---|
| 1223 | |
|---|
| 1224 | |
|---|
| 1225 | ******************************************************************* |
|---|
| 1226 | |
|---|
| 1227 | 4. Command Line Interface. |
|---|
| 1228 | |
|---|
| 1229 | |
|---|
| 1230 | |
|---|
| 1231 | You can do almost everything that can be done from the menus, using |
|---|
| 1232 | a command line interface. In this mode, the program will take all of |
|---|
| 1233 | its instructions as "switches" when you activate it; no questions |
|---|
| 1234 | will be asked; if there are no errors, the program just does an |
|---|
| 1235 | analysis and stops. It does not work so well on the MAC but is |
|---|
| 1236 | still possible. To get you started we will show you the 2 simplest |
|---|
| 1237 | uses of the command line as it looks on VAX/VMS. On all other |
|---|
| 1238 | machines (except the MAC) it works in the same way. |
|---|
| 1239 | |
|---|
| 1240 | $ clustalv /help **OR** $ clustalv /check |
|---|
| 1241 | |
|---|
| 1242 | Both of the above switches give you a one page summary of the |
|---|
| 1243 | command line on the screen and then the program stops. |
|---|
| 1244 | |
|---|
| 1245 | |
|---|
| 1246 | $ clustalv proteins.seq **OR** $ clustalv /infile=proteins.seq |
|---|
| 1247 | |
|---|
| 1248 | This will read the sequences from the file 'proteins.seq' and do a |
|---|
| 1249 | complete multiple alignment. Default parameters will be used, the |
|---|
| 1250 | program will try to tell whether or not the sequences are DNA or |
|---|
| 1251 | protein and the output will go to a file called 'proteins.aln' . A |
|---|
| 1252 | dendrogram file called 'proteins.dnd' will also be created. Thus |
|---|
| 1253 | the default action for the program, when it successfully reads in an |
|---|
| 1254 | input file is to do a full multiple alignment. Some further |
|---|
| 1255 | examples of command line usage will be given leter. |
|---|
| 1256 | |
|---|
| 1257 | Command line switches can be abbreviated but MAKE SURE YOU DO NOT |
|---|
| 1258 | MAKE THEM AMBIGUOUS. No attempt will be made to detect ambiguity. |
|---|
| 1259 | Use enough characters to distinguish each switch uniquely. |
|---|
| 1260 | |
|---|
| 1261 | |
|---|
| 1262 | |
|---|
| 1263 | |
|---|
| 1264 | |
|---|
| 1265 | |
|---|
| 1266 | |
|---|
| 1267 | The full list of allowed switches is given below: |
|---|
| 1268 | |
|---|
| 1269 | |
|---|
| 1270 | DATA (sequences) |
|---|
| 1271 | |
|---|
| 1272 | /INFILE=file.ext :input sequences. If you give an input file and |
|---|
| 1273 | nothing else as a switch, the default action is |
|---|
| 1274 | to do a complete multiple alignment. The input |
|---|
| 1275 | file can also be specified by giving it as the |
|---|
| 1276 | first command line parameter with no "/" in |
|---|
| 1277 | front of it e.g $ clustalv file.ext . |
|---|
| 1278 | |
|---|
| 1279 | /PROFILE1=file.ext :You use these two switches to give the names of |
|---|
| 1280 | /PROFILE2=file.ext two profiles. The default action is to align |
|---|
| 1281 | the two. You must give the names of both profile |
|---|
| 1282 | files. |
|---|
| 1283 | |
|---|
| 1284 | |
|---|
| 1285 | |
|---|
| 1286 | VERBS (do things) |
|---|
| 1287 | |
|---|
| 1288 | /HELP :list the command line parameters on the screen. |
|---|
| 1289 | /CHECK |
|---|
| 1290 | |
|---|
| 1291 | /ALIGN :do full multiple alignment. This is the default |
|---|
| 1292 | action if no other switches except for input files |
|---|
| 1293 | are given. |
|---|
| 1294 | |
|---|
| 1295 | /TREE :calculate NJ tree. If this is the only action |
|---|
| 1296 | specified (e.g. $ clustalv proteins.seq/tree ) it IS |
|---|
| 1297 | ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED. If |
|---|
| 1298 | the sequences are not already aligned, you should |
|---|
| 1299 | also give the /ALIGN switch. This will align the |
|---|
| 1300 | sequences first, output an alignment file and |
|---|
| 1301 | calculate the tree in memory. |
|---|
| 1302 | |
|---|
| 1303 | /BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; |
|---|
| 1304 | default = 1000). If this is the only action |
|---|
| 1305 | specified (e.g. $ clustalv proteins.seq/bootstrap ) |
|---|
| 1306 | it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED. |
|---|
| 1307 | If the sequences are not already aligned, you should |
|---|
| 1308 | also give the /ALIGN switch. This will align the |
|---|
| 1309 | sequences first, output an alignment file and |
|---|
| 1310 | calculate the bootstraps in memory. You can set the |
|---|
| 1311 | number of bootstrap trials here (e.g./bootstrap=500). |
|---|
| 1312 | You can set the seed number for the random number |
|---|
| 1313 | generator with /seed=n. |
|---|
| 1314 | |
|---|
| 1315 | |
|---|
| 1316 | |
|---|
| 1317 | PARAMETERS (set things) |
|---|
| 1318 | |
|---|
| 1319 | ***Pairwise alignments:*** |
|---|
| 1320 | |
|---|
| 1321 | /KTUP=n :word size |
|---|
| 1322 | |
|---|
| 1323 | /TOPDIAGS=n :number of best diagonals |
|---|
| 1324 | |
|---|
| 1325 | /WINDOW=n :window around best diagonals |
|---|
| 1326 | |
|---|
| 1327 | /PAIRGAP=n :gap penalty |
|---|
| 1328 | |
|---|
| 1329 | |
|---|
| 1330 | |
|---|
| 1331 | ***Multiple alignments:*** |
|---|
| 1332 | |
|---|
| 1333 | /FIXEDGAP=n :fixed length gap pen. |
|---|
| 1334 | |
|---|
| 1335 | /FLOATGAP=n :variable length gap pen. |
|---|
| 1336 | |
|---|
| 1337 | /MATRIX= :PAM100 or ID or file name. The default weight matrix |
|---|
| 1338 | for proteins is PAM 250. |
|---|
| 1339 | |
|---|
| 1340 | /TYPE=p or d :type is protein or DNA. This allows you to |
|---|
| 1341 | explicitely overide the programs attempt at guessing |
|---|
| 1342 | the type of the sequence. It is only useful if you |
|---|
| 1343 | are using sequences with a VERY strange composition. |
|---|
| 1344 | |
|---|
| 1345 | /OUTPUT= :GCG or PHYLIP or PIR. The default output is |
|---|
| 1346 | Clustal format. |
|---|
| 1347 | |
|---|
| 1348 | /TRANSIT :transitions not weighted. The default is to weight |
|---|
| 1349 | transitions as more favourable than other mismatches |
|---|
| 1350 | in DNA alignments. This switch makes all nucleotide |
|---|
| 1351 | mismatches equally weighted. |
|---|
| 1352 | |
|---|
| 1353 | |
|---|
| 1354 | ***Trees:*** |
|---|
| 1355 | |
|---|
| 1356 | /KIMURA :use Kimura's correction on distances. |
|---|
| 1357 | |
|---|
| 1358 | /TOSSGAPS :ignore positions with a gap in ANY sequence. |
|---|
| 1359 | |
|---|
| 1360 | /SEED=n :seed number for bootstraps. |
|---|
| 1361 | |
|---|
| 1362 | |
|---|
| 1363 | |
|---|
| 1364 | |
|---|
| 1365 | EXAMPLES: |
|---|
| 1366 | |
|---|
| 1367 | These examples use the VAX/VMS $ prompt; otherwise, command-line |
|---|
| 1368 | usage is the same on all machines except the Macintosh. |
|---|
| 1369 | |
|---|
| 1370 | |
|---|
| 1371 | $ clustalv proteins.seq OR $ clustalv /infile=proteins.seq |
|---|
| 1372 | |
|---|
| 1373 | Read whatever sequences are in the file "proteins.seq" and do a full |
|---|
| 1374 | multiple alignment; output will go to the files: "proteins.dnd" |
|---|
| 1375 | (dendrogram) and "proteins.aln" (alignment). |
|---|
| 1376 | |
|---|
| 1377 | |
|---|
| 1378 | $ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir |
|---|
| 1379 | |
|---|
| 1380 | Same as last example but use K-Tuple size of 2; use a PAM 100 |
|---|
| 1381 | protein weight matrix; write the alignment out in NBRF/PIR format |
|---|
| 1382 | (goes to a file called "proteins.pir"). |
|---|
| 1383 | |
|---|
| 1384 | |
|---|
| 1385 | $ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11 |
|---|
| 1386 | |
|---|
| 1387 | Take the alignment in "proteins.seq" and align it with "more.seq" |
|---|
| 1388 | using default values for everything except the fixed gap penalty |
|---|
| 1389 | which is set to 11. The sequence type is explicitely set to |
|---|
| 1390 | PROTEIN. |
|---|
| 1391 | |
|---|
| 1392 | |
|---|
| 1393 | $ clustalv proteins.pir/tree/kimura |
|---|
| 1394 | |
|---|
| 1395 | Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) |
|---|
| 1396 | and calculate a phylogenetic tree using Kimura's correction for |
|---|
| 1397 | distances. |
|---|
| 1398 | |
|---|
| 1399 | |
|---|
| 1400 | $ clustalv proteins.pir/align/tree/kimura |
|---|
| 1401 | |
|---|
| 1402 | Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE |
|---|
| 1403 | FIRST. |
|---|
| 1404 | |
|---|
| 1405 | |
|---|
| 1406 | $ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p |
|---|
| 1407 | |
|---|
| 1408 | Take the sequences in proteins.seq; they are explicitely set to be |
|---|
| 1409 | protein; align them; bootstrap a tree using 500 samples and a seed |
|---|
| 1410 | number of 99. |
|---|
| 1411 | |
|---|
| 1412 | |
|---|
| 1413 | ******************************************************************* |
|---|
| 1414 | |
|---|
| 1415 | 5. Algorithms and references. |
|---|
| 1416 | |
|---|
| 1417 | |
|---|
| 1418 | |
|---|
| 1419 | In this section, we will try to BRIEFLY describe the algorithms used |
|---|
| 1420 | in ClustalV and give references. The topics covered are: |
|---|
| 1421 | |
|---|
| 1422 | |
|---|
| 1423 | -Multiple alignments |
|---|
| 1424 | |
|---|
| 1425 | -Profile alignments |
|---|
| 1426 | |
|---|
| 1427 | -Protein weight matrices |
|---|
| 1428 | |
|---|
| 1429 | -Phylogenetic trees |
|---|
| 1430 | |
|---|
| 1431 | -distances |
|---|
| 1432 | |
|---|
| 1433 | -NJ method |
|---|
| 1434 | |
|---|
| 1435 | -Bootstrapping |
|---|
| 1436 | |
|---|
| 1437 | -Phylip |
|---|
| 1438 | |
|---|
| 1439 | -References |
|---|
| 1440 | |
|---|
| 1441 | |
|---|
| 1442 | |
|---|
| 1443 | |
|---|
| 1444 | |
|---|
| 1445 | |
|---|
| 1446 | MULTIPLE ALIGNMENTS. |
|---|
| 1447 | |
|---|
| 1448 | The approach used in ClustalV is a modified version of the method of |
|---|
| 1449 | Feng and Doolittle (1987) who aligned the sequences in larger and |
|---|
| 1450 | larger groups according to the branching order in an initial |
|---|
| 1451 | phylogenetic tree. This approach allows a very useful combination |
|---|
| 1452 | of computational tractability and sensitivity. |
|---|
| 1453 | |
|---|
| 1454 | The positions of gaps that are generated in early alignments remain |
|---|
| 1455 | through later stages. This can be justified because gaps that arise |
|---|
| 1456 | from the comparison of closely related sequences should not be moved |
|---|
| 1457 | because of later alignment with more distantly related sequences. |
|---|
| 1458 | At each alignment stage, you align two groups of already aligned |
|---|
| 1459 | sequences. This is done using a dynamic programming algorithm where |
|---|
| 1460 | one allows the residues that occur in every sequence at each |
|---|
| 1461 | alignment position to contribute to the alignment score. A Dayhoff |
|---|
| 1462 | (1978) PAM matrix is used in protein comparisons. |
|---|
| 1463 | |
|---|
| 1464 | The details of the algorithm used in ClustalV have been published in |
|---|
| 1465 | Higgins and Sharp (1989). This was an improved version of an |
|---|
| 1466 | earlier algorithm published in Higgins and Sharp (1988). First, you |
|---|
| 1467 | calculate a crude similarity measure between every pair of sequence. |
|---|
| 1468 | This is done using the fast, approximate alignment algorithm of |
|---|
| 1469 | Wilbur and Lipman (1983). Then, these scores are used to calculate |
|---|
| 1470 | a "guide tree" or dendrogram, which will tell the multiple alignment |
|---|
| 1471 | stage in which order to align the sequences for the final multiple |
|---|
| 1472 | alignment. This "guide tree" is calculated using the UPGMA method |
|---|
| 1473 | of Sneath and Sokal (1973). UPGMA is a fancy name for one type of |
|---|
| 1474 | average linkage cluster analysis, invented by Sokal and Michener |
|---|
| 1475 | (1958). |
|---|
| 1476 | |
|---|
| 1477 | Having calculated the dendrogram, the sequences are aligned in |
|---|
| 1478 | larger and larger groups. At each alignment stage, we use the |
|---|
| 1479 | algorithm of Myers and Miller (1988) for the optimal alignments. |
|---|
| 1480 | This algorithm is a very memory efficient variation of Gotoh's |
|---|
| 1481 | algorithm (Gotoh, 1982). It is because of this algorithm that |
|---|
| 1482 | ClustalV can work on microcomputers. Each of these alignments |
|---|
| 1483 | consists of aligning 2 alignments, using what we call "profile |
|---|
| 1484 | alignments". |
|---|
| 1485 | |
|---|
| 1486 | |
|---|
| 1487 | PROFILE ALIGNMENTS. |
|---|
| 1488 | |
|---|
| 1489 | We use the term "profile alignment" to describe the alignment of 2 |
|---|
| 1490 | alignments. We use this term because the method is a simple |
|---|
| 1491 | extension of the profile method of Gribskov, et al. (1987) for |
|---|
| 1492 | aligning 1 sequence with an alignment. Normally, with a 2 sequence |
|---|
| 1493 | alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a |
|---|
| 1494 | score between the pairs of aligned residues. The alignment is |
|---|
| 1495 | considered "optimal" if it gives the best total score for aligned |
|---|
| 1496 | residues minus penalties for any gaps (insertions or deletions) that |
|---|
| 1497 | must be introduced. |
|---|
| 1498 | |
|---|
| 1499 | Profile alignments are a simple extension of 2 sequence alignments |
|---|
| 1500 | in that you can treat each of the two input alignments as single |
|---|
| 1501 | sequences but you calculate the score at aligned positions as the |
|---|
| 1502 | average weight matrix score of all the residues in one alignment |
|---|
| 1503 | versus all those in the other e.g. if you have 2 alignments with I |
|---|
| 1504 | and J sequences respectively; the score at any position is the |
|---|
| 1505 | average of all the I times J scores of the residues compared |
|---|
| 1506 | seperately. Any gaps that are introduced are placed in all of the |
|---|
| 1507 | sequences of an alignment at the same position. The profile |
|---|
| 1508 | alignments offered in the "profile alignment menu" are also |
|---|
| 1509 | calculated in this way. |
|---|
| 1510 | |
|---|
| 1511 | |
|---|
| 1512 | PROTEIN WEIGHT MATRICES. |
|---|
| 1513 | |
|---|
| 1514 | There are 3 built-in weight matrices used by clustalV. These are |
|---|
| 1515 | the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity |
|---|
| 1516 | matrix. Each matrix is given as the bottom left half, including the |
|---|
| 1517 | diagonal of a 20 by 20 matrix. The order of the rows and columns is |
|---|
| 1518 | CSTPAGNDEQHRKMILVFYW. |
|---|
| 1519 | |
|---|
| 1520 | |
|---|
| 1521 | PAM 250 |
|---|
| 1522 | |
|---|
| 1523 | C 12 |
|---|
| 1524 | S 0 2 |
|---|
| 1525 | T -2 1 3 |
|---|
| 1526 | P -3 1 0 6 |
|---|
| 1527 | A -2 1 1 1 2 |
|---|
| 1528 | G -3 1 0 -1 1 5 |
|---|
| 1529 | N -4 1 0 -1 0 0 2 |
|---|
| 1530 | D -5 0 0 -1 0 1 2 4 |
|---|
| 1531 | E -5 0 0 -1 0 0 1 3 4 |
|---|
| 1532 | Q -5 -1 -1 0 0 -1 1 2 2 4 |
|---|
| 1533 | H -3 -1 -1 0 -1 -2 2 1 1 3 6 |
|---|
| 1534 | R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 |
|---|
| 1535 | K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 |
|---|
| 1536 | M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 |
|---|
| 1537 | I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 |
|---|
| 1538 | L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 |
|---|
| 1539 | V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 |
|---|
| 1540 | F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 |
|---|
| 1541 | Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 |
|---|
| 1542 | W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 |
|---|
| 1543 | ---------------------------------------------------------------- |
|---|
| 1544 | C S T P A G N D E Q H R K M I L V F Y W |
|---|
| 1545 | |
|---|
| 1546 | |
|---|
| 1547 | IDENTITY MATRIX |
|---|
| 1548 | |
|---|
| 1549 | 10 |
|---|
| 1550 | 0 10 |
|---|
| 1551 | 0 0 10 |
|---|
| 1552 | 0 0 0 10 |
|---|
| 1553 | 0 0 0 0 10 |
|---|
| 1554 | 0 0 0 0 1 10 |
|---|
| 1555 | 0 0 0 0 0 0 10 |
|---|
| 1556 | 0 0 0 0 0 0 0 10 |
|---|
| 1557 | 0 0 0 0 0 0 0 0 10 |
|---|
| 1558 | 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1559 | 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1560 | 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1561 | 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1562 | 0 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1563 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1564 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1565 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1566 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1567 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1568 | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 |
|---|
| 1569 | |
|---|
| 1570 | |
|---|
| 1571 | |
|---|
| 1572 | |
|---|
| 1573 | |
|---|
| 1574 | PAM 100 |
|---|
| 1575 | |
|---|
| 1576 | 14 |
|---|
| 1577 | -1 6 |
|---|
| 1578 | -5 2 7 |
|---|
| 1579 | -6 1 -1 10 |
|---|
| 1580 | -5 2 2 1 6 |
|---|
| 1581 | -8 1 -3 -3 1 8 |
|---|
| 1582 | -8 2 0 -3 -1 -1 7 |
|---|
| 1583 | -11 -1 -2 -4 -1 -1 4 8 |
|---|
| 1584 | -11 -2 -3 -3 0 -2 1 5 8 |
|---|
| 1585 | -11 -3 -3 -1 -2 -5 -1 1 4 9 |
|---|
| 1586 | -6 -4 -5 -2 -5 -7 2 -1 -2 4 11 |
|---|
| 1587 | -6 -1 -4, -2 -5 -8 -3 -6 -5 1 1 10 |
|---|
| 1588 | -11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8 |
|---|
| 1589 | -11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13 |
|---|
| 1590 | -5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9 |
|---|
| 1591 | -12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9 |
|---|
| 1592 | -4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8 |
|---|
| 1593 | -10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7-11 -2 0 0 -5 12 |
|---|
| 1594 | -2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1-10-10 -8 -4 -5 -6 6 13 |
|---|
| 1595 | -13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9-11-12 -7-14 -2 -2 19 |
|---|
| 1596 | |
|---|
| 1597 | |
|---|
| 1598 | |
|---|
| 1599 | |
|---|
| 1600 | PHYLOGENETIC TREES. |
|---|
| 1601 | |
|---|
| 1602 | There are two COMMONLY used approaches for inferring phylogentic |
|---|
| 1603 | trees from sequence data: parsimony and distance methods. There are |
|---|
| 1604 | other approaches which are probably superior in theory but which are |
|---|
| 1605 | yet to be used widely. This does not mean that they are no use; we |
|---|
| 1606 | (the authors of this program at any rate) simply do not know enough |
|---|
| 1607 | about them yet. You should see the documentation accompanying the |
|---|
| 1608 | Phylip package and some of the references there for an explanation |
|---|
| 1609 | of the different methods and what assumptions are implied when you |
|---|
| 1610 | use them. |
|---|
| 1611 | |
|---|
| 1612 | There is a constant debate in the literature as to the merits of |
|---|
| 1613 | different methods but unfortunately, a lot of what is said is |
|---|
| 1614 | incomprehensible or inaccurate. It is also a field that is prone to |
|---|
| 1615 | having highly opinionated schools of thought. This is a pity |
|---|
| 1616 | because it prevents rational discussion of the pro's and con's of |
|---|
| 1617 | the different methods. The approach adopted in ClustalV is to |
|---|
| 1618 | supply just one method and to produce alignments in a format that |
|---|
| 1619 | can be used by Phylip. In simple cases, the trees produced will be |
|---|
| 1620 | as "good" (reliable, robust) as those from ANY other method. In |
|---|
| 1621 | more complicated cases, there is no single magic recipe that we can |
|---|
| 1622 | supply that will work well in even most situations. |
|---|
| 1623 | |
|---|
| 1624 | The method we provide is the Neighbor Joining method (NJ) of Saitou |
|---|
| 1625 | and Nei (1987) which is a distance method. We use this for three |
|---|
| 1626 | reasons: it is conceptually and computationally simple; it is fast; |
|---|
| 1627 | it gives "good" trees in simple cases. It is difficult to prove that |
|---|
| 1628 | one tree is "better" than another if you do not know the true |
|---|
| 1629 | phylogeny; the few systematic surveys of methods show it to work |
|---|
| 1630 | more or less as well as any other method ON AVERAGE. Another reason |
|---|
| 1631 | for using the NJ method is that it is very commonly used; THIS IS A |
|---|
| 1632 | BAD REASON SCIENTIFICALLY but at least you will not feel lonely if |
|---|
| 1633 | you use it. |
|---|
| 1634 | |
|---|
| 1635 | The NJ method works on a matrix of distances (the distance matrix) |
|---|
| 1636 | between all pairs of sequence to be analysed. These distances are |
|---|
| 1637 | related to the degree of divergence between the sequences. It is |
|---|
| 1638 | normal to calculate the distances from the sequences after they are |
|---|
| 1639 | multiply aligned. If you calculate them from seperate alignments |
|---|
| 1640 | (as done for the dendrograms in another part of this program), you |
|---|
| 1641 | may increase the error considerably. |
|---|
| 1642 | |
|---|
| 1643 | |
|---|
| 1644 | DISTANCES |
|---|
| 1645 | |
|---|
| 1646 | The simplest measure of distance between sequences is percent |
|---|
| 1647 | divergence (100% minus percent identity). For two sequences, you |
|---|
| 1648 | count how many positions differ between them (ignoring all positions |
|---|
| 1649 | with a gap or an unknown residue) and divide by the number of |
|---|
| 1650 | positions considered. It is common practice to also ignore all |
|---|
| 1651 | positions in the alignment where there is a GAP in ANY of the |
|---|
| 1652 | sequences (Tossgaps ? option in the menu). Usually, you express the |
|---|
| 1653 | percent distance divided by 100 (gives distances between 0.0 and |
|---|
| 1654 | 1.0). |
|---|
| 1655 | |
|---|
| 1656 | This measure of distance is perfectly adequate (with some further |
|---|
| 1657 | modification described below) for rRNA sequences. However it treats |
|---|
| 1658 | all residues identically e.g. all amino acid substitutions are |
|---|
| 1659 | equally weighted. It also treats all positions identically e.g. it |
|---|
| 1660 | does not take account of different rates of substitution in |
|---|
| 1661 | different positions of different codons in protein coding DNA |
|---|
| 1662 | sequences; see Li et al (1985) for a distance measure that does. |
|---|
| 1663 | Despite these shortcomings, these percent identity distances do work |
|---|
| 1664 | well in practice in a wide variety of situations. |
|---|
| 1665 | |
|---|
| 1666 | In a simple world, you would like a distance to be proportional to |
|---|
| 1667 | the time since the sequences diverged. If this were EXACTLY true, |
|---|
| 1668 | then the calculation of the tree would be a simple matter of algebra |
|---|
| 1669 | (UPGMA does this for you) and the branch lengths will be nice and |
|---|
| 1670 | meaningful (times). In practice this OBVIOUSLY depends on the |
|---|
| 1671 | existence and quality of the "molecular clock", a subject of on- |
|---|
| 1672 | going debate. However, even if there is a good clock, there is a |
|---|
| 1673 | further problem with estimating divergences. As sequences diverge, |
|---|
| 1674 | they become "saturated" with mutations. Sites can have |
|---|
| 1675 | substitutions more than once. Calculated distances will |
|---|
| 1676 | underestimate actual divergence times; the greater the divergence, |
|---|
| 1677 | the greater the discrepancy. There are various methods for dealing |
|---|
| 1678 | with this and we provide two commonly used ones, both due to Motoo |
|---|
| 1679 | Kimura; one for proteins and one for DNA. |
|---|
| 1680 | |
|---|
| 1681 | |
|---|
| 1682 | For distance K (percent divergence /100 ) ... |
|---|
| 1683 | |
|---|
| 1684 | Correction for Protein distances: (Kimura, 1983). |
|---|
| 1685 | |
|---|
| 1686 | Corrected K = -ln(1.0 - K - (K * k/5.0)) |
|---|
| 1687 | |
|---|
| 1688 | |
|---|
| 1689 | |
|---|
| 1690 | Correction for nucleotide distances: Kimura's 2-parameter method |
|---|
| 1691 | (Kimura, 1980). |
|---|
| 1692 | |
|---|
| 1693 | Corrected K = 0.5*ln(a) + 0.25*ln(b) |
|---|
| 1694 | |
|---|
| 1695 | where a = 1/(1 - 2*P - Q) |
|---|
| 1696 | and b = 1/(1 - 2*Q) |
|---|
| 1697 | |
|---|
| 1698 | P and Q are the proportions of transitions (A<-->G, C<-->T) |
|---|
| 1699 | and transversions occuring between the sequences. |
|---|
| 1700 | |
|---|
| 1701 | |
|---|
| 1702 | One paradoxical effect of these corrections, is that distances can |
|---|
| 1703 | be corrected to have more than 100% divergence. That is because, |
|---|
| 1704 | for very highly diverged sequences of length N, you can estimate |
|---|
| 1705 | that more than N substitutions have occured by correcting the |
|---|
| 1706 | observed distance in the above ways. Don't panic! |
|---|
| 1707 | |
|---|
| 1708 | |
|---|
| 1709 | |
|---|
| 1710 | NEIGHBOR JOINING TREES. |
|---|
| 1711 | |
|---|
| 1712 | VERY briefly, the NJ method works as follows. You start by placing |
|---|
| 1713 | the sequences in a star topology (no internal branches). You then |
|---|
| 1714 | find that internal branch (take 2 sequences; join them; connect them |
|---|
| 1715 | to the rest by the internal branch) which when added to the tree |
|---|
| 1716 | will minimise the total branch length. The two joined sequences |
|---|
| 1717 | (neighbours) are merged into a single sequence and the process is |
|---|
| 1718 | repeated. For an unrooted tree with N sequences, there are N-3 |
|---|
| 1719 | internal branches. The above process is repeated N-3 times to give |
|---|
| 1720 | the final tree. The full details are given in Saitou and Nei |
|---|
| 1721 | (1987). |
|---|
| 1722 | |
|---|
| 1723 | As explained elsewhere in the documentation, you can only root the |
|---|
| 1724 | tree by one of two methods: |
|---|
| 1725 | |
|---|
| 1726 | 1) assume a degree of constancy in the molecular clock and place the |
|---|
| 1727 | root along the longest branch (internal or external). Methods that |
|---|
| 1728 | appear to produce rooted trees automatically are often just doing |
|---|
| 1729 | this without letting you know; this is true of UPGMA. |
|---|
| 1730 | |
|---|
| 1731 | 2) root the tree on biological grounds. The usual method is to |
|---|
| 1732 | include an "outgroup", a sequence that you are certain will branch |
|---|
| 1733 | to the outside of the tree. |
|---|
| 1734 | |
|---|
| 1735 | |
|---|
| 1736 | |
|---|
| 1737 | BOOTSTRAPPING. |
|---|
| 1738 | |
|---|
| 1739 | Bootstrapping is a general purpose technique that can be used for |
|---|
| 1740 | placing confidence limits on statistics that you estimate without |
|---|
| 1741 | any knowledge of the underlying distribution (e.g. a normal or |
|---|
| 1742 | poisson distribution). In the case of phylogenetic trees, there are |
|---|
| 1743 | several analytical methods for placing confidence limits on |
|---|
| 1744 | groupings (actually on the internal branches) but these are either |
|---|
| 1745 | restricted to particular tree drawing methods or only work on small |
|---|
| 1746 | trees of 4 or 5 sequences. Felsenstein (1985) showed how to use |
|---|
| 1747 | bootstrapping to calculate confidence limits on trees. His approach |
|---|
| 1748 | is completely general and can be applied to any tree drawing method. |
|---|
| 1749 | The main assumption of the method in this context is that the sites |
|---|
| 1750 | in the alignment are independant; this will be true of some sequence |
|---|
| 1751 | alignments (e.g. pseudogenes) but not others (e.g. rRNA's). What |
|---|
| 1752 | effect, lack of independance will have on the results is not known. |
|---|
| 1753 | |
|---|
| 1754 | The method works by taking random samples of data from the complete |
|---|
| 1755 | data set. You compute the test statistic (tree in this case) on |
|---|
| 1756 | each sample. Variation in the statistic computed from the samples |
|---|
| 1757 | gives a measure of variation in the statistic which can be used to |
|---|
| 1758 | calculate confidence intervals. Each random sample is the same size |
|---|
| 1759 | as the complete data set and is taken WITH REPLACEMENT i.e. a data |
|---|
| 1760 | point can be selected more than once (or not at all) in any given |
|---|
| 1761 | sample. |
|---|
| 1762 | |
|---|
| 1763 | In the case of an alignment N residues long, each random sample is a |
|---|
| 1764 | random selection of N sites form the alignment. For each sample, we |
|---|
| 1765 | calculate a distance matrix and tree in the usual way. Variation in |
|---|
| 1766 | the sample trees compared to a tree calculated from the full data |
|---|
| 1767 | set gives an indication of how well supported the tree is by the |
|---|
| 1768 | data. If the sample trees are very similar to each other and to the |
|---|
| 1769 | full tree, then the tree is "strongly" supported; if the sample |
|---|
| 1770 | trees show great variation, then the tree will be weakly supported. |
|---|
| 1771 | In practice, you usually find some parts of a tree well supported, |
|---|
| 1772 | others weakly. This can be seen by counting how often each |
|---|
| 1773 | monophyletic group in the full tree occurs in the sample trees. |
|---|
| 1774 | |
|---|
| 1775 | For a particular grouping, one considers it to be significant at the |
|---|
| 1776 | 95% level (P <= 0.05) if it occurs in 95% of the bootstrap samples. |
|---|
| 1777 | If a grouping is significant, it is significant with respect to the |
|---|
| 1778 | particular data set and method used for drawing the tree. |
|---|
| 1779 | Biological "significance" is another matter. |
|---|
| 1780 | |
|---|
| 1781 | |
|---|
| 1782 | PHYLIP. |
|---|
| 1783 | |
|---|
| 1784 | The Phylip package was written by Joe Felsenstein, University of |
|---|
| 1785 | Washington, USA. It provides Pascal source code for a large number |
|---|
| 1786 | of programs for doing most types of phylogenetic analyses. The |
|---|
| 1787 | Phylip format alignments produced by this program can be used by all |
|---|
| 1788 | of the Phylip programs, version 3.4 or later (March 1991). It is |
|---|
| 1789 | freely available from him as follows. |
|---|
| 1790 | |
|---|
| 1791 | |
|---|
| 1792 | |
|---|
| 1793 | ================= PHYLIP information sheet ===================== |
|---|
| 1794 | |
|---|
| 1795 | PHYLIP - Phylogeny Inference Package (version 3.3) |
|---|
| 1796 | |
|---|
| 1797 | This is a FREE package of programs for inferring phylogenies and |
|---|
| 1798 | carrying out certain related tasks. At present it contains 28 |
|---|
| 1799 | programs, which carry out different algorithms on different kinds of |
|---|
| 1800 | data. The programs in the package are: |
|---|
| 1801 | |
|---|
| 1802 | ---------- Programs for molecular sequence data ---------- |
|---|
| 1803 | PROTPARS Protein parsimony |
|---|
| 1804 | DNAPARS Parsimony method for DNA |
|---|
| 1805 | DNAMOVE Interactive DNA parsimony |
|---|
| 1806 | DNAPENNY Branch and bound for DNA |
|---|
| 1807 | DNABOOT Bootstraps DNA parsimony |
|---|
| 1808 | DNACOMP Compatibility for DNA |
|---|
| 1809 | DNAINVAR Phylogenetic invariants |
|---|
| 1810 | DNAML Maximum likelihood method |
|---|
| 1811 | DNAMLK DNAML with molecular clock |
|---|
| 1812 | DNADIST Distances from sequences |
|---|
| 1813 | RESTML ML for restriction sites |
|---|
| 1814 | |
|---|
| 1815 | ----------- Programs for distance matrix data ------------ |
|---|
| 1816 | FITCH Fitch-Margoliash and least-squares methods |
|---|
| 1817 | KITSCH Fitch-Margoliash and least squares methods with |
|---|
| 1818 | evolutionary clock |
|---|
| 1819 | |
|---|
| 1820 | --- Programs for gene frequencies and continuous characters -- |
|---|
| 1821 | CONTML Maximum likelihood method |
|---|
| 1822 | GENDIST Computes genetic distances |
|---|
| 1823 | |
|---|
| 1824 | ------------- Programs for discrete state data ----------- |
|---|
| 1825 | MIX Wagner, Camin-Sokal, and mixed parsimony criteria |
|---|
| 1826 | MOVE Interactive Wagner, C-S, mixed parsimony program |
|---|
| 1827 | PENNY Finds all most parsimonious trees by branch-and-bound |
|---|
| 1828 | BOOT Bootstrap confidence interval on mixed parsimony methods |
|---|
| 1829 | DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT same as preceding four |
|---|
| 1830 | programs, but for the Dollo and polymorphism parsimony |
|---|
| 1831 | criteria |
|---|
| 1832 | CLIQUE Compatibility method |
|---|
| 1833 | FACTOR recode multistate characters |
|---|
| 1834 | |
|---|
| 1835 | ---- Programs for plotting trees and consensus trees ---- |
|---|
| 1836 | DRAWGRAM Draws cladograms and phenograms on screens, plotters and |
|---|
| 1837 | printers |
|---|
| 1838 | DRAWTREE Draws unrooted phylogenies on screens, plotters and |
|---|
| 1839 | printers |
|---|
| 1840 | CONSENSE Majority-rule and strict consensus trees |
|---|
| 1841 | |
|---|
| 1842 | The package includes extensive documentation files that provide the |
|---|
| 1843 | information necessary to use and modify the programs. |
|---|
| 1844 | |
|---|
| 1845 | COMPATIBILITY: The programs are written in a very standard subset of |
|---|
| 1846 | Pascal, a language that is available on most computers (including |
|---|
| 1847 | microcomputers). The programs require only trivial modifications to |
|---|
| 1848 | run on most machines: for example they work with only minor |
|---|
| 1849 | modifications with Turbo Pascal, and without modifications on VAX |
|---|
| 1850 | VMS Pascal. Pascal source code is distributed in the regular version |
|---|
| 1851 | of PHYLIP: compiled object code is not. To use that version, you |
|---|
| 1852 | must have a Pascal compiler. |
|---|
| 1853 | |
|---|
| 1854 | DISKETTE DISTRIBUTION: The package is distributed in a variety of |
|---|
| 1855 | microcomputer diskette formats. You should send FORMATTED |
|---|
| 1856 | diskettes, which I will return with the package written on them. |
|---|
| 1857 | Unfortunately, I cannot write any Apple formats. See below for how |
|---|
| 1858 | many diskettes to send. The programs on the magnetic tape or |
|---|
| 1859 | electronic network versions may of course also be moved to |
|---|
| 1860 | microcomputers using a terminal program. |
|---|
| 1861 | |
|---|
| 1862 | PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS |
|---|
| 1863 | systems are available from me. Specify the "PCDOS executable |
|---|
| 1864 | version" and send the number of extra diskettes indicated below. |
|---|
| 1865 | An Apple Macintosh version with precompiled code is available from |
|---|
| 1866 | Willem Ellis, Instituut voor Taxonomische Zoologie, Zoologisch |
|---|
| 1867 | Museum, Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH |
|---|
| 1868 | Amsterdam, Netherlands, who asks that you send 5 800K diskettes. |
|---|
| 1869 | |
|---|
| 1870 | HOW MANY DISKETTES TO SEND: The following table shows for different |
|---|
| 1871 | PCDOS formats how many diskettes to send, and how many extra |
|---|
| 1872 | diskettes to send for the PCDOS executable version: |
|---|
| 1873 | |
|---|
| 1874 | Diskette size Density For source code For executables, send |
|---|
| 1875 | in addition |
|---|
| 1876 | 3.5 inch 1.44 Mb 2 1 |
|---|
| 1877 | 5.25 inch 1.2 Mb 2 2 |
|---|
| 1878 | 3.5 inch 720 Kb 4 2 |
|---|
| 1879 | 5.25 inch 360 Kb 7 4 |
|---|
| 1880 | |
|---|
| 1881 | Some other formats are also available. You MUST tell me EXACTLY |
|---|
| 1882 | which of these formats you need. The diskettes MUST be formatted by |
|---|
| 1883 | you before being sent to me. Sending an extra diskette may be |
|---|
| 1884 | helpful. |
|---|
| 1885 | |
|---|
| 1886 | NETWORK DISTRIBUTION: The package is also available by distribution |
|---|
| 1887 | of the files directly over electronic networks, and by anonymous ftp |
|---|
| 1888 | from evolution.genetics.washington.edu. Contact me by electronic |
|---|
| 1889 | mail for details. |
|---|
| 1890 | |
|---|
| 1891 | TAPE DISTRIBUTION: The programs are also distributed on a magnetic |
|---|
| 1892 | tape provided by you (which should be a small tape and need only be |
|---|
| 1893 | able to hold two megabytes) in the following format: 9-track, ASCII, |
|---|
| 1894 | odd parity, unlabelled, 6250 bpi (unless otherwise indicated). |
|---|
| 1895 | Logical record: 80 bytes, physical record: 3200 bytes (i.e. blocking |
|---|
| 1896 | factor 40). There are a total of 71 files. The first one describes |
|---|
| 1897 | the contents of the package. |
|---|
| 1898 | |
|---|
| 1899 | POLICIES: The package is distributed free. I do not make it |
|---|
| 1900 | available or support it in South Africa. The package will be |
|---|
| 1901 | written on the diskettes or tape, which will be mailed back. They |
|---|
| 1902 | can be sent to: |
|---|
| 1903 | |
|---|
| 1904 | Joe Felsenstein |
|---|
| 1905 | Electronic mail addresses: Department of Genetics SK-50 |
|---|
| 1906 | Internet: joe@genetics.washington.edu University of Washington |
|---|
| 1907 | Bitnet/EARN: felsenst@uwavm Seattle, Washington 98195 |
|---|
| 1908 | UUCP: uw-beaver!evolution.genetics!joe U.S.A. |
|---|
| 1909 | |
|---|
| 1910 | |
|---|
| 1911 | ===================== End of Phylip Info. Sheet ==================== |
|---|
| 1912 | |
|---|
| 1913 | |
|---|
| 1914 | |
|---|
| 1915 | |
|---|
| 1916 | REFERENCES. |
|---|
| 1917 | |
|---|
| 1918 | Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of |
|---|
| 1919 | Protein Sequence and Structure, Vol. 5 supplement 3, Dayhoff, M.O. |
|---|
| 1920 | (ed.), NBRF, Washington, p. 345. |
|---|
| 1921 | |
|---|
| 1922 | Felsenstein, J. (1985) Confidence limits on phylogenies: an |
|---|
| 1923 | approach using the bootstrap. Evolution 39, 783-791. |
|---|
| 1924 | |
|---|
| 1925 | Feng, D.-F. and Doolittle, R.F. (1987) Progressive sequence |
|---|
| 1926 | alignment as a prerequisite to correct phylogenetic trees. |
|---|
| 1927 | J.Mol.Evol. 25, 351-360. |
|---|
| 1928 | |
|---|
| 1929 | Gotoh, O. (1982) An improved algorithm for matching biological |
|---|
| 1930 | sequences. J.Mol.Biol. 162, 705-708. |
|---|
| 1931 | |
|---|
| 1932 | Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile |
|---|
| 1933 | analysis: detection of distantly related proteins. PNAS USA 84, |
|---|
| 1934 | 4355-4358. |
|---|
| 1935 | |
|---|
| 1936 | Higgins, D.G. and Sharp, P.M. (1988) CLUSTAL: a package for |
|---|
| 1937 | performing multiple sequence alignments on a microcomputer. Gene |
|---|
| 1938 | 73, 237-244. |
|---|
| 1939 | |
|---|
| 1940 | Higgins, D.G. and Sharp, P.M. (1989) Fast and sensitive multiple |
|---|
| 1941 | sequence alignments on a microcomputer. CABIOS 5, 151-153. |
|---|
| 1942 | |
|---|
| 1943 | Kimura, M. (1980) A simple method for estimating evolutionary |
|---|
| 1944 | rates of base substitutions through comparative studies of |
|---|
| 1945 | nucleotide sequences. J. Mol. Evol. 16, 111-120. |
|---|
| 1946 | |
|---|
| 1947 | Kimura, M. (1983) The Neutral Theory of Molecular Evolution. |
|---|
| 1948 | Cambridge University Press, Cambridge, England. |
|---|
| 1949 | |
|---|
| 1950 | Li, W.-H., Wu, C.-I. and Luo, C.-C. (1985) A new method for |
|---|
| 1951 | estimating synonymous and nonsynonymous rates of nucleotide |
|---|
| 1952 | substitution considering the relative likelihood of nucleotide and |
|---|
| 1953 | codon changes. Mol.Biol.Evol. 2, 150-174. |
|---|
| 1954 | |
|---|
| 1955 | Myers, E.W. and Miller, W. (1988) Optimal alignments in linear |
|---|
| 1956 | space. CABIOS 4, 11-17. |
|---|
| 1957 | |
|---|
| 1958 | Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological |
|---|
| 1959 | sequence comparison. PNAS USA 85, 2444-2448. |
|---|
| 1960 | |
|---|
| 1961 | Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new |
|---|
| 1962 | method for reconstructing phylogenetic trees. Mol.Biol.Evol. 4, |
|---|
| 1963 | 406-425. |
|---|
| 1964 | |
|---|
| 1965 | Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. Freeman, |
|---|
| 1966 | San Francisco. |
|---|
| 1967 | |
|---|
| 1968 | Sokal, R.R. and Michener, C.D. (1958) A statistical method for |
|---|
| 1969 | evaluating systematic relationships. Univ.Kansas Sci.Bull. 38, |
|---|
| 1970 | 1409-1438. |
|---|
| 1971 | |
|---|
| 1972 | Vingron, M. and Argos, P. (1991) Motif recognition and alignment |
|---|
| 1973 | for many sequences by comparison of dot matrices. J.Mol.Biol. 218, |
|---|
| 1974 | 33-43. |
|---|
| 1975 | |
|---|
| 1976 | Wilbur, W.J. and Lipman, D.J. (1983) Rapid similarity searches of |
|---|
| 1977 | nucleic acid and protein data banks. PNAS USA 80, 726-730. |
|---|
| 1978 | |
|---|