| 1 | README for Clustal W version 1.7 June 1997 |
|---|
| 2 | |
|---|
| 3 | Clustal W version 1.7 Documentation |
|---|
| 4 | |
|---|
| 5 | This file provides some notes on the latest changes, installation and usage |
|---|
| 6 | of the Clustal W multiple sequence alignment program. |
|---|
| 7 | |
|---|
| 8 | |
|---|
| 9 | |
|---|
| 10 | Julie Thompson (Thompson@EMBL-Heidelberg.DE) |
|---|
| 11 | Toby Gibson (Gibson@EMBL-Heidelberg.DE) |
|---|
| 12 | |
|---|
| 13 | European Molecular Biology Laboratory |
|---|
| 14 | Meyerhofstrasse 1 |
|---|
| 15 | D 69117 Heidelberg |
|---|
| 16 | Germany |
|---|
| 17 | |
|---|
| 18 | |
|---|
| 19 | Des Higgins (Higgins@ucc.ie) |
|---|
| 20 | |
|---|
| 21 | University of County Cork |
|---|
| 22 | Cork |
|---|
| 23 | Ireland |
|---|
| 24 | |
|---|
| 25 | |
|---|
| 26 | Please e-mail bug reports/complaints/suggestions (polite if possible) |
|---|
| 27 | to Toby Gibson or Des Higgins. |
|---|
| 28 | |
|---|
| 29 | |
|---|
| 30 | |
|---|
| 31 | Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) |
|---|
| 32 | CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment |
|---|
| 33 | through sequence weighting, positions-specific gap penalties and weight matrix |
|---|
| 34 | choice. Nucleic Acids Research, 22:4673-4680. |
|---|
| 35 | |
|---|
| 36 | -------------------------------------------------------------- |
|---|
| 37 | |
|---|
| 38 | What's New (June 1997) in Version 1.7 (since version 1.6). |
|---|
| 39 | |
|---|
| 40 | |
|---|
| 41 | 1. The static arrays used by clustalw for storing the alignment data have been |
|---|
| 42 | replaced by dynamically allocated memory. There is now no limit on the number |
|---|
| 43 | or length of sequences which can be input. |
|---|
| 44 | |
|---|
| 45 | 2. The alignment of DNA sequences now offers a new hard-coded matrix, as well |
|---|
| 46 | as the identity matrix used previously. The new matrix is the default scoring |
|---|
| 47 | matrix used by the BESTFIT program of the GCG package for the comparison of |
|---|
| 48 | nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity |
|---|
| 49 | symbol. All matches score 1.9; all mismatches for IUB symbols score 0.0. |
|---|
| 50 | |
|---|
| 51 | 3. The transition weight option for aligning nucleotide sequences has been |
|---|
| 52 | changed from an on/off toggle to a weight between 0 and 1. A weight of zero |
|---|
| 53 | means that the transitions are scored as mismatches; a weight of 1 gives |
|---|
| 54 | transitions the full match score. For distantly related DNA sequences, the |
|---|
| 55 | weight should be near to zero; for closely related sequences it can be useful |
|---|
| 56 | to assign a higher score. |
|---|
| 57 | |
|---|
| 58 | 4. The RSF sequence alignment file format used by GCG Version 9 can now be |
|---|
| 59 | read. |
|---|
| 60 | |
|---|
| 61 | 5. The clustal sequence alignment file format has been changed to allow |
|---|
| 62 | sequence names longer than 10 characters. The maximum length allowed is set in |
|---|
| 63 | clustalw.h by the statement: |
|---|
| 64 | |
|---|
| 65 | #define MAXNAMES 10 |
|---|
| 66 | |
|---|
| 67 | For the fasta format, the name is taken as the first string after the '>' |
|---|
| 68 | character, stopping at the first white space. (Previously, the first 10 |
|---|
| 69 | characters were taken, replacing blanks by underscores). |
|---|
| 70 | |
|---|
| 71 | 6. The bootstrap values written in the phylip tree file format can be assigned |
|---|
| 72 | either to branches or nodes. The default is to write the values on the nodes, |
|---|
| 73 | as this can be read by several commonly-used tree display programs. But note |
|---|
| 74 | that this can lead to confusion if the tree is rooted and the bootstraps may |
|---|
| 75 | be better attached to the internal branches: Software developers should ensure |
|---|
| 76 | they can read the branch label format. |
|---|
| 77 | |
|---|
| 78 | 7. The sequence weighting used during sequence to profile alignments has been |
|---|
| 79 | changed. The tree weight is now multiplied by the percent identity of the |
|---|
| 80 | new sequence compared with the most closely related sequence in the profile. |
|---|
| 81 | |
|---|
| 82 | 8. The sequence weighting used during profile to profile alignments has been |
|---|
| 83 | changed. A guide tree is now built for each profile separately and the |
|---|
| 84 | sequence weights calculated from the two trees. The weights for each |
|---|
| 85 | sequence are then multiplied by the percent identity of the sequence compared |
|---|
| 86 | with the most closely related sequence in the opposite profile. |
|---|
| 87 | |
|---|
| 88 | 9. The adjustment of the Gap Opening and Gap Extension Penalties for sequences |
|---|
| 89 | of unequal length has been improved. |
|---|
| 90 | |
|---|
| 91 | 10. The default order of the sequences in the output alignment file has been |
|---|
| 92 | changed. Previously the default was to output the sequences in the same order |
|---|
| 93 | as the input file. Now the default is to use the order in which the sequences |
|---|
| 94 | were aligned (from the guide tree/dendrogram), thus automatically grouping |
|---|
| 95 | closely related sequences. |
|---|
| 96 | |
|---|
| 97 | 11. The option to 'Reset Gaps between alignments' has been switched off by |
|---|
| 98 | default. |
|---|
| 99 | |
|---|
| 100 | 12. The conservation line output in the clustal format alignment file has been |
|---|
| 101 | changed. Three characters are now used: |
|---|
| 102 | |
|---|
| 103 | '*' indicates positions which have a single, fully conserved residue |
|---|
| 104 | |
|---|
| 105 | ':' indicates that one of the following 'strong' groups is fully conserved:- |
|---|
| 106 | |
|---|
| 107 | STA |
|---|
| 108 | NEQK |
|---|
| 109 | NHQK |
|---|
| 110 | NDEQ |
|---|
| 111 | QHRK |
|---|
| 112 | MILV |
|---|
| 113 | MILF |
|---|
| 114 | HY |
|---|
| 115 | FYW |
|---|
| 116 | |
|---|
| 117 | '.' indicates that one of the following 'weaker' groups is fully conserved:- |
|---|
| 118 | |
|---|
| 119 | CSA |
|---|
| 120 | ATV |
|---|
| 121 | SAG |
|---|
| 122 | STNK |
|---|
| 123 | STPA |
|---|
| 124 | SGND |
|---|
| 125 | SNDEQK |
|---|
| 126 | NDEQHK |
|---|
| 127 | NEQHRK |
|---|
| 128 | FVLIM |
|---|
| 129 | HFY |
|---|
| 130 | |
|---|
| 131 | These are all the positively scoring groups that occur in the Gonnet Pam250 |
|---|
| 132 | matrix. The strong and weak groups are defined as strong score >0.5 and weak |
|---|
| 133 | score =<0.5 respectively. |
|---|
| 134 | |
|---|
| 135 | 13. A bug in the modification of the Myers and Miller alignment algorithm |
|---|
| 136 | for residue-specific gap penalites has been fixed. This occasionally caused |
|---|
| 137 | new gaps to be opened a few residues away from the optimal position. |
|---|
| 138 | |
|---|
| 139 | 14. The GCG/MSF input format no longer needs the word PILEUP on the first |
|---|
| 140 | line. Several versions can now be recognised:- |
|---|
| 141 | 1. The word PILEUP as the first word in the file |
|---|
| 142 | 2. The word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT |
|---|
| 143 | as the first word in the file |
|---|
| 144 | 3. The characters MSF on the first line in the line, and the |
|---|
| 145 | characters .. at the end of the line. |
|---|
| 146 | |
|---|
| 147 | 15. The standard command line separator for UNIX systems has been changed from |
|---|
| 148 | '/' to '-'. ie. to give options on the command line, you now type |
|---|
| 149 | |
|---|
| 150 | clustalw input.aln -gapopen=8.0 |
|---|
| 151 | |
|---|
| 152 | instead of |
|---|
| 153 | |
|---|
| 154 | clustalw input.aln /gapopen=8.0 |
|---|
| 155 | |
|---|
| 156 | |
|---|
| 157 | ATTENTION SOFTWARE DEVELOPERS!! |
|---|
| 158 | ------------------------------- |
|---|
| 159 | |
|---|
| 160 | The CLUSTAL sequence alignment output format has been modified: |
|---|
| 161 | |
|---|
| 162 | 1. Names longer than 10 chars are now allowed. (The maximum is specified in |
|---|
| 163 | clustalw.h by '#define MAXNAMES'.) |
|---|
| 164 | |
|---|
| 165 | 2. The consensus line now consists of three characters: '*',':' and '.'. (Only |
|---|
| 166 | the '*' and '.' were previously used.) |
|---|
| 167 | |
|---|
| 168 | 3. An option (not the default) has been added, allowing the user to print out |
|---|
| 169 | sequence numbers at the end of each line of the alignment output. |
|---|
| 170 | |
|---|
| 171 | 4. Both RNA bases (U) and base ambiguities are now supported in nucleic acid |
|---|
| 172 | sequences. In the past, all characters (upper or lower case) other than |
|---|
| 173 | a,c,g,t or u were converted to N. Now the following characters are recognised |
|---|
| 174 | and retained in the alignment output: ABCDGHKMNRSTUVWXY (upper or lower case). |
|---|
| 175 | |
|---|
| 176 | 5. A Blank line inadvertently added in the version 1.6 header has been taken |
|---|
| 177 | out again. |
|---|
| 178 | |
|---|
| 179 | |
|---|
| 180 | -------------------------------------------------------------- |
|---|
| 181 | |
|---|
| 182 | What's New (March 1996) in Version 1.6 (since version 1.5). |
|---|
| 183 | |
|---|
| 184 | |
|---|
| 185 | 1) Improved handling of sequences of unequal length. Previously, we |
|---|
| 186 | increased the gap extension penalties for both sequences if the two sequences |
|---|
| 187 | (or groups of previously aligned sequences) were of different lengths. |
|---|
| 188 | Now, we increase the gap opening and extension penalties for the shorter |
|---|
| 189 | sequence only. This helps prevent short sequences being stretched out |
|---|
| 190 | along longer ones. |
|---|
| 191 | |
|---|
| 192 | 2) Added the "Gonnet" series of weight matrices (from Gaston Gonnet and |
|---|
| 193 | co-workers at the ETH in Zurich). Fixed a bug in the matrix |
|---|
| 194 | choice menu; now PAM matrices can be selected ok. |
|---|
| 195 | |
|---|
| 196 | 3) Added secondary structure/gap penalty masks. These allow you to |
|---|
| 197 | include, in an alignment, a position specific set of gap penalties. |
|---|
| 198 | You can either set a gap opening penalty at each position or specify |
|---|
| 199 | the secondary strcuture (if protein; alpha helix, beta strand or loop) |
|---|
| 200 | and have gap penalties set automatically. This, basically, is used to make |
|---|
| 201 | gaps harder to open inside helices or strands. |
|---|
| 202 | |
|---|
| 203 | These masks are only used in the "profile alignment" menu. They may be read in |
|---|
| 204 | as part of an alignment in a special format (see the on-line help for |
|---|
| 205 | details) or associated with each sequence, if the sequences are in Swiss Prot |
|---|
| 206 | format and secondary structure information is given. All of the mask |
|---|
| 207 | parameters can be set from the profile alignment menu. Basically, the |
|---|
| 208 | mask is made up of a series of numbers between 1 and 9, one per position. |
|---|
| 209 | The gap opening penalty at a position is calculated as the starting penalty |
|---|
| 210 | multipleied by the mask value at that site. |
|---|
| 211 | |
|---|
| 212 | 4) Added command line options /profile and /sequences. |
|---|
| 213 | These allow uses to choose between normal profile alignment where the |
|---|
| 214 | two profiles (pre-existing alignments specified in the files |
|---|
| 215 | /profile1= and /profile2=) are merged/aligned with each other (/profile) |
|---|
| 216 | and the case where the individual sequences in /profile2 are aligned |
|---|
| 217 | sequentially with the alignment in /profile1 (/sequences). |
|---|
| 218 | |
|---|
| 219 | 5) Fixed bug in modified Myers and Miller algorithm - gap penalty score |
|---|
| 220 | was not always calculated properly for type 2 midpoints. This is the core |
|---|
| 221 | alignment algorithm. |
|---|
| 222 | |
|---|
| 223 | 6) Only allows one output file format to be selected from command line |
|---|
| 224 | - ie. multiple output alignment files are not allowed. |
|---|
| 225 | |
|---|
| 226 | 7) Fixed 'bad calls to ckfree' error during calculation of phylip distance |
|---|
| 227 | matrix. |
|---|
| 228 | |
|---|
| 229 | 8) Fixed command line options /gapopen /gapext /type=protein /negative. |
|---|
| 230 | |
|---|
| 231 | 9) Allowed user to change command line separator on UNIX from '/' to '-'. |
|---|
| 232 | This allows unix users to use the more conventinal '-' symbol |
|---|
| 233 | for seperating command line options. "/" can then be used in unix |
|---|
| 234 | file names on the command line. The symbol that is used, |
|---|
| 235 | is specified in the file clustalw.h which must be edited if you |
|---|
| 236 | wish to change it (and the program must then be recompiled). Find the |
|---|
| 237 | block of code in clustalw.h that corrsponds to the operating system you |
|---|
| 238 | are using. These blocks are started by one of the following: |
|---|
| 239 | |
|---|
| 240 | #ifdef VMS |
|---|
| 241 | #elif MAC |
|---|
| 242 | #elif MSDOS |
|---|
| 243 | #elif UNIX |
|---|
| 244 | |
|---|
| 245 | On the next line after each is the line: |
|---|
| 246 | |
|---|
| 247 | #define COMMANDSEP '/' |
|---|
| 248 | |
|---|
| 249 | Change this in the appropriate block of code (e.g. the UNIX block) to |
|---|
| 250 | |
|---|
| 251 | #define COMMANDSEP '-' |
|---|
| 252 | |
|---|
| 253 | if you wish to use the "-" character as command seperator. |
|---|
| 254 | |
|---|
| 255 | |
|---|
| 256 | |
|---|
| 257 | -------------------------------------------------------------- |
|---|
| 258 | |
|---|
| 259 | What's New (April 1995) in Version 1.5 (since version 1.3). |
|---|
| 260 | |
|---|
| 261 | 1) ported to MAC and PC. These versions are quite slow unless you |
|---|
| 262 | have a nice beefy machine. On a Power Mac or a Pentium box |
|---|
| 263 | it is nice and fast. Two precompiled versions are supplied for Macs |
|---|
| 264 | (Power mac and old mac versions). |
|---|
| 265 | |
|---|
| 266 | Mac: 1500 residues by 100 sequences |
|---|
| 267 | Power Mac 3000 " " " " |
|---|
| 268 | PC 1500 " " " " |
|---|
| 269 | |
|---|
| 270 | 2) alignment of new sequences to an alignment. Fixed a serious bug |
|---|
| 271 | which assigned weights to the wrong sequences. Now also, weights |
|---|
| 272 | sequences according to distance from the incoming sequence. The |
|---|
| 273 | new weights are: tree weights * similarity to incoming sequence. |
|---|
| 274 | The tree weights are the old weights that we derive from the tree |
|---|
| 275 | connecting all the sequences in the existing alignment. |
|---|
| 276 | |
|---|
| 277 | 3) for all platforms, output linelength = 60. |
|---|
| 278 | |
|---|
| 279 | 4) Bootstrap files (*.phb): the "final" node (arbitrary trichotomy |
|---|
| 280 | at the end of the neighbor-joining process) is labelled as |
|---|
| 281 | TRICHOTOMY in the bootstrap output files. This is to help |
|---|
| 282 | link bootstrap figures with nodes when you reroot the tree. |
|---|
| 283 | |
|---|
| 284 | 5) Command line /bootstrap option now more robust. |
|---|
| 285 | |
|---|
| 286 | -------------------------------------------------------------- |
|---|
| 287 | INTRODUCTION |
|---|
| 288 | |
|---|
| 289 | |
|---|
| 290 | |
|---|
| 291 | This document gives some BRIEF notes about usage of the Clustal W |
|---|
| 292 | multiple alignment program for UNIX and VMS machines. Clustal W |
|---|
| 293 | is a major update and rewrite of the Clustal V program which |
|---|
| 294 | was described in: |
|---|
| 295 | |
|---|
| 296 | Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992) |
|---|
| 297 | CLUSTAL V: improved software for multiple sequence alignment. |
|---|
| 298 | Computer Applications in the Biosciences (CABIOS), 8(2):189-191. |
|---|
| 299 | |
|---|
| 300 | The main new features are a greatly improved (more sensitive) |
|---|
| 301 | multiple alignment procedure for proteins and improved support |
|---|
| 302 | for different file formats. This software was described in: |
|---|
| 303 | |
|---|
| 304 | Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) |
|---|
| 305 | CLUSTAL W: improving the sensitivity of progressive multiple |
|---|
| 306 | sequence alignment through sequence weighting, position specific |
|---|
| 307 | gap penalties and weight matrix choice. |
|---|
| 308 | Nucleic Acids Research, 22(22):4673-4680. |
|---|
| 309 | |
|---|
| 310 | |
|---|
| 311 | The usage of Clustal W is largely the same as for |
|---|
| 312 | Clustal V details of which are described in clustalv.txt. Details of the |
|---|
| 313 | new alignment algorithms are described in the manuscript by |
|---|
| 314 | Thompson et. al. above, an ascii/text version of which is included |
|---|
| 315 | (clustalw.ms). This file lists some of the details not covered by either |
|---|
| 316 | of the above documents. |
|---|
| 317 | |
|---|
| 318 | |
|---|
| 319 | There are brief notes on the following topics: |
|---|
| 320 | |
|---|
| 321 | 1) Installation for VMS and UNIX and MAC and PC |
|---|
| 322 | 2) File input |
|---|
| 323 | 3) file output |
|---|
| 324 | 4) changes to the alignment algorithms |
|---|
| 325 | 5) minor modifications to the phylogenetic tree and bootstrapping methods |
|---|
| 326 | 6) summary of the command line usage. |
|---|
| 327 | |
|---|
| 328 | ------------------------------------------------------------------- |
|---|
| 329 | |
|---|
| 330 | 1) INSTALLATION (for Unix, VAX/VMS, PC and MAC) |
|---|
| 331 | |
|---|
| 332 | |
|---|
| 333 | |
|---|
| 334 | *****IMPORTANT***** |
|---|
| 335 | If you wish to recompile the program (or compile it for the first |
|---|
| 336 | time; you will have to do this with UNIX): |
|---|
| 337 | first check the file CLUSTALW.H which needs to be changed if you |
|---|
| 338 | move the code from between unix and vms machines. At the top |
|---|
| 339 | of the file are four lines which define one of VMS, MSDOS, MAC or |
|---|
| 340 | UNIX to be 1. All of these EXCEPT one must be commented out |
|---|
| 341 | using enclosed /* ... */. |
|---|
| 342 | ******************* |
|---|
| 343 | |
|---|
| 344 | |
|---|
| 345 | Unix |
|---|
| 346 | ----- |
|---|
| 347 | |
|---|
| 348 | Make files are supplied for unix machines. The code was compiled and |
|---|
| 349 | tested using Decstation (Ultrix), SUN (Gnu C compiler/gcc), Silicon |
|---|
| 350 | Graphics (IRIX) and DEC/Alpha (OSF1). We have not tested the code on any other |
|---|
| 351 | systems. Just use makefile to make on most systems. For Sun, you need to |
|---|
| 352 | have the Gnuc C (gcc) compiler installed ... use the file makefile.sun in this |
|---|
| 353 | case. You make the program with: |
|---|
| 354 | make (or make -f makefile.sun) |
|---|
| 355 | |
|---|
| 356 | This produces the file clustalw which can be run by typing clustalw and |
|---|
| 357 | pressing return. The help file is called clustalw_help |
|---|
| 358 | |
|---|
| 359 | |
|---|
| 360 | VMS |
|---|
| 361 | ---- |
|---|
| 362 | |
|---|
| 363 | There is a small DCL command file (VMSLINK.COM) to compile and link the |
|---|
| 364 | code for VMS machines (vax or alpha). This procedure just compiles the |
|---|
| 365 | source files and links using default settings. Run it using: |
|---|
| 366 | $ @vmslink |
|---|
| 367 | This produces Clustalw.exe which can be run using the run command: |
|---|
| 368 | $ run clustalw |
|---|
| 369 | |
|---|
| 370 | The intermediate object files can be deleted with: |
|---|
| 371 | $ del *.obj; |
|---|
| 372 | |
|---|
| 373 | There is an extensive command line facility. To use this, you must |
|---|
| 374 | create a symbol to run the program (and put this in your login.com file). |
|---|
| 375 | e.g. |
|---|
| 376 | $ clustalw :== $$drive:[dir.dir]clustalw |
|---|
| 377 | where $drive is the drive on which the executable file is stored (clustalw.exe) |
|---|
| 378 | and [dir.dir] is the full directory specification. NOTE THE EXTRA DOLLAR SIGN. |
|---|
| 379 | Then the program can be run using the command: |
|---|
| 380 | $ clustalw |
|---|
| 381 | |
|---|
| 382 | |
|---|
| 383 | PC |
|---|
| 384 | __ |
|---|
| 385 | |
|---|
| 386 | We supply an executable file (Clustalw.exe) which will run using MSDOS. |
|---|
| 387 | It will also run under windows (as a DOS application) |
|---|
| 388 | *** IF you have a maths coprocessor***. If you do not have a maths chip |
|---|
| 389 | (e.g. 80387), the program can only be run under MSDOS. In the latter case, |
|---|
| 390 | you must have the file EMU387.exe in the same directory as CLUSTALW.EXE. |
|---|
| 391 | This file emulates a maths chip if you do not have one. |
|---|
| 392 | |
|---|
| 393 | |
|---|
| 394 | We generated the executable file using gnu c for MSDOS. |
|---|
| 395 | It will also compile (with about 10,000 warning messages) |
|---|
| 396 | using Microsoft C but we have not tested it and there appear to be problems |
|---|
| 397 | with the executable. |
|---|
| 398 | |
|---|
| 399 | You will need to use a "memory extender" to allow the program to get at more |
|---|
| 400 | than 640kb of memory. |
|---|
| 401 | |
|---|
| 402 | |
|---|
| 403 | |
|---|
| 404 | MAC |
|---|
| 405 | --- |
|---|
| 406 | |
|---|
| 407 | The code compiles for Power Mac and older macs using Metroworks Codewarrior |
|---|
| 408 | C compiler. We supply 2 executable programs (one each for PowerMac and |
|---|
| 409 | older mac): ClustalwPPC and Clustalw68k). These need up to |
|---|
| 410 | 10mb of memory to run which needs to be adjusted with the Get Info (%I) |
|---|
| 411 | command from the Finder if you have problems. Just double click the |
|---|
| 412 | executable file name or icon and off you go (we hope). |
|---|
| 413 | |
|---|
| 414 | As a special treat for Mac users, we supply an executable and brief readme |
|---|
| 415 | file for NJPLOT. This is a really nice program by Manolo Gouy |
|---|
| 416 | (University of Lyon, France) that allows you to import the trees |
|---|
| 417 | made by Clustal W and display them/manipulate them. It will properly |
|---|
| 418 | display the bootstrap figures from the *.phb files. It can export the |
|---|
| 419 | trees in PICT format which can then be used by MacDraw for example. |
|---|
| 420 | |
|---|
| 421 | |
|---|
| 422 | ------------------------------------------------------------------------- |
|---|
| 423 | |
|---|
| 424 | 2) FILE INPUT (sequences to be aligned) |
|---|
| 425 | |
|---|
| 426 | |
|---|
| 427 | |
|---|
| 428 | The sequences must all be in one file (or two files for a "profile alignment") |
|---|
| 429 | in ONE of the following formats: |
|---|
| 430 | |
|---|
| 431 | FASTA (Pearson), NBRF/PIR, EMBL/Swiss Prot, GDE, CLUSTAL, GCG/MSF, GCG9/RSF. |
|---|
| 432 | |
|---|
| 433 | The program tries to "guess" which format is being used and whether |
|---|
| 434 | the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The |
|---|
| 435 | format is recognised by the first characters in the file. This is kind |
|---|
| 436 | of stupid/crude but works most of the time and it is difficult |
|---|
| 437 | to do reliably, any other way. |
|---|
| 438 | |
|---|
| 439 | |
|---|
| 440 | Format First non blank word or character in the file. |
|---|
| 441 | ............................................................... |
|---|
| 442 | FASTA > |
|---|
| 443 | NBRF >P1; or >D1; |
|---|
| 444 | EMBL/SWISS ID |
|---|
| 445 | GDE protein % |
|---|
| 446 | GDE nucleotide # |
|---|
| 447 | CLUSTAL CLUSTAL (blocked multiple alignments) |
|---|
| 448 | GCG/MSF PILEUP or !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT |
|---|
| 449 | or MSF on the first line, and '..' at the end of line |
|---|
| 450 | GCG9/RSF !!RICH_SEQUENCE |
|---|
| 451 | |
|---|
| 452 | Note, that the only way of spotting that a file is MSF format is if |
|---|
| 453 | the word PILEUP appears at the very beginning of the file. If you |
|---|
| 454 | produce this format from software other than the GCG pileup program, |
|---|
| 455 | then you will have to insert the word PILEUP at the start of the file. |
|---|
| 456 | Similarly, if you use clustal format, the word CLUSTAL must appear first. |
|---|
| 457 | |
|---|
| 458 | All of these formats can be used to read in AN EXISTING FULL ALIGNMENT. |
|---|
| 459 | With CLUSTAL format, this is just the same as the output format of this |
|---|
| 460 | program and Clustal V. If you use PILEUP or CLUSTAL format, all sequences |
|---|
| 461 | must be the same length, INCLUDING GAPS ("-" in clustal format; "." in MSF). |
|---|
| 462 | With the other formats, sequences can be gapped with "-" characters. If you |
|---|
| 463 | read in any gaps these are kept during any later alignments. You can use |
|---|
| 464 | this facility to read in an alignment in order to calculate a phylogenetic |
|---|
| 465 | tree OR to output the same alignment in a different format (from the |
|---|
| 466 | output format options menu of the multiple alignment menu) e.g. read |
|---|
| 467 | in a GCG/MSF format alignment and output a PHYLIP format alignment. This is |
|---|
| 468 | also useful to read in one reference alignment and to add one or more new |
|---|
| 469 | sequences to it using the "profile alignment" facilities. |
|---|
| 470 | |
|---|
| 471 | DNA vs. PROTEIN: the program will count the number of A,C,G,T,U and N |
|---|
| 472 | charcters. If 85% or more of the characters in a sequence are as above, |
|---|
| 473 | then DNA/RNA is assumed, protein otherwise. |
|---|
| 474 | |
|---|
| 475 | ------------------------------------------------------------------------- |
|---|
| 476 | |
|---|
| 477 | |
|---|
| 478 | 3) FILE OUTPUT |
|---|
| 479 | |
|---|
| 480 | |
|---|
| 481 | 1) the alignments. |
|---|
| 482 | |
|---|
| 483 | In the multiple alignment and profile alignment menus, there is a menu |
|---|
| 484 | item to control the output format(s). |
|---|
| 485 | |
|---|
| 486 | The alignment output format can be set to any (or all) of: |
|---|
| 487 | CLUSTAL (a self explanatory blocked alignment) |
|---|
| 488 | NBRF/PIR (same as input format but with "-" characters for gaps) |
|---|
| 489 | MSF (the main GCG package multiple alignment format) |
|---|
| 490 | PHYLIP (Joe Felsenstein's phylogeny inference package. Gaps are set to |
|---|
| 491 | "-" characters. For some programs (e.g. PROTPARS/DNAPARS) these |
|---|
| 492 | should be changed to "?" characters for unknown residues. |
|---|
| 493 | GDE (Used by Steven Smith's GDE package) |
|---|
| 494 | |
|---|
| 495 | You can also choose between having the sequences in the same order as in |
|---|
| 496 | the input file or writing them out in an order that more closely matches the |
|---|
| 497 | order used to carry out the multiple alignment. |
|---|
| 498 | |
|---|
| 499 | |
|---|
| 500 | 2) The trees. |
|---|
| 501 | |
|---|
| 502 | Believe it or not, we now use the New Hampshire (nested parentheses) |
|---|
| 503 | format as default for our trees. This format is compatible with e.g. the |
|---|
| 504 | PHYLIP package. If you want to view a tree, you can use the RETREE or |
|---|
| 505 | DRAWGRAM/DRAWTREE programs of PHYLIP. This format is used for all our |
|---|
| 506 | trees, even the initial guide trees for deciding the order of multiple |
|---|
| 507 | alignment. The output trees from the phylogenetic tree menu can also be |
|---|
| 508 | requested in our old verbose/cryptic format. This may be more useful |
|---|
| 509 | if, for example, you wish to see the bootstrap figures. The bootstrap |
|---|
| 510 | trees in the default New Hampshire format give the bootstrap figures |
|---|
| 511 | as extra labels which can be viewed very easily using TREETOOL which is |
|---|
| 512 | available as part of the GDE package. TREETOOL is available from the |
|---|
| 513 | RDP project by ftp from rdp.life.uiuc.edu. |
|---|
| 514 | |
|---|
| 515 | The New Hampshire format is only useful if you have software to display or |
|---|
| 516 | manipulate the trees. The PHYLIP package is highly recommended if you intend |
|---|
| 517 | to do much work with trees and includes programs for doing this. If you do |
|---|
| 518 | not have such software, request the trees in the older clustal format |
|---|
| 519 | and see the documentation for Clustal V (clustalv.txt). WE DO NOT PROVIDE |
|---|
| 520 | ANY DIRECT MEANS FOR VIEWING TREES GRAPHICALLY. |
|---|
| 521 | |
|---|
| 522 | ------------------------------------------------------------------------- |
|---|
| 523 | |
|---|
| 524 | 4) THE ALIGNMENT ALGORITHMS |
|---|
| 525 | |
|---|
| 526 | |
|---|
| 527 | The basic algorithm is the same as for Clustal V and is described in some |
|---|
| 528 | detail in clustalv.txt. The new modifications are described in detail in |
|---|
| 529 | clustalw.ms. Here we just list some notes to help answer some of the most |
|---|
| 530 | obvious questions. |
|---|
| 531 | |
|---|
| 532 | |
|---|
| 533 | Terminal Gaps |
|---|
| 534 | |
|---|
| 535 | In the original Clustal V program, terminal gaps were penalised the same |
|---|
| 536 | as all other gaps. This caused some ugly side effects e.g. |
|---|
| 537 | |
|---|
| 538 | acgtacgtacgtacgt acgtacgtacgtacgt |
|---|
| 539 | a----cgtacgtacgt gets the same score as ----acgtacgtacgt |
|---|
| 540 | |
|---|
| 541 | NOW, terminal gaps are free. This is better on average and stops silly |
|---|
| 542 | effects like single residues jumping to the edge of the alignment. However, |
|---|
| 543 | it is not perfect. It does mean that if there should be a gap near the end |
|---|
| 544 | of the alignment, the program may be reluctant to insert it i.e. |
|---|
| 545 | |
|---|
| 546 | cccccgggccccc cccccgggccccc |
|---|
| 547 | ccccc---ccccc may be considered worse (lower score) than cccccccccc--- |
|---|
| 548 | |
|---|
| 549 | In the right hand case above, the terminal gap is free and may score higher |
|---|
| 550 | than the laft hand alignment. This can be prevented by lowering the gap |
|---|
| 551 | opening and extension penalties. It is difficult to get this right all the |
|---|
| 552 | time. Please watch the ends of your alignments. |
|---|
| 553 | |
|---|
| 554 | |
|---|
| 555 | |
|---|
| 556 | Speed of the initial (pairwise) alignments (fast approximate/slow accurate) |
|---|
| 557 | |
|---|
| 558 | By default, the initial pairwise alignments are now carried out using a full |
|---|
| 559 | dynamic programming algorithm. This is more accurate than the older hash/ |
|---|
| 560 | k-tuple based alignments (Wilbur and Lipman) but is MUCH slower. On a fast |
|---|
| 561 | workstation you may not notice but on a slow box, the difference is extreme. |
|---|
| 562 | You can set the alignment method from the menus easily to the older, faster |
|---|
| 563 | method. |
|---|
| 564 | |
|---|
| 565 | |
|---|
| 566 | |
|---|
| 567 | Delaying alignment of distant sequences |
|---|
| 568 | |
|---|
| 569 | The user can set a cut off to delay the alignment of the most divergent |
|---|
| 570 | sequences in a data set until all other sequences have been aligned. By |
|---|
| 571 | default, this is set to 40% which means that if a sequence is less than 40% |
|---|
| 572 | identical to any other sequence, its alignment will be delayed. |
|---|
| 573 | |
|---|
| 574 | |
|---|
| 575 | |
|---|
| 576 | Iterative realignment/Reset gaps between alignments |
|---|
| 577 | |
|---|
| 578 | By default, if you align a set of sequences a second time (e.g. with changed |
|---|
| 579 | gap penalties), the gaps from the first alignment are discarded. You can |
|---|
| 580 | set this from the menus so that older gaps will be kept between alignments, |
|---|
| 581 | This can sometimes give better alignments by keeping the gaps (do not reset |
|---|
| 582 | them) and doing the full multiple alignment a second time. Sometimes, the |
|---|
| 583 | alignment will converge on a better solution; sometimes the new alignment will |
|---|
| 584 | be the same as the first. There can be a strange side effect: you can get |
|---|
| 585 | columns of nothing but gaps introduced. |
|---|
| 586 | |
|---|
| 587 | Any gaps that are read in from the input file are always kept, regardless |
|---|
| 588 | of the setting of this switch. If you read in a full multiple alignment, the "reset |
|---|
| 589 | gaps" switch has no effect. The old gaps will remain and if you carry out |
|---|
| 590 | a multiple alignment, any new gaps will be added in. If you wish to carry out |
|---|
| 591 | a full new alignment of a set of sequences that are already aligned in a file |
|---|
| 592 | you must input the sequences without gaps. |
|---|
| 593 | |
|---|
| 594 | |
|---|
| 595 | |
|---|
| 596 | Profile alignment |
|---|
| 597 | |
|---|
| 598 | By profile alignment, we simply mean the alignment of old alignments/sequences. |
|---|
| 599 | In this context, a profile is just an existing alignment (or even a set of |
|---|
| 600 | unaligned sequences; see below). This allows you to |
|---|
| 601 | read in an old alignment (in any of the allowed input formats) and align |
|---|
| 602 | one or more new sequences to it. From the profile alignment menu, you |
|---|
| 603 | are allowed to read in 2 profiles. Either profile can be a full alignment |
|---|
| 604 | OR a single sequence. In the simplest mode, you simply align the two profiles |
|---|
| 605 | to each other. This is useful if you want to gradually build up a full |
|---|
| 606 | multiple alignment. |
|---|
| 607 | |
|---|
| 608 | A second option is to align the sequences from the second profile, one at |
|---|
| 609 | a time to the first profile. This is done, taking the underlying tree between |
|---|
| 610 | the sequences into account. This is useful if you have a set of new sequences |
|---|
| 611 | (not aligned) and you wish to add them all to an older alignment. |
|---|
| 612 | |
|---|
| 613 | ---------------------------------------------------------------------------- |
|---|
| 614 | |
|---|
| 615 | 5) CHANGES TO THE PHYLOGENTIC TREE CALCULATIONS AND SOME HINTS. |
|---|
| 616 | |
|---|
| 617 | |
|---|
| 618 | |
|---|
| 619 | IMPROVED DISTANCE CALCULATIONS FOR PROTEIN TREES |
|---|
| 620 | |
|---|
| 621 | |
|---|
| 622 | The phylogenetic trees in Clustal W (the real trees that you calculate |
|---|
| 623 | AFTER alignment; not the guide trees used to decide the branching order |
|---|
| 624 | for multiple alignment) use the Neighbor-Joining method of Saitou and |
|---|
| 625 | Nei based on a matrix of "distances" between all sequences. These distances |
|---|
| 626 | can be corrected for "multiple hits". This is normal practice when accurate |
|---|
| 627 | trees are needed. This correction stretches distances (especially large ones) |
|---|
| 628 | to try to correct for the fact that OBSERVED distances (mean number of |
|---|
| 629 | differences per site) greatly underestimate the actual number that happened |
|---|
| 630 | during evolution. |
|---|
| 631 | |
|---|
| 632 | In Clustal V we used a simple formula to convert an observed distance to one |
|---|
| 633 | that is corrected for multiple hits. The observed distance is the mean number |
|---|
| 634 | of differences per site in an alignment (ignoring sites with a gap) and is |
|---|
| 635 | therefore always between 0.0 (for ientical sequences) an 1.0 (no residues the |
|---|
| 636 | same at any site). These distances can be multiplied by 100 to give percent |
|---|
| 637 | difference values. 100 minus percent difference gives percent identity. |
|---|
| 638 | The formula we use to correct for multiple hits is from Motoo Kimura |
|---|
| 639 | (Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983, |
|---|
| 640 | page 75) and is: |
|---|
| 641 | |
|---|
| 642 | K = -Ln(1 - D - (D.D)/5) where D is the observed distance and K is |
|---|
| 643 | corrected distance. |
|---|
| 644 | |
|---|
| 645 | This formula gives mean number of estimated substitutions per site and, in |
|---|
| 646 | contrast to D (the observed number), can be greater than 1 i.e. more than |
|---|
| 647 | one substitution per site, on average. For example, if you observe 0.8 |
|---|
| 648 | differences per site (80% difference; 20% identity), then the above formula |
|---|
| 649 | predicts that there have been 2.5 substitutions per site over the course |
|---|
| 650 | of evolution since the 2 sequences diverged. This can also be expressed in |
|---|
| 651 | PAM units by multiplying by 100 (mean number of substitutions per 100 residues). |
|---|
| 652 | The PAM scale of evolution and its derivation/calculation comes from the |
|---|
| 653 | work of Margaret Dayhoff and co workers (the famous Dayhoff PAM series |
|---|
| 654 | of weight matrices also came from this work). Dayhoff et al constructed |
|---|
| 655 | an elaborate model of protein evolution based on observed frequencies |
|---|
| 656 | of substitution between very closely related proteins. Using this model, |
|---|
| 657 | they derived a table relating observed distances to predicted PAM distances. |
|---|
| 658 | Kimura's formula, above, is just a "curve fitting" approximation to this table. |
|---|
| 659 | It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly |
|---|
| 660 | unaccurate at high D (>0.75) and fails completely at around D = 0.85. |
|---|
| 661 | |
|---|
| 662 | To circumvent this problem, we calculated all the values for K corresponding |
|---|
| 663 | to D above 0.75 directly using the Dayhoff model and store these in an |
|---|
| 664 | internal table, used by Clustal W. This table is declared in the file dayhoff.h and |
|---|
| 665 | gives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e. |
|---|
| 666 | for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930. For any observed D |
|---|
| 667 | higher than 0.930, we arbitrarily set K to 10.0. This sounds drastic but |
|---|
| 668 | with real sequences, distances of 0.93 (less than 7% identity) are rare. |
|---|
| 669 | If your data set includes sequences with this degree of divergence, you |
|---|
| 670 | will have great difficulty getting accurate trees by ANY method; the alignment |
|---|
| 671 | itself will be very difficult (to construct and to evaluate). |
|---|
| 672 | |
|---|
| 673 | There are some important |
|---|
| 674 | things to note. Firstly, this formula works well if your sequences are |
|---|
| 675 | of average amino acid composition and if the amino acids substitute according |
|---|
| 676 | to the original Dayhoff model. In other cases, it may be misleading. Secondly, |
|---|
| 677 | it is based only on observed percent distance i.e. it does not DIRECTLY |
|---|
| 678 | take conservative substitutions into account. Thirdly, the error on the |
|---|
| 679 | estimated PAM distances may be VERY great for high distances; at very high |
|---|
| 680 | distance (e.g. over 85%) it may give largely arbitrary corrected distances. |
|---|
| 681 | In most cases, however, the correction is still worth using; the trees will |
|---|
| 682 | be more accurate and the branch lengths will be more realistic. |
|---|
| 683 | |
|---|
| 684 | A far more sophisticated distance correction based on a full Dayhoff |
|---|
| 685 | model which DOES take conservative substitutions and actual amino acid |
|---|
| 686 | composition into account, may be found in the PROTDIST program of the |
|---|
| 687 | PHYLIP package. For serious tree makers, this program is highly recommended. |
|---|
| 688 | |
|---|
| 689 | |
|---|
| 690 | |
|---|
| 691 | TWO NOTES ON BOOTSTRAPPING... |
|---|
| 692 | |
|---|
| 693 | When you use the BOOTSTRAP in Clustal W to estimate the reliability of parts |
|---|
| 694 | of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut |
|---|
| 695 | off of 0.93 (sequences only 7% identical) if the sequences are distantly |
|---|
| 696 | related. This will happen randomly i.e. even if none of the pairs of |
|---|
| 697 | sequences are less than 7% identical, the bootstrap samples may contain pairs |
|---|
| 698 | of sequences that do exceed this cut off. |
|---|
| 699 | If this happens, you will be warned. In practice, this can |
|---|
| 700 | happen with many data sets. It is not a serious problem if it happens rarely. |
|---|
| 701 | If it does happen (you are warned when it happens and told how often the |
|---|
| 702 | problem occurs), you should consider removing the most distantly |
|---|
| 703 | related sequences and/or using the PHYLIP package instead. |
|---|
| 704 | |
|---|
| 705 | |
|---|
| 706 | A further problem arises in almost exactly the opposite situation: when |
|---|
| 707 | you bootstrap a data set which contains 3 or more sequences that are identical |
|---|
| 708 | or almost identical. Here, the sets of identical sequences should be shown |
|---|
| 709 | as a multifurcation (several sequences joing at the same part of the tree). |
|---|
| 710 | Because the Neighbor-Joining method only gives strictly dichotomous trees |
|---|
| 711 | (never more than 2 sequences join at one time), this cannot be exactly |
|---|
| 712 | represented. In practice, this is NOT a problem as there will be some |
|---|
| 713 | internal branches of zero length seperating the sequences. If you |
|---|
| 714 | display the tree with all branch lengths, you will still see a multifurcation. |
|---|
| 715 | However, when you bootstrap |
|---|
| 716 | the tree, only the branching orders are stored and counted. In the case |
|---|
| 717 | of multifurcations, the exact branching order is arbitrary but the program |
|---|
| 718 | will always get the same branching order, depending only on the input order |
|---|
| 719 | of the sequences. In practice, this is only a problem in situations where |
|---|
| 720 | you have a set of sequences where all of them are VERY similar. In this case, |
|---|
| 721 | you can find very high support for some groupings which will disappear if you |
|---|
| 722 | run the analysis with a different input order. Again, the PHYLIP package |
|---|
| 723 | deals with this by offering a JUMBLE option to shuffle the input order |
|---|
| 724 | of your sequences between each bootstrap sample. |
|---|
| 725 | |
|---|
| 726 | ---------------------------------------------------------------------------- |
|---|
| 727 | |
|---|
| 728 | 6) SUMMARY OF THE COMMAND LINE USAGE |
|---|
| 729 | |
|---|
| 730 | Clustal W is designed to be run interactively. However, there are many |
|---|
| 731 | situations where it is convenient to run it from the command line, especially |
|---|
| 732 | if you wish to run it from another piece of software (e.g. SeqApp or GDE). |
|---|
| 733 | All parameters can be set from the command line by giving options after the |
|---|
| 734 | clustalw command. On UNIX options should be preceded by '-', all other systems |
|---|
| 735 | use the '/' character. |
|---|
| 736 | |
|---|
| 737 | If anything is put on the command line, the program will (attempt to) carry |
|---|
| 738 | out whatever is requested and will exit. If you wish to use the command |
|---|
| 739 | line to set some parameters and then go into interactive mode, use the |
|---|
| 740 | command line switch: interactive .... e.g. |
|---|
| 741 | |
|---|
| 742 | clustalw -quicktree -interactive on UNIX |
|---|
| 743 | or |
|---|
| 744 | clustalw /quicktree /interactive on VMS,MAC and PC |
|---|
| 745 | |
|---|
| 746 | will set the default initial alignment mode to fast/approximate and will then |
|---|
| 747 | go to the main menu. |
|---|
| 748 | |
|---|
| 749 | |
|---|
| 750 | To see a list of all the command line parameters, type: |
|---|
| 751 | |
|---|
| 752 | clustalw -options on UNIX |
|---|
| 753 | or |
|---|
| 754 | clustalw /options on VMS,MAC and PC |
|---|
| 755 | |
|---|
| 756 | and you will see a list with no explanation. |
|---|
| 757 | |
|---|
| 758 | |
|---|
| 759 | To get (VERY BRIEF) help on command line usage, use the /HELP or /CHECK |
|---|
| 760 | (-help or -check on UNIX systems) options. Otherwise, the command line |
|---|
| 761 | usage is self explanatory or is explained in clustalv.txt. The defaults |
|---|
| 762 | for all parameters are set in the file param.h which can be changed easily |
|---|
| 763 | (remember to recompile the program afterwards :-). |
|---|
| 764 | |
|---|
| 765 | ------------------------------------------------------------------------------ |
|---|