| 1 | ****************************************************************************** |
|---|
| 2 | |
|---|
| 3 | CLUSTAL W Multiple Sequence Alignment Program |
|---|
| 4 | (version 1.83, Feb 2003) |
|---|
| 5 | |
|---|
| 6 | ****************************************************************************** |
|---|
| 7 | |
|---|
| 8 | |
|---|
| 9 | Please send bug reports, comments etc. to one of:- |
|---|
| 10 | gibson@embl-heidelberg.de |
|---|
| 11 | thompson@igbmc.u-strasbg.fr |
|---|
| 12 | d.higgins@ucc.ie |
|---|
| 13 | |
|---|
| 14 | |
|---|
| 15 | ****************************************************************************** |
|---|
| 16 | |
|---|
| 17 | POLICY ON COMMERCIAL DISTRIBUTION OF CLUSTAL W |
|---|
| 18 | |
|---|
| 19 | Clustal W is freely available to the user community. However, Clustal W is |
|---|
| 20 | increasingly being distributed as part of commercial sequence analysis |
|---|
| 21 | packages. To help us safeguard future maintenance and development, commercial |
|---|
| 22 | distributors of Clustal W must take out a NON-EXCLUSIVE LICENCE. Anyone |
|---|
| 23 | wishing to commercially distribute version 1.81 of Clustal W should contact the |
|---|
| 24 | authors unless they have previously taken out a licence. |
|---|
| 25 | |
|---|
| 26 | ****************************************************************************** |
|---|
| 27 | |
|---|
| 28 | Clustal W is written in ANSI-C and can be run on any machine with an ANSI-C |
|---|
| 29 | compiler. Executables are provided for several major platforms. |
|---|
| 30 | |
|---|
| 31 | Changes since CLUSTAL X Version 1.82 |
|---|
| 32 | ------------------------------------ |
|---|
| 33 | |
|---|
| 34 | 1. The FASTA format has been added to the list of alignment output options. |
|---|
| 35 | |
|---|
| 36 | 2. It is now possible to save the residue ranges (appended after the sequence |
|---|
| 37 | names) when saving a specified range of the alignment. |
|---|
| 38 | |
|---|
| 39 | 3. The efficiency of the neighour-joining algorithm has been improved. This |
|---|
| 40 | work was done by Tadashi Koike at the Center for Information Biology and DNA Data |
|---|
| 41 | Bank of Japan and FUJITSU Limited. |
|---|
| 42 | |
|---|
| 43 | Some example speedups are given below : (timings on a SPARC64 CPU) |
|---|
| 44 | |
|---|
| 45 | No. of sequences original NJ new NJ |
|---|
| 46 | 200 0' 12" 0.1" |
|---|
| 47 | 500 9' 19" 1.4" |
|---|
| 48 | 1000 XXXX 0' 31" |
|---|
| 49 | |
|---|
| 50 | Changes since version 1.8 |
|---|
| 51 | -------------------------- |
|---|
| 52 | |
|---|
| 53 | 1. ClustalW now returns error codes for some common errors when exiting. This |
|---|
| 54 | may be useful for people who run clustalw automatically from within a script. |
|---|
| 55 | Error codes are: |
|---|
| 56 | 1 bad command line option |
|---|
| 57 | 2 cannot open sequence file |
|---|
| 58 | 3 wrong format in sequence file |
|---|
| 59 | 4 sequence file contains only 1 sequence (for multiple alignments) |
|---|
| 60 | |
|---|
| 61 | 2. Alignments can now be saved in Nexus format, for compatibility with PAUP, |
|---|
| 62 | MacClade etc. For a description of the Nexus format, see: |
|---|
| 63 | Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997. |
|---|
| 64 | NEXUS: an extensible file format for systematic information. |
|---|
| 65 | Systematic Biology 46:590-621. |
|---|
| 66 | |
|---|
| 67 | 3. Phylogenetic trees can also be saved in nexus format. |
|---|
| 68 | |
|---|
| 69 | 4. A ClustalW icon has been designed for MAC and PC systems. |
|---|
| 70 | |
|---|
| 71 | |
|---|
| 72 | Changes since version 1.74 |
|---|
| 73 | -------------------------- |
|---|
| 74 | |
|---|
| 75 | 1. Some work has been done to automatically select the optimal parameters |
|---|
| 76 | depending on the set of sequences to be aligned. The Gonnet series of residue |
|---|
| 77 | comparison matrices are now used by default. The Blosum series remains as an |
|---|
| 78 | option. The default gap extension penalty for proteins has been changed to 0.2 |
|---|
| 79 | (was 0.05).The 'delay divergent sequences' option has been changed to 30% |
|---|
| 80 | residue identity (was 40%). |
|---|
| 81 | |
|---|
| 82 | 2. The default parameters used when the 'Negative matrix' option is selected |
|---|
| 83 | have been optimised. This option may help when the sequences to be aligned are |
|---|
| 84 | not superposable over their whole lengths (e.g. in the presence of N/C terminal |
|---|
| 85 | extensions). |
|---|
| 86 | |
|---|
| 87 | 3. A bug in the calculation of phylogenetic trees for 2 sequences has been |
|---|
| 88 | fixed. |
|---|
| 89 | |
|---|
| 90 | 4. A command line option has been added to turn off the sequence weighting |
|---|
| 91 | calculation. |
|---|
| 92 | |
|---|
| 93 | 5. The phylogenetic tree calculation now ignores any ambiguity codes in the |
|---|
| 94 | sequences. |
|---|
| 95 | |
|---|
| 96 | 6. A bug in the memory access during the calculation of profiles has been |
|---|
| 97 | fixed. (Thanks to Haruna Cofer at SGI). |
|---|
| 98 | |
|---|
| 99 | 7. A bug has been fixed in the 'transition weight' option for nucleic acid |
|---|
| 100 | sequences. (Thanks to Chanan Rubin at Compugen). |
|---|
| 101 | |
|---|
| 102 | 8. An option has been added to read in a series of comparison matrices from a |
|---|
| 103 | file. This option is only applicable for protein sequences. For details of the |
|---|
| 104 | file format, see the on-line documentation. |
|---|
| 105 | |
|---|
| 106 | 9. The MSF output file format has been changed. The sequence weights |
|---|
| 107 | calculated by Clustal W are now included in the header. |
|---|
| 108 | |
|---|
| 109 | 10. Two bugs in the FAST/APPROXIMATE pairwise alignments have been fixed. One |
|---|
| 110 | involved the alignment of new sequences to an existing profile using the fast |
|---|
| 111 | pairwise alignment option; the second was caused by changing the default |
|---|
| 112 | options for the fast pairwise alignments. |
|---|
| 113 | |
|---|
| 114 | 11. A bug in the alignment of a small number of sequences has been fixed. |
|---|
| 115 | Previously a Guide Tree was not calculated for less than 4 sequences. |
|---|
| 116 | |
|---|
| 117 | |
|---|
| 118 | Changes since version 1.6 |
|---|
| 119 | ------------------------- |
|---|
| 120 | |
|---|
| 121 | 1. The static arrays used by clustalw for storing the alignment data have been |
|---|
| 122 | replaced by dynamically allocated memory. There is now no limit on the number |
|---|
| 123 | or length of sequences which can be input. |
|---|
| 124 | |
|---|
| 125 | 2. The alignment of DNA sequences now offers a new hard-coded matrix, as well |
|---|
| 126 | as the identity matrix used previously. The new matrix is the default scoring |
|---|
| 127 | matrix used by the BESTFIT program of the GCG package for the comparison of |
|---|
| 128 | nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity |
|---|
| 129 | symbol. All matches score 1.9; all mismatches for IUB symbols score 0.0. |
|---|
| 130 | |
|---|
| 131 | 3. The transition weight option for aligning nucleotide sequences has been |
|---|
| 132 | changed from an on/off toggle to a weight between 0 and 1. A weight of zero |
|---|
| 133 | means that the transitions are scored as mismatches; a weight of 1 gives |
|---|
| 134 | transitions the full match score. For distantly related DNA sequences, the |
|---|
| 135 | weight should be near to zero; for closely related sequences it can be useful |
|---|
| 136 | to assign a higher score. |
|---|
| 137 | |
|---|
| 138 | 4. The RSF sequence alignment file format used by GCG Version 9 can now be |
|---|
| 139 | read. |
|---|
| 140 | |
|---|
| 141 | 5. The clustal sequence alignment file format has been changed to allow |
|---|
| 142 | sequence names longer than 10 characters. The maximum length allowed is set in |
|---|
| 143 | clustalw.h by the statement: |
|---|
| 144 | #define MAXNAMES 10 |
|---|
| 145 | |
|---|
| 146 | For the fasta format, the name is taken as the first string after the '>' |
|---|
| 147 | character, stopping at the first white space. (Previously, the first 10 |
|---|
| 148 | characters were taken, replacing blanks by underscores). |
|---|
| 149 | |
|---|
| 150 | 6. The bootstrap values written in the phylip tree file format can be assigned |
|---|
| 151 | either to branches or nodes. The default is to write the values on the nodes, |
|---|
| 152 | as this can be read by several commonly-used tree display programs. But note |
|---|
| 153 | that this can lead to confusion if the tree is rooted and the bootstraps may |
|---|
| 154 | be better attached to the internal branches: Software developers should ensure |
|---|
| 155 | they can read the branch label format. |
|---|
| 156 | |
|---|
| 157 | 7. The sequence weighting used during sequence to profile alignments has been |
|---|
| 158 | changed. The tree weight is now multiplied by the percent identity of the |
|---|
| 159 | new sequence compared with the most closely related sequence in the profile. |
|---|
| 160 | |
|---|
| 161 | 8. The sequence weighting used during profile to profile alignments has been |
|---|
| 162 | changed. A guide tree is now built for each profile separately and the |
|---|
| 163 | sequence weights calculated from the two trees. The weights for each |
|---|
| 164 | sequence are then multiplied by the percent identity of the sequence compared |
|---|
| 165 | with the most closely related sequence in the opposite profile. |
|---|
| 166 | |
|---|
| 167 | 9. The adjustment of the Gap Opening and Gap Extension Penalties for sequences |
|---|
| 168 | of unequal length has been improved. |
|---|
| 169 | |
|---|
| 170 | 10. The default order of the sequences in the output alignment file has been |
|---|
| 171 | changed. Previously the default was to output the sequences in the same order |
|---|
| 172 | as the input file. Now the default is to use the order in which the sequences |
|---|
| 173 | were aligned (from the guide tree/dendrogram), thus automatically grouping |
|---|
| 174 | closely related sequences. |
|---|
| 175 | |
|---|
| 176 | 11. The option to 'Reset Gaps between alignments' has been switched off by |
|---|
| 177 | default. |
|---|
| 178 | |
|---|
| 179 | 12. The conservation line output in the clustal format alignment file has been |
|---|
| 180 | changed. Three characters are now used: |
|---|
| 181 | '*' indicates positions which have a single, fully conserved residue |
|---|
| 182 | ':' indicates that one of the following 'strong' groups is fully conserved:- |
|---|
| 183 | STA |
|---|
| 184 | NEQK |
|---|
| 185 | NHQK |
|---|
| 186 | NDEQ |
|---|
| 187 | QHRK |
|---|
| 188 | MILV |
|---|
| 189 | MILF |
|---|
| 190 | HY |
|---|
| 191 | FYW |
|---|
| 192 | |
|---|
| 193 | '.' indicates that one of the following 'weaker' groups is fully conserved:- |
|---|
| 194 | CSA |
|---|
| 195 | ATV |
|---|
| 196 | SAG |
|---|
| 197 | STNK |
|---|
| 198 | STPA |
|---|
| 199 | SGND |
|---|
| 200 | SNDEQK |
|---|
| 201 | NDEQHK |
|---|
| 202 | NEQHRK |
|---|
| 203 | FVLIM |
|---|
| 204 | HFY |
|---|
| 205 | |
|---|
| 206 | These are all the positively scoring groups that occur in the Gonnet Pam250 |
|---|
| 207 | matrix. The strong and weak groups are defined as strong score >0.5 and weak |
|---|
| 208 | score =<0.5 respectively. |
|---|
| 209 | |
|---|
| 210 | 13. A bug in the modification of the Myers and Miller alignment algorithm |
|---|
| 211 | for residue-specific gap penalites has been fixed. This occasionally caused |
|---|
| 212 | new gaps to be opened a few residues away from the optimal position. |
|---|
| 213 | |
|---|
| 214 | 14. The GCG/MSF input format no longer needs the word PILEUP on the first |
|---|
| 215 | line. Several versions can now be recognised:- |
|---|
| 216 | 1. The word PILEUP as the first word in the file |
|---|
| 217 | 2. The word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT |
|---|
| 218 | as the first word in the file |
|---|
| 219 | 3. The characters MSF on the first line in the line, and the |
|---|
| 220 | characters .. at the end of the line. |
|---|
| 221 | |
|---|
| 222 | 15. The standard command line separator for UNIX systems has been changed from |
|---|
| 223 | '/' to '-'. ie. to give options on the command line, you now type |
|---|
| 224 | |
|---|
| 225 | clustalw input.aln -gapopen=8.0 |
|---|
| 226 | |
|---|
| 227 | instead of clustalw input.aln /gapopen=8.0 |
|---|
| 228 | |
|---|
| 229 | |
|---|
| 230 | ATTENTION SOFTWARE DEVELOPERS!! |
|---|
| 231 | ------------------------------- |
|---|
| 232 | |
|---|
| 233 | The CLUSTAL sequence alignment output format was modified from version 1.7: |
|---|
| 234 | |
|---|
| 235 | 1. Names longer than 10 chars are now allowed. (The maximum is specified in |
|---|
| 236 | clustalw.h by '#define MAXNAMES'.) |
|---|
| 237 | |
|---|
| 238 | 2. The consensus line now consists of three characters: '*',':' and '.'. (Only |
|---|
| 239 | the '*' and '.' were previously used.) |
|---|
| 240 | |
|---|
| 241 | 3. An option (not the default) has been added, allowing the user to print out |
|---|
| 242 | sequence numbers at the end of each line of the alignment output. |
|---|
| 243 | |
|---|
| 244 | 4. Both RNA bases (U) and base ambiguities are now supported in nucleic acid |
|---|
| 245 | sequences. In the past, all characters (upper or lower case) other than |
|---|
| 246 | a,c,g,t or u were converted to N. Now the following characters are recognised |
|---|
| 247 | and retained in the alignment output: ABCDGHKMNRSTUVWXY (upper or lower case). |
|---|
| 248 | |
|---|
| 249 | 5. A Blank line inadvertently added in the version 1.6 header has been taken |
|---|
| 250 | out again. |
|---|
| 251 | |
|---|
| 252 | CLUSTAL REFERENCES |
|---|
| 253 | ------------------ |
|---|
| 254 | |
|---|
| 255 | Details of algorithms, implementation and useful tips on usage of Clustal |
|---|
| 256 | programs can be found in the following publications: |
|---|
| 257 | |
|---|
| 258 | Jeanmougin,F., Thompson,J.D., Gouy,M., Higgins,D.G. and Gibson,T.J. (1998) |
|---|
| 259 | Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23, 403-5. |
|---|
| 260 | |
|---|
| 261 | Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997) |
|---|
| 262 | The ClustalX windows interface: flexible strategies for multiple sequence |
|---|
| 263 | alignment aided by quality analysis tools. Nucleic Acids Research, 24:4876-4882. |
|---|
| 264 | |
|---|
| 265 | Higgins, D. G., Thompson, J. D. and Gibson, T. J. (1996) Using CLUSTAL for |
|---|
| 266 | multiple sequence alignments. Methods Enzymol., 266, 383-402. |
|---|
| 267 | |
|---|
| 268 | Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the |
|---|
| 269 | sensitivity of progressive multiple sequence alignment through sequence |
|---|
| 270 | weighting, positions-specific gap penalties and weight matrix choice. Nucleic |
|---|
| 271 | Acids Research, 22:4673-4680. |
|---|
| 272 | |
|---|
| 273 | Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) CLUSTAL V: improved software for |
|---|
| 274 | multiple sequence alignment. CABIOS 8,189-191. |
|---|
| 275 | |
|---|
| 276 | Higgins,D.G. and Sharp,P.M. (1989) Fast and sensitive multiple sequence |
|---|
| 277 | alignments on a microcomputer. CABIOS 5,151-153. |
|---|
| 278 | |
|---|
| 279 | Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple |
|---|
| 280 | sequence alignment on a microcomputer. Gene 73,237-244. |
|---|