1 | ****************************************************************************** |
---|
2 | |
---|
3 | CLUSTAL W Multiple Sequence Alignment Program |
---|
4 | (version 1.83, Feb 2003) |
---|
5 | |
---|
6 | ****************************************************************************** |
---|
7 | |
---|
8 | |
---|
9 | Please send bug reports, comments etc. to one of:- |
---|
10 | gibson@embl-heidelberg.de |
---|
11 | thompson@igbmc.u-strasbg.fr |
---|
12 | d.higgins@ucc.ie |
---|
13 | |
---|
14 | |
---|
15 | ****************************************************************************** |
---|
16 | |
---|
17 | POLICY ON COMMERCIAL DISTRIBUTION OF CLUSTAL W |
---|
18 | |
---|
19 | Clustal W is freely available to the user community. However, Clustal W is |
---|
20 | increasingly being distributed as part of commercial sequence analysis |
---|
21 | packages. To help us safeguard future maintenance and development, commercial |
---|
22 | distributors of Clustal W must take out a NON-EXCLUSIVE LICENCE. Anyone |
---|
23 | wishing to commercially distribute version 1.81 of Clustal W should contact the |
---|
24 | authors unless they have previously taken out a licence. |
---|
25 | |
---|
26 | ****************************************************************************** |
---|
27 | |
---|
28 | Clustal W is written in ANSI-C and can be run on any machine with an ANSI-C |
---|
29 | compiler. Executables are provided for several major platforms. |
---|
30 | |
---|
31 | Changes since CLUSTAL X Version 1.82 |
---|
32 | ------------------------------------ |
---|
33 | |
---|
34 | 1. The FASTA format has been added to the list of alignment output options. |
---|
35 | |
---|
36 | 2. It is now possible to save the residue ranges (appended after the sequence |
---|
37 | names) when saving a specified range of the alignment. |
---|
38 | |
---|
39 | 3. The efficiency of the neighour-joining algorithm has been improved. This |
---|
40 | work was done by Tadashi Koike at the Center for Information Biology and DNA Data |
---|
41 | Bank of Japan and FUJITSU Limited. |
---|
42 | |
---|
43 | Some example speedups are given below : (timings on a SPARC64 CPU) |
---|
44 | |
---|
45 | No. of sequences original NJ new NJ |
---|
46 | 200 0' 12" 0.1" |
---|
47 | 500 9' 19" 1.4" |
---|
48 | 1000 XXXX 0' 31" |
---|
49 | |
---|
50 | Changes since version 1.8 |
---|
51 | -------------------------- |
---|
52 | |
---|
53 | 1. ClustalW now returns error codes for some common errors when exiting. This |
---|
54 | may be useful for people who run clustalw automatically from within a script. |
---|
55 | Error codes are: |
---|
56 | 1 bad command line option |
---|
57 | 2 cannot open sequence file |
---|
58 | 3 wrong format in sequence file |
---|
59 | 4 sequence file contains only 1 sequence (for multiple alignments) |
---|
60 | |
---|
61 | 2. Alignments can now be saved in Nexus format, for compatibility with PAUP, |
---|
62 | MacClade etc. For a description of the Nexus format, see: |
---|
63 | Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997. |
---|
64 | NEXUS: an extensible file format for systematic information. |
---|
65 | Systematic Biology 46:590-621. |
---|
66 | |
---|
67 | 3. Phylogenetic trees can also be saved in nexus format. |
---|
68 | |
---|
69 | 4. A ClustalW icon has been designed for MAC and PC systems. |
---|
70 | |
---|
71 | |
---|
72 | Changes since version 1.74 |
---|
73 | -------------------------- |
---|
74 | |
---|
75 | 1. Some work has been done to automatically select the optimal parameters |
---|
76 | depending on the set of sequences to be aligned. The Gonnet series of residue |
---|
77 | comparison matrices are now used by default. The Blosum series remains as an |
---|
78 | option. The default gap extension penalty for proteins has been changed to 0.2 |
---|
79 | (was 0.05).The 'delay divergent sequences' option has been changed to 30% |
---|
80 | residue identity (was 40%). |
---|
81 | |
---|
82 | 2. The default parameters used when the 'Negative matrix' option is selected |
---|
83 | have been optimised. This option may help when the sequences to be aligned are |
---|
84 | not superposable over their whole lengths (e.g. in the presence of N/C terminal |
---|
85 | extensions). |
---|
86 | |
---|
87 | 3. A bug in the calculation of phylogenetic trees for 2 sequences has been |
---|
88 | fixed. |
---|
89 | |
---|
90 | 4. A command line option has been added to turn off the sequence weighting |
---|
91 | calculation. |
---|
92 | |
---|
93 | 5. The phylogenetic tree calculation now ignores any ambiguity codes in the |
---|
94 | sequences. |
---|
95 | |
---|
96 | 6. A bug in the memory access during the calculation of profiles has been |
---|
97 | fixed. (Thanks to Haruna Cofer at SGI). |
---|
98 | |
---|
99 | 7. A bug has been fixed in the 'transition weight' option for nucleic acid |
---|
100 | sequences. (Thanks to Chanan Rubin at Compugen). |
---|
101 | |
---|
102 | 8. An option has been added to read in a series of comparison matrices from a |
---|
103 | file. This option is only applicable for protein sequences. For details of the |
---|
104 | file format, see the on-line documentation. |
---|
105 | |
---|
106 | 9. The MSF output file format has been changed. The sequence weights |
---|
107 | calculated by Clustal W are now included in the header. |
---|
108 | |
---|
109 | 10. Two bugs in the FAST/APPROXIMATE pairwise alignments have been fixed. One |
---|
110 | involved the alignment of new sequences to an existing profile using the fast |
---|
111 | pairwise alignment option; the second was caused by changing the default |
---|
112 | options for the fast pairwise alignments. |
---|
113 | |
---|
114 | 11. A bug in the alignment of a small number of sequences has been fixed. |
---|
115 | Previously a Guide Tree was not calculated for less than 4 sequences. |
---|
116 | |
---|
117 | |
---|
118 | Changes since version 1.6 |
---|
119 | ------------------------- |
---|
120 | |
---|
121 | 1. The static arrays used by clustalw for storing the alignment data have been |
---|
122 | replaced by dynamically allocated memory. There is now no limit on the number |
---|
123 | or length of sequences which can be input. |
---|
124 | |
---|
125 | 2. The alignment of DNA sequences now offers a new hard-coded matrix, as well |
---|
126 | as the identity matrix used previously. The new matrix is the default scoring |
---|
127 | matrix used by the BESTFIT program of the GCG package for the comparison of |
---|
128 | nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity |
---|
129 | symbol. All matches score 1.9; all mismatches for IUB symbols score 0.0. |
---|
130 | |
---|
131 | 3. The transition weight option for aligning nucleotide sequences has been |
---|
132 | changed from an on/off toggle to a weight between 0 and 1. A weight of zero |
---|
133 | means that the transitions are scored as mismatches; a weight of 1 gives |
---|
134 | transitions the full match score. For distantly related DNA sequences, the |
---|
135 | weight should be near to zero; for closely related sequences it can be useful |
---|
136 | to assign a higher score. |
---|
137 | |
---|
138 | 4. The RSF sequence alignment file format used by GCG Version 9 can now be |
---|
139 | read. |
---|
140 | |
---|
141 | 5. The clustal sequence alignment file format has been changed to allow |
---|
142 | sequence names longer than 10 characters. The maximum length allowed is set in |
---|
143 | clustalw.h by the statement: |
---|
144 | #define MAXNAMES 10 |
---|
145 | |
---|
146 | For the fasta format, the name is taken as the first string after the '>' |
---|
147 | character, stopping at the first white space. (Previously, the first 10 |
---|
148 | characters were taken, replacing blanks by underscores). |
---|
149 | |
---|
150 | 6. The bootstrap values written in the phylip tree file format can be assigned |
---|
151 | either to branches or nodes. The default is to write the values on the nodes, |
---|
152 | as this can be read by several commonly-used tree display programs. But note |
---|
153 | that this can lead to confusion if the tree is rooted and the bootstraps may |
---|
154 | be better attached to the internal branches: Software developers should ensure |
---|
155 | they can read the branch label format. |
---|
156 | |
---|
157 | 7. The sequence weighting used during sequence to profile alignments has been |
---|
158 | changed. The tree weight is now multiplied by the percent identity of the |
---|
159 | new sequence compared with the most closely related sequence in the profile. |
---|
160 | |
---|
161 | 8. The sequence weighting used during profile to profile alignments has been |
---|
162 | changed. A guide tree is now built for each profile separately and the |
---|
163 | sequence weights calculated from the two trees. The weights for each |
---|
164 | sequence are then multiplied by the percent identity of the sequence compared |
---|
165 | with the most closely related sequence in the opposite profile. |
---|
166 | |
---|
167 | 9. The adjustment of the Gap Opening and Gap Extension Penalties for sequences |
---|
168 | of unequal length has been improved. |
---|
169 | |
---|
170 | 10. The default order of the sequences in the output alignment file has been |
---|
171 | changed. Previously the default was to output the sequences in the same order |
---|
172 | as the input file. Now the default is to use the order in which the sequences |
---|
173 | were aligned (from the guide tree/dendrogram), thus automatically grouping |
---|
174 | closely related sequences. |
---|
175 | |
---|
176 | 11. The option to 'Reset Gaps between alignments' has been switched off by |
---|
177 | default. |
---|
178 | |
---|
179 | 12. The conservation line output in the clustal format alignment file has been |
---|
180 | changed. Three characters are now used: |
---|
181 | '*' indicates positions which have a single, fully conserved residue |
---|
182 | ':' indicates that one of the following 'strong' groups is fully conserved:- |
---|
183 | STA |
---|
184 | NEQK |
---|
185 | NHQK |
---|
186 | NDEQ |
---|
187 | QHRK |
---|
188 | MILV |
---|
189 | MILF |
---|
190 | HY |
---|
191 | FYW |
---|
192 | |
---|
193 | '.' indicates that one of the following 'weaker' groups is fully conserved:- |
---|
194 | CSA |
---|
195 | ATV |
---|
196 | SAG |
---|
197 | STNK |
---|
198 | STPA |
---|
199 | SGND |
---|
200 | SNDEQK |
---|
201 | NDEQHK |
---|
202 | NEQHRK |
---|
203 | FVLIM |
---|
204 | HFY |
---|
205 | |
---|
206 | These are all the positively scoring groups that occur in the Gonnet Pam250 |
---|
207 | matrix. The strong and weak groups are defined as strong score >0.5 and weak |
---|
208 | score =<0.5 respectively. |
---|
209 | |
---|
210 | 13. A bug in the modification of the Myers and Miller alignment algorithm |
---|
211 | for residue-specific gap penalites has been fixed. This occasionally caused |
---|
212 | new gaps to be opened a few residues away from the optimal position. |
---|
213 | |
---|
214 | 14. The GCG/MSF input format no longer needs the word PILEUP on the first |
---|
215 | line. Several versions can now be recognised:- |
---|
216 | 1. The word PILEUP as the first word in the file |
---|
217 | 2. The word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT |
---|
218 | as the first word in the file |
---|
219 | 3. The characters MSF on the first line in the line, and the |
---|
220 | characters .. at the end of the line. |
---|
221 | |
---|
222 | 15. The standard command line separator for UNIX systems has been changed from |
---|
223 | '/' to '-'. ie. to give options on the command line, you now type |
---|
224 | |
---|
225 | clustalw input.aln -gapopen=8.0 |
---|
226 | |
---|
227 | instead of clustalw input.aln /gapopen=8.0 |
---|
228 | |
---|
229 | |
---|
230 | ATTENTION SOFTWARE DEVELOPERS!! |
---|
231 | ------------------------------- |
---|
232 | |
---|
233 | The CLUSTAL sequence alignment output format was modified from version 1.7: |
---|
234 | |
---|
235 | 1. Names longer than 10 chars are now allowed. (The maximum is specified in |
---|
236 | clustalw.h by '#define MAXNAMES'.) |
---|
237 | |
---|
238 | 2. The consensus line now consists of three characters: '*',':' and '.'. (Only |
---|
239 | the '*' and '.' were previously used.) |
---|
240 | |
---|
241 | 3. An option (not the default) has been added, allowing the user to print out |
---|
242 | sequence numbers at the end of each line of the alignment output. |
---|
243 | |
---|
244 | 4. Both RNA bases (U) and base ambiguities are now supported in nucleic acid |
---|
245 | sequences. In the past, all characters (upper or lower case) other than |
---|
246 | a,c,g,t or u were converted to N. Now the following characters are recognised |
---|
247 | and retained in the alignment output: ABCDGHKMNRSTUVWXY (upper or lower case). |
---|
248 | |
---|
249 | 5. A Blank line inadvertently added in the version 1.6 header has been taken |
---|
250 | out again. |
---|
251 | |
---|
252 | CLUSTAL REFERENCES |
---|
253 | ------------------ |
---|
254 | |
---|
255 | Details of algorithms, implementation and useful tips on usage of Clustal |
---|
256 | programs can be found in the following publications: |
---|
257 | |
---|
258 | Jeanmougin,F., Thompson,J.D., Gouy,M., Higgins,D.G. and Gibson,T.J. (1998) |
---|
259 | Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23, 403-5. |
---|
260 | |
---|
261 | Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997) |
---|
262 | The ClustalX windows interface: flexible strategies for multiple sequence |
---|
263 | alignment aided by quality analysis tools. Nucleic Acids Research, 24:4876-4882. |
---|
264 | |
---|
265 | Higgins, D. G., Thompson, J. D. and Gibson, T. J. (1996) Using CLUSTAL for |
---|
266 | multiple sequence alignments. Methods Enzymol., 266, 383-402. |
---|
267 | |
---|
268 | Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the |
---|
269 | sensitivity of progressive multiple sequence alignment through sequence |
---|
270 | weighting, positions-specific gap penalties and weight matrix choice. Nucleic |
---|
271 | Acids Research, 22:4673-4680. |
---|
272 | |
---|
273 | Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) CLUSTAL V: improved software for |
---|
274 | multiple sequence alignment. CABIOS 8,189-191. |
---|
275 | |
---|
276 | Higgins,D.G. and Sharp,P.M. (1989) Fast and sensitive multiple sequence |
---|
277 | alignments on a microcomputer. CABIOS 5,151-153. |
---|
278 | |
---|
279 | Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple |
---|
280 | sequence alignment on a microcomputer. Gene 73,237-244. |
---|