source: branches/profile/GDE/CLUSTALW/README

Last change on this file was 1754, checked in by westram, 21 years ago

updated to version 1.83

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 11.7 KB
Line 
1******************************************************************************
2
3               CLUSTAL W Multiple Sequence Alignment Program
4                        (version 1.83, Feb 2003)
5
6******************************************************************************
7
8
9Please send bug reports, comments etc. to one of:-
10        gibson@embl-heidelberg.de
11        thompson@igbmc.u-strasbg.fr
12        d.higgins@ucc.ie
13
14
15******************************************************************************
16
17                  POLICY ON COMMERCIAL DISTRIBUTION OF CLUSTAL W
18
19Clustal W is freely available to the user community. However, Clustal W is
20increasingly being distributed as part of commercial sequence analysis
21packages. To help us safeguard future maintenance and development, commercial
22distributors of Clustal W must take out a NON-EXCLUSIVE LICENCE. Anyone
23wishing to commercially distribute version 1.81 of Clustal W should contact the
24authors unless they have previously taken out a licence.
25
26******************************************************************************
27
28Clustal W is written in ANSI-C and can be run on any machine with an ANSI-C
29compiler. Executables are provided for several major platforms.
30
31Changes since CLUSTAL X Version 1.82
32------------------------------------
33
341. The FASTA format has been added to the list of alignment output options.
35
362. It is now possible to save the residue ranges (appended after the sequence
37names) when saving a specified range of the alignment.
38
393. The efficiency of  the neighour-joining algorithm has been improved. This
40work was done by Tadashi Koike at the Center for Information Biology and DNA Data
41Bank of Japan and FUJITSU Limited.
42
43Some example speedups are given below : (timings on a SPARC64 CPU)
44
45No. of sequences        original NJ     new NJ
46     200                0' 12"          0.1"
47     500                9' 19"          1.4"
48     1000               XXXX            0' 31"
49
50Changes since version 1.8
51--------------------------
52
531. ClustalW now returns error codes for some common errors when exiting. This
54may be useful for people who run clustalw automatically from within a script.
55Error codes are:
56        1       bad command line option
57        2       cannot open sequence file
58        3       wrong format in sequence file
59        4       sequence file contains only 1 sequence (for multiple alignments)
60
612. Alignments can now be saved in Nexus format, for compatibility with PAUP,
62MacClade etc. For a description of the Nexus format, see:
63Maddison, D. R., D. L. Swofford and W. P. Maddison.  1997.
64NEXUS: an extensible file format for systematic information.
65Systematic Biology 46:590-621.
66
673. Phylogenetic trees can also be saved in nexus format.
68
694. A ClustalW icon has been designed for MAC and PC systems.
70
71
72Changes since version 1.74
73--------------------------
74
751. Some work has been done to automatically select the optimal parameters
76depending on the set of sequences to be aligned. The Gonnet series of residue
77comparison matrices are now used by default. The Blosum series remains as an
78option. The default gap extension penalty for proteins has been changed to 0.2
79(was 0.05).The 'delay divergent sequences' option has been changed to 30%
80residue identity (was 40%).
81
822. The default parameters used when the 'Negative matrix' option is selected
83have been optimised. This option may help when the sequences to be aligned are
84not superposable over their whole lengths (e.g. in the presence of N/C terminal
85extensions).
86
873. A bug in the calculation of phylogenetic trees for 2 sequences has been
88fixed.
89
904. A command line option has been added to turn off the sequence weighting
91calculation.
92
935. The phylogenetic tree calculation now ignores any ambiguity codes in the
94sequences.
95
966.  A bug in the memory access during the calculation of profiles has been
97fixed. (Thanks to Haruna Cofer at SGI).
98
997. A bug has been fixed in the 'transition weight' option for nucleic acid
100sequences. (Thanks to Chanan Rubin at Compugen).
101
1028. An option has been added to read in a series of comparison matrices from a
103file. This option is only applicable for protein sequences. For details of the
104file format, see the on-line documentation.
105
1069. The MSF output file format has been changed. The sequence weights
107calculated by Clustal W are now included in the header.
108
10910. Two bugs in the FAST/APPROXIMATE pairwise alignments have been fixed. One
110involved the alignment of new sequences to an existing profile using the fast
111pairwise alignment option; the second was caused by changing the default
112options for the fast pairwise alignments.
113
11411. A bug in the alignment of a small number of sequences has been fixed.
115Previously a Guide Tree was not calculated for less than 4 sequences.
116
117
118Changes since version 1.6
119-------------------------
120
1211. The static arrays used by clustalw for storing the alignment data have been
122replaced by dynamically allocated memory. There is now no limit on the number
123or length of sequences which can be input.
124
1252. The alignment of DNA sequences now offers a new hard-coded matrix, as well
126as the identity matrix used previously. The new matrix is the default scoring
127matrix used by the BESTFIT program of the GCG package for the comparison of
128nucleic acid sequences. X's and N's are treated as matches to any IUB ambiguity
129symbol. All matches score 1.9; all mismatches for IUB symbols score 0.0.
130
1313. The transition weight option for aligning nucleotide sequences has been
132changed from an on/off toggle to a weight between 0 and 1.  A weight of zero
133means that the transitions are scored as mismatches; a weight of 1 gives
134transitions the full match score. For distantly related DNA sequences, the
135weight should be near to zero; for closely related sequences it can be useful
136to assign a higher score.
137
1384. The RSF sequence alignment file format used by GCG Version 9 can now be
139read.
140
1415. The clustal sequence alignment file format has been changed to allow
142sequence names longer than 10 characters. The maximum length allowed is set in
143clustalw.h by the statement:
144#define MAXNAMES        10
145
146For the fasta format, the name is taken as the first string after the '>'
147character, stopping at the first white space. (Previously, the first 10
148characters were taken, replacing blanks by underscores).
149
1506. The bootstrap values written in the phylip tree file format can be assigned
151either to branches or nodes. The default is to write the values on the nodes,
152as this can be read by several commonly-used tree display programs. But note
153that this can lead to confusion if the tree is rooted and the bootstraps may
154be better attached to the internal branches: Software developers should ensure
155they can read the branch label format.
156
1577. The sequence weighting used during sequence to profile alignments has been
158changed. The tree weight is now multiplied by the percent identity of the
159new sequence compared with the most closely related sequence in the profile.
160
1618. The sequence weighting used during profile to profile alignments has been
162changed. A guide tree is now built for each profile separately and the
163sequence weights calculated from the two trees. The weights for each
164sequence are then multiplied by the percent identity of the sequence compared
165with the most closely related sequence in the opposite profile.
166
1679. The adjustment of the Gap Opening and Gap Extension Penalties for sequences
168of unequal length has been improved.
169
17010. The default order of the sequences in the output alignment file has been
171changed. Previously the default was to output the sequences in the same order
172as the input file. Now the default is to use the order in which the sequences
173were aligned (from the guide tree/dendrogram), thus automatically grouping
174closely related sequences.
175
17611. The option to 'Reset Gaps between alignments' has been switched off by
177default.
178
17912. The conservation line output in the clustal format alignment file has been
180changed. Three characters are now used:
181'*' indicates positions which have a single, fully conserved residue
182':' indicates that one of the following 'strong' groups is fully conserved:-
183                 STA
184                 NEQK
185                 NHQK
186                 NDEQ
187                 QHRK
188                 MILV
189                 MILF
190                 HY
191                 FYW
192
193'.' indicates that one of the following 'weaker' groups is fully conserved:-
194                 CSA
195                 ATV
196                 SAG
197                 STNK
198                 STPA
199                 SGND
200                 SNDEQK
201                 NDEQHK
202                 NEQHRK
203                 FVLIM
204                 HFY
205
206These are all the positively scoring groups that occur in the Gonnet Pam250
207matrix. The strong and weak groups are defined as strong score >0.5 and weak
208score =<0.5 respectively.
209
21013. A bug in the modification of the Myers and Miller alignment algorithm
211for residue-specific gap penalites has been fixed. This occasionally caused
212new gaps to be opened a few residues away from the optimal position.
213
21414. The GCG/MSF input format no longer needs the word PILEUP on the first
215line. Several versions can now be recognised:-
216      1.  The word PILEUP as the first word in the file
217      2.  The word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
218          as the first word in the file
219      3.  The characters MSF on the first line in the line, and the
220          characters .. at the end of the line.
221
22215. The standard command line separator for UNIX systems has been changed from
223'/' to '-'. ie. to give options on the command line, you now type
224
225     clustalw input.aln -gapopen=8.0
226
227instead of  clustalw input.aln /gapopen=8.0
228
229
230                      ATTENTION SOFTWARE DEVELOPERS!!
231                      -------------------------------
232
233The CLUSTAL sequence alignment output format was modified from version 1.7:
234
2351. Names longer than 10 chars are now allowed. (The maximum is specified in
236clustalw.h by '#define MAXNAMES'.)
237
2382. The consensus line now consists of three characters: '*',':' and '.'. (Only
239the '*' and '.' were previously used.)
240
2413. An option (not the default) has been added, allowing the user to print out
242sequence numbers at the end of each line of the alignment output.
243
2444. Both RNA bases (U) and base ambiguities are now supported in nucleic acid
245sequences. In the past, all characters (upper or lower case) other than
246a,c,g,t or u were converted to N. Now the following characters are recognised
247and retained in the alignment output: ABCDGHKMNRSTUVWXY (upper or lower case).
248
2495. A  Blank line inadvertently added in the version 1.6 header has been taken
250out again.
251
252                              CLUSTAL REFERENCES
253                              ------------------
254
255Details of algorithms, implementation and useful tips on usage of Clustal
256programs can be found in the following publications:
257
258Jeanmougin,F., Thompson,J.D., Gouy,M., Higgins,D.G. and Gibson,T.J. (1998)
259Multiple sequence alignment with Clustal X. Trends Biochem Sci, 23, 403-5.
260
261Thompson,J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997)
262The ClustalX windows interface: flexible strategies for multiple sequence
263alignment aided by quality analysis tools. Nucleic Acids Research, 24:4876-4882.
264
265Higgins, D. G., Thompson, J. D. and Gibson, T. J. (1996) Using CLUSTAL for
266multiple sequence alignments. Methods Enzymol., 266, 383-402.
267
268Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the
269sensitivity of progressive multiple sequence alignment through sequence
270weighting, positions-specific gap penalties and weight matrix choice.  Nucleic
271Acids Research, 22:4673-4680.
272
273Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) CLUSTAL V: improved software for
274multiple sequence alignment. CABIOS 8,189-191.
275
276Higgins,D.G. and Sharp,P.M. (1989) Fast and sensitive multiple sequence
277alignments on a microcomputer. CABIOS 5,151-153.
278
279Higgins,D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple
280sequence alignment on a microcomputer. Gene 73,237-244.
Note: See TracBrowser for help on using the repository browser.