source: tags/initial/GDEHELP/clustal_help

Last change on this file was 2, checked in by oldcode, 24 years ago

Initial revision

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 12.6 KB
Line 
1ClustalV - Multiple sequence alignment by Des Higgins
2
3ClustalV is an order independent multiple sequences alignment algorithm.
4It first determines which sequences are similar to eachother, and aligns
5those sequences first.  This helps reduce alignment bias based on the
6early alignment of poor sequences.
7
8
9Be warned that all sequences are returned in upper case, and all
10extended information under Get info.. will be lost.  For this reason,
11the results are placed in a new GDE window.
12
13The following is the help file that Higgins provides.
14
15---
16
17This is the on-line help file for CLUSTAL V.   
18
19It should be named or defined as:   clustalv_hlp
20
21>>HELP<< 1            General help for CLUSTAL V
22CLUSTAL V is a general purpose multiple alignment program for DNA or proteins.
23
24SEQUENCE INPUT:  all sequences must be in 1 file, one after another.  3 formats
25are automatically recognised: NBRF/PIR, EMBL/SWISSPROT or Pearson (Fasta). 
26All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
27except "-" which is used to indicate a GAP.  Upper or lower case is allowed.
28
29
30To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
31INPUT them; go to menu item 2 to do the multiple alignment.
32
33
34PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments.  Use this to
35add a new sequence to an old alignment.  GAPS in the old alignments are
36indicated using the "-" character.   PROFILES can be input as PIR format files.
37
38
39PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
40in PIR format with "-" characters to indicate gaps) OR after a multiple
41alignemnt while the alignment is still in memory.
42>>HELP<< 2     Help for multiple alignments
43
44If you have already loaded sequences, use menu item 1 to do the complete
45multiple alignment.  You will be prompted for 2 output files: 1 for the
46alignment itself; another to store a dendrogram that describes the similarity
47of the sequences to each other.
48
49Multiple alignments are carried out in 3 stages (automatically done from menu
50item 1 ... multiple alignments NOW):
51
521) all sequences are compared to each other (pairwise alignments);
53
542) a dendrogram (like a phylogenetic tree) is constructed, describing the
55approximate groupings of the sequences by similarity (stored in a file).
56
573) the final multiple alignment is carried out, using the dendrogram as a guide.
58
59
60PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial
61alignments.
62
63MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
64
65
66
67
68You can skip the first stages (pairwise alignments; dendrogram) by using an
69old dendrogram file (menu item 3); or you can just produce the dendrogram
70with no final multiple alignment (menu item 2).
71
72
73OUTPUT FORMAT: Menu item 6 (format options) allows you to choose between 4
74different alignment formats (CLUSTAL, GCG, NBRF/PIR and PHYLIP). 
75
76
77>>HELP<< 3     Help for pairwise alignment parameters
78
79A similarity score is calculated between every pair of sequence and these are
80used to construct the dendrogram which guides the final multiple alignment.
81
82These similarity scores are calculated from fast, approximate, global align-
83ments, which are controlled by 4 parameters.   2 techniques are used to make
84these alignments very fast: 1) only exactly matching fragments (k-tuples) are
85considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
86are used.
87
88
89K-TUPLE SIZE:  This is the size of exactly matching fragment that is used.
90INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
91For longer sequences (e.g. > 300 residues) you may need to increase the default.
92
93
94GAP PENALTY:   This is a penalty for each gap in the fast alignments.  It has
95little affect on the speed or sensitivity. 
96
97
98
99
100
101
102TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
103dot-matrix plot) is calculated.  Only the best ones (with most matches) are
104used in the alignment.  This parameter specifies how many.  Decrease for speed;
105increase for sensitivity.
106
107
108DIAGONAL WINDOW:  This is the number of diagonals around each of the 'best'
109diagonals that will be used.  Decrease for speed; increase for sensitivity.
110
111
112SCORING METHOD = PERCENTAGE or ABSOLUTE:   This controls whether the similarity
113scores are calculated as raw alignment scores (number of k-tuple matches minus a
114gap penalty for every gap) (ABSOLUTE) or as the alignment score divided by the
115length of the shorter sequence (PERCENTAGE).
116
117
118
119>>HELP<< 4     Help for multiple alignment parameters
120These parameters control the final multiple alignment.  There are 2 gap penalty
121parameters and 1 for whether transitions (A <--> G or C <--> T) are weighted in
122DNA alignments.  The default weight matrix for protein alignments is a PAM250
123matrix, converted to distances.
124
125GAP PENALTY (FIXED):    This is a penalty for opening up a gap.   Decrease it
126and you will encourage gaps of all sizes.  TERMINAL GAPS are penalised (same as
127internal ones).  BEWARE:  if you make this too small (+/- 5 or so), the program
128will prefer to align each sequence opposite a long gap.
129
130GAP PENALTY (VARYING):  This penalty is incurred for every item in a gap.  This
131penalises long gaps more.  Increase this and gaps will get shorter.   BEWARE:
132if you make this too small (+/- 5 or so), the program will prefer to align each
133sequence opposite a long gap.
134
135TRANSITIONS = WEIGHTED or UNWEIGHTED:  With UNWEIGHTED transitions identical
136bases in a DNA alignment have a DISTANCE of 0; different ones have a distance
137of 10.  If transitions are WEIGHTED then A vs G and C vs T will have a distance
138of 5 (less distant than A vs C,T or C vs A,G). 
139>>HELP<< 5     Help for output format options.
140Four output formats are offered.  You can choose more than one (or all four if
141you wish).  NBRF/PIR format is ESPECIALLY USEFUL.  Alignments that are written
142in this format can be used again as input (for calculating phylogenetic trees;
143profile alignments; general input).
144
145CLUSTAL format output is a self explanatory alignment format.  It shows the
146sequences aligned in blocks.
147
148GCG output can be used by any of the GCG programs that can work on multiple
149alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN).  It is the same as the GCG
150.msf format files (multiple sequence file); new in version 7 of GCG.
151
152PHYLIP format output can be used for input to the PHYLIP package of Joe
153Felsenstein.  This is an extremely widely used package for doing every
154imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
155duction offered by this program).
156
157NBRF/PIR:  this is the same as the standard PIR format with ONE ADDITION.  Gap
158characters "-" are used to indicate the positions of gaps in the multiple
159alignment.   These files can be re-used as input in any part of clustal that
160allows sequences (or alignments or profiles) to be read in. 
161>>HELP<< 6     Help for profile alignments
162
163By PROFILE ALIGNMENT, we mean the alignment of two old alignments.  One of the
164alignments can be a single sequence. 
165
166The profiles should be in PIR format (one of the 4 output formats produced by
167this program).   This is the same as standard NBRF/PIR format, with 1 addition:
168gap characters are indicated by "-".   
169
170The alignment method produces a global, optimal alignment using an amino acid
171weight matrix (PAM250 is default) and 2 gap penalty parameters.
172
173Profile alignments allow you to store alignments of your favourite sequences (as
174long as they are in PIR format) and add new sequences to them in small bunches
175at a time.  One of the 2 profiles can simply be a single sequence.
176
177
178
179>>HELP<< 7     Help for phylogenetic trees
180Before calculating a tree, you must have an alignment in memory.  This can be
181input in NBRF/PIR format or you should have just carried out a full multiple
182alignment and the alignment is still in memory.
183
184The method used is the NJ (Neighbour Joining) method of Saitou and Nei.  First
185you calculate distances (percent divergence) between all pairs of sequence from
186a multiple alignment; second you apply the NJ method to the distance matrix.
187
188EXCLUDE POSITIONS WITH GAPS?  If you choose this option, any alignment positions
189where ANY of the sequences have a gap will be ignored.  This guarantees that
190the distances will be 'metric'.  Also, it means that 'like' will be compared to
191'like' in all distances.  The disadvantage is that you may throw away much of
192the data if there are many gaps.
193
194CORRECT FOR MULTIPLE SUBSTITUTIONS?  For small divergence (say <10%) this
195option makes little difference.  For greater divergence, this option corrects
196for the fact that observed distances underestimate actual evolutionary dist-
197ances.  This is because, as sequences diverge, more than one substitution will
198happen at many sites.  However, you only see one difference when you look at the
199present day sequences.  Therefore, this option has the effect of stretching
200branch lengths in trees (especially long branches).  The corrections used here
201(for DNA or proteins) are both due to Motoo Kimura.
202
203To calculate a tree, use option 4 (DRAW TREE NOW).  This gives an UNROOTED
204tree and all branch lengths.  The root of the tree can only be inferred by
205using an outgroup (a sequence that you are certain branches at the outside
206of the tree .... certain on biological grounds) OR if you assume a degree
207of constancy in the 'molecular clock', you can place the root along the
208longest branch.
209
210BOOTSTRAPPING is a method for deriving confidence values for the groupings in
211a tree (first adapted for trees by Joe Felsenstein).   It involves making N
212random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000);
213drawing N trees (1 from each sample) and counting how many times each grouping
214from the original tree occurs in the sample trees.   For a group to be consid-
215ered significant at the 5% level (p <= 0.05) it should occur in at least 95% of
216the sample trees. You must supply a seed number for the random number generator.
217>>HELP<< 8     Help for choosing protein weight matrix
218For protein alignments, you use a weight matrix to determine the similarity of
219non-identical amino acids.  For example, Tyr aligned with Phe is usually judged
220to be 'better' than Tyr aligned with Pro. 
221
222
223
224There are three 'in-built' weight matrices offered:
225
226
2271) PAM 100 and 2) PAM 250    These are from the work of M. Dayhoff and are often
228simply called Dayhoff matrices.   The pam 250 matrix is the most commonly used
229and is the default in most protein comparison packages.   It is claimed that
230a pam 100 matrix is more sensitive in many cases, so we have included it
231here.
232
233
2343) Identity matrix.   This matrix just scores identical residues.
235
236
237
238
239
240You can also input your own matrix.  If so then be careful:  1) follow the
241instructions on format below; 2) watch the gap penalty parameters (the default
242values may no be appropriate).   Conservative substitutions will not be
243indicated in alignments.
244
245The values in a new weight matrix must be integers and the scores should be
246similarities.  You can use negative as well as positive values if you wish.
247
248
249INPUT FORMAT  The lower triangle of a 20x20 matrix of values is read in, in free
250format, row by row.  The diagonal must be included.   Using the 1 letter code,
251the order of amino acids in the matrix is:   CSTPAGNDEQHRKMILVFYW.   Seperate
252the values by spaces (not commas).   You can put the values on as many lines
253as you like as long as they are in the right order.
254
255
256GAP PENALTIES  The default gap penalty parameters work fine with a PAM 250
257matrix.  The range of PAM 250 values is 0 to 25 (when rescaled to be positive)
258and the default gap penalties are 10 each.   Very approximately, the best gap
259penalty settings are 2/5 the maximum weight matrix score.   
260>>HELP<< 9     Help for command line parameters
261                DATA (sequences)
262
263/INFILE=file.ext                             :input sequences.
264/PROFILE1=file.ext  and  /PROFILE2=file.ext  :profiles (old alignment).
265
266                VERBS (do things)
267
268/HELP  or /CHECK    :list the command line params.
269/ALIGN              :do full multiple alignment
270/TREE               :calculate NJ tree.
271/BOOTSTRAP(=n)      :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
272
273                PARAMETERS (set things)
274
275***Pairwise alignments:***
276/KTUP=n      :word size                  /TOPDIAGS=n  :number of best diags.
277/WINDOW=n    :window around best diags.  /PAIRGAP=n   :gap penalty
278
279***Multiple alignments:***
280/FIXEDGAP=n  :fixed length gap pen.      /FLOATGAP=n  :variable length gap pen.
281/MATRIX=     :PAM100 or ID or file name. /TYPE=p or d :type is prot. or DNA
282/OUTPUT=     :GCG or PHYLIP or PIR.      /TRANSIT     :transitions not weighted.
283
284***Trees:***                             /SEED      :seed number for bootstraps.
285/KIMURA      :use Kimura's correction.   /TOSSGAPS  :ignore positions with gaps.
286
Note: See TracBrowser for help on using the repository browser.