1 | ClustalV - Multiple sequence alignment by Des Higgins |
---|
2 | |
---|
3 | ClustalV is an order independent multiple sequences alignment algorithm. |
---|
4 | It first determines which sequences are similar to eachother, and aligns |
---|
5 | those sequences first. This helps reduce alignment bias based on the |
---|
6 | early alignment of poor sequences. |
---|
7 | |
---|
8 | |
---|
9 | Be warned that all sequences are returned in upper case, and all |
---|
10 | extended information under Get info.. will be lost. For this reason, |
---|
11 | the results are placed in a new GDE window. |
---|
12 | |
---|
13 | The following is the help file that Higgins provides. |
---|
14 | |
---|
15 | --- |
---|
16 | |
---|
17 | This is the on-line help file for CLUSTAL V. |
---|
18 | |
---|
19 | It should be named or defined as: clustalv_hlp |
---|
20 | |
---|
21 | >>HELP<< 1 General help for CLUSTAL V |
---|
22 | CLUSTAL V is a general purpose multiple alignment program for DNA or proteins. |
---|
23 | |
---|
24 | SEQUENCE INPUT: all sequences must be in 1 file, one after another. 3 formats |
---|
25 | are automatically recognised: NBRF/PIR, EMBL/SWISSPROT or Pearson (Fasta). |
---|
26 | All non-alphabetic characters (spaces, digits, punctuation marks) are ignored |
---|
27 | except "-" which is used to indicate a GAP. Upper or lower case is allowed. |
---|
28 | |
---|
29 | |
---|
30 | To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to |
---|
31 | INPUT them; go to menu item 2 to do the multiple alignment. |
---|
32 | |
---|
33 | |
---|
34 | PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to |
---|
35 | add a new sequence to an old alignment. GAPS in the old alignments are |
---|
36 | indicated using the "-" character. PROFILES can be input as PIR format files. |
---|
37 | |
---|
38 | |
---|
39 | PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in |
---|
40 | in PIR format with "-" characters to indicate gaps) OR after a multiple |
---|
41 | alignemnt while the alignment is still in memory. |
---|
42 | >>HELP<< 2 Help for multiple alignments |
---|
43 | |
---|
44 | If you have already loaded sequences, use menu item 1 to do the complete |
---|
45 | multiple alignment. You will be prompted for 2 output files: 1 for the |
---|
46 | alignment itself; another to store a dendrogram that describes the similarity |
---|
47 | of the sequences to each other. |
---|
48 | |
---|
49 | Multiple alignments are carried out in 3 stages (automatically done from menu |
---|
50 | item 1 ... multiple alignments NOW): |
---|
51 | |
---|
52 | 1) all sequences are compared to each other (pairwise alignments); |
---|
53 | |
---|
54 | 2) a dendrogram (like a phylogenetic tree) is constructed, describing the |
---|
55 | approximate groupings of the sequences by similarity (stored in a file). |
---|
56 | |
---|
57 | 3) the final multiple alignment is carried out, using the dendrogram as a guide. |
---|
58 | |
---|
59 | |
---|
60 | PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial |
---|
61 | alignments. |
---|
62 | |
---|
63 | MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments. |
---|
64 | |
---|
65 | |
---|
66 | |
---|
67 | |
---|
68 | You can skip the first stages (pairwise alignments; dendrogram) by using an |
---|
69 | old dendrogram file (menu item 3); or you can just produce the dendrogram |
---|
70 | with no final multiple alignment (menu item 2). |
---|
71 | |
---|
72 | |
---|
73 | OUTPUT FORMAT: Menu item 6 (format options) allows you to choose between 4 |
---|
74 | different alignment formats (CLUSTAL, GCG, NBRF/PIR and PHYLIP). |
---|
75 | |
---|
76 | |
---|
77 | >>HELP<< 3 Help for pairwise alignment parameters |
---|
78 | |
---|
79 | A similarity score is calculated between every pair of sequence and these are |
---|
80 | used to construct the dendrogram which guides the final multiple alignment. |
---|
81 | |
---|
82 | These similarity scores are calculated from fast, approximate, global align- |
---|
83 | ments, which are controlled by 4 parameters. 2 techniques are used to make |
---|
84 | these alignments very fast: 1) only exactly matching fragments (k-tuples) are |
---|
85 | considered; 2) only the 'best' diagonals (the ones with most k-tuple matches) |
---|
86 | are used. |
---|
87 | |
---|
88 | |
---|
89 | K-TUPLE SIZE: This is the size of exactly matching fragment that is used. |
---|
90 | INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity. |
---|
91 | For longer sequences (e.g. > 300 residues) you may need to increase the default. |
---|
92 | |
---|
93 | |
---|
94 | GAP PENALTY: This is a penalty for each gap in the fast alignments. It has |
---|
95 | little affect on the speed or sensitivity. |
---|
96 | |
---|
97 | |
---|
98 | |
---|
99 | |
---|
100 | |
---|
101 | |
---|
102 | TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary |
---|
103 | dot-matrix plot) is calculated. Only the best ones (with most matches) are |
---|
104 | used in the alignment. This parameter specifies how many. Decrease for speed; |
---|
105 | increase for sensitivity. |
---|
106 | |
---|
107 | |
---|
108 | DIAGONAL WINDOW: This is the number of diagonals around each of the 'best' |
---|
109 | diagonals that will be used. Decrease for speed; increase for sensitivity. |
---|
110 | |
---|
111 | |
---|
112 | SCORING METHOD = PERCENTAGE or ABSOLUTE: This controls whether the similarity |
---|
113 | scores are calculated as raw alignment scores (number of k-tuple matches minus a |
---|
114 | gap penalty for every gap) (ABSOLUTE) or as the alignment score divided by the |
---|
115 | length of the shorter sequence (PERCENTAGE). |
---|
116 | |
---|
117 | |
---|
118 | |
---|
119 | >>HELP<< 4 Help for multiple alignment parameters |
---|
120 | These parameters control the final multiple alignment. There are 2 gap penalty |
---|
121 | parameters and 1 for whether transitions (A <--> G or C <--> T) are weighted in |
---|
122 | DNA alignments. The default weight matrix for protein alignments is a PAM250 |
---|
123 | matrix, converted to distances. |
---|
124 | |
---|
125 | GAP PENALTY (FIXED): This is a penalty for opening up a gap. Decrease it |
---|
126 | and you will encourage gaps of all sizes. TERMINAL GAPS are penalised (same as |
---|
127 | internal ones). BEWARE: if you make this too small (+/- 5 or so), the program |
---|
128 | will prefer to align each sequence opposite a long gap. |
---|
129 | |
---|
130 | GAP PENALTY (VARYING): This penalty is incurred for every item in a gap. This |
---|
131 | penalises long gaps more. Increase this and gaps will get shorter. BEWARE: |
---|
132 | if you make this too small (+/- 5 or so), the program will prefer to align each |
---|
133 | sequence opposite a long gap. |
---|
134 | |
---|
135 | TRANSITIONS = WEIGHTED or UNWEIGHTED: With UNWEIGHTED transitions identical |
---|
136 | bases in a DNA alignment have a DISTANCE of 0; different ones have a distance |
---|
137 | of 10. If transitions are WEIGHTED then A vs G and C vs T will have a distance |
---|
138 | of 5 (less distant than A vs C,T or C vs A,G). |
---|
139 | >>HELP<< 5 Help for output format options. |
---|
140 | Four output formats are offered. You can choose more than one (or all four if |
---|
141 | you wish). NBRF/PIR format is ESPECIALLY USEFUL. Alignments that are written |
---|
142 | in this format can be used again as input (for calculating phylogenetic trees; |
---|
143 | profile alignments; general input). |
---|
144 | |
---|
145 | CLUSTAL format output is a self explanatory alignment format. It shows the |
---|
146 | sequences aligned in blocks. |
---|
147 | |
---|
148 | GCG output can be used by any of the GCG programs that can work on multiple |
---|
149 | alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG |
---|
150 | .msf format files (multiple sequence file); new in version 7 of GCG. |
---|
151 | |
---|
152 | PHYLIP format output can be used for input to the PHYLIP package of Joe |
---|
153 | Felsenstein. This is an extremely widely used package for doing every |
---|
154 | imaginable form of phylogenetic analysis (MUCH more than the the modest intro- |
---|
155 | duction offered by this program). |
---|
156 | |
---|
157 | NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap |
---|
158 | characters "-" are used to indicate the positions of gaps in the multiple |
---|
159 | alignment. These files can be re-used as input in any part of clustal that |
---|
160 | allows sequences (or alignments or profiles) to be read in. |
---|
161 | >>HELP<< 6 Help for profile alignments |
---|
162 | |
---|
163 | By PROFILE ALIGNMENT, we mean the alignment of two old alignments. One of the |
---|
164 | alignments can be a single sequence. |
---|
165 | |
---|
166 | The profiles should be in PIR format (one of the 4 output formats produced by |
---|
167 | this program). This is the same as standard NBRF/PIR format, with 1 addition: |
---|
168 | gap characters are indicated by "-". |
---|
169 | |
---|
170 | The alignment method produces a global, optimal alignment using an amino acid |
---|
171 | weight matrix (PAM250 is default) and 2 gap penalty parameters. |
---|
172 | |
---|
173 | Profile alignments allow you to store alignments of your favourite sequences (as |
---|
174 | long as they are in PIR format) and add new sequences to them in small bunches |
---|
175 | at a time. One of the 2 profiles can simply be a single sequence. |
---|
176 | |
---|
177 | |
---|
178 | |
---|
179 | >>HELP<< 7 Help for phylogenetic trees |
---|
180 | Before calculating a tree, you must have an alignment in memory. This can be |
---|
181 | input in NBRF/PIR format or you should have just carried out a full multiple |
---|
182 | alignment and the alignment is still in memory. |
---|
183 | |
---|
184 | The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First |
---|
185 | you calculate distances (percent divergence) between all pairs of sequence from |
---|
186 | a multiple alignment; second you apply the NJ method to the distance matrix. |
---|
187 | |
---|
188 | EXCLUDE POSITIONS WITH GAPS? If you choose this option, any alignment positions |
---|
189 | where ANY of the sequences have a gap will be ignored. This guarantees that |
---|
190 | the distances will be 'metric'. Also, it means that 'like' will be compared to |
---|
191 | 'like' in all distances. The disadvantage is that you may throw away much of |
---|
192 | the data if there are many gaps. |
---|
193 | |
---|
194 | CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this |
---|
195 | option makes little difference. For greater divergence, this option corrects |
---|
196 | for the fact that observed distances underestimate actual evolutionary dist- |
---|
197 | ances. This is because, as sequences diverge, more than one substitution will |
---|
198 | happen at many sites. However, you only see one difference when you look at the |
---|
199 | present day sequences. Therefore, this option has the effect of stretching |
---|
200 | branch lengths in trees (especially long branches). The corrections used here |
---|
201 | (for DNA or proteins) are both due to Motoo Kimura. |
---|
202 | |
---|
203 | To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED |
---|
204 | tree and all branch lengths. The root of the tree can only be inferred by |
---|
205 | using an outgroup (a sequence that you are certain branches at the outside |
---|
206 | of the tree .... certain on biological grounds) OR if you assume a degree |
---|
207 | of constancy in the 'molecular clock', you can place the root along the |
---|
208 | longest branch. |
---|
209 | |
---|
210 | BOOTSTRAPPING is a method for deriving confidence values for the groupings in |
---|
211 | a tree (first adapted for trees by Joe Felsenstein). It involves making N |
---|
212 | random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000); |
---|
213 | drawing N trees (1 from each sample) and counting how many times each grouping |
---|
214 | from the original tree occurs in the sample trees. For a group to be consid- |
---|
215 | ered significant at the 5% level (p <= 0.05) it should occur in at least 95% of |
---|
216 | the sample trees. You must supply a seed number for the random number generator. |
---|
217 | >>HELP<< 8 Help for choosing protein weight matrix |
---|
218 | For protein alignments, you use a weight matrix to determine the similarity of |
---|
219 | non-identical amino acids. For example, Tyr aligned with Phe is usually judged |
---|
220 | to be 'better' than Tyr aligned with Pro. |
---|
221 | |
---|
222 | |
---|
223 | |
---|
224 | There are three 'in-built' weight matrices offered: |
---|
225 | |
---|
226 | |
---|
227 | 1) PAM 100 and 2) PAM 250 These are from the work of M. Dayhoff and are often |
---|
228 | simply called Dayhoff matrices. The pam 250 matrix is the most commonly used |
---|
229 | and is the default in most protein comparison packages. It is claimed that |
---|
230 | a pam 100 matrix is more sensitive in many cases, so we have included it |
---|
231 | here. |
---|
232 | |
---|
233 | |
---|
234 | 3) Identity matrix. This matrix just scores identical residues. |
---|
235 | |
---|
236 | |
---|
237 | |
---|
238 | |
---|
239 | |
---|
240 | You can also input your own matrix. If so then be careful: 1) follow the |
---|
241 | instructions on format below; 2) watch the gap penalty parameters (the default |
---|
242 | values may no be appropriate). Conservative substitutions will not be |
---|
243 | indicated in alignments. |
---|
244 | |
---|
245 | The values in a new weight matrix must be integers and the scores should be |
---|
246 | similarities. You can use negative as well as positive values if you wish. |
---|
247 | |
---|
248 | |
---|
249 | INPUT FORMAT The lower triangle of a 20x20 matrix of values is read in, in free |
---|
250 | format, row by row. The diagonal must be included. Using the 1 letter code, |
---|
251 | the order of amino acids in the matrix is: CSTPAGNDEQHRKMILVFYW. Seperate |
---|
252 | the values by spaces (not commas). You can put the values on as many lines |
---|
253 | as you like as long as they are in the right order. |
---|
254 | |
---|
255 | |
---|
256 | GAP PENALTIES The default gap penalty parameters work fine with a PAM 250 |
---|
257 | matrix. The range of PAM 250 values is 0 to 25 (when rescaled to be positive) |
---|
258 | and the default gap penalties are 10 each. Very approximately, the best gap |
---|
259 | penalty settings are 2/5 the maximum weight matrix score. |
---|
260 | >>HELP<< 9 Help for command line parameters |
---|
261 | DATA (sequences) |
---|
262 | |
---|
263 | /INFILE=file.ext :input sequences. |
---|
264 | /PROFILE1=file.ext and /PROFILE2=file.ext :profiles (old alignment). |
---|
265 | |
---|
266 | VERBS (do things) |
---|
267 | |
---|
268 | /HELP or /CHECK :list the command line params. |
---|
269 | /ALIGN :do full multiple alignment |
---|
270 | /TREE :calculate NJ tree. |
---|
271 | /BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000). |
---|
272 | |
---|
273 | PARAMETERS (set things) |
---|
274 | |
---|
275 | ***Pairwise alignments:*** |
---|
276 | /KTUP=n :word size /TOPDIAGS=n :number of best diags. |
---|
277 | /WINDOW=n :window around best diags. /PAIRGAP=n :gap penalty |
---|
278 | |
---|
279 | ***Multiple alignments:*** |
---|
280 | /FIXEDGAP=n :fixed length gap pen. /FLOATGAP=n :variable length gap pen. |
---|
281 | /MATRIX= :PAM100 or ID or file name. /TYPE=p or d :type is prot. or DNA |
---|
282 | /OUTPUT= :GCG or PHYLIP or PIR. /TRANSIT :transitions not weighted. |
---|
283 | |
---|
284 | ***Trees:*** /SEED :seed number for bootstraps. |
---|
285 | /KIMURA :use Kimura's correction. /TOSSGAPS :ignore positions with gaps. |
---|
286 | |
---|