| 1 | ClustalV - Multiple sequence alignment by Des Higgins |
|---|
| 2 | |
|---|
| 3 | ClustalV is an order independent multiple sequences alignment algorithm. |
|---|
| 4 | It first determines which sequences are similar to eachother, and aligns |
|---|
| 5 | those sequences first. This helps reduce alignment bias based on the |
|---|
| 6 | early alignment of poor sequences. |
|---|
| 7 | |
|---|
| 8 | |
|---|
| 9 | Be warned that all sequences are returned in upper case, and all |
|---|
| 10 | extended information under Get info.. will be lost. For this reason, |
|---|
| 11 | the results are placed in a new GDE window. |
|---|
| 12 | |
|---|
| 13 | The following is the help file that Higgins provides. |
|---|
| 14 | |
|---|
| 15 | --- |
|---|
| 16 | |
|---|
| 17 | This is the on-line help file for CLUSTAL V. |
|---|
| 18 | |
|---|
| 19 | It should be named or defined as: clustalv_hlp |
|---|
| 20 | |
|---|
| 21 | >>HELP<< 1 General help for CLUSTAL V |
|---|
| 22 | CLUSTAL V is a general purpose multiple alignment program for DNA or proteins. |
|---|
| 23 | |
|---|
| 24 | SEQUENCE INPUT: all sequences must be in 1 file, one after another. 3 formats |
|---|
| 25 | are automatically recognised: NBRF/PIR, EMBL/SWISSPROT or Pearson (Fasta). |
|---|
| 26 | All non-alphabetic characters (spaces, digits, punctuation marks) are ignored |
|---|
| 27 | except "-" which is used to indicate a GAP. Upper or lower case is allowed. |
|---|
| 28 | |
|---|
| 29 | |
|---|
| 30 | To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to |
|---|
| 31 | INPUT them; go to menu item 2 to do the multiple alignment. |
|---|
| 32 | |
|---|
| 33 | |
|---|
| 34 | PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to |
|---|
| 35 | add a new sequence to an old alignment. GAPS in the old alignments are |
|---|
| 36 | indicated using the "-" character. PROFILES can be input as PIR format files. |
|---|
| 37 | |
|---|
| 38 | |
|---|
| 39 | PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in |
|---|
| 40 | in PIR format with "-" characters to indicate gaps) OR after a multiple |
|---|
| 41 | alignemnt while the alignment is still in memory. |
|---|
| 42 | >>HELP<< 2 Help for multiple alignments |
|---|
| 43 | |
|---|
| 44 | If you have already loaded sequences, use menu item 1 to do the complete |
|---|
| 45 | multiple alignment. You will be prompted for 2 output files: 1 for the |
|---|
| 46 | alignment itself; another to store a dendrogram that describes the similarity |
|---|
| 47 | of the sequences to each other. |
|---|
| 48 | |
|---|
| 49 | Multiple alignments are carried out in 3 stages (automatically done from menu |
|---|
| 50 | item 1 ... multiple alignments NOW): |
|---|
| 51 | |
|---|
| 52 | 1) all sequences are compared to each other (pairwise alignments); |
|---|
| 53 | |
|---|
| 54 | 2) a dendrogram (like a phylogenetic tree) is constructed, describing the |
|---|
| 55 | approximate groupings of the sequences by similarity (stored in a file). |
|---|
| 56 | |
|---|
| 57 | 3) the final multiple alignment is carried out, using the dendrogram as a guide. |
|---|
| 58 | |
|---|
| 59 | |
|---|
| 60 | PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial |
|---|
| 61 | alignments. |
|---|
| 62 | |
|---|
| 63 | MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments. |
|---|
| 64 | |
|---|
| 65 | |
|---|
| 66 | |
|---|
| 67 | |
|---|
| 68 | You can skip the first stages (pairwise alignments; dendrogram) by using an |
|---|
| 69 | old dendrogram file (menu item 3); or you can just produce the dendrogram |
|---|
| 70 | with no final multiple alignment (menu item 2). |
|---|
| 71 | |
|---|
| 72 | |
|---|
| 73 | OUTPUT FORMAT: Menu item 6 (format options) allows you to choose between 4 |
|---|
| 74 | different alignment formats (CLUSTAL, GCG, NBRF/PIR and PHYLIP). |
|---|
| 75 | |
|---|
| 76 | |
|---|
| 77 | >>HELP<< 3 Help for pairwise alignment parameters |
|---|
| 78 | |
|---|
| 79 | A similarity score is calculated between every pair of sequence and these are |
|---|
| 80 | used to construct the dendrogram which guides the final multiple alignment. |
|---|
| 81 | |
|---|
| 82 | These similarity scores are calculated from fast, approximate, global align- |
|---|
| 83 | ments, which are controlled by 4 parameters. 2 techniques are used to make |
|---|
| 84 | these alignments very fast: 1) only exactly matching fragments (k-tuples) are |
|---|
| 85 | considered; 2) only the 'best' diagonals (the ones with most k-tuple matches) |
|---|
| 86 | are used. |
|---|
| 87 | |
|---|
| 88 | |
|---|
| 89 | K-TUPLE SIZE: This is the size of exactly matching fragment that is used. |
|---|
| 90 | INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity. |
|---|
| 91 | For longer sequences (e.g. > 300 residues) you may need to increase the default. |
|---|
| 92 | |
|---|
| 93 | |
|---|
| 94 | GAP PENALTY: This is a penalty for each gap in the fast alignments. It has |
|---|
| 95 | little affect on the speed or sensitivity. |
|---|
| 96 | |
|---|
| 97 | |
|---|
| 98 | |
|---|
| 99 | |
|---|
| 100 | |
|---|
| 101 | |
|---|
| 102 | TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary |
|---|
| 103 | dot-matrix plot) is calculated. Only the best ones (with most matches) are |
|---|
| 104 | used in the alignment. This parameter specifies how many. Decrease for speed; |
|---|
| 105 | increase for sensitivity. |
|---|
| 106 | |
|---|
| 107 | |
|---|
| 108 | DIAGONAL WINDOW: This is the number of diagonals around each of the 'best' |
|---|
| 109 | diagonals that will be used. Decrease for speed; increase for sensitivity. |
|---|
| 110 | |
|---|
| 111 | |
|---|
| 112 | SCORING METHOD = PERCENTAGE or ABSOLUTE: This controls whether the similarity |
|---|
| 113 | scores are calculated as raw alignment scores (number of k-tuple matches minus a |
|---|
| 114 | gap penalty for every gap) (ABSOLUTE) or as the alignment score divided by the |
|---|
| 115 | length of the shorter sequence (PERCENTAGE). |
|---|
| 116 | |
|---|
| 117 | |
|---|
| 118 | |
|---|
| 119 | >>HELP<< 4 Help for multiple alignment parameters |
|---|
| 120 | These parameters control the final multiple alignment. There are 2 gap penalty |
|---|
| 121 | parameters and 1 for whether transitions (A <--> G or C <--> T) are weighted in |
|---|
| 122 | DNA alignments. The default weight matrix for protein alignments is a PAM250 |
|---|
| 123 | matrix, converted to distances. |
|---|
| 124 | |
|---|
| 125 | GAP PENALTY (FIXED): This is a penalty for opening up a gap. Decrease it |
|---|
| 126 | and you will encourage gaps of all sizes. TERMINAL GAPS are penalised (same as |
|---|
| 127 | internal ones). BEWARE: if you make this too small (+/- 5 or so), the program |
|---|
| 128 | will prefer to align each sequence opposite a long gap. |
|---|
| 129 | |
|---|
| 130 | GAP PENALTY (VARYING): This penalty is incurred for every item in a gap. This |
|---|
| 131 | penalises long gaps more. Increase this and gaps will get shorter. BEWARE: |
|---|
| 132 | if you make this too small (+/- 5 or so), the program will prefer to align each |
|---|
| 133 | sequence opposite a long gap. |
|---|
| 134 | |
|---|
| 135 | TRANSITIONS = WEIGHTED or UNWEIGHTED: With UNWEIGHTED transitions identical |
|---|
| 136 | bases in a DNA alignment have a DISTANCE of 0; different ones have a distance |
|---|
| 137 | of 10. If transitions are WEIGHTED then A vs G and C vs T will have a distance |
|---|
| 138 | of 5 (less distant than A vs C,T or C vs A,G). |
|---|
| 139 | >>HELP<< 5 Help for output format options. |
|---|
| 140 | Four output formats are offered. You can choose more than one (or all four if |
|---|
| 141 | you wish). NBRF/PIR format is ESPECIALLY USEFUL. Alignments that are written |
|---|
| 142 | in this format can be used again as input (for calculating phylogenetic trees; |
|---|
| 143 | profile alignments; general input). |
|---|
| 144 | |
|---|
| 145 | CLUSTAL format output is a self explanatory alignment format. It shows the |
|---|
| 146 | sequences aligned in blocks. |
|---|
| 147 | |
|---|
| 148 | GCG output can be used by any of the GCG programs that can work on multiple |
|---|
| 149 | alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG |
|---|
| 150 | .msf format files (multiple sequence file); new in version 7 of GCG. |
|---|
| 151 | |
|---|
| 152 | PHYLIP format output can be used for input to the PHYLIP package of Joe |
|---|
| 153 | Felsenstein. This is an extremely widely used package for doing every |
|---|
| 154 | imaginable form of phylogenetic analysis (MUCH more than the the modest intro- |
|---|
| 155 | duction offered by this program). |
|---|
| 156 | |
|---|
| 157 | NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap |
|---|
| 158 | characters "-" are used to indicate the positions of gaps in the multiple |
|---|
| 159 | alignment. These files can be re-used as input in any part of clustal that |
|---|
| 160 | allows sequences (or alignments or profiles) to be read in. |
|---|
| 161 | >>HELP<< 6 Help for profile alignments |
|---|
| 162 | |
|---|
| 163 | By PROFILE ALIGNMENT, we mean the alignment of two old alignments. One of the |
|---|
| 164 | alignments can be a single sequence. |
|---|
| 165 | |
|---|
| 166 | The profiles should be in PIR format (one of the 4 output formats produced by |
|---|
| 167 | this program). This is the same as standard NBRF/PIR format, with 1 addition: |
|---|
| 168 | gap characters are indicated by "-". |
|---|
| 169 | |
|---|
| 170 | The alignment method produces a global, optimal alignment using an amino acid |
|---|
| 171 | weight matrix (PAM250 is default) and 2 gap penalty parameters. |
|---|
| 172 | |
|---|
| 173 | Profile alignments allow you to store alignments of your favourite sequences (as |
|---|
| 174 | long as they are in PIR format) and add new sequences to them in small bunches |
|---|
| 175 | at a time. One of the 2 profiles can simply be a single sequence. |
|---|
| 176 | |
|---|
| 177 | |
|---|
| 178 | |
|---|
| 179 | >>HELP<< 7 Help for phylogenetic trees |
|---|
| 180 | Before calculating a tree, you must have an alignment in memory. This can be |
|---|
| 181 | input in NBRF/PIR format or you should have just carried out a full multiple |
|---|
| 182 | alignment and the alignment is still in memory. |
|---|
| 183 | |
|---|
| 184 | The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First |
|---|
| 185 | you calculate distances (percent divergence) between all pairs of sequence from |
|---|
| 186 | a multiple alignment; second you apply the NJ method to the distance matrix. |
|---|
| 187 | |
|---|
| 188 | EXCLUDE POSITIONS WITH GAPS? If you choose this option, any alignment positions |
|---|
| 189 | where ANY of the sequences have a gap will be ignored. This guarantees that |
|---|
| 190 | the distances will be 'metric'. Also, it means that 'like' will be compared to |
|---|
| 191 | 'like' in all distances. The disadvantage is that you may throw away much of |
|---|
| 192 | the data if there are many gaps. |
|---|
| 193 | |
|---|
| 194 | CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this |
|---|
| 195 | option makes little difference. For greater divergence, this option corrects |
|---|
| 196 | for the fact that observed distances underestimate actual evolutionary dist- |
|---|
| 197 | ances. This is because, as sequences diverge, more than one substitution will |
|---|
| 198 | happen at many sites. However, you only see one difference when you look at the |
|---|
| 199 | present day sequences. Therefore, this option has the effect of stretching |
|---|
| 200 | branch lengths in trees (especially long branches). The corrections used here |
|---|
| 201 | (for DNA or proteins) are both due to Motoo Kimura. |
|---|
| 202 | |
|---|
| 203 | To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED |
|---|
| 204 | tree and all branch lengths. The root of the tree can only be inferred by |
|---|
| 205 | using an outgroup (a sequence that you are certain branches at the outside |
|---|
| 206 | of the tree .... certain on biological grounds) OR if you assume a degree |
|---|
| 207 | of constancy in the 'molecular clock', you can place the root along the |
|---|
| 208 | longest branch. |
|---|
| 209 | |
|---|
| 210 | BOOTSTRAPPING is a method for deriving confidence values for the groupings in |
|---|
| 211 | a tree (first adapted for trees by Joe Felsenstein). It involves making N |
|---|
| 212 | random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000); |
|---|
| 213 | drawing N trees (1 from each sample) and counting how many times each grouping |
|---|
| 214 | from the original tree occurs in the sample trees. For a group to be consid- |
|---|
| 215 | ered significant at the 5% level (p <= 0.05) it should occur in at least 95% of |
|---|
| 216 | the sample trees. You must supply a seed number for the random number generator. |
|---|
| 217 | >>HELP<< 8 Help for choosing protein weight matrix |
|---|
| 218 | For protein alignments, you use a weight matrix to determine the similarity of |
|---|
| 219 | non-identical amino acids. For example, Tyr aligned with Phe is usually judged |
|---|
| 220 | to be 'better' than Tyr aligned with Pro. |
|---|
| 221 | |
|---|
| 222 | |
|---|
| 223 | |
|---|
| 224 | There are three 'in-built' weight matrices offered: |
|---|
| 225 | |
|---|
| 226 | |
|---|
| 227 | 1) PAM 100 and 2) PAM 250 These are from the work of M. Dayhoff and are often |
|---|
| 228 | simply called Dayhoff matrices. The pam 250 matrix is the most commonly used |
|---|
| 229 | and is the default in most protein comparison packages. It is claimed that |
|---|
| 230 | a pam 100 matrix is more sensitive in many cases, so we have included it |
|---|
| 231 | here. |
|---|
| 232 | |
|---|
| 233 | |
|---|
| 234 | 3) Identity matrix. This matrix just scores identical residues. |
|---|
| 235 | |
|---|
| 236 | |
|---|
| 237 | |
|---|
| 238 | |
|---|
| 239 | |
|---|
| 240 | You can also input your own matrix. If so then be careful: 1) follow the |
|---|
| 241 | instructions on format below; 2) watch the gap penalty parameters (the default |
|---|
| 242 | values may no be appropriate). Conservative substitutions will not be |
|---|
| 243 | indicated in alignments. |
|---|
| 244 | |
|---|
| 245 | The values in a new weight matrix must be integers and the scores should be |
|---|
| 246 | similarities. You can use negative as well as positive values if you wish. |
|---|
| 247 | |
|---|
| 248 | |
|---|
| 249 | INPUT FORMAT The lower triangle of a 20x20 matrix of values is read in, in free |
|---|
| 250 | format, row by row. The diagonal must be included. Using the 1 letter code, |
|---|
| 251 | the order of amino acids in the matrix is: CSTPAGNDEQHRKMILVFYW. Seperate |
|---|
| 252 | the values by spaces (not commas). You can put the values on as many lines |
|---|
| 253 | as you like as long as they are in the right order. |
|---|
| 254 | |
|---|
| 255 | |
|---|
| 256 | GAP PENALTIES The default gap penalty parameters work fine with a PAM 250 |
|---|
| 257 | matrix. The range of PAM 250 values is 0 to 25 (when rescaled to be positive) |
|---|
| 258 | and the default gap penalties are 10 each. Very approximately, the best gap |
|---|
| 259 | penalty settings are 2/5 the maximum weight matrix score. |
|---|
| 260 | >>HELP<< 9 Help for command line parameters |
|---|
| 261 | DATA (sequences) |
|---|
| 262 | |
|---|
| 263 | /INFILE=file.ext :input sequences. |
|---|
| 264 | /PROFILE1=file.ext and /PROFILE2=file.ext :profiles (old alignment). |
|---|
| 265 | |
|---|
| 266 | VERBS (do things) |
|---|
| 267 | |
|---|
| 268 | /HELP or /CHECK :list the command line params. |
|---|
| 269 | /ALIGN :do full multiple alignment |
|---|
| 270 | /TREE :calculate NJ tree. |
|---|
| 271 | /BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000). |
|---|
| 272 | |
|---|
| 273 | PARAMETERS (set things) |
|---|
| 274 | |
|---|
| 275 | ***Pairwise alignments:*** |
|---|
| 276 | /KTUP=n :word size /TOPDIAGS=n :number of best diags. |
|---|
| 277 | /WINDOW=n :window around best diags. /PAIRGAP=n :gap penalty |
|---|
| 278 | |
|---|
| 279 | ***Multiple alignments:*** |
|---|
| 280 | /FIXEDGAP=n :fixed length gap pen. /FLOATGAP=n :variable length gap pen. |
|---|
| 281 | /MATRIX= :PAM100 or ID or file name. /TYPE=p or d :type is prot. or DNA |
|---|
| 282 | /OUTPUT= :GCG or PHYLIP or PIR. /TRANSIT :transitions not weighted. |
|---|
| 283 | |
|---|
| 284 | ***Trees:*** /SEED :seed number for bootstraps. |
|---|
| 285 | /KIMURA :use Kimura's correction. /TOSSGAPS :ignore positions with gaps. |
|---|
| 286 | |
|---|