[176] | 1 | |
---|
[1754] | 2 | This is the on-line help file for CLUSTAL W ( version 1.83). |
---|
[176] | 3 | |
---|
| 4 | It should be named or defined as: clustalw_help |
---|
| 5 | except with MSDOS in which case it should be named CLUSTALW.HLP |
---|
| 6 | |
---|
| 7 | For full details of usage and algorithms, please read the CLUSTALW.DOC file. |
---|
| 8 | |
---|
| 9 | |
---|
| 10 | Toby Gibson EMBL, Heidelberg, Germany. |
---|
| 11 | Des Higgins UCC, Cork, Ireland. |
---|
| 12 | Julie Thompson IGBMC, Strasbourg, France. |
---|
| 13 | |
---|
| 14 | |
---|
| 15 | |
---|
[1754] | 16 | >>NEW << |
---|
| 17 | |
---|
| 18 | Fasta output |
---|
| 19 | =========== |
---|
| 20 | |
---|
| 21 | Write/Read sequence with range specified. The command line syntax |
---|
| 22 | for range specification is flexible. You can use one of the following |
---|
| 23 | syntax. |
---|
| 24 | |
---|
| 25 | -range=n:m |
---|
| 26 | -range=n-m |
---|
| 27 | -range="n m" |
---|
| 28 | |
---|
| 29 | where m is the starting and m is the length of the sequence. |
---|
| 30 | |
---|
| 31 | Range and range numbers. |
---|
| 32 | ======================= |
---|
| 33 | |
---|
| 34 | Include range numbers in the ouput. |
---|
| 35 | |
---|
| 36 | -seqno_range=on/off |
---|
| 37 | |
---|
| 38 | The sequence range will be appended as to the names of the sequence. |
---|
| 39 | |
---|
| 40 | |
---|
| 41 | PIM: Percentage Identity Matrix |
---|
| 42 | =============================== |
---|
| 43 | |
---|
| 44 | |
---|
| 45 | |
---|
[176] | 46 | >>HELP 1 << General help for CLUSTAL W (1.81) |
---|
| 47 | |
---|
| 48 | Clustal W is a general purpose multiple alignment program for DNA or proteins. |
---|
| 49 | |
---|
| 50 | SEQUENCE INPUT: all sequences must be in 1 file, one after another. |
---|
| 51 | 7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT, |
---|
| 52 | Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file. |
---|
| 53 | All non-alphabetic characters (spaces, digits, punctuation marks) are ignored |
---|
| 54 | except "-" which is used to indicate a GAP ("." in MSF-RSF). |
---|
| 55 | |
---|
| 56 | To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to |
---|
| 57 | INPUT them; go to menu item 2 to do the multiple alignment. |
---|
| 58 | |
---|
| 59 | PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to |
---|
| 60 | add a new sequence to an old alignment, or to use secondary structure to guide |
---|
| 61 | the alignment process. GAPS in the old alignments are indicated using the "-" |
---|
| 62 | character. PROFILES can be input in ANY of the allowed formats; just |
---|
| 63 | use "-" (or "." for MSF-RSF) for each gap position. |
---|
| 64 | |
---|
| 65 | PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in |
---|
| 66 | with "-" characters to indicate gaps) OR after a multiple alignment while the |
---|
| 67 | alignment is still in memory. |
---|
| 68 | |
---|
| 69 | |
---|
| 70 | The program tries to automatically recognise the different file formats used |
---|
| 71 | and to guess whether the sequences are amino acid or nucleotide. This is not |
---|
| 72 | always foolproof. |
---|
| 73 | |
---|
| 74 | FASTA and NBRF-PIR formats are recognised by having a ">" as the first |
---|
| 75 | character in the file. |
---|
| 76 | |
---|
| 77 | EMBL-Swiss Prot formats are recognised by the letters |
---|
| 78 | ID at the start of the file (the token for the entry name field). |
---|
| 79 | |
---|
| 80 | CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file. |
---|
| 81 | |
---|
| 82 | GCG-MSF format is recognised by one of the following: |
---|
| 83 | - the word PileUp at the start of the file. |
---|
| 84 | - the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT |
---|
| 85 | at the start of the file. |
---|
| 86 | - the word MSF on the first line of the line, and the characters .. |
---|
| 87 | at the end of this line. |
---|
| 88 | |
---|
| 89 | GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of |
---|
| 90 | the file. |
---|
| 91 | |
---|
| 92 | |
---|
| 93 | If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the |
---|
| 94 | sequence will be assumed to be nucleotide. This works in 97.3% of cases |
---|
| 95 | but watch out! |
---|
| 96 | |
---|
| 97 | >>HELP 2 << Help for multiple alignments |
---|
| 98 | |
---|
| 99 | If you have already loaded sequences, use menu item 1 to do the complete |
---|
| 100 | multiple alignment. You will be prompted for 2 output files: 1 for the |
---|
| 101 | alignment itself; another to store a dendrogram that describes the similarity |
---|
| 102 | of the sequences to each other. |
---|
| 103 | |
---|
| 104 | Multiple alignments are carried out in 3 stages (automatically done from menu |
---|
| 105 | item 1 ...Do complete multiple alignments now): |
---|
| 106 | |
---|
| 107 | 1) all sequences are compared to each other (pairwise alignments); |
---|
| 108 | |
---|
| 109 | 2) a dendrogram (like a phylogenetic tree) is constructed, describing the |
---|
| 110 | approximate groupings of the sequences by similarity (stored in a file). |
---|
| 111 | |
---|
| 112 | 3) the final multiple alignment is carried out, using the dendrogram as a guide. |
---|
| 113 | |
---|
| 114 | |
---|
| 115 | PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial |
---|
| 116 | alignments. |
---|
| 117 | |
---|
| 118 | MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments. |
---|
| 119 | |
---|
| 120 | |
---|
| 121 | RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences |
---|
| 122 | during multiple alignment if you wish to change the parameters and try again. |
---|
| 123 | This only takes effect just before you do a second multiple alignment. You |
---|
| 124 | can make phylogenetic trees after alignment whether or not this is ON. |
---|
| 125 | If you turn this OFF, the new gaps are kept even if you do a second multiple |
---|
| 126 | alignment. This allows you to iterate the alignment gradually. Sometimes, the |
---|
| 127 | alignment is improved by a second or third pass. |
---|
| 128 | |
---|
| 129 | SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the |
---|
| 130 | screen as well as to the output file. |
---|
| 131 | |
---|
| 132 | You can skip the first stages (pairwise alignments; dendrogram) by using an |
---|
| 133 | old dendrogram file (menu item 3); or you can just produce the dendrogram |
---|
| 134 | with no final multiple alignment (menu item 2). |
---|
| 135 | |
---|
| 136 | |
---|
| 137 | OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6 |
---|
[1754] | 138 | different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA). |
---|
[176] | 139 | |
---|
| 140 | |
---|
| 141 | >>HELP 3 << Help for pairwise alignment parameters |
---|
| 142 | A distance is calculated between every pair of sequences and these are used to |
---|
| 143 | construct the dendrogram which guides the final multiple alignment. The scores |
---|
| 144 | are calculated from separate pairwise alignments. These can be calculated using |
---|
| 145 | 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur |
---|
| 146 | and Lipman (extremely fast but approximate). |
---|
| 147 | |
---|
| 148 | You can choose between the 2 alignment methods using menu option 8. The |
---|
| 149 | slow-accurate method is fine for short sequences but will be VERY SLOW for |
---|
| 150 | many (e.g. >100) long (e.g. >1000 residue) sequences. |
---|
| 151 | |
---|
| 152 | SLOW-ACCURATE alignment parameters: |
---|
| 153 | These parameters do not have any affect on the speed of the alignments. |
---|
| 154 | They are used to give initial alignments which are then rescored to give percent |
---|
| 155 | identity scores. These % scores are the ones which are displayed on the |
---|
| 156 | screen. The scores are converted to distances for the trees. |
---|
| 157 | |
---|
| 158 | 1) Gap Open Penalty: the penalty for opening a gap in the alignment. |
---|
| 159 | 2) Gap extension penalty: the penalty for extending a gap by 1 residue. |
---|
| 160 | 3) Protein weight matrix: the scoring table which describes the similarity |
---|
| 161 | of each amino acid to each other. |
---|
| 162 | 4) DNA weight matrix: the scores assigned to matches and mismatches |
---|
| 163 | (including IUB ambiguity codes). |
---|
| 164 | |
---|
| 165 | |
---|
| 166 | FAST-APPROXIMATE alignment parameters: |
---|
| 167 | |
---|
[6136] | 168 | These similarity scores are calculated from fast, approximate, global alignments, |
---|
| 169 | which are controlled by 4 parameters. 2 techniques are used to make |
---|
[176] | 170 | these alignments very fast: 1) only exactly matching fragments (k-tuples) are |
---|
| 171 | considered; 2) only the 'best' diagonals (the ones with most k-tuple matches) |
---|
| 172 | are used. |
---|
| 173 | |
---|
| 174 | K-TUPLE SIZE: This is the size of exactly matching fragment that is used. |
---|
| 175 | INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity. |
---|
| 176 | For longer sequences (e.g. >1000 residues) you may need to increase the default. |
---|
| 177 | |
---|
| 178 | GAP PENALTY: This is a penalty for each gap in the fast alignments. It has |
---|
| 179 | little affect on the speed or sensitivity except for extreme values. |
---|
| 180 | |
---|
| 181 | TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary |
---|
| 182 | dot-matrix plot) is calculated. Only the best ones (with most matches) are |
---|
| 183 | used in the alignment. This parameter specifies how many. Decrease for speed; |
---|
| 184 | increase for sensitivity. |
---|
| 185 | |
---|
| 186 | WINDOW SIZE: This is the number of diagonals around each of the 'best' |
---|
| 187 | diagonals that will be used. Decrease for speed; increase for sensitivity. |
---|
| 188 | |
---|
| 189 | |
---|
| 190 | >>HELP 4 << Help for multiple alignment parameters |
---|
| 191 | |
---|
| 192 | These parameters control the final multiple alignment. This is the core of the |
---|
| 193 | program and the details are complicated. To fully understand the use of the |
---|
| 194 | parameters and the scoring system, you will have to refer to the documentation. |
---|
| 195 | |
---|
| 196 | Each step in the final multiple alignment consists of aligning two alignments |
---|
| 197 | or sequences. This is done progressively, following the branching order in |
---|
| 198 | the GUIDE TREE. The basic parameters to control this are two gap penalties and |
---|
[6141] | 199 | the scores for various identical-non-identical residues. |
---|
[176] | 200 | |
---|
[10842] | 201 | 1) and .. |
---|
| 202 | |
---|
| 203 | 2) The GAP PENALTIES are set by menu items 1 and 2. These control the |
---|
[176] | 204 | cost of opening up every new gap and the cost of every item in a gap. |
---|
| 205 | Increasing the gap opening penalty will make gaps less frequent. Increasing |
---|
| 206 | the gap extension penalty will make gaps shorter. Terminal gaps are not |
---|
| 207 | penalised. |
---|
| 208 | |
---|
| 209 | 3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most |
---|
| 210 | distantly related sequences until after the most closely related sequences have |
---|
| 211 | been aligned. The setting shows the percent identity level required to delay |
---|
| 212 | the addition of a sequence; sequences that are less identical than this level |
---|
| 213 | to any other sequences will be aligned later. |
---|
| 214 | |
---|
| 215 | |
---|
| 216 | |
---|
| 217 | 4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T |
---|
| 218 | i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0 |
---|
| 219 | and 1; a weight of zero means that the transitions are scored as mismatches, |
---|
| 220 | while a weight of 1 gives the transitions the match score. For distantly related |
---|
| 221 | DNA sequences, the weight should be near to zero; for closely related sequences |
---|
| 222 | it can be useful to assign a higher score. |
---|
| 223 | |
---|
| 224 | |
---|
| 225 | 5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of |
---|
| 226 | weight matrices. The default for proteins in version 1.8 is the PAM series |
---|
| 227 | derived by Gonnet and colleagues. Note, a series is used! The actual matrix |
---|
| 228 | that is used depends on how similar the sequences to be aligned at this |
---|
| 229 | alignment step are. Different matrices work differently at each evolutionary |
---|
| 230 | distance. |
---|
| 231 | |
---|
| 232 | 6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series) |
---|
| 233 | can be selected. The default is the matrix used by BESTFIT for comparison of |
---|
| 234 | nucleic acid sequences. |
---|
| 235 | |
---|
| 236 | Further help is offered in the weight matrix menu. |
---|
| 237 | |
---|
| 238 | |
---|
| 239 | 7) In the weight matrices, you can use negative as well as positive values if |
---|
| 240 | you wish, although the matrix will be automatically adjusted to all positive |
---|
| 241 | scores, unless the NEGATIVE MATRIX option is selected. |
---|
| 242 | |
---|
| 243 | 8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty |
---|
| 244 | options which are only used in protein alignments. |
---|
| 245 | |
---|
| 246 | |
---|
| 247 | >>HELP A << Help for protein gap parameters. |
---|
| 248 | 1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce |
---|
| 249 | or increase the gap opening penalties at each position in the alignment or |
---|
| 250 | sequence. See the documentation for details. As an example, positions that |
---|
| 251 | are rich in glycine are more likely to have an adjacent gap than positions that |
---|
| 252 | are rich in valine. |
---|
| 253 | |
---|
[10842] | 254 | 2) [and ..] |
---|
| 255 | |
---|
| 256 | 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within |
---|
[176] | 257 | a run (5 or more residues) of hydrophilic amino acids; these are likely to |
---|
| 258 | be loop or random coil regions where gaps are more common. The residues that |
---|
| 259 | are "considered" to be hydrophilic are set by menu item 3. |
---|
| 260 | |
---|
| 261 | 4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too |
---|
| 262 | close to each other. Gaps that are less than this distance apart are penalised |
---|
| 263 | more than other gaps. This does not prevent close gaps; it makes them less |
---|
| 264 | frequent, promoting a block-like appearance of the alignment. |
---|
| 265 | |
---|
| 266 | 5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes |
---|
| 267 | of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). |
---|
| 268 | If you turn this off, end gaps will be ignored for this purpose. This is |
---|
| 269 | useful when you wish to align fragments where the end gaps are not biologically |
---|
| 270 | meaningful. |
---|
| 271 | >>HELP 5 << Help for output format options. |
---|
| 272 | |
---|
| 273 | Six output formats are offered. You can choose any (or all 6 if you wish). |
---|
| 274 | |
---|
| 275 | CLUSTAL format output is a self explanatory alignment format. It shows the |
---|
| 276 | sequences aligned in blocks. It can be read in again at a later date to |
---|
| 277 | (for example) calculate a phylogenetic tree or add a new sequence with a |
---|
| 278 | profile alignment. |
---|
| 279 | |
---|
| 280 | GCG output can be used by any of the GCG programs that can work on multiple |
---|
| 281 | alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG |
---|
| 282 | .msf format files (multiple sequence file); new in version 7 of GCG. |
---|
| 283 | |
---|
| 284 | PHYLIP format output can be used for input to the PHYLIP package of Joe |
---|
| 285 | Felsenstein. This is an extremely widely used package for doing every |
---|
[6136] | 286 | imaginable form of phylogenetic analysis (MUCH more than the the modest |
---|
| 287 | introduction offered by this program). |
---|
[176] | 288 | |
---|
| 289 | NBRF-PIR: this is the same as the standard PIR format with ONE ADDITION. Gap |
---|
| 290 | characters "-" are used to indicate the positions of gaps in the multiple |
---|
| 291 | alignment. These files can be re-used as input in any part of clustal that |
---|
| 292 | allows sequences (or alignments or profiles) to be read in. |
---|
| 293 | |
---|
| 294 | GDE: this is the flat file format used by the GDE package of Steven Smith. |
---|
| 295 | |
---|
| 296 | NEXUS: the format used by several phylogeny programs, including PAUP and |
---|
| 297 | MacClade. |
---|
| 298 | |
---|
| 299 | GDE OUTPUT CASE: sequences in GDE format may be written in either upper or |
---|
| 300 | lower case. |
---|
| 301 | |
---|
| 302 | CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the |
---|
| 303 | alignment lines in clustalw format. |
---|
| 304 | |
---|
| 305 | OUTPUT ORDER is used to control the order of the sequences in the output |
---|
| 306 | alignments. By default, the order corresponds to the order in which the |
---|
| 307 | sequences were aligned (from the guide tree-dendrogram), thus automatically |
---|
| 308 | grouping closely related sequences. This switch can be used to set the order |
---|
| 309 | to the same as the input file. |
---|
[1754] | 310 | |
---|
[176] | 311 | PARAMETER OUTPUT: This option allows you to save all your parameter settings |
---|
| 312 | in a parameter file. This file can be used subsequently to rerun Clustal W |
---|
| 313 | using the same parameters. |
---|
| 314 | |
---|
| 315 | >>HELP 6 << Help for profile and structure alignments |
---|
| 316 | |
---|
| 317 | By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile |
---|
| 318 | alignments allow you to store alignments of your favourite sequences and add |
---|
| 319 | new sequences to them in small bunches at a time. A profile is simply an |
---|
| 320 | alignment of one or more sequences (e.g. an alignment output file from CLUSTAL |
---|
| 321 | W). Each input can be a single sequence. One or both sets of input sequences |
---|
| 322 | may include secondary structure assignments or gap penalty masks to guide the |
---|
| 323 | alignment. |
---|
| 324 | |
---|
| 325 | The profiles can be in any of the allowed input formats with "-" characters |
---|
| 326 | used to specify gaps (except for MSF-RSF where "." is used). |
---|
| 327 | |
---|
| 328 | You have to specify the 2 profiles by choosing menu items 1 and 2 and giving |
---|
| 329 | 2 file names. Then Menu item 3 will align the 2 profiles to each other. |
---|
| 330 | Secondary structure masks in either profile can be used to guide the alignment. |
---|
| 331 | |
---|
| 332 | Menu item 4 will take the sequences in the second profile and align them to |
---|
| 333 | the first profile, 1 at a time. This is useful to add some new sequences to |
---|
| 334 | an existing alignment, or to align a set of sequences to a known structure. |
---|
| 335 | In this case, the second profile would not be pre-aligned. |
---|
| 336 | |
---|
| 337 | |
---|
| 338 | The alignment parameters can be set using menu items 5, 6 and 7. These are |
---|
| 339 | EXACTLY the same parameters as used by the general, automatic multiple |
---|
| 340 | alignment procedure. The general multiple alignment procedure is simply a |
---|
| 341 | series of profile alignments. Carrying out a series of profile alignments on |
---|
| 342 | larger and larger groups of sequences, allows you to manually build up a |
---|
| 343 | complete alignment, if necessary editing intermediate alignments. |
---|
| 344 | |
---|
| 345 | SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set 2D structure |
---|
| 346 | parameters. If a solved structure is available, it can be used to guide the |
---|
| 347 | alignment by raising gap penalties within secondary structure elements, so |
---|
| 348 | that gaps will preferentially be inserted into unstructured surface loops. |
---|
| 349 | Alternatively, a user-specified gap penalty mask can be supplied directly. |
---|
| 350 | |
---|
| 351 | A gap penalty mask is a series of numbers between 1 and 9, one per position in |
---|
| 352 | the alignment. Each number specifies how much the gap opening penalty is to be |
---|
| 353 | raised at that position (raised by multiplying the basic gap opening penalty |
---|
| 354 | by the number) i.e. a mask figure of 1 at a position means no change |
---|
| 355 | in gap opening penalty; a figure of 4 means that the gap opening penalty is |
---|
| 356 | four times greater at that position, making gaps 4 times harder to open. |
---|
| 357 | |
---|
| 358 | The format for gap penalty masks and secondary structure masks is explained |
---|
| 359 | in the help under option 0 (secondary structure options). |
---|
| 360 | >>HELP B << Help for secondary structure - gap penalty masks |
---|
| 361 | |
---|
| 362 | The use of secondary structure-based penalties has been shown to improve the |
---|
| 363 | accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty |
---|
| 364 | masks to be supplied with the input sequences. The masks work by raising gap |
---|
| 365 | penalties in specified regions (typically secondary structure elements) so that |
---|
| 366 | gaps are preferentially opened in the less well conserved regions (typically |
---|
| 367 | surface loops). |
---|
| 368 | |
---|
| 369 | Options 1 and 2 control whether the input secondary structure information or |
---|
| 370 | gap penalty masks will be used. |
---|
| 371 | |
---|
| 372 | Option 3 controls whether the secondary structure and gap penalty masks should |
---|
| 373 | be included in the output alignment. |
---|
| 374 | |
---|
| 375 | Options 4 and 5 provide the value for raising the gap penalty at core Alpha |
---|
| 376 | Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues |
---|
| 377 | denote the A and B core structure notation. The basic gap penalties are |
---|
| 378 | multiplied by the amount specified. |
---|
| 379 | |
---|
| 380 | Option 6 provides the value for the gap penalty in Loops. By default this |
---|
| 381 | penalty is not raised. In CLUSTAL format, loops are specified by "." in the |
---|
| 382 | secondary structure notation. |
---|
| 383 | |
---|
| 384 | Option 7 provides the value for setting the gap penalty at the ends of |
---|
| 385 | secondary structures. Ends of secondary structures are observed to grow |
---|
| 386 | and-or shrink in related structures. Therefore by default these are given |
---|
| 387 | intermediate values, lower than the core penalties. All secondary structure |
---|
| 388 | read in as lower case in CLUSTAL format gets the reduced terminal penalty. |
---|
| 389 | |
---|
| 390 | Options 8 and 9 specify the range of structure termini for the intermediate |
---|
| 391 | penalties. In the alignment output, these are indicated as lower case. |
---|
| 392 | For Alpha Helices, by default, the range spans the end helical turn. For |
---|
| 393 | Beta Strands, the default range spans the end residue and the adjacent loop |
---|
| 394 | residue, since sequence conservation often extends beyond the actual H-bonded |
---|
| 395 | Beta Strand. |
---|
| 396 | |
---|
| 397 | CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input |
---|
| 398 | files. For many 3-D protein structures, secondary structure information is |
---|
| 399 | recorded in the feature tables of SWISS-PROT database entries. You should |
---|
| 400 | always check that the assignments are correct - some are quite inaccurate. |
---|
| 401 | CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g. |
---|
| 402 | |
---|
| 403 | FT HELIX 100 115 |
---|
| 404 | FT STRAND 118 119 |
---|
| 405 | |
---|
| 406 | The structure and penalty masks can also be read from CLUSTAL alignment format |
---|
| 407 | as comment lines beginning "!SS_" or "!GM_" e.g. |
---|
| 408 | |
---|
| 409 | !SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA |
---|
| 410 | !GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444 |
---|
| 411 | HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK |
---|
| 412 | |
---|
| 413 | Note that the mask itself is a set of numbers between 1 and 9 each of which is |
---|
| 414 | assigned to the residue(s) in the same column below. |
---|
| 415 | |
---|
| 416 | In GDE flat file format, the masks are specified as text and the names must |
---|
| 417 | begin with "SS_ or "GM_. |
---|
| 418 | |
---|
| 419 | Either a structure or penalty mask or both may be used. If both are included in |
---|
| 420 | an alignment, the user will be asked which is to be used. |
---|
| 421 | |
---|
| 422 | >>HELP C << Help for secondary structure - gap penalty mask output options |
---|
| 423 | |
---|
| 424 | The options in this menu let you choose whether or not to include the masks |
---|
| 425 | in the CLUSTAL W output alignments. Showing both is useful for understanding |
---|
| 426 | how the masks work. The secondary structure information is itself very useful |
---|
| 427 | in judging the alignment quality and in seeing how residue conservation |
---|
| 428 | patterns vary with secondary structure. |
---|
| 429 | |
---|
| 430 | |
---|
| 431 | >>HELP 7 << Help for phylogenetic trees |
---|
| 432 | |
---|
| 433 | 1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be |
---|
| 434 | input in any format or you should have just carried out a full multiple |
---|
| 435 | alignment and the alignment is still in memory. |
---|
| 436 | |
---|
| 437 | |
---|
| 438 | *************** Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!! *************** |
---|
| 439 | |
---|
| 440 | |
---|
| 441 | The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First |
---|
| 442 | you calculate distances (percent divergence) between all pairs of sequence from |
---|
| 443 | a multiple alignment; second you apply the NJ method to the distance matrix. |
---|
| 444 | |
---|
| 445 | 2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where |
---|
| 446 | ANY of the sequences have a gap will be ignored. This means that 'like' will be |
---|
| 447 | compared to 'like' in all distances, which is highly desirable. It also |
---|
| 448 | automatically throws away the most ambiguous parts of the alignment, which are |
---|
| 449 | concentrated around gaps (usually). The disadvantage is that you may throw away |
---|
| 450 | much of the data if there are many gaps (which is why it is difficult for us to |
---|
| 451 | make it the default). |
---|
| 452 | |
---|
| 453 | |
---|
| 454 | |
---|
| 455 | 3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this |
---|
| 456 | option makes no difference. For greater divergence, it corrects for the fact |
---|
| 457 | that observed distances underestimate actual evolutionary distances. This is |
---|
| 458 | because, as sequences diverge, more than one substitution will happen at many |
---|
| 459 | sites. However, you only see one difference when you look at the present day |
---|
| 460 | sequences. Therefore, this option has the effect of stretching branch lengths |
---|
| 461 | in trees (especially long branches). The corrections used here (for DNA or |
---|
| 462 | proteins) are both due to Motoo Kimura. See the documentation for details. |
---|
| 463 | |
---|
| 464 | Where possible, this option should be used. However, for VERY divergent |
---|
| 465 | sequences, the distances cannot be reliably corrected. You will be warned if |
---|
| 466 | this happens. Even if none of the distances in a data set exceed the reliable |
---|
| 467 | threshold, if you bootstrap the data, some of the bootstrap distances may |
---|
| 468 | randomly exceed the safe limit. |
---|
| 469 | |
---|
| 470 | 4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED |
---|
| 471 | tree and all branch lengths. The root of the tree can only be inferred by |
---|
| 472 | using an outgroup (a sequence that you are certain branches at the outside |
---|
| 473 | of the tree .... certain on biological grounds) OR if you assume a degree |
---|
| 474 | of constancy in the 'molecular clock', you can place the root in the 'middle' |
---|
| 475 | of the tree (roughly equidistant from all tips). |
---|
| 476 | |
---|
| 477 | 5) TOGGLE PHYLIP BOOTSTRAP POSITIONS |
---|
| 478 | By default, the bootstrap values are correctly placed on the tree branches of |
---|
| 479 | the phylip format output tree. The toggle allows them to be placed on the |
---|
| 480 | nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView |
---|
| 481 | and Phylowin) only support node labelling but not branch labelling. Care |
---|
| 482 | should be taken to note which branches and labels go together. |
---|
| 483 | |
---|
| 484 | 6) OUTPUT FORMATS: four different formats are allowed. None of these displays |
---|
| 485 | the tree visually. Useful display programs accepting PHYLIP format include |
---|
| 486 | NJplot (from Manolo Gouy and supplied with Clustal W), TreeView (Mac-PC), and |
---|
| 487 | PHYLIP itself - OR get the PHYLIP package and use the tree drawing facilities |
---|
| 488 | there. (Get the PHYLIP package anyway if you are interested in trees). The |
---|
| 489 | NEXUS format can be read into PAUP or MacClade. |
---|
| 490 | |
---|
| 491 | >>HELP 8 << Help for choosing a weight matrix |
---|
| 492 | |
---|
| 493 | For protein alignments, you use a weight matrix to determine the similarity of |
---|
| 494 | non-identical amino acids. For example, Tyr aligned with Phe is usually judged |
---|
| 495 | to be 'better' than Tyr aligned with Pro. |
---|
| 496 | |
---|
| 497 | There are three 'in-built' series of weight matrices offered. Each consists of |
---|
| 498 | several matrices which work differently at different evolutionary distances. To |
---|
| 499 | see the exact details, read the documentation. Crudely, we store several |
---|
| 500 | matrices in memory, spanning the full range of amino acid distance (from almost |
---|
| 501 | identical sequences to highly divergent ones). For very similar sequences, it |
---|
| 502 | is best to use a strict weight matrix which only gives a high score to |
---|
| 503 | identities and the most favoured conservative substitutions. For more divergent |
---|
| 504 | sequences, it is appropriate to use "softer" matrices which give a high score |
---|
| 505 | to many other frequent substitutions. |
---|
| 506 | |
---|
| 507 | 1) BLOSUM (Henikoff). These matrices appear to be the best available for |
---|
| 508 | carrying out database similarity (homology searches). The matrices used are: |
---|
| 509 | Blosum 80, 62, 45 and 30. (BLOSUM was the default in earlier Clustal W |
---|
| 510 | versions) |
---|
| 511 | |
---|
| 512 | 2) PAM (Dayhoff). These have been extremely widely used since the late '70s. |
---|
| 513 | We use the PAM 20, 60, 120 and 350 matrices. |
---|
| 514 | |
---|
| 515 | 3) GONNET. These matrices were derived using almost the same procedure as the |
---|
| 516 | Dayhoff one (above) but are much more up to date and are based on a far larger |
---|
| 517 | data set. They appear to be more sensitive than the Dayhoff series. We use the |
---|
| 518 | GONNET 80, 120, 160, 250 and 350 matrices. This series is the default for |
---|
| 519 | Clustal W version 1.8. |
---|
| 520 | |
---|
| 521 | We also supply an identity matrix which gives a score of 1.0 to two identical |
---|
| 522 | amino acids and a score of zero otherwise. This matrix is not very useful. |
---|
| 523 | Alternatively, you can read in your own (just one matrix, not a series). |
---|
| 524 | |
---|
| 525 | A new matrix can be read from a file on disk, if the filename consists only |
---|
| 526 | of lower case characters. The values in the new weight matrix must be integers |
---|
| 527 | and the scores should be similarities. You can use negative as well as positive |
---|
| 528 | values if you wish, although the matrix will be automatically adjusted to all |
---|
| 529 | positive scores. |
---|
| 530 | |
---|
| 531 | |
---|
| 532 | |
---|
| 533 | For DNA, a single matrix (not a series) is used. Two hard-coded matrices are |
---|
| 534 | available: |
---|
| 535 | |
---|
| 536 | |
---|
| 537 | 1) IUB. This is the default scoring matrix used by BESTFIT for the comparison |
---|
| 538 | of nucleic acid sequences. X's and N's are treated as matches to any IUB |
---|
| 539 | ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0. |
---|
| 540 | |
---|
| 541 | |
---|
| 542 | 2) CLUSTALW(1.6). The previous system used by Clustal W, in which matches score |
---|
| 543 | 1.0 and mismatches score 0. All matches for IUB symbols also score 0. |
---|
| 544 | |
---|
| 545 | INPUT FORMAT The format used for a new matrix is the same as the BLAST program. |
---|
| 546 | Any lines beginning with a # character are assumed to be comments. The first |
---|
| 547 | non-comment line should contain a list of amino acids in any order, using the |
---|
| 548 | 1 letter code, followed by a * character. This should be followed by a square |
---|
| 549 | matrix of integer scores, with one row and one column for each amino acid. The |
---|
| 550 | last row and column of the matrix (corresponding to the * character) contain |
---|
| 551 | the minimum score over the whole matrix. |
---|
| 552 | |
---|
| 553 | >>HELP 9 << Help for command line parameters |
---|
| 554 | DATA (sequences) |
---|
| 555 | |
---|
| 556 | -INFILE=file.ext :input sequences. |
---|
| 557 | -PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment). |
---|
| 558 | |
---|
| 559 | |
---|
| 560 | VERBS (do things) |
---|
| 561 | |
---|
| 562 | -OPTIONS :list the command line parameters |
---|
| 563 | -HELP or -CHECK :outline the command line params. |
---|
| 564 | -ALIGN :do full multiple alignment |
---|
| 565 | -TREE :calculate NJ tree. |
---|
| 566 | -BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000). |
---|
| 567 | -CONVERT :output the input sequences in a different file format. |
---|
| 568 | |
---|
| 569 | |
---|
| 570 | PARAMETERS (set things) |
---|
| 571 | |
---|
| 572 | ***General settings:**** |
---|
| 573 | -INTERACTIVE :read command line, then enter normal interactive menus |
---|
| 574 | -QUICKTREE :use FAST algorithm for the alignment guide tree |
---|
| 575 | -TYPE= :PROTEIN or DNA sequences |
---|
| 576 | -NEGATIVE :protein alignment with negative values in matrix |
---|
| 577 | -OUTFILE= :sequence alignment file name |
---|
| 578 | -OUTPUT= :GCG, GDE, PHYLIP, PIR or NEXUS |
---|
| 579 | -OUTORDER= :INPUT or ALIGNED |
---|
| 580 | -CASE :LOWER or UPPER (for GDE output only) |
---|
| 581 | -SEQNOS= :OFF or ON (for Clustal output only) |
---|
[1754] | 582 | -SEQNO_RANGE=:OFF or ON (NEW: for all output formats) |
---|
| 583 | -RANGE=m,n :sequence range to write starting m to m+n. |
---|
[176] | 584 | |
---|
| 585 | ***Fast Pairwise Alignments:*** |
---|
| 586 | -KTUPLE=n :word size |
---|
| 587 | -TOPDIAGS=n :number of best diags. |
---|
| 588 | -WINDOW=n :window around best diags. |
---|
| 589 | -PAIRGAP=n :gap penalty |
---|
| 590 | -SCORE :PERCENT or ABSOLUTE |
---|
| 591 | |
---|
| 592 | |
---|
| 593 | ***Slow Pairwise Alignments:*** |
---|
| 594 | -PWMATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename |
---|
| 595 | -PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename |
---|
| 596 | -PWGAPOPEN=f :gap opening penalty |
---|
| 597 | -PWGAPEXT=f :gap opening penalty |
---|
| 598 | |
---|
| 599 | |
---|
| 600 | ***Multiple Alignments:*** |
---|
| 601 | -NEWTREE= :file for new guide tree |
---|
| 602 | -USETREE= :file for old guide tree |
---|
| 603 | -MATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename |
---|
| 604 | -DNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename |
---|
| 605 | -GAPOPEN=f :gap opening penalty |
---|
| 606 | -GAPEXT=f :gap extension penalty |
---|
| 607 | -ENDGAPS :no end gap separation pen. |
---|
| 608 | -GAPDIST=n :gap separation pen. range |
---|
| 609 | -NOPGAP :residue-specific gaps off |
---|
| 610 | -NOHGAP :hydrophilic gaps off |
---|
| 611 | -HGAPRESIDUES= :list hydrophilic res. |
---|
| 612 | -MAXDIV=n :% ident. for delay |
---|
| 613 | -TYPE= :PROTEIN or DNA |
---|
| 614 | -TRANSWEIGHT=f :transitions weighting |
---|
| 615 | |
---|
| 616 | |
---|
| 617 | ***Profile Alignments:*** |
---|
| 618 | -PROFILE :Merge two alignments by profile alignment |
---|
| 619 | -NEWTREE1= :file for new guide tree for profile1 |
---|
| 620 | -NEWTREE2= :file for new guide tree for profile2 |
---|
| 621 | -USETREE1= :file for old guide tree for profile1 |
---|
| 622 | -USETREE2= :file for old guide tree for profile2 |
---|
| 623 | |
---|
| 624 | |
---|
| 625 | ***Sequence to Profile Alignments:*** |
---|
| 626 | -SEQUENCES :Sequentially add profile2 sequences to profile1 alignment |
---|
| 627 | -NEWTREE= :file for new guide tree |
---|
| 628 | -USETREE= :file for old guide tree |
---|
| 629 | |
---|
| 630 | |
---|
| 631 | ***Structure Alignments:*** |
---|
| 632 | -NOSECSTR1 :do not use secondary structure-gap penalty mask for profile 1 |
---|
| 633 | -NOSECSTR2 :do not use secondary structure-gap penalty mask for profile 2 |
---|
| 634 | -SECSTROUT=STRUCTURE or MASK or BOTH or NONE :output in alignment file |
---|
| 635 | -HELIXGAP=n :gap penalty for helix core residues |
---|
| 636 | -STRANDGAP=n :gap penalty for strand core residues |
---|
| 637 | -LOOPGAP=n :gap penalty for loop regions |
---|
| 638 | -TERMINALGAP=n :gap penalty for structure termini |
---|
| 639 | -HELIXENDIN=n :number of residues inside helix to be treated as terminal |
---|
| 640 | -HELIXENDOUT=n :number of residues outside helix to be treated as terminal |
---|
| 641 | -STRANDENDIN=n :number of residues inside strand to be treated as terminal |
---|
| 642 | -STRANDENDOUT=n:number of residues outside strand to be treated as terminal |
---|
| 643 | |
---|
| 644 | |
---|
| 645 | ***Trees:*** |
---|
| 646 | -OUTPUTTREE=nj OR phylip OR dist OR nexus |
---|
| 647 | -SEED=n :seed number for bootstraps. |
---|
| 648 | -KIMURA :use Kimura's correction. |
---|
| 649 | -TOSSGAPS :ignore positions with gaps. |
---|
| 650 | -BOOTLABELS=node OR branch :position of bootstrap values in tree display |
---|
| 651 | |
---|
| 652 | >>HELP 0 << Help for tree output format options |
---|
| 653 | |
---|
[10842] | 654 | Four output formats are offered: |
---|
| 655 | 1) Clustal, |
---|
| 656 | 2) Phylip, |
---|
| 657 | 3) Just the distances |
---|
| 658 | 4) Nexus |
---|
[176] | 659 | |
---|
| 660 | None of these formats displays the results graphically. Many packages can |
---|
| 661 | display trees in the the PHYLIP format 2) below. It can also be imported into |
---|
| 662 | the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display. |
---|
| 663 | NEXUS format trees can be read by PAUP and MacClade. |
---|
| 664 | |
---|
[10842] | 665 | 1) Clustal format output. |
---|
[176] | 666 | |
---|
[10842] | 667 | This format is verbose and lists all of the distances between the sequences and |
---|
| 668 | the number of alignment positions used for each. The tree is described at the |
---|
| 669 | end of the file. It lists the sequences that are joined at each alignment step |
---|
| 670 | and the branch lengths. After two sequences are joined, it is referred to later |
---|
| 671 | as a NODE. The number of a NODE is the number of the lowest sequence in that |
---|
| 672 | NODE. |
---|
| 673 | |
---|
[176] | 674 | 2) Phylip format output. |
---|
| 675 | |
---|
[10842] | 676 | This format is the New Hampshire format, used by many phylogenetic analysis |
---|
| 677 | packages. It consists of a series of nested parentheses, describing the |
---|
| 678 | branching order, with the sequence names and branch lengths. It can be used by |
---|
| 679 | the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the |
---|
| 680 | trees graphically. This is the same format used during multiple alignment for |
---|
| 681 | the guide trees. |
---|
| 682 | |
---|
| 683 | Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other |
---|
| 684 | packages that can read and display New Hampshire format are TreeView (Mac/PC), |
---|
| 685 | TreeTool (UNIX), and Phylowin. |
---|
[176] | 686 | |
---|
| 687 | 3) The distances only. |
---|
| 688 | |
---|
[10842] | 689 | This format just outputs a matrix of all the pairwise distances in a format |
---|
| 690 | that can be used by the Phylip package. It used to be useful when one could not |
---|
| 691 | produce distances from protein sequences in the Phylip package but is now |
---|
| 692 | redundant (Protdist of Phylip 3.5 now does this). |
---|
[176] | 693 | |
---|
[10842] | 694 | 4) NEXUS FORMAT TREE. |
---|
| 695 | |
---|
| 696 | This format is used by several popular phylogeny programs, |
---|
| 697 | including PAUP and MacClade. The format is described fully in: |
---|
| 698 | Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997. |
---|
| 699 | NEXUS: an extensible file format for systematic information. |
---|
| 700 | Systematic Biology 46:590-621. |
---|
| 701 | |
---|
[176] | 702 | 5) TOGGLE PHYLIP BOOTSTRAP POSITIONS |
---|
| 703 | |
---|
[10842] | 704 | By default, the bootstrap values are placed on the nodes of the phylip format |
---|
| 705 | output tree. This is inaccurate as the bootstrap values should be associated |
---|
| 706 | with the tree branches and not the nodes. However, this format can be read and |
---|
| 707 | displayed by TreeTool, TreeView and Phylowin. An option is available to |
---|
| 708 | correctly place the bootstrap values on the branches with which they are |
---|
| 709 | associated. |
---|
| 710 | |
---|