| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>sequence</TITLE> |
|---|
| 5 | <META NAME="description" CONTENT="sequence"> |
|---|
| 6 | <META NAME="keywords" CONTENT="sequence"> |
|---|
| 7 | <META NAME="resource-type" CONTENT="document"> |
|---|
| 8 | <META NAME="distribution" CONTENT="global"> |
|---|
| 9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
|---|
| 10 | </HEAD> |
|---|
| 11 | <BODY BGCOLOR="#ccffff"> |
|---|
| 12 | <DIV ALIGN=RIGHT> |
|---|
| 13 | version 3.6 |
|---|
| 14 | </DIV> |
|---|
| 15 | <DIV ALIGN=CENTER> |
|---|
| 16 | <H1>Molecular Sequence Programs</H1> |
|---|
| 17 | </CENTER> |
|---|
| 18 | <P> |
|---|
| 19 | (c) Copyright 1986-2000 by The University of |
|---|
| 20 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
|---|
| 21 | this document provided that no fee is charged for it and that this copyright |
|---|
| 22 | notice is not removed. |
|---|
| 23 | <P> |
|---|
| 24 | These programs estimate phylogenies from protein |
|---|
| 25 | sequence or nucleic acid sequence data. PROTPARS uses a parsimony method |
|---|
| 26 | intermediate between Eck and Dayhoff's |
|---|
| 27 | method (1966) of allowing transitions between all amino acids and counting |
|---|
| 28 | those, and Fitch's (1971) method of counting the number of nucleotide changes |
|---|
| 29 | that would be needed to evolve the protein sequence. DNAPARS uses the |
|---|
| 30 | parsimony method allowing changes between all bases |
|---|
| 31 | and counting the number of those. DNAMOVE is an interactive parsimony |
|---|
| 32 | program allowing the user to rearrange trees by hand and see where |
|---|
| 33 | characters states change. DNAPENNY |
|---|
| 34 | uses the branch-and-bound method to search for all most |
|---|
| 35 | parsimonious trees in the nucleic acid sequence case. DNACOMP |
|---|
| 36 | adapts to nucleotide sequences the compatibility (largest clique) |
|---|
| 37 | approach. DNAINVAR does not directly estimate a phylogeny, but computes Lake's |
|---|
| 38 | (1987) and Cavender's (Cavender and Felsenstein, 1987) phylogenetic invariants, |
|---|
| 39 | which are quantities whose values depend on the phylogeny. DNAML does a |
|---|
| 40 | maximum likelihood estimate of the phylogeny (Felsenstein, 1981a). DNAMLK |
|---|
| 41 | is similar to DNAML but assumes a molecular clock. DNADIST |
|---|
| 42 | computes distance measures between pairs of species from nucleotide sequences, |
|---|
| 43 | distances that can then be used by the distance matrix programs FITCH and |
|---|
| 44 | KITSCH. RESTML does a maximum likelihood estimate from restriction |
|---|
| 45 | sites data. SEQBOOT allows you to read in a data set and then produce |
|---|
| 46 | multiple data sets from it by bootstrapping, delete-half jackknifing, or |
|---|
| 47 | by permuting within sites. This |
|---|
| 48 | then allows most of these methods to be bootstrapped or jackknifed, and |
|---|
| 49 | for the Permutation Tail Probability Test of Archie (1989) and Faith and |
|---|
| 50 | Cranston (1991) to be carried out. |
|---|
| 51 | <P> |
|---|
| 52 | The input and output format for RESTML is described in |
|---|
| 53 | its document files. In general its input format is similar to |
|---|
| 54 | those described here, except that the one-letter codes for restriction sites |
|---|
| 55 | is specific to that program and is described in that document file. Since |
|---|
| 56 | the input formats for the eight DNA sequence and two protein sequence |
|---|
| 57 | programs apply to more than one program, they are described here. Their |
|---|
| 58 | input formats are standard, making use of the IUPAC standards. |
|---|
| 59 | .sp 2 |
|---|
| 60 | .ce |
|---|
| 61 | INTERLEAVED AND SEQUENTIAL FORMATS |
|---|
| 62 | <P> |
|---|
| 63 | The sequences can continue over multiple lines; when this is done the |
|---|
| 64 | sequences must be either in "interleaved" format, similar to the |
|---|
| 65 | output of alignment programs, or "sequential" format. These are |
|---|
| 66 | described in the main document file. In sequential format all |
|---|
| 67 | of one sequence is given, possibly on multiple lines, before the next starts. |
|---|
| 68 | In interleaved format the first part of the file should contain the first |
|---|
| 69 | part of each of the sequences, then possibly a line containing nothing |
|---|
| 70 | but a carriage-return character, then the second part of each sequence, |
|---|
| 71 | and so on. Only the first parts of the sequences should be preceded by |
|---|
| 72 | names. Here is a hypothetical example of interleaved format: |
|---|
| 73 | <P> |
|---|
| 74 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 75 | <PRE> |
|---|
| 76 | 5 42 |
|---|
| 77 | Turkey AAGCTNGGGC ATTTCAGGGT |
|---|
| 78 | Salmo gairAAGCCTTGGC AGTGCAGGGT |
|---|
| 79 | H. SapiensACCGGTTGGC CGTTCAGGGT |
|---|
| 80 | Chimp AAACCCTTGC CGTTACGCTT |
|---|
| 81 | Gorilla AAACCCTTGC CGGTACGCTT |
|---|
| 82 | |
|---|
| 83 | GAGCCCGGGC AATACAGGGT AT |
|---|
| 84 | GAGCCGTGGC CGGGCACGGT AT |
|---|
| 85 | ACAGGTTGGC CGTTCAGGGT AA |
|---|
| 86 | AAACCGAGGC CGGGACACTC AT |
|---|
| 87 | AAACCATTGC CGGTACGCTT AA |
|---|
| 88 | </PRE> |
|---|
| 89 | </TD></TR></TABLE> |
|---|
| 90 | <P> |
|---|
| 91 | while in sequential format the same sequences would be: |
|---|
| 92 | <P> |
|---|
| 93 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 94 | <PRE> |
|---|
| 95 | 5 42 |
|---|
| 96 | Turkey AAGCTNGGGC ATTTCAGGGT |
|---|
| 97 | GAGCCCGGGC AATACAGGGT AT |
|---|
| 98 | Salmo gairAAGCCTTGGC AGTGCAGGGT |
|---|
| 99 | GAGCCGTGGC CGGGCACGGT AT |
|---|
| 100 | H. SapiensACCGGTTGGC CGTTCAGGGT |
|---|
| 101 | ACAGGTTGGC CGTTCAGGGT AA |
|---|
| 102 | Chimp AAACCCTTGC CGTTACGCTT |
|---|
| 103 | AAACCGAGGC CGGGACACTC AT |
|---|
| 104 | Gorilla AAACCCTTGC CGGTACGCTT |
|---|
| 105 | AAACCATTGC CGGTACGCTT AA |
|---|
| 106 | </PRE> |
|---|
| 107 | </TD></TR></TABLE> |
|---|
| 108 | <P> |
|---|
| 109 | Note, of course, that a portion of a sequence like this: |
|---|
| 110 | <P> |
|---|
| 111 | 300 AAGCGTGAAC GTTGTACTAA TRCAG |
|---|
| 112 | <P> |
|---|
| 113 | is perfectly legal, assuming that the species name has gone before, and is |
|---|
| 114 | filled out to full length by blanks. The above |
|---|
| 115 | digits and blanks will be ignored, the sequence being taken as starting |
|---|
| 116 | at the first base symbol (in this case an A). This should enable you to |
|---|
| 117 | use output from many multiple-sequence alignment programs with only |
|---|
| 118 | minimal editing. |
|---|
| 119 | <P> |
|---|
| 120 | In interleaved format |
|---|
| 121 | the present versions of the programs may sometimes have difficulties with the |
|---|
| 122 | blank lines between groups of lines, and if so you might want to retype |
|---|
| 123 | those lines, making sure that they have only a carriage-return and no blank |
|---|
| 124 | characters on them, or you may perhaps have to eliminate them. The symptoms |
|---|
| 125 | of this problem are that the programs complain that the sequences are not |
|---|
| 126 | properly aligned, and you can find no other cause for this complaint. |
|---|
| 127 | <P> |
|---|
| 128 | <H2>INPUT FOR THE DNA SEQUENCE PROGRAMS</H2> |
|---|
| 129 | <P> |
|---|
| 130 | The input format for the DNA sequence programs is |
|---|
| 131 | standard: the data have A's, G's, C's and T's (or U's). The first line of the |
|---|
| 132 | input file contains the number of species and the number of sites. As |
|---|
| 133 | with the other programs, options information may follow this. Following this, |
|---|
| 134 | each species starts on a new line. The first 10 |
|---|
| 135 | characters of that line are the species name. There then follows |
|---|
| 136 | the base sequence of that species, each character |
|---|
| 137 | being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, |
|---|
| 138 | W, X, Y, ?, or - (a period was also previously allowed but it is no longer |
|---|
| 139 | allowed, because it sometimes is used in different senses in other |
|---|
| 140 | programs). Blanks will be ignored, and so will numerical |
|---|
| 141 | digits. This allows GENBANK and EMBL sequence entries to be read with |
|---|
| 142 | minimum editing. |
|---|
| 143 | <P> |
|---|
| 144 | These characters can be either upper or lower case. The algorithms |
|---|
| 145 | convert all input characters to upper case (which is how they |
|---|
| 146 | are treated). The characters constitute the IUPAC (IUB) nucleic acid code |
|---|
| 147 | plus some slight |
|---|
| 148 | extensions. They enable input of nucleic acid sequences taking full account |
|---|
| 149 | of any ambiguities in the sequence. |
|---|
| 150 | <P> |
|---|
| 151 | <DIV ALIGN=CENTER> |
|---|
| 152 | <TABLE BORDER=0> |
|---|
| 153 | <TR><TD ALIGN=LEFT><B>Symbol</B><TD><TD><B>Meaning</B></TD><TD></TD></TR> |
|---|
| 154 | <TR><TD></TD><TD></TD></TD></TR> |
|---|
| 155 | <TR><TD>A<TD><TD>Adenine</TD><TD></TD></TR> |
|---|
| 156 | <TR><TD>G<TD><TD>Guanine</TD><TD></TD></TR> |
|---|
| 157 | <TR><TD>C<TD><TD>Cytosine</TD><TD></TD></TR> |
|---|
| 158 | <TR><TD>T<TD><TD>Thymine</TD><TD></TD></TR> |
|---|
| 159 | <TR><TD>U<TD><TD>Uracil </TD><TD></TD></TR> |
|---|
| 160 | <TR><TD>Y<TD><TD>pYrimidine<TD><TD>(C or T)</TD></TR> |
|---|
| 161 | <TR><TD>R<TD><TD>puRine<TD><TD>(A or G)</TD></TR> |
|---|
| 162 | <TR><TD>W<TD><TD>"Weak"<TD><TD>(A or T)</TD></TR> |
|---|
| 163 | <TR><TD>S<TD><TD>"Strong"<TD><TD>(C or G)</TD></TR> |
|---|
| 164 | <TR><TD>K<TD><TD>"Keto"<TD><TD>(T or G)</TD></TR> |
|---|
| 165 | <TR><TD>M<TD><TD>"aMino"<TD><TD>(C or A)</TD></TR> |
|---|
| 166 | <TR><TD>B<TD><TD>not A<TD><TD>(C or G or T)</TD></TR> |
|---|
| 167 | <TR><TD>D<TD><TD>not C<TD><TD>(A or G or T)</TD></TR> |
|---|
| 168 | <TR><TD>H<TD><TD>not G<TD><TD>(A or C or T)</TD></TR> |
|---|
| 169 | <TR><TD>V<TD><TD>not T<TD><TD>(A or C or G)</TD></TR> |
|---|
| 170 | <TR><TD>X,N,?<TD><TD>unknown<TD><TD>(A or C or G or T)</TD></TR> |
|---|
| 171 | <TR><TD>O<TD><TD>deletion</TD><TD></TD></TR> |
|---|
| 172 | <TR><TD>-<TD><TD>deletion</TD><TD></TD></TR> |
|---|
| 173 | </TABLE> |
|---|
| 174 | </DIV> |
|---|
| 175 | <P> |
|---|
| 176 | <H2>INPUT FOR THE PROTEIN SEQUENCE PROGRAMS</H2> |
|---|
| 177 | <P> |
|---|
| 178 | The input for the protein sequence programs is fairly standard. The first |
|---|
| 179 | line contains the |
|---|
| 180 | number of species and the number of amino acid positions (counting any |
|---|
| 181 | stop codons that you want to include). These are followed on the same line |
|---|
| 182 | by the options. The only options which |
|---|
| 183 | need information in the input file are U (User Tree) and W (Weights). They are |
|---|
| 184 | as described in the main documentation file. If the W (Weights) option is |
|---|
| 185 | used there must be a W in the first line of the input file. |
|---|
| 186 | <P> |
|---|
| 187 | Next come the species data. Each |
|---|
| 188 | sequence starts on a new line, has a ten-character species name |
|---|
| 189 | that must be blank-filled to be of that length, followed immediately |
|---|
| 190 | by the species data in the one-letter code. The sequences must either |
|---|
| 191 | be in the "interleaved" or "sequential" formats. The I option |
|---|
| 192 | selects between them. The sequences can have internal |
|---|
| 193 | blanks in the sequence but there must be no extra blanks at the end of the |
|---|
| 194 | terminated line. Note that a blank is not a valid symbol for a deletion. |
|---|
| 195 | <P> |
|---|
| 196 | The protein sequences are given by the one-letter code used by |
|---|
| 197 | the late Margaret Dayhoff's group in the Atlas of Protein Sequences, |
|---|
| 198 | and consistent with the IUB standard abbreviations. |
|---|
| 199 | In the present version it is: |
|---|
| 200 | <P> |
|---|
| 201 | <DIV ALIGN=CENTER> |
|---|
| 202 | <TABLE> |
|---|
| 203 | <TR><TD><B ALIGN=CENTER>Symbol</B></TD><TD ALIGN=CENTER><B>Stands for</B></TD></TR> |
|---|
| 204 | <TR><TD ALIGN=CENTER></TD><TD ALIGN=CENTER></TD></TR> |
|---|
| 205 | <TR><TD ALIGN=CENTER>A</TD><TD ALIGN=CENTER>ala</TD></TR> |
|---|
| 206 | <TR><TD ALIGN=CENTER>B</TD><TD ALIGN=CENTER>asx</TD></TR> |
|---|
| 207 | <TR><TD ALIGN=CENTER>C</TD><TD ALIGN=CENTER>cys</TD></TR> |
|---|
| 208 | <TR><TD ALIGN=CENTER>D</TD><TD ALIGN=CENTER>asp</TD></TR> |
|---|
| 209 | <TR><TD ALIGN=CENTER>E</TD><TD ALIGN=CENTER>glu</TD></TR> |
|---|
| 210 | <TR><TD ALIGN=CENTER>F</TD><TD ALIGN=CENTER>phe</TD></TR> |
|---|
| 211 | <TR><TD ALIGN=CENTER>G</TD><TD ALIGN=CENTER>gly</TD></TR> |
|---|
| 212 | <TR><TD ALIGN=CENTER>H</TD><TD ALIGN=CENTER>his</TD></TR> |
|---|
| 213 | <TR><TD ALIGN=CENTER>I</TD><TD ALIGN=CENTER>ileu</TD></TR> |
|---|
| 214 | <TR><TD ALIGN=CENTER>J</TD><TD ALIGN=CENTER>(not used)</TD></TR> |
|---|
| 215 | <TR><TD ALIGN=CENTER>K</TD><TD ALIGN=CENTER>lys</TD></TR> |
|---|
| 216 | <TR><TD ALIGN=CENTER>L</TD><TD ALIGN=CENTER>leu</TD></TR> |
|---|
| 217 | <TR><TD ALIGN=CENTER>M</TD><TD ALIGN=CENTER>met</TD></TR> |
|---|
| 218 | <TR><TD ALIGN=CENTER>N</TD><TD ALIGN=CENTER>asn</TD></TR> |
|---|
| 219 | <TR><TD ALIGN=CENTER>O</TD><TD ALIGN=CENTER>(not used)</TD></TR> |
|---|
| 220 | <TR><TD ALIGN=CENTER>P</TD><TD ALIGN=CENTER>pro</TD></TR> |
|---|
| 221 | <TR><TD ALIGN=CENTER>Q</TD><TD ALIGN=CENTER>gln</TD></TR> |
|---|
| 222 | <TR><TD ALIGN=CENTER>R</TD><TD ALIGN=CENTER>arg</TD></TR> |
|---|
| 223 | <TR><TD ALIGN=CENTER>S</TD><TD ALIGN=CENTER>ser</TD></TR> |
|---|
| 224 | <TR><TD ALIGN=CENTER>T</TD><TD ALIGN=CENTER>thr</TD></TR> |
|---|
| 225 | <TR><TD ALIGN=CENTER>U</TD><TD ALIGN=CENTER>(not used)</TD></TR> |
|---|
| 226 | <TR><TD ALIGN=CENTER>V</TD><TD ALIGN=CENTER>val</TD></TR> |
|---|
| 227 | <TR><TD ALIGN=CENTER>W</TD><TD ALIGN=CENTER>trp</TD></TR> |
|---|
| 228 | <TR><TD ALIGN=CENTER>X</TD><TD ALIGN=CENTER>unknown amino acid</TD></TR> |
|---|
| 229 | <TR><TD ALIGN=CENTER>Y</TD><TD ALIGN=CENTER>tyr</TD></TR> |
|---|
| 230 | <TR><TD ALIGN=CENTER>Z</TD><TD ALIGN=CENTER>glx</TD></TR> |
|---|
| 231 | <TR><TD ALIGN=CENTER>*</TD><TD ALIGN=CENTER>nonsense (stop)</TD></TR> |
|---|
| 232 | <TR><TD ALIGN=CENTER>?</TD><TD ALIGN=CENTER>unknown amino acid or deletion</TD></TR> |
|---|
| 233 | <TR><TD ALIGN=CENTER>-</TD><TD ALIGN=CENTER>deletion</TD></TR> |
|---|
| 234 | </TABLE> |
|---|
| 235 | </DIV> |
|---|
| 236 | <P> |
|---|
| 237 | where "nonsense", and "unknown" mean respectively a nonsense (chain |
|---|
| 238 | termination) codon and an amino acid whose identity has not been |
|---|
| 239 | determined. The state "asx" means "either asn or asp", |
|---|
| 240 | and the state "glx" means "either gln or glu" and the state "deletion" |
|---|
| 241 | means that alignment studies indicate a deletion has happened in the |
|---|
| 242 | ancestry of this position, so that it is no longer present. Note that |
|---|
| 243 | if two polypeptide chains are being used that are of different length |
|---|
| 244 | owing to one terminating before the other, they can be coded as (say) |
|---|
| 245 | <PRE> |
|---|
| 246 | HIINMA*???? |
|---|
| 247 | HIPNMGVWABT |
|---|
| 248 | </PRE> |
|---|
| 249 | since after the stop codon we do not definitely know that |
|---|
| 250 | there has been a deletion, and do not know what amino acid would |
|---|
| 251 | have been there. If DNA studies tell us that there is |
|---|
| 252 | DNA sequence in that region, then we could use "X" rather than "?". Note |
|---|
| 253 | that "X" means an unknown amino acid, but definitely an amino acid, |
|---|
| 254 | while "?" could mean either that or a deletion. Otherwise one will usually |
|---|
| 255 | want to use "?" after a stop codon, if one does not know what amino acid is |
|---|
| 256 | there. If the DNA sequence has been observed there, one probably ought to |
|---|
| 257 | resist putting in the amino acids that this DNA would code for, and one should |
|---|
| 258 | use "X" instead, because under the assumptions implicit in this either the |
|---|
| 259 | parsimony or the distance |
|---|
| 260 | methods, changes to any noncoding sequence are much easier than |
|---|
| 261 | changes in a coding region that change the amino acid |
|---|
| 262 | <P> |
|---|
| 263 | Here are the same one-letter codes tabulated the other way 'round: |
|---|
| 264 | <P> |
|---|
| 265 | <DIV ALIGN=CENTER> |
|---|
| 266 | <TABLE> |
|---|
| 267 | <TR><TD ALIGN=CENTER><B>Amino acid</B></TD><TD ALIGN=CENTER><B>One-letter code</B></TD></TR> |
|---|
| 268 | <TR><TD ALIGN=CENTER></TD><TD ALIGN=CENTER></TD></TR></TD></TR> |
|---|
| 269 | <TR><TD ALIGN=CENTER>ala</TD><TD ALIGN=CENTER>A</TD></TR> |
|---|
| 270 | <TR><TD ALIGN=CENTER>arg</TD><TD ALIGN=CENTER>R</TD></TR> |
|---|
| 271 | <TR><TD ALIGN=CENTER>asn</TD><TD ALIGN=CENTER>N</TD></TR> |
|---|
| 272 | <TR><TD ALIGN=CENTER>asp</TD><TD ALIGN=CENTER>D</TD></TR> |
|---|
| 273 | <TR><TD ALIGN=CENTER>asx</TD><TD ALIGN=CENTER>B</TD></TR> |
|---|
| 274 | <TR><TD ALIGN=CENTER>cys</TD><TD ALIGN=CENTER>C</TD></TR> |
|---|
| 275 | <TR><TD ALIGN=CENTER>gln</TD><TD ALIGN=CENTER>Q</TD></TR> |
|---|
| 276 | <TR><TD ALIGN=CENTER>glu</TD><TD ALIGN=CENTER>E</TD></TR> |
|---|
| 277 | <TR><TD ALIGN=CENTER>gly</TD><TD ALIGN=CENTER>G</TD></TR> |
|---|
| 278 | <TR><TD ALIGN=CENTER>glx</TD><TD ALIGN=CENTER>Z</TD></TR> |
|---|
| 279 | <TR><TD ALIGN=CENTER>his</TD><TD ALIGN=CENTER>H</TD></TR> |
|---|
| 280 | <TR><TD ALIGN=CENTER>ileu</TD><TD ALIGN=CENTER>I</TD></TR> |
|---|
| 281 | <TR><TD ALIGN=CENTER>leu</TD><TD ALIGN=CENTER>L</TD></TR> |
|---|
| 282 | <TR><TD ALIGN=CENTER>lys</TD><TD ALIGN=CENTER>K</TD></TR> |
|---|
| 283 | <TR><TD ALIGN=CENTER>met</TD><TD ALIGN=CENTER>M</TD></TR> |
|---|
| 284 | <TR><TD ALIGN=CENTER>phe</TD><TD ALIGN=CENTER>F</TD></TR> |
|---|
| 285 | <TR><TD ALIGN=CENTER>pro</TD><TD ALIGN=CENTER>P</TD></TR> |
|---|
| 286 | <TR><TD ALIGN=CENTER>ser</TD><TD ALIGN=CENTER>S</TD></TR> |
|---|
| 287 | <TR><TD ALIGN=CENTER>thr</TD><TD ALIGN=CENTER>T</TD></TR> |
|---|
| 288 | <TR><TD ALIGN=CENTER>trp</TD><TD ALIGN=CENTER>W</TD></TR> |
|---|
| 289 | <TR><TD ALIGN=CENTER>tyr</TD><TD ALIGN=CENTER>Y</TD></TR> |
|---|
| 290 | <TR><TD ALIGN=CENTER>val</TD><TD ALIGN=CENTER>V</TD></TR> |
|---|
| 291 | <TR><TD ALIGN=CENTER>deletion</TD><TD ALIGN=CENTER>-</TD></TR> |
|---|
| 292 | <TR><TD ALIGN=CENTER>nonsense (stop)</TD><TD ALIGN=CENTER>*</TD></TR> |
|---|
| 293 | <TR><TD ALIGN=CENTER>unknown amino acid</TD><TD ALIGN=CENTER>X</TD></TR> |
|---|
| 294 | <TR><TD ALIGN=CENTER>unknown (incl. deletion)</TD><TD ALIGN=CENTER>?</TD></TR> |
|---|
| 295 | </TABLE> |
|---|
| 296 | </DIV> |
|---|
| 297 | <P> |
|---|
| 298 | <H2>THE OPTIONS</H2> |
|---|
| 299 | <P> |
|---|
| 300 | The programs allow options chosen from their menus. Many of these are as described in the |
|---|
| 301 | main documentation file, particularly the options J, O, U, T, W, |
|---|
| 302 | and Y. (Although T has a different meaning in the programs DNAML and |
|---|
| 303 | DNADIST than in the others). |
|---|
| 304 | <P> |
|---|
| 305 | The U option indicates that |
|---|
| 306 | user-defined trees are provided at the end of the input file. This |
|---|
| 307 | happens in the usual way, except that for PROTPARS, DNAPARS, DNACOMP, and |
|---|
| 308 | DNAMLK, the trees must be strictly |
|---|
| 309 | bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For |
|---|
| 310 | DNAML and RESTML it must have a trifurcation at its base, |
|---|
| 311 | e. g.: ((A,B),C,(D,E));. The |
|---|
| 312 | root of the tree may in those cases be placed arbitrarily, since the trees |
|---|
| 313 | needed are actually unrooted, though they look different when printed out. The |
|---|
| 314 | program RETREE should enable you to reroot the trees without having to |
|---|
| 315 | hand-edit or retype them. For |
|---|
| 316 | DNAMOVE the U option is not available (although |
|---|
| 317 | there is an equivalent feature which uses rooted user trees). |
|---|
| 318 | <P> |
|---|
| 319 | A feature of the nucleotide sequence programs other than DNAMOVE |
|---|
| 320 | is that they save time and computer memory space by recognizing sites |
|---|
| 321 | at which the pattern of bases is the same, and doing their computation only |
|---|
| 322 | once. Thus if we have only four species but a large number of sites, there |
|---|
| 323 | are (ignoring ambiguous bases) only about 256 different patterns of |
|---|
| 324 | nucleotides (4 x 4 x 4 x 4) that can occur. The programs automatically |
|---|
| 325 | count how many occurrences there are of each and then only needs to do as much |
|---|
| 326 | computation as would be |
|---|
| 327 | needed with 256 sites, even though the number of sites is actually much |
|---|
| 328 | larger. If there are ambiguities (such as Y or R nucleotides), these are also |
|---|
| 329 | handled correctly, and do not cause trouble. The programs store the full |
|---|
| 330 | sequences but reserve other space for bookkeeping only for the distinct |
|---|
| 331 | patterns. This saves space. Thus the programs will run very effectively |
|---|
| 332 | with few species and many sites. On larger numbers of species, |
|---|
| 333 | if rates of evolution are small, many of the sites will be invariant |
|---|
| 334 | (such as having all A's) and thus will mostly have one of four patterns. The |
|---|
| 335 | programs will in this way automatically avoid doing duplicate |
|---|
| 336 | computations for such sites. |
|---|
| 337 | </BODY> |
|---|
| 338 | </HTML> |
|---|