| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>protpars</TITLE> |
|---|
| 5 | <META NAME="description" CONTENT="protpars"> |
|---|
| 6 | <META NAME="keywords" CONTENT="protpars"> |
|---|
| 7 | <META NAME="resource-type" CONTENT="document"> |
|---|
| 8 | <META NAME="distribution" CONTENT="global"> |
|---|
| 9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
|---|
| 10 | </HEAD> |
|---|
| 11 | <BODY BGCOLOR="#ccffff"> |
|---|
| 12 | <DIV ALIGN=RIGHT> |
|---|
| 13 | version 3.6 |
|---|
| 14 | </DIV> |
|---|
| 15 | <P> |
|---|
| 16 | <DIV ALIGN=CENTER> |
|---|
| 17 | <H1>PROTPARS -- Protein Sequence Parsimony Method</H1> |
|---|
| 18 | </DIV> |
|---|
| 19 | <P> |
|---|
| 20 | © Copyright 1986-2002 by the University of |
|---|
| 21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
|---|
| 22 | this document provided that no fee is charged for it and that this copyright |
|---|
| 23 | notice is not removed. |
|---|
| 24 | <P> |
|---|
| 25 | </EM> |
|---|
| 26 | <P> |
|---|
| 27 | This program infers an unrooted phylogeny from protein sequences, using a |
|---|
| 28 | new method intermediate between the approaches of Eck and Dayhoff (1966) and |
|---|
| 29 | Fitch (1971). Eck and Dayhoff (1966) allowed any amino acid to change to |
|---|
| 30 | any other, and counted the number of such changes needed to evolve the |
|---|
| 31 | protein sequences on each given phylogeny. This has the problem that it |
|---|
| 32 | allows replacements which are not consistent with the genetic code, counting |
|---|
| 33 | them equally with replacements that are consistent. Fitch, on the other hand, |
|---|
| 34 | counted the minimum number of nucleotide substitutions that would be |
|---|
| 35 | needed to achieve the given protein sequences. This counts silent |
|---|
| 36 | changes equally with those that change the amino acid. |
|---|
| 37 | <P> |
|---|
| 38 | The present method insists that any changes of amino acid be consistent |
|---|
| 39 | with the genetic code so that, for example, lysine is allowed to change |
|---|
| 40 | to methionine but not to proline. However, changes between two amino acids |
|---|
| 41 | via a third are allowed and counted as two changes if each of the two |
|---|
| 42 | replacements is individually allowed. This sometimes allows changes that |
|---|
| 43 | at first sight you would think should be outlawed. Thus we can change from |
|---|
| 44 | phenylalanine to glutamine via leucine in two steps |
|---|
| 45 | total. Consulting the genetic code, you will find that there is a leucine |
|---|
| 46 | codon one step away from a phenylalanine codon, and a leucine codon one |
|---|
| 47 | step away from glutamine. But they are not the same leucine codon. It |
|---|
| 48 | actually takes three base substitutions to get from either of the |
|---|
| 49 | phenylalanine codons TTT and TTC to either of the glutamine codons |
|---|
| 50 | CAA or CAG. Why then does this program count only two? The answer |
|---|
| 51 | is that recent DNA sequence comparisons seem to show that synonymous |
|---|
| 52 | changes are considerably faster and easier than ones that change the |
|---|
| 53 | amino acid. We are assuming that, in effect, synonymous changes occur |
|---|
| 54 | so much more readily that they need not be counted. Thus, in the chain |
|---|
| 55 | of changes TTT (Phe) --> CTT (Leu) --> CTA (Leu) --> CAA (Glu), the middle |
|---|
| 56 | one is not counted because it does not change the amino acid (leucine). |
|---|
| 57 | <P> |
|---|
| 58 | To maintain consistency with the genetic code, it is necessary for the |
|---|
| 59 | program internally to treat serine as two separate states (ser1 and ser2) |
|---|
| 60 | since the two groups of serine codons are not adjacent in the |
|---|
| 61 | code. Changes to the state "deletion" are counted as three steps to prevent the |
|---|
| 62 | algorithm from assuming unnecessary deletions. The state "unknown" is |
|---|
| 63 | simply taken to mean that the amino acid, which has not been determined, |
|---|
| 64 | will in each part of a tree that is evaluated be assumed be whichever one |
|---|
| 65 | causes the fewest steps. |
|---|
| 66 | <P> |
|---|
| 67 | The assumptions of this method (which has not been described in the |
|---|
| 68 | literature), are thus something like this: |
|---|
| 69 | <P> |
|---|
| 70 | <OL> |
|---|
| 71 | <LI>Change in different sites is independent. |
|---|
| 72 | <LI>Change in different lineages is independent. |
|---|
| 73 | <LI>The probability of a base substitution that changes the amino |
|---|
| 74 | acid sequence is small over the lengths of time involved in |
|---|
| 75 | a branch of the phylogeny. |
|---|
| 76 | <LI>The expected amounts of change in different branches of the phylogeny |
|---|
| 77 | do not vary by so much that two changes in a high-rate branch |
|---|
| 78 | are more probable than one change in a low-rate branch. |
|---|
| 79 | <LI>The expected amounts of change do not vary enough among sites that two |
|---|
| 80 | changes in one site are more probable than one change in another. |
|---|
| 81 | <LI>The probability of a base change that is synonymous is much higher |
|---|
| 82 | than the probability of a change that is not synonymous. |
|---|
| 83 | </OL> |
|---|
| 84 | <P> |
|---|
| 85 | That these are the assumptions of parsimony methods has been documented |
|---|
| 86 | in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For |
|---|
| 87 | an opposing view arguing that the parsimony methods make no substantive |
|---|
| 88 | assumptions such as these, see the works by Farris (1983) and Sober (1983a, |
|---|
| 89 | 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986). |
|---|
| 90 | <P> |
|---|
| 91 | The input for the program is fairly standard. The first line contains the |
|---|
| 92 | number of species and the number of amino acid positions (counting any |
|---|
| 93 | stop codons that you want to include). |
|---|
| 94 | <P> |
|---|
| 95 | Next come the species data. Each |
|---|
| 96 | sequence starts on a new line, has a ten-character species name |
|---|
| 97 | that must be blank-filled to be of that length, followed immediately |
|---|
| 98 | by the species data in the one-letter code. The sequences must either |
|---|
| 99 | be in the "interleaved" or "sequential" formats |
|---|
| 100 | described in the Molecular Sequence Programs document. The I option |
|---|
| 101 | selects between them. The sequences can have internal |
|---|
| 102 | blanks in the sequence but there must be no extra blanks at the end of the |
|---|
| 103 | terminated line. Note that a blank is not a valid symbol for a deletion. |
|---|
| 104 | <P> |
|---|
| 105 | The protein sequences are given by the one-letter code used by |
|---|
| 106 | described in the <A HREF="sequence.html">Molecular Sequence Programs documentation file</A>. Note that |
|---|
| 107 | if two polypeptide chains are being used that are of different length |
|---|
| 108 | owing to one terminating before the other, they should be coded as (say) |
|---|
| 109 | <P><PRE> |
|---|
| 110 | HIINMA*???? |
|---|
| 111 | HIPNMGVWABT |
|---|
| 112 | </PRE><P> |
|---|
| 113 | since after the stop codon we do not definitely know that |
|---|
| 114 | there has been a deletion, and do not know what amino acid would |
|---|
| 115 | have been there. If DNA studies tell us that there is |
|---|
| 116 | DNA sequence in that region, then we could use "X" rather than "?". Note |
|---|
| 117 | that "X" means an unknown amino acid, but definitely an amino acid, |
|---|
| 118 | while "?" could mean either that or a deletion. The distinction is often |
|---|
| 119 | significant in regions where there are deletions: one may want to encode |
|---|
| 120 | a six-base deletion as "-?????" since that way the program will only count |
|---|
| 121 | one deletion, not six deletion events, when the deletion arises. However, |
|---|
| 122 | if there are overlapping deletions it may not be so easy to know what |
|---|
| 123 | coding is correct. |
|---|
| 124 | <P> |
|---|
| 125 | One will usually want to |
|---|
| 126 | use "?" after a stop codon, if one does not know what amino acid is there. If |
|---|
| 127 | the DNA sequence has been observed there, one probably ought to resist |
|---|
| 128 | putting in the amino acids that this DNA would code for, and one should use |
|---|
| 129 | "X" instead, because under the assumptions implicit in this parsimony |
|---|
| 130 | method, changes to any noncoding sequence are much easier than |
|---|
| 131 | changes in a coding region that change the amino acid, so that they |
|---|
| 132 | shouldn't be counted anyway! |
|---|
| 133 | <P> |
|---|
| 134 | The form of this information |
|---|
| 135 | is the standard one described in the main documentation file. For the U option |
|---|
| 136 | the tree |
|---|
| 137 | provided must be a rooted bifurcating tree, with the root placed anywhere |
|---|
| 138 | you want, since that root placement does not affect anything. |
|---|
| 139 | <P> |
|---|
| 140 | The options are selected using an interactive menu. The menu looks like this: |
|---|
| 141 | <P> |
|---|
| 142 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 143 | <PRE> |
|---|
| 144 | Protein parsimony algorithm, version 3.6 |
|---|
| 145 | |
|---|
| 146 | Setting for this run: |
|---|
| 147 | U Search for best tree? Yes |
|---|
| 148 | J Randomize input order of sequences? No. Use input order |
|---|
| 149 | O Outgroup root? No, use as outgroup species 1 |
|---|
| 150 | T Use Threshold parsimony? No, use ordinary parsimony |
|---|
| 151 | C Use which genetic code? Universal |
|---|
| 152 | M Analyze multiple data sets? No |
|---|
| 153 | I Input sequences interleaved? Yes |
|---|
| 154 | 0 Terminal type (IBM PC, VT52, ANSI)? (none) |
|---|
| 155 | 1 Print out the data at start of run No |
|---|
| 156 | 2 Print indications of progress of run Yes |
|---|
| 157 | 3 Print out tree Yes |
|---|
| 158 | 4 Print out steps in each site No |
|---|
| 159 | 5 Print sequences at all nodes of tree No |
|---|
| 160 | 6 Write out trees onto tree file? Yes |
|---|
| 161 | |
|---|
| 162 | Are these settings correct? (type Y or the letter for one to change) |
|---|
| 163 | |
|---|
| 164 | </PRE> |
|---|
| 165 | </TD></TR></TABLE> |
|---|
| 166 | <P> |
|---|
| 167 | The user either types "Y" (followed, of course, by a carriage-return) |
|---|
| 168 | if the settings shown are to be accepted, or the letter or digit corresponding |
|---|
| 169 | to an option that is to be changed. |
|---|
| 170 | <P> |
|---|
| 171 | The options U, J, O, T, W, M, and 0 are the usual ones. They are described in |
|---|
| 172 | the main documentation file of this package. Option I is the same as in |
|---|
| 173 | other molecular sequence programs and is described in the documentation file |
|---|
| 174 | for the sequence programs. Option C allows the user to select among various |
|---|
| 175 | nuclear and mitochondrial genetic codes. There is no provision for coping |
|---|
| 176 | with data where different genetic codes have been used in different |
|---|
| 177 | organisms. |
|---|
| 178 | <P> |
|---|
| 179 | In the U (User tree) option, the trees should |
|---|
| 180 | not be preceded by a line with the number of trees on it. |
|---|
| 181 | <P> |
|---|
| 182 | Output is standard: if option 1 is toggled on, the data is printed out, |
|---|
| 183 | with the convention that "." means "the same as in the first species". |
|---|
| 184 | Then comes a list of equally parsimonious trees, and (if option 2 is |
|---|
| 185 | toggled on) a table of the |
|---|
| 186 | number of changes of state required in each position. If option 5 is toggled |
|---|
| 187 | on, a table is printed |
|---|
| 188 | out after each tree, showing for each branch whether there are known to be |
|---|
| 189 | changes in the branch, and what the states are inferred to have been at the |
|---|
| 190 | top end of the branch. If the inferred state is a "?" there will be multiple |
|---|
| 191 | equally-parsimonious assignments of states; the user must work these out for |
|---|
| 192 | themselves by hand. If option 6 is left in its default state the trees |
|---|
| 193 | found will be written to a tree file, so that they are available to be used |
|---|
| 194 | in other programs. |
|---|
| 195 | <P> |
|---|
| 196 | If the U (User Tree) option is used and more than one tree is supplied, the |
|---|
| 197 | program also performs a statistical test of each of these trees against the |
|---|
| 198 | best tree. This test, which is a version of the test proposed by |
|---|
| 199 | Alan Templeton (1983) and evaluated in a test case by me (1985a). It is |
|---|
| 200 | closely parallel to a test using log likelihood differences |
|---|
| 201 | due to Kishino and Hasegawa (1989), and uses the mean |
|---|
| 202 | and variance of |
|---|
| 203 | step differences between trees, taken across positions. If the mean |
|---|
| 204 | is more than 1.96 standard deviations different then the trees are declared |
|---|
| 205 | significantly different. The program |
|---|
| 206 | prints out a table of the steps for each tree, the differences of |
|---|
| 207 | each from the best one, the variance of that quantity as determined by |
|---|
| 208 | the step differences at individual positions, and a conclusion as to |
|---|
| 209 | whether that tree is or is not significantly worse than the best one. |
|---|
| 210 | <P> |
|---|
| 211 | The program is derived from MIX but has had some rather elaborate |
|---|
| 212 | bookkeeping using sets of bits installed. It is not a very fast |
|---|
| 213 | program but is speeded up substantially over version 3.2. |
|---|
| 214 | <P> |
|---|
| 215 | <HR> |
|---|
| 216 | <H3>TEST DATA SET</H3> |
|---|
| 217 | <P> |
|---|
| 218 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 219 | <PRE> |
|---|
| 220 | 5 10 |
|---|
| 221 | Alpha ABCDEFGHIK |
|---|
| 222 | Beta AB--EFGHIK |
|---|
| 223 | Gamma ?BCDSFG*?? |
|---|
| 224 | Delta CIKDEFGHIK |
|---|
| 225 | Epsilon DIKDEFGHIK |
|---|
| 226 | </PRE> |
|---|
| 227 | </TD></TR></TABLE> |
|---|
| 228 | <P> |
|---|
| 229 | <HR> |
|---|
| 230 | <P> |
|---|
| 231 | <H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3> |
|---|
| 232 | <P> |
|---|
| 233 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 234 | <PRE> |
|---|
| 235 | |
|---|
| 236 | Protein parsimony algorithm, version 3.6 |
|---|
| 237 | |
|---|
| 238 | |
|---|
| 239 | |
|---|
| 240 | 3 trees in all found |
|---|
| 241 | |
|---|
| 242 | |
|---|
| 243 | |
|---|
| 244 | |
|---|
| 245 | +--------Gamma |
|---|
| 246 | ! |
|---|
| 247 | +--2 +--Epsilon |
|---|
| 248 | ! ! +--4 |
|---|
| 249 | ! +--3 +--Delta |
|---|
| 250 | 1 ! |
|---|
| 251 | ! +-----Beta |
|---|
| 252 | ! |
|---|
| 253 | +-----------Alpha |
|---|
| 254 | |
|---|
| 255 | remember: this is an unrooted tree! |
|---|
| 256 | |
|---|
| 257 | |
|---|
| 258 | requires a total of 16.000 |
|---|
| 259 | |
|---|
| 260 | steps in each position: |
|---|
| 261 | 0 1 2 3 4 5 6 7 8 9 |
|---|
| 262 | *----------------------------------------- |
|---|
| 263 | 0! 3 1 5 3 2 0 0 2 0 |
|---|
| 264 | 10! 0 |
|---|
| 265 | |
|---|
| 266 | From To Any Steps? State at upper node |
|---|
| 267 | ( . means same as in the node below it on tree) |
|---|
| 268 | |
|---|
| 269 | |
|---|
| 270 | 1 ANCDEFGHIK |
|---|
| 271 | 1 2 no .......... |
|---|
| 272 | 2 Gamma yes ?B..S..*?? |
|---|
| 273 | 2 3 yes ..?....... |
|---|
| 274 | 3 4 yes ?IK....... |
|---|
| 275 | 4 Epsilon maybe D......... |
|---|
| 276 | 4 Delta yes C......... |
|---|
| 277 | 3 Beta yes .B--...... |
|---|
| 278 | 1 Alpha maybe .B........ |
|---|
| 279 | |
|---|
| 280 | |
|---|
| 281 | |
|---|
| 282 | |
|---|
| 283 | |
|---|
| 284 | +--Epsilon |
|---|
| 285 | +--4 |
|---|
| 286 | +--3 +--Delta |
|---|
| 287 | ! ! |
|---|
| 288 | +--2 +-----Gamma |
|---|
| 289 | ! ! |
|---|
| 290 | 1 +--------Beta |
|---|
| 291 | ! |
|---|
| 292 | +-----------Alpha |
|---|
| 293 | |
|---|
| 294 | remember: this is an unrooted tree! |
|---|
| 295 | |
|---|
| 296 | |
|---|
| 297 | requires a total of 16.000 |
|---|
| 298 | |
|---|
| 299 | steps in each position: |
|---|
| 300 | 0 1 2 3 4 5 6 7 8 9 |
|---|
| 301 | *----------------------------------------- |
|---|
| 302 | 0! 3 1 5 3 2 0 0 2 0 |
|---|
| 303 | 10! 0 |
|---|
| 304 | |
|---|
| 305 | From To Any Steps? State at upper node |
|---|
| 306 | ( . means same as in the node below it on tree) |
|---|
| 307 | |
|---|
| 308 | |
|---|
| 309 | 1 ANCDEFGHIK |
|---|
| 310 | 1 2 no .......... |
|---|
| 311 | 2 3 maybe ?......... |
|---|
| 312 | 3 4 yes .IK....... |
|---|
| 313 | 4 Epsilon maybe D......... |
|---|
| 314 | 4 Delta yes C......... |
|---|
| 315 | 3 Gamma yes ?B..S..*?? |
|---|
| 316 | 2 Beta yes .B--...... |
|---|
| 317 | 1 Alpha maybe .B........ |
|---|
| 318 | |
|---|
| 319 | |
|---|
| 320 | |
|---|
| 321 | |
|---|
| 322 | |
|---|
| 323 | +--Epsilon |
|---|
| 324 | +-----4 |
|---|
| 325 | ! +--Delta |
|---|
| 326 | +--3 |
|---|
| 327 | ! ! +--Gamma |
|---|
| 328 | 1 +-----2 |
|---|
| 329 | ! +--Beta |
|---|
| 330 | ! |
|---|
| 331 | +-----------Alpha |
|---|
| 332 | |
|---|
| 333 | remember: this is an unrooted tree! |
|---|
| 334 | |
|---|
| 335 | |
|---|
| 336 | requires a total of 16.000 |
|---|
| 337 | |
|---|
| 338 | steps in each position: |
|---|
| 339 | 0 1 2 3 4 5 6 7 8 9 |
|---|
| 340 | *----------------------------------------- |
|---|
| 341 | 0! 3 1 5 3 2 0 0 2 0 |
|---|
| 342 | 10! 0 |
|---|
| 343 | |
|---|
| 344 | From To Any Steps? State at upper node |
|---|
| 345 | ( . means same as in the node below it on tree) |
|---|
| 346 | |
|---|
| 347 | |
|---|
| 348 | 1 ANCDEFGHIK |
|---|
| 349 | 1 3 no .......... |
|---|
| 350 | 3 4 yes ?IK....... |
|---|
| 351 | 4 Epsilon maybe D......... |
|---|
| 352 | 4 Delta yes C......... |
|---|
| 353 | 3 2 no .......... |
|---|
| 354 | 2 Gamma yes ?B..S..*?? |
|---|
| 355 | 2 Beta yes .B--...... |
|---|
| 356 | 1 Alpha maybe .B........ |
|---|
| 357 | |
|---|
| 358 | |
|---|
| 359 | </PRE> |
|---|
| 360 | </TD></TR></TABLE> |
|---|
| 361 | </BODY> |
|---|
| 362 | </HTML> |
|---|