1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>protpars</TITLE> |
---|
5 | <META NAME="description" CONTENT="protpars"> |
---|
6 | <META NAME="keywords" CONTENT="protpars"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>PROTPARS -- Protein Sequence Parsimony Method</H1> |
---|
18 | </DIV> |
---|
19 | <P> |
---|
20 | © Copyright 1986-2002 by the University of |
---|
21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
---|
22 | this document provided that no fee is charged for it and that this copyright |
---|
23 | notice is not removed. |
---|
24 | <P> |
---|
25 | </EM> |
---|
26 | <P> |
---|
27 | This program infers an unrooted phylogeny from protein sequences, using a |
---|
28 | new method intermediate between the approaches of Eck and Dayhoff (1966) and |
---|
29 | Fitch (1971). Eck and Dayhoff (1966) allowed any amino acid to change to |
---|
30 | any other, and counted the number of such changes needed to evolve the |
---|
31 | protein sequences on each given phylogeny. This has the problem that it |
---|
32 | allows replacements which are not consistent with the genetic code, counting |
---|
33 | them equally with replacements that are consistent. Fitch, on the other hand, |
---|
34 | counted the minimum number of nucleotide substitutions that would be |
---|
35 | needed to achieve the given protein sequences. This counts silent |
---|
36 | changes equally with those that change the amino acid. |
---|
37 | <P> |
---|
38 | The present method insists that any changes of amino acid be consistent |
---|
39 | with the genetic code so that, for example, lysine is allowed to change |
---|
40 | to methionine but not to proline. However, changes between two amino acids |
---|
41 | via a third are allowed and counted as two changes if each of the two |
---|
42 | replacements is individually allowed. This sometimes allows changes that |
---|
43 | at first sight you would think should be outlawed. Thus we can change from |
---|
44 | phenylalanine to glutamine via leucine in two steps |
---|
45 | total. Consulting the genetic code, you will find that there is a leucine |
---|
46 | codon one step away from a phenylalanine codon, and a leucine codon one |
---|
47 | step away from glutamine. But they are not the same leucine codon. It |
---|
48 | actually takes three base substitutions to get from either of the |
---|
49 | phenylalanine codons TTT and TTC to either of the glutamine codons |
---|
50 | CAA or CAG. Why then does this program count only two? The answer |
---|
51 | is that recent DNA sequence comparisons seem to show that synonymous |
---|
52 | changes are considerably faster and easier than ones that change the |
---|
53 | amino acid. We are assuming that, in effect, synonymous changes occur |
---|
54 | so much more readily that they need not be counted. Thus, in the chain |
---|
55 | of changes TTT (Phe) --> CTT (Leu) --> CTA (Leu) --> CAA (Glu), the middle |
---|
56 | one is not counted because it does not change the amino acid (leucine). |
---|
57 | <P> |
---|
58 | To maintain consistency with the genetic code, it is necessary for the |
---|
59 | program internally to treat serine as two separate states (ser1 and ser2) |
---|
60 | since the two groups of serine codons are not adjacent in the |
---|
61 | code. Changes to the state "deletion" are counted as three steps to prevent the |
---|
62 | algorithm from assuming unnecessary deletions. The state "unknown" is |
---|
63 | simply taken to mean that the amino acid, which has not been determined, |
---|
64 | will in each part of a tree that is evaluated be assumed be whichever one |
---|
65 | causes the fewest steps. |
---|
66 | <P> |
---|
67 | The assumptions of this method (which has not been described in the |
---|
68 | literature), are thus something like this: |
---|
69 | <P> |
---|
70 | <OL> |
---|
71 | <LI>Change in different sites is independent. |
---|
72 | <LI>Change in different lineages is independent. |
---|
73 | <LI>The probability of a base substitution that changes the amino |
---|
74 | acid sequence is small over the lengths of time involved in |
---|
75 | a branch of the phylogeny. |
---|
76 | <LI>The expected amounts of change in different branches of the phylogeny |
---|
77 | do not vary by so much that two changes in a high-rate branch |
---|
78 | are more probable than one change in a low-rate branch. |
---|
79 | <LI>The expected amounts of change do not vary enough among sites that two |
---|
80 | changes in one site are more probable than one change in another. |
---|
81 | <LI>The probability of a base change that is synonymous is much higher |
---|
82 | than the probability of a change that is not synonymous. |
---|
83 | </OL> |
---|
84 | <P> |
---|
85 | That these are the assumptions of parsimony methods has been documented |
---|
86 | in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For |
---|
87 | an opposing view arguing that the parsimony methods make no substantive |
---|
88 | assumptions such as these, see the works by Farris (1983) and Sober (1983a, |
---|
89 | 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986). |
---|
90 | <P> |
---|
91 | The input for the program is fairly standard. The first line contains the |
---|
92 | number of species and the number of amino acid positions (counting any |
---|
93 | stop codons that you want to include). |
---|
94 | <P> |
---|
95 | Next come the species data. Each |
---|
96 | sequence starts on a new line, has a ten-character species name |
---|
97 | that must be blank-filled to be of that length, followed immediately |
---|
98 | by the species data in the one-letter code. The sequences must either |
---|
99 | be in the "interleaved" or "sequential" formats |
---|
100 | described in the Molecular Sequence Programs document. The I option |
---|
101 | selects between them. The sequences can have internal |
---|
102 | blanks in the sequence but there must be no extra blanks at the end of the |
---|
103 | terminated line. Note that a blank is not a valid symbol for a deletion. |
---|
104 | <P> |
---|
105 | The protein sequences are given by the one-letter code used by |
---|
106 | described in the <A HREF="sequence.html">Molecular Sequence Programs documentation file</A>. Note that |
---|
107 | if two polypeptide chains are being used that are of different length |
---|
108 | owing to one terminating before the other, they should be coded as (say) |
---|
109 | <P><PRE> |
---|
110 | HIINMA*???? |
---|
111 | HIPNMGVWABT |
---|
112 | </PRE><P> |
---|
113 | since after the stop codon we do not definitely know that |
---|
114 | there has been a deletion, and do not know what amino acid would |
---|
115 | have been there. If DNA studies tell us that there is |
---|
116 | DNA sequence in that region, then we could use "X" rather than "?". Note |
---|
117 | that "X" means an unknown amino acid, but definitely an amino acid, |
---|
118 | while "?" could mean either that or a deletion. The distinction is often |
---|
119 | significant in regions where there are deletions: one may want to encode |
---|
120 | a six-base deletion as "-?????" since that way the program will only count |
---|
121 | one deletion, not six deletion events, when the deletion arises. However, |
---|
122 | if there are overlapping deletions it may not be so easy to know what |
---|
123 | coding is correct. |
---|
124 | <P> |
---|
125 | One will usually want to |
---|
126 | use "?" after a stop codon, if one does not know what amino acid is there. If |
---|
127 | the DNA sequence has been observed there, one probably ought to resist |
---|
128 | putting in the amino acids that this DNA would code for, and one should use |
---|
129 | "X" instead, because under the assumptions implicit in this parsimony |
---|
130 | method, changes to any noncoding sequence are much easier than |
---|
131 | changes in a coding region that change the amino acid, so that they |
---|
132 | shouldn't be counted anyway! |
---|
133 | <P> |
---|
134 | The form of this information |
---|
135 | is the standard one described in the main documentation file. For the U option |
---|
136 | the tree |
---|
137 | provided must be a rooted bifurcating tree, with the root placed anywhere |
---|
138 | you want, since that root placement does not affect anything. |
---|
139 | <P> |
---|
140 | The options are selected using an interactive menu. The menu looks like this: |
---|
141 | <P> |
---|
142 | <TABLE><TR><TD BGCOLOR=white> |
---|
143 | <PRE> |
---|
144 | Protein parsimony algorithm, version 3.6 |
---|
145 | |
---|
146 | Setting for this run: |
---|
147 | U Search for best tree? Yes |
---|
148 | J Randomize input order of sequences? No. Use input order |
---|
149 | O Outgroup root? No, use as outgroup species 1 |
---|
150 | T Use Threshold parsimony? No, use ordinary parsimony |
---|
151 | C Use which genetic code? Universal |
---|
152 | M Analyze multiple data sets? No |
---|
153 | I Input sequences interleaved? Yes |
---|
154 | 0 Terminal type (IBM PC, VT52, ANSI)? (none) |
---|
155 | 1 Print out the data at start of run No |
---|
156 | 2 Print indications of progress of run Yes |
---|
157 | 3 Print out tree Yes |
---|
158 | 4 Print out steps in each site No |
---|
159 | 5 Print sequences at all nodes of tree No |
---|
160 | 6 Write out trees onto tree file? Yes |
---|
161 | |
---|
162 | Are these settings correct? (type Y or the letter for one to change) |
---|
163 | |
---|
164 | </PRE> |
---|
165 | </TD></TR></TABLE> |
---|
166 | <P> |
---|
167 | The user either types "Y" (followed, of course, by a carriage-return) |
---|
168 | if the settings shown are to be accepted, or the letter or digit corresponding |
---|
169 | to an option that is to be changed. |
---|
170 | <P> |
---|
171 | The options U, J, O, T, W, M, and 0 are the usual ones. They are described in |
---|
172 | the main documentation file of this package. Option I is the same as in |
---|
173 | other molecular sequence programs and is described in the documentation file |
---|
174 | for the sequence programs. Option C allows the user to select among various |
---|
175 | nuclear and mitochondrial genetic codes. There is no provision for coping |
---|
176 | with data where different genetic codes have been used in different |
---|
177 | organisms. |
---|
178 | <P> |
---|
179 | In the U (User tree) option, the trees should |
---|
180 | not be preceded by a line with the number of trees on it. |
---|
181 | <P> |
---|
182 | Output is standard: if option 1 is toggled on, the data is printed out, |
---|
183 | with the convention that "." means "the same as in the first species". |
---|
184 | Then comes a list of equally parsimonious trees, and (if option 2 is |
---|
185 | toggled on) a table of the |
---|
186 | number of changes of state required in each position. If option 5 is toggled |
---|
187 | on, a table is printed |
---|
188 | out after each tree, showing for each branch whether there are known to be |
---|
189 | changes in the branch, and what the states are inferred to have been at the |
---|
190 | top end of the branch. If the inferred state is a "?" there will be multiple |
---|
191 | equally-parsimonious assignments of states; the user must work these out for |
---|
192 | themselves by hand. If option 6 is left in its default state the trees |
---|
193 | found will be written to a tree file, so that they are available to be used |
---|
194 | in other programs. |
---|
195 | <P> |
---|
196 | If the U (User Tree) option is used and more than one tree is supplied, the |
---|
197 | program also performs a statistical test of each of these trees against the |
---|
198 | best tree. This test, which is a version of the test proposed by |
---|
199 | Alan Templeton (1983) and evaluated in a test case by me (1985a). It is |
---|
200 | closely parallel to a test using log likelihood differences |
---|
201 | due to Kishino and Hasegawa (1989), and uses the mean |
---|
202 | and variance of |
---|
203 | step differences between trees, taken across positions. If the mean |
---|
204 | is more than 1.96 standard deviations different then the trees are declared |
---|
205 | significantly different. The program |
---|
206 | prints out a table of the steps for each tree, the differences of |
---|
207 | each from the best one, the variance of that quantity as determined by |
---|
208 | the step differences at individual positions, and a conclusion as to |
---|
209 | whether that tree is or is not significantly worse than the best one. |
---|
210 | <P> |
---|
211 | The program is derived from MIX but has had some rather elaborate |
---|
212 | bookkeeping using sets of bits installed. It is not a very fast |
---|
213 | program but is speeded up substantially over version 3.2. |
---|
214 | <P> |
---|
215 | <HR> |
---|
216 | <H3>TEST DATA SET</H3> |
---|
217 | <P> |
---|
218 | <TABLE><TR><TD BGCOLOR=white> |
---|
219 | <PRE> |
---|
220 | 5 10 |
---|
221 | Alpha ABCDEFGHIK |
---|
222 | Beta AB--EFGHIK |
---|
223 | Gamma ?BCDSFG*?? |
---|
224 | Delta CIKDEFGHIK |
---|
225 | Epsilon DIKDEFGHIK |
---|
226 | </PRE> |
---|
227 | </TD></TR></TABLE> |
---|
228 | <P> |
---|
229 | <HR> |
---|
230 | <P> |
---|
231 | <H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3> |
---|
232 | <P> |
---|
233 | <TABLE><TR><TD BGCOLOR=white> |
---|
234 | <PRE> |
---|
235 | |
---|
236 | Protein parsimony algorithm, version 3.6 |
---|
237 | |
---|
238 | |
---|
239 | |
---|
240 | 3 trees in all found |
---|
241 | |
---|
242 | |
---|
243 | |
---|
244 | |
---|
245 | +--------Gamma |
---|
246 | ! |
---|
247 | +--2 +--Epsilon |
---|
248 | ! ! +--4 |
---|
249 | ! +--3 +--Delta |
---|
250 | 1 ! |
---|
251 | ! +-----Beta |
---|
252 | ! |
---|
253 | +-----------Alpha |
---|
254 | |
---|
255 | remember: this is an unrooted tree! |
---|
256 | |
---|
257 | |
---|
258 | requires a total of 16.000 |
---|
259 | |
---|
260 | steps in each position: |
---|
261 | 0 1 2 3 4 5 6 7 8 9 |
---|
262 | *----------------------------------------- |
---|
263 | 0! 3 1 5 3 2 0 0 2 0 |
---|
264 | 10! 0 |
---|
265 | |
---|
266 | From To Any Steps? State at upper node |
---|
267 | ( . means same as in the node below it on tree) |
---|
268 | |
---|
269 | |
---|
270 | 1 ANCDEFGHIK |
---|
271 | 1 2 no .......... |
---|
272 | 2 Gamma yes ?B..S..*?? |
---|
273 | 2 3 yes ..?....... |
---|
274 | 3 4 yes ?IK....... |
---|
275 | 4 Epsilon maybe D......... |
---|
276 | 4 Delta yes C......... |
---|
277 | 3 Beta yes .B--...... |
---|
278 | 1 Alpha maybe .B........ |
---|
279 | |
---|
280 | |
---|
281 | |
---|
282 | |
---|
283 | |
---|
284 | +--Epsilon |
---|
285 | +--4 |
---|
286 | +--3 +--Delta |
---|
287 | ! ! |
---|
288 | +--2 +-----Gamma |
---|
289 | ! ! |
---|
290 | 1 +--------Beta |
---|
291 | ! |
---|
292 | +-----------Alpha |
---|
293 | |
---|
294 | remember: this is an unrooted tree! |
---|
295 | |
---|
296 | |
---|
297 | requires a total of 16.000 |
---|
298 | |
---|
299 | steps in each position: |
---|
300 | 0 1 2 3 4 5 6 7 8 9 |
---|
301 | *----------------------------------------- |
---|
302 | 0! 3 1 5 3 2 0 0 2 0 |
---|
303 | 10! 0 |
---|
304 | |
---|
305 | From To Any Steps? State at upper node |
---|
306 | ( . means same as in the node below it on tree) |
---|
307 | |
---|
308 | |
---|
309 | 1 ANCDEFGHIK |
---|
310 | 1 2 no .......... |
---|
311 | 2 3 maybe ?......... |
---|
312 | 3 4 yes .IK....... |
---|
313 | 4 Epsilon maybe D......... |
---|
314 | 4 Delta yes C......... |
---|
315 | 3 Gamma yes ?B..S..*?? |
---|
316 | 2 Beta yes .B--...... |
---|
317 | 1 Alpha maybe .B........ |
---|
318 | |
---|
319 | |
---|
320 | |
---|
321 | |
---|
322 | |
---|
323 | +--Epsilon |
---|
324 | +-----4 |
---|
325 | ! +--Delta |
---|
326 | +--3 |
---|
327 | ! ! +--Gamma |
---|
328 | 1 +-----2 |
---|
329 | ! +--Beta |
---|
330 | ! |
---|
331 | +-----------Alpha |
---|
332 | |
---|
333 | remember: this is an unrooted tree! |
---|
334 | |
---|
335 | |
---|
336 | requires a total of 16.000 |
---|
337 | |
---|
338 | steps in each position: |
---|
339 | 0 1 2 3 4 5 6 7 8 9 |
---|
340 | *----------------------------------------- |
---|
341 | 0! 3 1 5 3 2 0 0 2 0 |
---|
342 | 10! 0 |
---|
343 | |
---|
344 | From To Any Steps? State at upper node |
---|
345 | ( . means same as in the node below it on tree) |
---|
346 | |
---|
347 | |
---|
348 | 1 ANCDEFGHIK |
---|
349 | 1 3 no .......... |
---|
350 | 3 4 yes ?IK....... |
---|
351 | 4 Epsilon maybe D......... |
---|
352 | 4 Delta yes C......... |
---|
353 | 3 2 no .......... |
---|
354 | 2 Gamma yes ?B..S..*?? |
---|
355 | 2 Beta yes .B--...... |
---|
356 | 1 Alpha maybe .B........ |
---|
357 | |
---|
358 | |
---|
359 | </PRE> |
---|
360 | </TD></TR></TABLE> |
---|
361 | </BODY> |
---|
362 | </HTML> |
---|