source: trunk/GDE/PHYLIP/doc/protpars.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 13.7 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>protpars</TITLE>
5<META NAME="description" CONTENT="protpars">
6<META NAME="keywords" CONTENT="protpars">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>PROTPARS -- Protein Sequence Parsimony Method</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2002 by the University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25</EM>
26<P>
27This program infers an unrooted phylogeny from protein sequences, using a
28new method intermediate between the approaches of Eck and Dayhoff (1966) and
29Fitch (1971).  Eck and Dayhoff (1966) allowed any amino acid to change to
30any other, and counted the number of such changes needed to evolve the
31protein sequences on each given phylogeny.  This has the problem that it
32allows replacements which are not consistent with the genetic code, counting
33them equally with replacements that are consistent.  Fitch, on the other hand,
34counted the minimum number of nucleotide substitutions that would be
35needed to achieve the given protein sequences.  This counts silent
36changes equally with those that change the amino acid.
37<P>
38The present method insists that any changes of amino acid be consistent
39with the genetic code so that, for example, lysine is allowed to change
40to methionine but not to proline.  However, changes between two amino acids
41via a third are allowed and counted as two changes if each of the two
42replacements is individually allowed.  This sometimes allows changes that
43at first sight you would think should be outlawed.  Thus we can change from
44phenylalanine to glutamine via leucine in two steps
45total.  Consulting the genetic code, you will find that there is a leucine
46codon one step away from a phenylalanine codon, and a leucine codon one
47step away from glutamine.  But they are not the same leucine codon.  It
48actually takes three base substitutions to get from either of the
49phenylalanine codons TTT and TTC to either of the glutamine codons
50CAA or CAG.  Why then does this program count only two?  The answer
51is that recent DNA sequence comparisons seem to show that synonymous
52changes are considerably faster and easier than ones that change the
53amino acid.  We are assuming that, in effect, synonymous changes occur
54so much more readily that they need not be counted.  Thus, in the chain
55of changes  TTT (Phe) --> CTT (Leu) --> CTA (Leu) --> CAA (Glu), the middle
56one is not counted because it does not change the amino acid (leucine).
57<P>
58To maintain consistency with the genetic code, it is necessary for the
59program internally to treat serine as two separate states (ser1 and ser2)
60since the two groups of serine codons are not adjacent in the
61code.  Changes to the state "deletion" are counted as three steps to prevent the
62algorithm from assuming unnecessary deletions.  The state "unknown" is
63simply taken to mean that the amino acid, which has not been determined,
64will in each part of a tree that is evaluated be assumed be whichever one
65causes the fewest steps.
66<P>
67The assumptions of this method (which has not been described in the
68literature), are thus something like this:
69<P>
70<OL>
71<LI>Change in different sites is independent.
72<LI>Change in different lineages is independent.
73<LI>The probability of a base substitution that changes the amino
74acid sequence is small over the lengths of time involved in
75a branch of the phylogeny.
76<LI>The expected amounts of change in different branches of the phylogeny
77do not vary by so much that two changes in a high-rate branch
78are more probable than one change in a low-rate branch.
79<LI>The expected amounts of change do not vary enough among sites that two
80changes in one site are more probable than one change in another.
81<LI>The probability of a base change that is synonymous is much higher
82than the probability of a change that is not synonymous.
83</OL>
84<P>
85That these are the assumptions of parsimony methods has been documented
86in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b).  For
87an opposing view arguing that the parsimony methods make no substantive
88assumptions such as these, see the works by Farris (1983) and Sober (1983a,
891983b, 1988), but also read the exchange between Felsenstein and Sober (1986). 
90<P>
91The input for the program is fairly standard.  The first line contains the
92number of species and the number of amino acid positions (counting any
93stop codons that you want to include).
94<P>
95Next come the species data.  Each
96sequence starts on a new line, has a ten-character species name
97that must be blank-filled to be of that length, followed immediately
98by the species data in the one-letter code.  The sequences must either
99be in the "interleaved" or "sequential" formats
100described in the Molecular Sequence Programs document.  The I option
101selects between them.  The sequences can have internal
102blanks in the sequence but there must be no extra blanks at the end of the
103terminated line.  Note that a blank is not a valid symbol for a deletion.
104<P>
105The protein sequences are given by the one-letter code used by
106described in the <A HREF="sequence.html">Molecular Sequence Programs documentation file</A>.  Note that
107if two polypeptide chains are being used that are of different length
108owing to one terminating before the other, they should be coded as (say)
109<P><PRE>
110             HIINMA*????
111             HIPNMGVWABT
112</PRE><P>
113since after the stop codon we do not definitely know that
114there has been a deletion, and do not know what amino acid would
115have been there.  If DNA studies tell us that there is
116DNA sequence in that region, then we could use "X" rather than "?".  Note
117that "X" means an unknown amino acid, but definitely an amino acid,
118while "?" could mean either that or a deletion.  The distinction is often
119significant in regions where there are deletions: one may want to encode
120a six-base deletion as "-?????" since that way the program will only count
121one deletion, not six deletion events, when the deletion arises.  However,
122if there are overlapping deletions it may not be so easy to know what
123coding is correct.
124<P>
125One will usually want to
126use "?" after a stop codon, if one does not know what amino acid is there.  If
127the DNA sequence has been observed there, one probably ought to resist
128putting in the amino acids that this DNA would code for, and one should use
129"X" instead, because under the assumptions implicit in this parsimony
130method, changes to any noncoding sequence are much easier than
131changes in a coding region that change the amino acid, so that they
132shouldn't be counted anyway!
133<P>
134The form of this information
135is the standard one described in the main documentation file.  For the U option
136the tree
137provided must be a rooted bifurcating tree, with the root placed anywhere
138you want, since that root placement does not affect anything.
139<P>
140The options are selected using an interactive menu.  The menu looks like this:
141<P>
142<TABLE><TR><TD BGCOLOR=white>
143<PRE>
144Protein parsimony algorithm, version 3.6
145
146Setting for this run:
147  U                 Search for best tree?  Yes
148  J   Randomize input order of sequences?  No. Use input order
149  O                        Outgroup root?  No, use as outgroup species  1
150  T              Use Threshold parsimony?  No, use ordinary parsimony
151  C               Use which genetic code?  Universal
152  M           Analyze multiple data sets?  No
153  I          Input sequences interleaved?  Yes
154  0   Terminal type (IBM PC, VT52, ANSI)?  (none)
155  1    Print out the data at start of run  No
156  2  Print indications of progress of run  Yes
157  3                        Print out tree  Yes
158  4          Print out steps in each site  No
159  5  Print sequences at all nodes of tree  No
160  6       Write out trees onto tree file?  Yes
161
162Are these settings correct? (type Y or the letter for one to change)
163
164</PRE>
165</TD></TR></TABLE>
166<P>
167The user either types "Y" (followed, of course, by a carriage-return)
168if the settings shown are to be accepted, or the letter or digit corresponding
169to an option that is to be changed.
170<P>
171The options U, J, O, T, W, M, and 0 are the usual ones.  They are described in
172the main documentation file of this package.  Option I is the same as in
173other molecular sequence programs and is described in the documentation file
174for the sequence programs.  Option C allows the user to select among various
175nuclear and mitochondrial genetic codes.  There is no provision for coping
176with data where different genetic codes have been used in different
177organisms.
178<P>
179In the U (User tree) option, the trees should
180not be preceded by a line with the number of trees on it.
181<P>
182Output is standard: if option 1 is toggled on, the data is printed out,
183with the convention that "." means "the same as in the first species".
184Then comes a list of equally parsimonious trees, and (if option 2 is
185toggled on) a table of the
186number of changes of state required in each position.  If option 5 is toggled
187on, a table is printed
188out after each tree, showing for each  branch whether there are known to be
189changes in the branch, and what the states are inferred to have been at the
190top end of the branch.  If the inferred state is a "?" there will be multiple
191equally-parsimonious assignments of states; the user must work these out for
192themselves by hand.  If option 6 is left in its default state the trees
193found will be written to a tree file, so that they are available to be used
194in other programs.
195<P>
196If the U (User Tree) option is used and more than one tree is supplied, the
197program also performs a statistical test of each of these trees against the
198best tree.  This test, which is a version of the test proposed by
199Alan Templeton (1983) and evaluated in a test case by me (1985a).  It is
200closely parallel to a test using log likelihood differences
201due to Kishino and Hasegawa (1989), and uses the mean
202and variance of
203step differences between trees, taken across positions.  If the mean
204is more than 1.96 standard deviations different then the trees are declared
205significantly different.  The program
206prints out a table of the steps for each tree, the differences of
207each from the best one, the variance of that quantity as determined by
208the step differences at individual positions, and a conclusion as to
209whether that tree is or is not significantly worse than the best one.
210<P>
211The program is derived from MIX but has had some rather elaborate
212bookkeeping using sets of bits installed.  It is not a very fast
213program but is speeded up substantially over version 3.2.
214<P>
215<HR>
216<H3>TEST DATA SET</H3>
217<P>
218<TABLE><TR><TD BGCOLOR=white>
219<PRE>
220     5    10
221Alpha     ABCDEFGHIK
222Beta      AB--EFGHIK
223Gamma     ?BCDSFG*??
224Delta     CIKDEFGHIK
225Epsilon   DIKDEFGHIK
226</PRE>
227</TD></TR></TABLE>
228<P>
229<HR>
230<P>
231<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
232<P>
233<TABLE><TR><TD BGCOLOR=white>
234<PRE>
235
236Protein parsimony algorithm, version 3.6
237
238
239
240     3 trees in all found
241
242
243
244
245     +--------Gamma     
246     ! 
247  +--2     +--Epsilon   
248  !  !  +--4 
249  !  +--3  +--Delta     
250  1     ! 
251  !     +-----Beta     
252  ! 
253  +-----------Alpha     
254
255  remember: this is an unrooted tree!
256
257
258requires a total of     16.000
259
260steps in each position:
261         0   1   2   3   4   5   6   7   8   9
262     *-----------------------------------------
263    0!       3   1   5   3   2   0   0   2   0
264   10!   0                                   
265
266From    To     Any Steps?    State at upper node
267                             ( . means same as in the node below it on tree)
268
269
270         1                ANCDEFGHIK
271  1      2         no     ..........
272  2   Gamma        yes    ?B..S..*??
273  2      3         yes    ..?.......
274  3      4         yes    ?IK.......
275  4   Epsilon     maybe   D.........
276  4   Delta        yes    C.........
277  3   Beta         yes    .B--......
278  1   Alpha       maybe   .B........
279
280
281
282
283
284           +--Epsilon   
285        +--4 
286     +--3  +--Delta     
287     !  ! 
288  +--2  +-----Gamma     
289  !  ! 
290  1  +--------Beta     
291  ! 
292  +-----------Alpha     
293
294  remember: this is an unrooted tree!
295
296
297requires a total of     16.000
298
299steps in each position:
300         0   1   2   3   4   5   6   7   8   9
301     *-----------------------------------------
302    0!       3   1   5   3   2   0   0   2   0
303   10!   0                                   
304
305From    To     Any Steps?    State at upper node
306                             ( . means same as in the node below it on tree)
307
308
309         1                ANCDEFGHIK
310  1      2         no     ..........
311  2      3        maybe   ?.........
312  3      4         yes    .IK.......
313  4   Epsilon     maybe   D.........
314  4   Delta        yes    C.........
315  3   Gamma        yes    ?B..S..*??
316  2   Beta         yes    .B--......
317  1   Alpha       maybe   .B........
318
319
320
321
322
323           +--Epsilon   
324     +-----4 
325     !     +--Delta     
326  +--3 
327  !  !     +--Gamma     
328  1  +-----2 
329  !        +--Beta     
330  ! 
331  +-----------Alpha     
332
333  remember: this is an unrooted tree!
334
335
336requires a total of     16.000
337
338steps in each position:
339         0   1   2   3   4   5   6   7   8   9
340     *-----------------------------------------
341    0!       3   1   5   3   2   0   0   2   0
342   10!   0                                   
343
344From    To     Any Steps?    State at upper node
345                             ( . means same as in the node below it on tree)
346
347
348         1                ANCDEFGHIK
349  1      3         no     ..........
350  3      4         yes    ?IK.......
351  4   Epsilon     maybe   D.........
352  4   Delta        yes    C.........
353  3      2         no     ..........
354  2   Gamma        yes    ?B..S..*??
355  2   Beta         yes    .B--......
356  1   Alpha       maybe   .B........
357
358
359</PRE>
360</TD></TR></TABLE>
361</BODY>
362</HTML>
Note: See TracBrowser for help on using the repository browser.