Context Navigation

protpars.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 13.7 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>protpars</TITLE>
5	<META NAME="description" CONTENT="protpars">
6	<META NAME="keywords" CONTENT="protpars">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>PROTPARS -- Protein Sequence Parsimony Method</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	</EM>
26	<P>
27	This program infers an unrooted phylogeny from protein sequences, using a
28	new method intermediate between the approaches of Eck and Dayhoff (1966) and
29	Fitch (1971). Eck and Dayhoff (1966) allowed any amino acid to change to
30	any other, and counted the number of such changes needed to evolve the
31	protein sequences on each given phylogeny. This has the problem that it
32	allows replacements which are not consistent with the genetic code, counting
33	them equally with replacements that are consistent. Fitch, on the other hand,
34	counted the minimum number of nucleotide substitutions that would be
35	needed to achieve the given protein sequences. This counts silent
36	changes equally with those that change the amino acid.
37	<P>
38	The present method insists that any changes of amino acid be consistent
39	with the genetic code so that, for example, lysine is allowed to change
40	to methionine but not to proline. However, changes between two amino acids
41	via a third are allowed and counted as two changes if each of the two
42	replacements is individually allowed. This sometimes allows changes that
43	at first sight you would think should be outlawed. Thus we can change from
44	phenylalanine to glutamine via leucine in two steps
45	total. Consulting the genetic code, you will find that there is a leucine
46	codon one step away from a phenylalanine codon, and a leucine codon one
47	step away from glutamine. But they are not the same leucine codon. It
48	actually takes three base substitutions to get from either of the
49	phenylalanine codons TTT and TTC to either of the glutamine codons
50	CAA or CAG. Why then does this program count only two? The answer
51	is that recent DNA sequence comparisons seem to show that synonymous
52	changes are considerably faster and easier than ones that change the
53	amino acid. We are assuming that, in effect, synonymous changes occur
54	so much more readily that they need not be counted. Thus, in the chain
55	of changes TTT (Phe) --> CTT (Leu) --> CTA (Leu) --> CAA (Glu), the middle
56	one is not counted because it does not change the amino acid (leucine).
57	<P>
58	To maintain consistency with the genetic code, it is necessary for the
59	program internally to treat serine as two separate states (ser1 and ser2)
60	since the two groups of serine codons are not adjacent in the
61	code. Changes to the state "deletion" are counted as three steps to prevent the
62	algorithm from assuming unnecessary deletions. The state "unknown" is
63	simply taken to mean that the amino acid, which has not been determined,
64	will in each part of a tree that is evaluated be assumed be whichever one
65	causes the fewest steps.
66	<P>
67	The assumptions of this method (which has not been described in the
68	literature), are thus something like this:
69	<P>
70	<OL>
71	<LI>Change in different sites is independent.
72	<LI>Change in different lineages is independent.
73	<LI>The probability of a base substitution that changes the amino
74	acid sequence is small over the lengths of time involved in
75	a branch of the phylogeny.
76	<LI>The expected amounts of change in different branches of the phylogeny
77	do not vary by so much that two changes in a high-rate branch
78	are more probable than one change in a low-rate branch.
79	<LI>The expected amounts of change do not vary enough among sites that two
80	changes in one site are more probable than one change in another.
81	<LI>The probability of a base change that is synonymous is much higher
82	than the probability of a change that is not synonymous.
83	</OL>
84	<P>
85	That these are the assumptions of parsimony methods has been documented
86	in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b). For
87	an opposing view arguing that the parsimony methods make no substantive
88	assumptions such as these, see the works by Farris (1983) and Sober (1983a,
89	1983b, 1988), but also read the exchange between Felsenstein and Sober (1986).
90	<P>
91	The input for the program is fairly standard. The first line contains the
92	number of species and the number of amino acid positions (counting any
93	stop codons that you want to include).
94	<P>
95	Next come the species data. Each
96	sequence starts on a new line, has a ten-character species name
97	that must be blank-filled to be of that length, followed immediately
98	by the species data in the one-letter code. The sequences must either
99	be in the "interleaved" or "sequential" formats
100	described in the Molecular Sequence Programs document. The I option
101	selects between them. The sequences can have internal
102	blanks in the sequence but there must be no extra blanks at the end of the
103	terminated line. Note that a blank is not a valid symbol for a deletion.
104	<P>
105	The protein sequences are given by the one-letter code used by
106	described in the <A HREF="sequence.html">Molecular Sequence Programs documentation file</A>. Note that
107	if two polypeptide chains are being used that are of different length
108	owing to one terminating before the other, they should be coded as (say)
109	<P><PRE>
110	HIINMA*????
111	HIPNMGVWABT
112	</PRE><P>
113	since after the stop codon we do not definitely know that
114	there has been a deletion, and do not know what amino acid would
115	have been there. If DNA studies tell us that there is
116	DNA sequence in that region, then we could use "X" rather than "?". Note
117	that "X" means an unknown amino acid, but definitely an amino acid,
118	while "?" could mean either that or a deletion. The distinction is often
119	significant in regions where there are deletions: one may want to encode
120	a six-base deletion as "-?????" since that way the program will only count
121	one deletion, not six deletion events, when the deletion arises. However,
122	if there are overlapping deletions it may not be so easy to know what
123	coding is correct.
124	<P>
125	One will usually want to
126	use "?" after a stop codon, if one does not know what amino acid is there. If
127	the DNA sequence has been observed there, one probably ought to resist
128	putting in the amino acids that this DNA would code for, and one should use
129	"X" instead, because under the assumptions implicit in this parsimony
130	method, changes to any noncoding sequence are much easier than
131	changes in a coding region that change the amino acid, so that they
132	shouldn't be counted anyway!
133	<P>
134	The form of this information
135	is the standard one described in the main documentation file. For the U option
136	the tree
137	provided must be a rooted bifurcating tree, with the root placed anywhere
138	you want, since that root placement does not affect anything.
139	<P>
140	The options are selected using an interactive menu. The menu looks like this:
141	<P>
142	<TABLE><TR><TD BGCOLOR=white>
143	<PRE>
144	Protein parsimony algorithm, version 3.6
145
146	Setting for this run:
147	U Search for best tree? Yes
148	J Randomize input order of sequences? No. Use input order
149	O Outgroup root? No, use as outgroup species 1
150	T Use Threshold parsimony? No, use ordinary parsimony
151	C Use which genetic code? Universal
152	M Analyze multiple data sets? No
153	I Input sequences interleaved? Yes
154	0 Terminal type (IBM PC, VT52, ANSI)? (none)
155	1 Print out the data at start of run No
156	2 Print indications of progress of run Yes
157	3 Print out tree Yes
158	4 Print out steps in each site No
159	5 Print sequences at all nodes of tree No
160	6 Write out trees onto tree file? Yes
161
162	Are these settings correct? (type Y or the letter for one to change)
163
164	</PRE>
165	</TD></TR></TABLE>
166	<P>
167	The user either types "Y" (followed, of course, by a carriage-return)
168	if the settings shown are to be accepted, or the letter or digit corresponding
169	to an option that is to be changed.
170	<P>
171	The options U, J, O, T, W, M, and 0 are the usual ones. They are described in
172	the main documentation file of this package. Option I is the same as in
173	other molecular sequence programs and is described in the documentation file
174	for the sequence programs. Option C allows the user to select among various
175	nuclear and mitochondrial genetic codes. There is no provision for coping
176	with data where different genetic codes have been used in different
177	organisms.
178	<P>
179	In the U (User tree) option, the trees should
180	not be preceded by a line with the number of trees on it.
181	<P>
182	Output is standard: if option 1 is toggled on, the data is printed out,
183	with the convention that "." means "the same as in the first species".
184	Then comes a list of equally parsimonious trees, and (if option 2 is
185	toggled on) a table of the
186	number of changes of state required in each position. If option 5 is toggled
187	on, a table is printed
188	out after each tree, showing for each branch whether there are known to be
189	changes in the branch, and what the states are inferred to have been at the
190	top end of the branch. If the inferred state is a "?" there will be multiple
191	equally-parsimonious assignments of states; the user must work these out for
192	themselves by hand. If option 6 is left in its default state the trees
193	found will be written to a tree file, so that they are available to be used
194	in other programs.
195	<P>
196	If the U (User Tree) option is used and more than one tree is supplied, the
197	program also performs a statistical test of each of these trees against the
198	best tree. This test, which is a version of the test proposed by
199	Alan Templeton (1983) and evaluated in a test case by me (1985a). It is
200	closely parallel to a test using log likelihood differences
201	due to Kishino and Hasegawa (1989), and uses the mean
202	and variance of
203	step differences between trees, taken across positions. If the mean
204	is more than 1.96 standard deviations different then the trees are declared
205	significantly different. The program
206	prints out a table of the steps for each tree, the differences of
207	each from the best one, the variance of that quantity as determined by
208	the step differences at individual positions, and a conclusion as to
209	whether that tree is or is not significantly worse than the best one.
210	<P>
211	The program is derived from MIX but has had some rather elaborate
212	bookkeeping using sets of bits installed. It is not a very fast
213	program but is speeded up substantially over version 3.2.
214	<P>
215	<HR>
216	<H3>TEST DATA SET</H3>
217	<P>
218	<TABLE><TR><TD BGCOLOR=white>
219	<PRE>
220	5 10
221	Alpha ABCDEFGHIK
222	Beta AB--EFGHIK
223	Gamma ?BCDSFG*??
224	Delta CIKDEFGHIK
225	Epsilon DIKDEFGHIK
226	</PRE>
227	</TD></TR></TABLE>
228	<P>
229	<HR>
230	<P>
231	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
232	<P>
233	<TABLE><TR><TD BGCOLOR=white>
234	<PRE>
235
236	Protein parsimony algorithm, version 3.6
237
238
239
240	3 trees in all found
241
242
243
244
245	+--------Gamma
246	!
247	+--2 +--Epsilon
248	! ! +--4
249	! +--3 +--Delta
250	1 !
251	! +-----Beta
252	!
253	+-----------Alpha
254
255	remember: this is an unrooted tree!
256
257
258	requires a total of 16.000
259
260	steps in each position:
261	0 1 2 3 4 5 6 7 8 9
262	*-----------------------------------------
263	0! 3 1 5 3 2 0 0 2 0
264	10! 0
265
266	From To Any Steps? State at upper node
267	( . means same as in the node below it on tree)
268
269
270	1 ANCDEFGHIK
271	1 2 no ..........
272	2 Gamma yes ?B..S..*??
273	2 3 yes ..?.......
274	3 4 yes ?IK.......
275	4 Epsilon maybe D.........
276	4 Delta yes C.........
277	3 Beta yes .B--......
278	1 Alpha maybe .B........
279
280
281
282
283
284	+--Epsilon
285	+--4
286	+--3 +--Delta
287	! !
288	+--2 +-----Gamma
289	! !
290	1 +--------Beta
291	!
292	+-----------Alpha
293
294	remember: this is an unrooted tree!
295
296
297	requires a total of 16.000
298
299	steps in each position:
300	0 1 2 3 4 5 6 7 8 9
301	*-----------------------------------------
302	0! 3 1 5 3 2 0 0 2 0
303	10! 0
304
305	From To Any Steps? State at upper node
306	( . means same as in the node below it on tree)
307
308
309	1 ANCDEFGHIK
310	1 2 no ..........
311	2 3 maybe ?.........
312	3 4 yes .IK.......
313	4 Epsilon maybe D.........
314	4 Delta yes C.........
315	3 Gamma yes ?B..S..*??
316	2 Beta yes .B--......
317	1 Alpha maybe .B........
318
319
320
321
322
323	+--Epsilon
324	+-----4
325	! +--Delta
326	+--3
327	! ! +--Gamma
328	1 +-----2
329	! +--Beta
330	!
331	+-----------Alpha
332
333	remember: this is an unrooted tree!
334
335
336	requires a total of 16.000
337
338	steps in each position:
339	0 1 2 3 4 5 6 7 8 9
340	*-----------------------------------------
341	0! 3 1 5 3 2 0 0 2 0
342	10! 0
343
344	From To Any Steps? State at upper node
345	( . means same as in the node below it on tree)
346
347
348	1 ANCDEFGHIK
349	1 3 no ..........
350	3 4 yes ?IK.......
351	4 Epsilon maybe D.........
352	4 Delta yes C.........
353	3 2 no ..........
354	2 Gamma yes ?B..S..*??
355	2 Beta yes .B--......
356	1 Alpha maybe .B........
357
358
359	</PRE>
360	</TD></TR></TABLE>
361	</BODY>
362	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/protpars.html

Download in other formats: