Context Navigation

penny.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 24.6 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>penny</TITLE>
5	<META NAME="description" CONTENT="penny">
6	<META NAME="keywords" CONTENT="penny">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>PENNY - Branch and bound to find<BR>all most parsimonious trees</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	PENNY is a program that will find all of the most parsimonious trees
26	implied by your data. It does so not by examining all possible trees,
27	but by using the more sophisticated "branch and bound" algorithm, a
28	standard computer science search strategy first applied to
29	phylogenetic inference by Hendy and Penny (1982). (J. S. Farris
30	[personal communication, 1975] had also suggested that this strategy,
31	which is well-known in computer science, might
32	be applied to phylogenies, but he did not publish this suggestion).
33	<P>
34	There is, however, a price to be paid for the certainty that one has
35	found all members of the set of most parsimonious trees. The problem of
36	finding these has been shown (Graham and Foulds, 1982; Day, 1983) to be
37	NP-complete, which is equivalent to saying that there is no
38	fast algorithm that is guaranteed to solve the problem in all cases
39	(for a discussion of NP-completeness, see the Scientific American
40	article by Lewis and Papadimitriou, 1978). The result is that this
41	program, despite its algorithmic sophistication, is VERY SLOW.
42	<P>
43	The program should be slower than the other tree-building programs
44	in the package, but useable up to about ten species. Above this it will
45	bog down rapidly, but exactly when depends on the data and on how much
46	computer time you have (it may be more effective in the hands of someone
47	who can let a microcomputer grind all night than for someone who
48	has the "benefit" of paying for time on the campus mainframe computer). IT
49	IS VERY IMPORTANT FOR YOU TO GET A FEEL FOR HOW LONG THE PROGRAM
50	WILL TAKE ON YOUR DATA. This can be done by running it on subsets
51	of the species, increasing the number of species in the run until you
52	either are able to treat the full data set or know that the program
53	will take unacceptably long on it. (Making a plot of the logarithm of run
54	time against species number may help to project run times).
55	<P>
56	<H2>The Algorithm</H2>
57	<P>
58	The search strategy used by PENNY starts by making a tree consisting of the
59	first two species (the first three if the tree is to be unrooted). Then
60	it tries to add the next species in all possible places (there are three
61	of these). For each of the resulting trees it evaluates the number of
62	steps. It adds the next species to each of these, again in all
63	possible spaces. If this process would continue it would simply
64	generate all possible trees, of which there are a very large number even
65	when the number of species is moderate (34,459,425 with 10 species). Actually
66	it does not do this, because the trees are generated in a
67	particular order and some of them are never generated.
68	<P>
69	Actually the order in which trees are generated is not quite as
70	implied above, but is a "depth-first search". This
71	means that first one adds the third species in the first possible
72	place, then the fourth species in its first possible place, then
73	the fifth and so on until the first possible tree has been produced. Its
74	number of steps is evaluated. Then one "backtracks" by trying the
75	alternative placements of the last species. When these are exhausted
76	one tries the next placement of the next-to-last species. The
77	order of placement in a depth-first search is like this for a
78	four-species case (parentheses enclose monophyletic groups):
79	<P>
80	Make tree of first two species     (A,B)<BR>
81	&nbsp     Add C in first place     ((A,B),C)<BR>
82	&nbsp    &nbsp     Add D in first place     (((A,D),B),C)<BR>
83	&nbsp    &nbsp     Add D in second place     ((A,(B,D)),C)<BR>
84	&nbsp    &nbsp     Add D in third place     (((A,B),D),C)<BR>
85	&nbsp    &nbsp     Add D in fourth place     ((A,B),(C,D))<BR>
86	&nbsp    &nbsp     Add D in fifth place     (((A,B),C),D)<BR>
87	&nbsp     Add C in second place: ((A,C),B)<BR>
88	&nbsp    &nbsp     Add D in first place     (((A,D),C),B)<BR>
89	&nbsp    &nbsp     Add D in second place     ((A,(C,D)),B)<BR>
90	&nbsp    &nbsp     Add D in third place     (((A,C),D),B)<BR>
91	&nbsp    &nbsp     Add D in fourth place     ((A,C),(B,D))<BR>
92	&nbsp    &nbsp     Add D in fifth place     (((A,C),B),D)<BR>
93	&nbsp     Add C in third place     (A,(B,C))<BR>
94	&nbsp    &nbsp     Add D in first place     ((A,D),(B,C))<BR>
95	&nbsp    &nbsp     Add D in second place     (A,((B,D),C))<BR>
96	&nbsp    &nbsp     Add D in third place     (A,(B,(C,D)))<BR>
97	&nbsp    &nbsp     Add D in fourth place     (A,((B,C),D))<BR>
98	&nbsp    &nbsp     Add D in fifth place     ((A,(B,C)),D)<BR>
99	<P>
100	Among these fifteen trees you will find all of the four-species
101	rooted bifurcating trees, each exactly once (the parentheses each enclose
102	a monophyletic group). As displayed above, the backtracking
103	depth-first search algorithm is just another way of producing all
104	possible trees one at a time. The branch and bound algorithm
105	consists of this with one change. As each tree is constructed,
106	including the partial trees such as (A,(B,C)), its number of steps
107	is evaluated. In addition a prediction is made as to how many
108	steps will be added, at a minimum, as further species are added.
109	<P>
110	This is done by counting how many binary characters which are invariant in the
111	data up the species most recently added will ultimately show variation when
112	further species
113	are added. Thus if 20 characters vary among species A, B, and C and their
114	root, and if tree ((A,C),B) requires 24 steps, then if there are 8 more
115	characters which will be seen to vary when species D is added, we can
116	immediately say that no matter how we add D, the resulting tree can have no less
117	than 24 + 8 = 32 steps. The point of all this is that if a previously-found
118	tree such as ((A,B),(C,D)) required only 30 steps, then we know that
119	there is no point in even trying to add D to ((A,C),B). We have
120	computed the bound that enables us to cut off a whole line of inquiry
121	(in this case five trees) and avoid going down that particular branch
122	any farther.
123	<P>
124	The branch-and-bound algorithm thus allows us to find all most parsimonious
125	trees without generating all possible trees. How much of a saving this
126	is depends strongly on the data. For very clean (nearly "Hennigian")
127	data, it saves much time, but on very messy data it will still take
128	a very long time.
129	<P>
130	The algorithm in the program differs from the one outlined here
131	in some essential details: it investigates possibilities in the
132	order of their apparent promise. This applies to the order of addition
133	of species, and to the places where they are added to the tree. After
134	the first two-species tree is constructed, the program tries adding
135	each of the remaining species in turn, each in the best possible place it
136	can find. Whichever of those species adds (at a minimum) the most
137	additional steps is taken to be the one to be added next to the tree. When
138	it is added, it is added in turn to places which cause the fewest
139	additional steps to be added. This sounds a bit complex, but it is done
140	with the intention of eliminating regions of the search of all possible
141	trees as soon as possible, and lowering the bound on tree length as quickly
142	as possible.
143	<P>
144	The program keeps a list of all the most parsimonious
145	trees found so far. Whenever
146	it finds one that has fewer steps than
147	these, it clears out the list and
148	restarts the list with that tree. In the process the bound tightens and
149	fewer possibilities need be investigated. At the end the list contains
150	all the shortest trees. These are then printed out. It should be
151	mentioned that the program CLIQUE for finding all largest cliques
152	also works by branch-and-bound. Both problems are NP-complete but for
153	some reason CLIQUE runs far faster. Although their worst-case behavior
154	is bad for both programs, those worst cases occur far more frequently
155	in parsimony problems than in compatibility problems.
156	<P>
157	<H2>Controlling Run Times</H2>
158	<P>
159	Among the quantities available to be set at the
160	beginning of a run of PENNY, two (howoften and howmany) are of particular
161	importance. As PENNY goes along it will keep count of how many
162	trees it has examined. Suppose that howoften is 100 and howmany is 1000,
163	the default settings. Every time 100 trees have been examined, PENNY
164	will print out a line saying how many multiples of 100 trees have now been
165	examined, how many steps the most parsimonious tree found so far has,
166	how many trees of with that number of steps have been found, and a very
167	rough estimate of what fraction of all trees have been looked at so far.
168	<P>
169	When the number of these multiples printed out reaches the number howmany
170	(say 1000), the whole algorithm aborts and prints out that it has not
171	found all most parsimonious trees, but prints out what is has got so far
172	anyway. These trees need not be any of the most parsimonious trees: they are
173	simply the most parsimonious ones found so far. By setting the product
174	(howoften times howmany) large you can make
175	the algorithm less likely to abort, but then you risk getting bogged
176	down in a gigantic computation. You should adjust these constants so that
177	the program cannot go beyond examining the number of trees you are reasonably
178	willing to wait for. In their initial setting the program will
179	abort after looking at 100,000 trees. Obviously you may want to adjust
180	howoften in order to get more or fewer lines of intermediate notice of how
181	many trees have been looked at so far. Of course, in small cases you may
182	never even reach the first multiple of howoften and nothing will be printed out
183	except some headings and then the final trees.
184	<P>
185	The indication of the approximate percentage of trees searched so far will
186	be helpful in judging how much farther you would have to go to get the full
187	search. Actually, since that fraction is the fraction of the set of all
188	possible trees searched or ruled out so far, and since the search becomes
189	progressively more efficient, the approximate fraction printed out will
190	usually be an underestimate of how far along the program is, sometimes a
191	serious underestimate.
192	<P>
193	A constant that affects the result is
194	"maxtrees", which controls the
195	maximum number of trees that can be stored. Thus if
196	"maxtrees" is 25, and 32 most parsimonious trees are found,
197	only the first 25 of these are stored and printed out. If "maxtrees"
198	is increased, the program does not run any slower but requires a little
199	more intermediate storage space. I recommend that
200	"maxtrees" be kept as large as you can, provided you are willing to
201	look at an output with that many trees on it! Initially,
202	"maxtrees" is set to 100 in the distribution copy.
203	<P>
204	<H2>Methods and Options</H2>
205	<P>
206	The counting of the length of trees is done by an algorithm nearly
207	identical to the corresponding algorithms in MIX, and thus the remainder
208	of this document will be nearly identical to the MIX document. MIX
209	is a general parsimony program which carries out the Wagner and
210	Camin-Sokal parsimony methods in mixture, where each character can have
211	its method specified. The program defaults to carrying out Wagner
212	parsimony.
213	<P>
214	The Camin-Sokal parsimony method explains the data by assuming that
215	changes 0 --> 1 are allowed but not changes 1 --> 0. Wagner parsimony
216	allows both kinds of changes. (This under the assumption that 0 is the
217	ancestral state, though the program allows reassignment of the ancestral
218	state, in which case we must reverse the state numbers 0 and 1
219	throughout this discussion). The criterion is to find the tree which
220	requires the minimum number of changes. The Camin-Sokal method is due
221	to Camin and Sokal (1965) and the Wagner method to Eck and Dayhoff
222	(1966) and to Kluge and Farris (1969).
223	<P>
224	Here are the assumptions of these two methods:
225	<P>
226	<OL>
227	<LI>Ancestral states are known (Camin-Sokal) or unknown (Wagner).
228	<LI>Different characters evolve independently.
229	<LI>Different lineages evolve independently.
230	<LI>Changes 0 --> 1 are much more probable than changes 1 --> 0
231	(Camin-Sokal) or equally probable (Wagner).
232	<LI>Both of these kinds of changes are a priori improbable over the
233	evolutionary time spans involved in the differentiation of the
234	group in question.
235	<LI>Other kinds of evolutionary event such as retention of polymorphism
236	are far less probable than 0 --> 1 changes.
237	<LI>Rates of evolution in different lineages are sufficiently low that
238	two changes in a long segment of the tree are far less probable
239	than one change in a short segment.
240	</OL>
241	<P>
242	That these are the assumptions of parsimony methods has been documented
243	in a series of papers of mine: (1973a, 1978b, 1979, 1981b,
244	1983b, 1988b). For an opposing view arguing that the parsimony methods
245	make no substantive
246	assumptions such as these, see the papers by Farris (1983) and Sober (1983a,
247	1983b), but also read the exchange between Felsenstein and Sober (1986).
248	<P>
249	The input for PENNY is the standard input for discrete characters
250	programs, described above in the documentation file for the
251	discrete-characters programs. States "?", "P", and "B" are allowed.
252	<P>
253	Most of the options are selected using a menu:
254	<P>
255	<TABLE><TR><TD BGCOLOR=white>
256	<PRE>
257
258	Penny algorithm, version 3.6a3
259	branch-and-bound to find all most parsimonious trees
260
261	Settings for this run:
262	X Use Mixed method? No
263	P Parsimony method? Wagner
264	F How often to report, in trees: 100
265	H How many groups of 100 trees: 1000
266	O Outgroup root? No, use as outgroup species 1
267	S Branch and bound is simple? Yes
268	T Use Threshold parsimony? No, use ordinary parsimony
269	A Use ancestral states in input file? No
270	W Sites weighted? No
271	M Analyze multiple data sets? No
272	0 Terminal type (IBM PC, ANSI, none)? (none)
273	1 Print out the data at start of run No
274	2 Print indications of progress of run Yes
275	3 Print out tree Yes
276	4 Print out steps in each character No
277	5 Print states at all nodes of tree No
278	6 Write out trees onto tree file? Yes
279
280	Are these settings correct? (type Y or the letter for one to change)
281	</PRE>
282	</TD></TR></TABLE>
283	<P>
284	The options X, O, T, A, and M are the usual Mixed Methods, Outgroup,
285	Threshold, Ancestral
286	States, and Multiple Data Sets options. They are described in the Main
287	documentation file and in the Discrete Characters Programs documentation
288	file. The O option is only acted upon if the final tree is unrooted.
289	<P>
290	The option P toggles between the Camin-Sokal parsimony criterion
291	and the Wagner parsimony criterion. Options F and H reset the
292	variables howoften (F) and howmany (H). The user is prompted for the new
293	values. By setting these larger the program will report its progress less
294	often (howoften) and will run longer (howmany times howoften). These values
295	default to 100 and 1000 which
296	guarantees a search of 100,000 trees, but these can be changed. Note that
297	option F in this program is not the Factors option available in some of
298	the other programs in this section of the package.
299	<P>
300	The A (Ancestral states) option works in the usual way, described in the
301	Discrete Characters Programs documentation file. If
302	the A option is not used, then the program will assume 0 as the
303	ancestral state for those characters following the Camin-Sokal method,
304	and will assume that the ancestral state is unknown for those characters
305	following Wagner parsimony. If any characters have unknown ancestral
306	states, and if the resulting tree is rooted (even by outgroup),
307	a table will be printed out
308	showing the best guesses of which are the ancestral states in each
309	character.
310	<P>
311	The S (Simple) option alters a step in PENNY which reconsiders the
312	order in which species are added to the tree. Normally the decision as to
313	what species to add to the tree next is made as the first tree is being
314	constructed; that ordering of species is not altered subsequently. The
315	S option causes it to be continually reconsidered. This will probably
316	result in a substantial increase in run time, but on some data sets of
317	intermediate messiness it may help. It is included in case it might prove
318	of use on some data sets.
319	<P>
320	The F (Factors)
321	option is not available in this program, as it would have no effect on
322	the result even if that information were provided in the input file.
323	<P>
324	The final output is standard: a set of trees, which
325	will be printed as rooted or unrooted
326	depending on which is appropriate, and if the user elects to see them,
327	tables of the number of changes
328	of state required in each character. If the Wagner option is in force for a
329	character, it may not be possible to unambiguously locate the places on
330	the tree where the changes occur, as there may be multiple possibilities. A
331	table is available to be printed out after each tree, showing for each branch
332	whether
333	there are known to be changes in the branch, and what the states are inferred
334	to have been at the top end of the branch. If the inferred state is a "?"
335	there will be multiple equally-parsimonious assignments of states; the user
336	must work these out for themselves by hand.
337	<P>
338	If the Camin-Sokal parsimony method (option C or S)
339	is invoked and the A option is also used, then the program will
340	infer, for any character whose ancestral state is unknown ("?") whether the
341	ancestral state 0 or 1 will give the fewest state changes. If these are
342	tied, then it may not be possible for the program to infer the
343	state in the internal nodes, and these will all be printed as ".". If this
344	has happened and you want to know more about the states at the internal
345	nodes, you will find helpful to use MOVE to display the tree and examine
346	its interior states, as the algorithm in MOVE shows all that can be known
347	in this case about the interior states, including where there is and is not
348	amibiguity. The algorithm in PENNY gives up more easily on displaying these
349	states.
350	<P>
351	If the A option is not used, then the program will assume 0 as the
352	ancestral state for those characters following the Camin-Sokal method,
353	and will assume that the ancestral state is unknown for those characters
354	following Wagner parsimony. If any characters have unknown ancestral
355	states, and if the resulting tree is rooted (even by outgroup),
356	a table will be printed out
357	showing the best guesses of which are the ancestral states in each
358	character. You will find it useful to understand the difference between
359	the Camin-Sokal parsimony criterion with unknown ancestral state and the Wagner
360	parsimony criterion.
361	<P>
362	At the beginning of the program are a series of constants,
363	which can be changed to help adapt the program to different computer systems.
364	Two are the initial values of howmany and howoften,
365	constants "often" and "many". Constant "maxtrees"
366	is the maximum number of tied trees that will be stored.
367	<P>
368	<HR>
369	<P>
370	<H3>TEST DATA SET</H3>
371	<P>
372	<TABLE><TR><TD BGCOLOR=white>
373	<PRE>
374	7 6
375	Alpha1 110110
376	Alpha2 110110
377	Beta1 110000
378	Beta2 110000
379	Gamma1 100110
380	Delta 001001
381	Epsilon 001110
382	</PRE>
383	</TD></TR></TABLE>
384	<P>
385	<HR>
386	<P>
387	<H3>TEST SET OUTPUT (with all numerical options turned on)</H3>
388	<P>
389	<TABLE><TR><TD BGCOLOR=white>
390	<PRE>
391
392	Penny algorithm, version 3.6a3
393	branch-and-bound to find all most parsimonious trees
394
395	7 species, 6 characters
396	Wagner parsimony method
397
398
399	Name Characters
400	---- ----------
401
402	Alpha1 11011 0
403	Alpha2 11011 0
404	Beta1 11000 0
405	Beta2 11000 0
406	Gamma1 10011 0
407	Delta 00100 1
408	Epsilon 00111 0
409
410
411
412	requires a total of 8.000
413
414	3 trees in all found
415
416
417
418
419	+-----------------Alpha1
420	!
421	! +--------Alpha2
422	--1 !
423	! +-----4 +--Epsilon
424	! ! ! +--6
425	! ! +--5 +--Delta
426	+--2 !
427	! +-----Gamma1
428	!
429	! +--Beta2
430	+-----------3
431	+--Beta1
432
433	remember: this is an unrooted tree!
434
435
436	steps in each character:
437	0 1 2 3 4 5 6 7 8 9
438	*-----------------------------------------
439	0! 1 1 1 2 2 1
440
441	From To Any Steps? State at upper node
442	( . means same as in the node below it on tree)
443
444	1 11011 0
445	1 Alpha1 no ..... .
446	1 2 no ..... .
447	2 4 no ..... .
448	4 Alpha2 no ..... .
449	4 5 yes .0... .
450	5 6 yes 0.1.. .
451	6 Epsilon no ..... .
452	6 Delta yes ...00 1
453	5 Gamma1 no ..... .
454	2 3 yes ...00 .
455	3 Beta2 no ..... .
456	3 Beta1 no ..... .
457
458
459
460
461	+-----------------Alpha1
462	!
463	--1 +--------------Alpha2
464	! !
465	! ! +--Epsilon
466	+--2 +--6
467	! +-----5 +--Delta
468	! ! !
469	+--4 +-----Gamma1
470	!
471	! +--Beta2
472	+--------3
473	+--Beta1
474
475	remember: this is an unrooted tree!
476
477
478	steps in each character:
479	0 1 2 3 4 5 6 7 8 9
480	*-----------------------------------------
481	0! 1 1 1 2 2 1
482
483	From To Any Steps? State at upper node
484	( . means same as in the node below it on tree)
485
486	1 11011 0
487	1 Alpha1 no ..... .
488	1 2 no ..... .
489	2 Alpha2 no ..... .
490	2 4 no ..... .
491	4 5 yes .0... .
492	5 6 yes 0.1.. .
493	6 Epsilon no ..... .
494	6 Delta yes ...00 1
495	5 Gamma1 no ..... .
496	4 3 yes ...00 .
497	3 Beta2 no ..... .
498	3 Beta1 no ..... .
499
500
501
502
503	+-----------------Alpha1
504	!
505	! +-----Alpha2
506	--1 +--------2
507	! ! ! +--Beta2
508	! ! +--3
509	+--4 +--Beta1
510	!
511	! +--Epsilon
512	! +--6
513	+--------5 +--Delta
514	!
515	+-----Gamma1
516
517	remember: this is an unrooted tree!
518
519
520	steps in each character:
521	0 1 2 3 4 5 6 7 8 9
522	*-----------------------------------------
523	0! 1 1 1 2 2 1
524
525	From To Any Steps? State at upper node
526	( . means same as in the node below it on tree)
527
528	1 11011 0
529	1 Alpha1 no ..... .
530	1 4 no ..... .
531	4 2 no ..... .
532	2 Alpha2 no ..... .
533	2 3 yes ...00 .
534	3 Beta2 no ..... .
535	3 Beta1 no ..... .
536	4 5 yes .0... .
537	5 6 yes 0.1.. .
538	6 Epsilon no ..... .
539	6 Delta yes ...00 1
540	5 Gamma1 no ..... .
541
542	</PRE>
543	</TD></TR></TABLE>
544	</BODY>
545	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/penny.html

Download in other formats: