Context Navigation

dolpenny.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 25.9 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>dolpenny</TITLE>
5	<META NAME="description" CONTENT="dolpenny">
6	<META NAME="keywords" CONTENT="dolpenny">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>DOLPENNY - Branch and bound<BR>to find all most parsimonious trees<BR>
18	for Dollo, polymorphism parsimony criteria</H1>
19	</DIV>
20	<P>
21	© Copyright 1986-2002 by the University of
22	Washington. Written by Joseph Felsenstein. Permission is granted to copy
23	this document provided that no fee is charged for it and that this copyright
24	notice is not removed.
25	<P>
26	DOLPENNY is a program that will find all of the most parsimonious trees
27	implied by your data when the Dollo or polymorphism parsimony criteria are
28	employed. It does so not by examining all possible trees,
29	but by using the more sophisticated "branch and bound" algorithm, a
30	standard computer science search strategy first applied to
31	phylogenetic inference by Hendy and Penny (1982). (J. S. Farris
32	[personal communication, 1975] had also suggested that this strategy,
33	which is well-known in computer science, might
34	be applied to phylogenies, but he did not publish this suggestion).
35	<P>
36	There is, however, a price to be paid for the certainty that one has
37	found all members of the set of most parsimonious trees. The problem of
38	finding these has been shown (Graham and Foulds, 1982; Day, 1983) to be
39	NP-complete, which is equivalent to saying that there is no
40	fast algorithm that is guaranteed to solve the problem in all cases
41	(for a discussion of NP-completeness, see the Scientific American
42	article by Lewis and Papadimitriou, 1978). The result is that
43	this program, despite its algorithmic sophistication, is VERY SLOW.
44	<P>
45	The program should be slower than the other tree-building programs
46	in the package, but useable up to about ten species. Above this it will
47	bog down rapidly, but exactly when depends on the data and on how much
48	computer time you have (it may be more effective in the hands of someone
49	who can let a microcomputer grind all night than for someone who
50	has the "benefit" of paying for time on the campus mainframe
51	computer). IT IS VERY IMPORTANT FOR YOU TO GET A FEEL FOR HOW LONG THE
52	PROGRAM WILL TAKE ON YOUR DATA. This can be done by running it on subsets
53	of the species, increasing the number of species in the run until you
54	either are able to treat the full data set or know that the program
55	will take unacceptably long on it. (Making a plot of the logarithm of run
56	time against species number may help to project run times).
57	<P>
58	<H2>The Algorithm</H2>
59	<P>
60	The search strategy used by DOLPENNY starts by making a tree consisting of the
61	first two species (the first three if the tree is to be unrooted). Then
62	it tries to add the next species in all possible places (there are three
63	of these). For each of the resulting trees it evaluates the number of
64	losses. It adds the next species to each of these, again in all
65	possible spaces. If this process would continue it would simply
66	generate all possible trees, of which there are a very large number even
67	when the number of species is moderate (34,459,425 with 10 species). Actually
68	it does not do this, because the trees are generated in a
69	particular order and some of them are never generated.
70	<P>
71	Actually the order in which trees are generated is not quite as
72	implied above, but is a "depth-first search". This means that first
73	one adds the third species in the first possible
74	place, then the fourth species in its first possible place, then
75	the fifth and so on until the first possible tree has been produced. Its
76	number of steps is evaluated. Then one "backtracks" by trying the
77	alternative placements of the last species. When these are exhausted
78	one tries the next placement of the next-to-last species. The order of
79	placement in a depth-first search is like this for a
80	four-species case (parentheses enclose monophyletic groups):
81	<P>
82	Make tree of first two species     (A,B)<BR>
83	&nbsp     Add C in first place     ((A,B),C)<BR>
84	&nbsp    &nbsp     Add D in first place     (((A,D),B),C)<BR>
85	&nbsp    &nbsp     Add D in second place     ((A,(B,D)),C)<BR>
86	&nbsp    &nbsp     Add D in third place     (((A,B),D),C)<BR>
87	&nbsp    &nbsp     Add D in fourth place     ((A,B),(C,D))<BR>
88	&nbsp    &nbsp     Add D in fifth place     (((A,B),C),D)<BR>
89	&nbsp     Add C in second place: ((A,C),B)<BR>
90	&nbsp    &nbsp     Add D in first place     (((A,D),C),B)<BR>
91	&nbsp    &nbsp     Add D in second place     ((A,(C,D)),B)<BR>
92	&nbsp    &nbsp     Add D in third place     (((A,C),D),B)<BR>
93	&nbsp    &nbsp     Add D in fourth place     ((A,C),(B,D))<BR>
94	&nbsp    &nbsp     Add D in fifth place     (((A,C),B),D)<BR>
95	&nbsp     Add C in third place     (A,(B,C))<BR>
96	&nbsp    &nbsp     Add D in first place     ((A,D),(B,C))<BR>
97	&nbsp    &nbsp     Add D in second place     (A,((B,D),C))<BR>
98	&nbsp    &nbsp     Add D in third place     (A,(B,(C,D)))<BR>
99	&nbsp    &nbsp     Add D in fourth place     (A,((B,C),D))<BR>
100	&nbsp    &nbsp     Add D in fifth place     ((A,(B,C)),D)<BR>
101	<P>
102	Among these fifteen trees you will find all of the four-species
103	rooted bifurcating trees, each exactly once (the parentheses each enclose
104	a monophyletic group). As displayed above, the backtracking
105	depth-first search algorithm is just another way of producing all
106	possible trees one at a time. The branch and bound algorithm
107	consists of this with one change. As each tree is constructed,
108	including the partial trees such as (A,(B,C)), its number of losses
109	(or retentions of polymorphism)
110	is evaluated.
111	<P>
112	The point of this is that if a previously-found
113	tree such as ((A,B),(C,D)) required fewer losses, then we know that
114	there is no point in even trying to add D to ((A,C),B). We have
115	computed the bound that enables us to cut off a whole line of inquiry
116	(in this case five trees) and avoid going down that particular branch
117	any farther.
118	<P>
119	The branch-and-bound algorithm thus allows us to find all most parsimonious
120	trees without generating all possible trees. How much of a saving this
121	is depends strongly on the data. For very clean (nearly "Hennigian")
122	data, it saves much time, but on very messy data it will still take
123	a very long time.
124	<P>
125	The algorithm in the program differs from the one outlined here
126	in some essential details: it investigates possibilities in the
127	order of their apparent promise. This applies to the order of addition
128	of species, and to the places where they are added to the tree. After
129	the first two-species tree is constructed, the program tries adding
130	each of the remaining species in turn, each in the best possible place it
131	can find. Whichever of those species adds (at a minimum) the most
132	additional steps is taken to be the one to be added next to the tree. When
133	it is added, it is added in turn to places which cause the fewest
134	additional steps to be added. This sounds a bit complex, but it is done
135	with the intention of eliminating regions of the search of all possible
136	trees as soon as possible, and lowering the bound on tree length as quickly
137	as possible.
138	<P>
139	The program keeps a list of all the most parsimonious
140	trees found so far. Whenever
141	it finds one that has fewer losses than
142	these, it clears out the list and
143	restarts the list with that tree. In the process the bound tightens and
144	fewer possibilities need be investigated. At the end the list contains
145	all the shortest trees. These are then printed out. It should be
146	mentioned that the program CLIQUE for finding all largest cliques
147	also works by branch-and-bound. Both problems are NP-complete but for
148	some reason CLIQUE runs far faster. Although their worst-case behavior
149	is bad for both programs, those worst cases occur far more frequently
150	in parsimony problems than in compatibility problems.
151	<P>
152	<H2>Controlling Run Times</H2>
153	<P>
154	Among the quantities available to be set at the
155	beginning of a run of DOLPENNY, two (howoften and howmany) are of particular
156	importance. As DOLPENNY goes along it will keep count of how many
157	trees it has examined. Suppose that howoften is 100 and howmany is 300,
158	the default settings. Every time 100 trees have been examined, DOLPENNY
159	will print out a line saying how many multiples of 100 trees have now been
160	examined, how many steps the most parsimonious tree found so far has,
161	how many trees of with that number of steps have been found, and a very
162	rough estimate of what fraction of all trees have been looked at so far.
163	<P>
164	When the number of these multiples printed out reaches the number howmany
165	(say 1000), the whole algorithm aborts and prints out that it has not
166	found all most parsimonious trees, but prints out what is has got so far
167	anyway. These trees need not be any of the most parsimonious trees: they are
168	simply the most parsimonious ones found so far. By setting the product
169	(howoften X howmany) large you can make
170	the algorithm less likely to abort, but then you risk getting bogged
171	down in a gigantic computation. You should adjust these constants so that
172	the program cannot go beyond examining the number of trees you are reasonably
173	willing to pay for (or wait for). In their initial setting the program will
174	abort after looking at 100,000 trees. Obviously you may want to adjust
175	howoften in order to get more or fewer lines of intermediate notice of how
176	many trees have been looked at so far. Of course, in small cases you may
177	never even reach the first multiple of howoften and nothing will be printed out
178	except some headings and then the final trees.
179	<P>
180	The indication of the approximate percentage of trees searched so far will
181	be helpful in judging how much farther you would have to go to get the full
182	search. Actually, since that fraction is the fraction of the set of all
183	possible trees searched or ruled out so far, and since the search becomes
184	progressively more efficient, the approximate fraction printed out will
185	usually be an underestimate of how far along the program is, sometimes a
186	serious underestimate.
187	<P>
188	A constant that affects the result is "maxtrees",
189	which controls the maximum number of trees that can be stored. Thus if
190	"maxtrees"
191	is 25, and 32 most parsimonious trees are found, only the first 25 of these are
192	stored and printed out. If "maxtrees"
193	is increased, the program does not run any slower but requires a little
194	more intermediate storage space. I recommend
195	that "maxtrees"
196	be kept as large as you can, provided you are willing to
197	look at an output with that many trees on it! Initially,
198	"maxtrees" is set to 100 in the distribution copy.
199	<P>
200	<H2>Methods and Options</H2>
201	<P>
202	The counting of the length of trees is done by an algorithm nearly
203	identical to the corresponding algorithms in DOLLOP, and thus the remainder
204	of this document will be nearly identical to the DOLLOP document. The
205	Dollo parsimony method was
206	first suggested in print in verbal form by Le Quesne (1974) and was
207	first well-specified by Farris (1977). The method is named after Louis
208	Dollo since he was one of the first to assert that in evolution it is
209	harder to gain a complex feature than to lose it. The algorithm
210	explains the presence of the state 1 by allowing up to one forward
211	change 0-->1 and as many reversions 1-->0 as are necessary to explain
212	the pattern of states seen. The program attempts to minimize the number
213	of 1-->0 reversions necessary.
214	<P>
215	The assumptions of this method are in effect:
216	<OL>
217	<LI>We know which state is the ancestral one (state 0).
218	<LI>The characters are evolving independently.
219	<LI>Different lineages evolve independently.
220	<LI>The probability of a forward change (0-->1) is small over the
221	evolutionary times involved.
222	<LI>The probability of a reversion (1-->0) is also small, but
223	still far larger than the probability of a forward change, so
224	that many reversions are easier to envisage than even one
225	extra forward change.
226	<LI>Retention of polymorphism for both states (0 and 1) is highly
227	improbable.
228	<LI>The lengths of the segments of the true tree are not so
229	unequal that two changes in a long segment are as probable as
230	one in a short segment.
231	</OL>
232	<P>
233	That these are the assumptions is established in several of my
234	papers (1973a, 1978b, 1979, 1981b, 1983). For an opposing view arguing
235	that the parsimony methods make no substantive
236	assumptions such as these, see the papers by Farris (1983) and Sober (1983a,
237	1983b), but also read the exchange between Felsenstein and Sober (1986).
238	<P>
239	One problem can arise when using additive binary recoding to
240	represent a multistate character as a series of two-state characters. Unlike
241	the Camin-Sokal, Wagner, and Polymorphism methods, the Dollo
242	method can reconstruct ancestral states which do not exist. An example
243	is given in my 1979 paper. It will be necessary to check the output to
244	make sure that this has not occurred.
245	<P>
246	The polymorphism parsimony method was first used by me,
247	and the results published
248	(without a clear
249	specification of the method) by Inger (1967). The method was
250	published by Farris (1978a) and by me (1979). The method
251	assumes that we can explain the pattern of states by no more than one
252	origination (0-->1) of state 1, followed by retention of polymorphism
253	along as many segments of the tree as are necessary, followed by loss of
254	state 0 or of state 1 where necessary. The program tries to minimize
255	the total number of polymorphic characters, where each polymorphism is
256	counted once for each segment of the tree in which it is retained.
257	<P>
258	The assumptions of the polymorphism parsimony method are in effect:
259	<OL>
260	<LI>The ancestral state (state 0) is known in each character.
261	<LI>The characters are evolving independently of each other.
262	<LI>Different lineages are evolving independently.
263	<LI>Forward change (0-->1) is highly improbable over the length of
264	time involved in the evolution of the group.
265	<LI>Retention of polymorphism is also improbable, but far more
266	probable that forward change, so that we can more easily
267	envisage much polymorhism than even one additional forward
268	change.
269	<LI>Once state 1 is reached, reoccurrence of state 0 is very
270	improbable, much less probable than multiple retentions of
271	polymorphism.
272	<LI>The lengths of segments in the true tree are not so unequal
273	that we can more easily envisage retention events occurring in
274	both of two long segments than one retention in a short
275	segment.
276	</OL>
277	<P>
278	That these are the assumptions of parsimony methods has been documented
279	in a series of papers of mine: (1973a, 1978b, 1979, 1981b,
280	1983b, 1988b). For an opposing view arguing that the parsimony methods
281	make no substantive
282	assumptions such as these, see the papers by Farris (1983) and Sober (1983a,
283	1983b), but also read the exchange between Felsenstein and Sober (1986).
284	<P>
285	The input format is the standard one, with "?", "P", "B" states
286	allowed. Most of the options are selected using a menu:
287	<P>
288	<TABLE><TR><TD BGCOLOR=white>
289	<PRE>
290
291	Penny algorithm for Dollo or polymorphism parsimony, version 3.6a3
292	branch-and-bound to find all most parsimonious trees
293
294	Settings for this run:
295	P Parsimony method? Dollo
296	H How many groups of 100 trees: 1000
297	F How often to report, in trees: 100
298	S Branch and bound is simple? Yes
299	T Use Threshold parsimony? No, use ordinary parsimony
300	A Use ancestral states? No
301	W Sites weighted? No
302	M Analyze multiple data sets? No
303	0 Terminal type (IBM PC, ANSI, none)? (none)
304	1 Print out the data at start of run No
305	2 Print indications of progress of run Yes
306	3 Print out tree Yes
307	4 Print out steps in each character No
308	5 Print states at all nodes of tree No
309	6 Write out trees onto tree file? Yes
310
311	Are these settings correct? (type Y or the letter for one to change)
312	</PRE>
313	</TD></TR></TABLE>
314	<P>
315	The P option toggles between the Polymorphism parsimony method and the
316	default Dollo parsimony method.
317	<P>
318	The options T, A, and M are the usual Threshold, Ancestral
319	States, and Multiple Data Sets options. They are described in the Main
320	documentation file and in the Discrete Characters Programs documentation
321	file.
322	<P>
323	Options F and H reset the
324	variables howoften (F) and howmany (H). The user is prompted for the new
325	values. By setting these larger the program will report its progress less
326	often (howoften) and will run longer (howmany times howoften). These values
327	default to 100 and 1000 which
328	guarantees a search of 100,000 trees, but these can be changed. Note that
329	option F in this program is not the Factors option available in some of
330	the other programs in this section of the package.
331	<P>
332	The use of the A
333	option allows implementation of the unordered Dollo parsimony and unordered
334	polymorphism parsimony methods which I have
335	described elsewhere (1984b). When the A option is used the ancestor is
336	not to be counted as one of the species. The O (outgroup) option is not
337	available since the tree produced is already rooted.
338	<P>
339	Setting T at or below
340	1.0 but above 0 causes the criterion to become compatibility rather than
341	polymorphism parsimony, although there is no advantage to using this
342	program instead of PENNY to do a compatibility method. Setting
343	the threshold value higher brings about an intermediate between
344	the Dollo or polymorphism parsimony methods and the compatibility method,
345	so that there is some rationale for doing that.
346	<P>
347	Using a threshold value of 1.0 or lower, but above 0, one can
348	obtain a rooted (or, if the A option is used with ancestral states of
349	"?", unrooted) compatibility criterion, but there is no particular
350	advantage to using this program for that instead of MIX. Higher
351	threshold values are of course meaningful and provide
352	intermediates between Dollo and compatibility methods.
353	<P>
354	The S (Simple) option alters a step in DOLPENNY which reconsiders the
355	order in which species are added to the tree. Normally the decision as to
356	what species to add to the tree next is made as the first tree is being
357	constructucted; that ordering of species is not altered subsequently. The
358	R option causes it to be continually reconsidered. This will probably
359	result in a substantial increase in run time, but on some data sets of
360	intermediate messiness it may help. It is included in case it might prove
361	of use on some data sets.
362	<P>
363	The Factors
364	option is not available in this program, as it would have no effect on
365	the result even if that information were provided in the input file.
366	<P>
367	The output format is also standard. It includes a rooted tree and,
368	if the user selects option 4, a table
369	of the numbers of reversions or retentions of polymorphism necessary
370	in each character. If any of the
371	ancestral states has been specified to be unknown, a table of
372	reconstructed ancestral states is also provided. When reconstructing
373	the placement of forward changes and reversions under the Dollo method,
374	keep in mind that each
375	polymorphic state in the input data will require one "last minute"
376	reversion. This is included in the tabulated counts. Thus if we have
377	both states 0 and 1 at a tip of the tree the program will assume that
378	the lineage had state 1 up to the last minute, and then state 0 arose in
379	that population by reversion, without loss of state 1.
380	<P>
381	A table is available to be printed out after each tree, showing for each
382	branch whether
383	there are known to be changes in the branch, and what the states are inferred
384	to have been at the top end of the branch. If the inferred state is a "?"
385	there will be multiple equally-parsimonious assignments of states; the user
386	must work these out for themselves by hand.
387	<P>
388	If the A option is used, then the program will
389	infer, for any character whose ancestral state is unknown ("?") whether the
390	ancestral state 0 or 1 will give the best tree. If these are
391	tied, then it may not be possible for the program to infer the
392	state in the internal nodes, and these will all be printed as ".". If this
393	has happened and you want to know more about the states at the internal
394	nodes, you will find helpful to use DOLMOVE to display the tree and examine
395	its interior states, as the algorithm in DOLMOVE shows all that can be known
396	in this case about the interior states, including where there is and is not
397	amibiguity. The algorithm in DOLPENNY gives up more easily on displaying these
398	states.
399	<P>
400	At the beginning of the program are a series of constants,
401	which can be changed to help adapt the program to different computer systems.
402	Two are the initial values of howmany and howoften,
403	constants "often" and "many". Constant "maxtrees"
404	is the maximum number of tied trees that will
405	be stored.
406	<P>
407	<HR>
408	<P>
409	<H3>TEST DATA SET</H3>
410	<P>
411	<TABLE><TR><TD BGCOLOR=white>
412	<PRE>
413	7 6
414	Alpha1 110110
415	Alpha2 110110
416	Beta1 110000
417	Beta2 110000
418	Gamma1 100110
419	Delta 001001
420	Epsilon 001110
421	</PRE>
422	</TD></TR></TABLE>
423	<P>
424	<HR>
425	<P>
426	<H3>TEST SET OUTPUT (with all numerical options turned on)</H3>
427	<P>
428	<TABLE><TR><TD BGCOLOR=white>
429	<PRE>
430
431	Penny algorithm for Dollo or polymorphism parsimony, version 3.6a3
432	branch-and-bound to find all most parsimonious trees
433
434	7 species, 6 characters
435	Dollo parsimony method
436
437
438	Name Characters
439	---- ----------
440
441	Alpha1 11011 0
442	Alpha2 11011 0
443	Beta1 11000 0
444	Beta2 11000 0
445	Gamma1 10011 0
446	Delta 00100 1
447	Epsilon 00111 0
448
449
450
451	requires a total of 3.000
452
453	3 trees in all found
454
455
456
457
458	+-----------------Delta
459	!
460	--2 +--------------Epsilon
461	! !
462	+--3 +-----------Gamma1
463	! !
464	+--6 +--------Alpha2
465	! !
466	+--1 +--Beta2
467	! +--5
468	+--4 +--Beta1
469	!
470	+-----Alpha1
471
472
473	reversions in each character:
474	0 1 2 3 4 5 6 7 8 9
475	*-----------------------------------------
476	0! 0 0 1 1 1 0
477
478	From To Any Steps? State at upper node
479	( . means same as in the node below it on tree)
480
481	root 2 yes ..1.. .
482	2 Delta yes ..... 1
483	2 3 yes ...11 .
484	3 Epsilon no ..... .
485	3 6 yes 1.0.. .
486	6 Gamma1 no ..... .
487	6 1 yes .1... .
488	1 Alpha2 no ..... .
489	1 4 no ..... .
490	4 5 yes ...00 .
491	5 Beta2 no ..... .
492	5 Beta1 no ..... .
493	4 Alpha1 no ..... .
494
495
496
497
498
499	+-----------------Delta
500	!
501	--2 +--------------Epsilon
502	! !
503	+--3 +-----------Gamma1
504	! !
505	+--6 +--Beta2
506	! +-----5
507	! ! +--Beta1
508	+--4
509	! +--Alpha2
510	+-----1
511	+--Alpha1
512
513
514	reversions in each character:
515	0 1 2 3 4 5 6 7 8 9
516	*-----------------------------------------
517	0! 0 0 1 1 1 0
518
519	From To Any Steps? State at upper node
520	( . means same as in the node below it on tree)
521
522	root 2 yes ..1.. .
523	2 Delta yes ..... 1
524	2 3 yes ...11 .
525	3 Epsilon no ..... .
526	3 6 yes 1.0.. .
527	6 Gamma1 no ..... .
528	6 4 yes .1... .
529	4 5 yes ...00 .
530	5 Beta2 no ..... .
531	5 Beta1 no ..... .
532	4 1 no ..... .
533	1 Alpha2 no ..... .
534	1 Alpha1 no ..... .
535
536
537
538
539
540	+-----------------Delta
541	!
542	--2 +--------------Epsilon
543	! !
544	+--3 +-----------Gamma1
545	! !
546	! ! +--Beta2
547	+--6 +--5
548	! +--4 +--Beta1
549	! ! !
550	+--1 +-----Alpha2
551	!
552	+--------Alpha1
553
554
555	reversions in each character:
556	0 1 2 3 4 5 6 7 8 9
557	*-----------------------------------------
558	0! 0 0 1 1 1 0
559
560	From To Any Steps? State at upper node
561	( . means same as in the node below it on tree)
562
563	root 2 yes ..1.. .
564	2 Delta yes ..... 1
565	2 3 yes ...11 .
566	3 Epsilon no ..... .
567	3 6 yes 1.0.. .
568	6 Gamma1 no ..... .
569	6 1 yes .1... .
570	1 4 no ..... .
571	4 5 yes ...00 .
572	5 Beta2 no ..... .
573	5 Beta1 no ..... .
574	4 Alpha2 no ..... .
575	1 Alpha1 no ..... .
576
577
578	</PRE>
579	</TD></TR></TABLE>
580	</BODY>
581	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/dolpenny.html

Download in other formats: