Context Navigation

discrete.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 20.1 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>discrete</TITLE>
5	<META NAME="description" CONTENT="discrete">
6	<META NAME="keywords" CONTENT="discrete">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>DOCUMENTATION FOR (0,1) DISCRETE CHARACTER PROGRAMS</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	These programs are intended for the use of morphological
26	systematists who are dealing with discrete characters,
27	or by molecular evolutionists dealing with presence-absence data on
28	restriction sites. One of the programs (PARS) allows multistate
29	characters, with up to 8 states, plus the unknown state symbol "?".
30	For the others, the characters
31	are assumed to be coded into a series of (0,1) two-state characters. For
32	most of the programs there are two other states possible, "P", which
33	stands for the state of Polymorphism for both states (0 and 1), and "?",
34	which stands for the state of ignorance: it is the state "unknown", or
35	"does not apply". The state "P" can also be denoted by "B", for "both".
36	<P>
37	There is a method invented by Sokal and Sneath (1963) for linear
38	sequences of character states, and fully developed for branching sequences
39	of character states
40	by Kluge and Farris (1969) for recoding a multistate character
41	into a series of two-state (0,1) characters. Suppose we had a character
42	with four states whose character-state tree had the rooted form:
43	<P>
44	<PRE>
45	1 ---> 0 ---> 2
46	\|
47	\|
48	V
49	3
50	</PRE>
51	<P>
52	<P>
53	so that 1 is the ancestral state and 0, 2 and 3 derived states. We can
54	represent this as three two-state characters:
55	<P>
56	<PRE>
57	Old State New States
58	--- ----- --- ------
59	0 001
60	1 000
61	2 011
62	3 101
63	</PRE>
64	<P>
65	The three new states correspond to the three arrows in the above character
66	state tree. Possession of one of the new states corresponds to whether or not
67	the old state had that arrow in its ancestry. Thus the first new state
68	corresponds to the bottommost arrow, which only state 3 has in its ancestry,
69	the second state to the rightmost of the top arrows, and the third state to
70	the leftmost top arrow. This coding will guarantee that the number of times
71	that states arise on the tree (in programs MIX, MOVE, PENNY and BOOT)
72	or the number of polymorphic states in a tree segment (in the Polymorphism
73	option of DOLLOP, DOLMOVE, DOLPENNY and DOLBOOT) will correctly
74	correspond to what would have been the case had our programs been able to take
75	multistate characters into account. Although I have shown the above character
76	state tree as rooted, the recoding method works equally well on unrooted
77	multistate characters as long as the connections between the states are known
78	and contain no loops.
79	<P>
80	However, in the default option of programs DOLLOP, DOLMOVE, DOLPENNY
81	and DOLBOOT the multistate recoding does not necessarily work properly, as it
82	may lead the program to reconstruct nonexistent state combinations such as
83	010. An example of this problem is given in my paper on alternative
84	phylogenetic methods (1979).
85	<P>
86	If you have multistate character data where the states are connected in a
87	branching "character state tree" you may want to do the binary recoding
88	yourself. Thanks to Christopher Meacham, the package contains
89	a program, FACTOR, which will do the recoding itself. For details see
90	the documentation file for FACTOR.
91	<P>
92	We now also have the program PARS, which can do parsimony for unordered
93	character states.
94	<P>
95	<H2>COMPARISON OF METHODS</H2>
96	<P>
97	The methods used in these programs make different assumptions about
98	evolutionary rates, probabilities of different kinds of events, and our
99	knowledge about the characters or about the character state trees.
100	Basic references on these assumptions are my 1979, 1981b and 1983b
101	papers, particularly the latter. The
102	assumptions of each method are briefly described in the documentation
103	file for the corresponding program. In most cases my assertions about what are
104	the assumptions of these methods are challenged by others, whose papers I also
105	cite at that point. Personally, I believe that they are wrong and I am
106	right. I must emphasize the importance of
107	understanding the assumptions underlying the methods you are using. No
108	matter how fancy the algorithms, how maximum the likelihood or how
109	minimum the number of steps, your results can only be as good as the
110	correspondence between biological reality and your assumptions!
111	<P>
112	<H2>INPUT FORMAT</H2>
113	<P>
114	The input format is as described in the general documentation file. The
115	input starts with a line containing the number of
116	species and the number of characters.
117	<P>
118	In PARS, each character can have up to 8 states plus a "?" state. In any
119	character, the first 8 symbols encountered will be taken to represent
120	these states. Any of the digits 0-9, letters A-Z and a-z, and even symbols
121	such as + and -, can be used (and in fact which 8 symbols are used can
122	be different in different characters).
123	<P>
124	In the other discrete characters programs the allowable states are,
125	0, 1, P, B, and ?. Blanks
126	may be included between the states (i. e. you can have a
127	species whose data is DISCOGLOSS0 1 1 0 1 1 1). It is possible for
128	extraneous information to follow the end of the character state data on
129	the same line. For example, if there were 7 characters in the data set,
130	a line of species data could read "DISCOGLOSS0110111 Hello there").
131	<P>
132	The discrete character data can continue to a new line whenever needed.
133	The characters are not in the "aligned" or "interleaved" format used by the
134	molecular sequence programs: they have the name and entire set of characters
135	for one species, then the name and entire set of characters for the next
136	one, and so on. This is known as the sequential format. Be particularly
137	careful when you use restriction sites
138	data, which can be in either the aligned or the sequential format for use in
139	RESTML but must be in the sequential format for these discrete character
140	programs.
141	<P>
142	For PARS the discrete character data can be in either Sequential or
143	Interleaved format; the latter is the default.
144	<P>
145	Errors in the input data will often be detected by the programs, and this will
146	cause them to issue an error message such as 'BAD OUTGROUP NUMBER: ' together
147	with information as to which species, character, or in this case outgroup
148	number is the incorrect one. The program will them terminate; you will have
149	to look at the data and figure out what went wrong and fix it. Often an error
150	in the data causes a lack of synchronization between what is in the data file
151	and what the program thinks is to be there. Thus a missing character may
152	cause the program to read part of the next species name as a character and
153	complain about its value. In this type of case you should look for the error
154	earlier in the data file than the point about which the program is
155	complaining.
156	<P>
157	<H2>OPTIONS GENERALLY AVAILABLE</H2>
158	<P>
159	Specific information on options will be given in the documentation
160	file associated with each program. However, some options occur in many
161	programs. Options are selected from the menu in each
162	program, but the Old Style programs CLIQUE and FACTOR require information to be put into
163	the beginning of the input file (Particularly the Ancestors, Factors, Weights,
164	and Mixtures options). The options information described here is for
165	the other programs. See the documentation page for CLIQUE and
166	FACTOR to find out how they get their options information.
167	<P>
168	<UL>
169	<LI>The A (Ancestral states) option. This indicates that we are
170	specifying the ancestral states for each character. In the menu the
171	ancestors (A) option must be selected.
172	An ancestral states input file is read, whose default name is
173	<TT>ancestors</TT>. It contains
174	a line or lines giving the ancestral states for each character.
175	These may be 0, 1 or ?, the latter
176	indicating that the ancestral state is unknown.
177	<P>
178	An example is:
179	<P>
180	001??11
181	<P>
182	The ancestor information can be continued to a new line and can have blanks
183	between any of the characters in the same way that species character data
184	can.
185	In the program CLIQUE the ancestor is instead to be included as a
186	regular species and
187	no A option is available.
188	<P>
189	<LI>The F (Factors) option. This is used in programs MOVE, DOLMOVE,
190	and FACTOR. It specifies which binary characters correspond
191	to which multistate characters. To use the F option you
192	choose the F option in the program menu. After that the program
193	will read a factors file (default name <TT>factors</TT>
194	Which consists of a line or lines containing a symbol
195	for each binary character. The
196	symbol can be anything, provided that it is the same for binary characters
197	that correspond to the same multistate character, and changes between
198	multistate characters. A good practice is to make it the lower-order digit
199	of the number of the multistate character.
200	<P>
201	For example, if there were 20 binary characters that had been generated by
202	nine multistate characters having respectively 4, 3, 3, 2, 1, 2, 2, 2, and 1
203	binary factors you would make the factors file be:
204	<P>
205	11112223334456677889
206	<P>
207	although it could equivalently be:
208	<P>
209	aaaabbbaaabbabbaabba
210	<P>
211	All that is important is that the symbol
212	for each binary character change only when adjacent binary characters
213	correspond to different mutlistate characters. The factors
214	file contents
215	can continue to a new line at any time except during the initial characters
216	filling out the length of a species name.
217	<P>
218	In programs CLIQUE and FACTOR the factors information is given in
219	the Old Style system of putting that information into the input
220	data file. The method for doing so is described in the documentation
221	files for these programs. We hope to change this in the next
222	release to use an input factors file.
223	<P>
224	<LI>The J (Jumble) option. This causes the species to be entered into the
225	tree in a random order rather than in their order in the input file. The
226	program prompts you for a random number seed. This option is described in
227	the main documentation file.
228	<P>
229	<LI>The M (Multiple data sets) option. This has also been described in the
230	main documentation file. It is not to be confused with the M option specified
231	in the input file, which is the Mixture of methods option (yes, I know
232	this is confusing).
233	<P>
234	<LI>The O (outgroup) option. This has also already been discussed in the
235	general documentation file. It specifies the number of the particular species
236	which will be used as the outgroup in rerooting the final tree when it is
237	printed out. It will not have any effect if the tree is already rooted or is
238	a user-defined tree. This option is not available in DOLLOP, DOLMOVE,
239	or DOLPENNY, which always infer a rooted tree, or CLIQUE, which
240	requires you to work out the rerooting by hand. The menu selection will
241	cause you to be prompted for the number of the outgroup.
242	<P>
243	<LI>The T (threshold) option. This sets a threshold such that if the
244	number of steps counted in a character is higher than the threshold, it
245	will be taken to be the threshold value rather than the actual number of
246	steps. This option has already been described in the main documentation
247	file. The user is prompted for the threshold value. My 1981 paper
248	(Felsenstein, 1981b)
249	explains the logic behind the Threshold option, which is an attarctive
250	alternative to successive weighting of characters.
251	<P>
252	<LI>The U (User tree) option. This has already been described in the
253	main documentation file. For all of these programs user trees are to be
254	specified as bifurcating trees, even in the cases where the tree that
255	is inferred by the programs is to be regarded as unrooted.
256	<P>
257	<LI>The W (Weights) option. This allows us to specify weights on the
258	characters, including the possibility of omitting characters from the
259	analysis. It has already been described in the main documentation file. If
260	the Weights option is used there must be a W on the first line of the
261	input file.
262	<P>
263	<LI>The X (miXture) option. In the programs MIX, MOVE, and PENNY
264	the user can specify for each character which parsimony method is
265	in effect. This is done by selecting menu option X (not M) and having
266	an input mixture file, whose default name is <TT>mixture</TT>.
267	It contains a line or lines with and one letter for
268	each character. These letters are C or S if the character is to
269	be reconstructed according to Camin-Sokal parsimony, W or ? if the
270	character is to be reconstructed according to Wagner parsimony. So if
271	there are 20 characters the line giving the mixture might look like this:
272	<P>
273	<PRE>
274	WWWCC WWCWC
275	</PRE>
276	<P>
277	Note that blanks in the seqence of characters (after the first ones that
278	are as long as the species names) will be ignored, and the information
279	can go on to a new line at any point. So this could equally well have been
280	specified by
281	<P>
282	<PRE>
283	WW
284	CCCWWCWC
285	</PRE>
286	</UL>
287	<P>
288	30! 1 2 1 1 1 2 1 3 1 1
289	40! 1
290	</PRE>
291	<P>
292	The numbers across the top and down the side indicate which character
293	is being referred to. Thus character 23 is column "3" of row "20"
294	and has 2 steps in this case.
295	<P>
296	I cannot emphasize too strongly that just because the tree diagram
297	which the program prints out contains a particular
298	branch DOES NOT MEAN
299	THAT WE HAVE EVIDENCE THAT THE BRANCH IS OF NONZERO LENGTH.
300	In program PARS the branches have lengths estimated and there
301	can be trifurcations, but in all other discrete characters programs
302	the procedure which prints out the tree cannot cope with a trifurcation, nor
303	can the internal data structures used in my programs. Therefore, even
304	when we have no resolution and a multifurcation, successive bifurcations
305	will be printed out, although some of the branches shown will in fact
306	actually be of zero length. To find out which, you will have to work out
307	character by character where the placements of the changes on the tree
308	are, under all possible ways that the changes can be placed on that
309	tree.
310	<P>
311	In PARS the trees are truly multifurcating, and the search is over both
312	bifurcating and multifurcating trees. A branch is retained in a tree only
313	if there is at least one character, under at least one possible most
314	parsimonious reconstruction of the placement of changes, that has a change in
315	that branch. This means that two branches can both be present which are,
316	however, not both in existence at the same time (in that there is no
317	most parsimonious reconstruction of changes n the characters that has changes
318	in both these branches at the same time).
319	<P>
320	In PARS, MIX, PENNY, DOLLOP, and DOLPENNY the trees will be (if the user selects
321	the option to see them)
322	accompanied by tables showing the reconstructed states of the characters in
323	the hypothetical ancestral nodes in the interior of the tree. This will enable
324	you to reconstruct where the changes were in each of the characters. In some
325	cases the state shown in an interior node will be "?", which means that either
326	0 or 1 would be possible at that point. In such cases you have to work out
327	the ambiguity by hand. A unique assignment of locations of changes is often
328	not possible in the case of the Wagner parsimony method. There may be multiple
329	ways of assigning changes to segments of the tree with that method. Printing
330	only one would be misleading, as it might imply that certain segments of the
331	tree had no change, when another equally valid assignment would put changes
332	there. It must be emphasized that all these multiple assignments have exactly
333	equal numbers of total changes, so that none is preferred over any other.
334	<P>
335	I have followed the convention of having
336	a "." printed out in the table of character states of the hypothetical
337	ancestral nodes whenever a state is 0 or 1 and its immediate ancestor is the
338	same. This has the effect of highlighting the places where changes might have
339	occurred and making it easy for the user to reconstruct all the alternative
340	patterns of the characters states in the hypothetical ancestral nodes.
341	In PARS you can, using the menu, turn off this dot-differencing
342	convention and see all states at all hypothetical ancestral nodes of the tree.
343	<P>
344	On the line in that table corresponding to each branch of the tree will also
345	be printed "yes", "no" or "maybe" as an answer to the question of whether this
346	branch is of nonzero length. If there is no evidence that any character has
347	changed in that branch, then "no" will be printed. If there is definite
348	evidence that one has changed, then "yes" will be printed. If the matter is
349	ambiguous, then "maybe" will be printed. You should keep in mind that all of
350	these conclusions assume that we are only interested in the assignment of
351	states that requires the least amount of change. In reality, the confidence
352	limit on tree topology usually includes many different topologies, and
353	presumably also then the confidence limits on amounts of change in branches
354	are also very broad.
355	<P>
356	In addition to the table showing numbers of events, a table may be printed out
357	showing which ancestral state causes the fewest events for each
358	occurred and making it easy for the user to reconstruct all the alternative
359	patterns of the characters states in the hypothetical ancestral nodes.
360	In PARS you can, using the menu, turn off this dot-differencing
361	convention and see all states at all hypothetical ancestral nodes of the tree.
362	<P>
363	On the line in that table corresponding to each branch of the tree will also
364	be printed "yes", "no" or "maybe" as an answer to the question of whether this
365	branch is of nonzero length. If there is no evidence that any character has
366	changed in that branch, then "no" will be printed. If there is definite
367	evidence that one has changed, then "yes" will be printed. If the matter is
368	ambiguous, then "maybe" will be printed. You should keep in mind that all of
369	these conclusions assume that we are only interested in the assignment of
370	states that requires the least amount of change. In reality, the confidence
371	limit on tree topology usually includes many different topologies, and
372	presumably also then the confidence limits on amounts of change in branches
373	are also very broad.
374	<P>
375	In addition to the table showing numbers of events, a table may be printed out
376	showing which ancestral state causes the fewest events for each
377	character. This will not always be done, but only when the tree is rooted and
378	some ancestral states are unknown. This can be used to infer states of
379	ancestors. For example, if you use the O (Outgroup) and A (Ancestral states)
380	options together, with at least some of the ancestral states being given as
381	"?", then inferences will be made for those characters, as the outgroup makes
382	the tree rooted if it was not already.
383	<P>
384	In programs MIX and PENNY, if you are using the Camin-Sokal parsimony option
385	with ancestral state "?" and it turns out that the program cannot decide
386	between ancestral states 0 and 1, it will fail to even attempt reconstruction
387	of states of the hypothetical ancestors, printing them all out as "." for
388	those characters. This is done for internal bookkeeping reasons -- to
389	reconstruct their changes would require a fair amount of additional code and
390	additional data structures. It is not too hard to reconstruct the internal
391	states by hand, trying the two possible ancestral states one after the
392	other. A similar comment applies to the use of ancestral state "?" in the
393	Dollo or Polymorphism parsimony methods (programs DOLLOP and DOLPENNY) which
394	also can result in a similar hesitancy to print the estimate of the states of
395	the hypothetical ancestors. In all of these cases the program will print "?"
396	rather than "no" when it describes whether there are any changes in a branch,
397	since there might or might not be changes in those characters which are not
398	reconstructed.
399	<P>
400	For further information see the documentation files for the
401	individual programs.
402	</BODY>
403	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/discrete.html

Download in other formats: