source: trunk/GDE/PHYLIP/doc/discrete.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 20.1 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>discrete</TITLE>
5<META NAME="description" CONTENT="discrete">
6<META NAME="keywords" CONTENT="discrete">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>DOCUMENTATION FOR (0,1) DISCRETE CHARACTER PROGRAMS</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2002 by the University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25These programs are intended for the use of morphological
26systematists who are dealing with discrete characters,
27or by molecular evolutionists dealing with presence-absence data on
28restriction sites. One of the programs (PARS) allows multistate
29characters, with up to 8 states, plus the unknown state symbol "?".
30For the others, the characters
31are assumed to be coded into a series of (0,1) two-state characters.  For
32most of the programs there are two other states possible, "P", which
33stands for the state of Polymorphism for both states (0 and 1), and "?",
34which stands for the state of ignorance: it is the state "unknown", or
35"does not apply".  The state "P" can also be denoted by "B", for "both".
36<P>
37There is a method invented by Sokal and Sneath (1963) for linear
38sequences of character states, and fully developed for branching sequences
39of character states
40by Kluge and Farris (1969) for recoding a multistate character
41into a series of two-state (0,1) characters.  Suppose we had a character
42with four states whose character-state tree had the rooted form:
43<P>
44<PRE>
45               1 ---> 0 ---> 2
46                      |
47                      |
48                      V
49                      3
50</PRE>
51<P>
52<P>
53so that 1 is the ancestral state and 0, 2 and 3 derived states.  We can
54represent this as three two-state characters:
55<P>
56<PRE>
57                Old State           New States
58                --- -----           --- ------
59                    0                  001
60                    1                  000
61                    2                  011
62                    3                  101
63</PRE>
64<P>
65The three new states correspond to the three arrows in the above character
66state tree.  Possession of one of the new states corresponds to whether or not
67the old state had that arrow in its ancestry.  Thus the first new state
68corresponds to the bottommost arrow, which only state 3 has in its ancestry,
69the second state to the rightmost of the top arrows, and the third state to
70the leftmost top arrow.  This coding will guarantee that the number of times
71that states arise on the tree (in programs MIX, MOVE, PENNY and BOOT)
72or the number of polymorphic states in a tree segment (in the Polymorphism
73option of DOLLOP, DOLMOVE, DOLPENNY and DOLBOOT) will correctly
74correspond to what would have been the case had our programs been able to take
75multistate characters into account.  Although I have shown the above character
76state tree as rooted, the recoding method works equally well on unrooted
77multistate characters as long as the connections between the states are known
78and contain no loops. 
79<P>
80However, in the default option of programs DOLLOP, DOLMOVE, DOLPENNY
81and DOLBOOT the multistate recoding does not necessarily work properly, as it
82may lead the program to reconstruct nonexistent state combinations such as
83010.  An example of this problem is given in my paper on alternative
84phylogenetic methods (1979). 
85<P>
86If you have multistate character data where the states are connected in a
87branching "character state tree" you may want to do the binary recoding
88yourself.  Thanks to Christopher Meacham, the package contains
89a program, FACTOR, which will do the recoding itself.  For details see
90the documentation file for FACTOR.
91<P>
92We now also have the program PARS, which can do parsimony for unordered
93character states.
94<P>
95<H2>COMPARISON OF METHODS</H2>
96<P>
97The methods used in these programs make different assumptions about
98evolutionary rates, probabilities of different kinds of events, and our
99knowledge about the characters or about the character state trees.
100Basic references on these assumptions are my 1979, 1981b and 1983b
101papers, particularly the latter.  The
102assumptions of each method are briefly described in the documentation
103file for the corresponding program.  In most cases my assertions about what are
104the assumptions of these methods are challenged by others, whose papers I also
105cite at that point.  Personally, I believe that they are wrong and I am
106right.  I must emphasize the importance of
107understanding the assumptions underlying the methods you are using.  No
108matter how fancy the algorithms, how maximum the likelihood or how
109minimum the number of steps, your results can only be as good as the
110correspondence between biological reality and your assumptions!
111<P>
112<H2>INPUT FORMAT</H2>
113<P>
114The input format is as described in the general documentation file.  The
115input starts with a line containing the number of
116species and the number of characters.
117<P>
118In PARS, each character can have up to 8 states plus a "?" state.  In any
119character, the first 8 symbols encountered will be taken to represent
120these states.  Any of the digits 0-9, letters A-Z and a-z, and even symbols
121such as + and -, can be used (and in fact which 8 symbols are used can
122be different in different characters).
123<P>
124In the other discrete characters programs the allowable states are,
1250, 1, P, B, and ?.  Blanks
126may be included between the states (i. e. you can have a
127species whose data is DISCOGLOSS0 1 1 0 1 1 1).  It is possible for
128extraneous information to follow the end of the character state data on
129the same line.  For example, if there were 7 characters in the data set,
130a line of species data could read "DISCOGLOSS0110111 Hello there").
131<P>
132The discrete character data can continue to a new line whenever needed.
133The characters are not in the "aligned" or "interleaved" format used by the
134molecular sequence programs: they have the name and entire set of characters
135for one species, then the name and entire set of characters for the next
136one, and so on.  This is known as the sequential format.  Be particularly
137careful when you use restriction sites
138data, which can be in either the aligned or the sequential format for use in
139RESTML but must be in the sequential format for these discrete character
140programs.
141<P>
142For PARS the discrete character data can be in either Sequential or
143Interleaved format; the latter is the default.
144<P>
145Errors in the input data will often be detected by the programs, and this will
146cause them to issue an error message such as 'BAD OUTGROUP NUMBER: ' together
147with information as to which species, character, or in this case outgroup
148number is the incorrect one.  The program will them terminate; you will have
149to look at the data and figure out what went wrong and fix it.  Often an error
150in the data causes a lack of synchronization between what is in the data file
151and what the program thinks is to be there.  Thus a missing character may
152cause the program to read part of the next species name as a character and
153complain about its value.  In this type of case you should look for the error
154earlier in the data file than the point about which the program is
155complaining.
156<P>
157<H2>OPTIONS GENERALLY AVAILABLE</H2>
158<P>
159Specific information on options will be given in the documentation
160file associated with each program.  However, some options occur in many
161programs.   Options are selected from the menu in each
162program, but the Old Style programs CLIQUE and FACTOR require information to be put into
163the beginning of the input file (Particularly the Ancestors, Factors, Weights,
164and Mixtures options).  The options information described here is for
165the other programs.  See the documentation page for CLIQUE and
166FACTOR to find out how they get their options information.
167<P>
168<UL>
169<LI>The A (Ancestral states) option.  This indicates that we are
170specifying the ancestral states for each character. In the menu the
171ancestors (A) option must be selected.
172An ancestral states input file is read, whose default name is
173<TT>ancestors</TT>.  It contains
174a line or lines giving the ancestral states for each character.
175These may be 0, 1 or ?, the latter
176indicating that the ancestral state is unknown.
177<P>
178An example is:
179<P>
180001??11
181<P>
182The ancestor information can be continued to a new line and can have blanks
183between any of the characters in the same way that species character data
184can.
185In the program CLIQUE the ancestor is instead to be included as a
186regular species and
187no A option is available.
188<P>
189<LI>The F (Factors) option.  This is used in programs MOVE, DOLMOVE,
190and FACTOR.  It specifies which binary characters correspond
191to which multistate characters.   To use the F option you
192choose the F option in the program menu.  After that the program
193will read a factors file (default name <TT>factors</TT>
194Which consists of a line or lines containing a symbol
195for each binary character.  The
196symbol can be anything, provided that it is the same for binary characters
197that correspond to the same multistate character, and changes between
198multistate characters.  A good practice is to make it the lower-order digit
199of the number of the multistate character.
200<P>
201For example, if there were 20 binary characters that had been generated by
202nine multistate characters having respectively 4, 3, 3, 2, 1, 2, 2, 2, and 1
203binary factors you would make the factors file be:
204<P>
20511112223334456677889
206<P>
207although it could equivalently be:
208<P>
209aaaabbbaaabbabbaabba
210<P>
211All that is important is that the symbol
212for each binary character change only when adjacent binary characters
213correspond to different mutlistate characters.  The factors
214file contents
215can continue to a new line at any time except during the initial characters
216filling out the length of a species name.
217<P>
218In programs CLIQUE and FACTOR the factors information is given in
219the Old Style system of putting that information into the input
220data file.  The method for doing so is described in the documentation
221files for these programs.  We hope to change this in the next
222release to use an input factors file.
223<P>
224<LI>The J (Jumble) option.  This causes the species to be entered into the
225tree in a random order rather than in their order in the input file.  The
226program prompts you for a random number seed.  This option is described in
227the main documentation file.
228<P>
229<LI>The M (Multiple data sets) option.  This has also been described in the
230main documentation file.  It is not to be confused with the M option specified
231in the input file, which is the Mixture of methods option (yes, I know
232this is confusing).
233<P>
234<LI>The O (outgroup) option.  This has also already been discussed in the
235general documentation file.  It specifies the number of the particular species
236which will be used as the outgroup in rerooting the final tree when it is
237printed out.  It will not have any effect if the tree is already rooted or is
238a user-defined tree.  This option is not available in DOLLOP, DOLMOVE,
239or DOLPENNY, which always infer a rooted tree, or CLIQUE, which
240requires you to work out the rerooting by hand.  The menu selection will
241cause you to be prompted for the number of the outgroup.
242<P>
243<LI>The T (threshold) option.  This sets a threshold such that if the
244number of steps counted in a character is higher than the threshold, it
245will be taken to be the threshold value rather than the actual number of
246steps.   This option has already been described in the main documentation
247file.  The user is prompted for the threshold value.  My 1981 paper
248(Felsenstein, 1981b)
249explains the logic behind the Threshold option, which is an attarctive
250alternative to successive weighting of characters.
251<P>
252<LI>The U (User tree) option.  This has already been described in the
253main documentation file.  For all of these programs user trees are to be
254specified as bifurcating trees, even in the cases where the tree that
255is inferred by the programs is to be regarded as unrooted.
256<P>
257<LI>The W (Weights) option.  This allows us to specify weights on the
258characters, including the possibility of omitting characters from the
259analysis.  It has already been described in the main documentation file.  If
260the Weights option is used there must be a W on the first line of the
261input file.
262<P>
263<LI>The X (miXture) option.  In the programs MIX, MOVE, and PENNY
264the user can specify for each character which parsimony method is
265in effect.  This is done by selecting menu option X (not M) and having
266an input mixture file, whose default name is <TT>mixture</TT>.
267It contains a line or lines with and one letter for
268each character.  These letters are C or S if the character is to
269be reconstructed according to Camin-Sokal parsimony, W or ? if the
270character is to be reconstructed according to Wagner parsimony.  So if
271there are 20 characters the line giving the mixture might look like this:
272<P>
273<PRE>
274WWWCC WWCWC
275</PRE>
276<P>
277Note that blanks in the seqence of characters (after the first ones that
278are as long as the species names) will be ignored, and the information
279can go on to a new line at any point.  So this could equally well have been
280specified by
281<P>
282<PRE>
283WW
284CCCWWCWC
285</PRE>
286</UL>
287<P>
288   30!   1   2   1   1   1   2   1   3   1   1
289   40!   1                                   
290</PRE>
291<P>
292The numbers across the top and down the side indicate which character
293is being referred to.  Thus character 23 is column "3" of row "20"
294and has 2 steps in this case.
295<P>
296I cannot emphasize too strongly that just because the tree diagram
297which the program prints out contains a particular
298branch DOES NOT MEAN
299THAT WE HAVE EVIDENCE THAT THE BRANCH IS OF NONZERO LENGTH.
300In program PARS the branches have lengths estimated and there
301can be trifurcations, but in all other discrete characters programs
302the procedure which prints out the tree cannot cope with a trifurcation, nor
303can the internal data structures used in my programs.  Therefore, even
304when we have no resolution and a multifurcation, successive bifurcations
305will be printed out, although some of the branches shown will in fact
306actually be of zero length.  To find out which, you will have to work out
307character by character where the placements of the changes on the tree
308are, under all possible ways that the changes can be placed on that
309tree.
310<P>
311In PARS the trees are truly multifurcating, and the search is over both
312bifurcating and multifurcating trees.  A branch is retained in a tree only
313if there is at least one character, under at least one possible most
314parsimonious reconstruction of the placement of changes, that has a change in
315that branch.  This means that two branches can both be present which are,
316however, not both in existence at the same time (in that there is no
317most parsimonious reconstruction of changes n the characters that has changes
318in both these branches at the same time).
319<P>
320In PARS, MIX, PENNY, DOLLOP, and DOLPENNY the trees will be (if the user selects
321the option to see them)
322accompanied by tables showing the reconstructed states of the characters in
323the hypothetical ancestral nodes in the interior of the tree.  This will enable
324you to reconstruct where the changes were in each of the characters.  In some
325cases the state shown in an interior node will be "?", which means that either
3260 or 1 would be possible at that point.  In such cases you have to work out
327the ambiguity by hand.  A unique assignment of locations of changes is often
328not possible in the case of the Wagner parsimony method.  There may be multiple
329ways of assigning changes to segments of the tree with that method.  Printing
330only one would be misleading, as it might imply that certain segments of the
331tree had no change, when another equally valid assignment would put changes
332there.  It must be emphasized that all these multiple assignments have exactly
333equal numbers of total changes, so that none is preferred over any other.
334<P>
335I have followed the convention of having
336a "." printed out in the table of character states of the hypothetical
337ancestral nodes whenever a state is 0 or 1 and its immediate ancestor is the
338same.  This has the effect of highlighting the places where changes might have
339occurred and making it easy for the user to reconstruct all the alternative
340patterns of the characters states in the hypothetical ancestral nodes.
341In PARS you can, using the menu, turn off this dot-differencing
342convention and see all states at all hypothetical ancestral nodes of the tree.
343<P>
344On the line in that table corresponding to each branch of the tree will also
345be printed "yes", "no" or "maybe" as an answer to the question of whether this
346branch is of nonzero length.  If there is no evidence that any character has
347changed in that branch, then "no" will be printed.  If there is definite
348evidence that one has changed, then "yes" will be printed.  If the matter is
349ambiguous, then "maybe" will be printed.  You should keep in mind that all of
350these conclusions assume that we are only interested in the assignment of
351states that requires the least amount of change.  In reality, the confidence
352limit on tree topology usually includes many different topologies, and
353presumably also then the confidence limits on amounts of change in branches
354are also very broad.
355<P>
356In addition to the table showing numbers of events, a table may be printed out
357showing which ancestral state causes the fewest events for each
358occurred and making it easy for the user to reconstruct all the alternative
359patterns of the characters states in the hypothetical ancestral nodes.
360In PARS you can, using the menu, turn off this dot-differencing
361convention and see all states at all hypothetical ancestral nodes of the tree.
362<P>
363On the line in that table corresponding to each branch of the tree will also
364be printed "yes", "no" or "maybe" as an answer to the question of whether this
365branch is of nonzero length.  If there is no evidence that any character has
366changed in that branch, then "no" will be printed.  If there is definite
367evidence that one has changed, then "yes" will be printed.  If the matter is
368ambiguous, then "maybe" will be printed.  You should keep in mind that all of
369these conclusions assume that we are only interested in the assignment of
370states that requires the least amount of change.  In reality, the confidence
371limit on tree topology usually includes many different topologies, and
372presumably also then the confidence limits on amounts of change in branches
373are also very broad.
374<P>
375In addition to the table showing numbers of events, a table may be printed out
376showing which ancestral state causes the fewest events for each
377character.  This will not always be done, but only when the tree is rooted and
378some ancestral states are unknown.  This can be used to infer states of
379ancestors.  For example, if you use the O (Outgroup) and A (Ancestral states)
380options together, with at least some of the ancestral states being given as
381"?", then inferences will be made for those characters, as the outgroup makes
382the tree rooted if it was not already.
383<P>
384In programs MIX and PENNY, if you are using the Camin-Sokal parsimony option
385with ancestral state "?" and it turns out that the program cannot decide
386between ancestral states 0 and 1, it will fail to even attempt reconstruction
387of states of the hypothetical ancestors, printing them all out as "." for
388those characters.  This is done for internal bookkeeping reasons -- to
389reconstruct their changes would require a fair amount of additional code and
390additional data structures.  It is not too hard to reconstruct the internal
391states by hand, trying the two possible ancestral states one after the
392other.  A similar comment applies to the use of ancestral state "?" in the
393Dollo or Polymorphism parsimony methods (programs DOLLOP and DOLPENNY) which
394also can result in a similar hesitancy to print the estimate of the states of
395the hypothetical ancestors.  In all of these cases the program will print "?"
396rather than "no" when it describes whether there are any changes in a branch,
397since there might or might not be changes in those characters which are not
398reconstructed. 
399<P>
400For further information see the documentation files for the
401individual programs.
402</BODY>
403</HTML>
Note: See TracBrowser for help on using the repository browser.