1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>discrete</TITLE> |
---|
5 | <META NAME="description" CONTENT="discrete"> |
---|
6 | <META NAME="keywords" CONTENT="discrete"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>DOCUMENTATION FOR (0,1) DISCRETE CHARACTER PROGRAMS</H1> |
---|
18 | </DIV> |
---|
19 | <P> |
---|
20 | © Copyright 1986-2002 by the University of |
---|
21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
---|
22 | this document provided that no fee is charged for it and that this copyright |
---|
23 | notice is not removed. |
---|
24 | <P> |
---|
25 | These programs are intended for the use of morphological |
---|
26 | systematists who are dealing with discrete characters, |
---|
27 | or by molecular evolutionists dealing with presence-absence data on |
---|
28 | restriction sites. One of the programs (PARS) allows multistate |
---|
29 | characters, with up to 8 states, plus the unknown state symbol "?". |
---|
30 | For the others, the characters |
---|
31 | are assumed to be coded into a series of (0,1) two-state characters. For |
---|
32 | most of the programs there are two other states possible, "P", which |
---|
33 | stands for the state of Polymorphism for both states (0 and 1), and "?", |
---|
34 | which stands for the state of ignorance: it is the state "unknown", or |
---|
35 | "does not apply". The state "P" can also be denoted by "B", for "both". |
---|
36 | <P> |
---|
37 | There is a method invented by Sokal and Sneath (1963) for linear |
---|
38 | sequences of character states, and fully developed for branching sequences |
---|
39 | of character states |
---|
40 | by Kluge and Farris (1969) for recoding a multistate character |
---|
41 | into a series of two-state (0,1) characters. Suppose we had a character |
---|
42 | with four states whose character-state tree had the rooted form: |
---|
43 | <P> |
---|
44 | <PRE> |
---|
45 | 1 ---> 0 ---> 2 |
---|
46 | | |
---|
47 | | |
---|
48 | V |
---|
49 | 3 |
---|
50 | </PRE> |
---|
51 | <P> |
---|
52 | <P> |
---|
53 | so that 1 is the ancestral state and 0, 2 and 3 derived states. We can |
---|
54 | represent this as three two-state characters: |
---|
55 | <P> |
---|
56 | <PRE> |
---|
57 | Old State New States |
---|
58 | --- ----- --- ------ |
---|
59 | 0 001 |
---|
60 | 1 000 |
---|
61 | 2 011 |
---|
62 | 3 101 |
---|
63 | </PRE> |
---|
64 | <P> |
---|
65 | The three new states correspond to the three arrows in the above character |
---|
66 | state tree. Possession of one of the new states corresponds to whether or not |
---|
67 | the old state had that arrow in its ancestry. Thus the first new state |
---|
68 | corresponds to the bottommost arrow, which only state 3 has in its ancestry, |
---|
69 | the second state to the rightmost of the top arrows, and the third state to |
---|
70 | the leftmost top arrow. This coding will guarantee that the number of times |
---|
71 | that states arise on the tree (in programs MIX, MOVE, PENNY and BOOT) |
---|
72 | or the number of polymorphic states in a tree segment (in the Polymorphism |
---|
73 | option of DOLLOP, DOLMOVE, DOLPENNY and DOLBOOT) will correctly |
---|
74 | correspond to what would have been the case had our programs been able to take |
---|
75 | multistate characters into account. Although I have shown the above character |
---|
76 | state tree as rooted, the recoding method works equally well on unrooted |
---|
77 | multistate characters as long as the connections between the states are known |
---|
78 | and contain no loops. |
---|
79 | <P> |
---|
80 | However, in the default option of programs DOLLOP, DOLMOVE, DOLPENNY |
---|
81 | and DOLBOOT the multistate recoding does not necessarily work properly, as it |
---|
82 | may lead the program to reconstruct nonexistent state combinations such as |
---|
83 | 010. An example of this problem is given in my paper on alternative |
---|
84 | phylogenetic methods (1979). |
---|
85 | <P> |
---|
86 | If you have multistate character data where the states are connected in a |
---|
87 | branching "character state tree" you may want to do the binary recoding |
---|
88 | yourself. Thanks to Christopher Meacham, the package contains |
---|
89 | a program, FACTOR, which will do the recoding itself. For details see |
---|
90 | the documentation file for FACTOR. |
---|
91 | <P> |
---|
92 | We now also have the program PARS, which can do parsimony for unordered |
---|
93 | character states. |
---|
94 | <P> |
---|
95 | <H2>COMPARISON OF METHODS</H2> |
---|
96 | <P> |
---|
97 | The methods used in these programs make different assumptions about |
---|
98 | evolutionary rates, probabilities of different kinds of events, and our |
---|
99 | knowledge about the characters or about the character state trees. |
---|
100 | Basic references on these assumptions are my 1979, 1981b and 1983b |
---|
101 | papers, particularly the latter. The |
---|
102 | assumptions of each method are briefly described in the documentation |
---|
103 | file for the corresponding program. In most cases my assertions about what are |
---|
104 | the assumptions of these methods are challenged by others, whose papers I also |
---|
105 | cite at that point. Personally, I believe that they are wrong and I am |
---|
106 | right. I must emphasize the importance of |
---|
107 | understanding the assumptions underlying the methods you are using. No |
---|
108 | matter how fancy the algorithms, how maximum the likelihood or how |
---|
109 | minimum the number of steps, your results can only be as good as the |
---|
110 | correspondence between biological reality and your assumptions! |
---|
111 | <P> |
---|
112 | <H2>INPUT FORMAT</H2> |
---|
113 | <P> |
---|
114 | The input format is as described in the general documentation file. The |
---|
115 | input starts with a line containing the number of |
---|
116 | species and the number of characters. |
---|
117 | <P> |
---|
118 | In PARS, each character can have up to 8 states plus a "?" state. In any |
---|
119 | character, the first 8 symbols encountered will be taken to represent |
---|
120 | these states. Any of the digits 0-9, letters A-Z and a-z, and even symbols |
---|
121 | such as + and -, can be used (and in fact which 8 symbols are used can |
---|
122 | be different in different characters). |
---|
123 | <P> |
---|
124 | In the other discrete characters programs the allowable states are, |
---|
125 | 0, 1, P, B, and ?. Blanks |
---|
126 | may be included between the states (i. e. you can have a |
---|
127 | species whose data is DISCOGLOSS0 1 1 0 1 1 1). It is possible for |
---|
128 | extraneous information to follow the end of the character state data on |
---|
129 | the same line. For example, if there were 7 characters in the data set, |
---|
130 | a line of species data could read "DISCOGLOSS0110111 Hello there"). |
---|
131 | <P> |
---|
132 | The discrete character data can continue to a new line whenever needed. |
---|
133 | The characters are not in the "aligned" or "interleaved" format used by the |
---|
134 | molecular sequence programs: they have the name and entire set of characters |
---|
135 | for one species, then the name and entire set of characters for the next |
---|
136 | one, and so on. This is known as the sequential format. Be particularly |
---|
137 | careful when you use restriction sites |
---|
138 | data, which can be in either the aligned or the sequential format for use in |
---|
139 | RESTML but must be in the sequential format for these discrete character |
---|
140 | programs. |
---|
141 | <P> |
---|
142 | For PARS the discrete character data can be in either Sequential or |
---|
143 | Interleaved format; the latter is the default. |
---|
144 | <P> |
---|
145 | Errors in the input data will often be detected by the programs, and this will |
---|
146 | cause them to issue an error message such as 'BAD OUTGROUP NUMBER: ' together |
---|
147 | with information as to which species, character, or in this case outgroup |
---|
148 | number is the incorrect one. The program will them terminate; you will have |
---|
149 | to look at the data and figure out what went wrong and fix it. Often an error |
---|
150 | in the data causes a lack of synchronization between what is in the data file |
---|
151 | and what the program thinks is to be there. Thus a missing character may |
---|
152 | cause the program to read part of the next species name as a character and |
---|
153 | complain about its value. In this type of case you should look for the error |
---|
154 | earlier in the data file than the point about which the program is |
---|
155 | complaining. |
---|
156 | <P> |
---|
157 | <H2>OPTIONS GENERALLY AVAILABLE</H2> |
---|
158 | <P> |
---|
159 | Specific information on options will be given in the documentation |
---|
160 | file associated with each program. However, some options occur in many |
---|
161 | programs. Options are selected from the menu in each |
---|
162 | program, but the Old Style programs CLIQUE and FACTOR require information to be put into |
---|
163 | the beginning of the input file (Particularly the Ancestors, Factors, Weights, |
---|
164 | and Mixtures options). The options information described here is for |
---|
165 | the other programs. See the documentation page for CLIQUE and |
---|
166 | FACTOR to find out how they get their options information. |
---|
167 | <P> |
---|
168 | <UL> |
---|
169 | <LI>The A (Ancestral states) option. This indicates that we are |
---|
170 | specifying the ancestral states for each character. In the menu the |
---|
171 | ancestors (A) option must be selected. |
---|
172 | An ancestral states input file is read, whose default name is |
---|
173 | <TT>ancestors</TT>. It contains |
---|
174 | a line or lines giving the ancestral states for each character. |
---|
175 | These may be 0, 1 or ?, the latter |
---|
176 | indicating that the ancestral state is unknown. |
---|
177 | <P> |
---|
178 | An example is: |
---|
179 | <P> |
---|
180 | 001??11 |
---|
181 | <P> |
---|
182 | The ancestor information can be continued to a new line and can have blanks |
---|
183 | between any of the characters in the same way that species character data |
---|
184 | can. |
---|
185 | In the program CLIQUE the ancestor is instead to be included as a |
---|
186 | regular species and |
---|
187 | no A option is available. |
---|
188 | <P> |
---|
189 | <LI>The F (Factors) option. This is used in programs MOVE, DOLMOVE, |
---|
190 | and FACTOR. It specifies which binary characters correspond |
---|
191 | to which multistate characters. To use the F option you |
---|
192 | choose the F option in the program menu. After that the program |
---|
193 | will read a factors file (default name <TT>factors</TT> |
---|
194 | Which consists of a line or lines containing a symbol |
---|
195 | for each binary character. The |
---|
196 | symbol can be anything, provided that it is the same for binary characters |
---|
197 | that correspond to the same multistate character, and changes between |
---|
198 | multistate characters. A good practice is to make it the lower-order digit |
---|
199 | of the number of the multistate character. |
---|
200 | <P> |
---|
201 | For example, if there were 20 binary characters that had been generated by |
---|
202 | nine multistate characters having respectively 4, 3, 3, 2, 1, 2, 2, 2, and 1 |
---|
203 | binary factors you would make the factors file be: |
---|
204 | <P> |
---|
205 | 11112223334456677889 |
---|
206 | <P> |
---|
207 | although it could equivalently be: |
---|
208 | <P> |
---|
209 | aaaabbbaaabbabbaabba |
---|
210 | <P> |
---|
211 | All that is important is that the symbol |
---|
212 | for each binary character change only when adjacent binary characters |
---|
213 | correspond to different mutlistate characters. The factors |
---|
214 | file contents |
---|
215 | can continue to a new line at any time except during the initial characters |
---|
216 | filling out the length of a species name. |
---|
217 | <P> |
---|
218 | In programs CLIQUE and FACTOR the factors information is given in |
---|
219 | the Old Style system of putting that information into the input |
---|
220 | data file. The method for doing so is described in the documentation |
---|
221 | files for these programs. We hope to change this in the next |
---|
222 | release to use an input factors file. |
---|
223 | <P> |
---|
224 | <LI>The J (Jumble) option. This causes the species to be entered into the |
---|
225 | tree in a random order rather than in their order in the input file. The |
---|
226 | program prompts you for a random number seed. This option is described in |
---|
227 | the main documentation file. |
---|
228 | <P> |
---|
229 | <LI>The M (Multiple data sets) option. This has also been described in the |
---|
230 | main documentation file. It is not to be confused with the M option specified |
---|
231 | in the input file, which is the Mixture of methods option (yes, I know |
---|
232 | this is confusing). |
---|
233 | <P> |
---|
234 | <LI>The O (outgroup) option. This has also already been discussed in the |
---|
235 | general documentation file. It specifies the number of the particular species |
---|
236 | which will be used as the outgroup in rerooting the final tree when it is |
---|
237 | printed out. It will not have any effect if the tree is already rooted or is |
---|
238 | a user-defined tree. This option is not available in DOLLOP, DOLMOVE, |
---|
239 | or DOLPENNY, which always infer a rooted tree, or CLIQUE, which |
---|
240 | requires you to work out the rerooting by hand. The menu selection will |
---|
241 | cause you to be prompted for the number of the outgroup. |
---|
242 | <P> |
---|
243 | <LI>The T (threshold) option. This sets a threshold such that if the |
---|
244 | number of steps counted in a character is higher than the threshold, it |
---|
245 | will be taken to be the threshold value rather than the actual number of |
---|
246 | steps. This option has already been described in the main documentation |
---|
247 | file. The user is prompted for the threshold value. My 1981 paper |
---|
248 | (Felsenstein, 1981b) |
---|
249 | explains the logic behind the Threshold option, which is an attarctive |
---|
250 | alternative to successive weighting of characters. |
---|
251 | <P> |
---|
252 | <LI>The U (User tree) option. This has already been described in the |
---|
253 | main documentation file. For all of these programs user trees are to be |
---|
254 | specified as bifurcating trees, even in the cases where the tree that |
---|
255 | is inferred by the programs is to be regarded as unrooted. |
---|
256 | <P> |
---|
257 | <LI>The W (Weights) option. This allows us to specify weights on the |
---|
258 | characters, including the possibility of omitting characters from the |
---|
259 | analysis. It has already been described in the main documentation file. If |
---|
260 | the Weights option is used there must be a W on the first line of the |
---|
261 | input file. |
---|
262 | <P> |
---|
263 | <LI>The X (miXture) option. In the programs MIX, MOVE, and PENNY |
---|
264 | the user can specify for each character which parsimony method is |
---|
265 | in effect. This is done by selecting menu option X (not M) and having |
---|
266 | an input mixture file, whose default name is <TT>mixture</TT>. |
---|
267 | It contains a line or lines with and one letter for |
---|
268 | each character. These letters are C or S if the character is to |
---|
269 | be reconstructed according to Camin-Sokal parsimony, W or ? if the |
---|
270 | character is to be reconstructed according to Wagner parsimony. So if |
---|
271 | there are 20 characters the line giving the mixture might look like this: |
---|
272 | <P> |
---|
273 | <PRE> |
---|
274 | WWWCC WWCWC |
---|
275 | </PRE> |
---|
276 | <P> |
---|
277 | Note that blanks in the seqence of characters (after the first ones that |
---|
278 | are as long as the species names) will be ignored, and the information |
---|
279 | can go on to a new line at any point. So this could equally well have been |
---|
280 | specified by |
---|
281 | <P> |
---|
282 | <PRE> |
---|
283 | WW |
---|
284 | CCCWWCWC |
---|
285 | </PRE> |
---|
286 | </UL> |
---|
287 | <P> |
---|
288 | 30! 1 2 1 1 1 2 1 3 1 1 |
---|
289 | 40! 1 |
---|
290 | </PRE> |
---|
291 | <P> |
---|
292 | The numbers across the top and down the side indicate which character |
---|
293 | is being referred to. Thus character 23 is column "3" of row "20" |
---|
294 | and has 2 steps in this case. |
---|
295 | <P> |
---|
296 | I cannot emphasize too strongly that just because the tree diagram |
---|
297 | which the program prints out contains a particular |
---|
298 | branch DOES NOT MEAN |
---|
299 | THAT WE HAVE EVIDENCE THAT THE BRANCH IS OF NONZERO LENGTH. |
---|
300 | In program PARS the branches have lengths estimated and there |
---|
301 | can be trifurcations, but in all other discrete characters programs |
---|
302 | the procedure which prints out the tree cannot cope with a trifurcation, nor |
---|
303 | can the internal data structures used in my programs. Therefore, even |
---|
304 | when we have no resolution and a multifurcation, successive bifurcations |
---|
305 | will be printed out, although some of the branches shown will in fact |
---|
306 | actually be of zero length. To find out which, you will have to work out |
---|
307 | character by character where the placements of the changes on the tree |
---|
308 | are, under all possible ways that the changes can be placed on that |
---|
309 | tree. |
---|
310 | <P> |
---|
311 | In PARS the trees are truly multifurcating, and the search is over both |
---|
312 | bifurcating and multifurcating trees. A branch is retained in a tree only |
---|
313 | if there is at least one character, under at least one possible most |
---|
314 | parsimonious reconstruction of the placement of changes, that has a change in |
---|
315 | that branch. This means that two branches can both be present which are, |
---|
316 | however, not both in existence at the same time (in that there is no |
---|
317 | most parsimonious reconstruction of changes n the characters that has changes |
---|
318 | in both these branches at the same time). |
---|
319 | <P> |
---|
320 | In PARS, MIX, PENNY, DOLLOP, and DOLPENNY the trees will be (if the user selects |
---|
321 | the option to see them) |
---|
322 | accompanied by tables showing the reconstructed states of the characters in |
---|
323 | the hypothetical ancestral nodes in the interior of the tree. This will enable |
---|
324 | you to reconstruct where the changes were in each of the characters. In some |
---|
325 | cases the state shown in an interior node will be "?", which means that either |
---|
326 | 0 or 1 would be possible at that point. In such cases you have to work out |
---|
327 | the ambiguity by hand. A unique assignment of locations of changes is often |
---|
328 | not possible in the case of the Wagner parsimony method. There may be multiple |
---|
329 | ways of assigning changes to segments of the tree with that method. Printing |
---|
330 | only one would be misleading, as it might imply that certain segments of the |
---|
331 | tree had no change, when another equally valid assignment would put changes |
---|
332 | there. It must be emphasized that all these multiple assignments have exactly |
---|
333 | equal numbers of total changes, so that none is preferred over any other. |
---|
334 | <P> |
---|
335 | I have followed the convention of having |
---|
336 | a "." printed out in the table of character states of the hypothetical |
---|
337 | ancestral nodes whenever a state is 0 or 1 and its immediate ancestor is the |
---|
338 | same. This has the effect of highlighting the places where changes might have |
---|
339 | occurred and making it easy for the user to reconstruct all the alternative |
---|
340 | patterns of the characters states in the hypothetical ancestral nodes. |
---|
341 | In PARS you can, using the menu, turn off this dot-differencing |
---|
342 | convention and see all states at all hypothetical ancestral nodes of the tree. |
---|
343 | <P> |
---|
344 | On the line in that table corresponding to each branch of the tree will also |
---|
345 | be printed "yes", "no" or "maybe" as an answer to the question of whether this |
---|
346 | branch is of nonzero length. If there is no evidence that any character has |
---|
347 | changed in that branch, then "no" will be printed. If there is definite |
---|
348 | evidence that one has changed, then "yes" will be printed. If the matter is |
---|
349 | ambiguous, then "maybe" will be printed. You should keep in mind that all of |
---|
350 | these conclusions assume that we are only interested in the assignment of |
---|
351 | states that requires the least amount of change. In reality, the confidence |
---|
352 | limit on tree topology usually includes many different topologies, and |
---|
353 | presumably also then the confidence limits on amounts of change in branches |
---|
354 | are also very broad. |
---|
355 | <P> |
---|
356 | In addition to the table showing numbers of events, a table may be printed out |
---|
357 | showing which ancestral state causes the fewest events for each |
---|
358 | occurred and making it easy for the user to reconstruct all the alternative |
---|
359 | patterns of the characters states in the hypothetical ancestral nodes. |
---|
360 | In PARS you can, using the menu, turn off this dot-differencing |
---|
361 | convention and see all states at all hypothetical ancestral nodes of the tree. |
---|
362 | <P> |
---|
363 | On the line in that table corresponding to each branch of the tree will also |
---|
364 | be printed "yes", "no" or "maybe" as an answer to the question of whether this |
---|
365 | branch is of nonzero length. If there is no evidence that any character has |
---|
366 | changed in that branch, then "no" will be printed. If there is definite |
---|
367 | evidence that one has changed, then "yes" will be printed. If the matter is |
---|
368 | ambiguous, then "maybe" will be printed. You should keep in mind that all of |
---|
369 | these conclusions assume that we are only interested in the assignment of |
---|
370 | states that requires the least amount of change. In reality, the confidence |
---|
371 | limit on tree topology usually includes many different topologies, and |
---|
372 | presumably also then the confidence limits on amounts of change in branches |
---|
373 | are also very broad. |
---|
374 | <P> |
---|
375 | In addition to the table showing numbers of events, a table may be printed out |
---|
376 | showing which ancestral state causes the fewest events for each |
---|
377 | character. This will not always be done, but only when the tree is rooted and |
---|
378 | some ancestral states are unknown. This can be used to infer states of |
---|
379 | ancestors. For example, if you use the O (Outgroup) and A (Ancestral states) |
---|
380 | options together, with at least some of the ancestral states being given as |
---|
381 | "?", then inferences will be made for those characters, as the outgroup makes |
---|
382 | the tree rooted if it was not already. |
---|
383 | <P> |
---|
384 | In programs MIX and PENNY, if you are using the Camin-Sokal parsimony option |
---|
385 | with ancestral state "?" and it turns out that the program cannot decide |
---|
386 | between ancestral states 0 and 1, it will fail to even attempt reconstruction |
---|
387 | of states of the hypothetical ancestors, printing them all out as "." for |
---|
388 | those characters. This is done for internal bookkeeping reasons -- to |
---|
389 | reconstruct their changes would require a fair amount of additional code and |
---|
390 | additional data structures. It is not too hard to reconstruct the internal |
---|
391 | states by hand, trying the two possible ancestral states one after the |
---|
392 | other. A similar comment applies to the use of ancestral state "?" in the |
---|
393 | Dollo or Polymorphism parsimony methods (programs DOLLOP and DOLPENNY) which |
---|
394 | also can result in a similar hesitancy to print the estimate of the states of |
---|
395 | the hypothetical ancestors. In all of these cases the program will print "?" |
---|
396 | rather than "no" when it describes whether there are any changes in a branch, |
---|
397 | since there might or might not be changes in those characters which are not |
---|
398 | reconstructed. |
---|
399 | <P> |
---|
400 | For further information see the documentation files for the |
---|
401 | individual programs. |
---|
402 | </BODY> |
---|
403 | </HTML> |
---|