source: branches/profile/GDE/PHYLIP/doc/dnapars.html

Last change on this file was 2176, checked in by westram, 20 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 13.6 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>main</TITLE>
5<META NAME="description" CONTENT="dnapars">
6<META NAME="keywords" CONTENT="dnapars">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<P>
13<DIV ALIGN=RIGHT>
14version 3.6
15</DIV>
16<P>
17<DIV ALIGN=CENTER>
18<H1>DNAPARS -- DNA Parsimony Program</H1>
19</DIV>
20<P>
21&#169; Copyright 1986-2002 by The University of
22Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
23this document provided that no fee is charged for it and that this copyright
24notice is not removed.
25<P>
26This program carries out unrooted parsimony (analogous to Wagner
27trees) (Eck and Dayhoff, 1966; Kluge and Farris, 1969) on DNA
28sequences.  The method of Fitch (1971) is used to count the number of
29changes of base needed on a given tree.
30The assumptions of this method are analogous to those of MIX:
31<OL>
32<LI>Each site evolves independently.
33<LI>Different lineages evolve independently.
34<LI>The probability of a base substitution at a given site is
35small over the lengths of time involved in
36a branch of the phylogeny.
37<LI>The expected amounts of change in different branches of the phylogeny
38do not vary by so much that two changes in a high-rate branch
39are more probable than one change in a low-rate branch.
40<LI>The expected amounts of change do not vary enough among sites that two
41changes in one site are more probable than one change in another.
42</OL>
43<P>
44That these are the assumptions of parsimony methods has been documented
45in a series of papers of mine: (1973a, 1978b, 1979, 1981b, 1983b, 1988b).  For
46an opposing view arguing that the parsimony methods make no substantive
47assumptions such as these, see the papers by Farris (1983) and Sober (1983a,
481983b, 1988), but also read the exchange between Felsenstein and Sober (1986). 
49<P>
50Change from an occupied site to a deletion is counted as one
51change.  Reversion from a deletion to an occupied site is allowed and is also
52counted as one change.  Note that this in effect assumes that a deletion
53N bases long is N separate events.
54<P>
55Dnapars can handle both bifurcating and multifurcating trees.  In doing its
56search for most parsimonious trees, it adds species not only by creating new
57forks in the middle of existing branches, but it also tries putting them at
58the end of new branches which are added to existing forks.  Thus it searches
59among both bifurcating and multifurcating trees.  If a branch in a tree
60does not have any characters which might change in that branch in the most
61parsimonious tree, it does not save that tree.  Thus in any tree that
62results, a branch exists only if some character has a most parsimonious
63reconstruction that would involve change in that branch.
64<P>
65It also saves a number of trees tied for best (you can alter the number
66it saves using the V option in the menu).  When rearranging trees, it
67tries rearrangements of all of the saved trees.  This makes the algorithm
68slower than earlier versions of Dnapars.
69<P>
70The input data is standard.  The first line of the input file contains the
71number of species and the number of sites.
72<P>
73Next come the species data.  Each
74sequence starts on a new line, has a ten-character species name
75that must be blank-filled to be of that length, followed immediately
76by the species data in the one-letter code.  The sequences must either
77be in the "interleaved" or "sequential" formats
78described in the Molecular Sequence Programs document.  The I option
79selects between them.  The sequences can have internal
80blanks in the sequence but there must be no extra blanks at the end of the
81terminated line.  Note that a blank is not a valid symbol for a deletion.
82<P>
83The options are selected using an interactive menu.  The menu looks like this:
84<P>
85<TABLE><TR><TD BGCOLOR=white>
86<PRE>
87
88DNA parsimony algorithm, version 3.6a3
89
90Setting for this run:
91  U                 Search for best tree?  Yes
92  S                        Search option?  More thorough search
93  V              Number of trees to save?  100
94  J   Randomize input order of sequences?  No. Use input order
95  O                        Outgroup root?  No, use as outgroup species  1
96  T              Use Threshold parsimony?  No, use ordinary parsimony
97  N           Use Transversion parsimony?  No, count all steps
98  W                       Sites weighted?  No
99  M           Analyze multiple data sets?  No
100  I          Input sequences interleaved?  Yes
101  0   Terminal type (IBM PC, ANSI, none)?  (none)
102  1    Print out the data at start of run  No
103  2  Print indications of progress of run  Yes
104  3                        Print out tree  Yes
105  4          Print out steps in each site  No
106  5  Print sequences at all nodes of tree  No
107  6       Write out trees onto tree file?  Yes
108
109  Y to accept these or type the letter for one to change
110</PRE>
111</TD></TR></TABLE> 
112<P>
113The user either types "Y" (followed, of course, by a carriage-return)
114if the settings shown are to be accepted, or the letter or digit corresponding
115to an option that is to be changed.
116<P>
117The N option allows you to choose transversion parsimony, which counts only
118transversions (changes between one of the purines A or G and one of the
119pyrimidines C or T).  This setting is turned off by default.
120<P>
121The Weights (W) option
122takes the weights from a file whose default name is "weights".  The weights
123follow the format described in the main documentation file, with integer
124weights from 0 to 35 allowed by using the characters 0, 1, 2, ..., 9 and
125A, B, ... Z.
126<P>
127The User tree (option U) is read from a file whose default name is <TT>intree</TT>.
128The trees can be multifurcating. They must be preceded in the file by a
129line giving the number of trees in the file.
130<P>
131The options J, O, T, M, and 0 are the usual ones.  They are described in the
132main documentation file of this package.  Option I is the same as in
133other molecular sequence programs and is described in the documentation file
134for the sequence programs.
135<P>
136The M (multiple data sets option) will ask you whether you want to
137use multiple sets of weights (from the weights file) or multiple data sets.
138The ability to use a single data set with multiple weights means that
139much less disk space will be used for this input data.  The bootstrapping
140and jackknifing tool Seqboot has the ability to create a weights file with
141multiple weights.
142<P>
143The O (outgroup) option will have no effect if the U (user-defined tree)
144option is in effect.
145The T (threshold) option allows a continuum of methods
146between parsimony and compatibility.  Thresholds less than or equal to 1.0 do
147not have any meaning and should
148not be used: they will result in a tree dependent only on the input
149order of species and not at all on the data!
150<P>
151Output is standard: if option 1 is toggled on, the data is printed out,
152with the convention that "." means "the same as in the first species".
153Then comes a list of equally parsimonious trees.
154Each tree has branch lengths.  These are computed using an algorithm
155published by Hochbaum and Pathria (1997) which I first heard of from
156Wayne Maddison who invented it independently of them.  This algorithm
157averages the number of reconstructed changes of state over all sites a
158over all possible most parsimonious placements of the changes of state
159among branches.  Note that it does not correct in any way for multiple
160changes that overlay each other.
161<P>
162If option 2 is
163toggled on a table of the
164number of changes of state required in each character is also
165printed.  If option 5 is toggled
166on, a table is printed
167out after each tree, showing for each  branch whether there are known to be
168changes in the branch, and what the states are inferred to have been at the
169top end of the branch.  This is a reconstruction of the ancestral sequences
170in the tree.  If you choose option 5, a menu item D appears which gives you
171the opportunity to turn off dot-differencing so that complete ancestral
172sequences are shown.  If the inferred state is a "?" or one of the IUB
173ambiguity symbols, there will be multiple
174equally-parsimonious assignments of states; the user must work these out for
175themselves by hand.  A "?" in the reconstructed states means that in
176addition to one or more bases, a deletion may or may not be present.  If
177option 6 is left in its default state the trees
178found will be written to a tree file, so that they are available to be used
179in other programs.
180<P>
181If the U (User Tree) option is used and more than one tree is supplied, the
182program also performs a statistical test of each of these trees against the
183best tree.  This test, which is a version of the test proposed by
184Alan Templeton (1983) and evaluated in a test case by me (1985a).  It is
185closely parallel to a test using log likelihood differences
186due to Kishino and Hasegawa (1989), and
187uses the mean and variance of
188step differences between trees, taken across sites.  If the mean
189is more than 1.96 standard deviations different then the trees are declared
190significantly different.  The program
191prints out a table of the steps for each tree, the differences of
192each from the best one, the variance of that quantity as determined by
193the step differences at individual sites, and a conclusion as to
194whether that tree is or is not significantly worse than the best one.
195If the U (User Tree) option is used and more than one tree is supplied,
196and the program is not told to assume autocorrelation between the
197rates at different sites, the
198program also performs a statistical test of each of these trees against the
199one with highest likelihood.   If there are two user trees, this
200is a version of the test proposed by
201Alan Templeton (1983) and evaluated in a test case by me (1985a).  It is
202closely parallel to a test using log likelihood differences
203due to Kishino and Hasegawa (1989)
204It uses the mean and variance of the differences in the number
205of steps between trees, taken across sites.  If the two
206trees' means are more than 1.96 standard deviations different,
207then the trees are
208declared significantly different.
209<P>
210If there are more than two trees, the test done is an extension of
211the KHT test, due to Shimodaira and Hasegawa (1999).  They pointed out
212that a correction for the number of trees was necessary, and they
213introduced a resampling method to make this correction.  In the version
214used here the variances and covariances of the sums of steps across
215sites are computed for all pairs of trees.  To test whether the
216difference between each tree and the best one is larger than could have
217been expected if they all had the same expected number of steps,
218numbers of steps for all trees are sampled with these covariances and equal
219means (Shimodaira and Hasegawa's "least favorable hypothesis"),
220and a P value is computed from the fraction of times the difference between
221the tree's value and the lowest number of steps exceeds that actually
222observed.  Note that this sampling needs random numbers, and so the
223program will prompt the user for a random number seed if one has not
224already been supplied.  With the two-tree KHT test no random numbers
225are used.
226<P>
227In either the KHT or the SH test the program
228prints out a table of the number of steps for each tree, the differences of
229each from the lowest one, the variance of that quantity as determined by
230the differences of the numbers of steps at individual sites,
231and a conclusion as to
232whether that tree is or is not significantly worse than the best one.
233<P>
234Option 6 in the menu controls whether the tree estimated by the program
235is written onto a tree file.  The default name of this output tree file
236is "outtree".  If the U option is in effect, all the user-defined
237trees are written to the output tree file.
238<P>
239The program is a straightforward relative of MIX
240and runs reasonably quickly, especially with many sites and few species.
241<P>
242<HR>
243<P>
244<H3>TEST DATA SET</H3>
245<P>
246<TABLE><TR><TD BGCOLOR=white>
247<PRE> 
248   5   13
249Alpha     AACGUGGCCAAAU
250Beta      AAGGUCGCCAAAC
251Gamma     CAUUUCGUCACAA
252Delta     GGUAUUUCGGCCU
253Epsilon   GGGAUCUCGGCCC
254</PRE>
255</TD></TR></TABLE> 
256<P>
257<HR>
258<P>
259<H3>CONTENTS OF OUTPUT FILE (if all numerical options are on)</H3>
260<P>
261<TABLE><TR><TD BGCOLOR=white>
262<PRE>
263
264DNA parsimony algorithm, version 3.6a3
265
266 5 species,  13  sites
267
268
269Name            Sequences
270----            ---------
271
272Alpha        AACGUGGCCA AAU
273Beta         ..G..C.... ..C
274Gamma        C.UU.C.U.. C.A
275Delta        GGUA.UU.GG CC.
276Epsilon      GGGA.CU.GG CCC
277
278
279
280One most parsimonious tree found:
281
282
283                                            +-----Epsilon   
284               +----------------------------3 
285  +------------2                            +-------Delta     
286  |            | 
287  |            +----------------Gamma     
288  | 
289  1----Beta     
290  | 
291  +---------Alpha     
292
293
294requires a total of     19.000
295
296  between      and       length
297  -------      ---       ------
298     1           2       0.217949
299     2           3       0.487179
300     3      Epsilon      0.096154
301     3      Delta        0.134615
302     2      Gamma        0.275641
303     1      Beta         0.076923
304     1      Alpha        0.173077
305
306steps in each site:
307         0   1   2   3   4   5   6   7   8   9
308     *-----------------------------------------
309    0|       2   1   3   2   0   2   1   1   1
310   10|   1   1   1   3                       
311
312From    To     Any Steps?    State at upper node
313                             ( . means same as in the node below it on tree)
314
315          1                AABGTCGCCA AAY
316   1      2         yes    V.KD...... C..
317   2      3         yes    GG.A..T.GG .C.
318   3   Epsilon     maybe   ..G....... ..C
319   3   Delta        yes    ..T..T.... ..T
320   2   Gamma        yes    C.TT...T.. ..A
321   1   Beta        maybe   ..G....... ..C
322   1   Alpha        yes    ..C..G.... ..T
323
324
325</PRE>
326<P>
327</TD></TR></TABLE> 
328</BODY>
329</HTML>
330
Note: See TracBrowser for help on using the repository browser.