source: trunk/GDE/PHYLIP/doc/dnacomp.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 11.5 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>dnacomp</TITLE>
5<META NAME="description" CONTENT="dnacomp">
6<META NAME="keywords" CONTENT="dnacomp">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>DNACOMP -- DNA Compatibility Program</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2002 by the University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25</EM>
26This program implements the compatibility method for DNA sequence
27data.  For a four-state character without a character-state tree, as in
28DNA sequences, the usual clique theorems cannot be applied.  The
29approach taken in this program is to directly evaluate each tree
30topology by counting how many substitutions are needed in each site,
31comparing this to the minimum number that might be needed (one less than
32the number of bases observed at that site), and then evaluating the
33number of sites which achieve the minimum number.  This is the
34evaluation of the tree (the number of compatible sites), and the
35topology is chosen so as to maximize that number.
36<P>
37Compatibility methods originated with Le Quesne's (1969) suggestion that
38one ought to look for trees supported by the largest number of perfectly
39fitting (compatible) characters.  Fitch (1975) showed by counterexample that
40one could not use the pairwise compatibility methods used in CLIQUE to
41discover the largest clique of jointly compatible characters.
42<P>
43The assumptions of this method are similar to those of CLIQUE.  In
44a paper in the Biological Journal of the Linnean Society (1981b)
45I discuss this matter extensively.  In effect, the assumptions are that:
46<OL>
47<LI>Each character evolves independently.
48<LI>Different lineages evolve independently.
49<LI>The ancestral base at each site is unknown.
50<LI>The rates of change in most sites over the time spans involved
51in the the divergence of the group are very small.
52<LI>A few of the sites have very high rates of change.
53<LI>We do not know in advance which are the high and which the low
54rate sites.
55</OL>
56<P>
57That these are the assumptions of compatibility methods has been documented
58in a series of papers of mine: (1973a, 1978b, 1979, 1981b,
591983b, 1988b).  For an opposing
60view arguing that arguments such as mine are invalid
61and that parsimony (and perhaps compatibility) methods make no substantive
62assumptions such as these, see the papers by Farris (1983) and Sober (1983a,
631983b, 1988), but also read the exchange between Felsenstein and Sober (1986). 
64<P>
65There is, however, some reason to believe that the present criterion is not the
66proper way to correct for the presence of some sites with high rates of
67change in nucleotide sequence data.  It can be argued that sites showing more
68than two nucleotide states, even if those are compatible with the other sites,
69are also candidates for sites with high rates of change.  It might then be more
70proper to use DNAPARS with the Threshold option with a threshold value of 2.
71<P>
72Change from an occupied site to a gap is counted as one
73change.  Reversion from a gap to an occupied site is allowed and is also
74counted as one change.  Note that this in effect assumes that a gap
75N bases long is N separate events.  This may be an overcorrection.  When
76we have nonoverlapping gaps, we could instead code a gap as a
77single event by changing all but the first "-" in the gap into "?" characters.
78In this way only the first base of the gap causes the program to infer a
79change.
80<P>
81The input data is standard.  The first line of the input file contains the
82number of species and the number of sites.
83<P>
84Next come the species data.  Each
85sequence starts on a new line, has a ten-character species name
86that must be blank-filled to be of that length, followed immediately
87by the species data in the one-letter code.  The sequences must either
88be in the "interleaved" or "sequential" formats
89described in the Molecular Sequence Programs document.  The I option
90selects between them.  The sequences can have internal
91blanks in the sequence but there must be no extra blanks at the end of the
92terminated line.  Note that a blank is not a valid symbol for a deletion.
93<P>
94The options are selected using an interactive menu.  The menu looks like this:
95<P>
96<TABLE><TR><TD BGCOLOR=white>
97<PRE>
98
99DNA compatibility algorithm, version 3.6a3
100
101Settings for this run:
102  U                 Search for best tree?  Yes
103  J   Randomize input order of sequences?  No. Use input order
104  O                        Outgroup root?  No, use as outgroup species  1
105  W                       Sites weighted?  No
106  M           Analyze multiple data sets?  No
107  I          Input sequences interleaved?  Yes
108  0   Terminal type (IBM PC, ANSI, none)?  (none)
109  1    Print out the data at start of run  No
110  2  Print indications of progress of run  Yes
111  3                        Print out tree  Yes
112  4  Print steps & compatibility at sites  No
113  5  Print sequences at all nodes of tree  No
114  6       Write out trees onto tree file?  Yes
115
116Are these settings correct? (type Y or the letter for one to change)
117
118</PRE>
119</TD></TR></TABLE>
120<P>
121The user either types "Y" (followed, of course, by a carriage-return)
122if the settings shown are to be accepted, or the letter or digit corresponding
123to an option that is to be changed.
124<P>
125The options U, J, O, W, M, and 0 are the usual ones.  They are described in the
126main documentation file of this package.  Option I is the same as in
127other molecular sequence programs and is described in the documentation file
128for the sequence programs.
129<P>
130The O (outgroup) option has no effect if the U (user-defined tree) option
131is in effect.  The user-defined trees (option U) fed in must be strictly
132bifurcating, with a two-way split at their base.
133<P>
134The interpretation of weights (option W) in the case of a compatibility method
135is that they count how many times the character (in this case the site) is
136counted in the analysis.  Thus a character can be dropped from the
137analysis by assigning it zero weight.  On the other hand, giving it a
138weight of 5 means that in any clique it is in, it is counted as 5
139characters when the size of the clique is evaluated.  Generally, weights
140other than 0 or 1 do not have much meaning when dealing with DNA sequences.
141<P>
142Output is standard: if option 1 is toggled on, the data is printed out,
143with the convention that "." means "the same as in the first species".
144Then comes a list of equally parsimonious trees, and (if option 2 is
145toggled on) a table of the
146number of changes of state required in each character.  If option 5 is toggled
147on, a table is printed
148out after each tree, showing for each  branch whether there are known to be
149changes in the branch, and what the states are inferred to have been at the
150top end of the branch.  If the inferred state is a "?" or one of the IUB
151ambiguity symbols, there will be multiple
152equally-parsimonious assignments of states; the user must work these out for
153themselves by hand.  A "?" in the reconstructed states means that in
154addition to one or more bases, a gap may or may not be present.  If
155option 6 is left in its default state the trees
156found will be written to a tree file, so that they are available to be used
157in other programs.
158<P>
159If the U (User Tree) option is used and more than one tree is supplied,
160the program also performs a statistical test of each of these trees against the
161one with highest likelihood.   If there are two user trees, the test
162done is one which is due to Kishino and Hasegawa (1989), a version
163of a test originally introduced by Templeton (1983).  In this
164implementation it uses the mean and variance of weighted
165compatibility differences between trees, taken across sites.  If the two
166trees compatibilities are more than 1.96 standard deviations different then
167the trees are declared significantly different.
168<P>
169If there are more than two trees, the test done is an extension of
170the KHT test, due to Shimodaira and Hasegawa (1999).  They pointed out
171that a correction for the number of trees was necessary, and they
172introduced a resampling method to make this correction.  In the version
173used here the variances and covariances of the sum of weighted
174compatibilities of sites are computed for all pairs of trees.  To
175test whether the
176difference between each tree and the best one is larger than could have
177been expected if they all had the same expected compatibility,
178compatibilities for all trees are sampled with these covariances and equal
179means (Shimodaira and Hasegawa's "least favorable hypothesis"),
180and a P value is computed from the fraction of times the difference between
181the tree's value and the highest compatibility exceeds that actually
182observed.  Note that this sampling needs random numbers, and so the
183program will prompt the user for a random number seed if one has not
184already been supplied.  With the two-tree KHT test no random numbers
185are used.
186<P>
187In either the KHT or the SH test the program
188prints out a table of the compatibility of each tree, the differences of
189each from the highest one, the variance of that quantity as determined by
190the compatibility differences at individual sites, and a conclusion as to
191whether that tree is or is not significantly worse than the best one.
192<P>
193The algorithm is a straightforward modification of DNAPARS, but with
194some extra machinery added to calculate, as each species is added, how
195many base changes are the minimum which could be required at that site.  The
196program runs fairly quickly.
197<P>
198The constants
199which can be changed at the beginning of the program are:
200the name length "nmlngth",
201"maxtrees", the maximum number of trees which the program will store for output,
202and "maxuser",
203the maximum number of user trees that can be used in the paired sites test.
204<P>
205<HR><H3>TEST DATA SET</H3>
206<P>
207<TABLE><TR><TD BGCOLOR=white>
208<PRE>
209    5   13
210Alpha     AACGUGGCCAAAU
211Beta      AAGGUCGCCAAAC
212Gamma     CAUUUCGUCACAA
213Delta     GGUAUUUCGGCCU
214Epsilon   GGGAUCUCGGCCC
215</PRE>
216</TD></TR></TABLE>
217<P>
218<H3>CONTENTS OF OUTPUT FILE (if all numerical options are turned on)</H3>
219<P>
220<TABLE><TR><TD BGCOLOR=white>
221<PRE>
222
223DNA compatibility algorithm, version 3.6a3
224
225 5 species,  13  sites
226
227Name            Sequences
228----            ---------
229
230Alpha        AACGUGGCCA AAU
231Beta         AAGGUCGCCA AAC
232Gamma        CAUUUCGUCA CAA
233Delta        GGUAUUUCGG CCU
234Epsilon      GGGAUCUCGG CCC
235
236
237
238One most parsimonious tree found:
239
240
241
242
243           +--Epsilon   
244        +--4 
245     +--3  +--Delta     
246     !  ! 
247  +--2  +-----Gamma     
248  !  ! 
249  1  +--------Beta     
250  ! 
251  +-----------Alpha     
252
253  remember: this is an unrooted tree!
254
255
256total number of compatible sites is       11.0
257
258steps in each site:
259         0   1   2   3   4   5   6   7   8   9
260     *-----------------------------------------
261    0|       2   1   3   2   0   2   1   1   1
262   10|   1   1   1   3                       
263
264 compatibility (Y or N) of each site with this tree:
265
266      0123456789
267     *----------
268   0 ! YYNYYYYYY
269  10 !YYYN     
270
271From    To     Any Steps?    State at upper node
272                           
273          1                AABGTSGCCA AAY
274   1      2        maybe   AABGTCGCCA AAY
275   2      3         yes    VAKDTCGCCA CAY
276   3      4         yes    GGKATCTCGG CCY
277   4   Epsilon     maybe   GGGATCTCGG CCC
278   4   Delta        yes    GGTATTTCGG CCT
279   3   Gamma        yes    CATTTCGTCA CAA
280   2   Beta        maybe   AAGGTCGCCA AAC
281   1   Alpha       maybe   AACGTGGCCA AAT
282
283
284</PRE>
285</TD></TR></TABLE>
286</BODY>
287</HTML>
Note: See TracBrowser for help on using the repository browser.