1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>dnacomp</TITLE> |
---|
5 | <META NAME="description" CONTENT="dnacomp"> |
---|
6 | <META NAME="keywords" CONTENT="dnacomp"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>DNACOMP -- DNA Compatibility Program</H1> |
---|
18 | </DIV> |
---|
19 | <P> |
---|
20 | © Copyright 1986-2002 by the University of |
---|
21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
---|
22 | this document provided that no fee is charged for it and that this copyright |
---|
23 | notice is not removed. |
---|
24 | <P> |
---|
25 | </EM> |
---|
26 | This program implements the compatibility method for DNA sequence |
---|
27 | data. For a four-state character without a character-state tree, as in |
---|
28 | DNA sequences, the usual clique theorems cannot be applied. The |
---|
29 | approach taken in this program is to directly evaluate each tree |
---|
30 | topology by counting how many substitutions are needed in each site, |
---|
31 | comparing this to the minimum number that might be needed (one less than |
---|
32 | the number of bases observed at that site), and then evaluating the |
---|
33 | number of sites which achieve the minimum number. This is the |
---|
34 | evaluation of the tree (the number of compatible sites), and the |
---|
35 | topology is chosen so as to maximize that number. |
---|
36 | <P> |
---|
37 | Compatibility methods originated with Le Quesne's (1969) suggestion that |
---|
38 | one ought to look for trees supported by the largest number of perfectly |
---|
39 | fitting (compatible) characters. Fitch (1975) showed by counterexample that |
---|
40 | one could not use the pairwise compatibility methods used in CLIQUE to |
---|
41 | discover the largest clique of jointly compatible characters. |
---|
42 | <P> |
---|
43 | The assumptions of this method are similar to those of CLIQUE. In |
---|
44 | a paper in the Biological Journal of the Linnean Society (1981b) |
---|
45 | I discuss this matter extensively. In effect, the assumptions are that: |
---|
46 | <OL> |
---|
47 | <LI>Each character evolves independently. |
---|
48 | <LI>Different lineages evolve independently. |
---|
49 | <LI>The ancestral base at each site is unknown. |
---|
50 | <LI>The rates of change in most sites over the time spans involved |
---|
51 | in the the divergence of the group are very small. |
---|
52 | <LI>A few of the sites have very high rates of change. |
---|
53 | <LI>We do not know in advance which are the high and which the low |
---|
54 | rate sites. |
---|
55 | </OL> |
---|
56 | <P> |
---|
57 | That these are the assumptions of compatibility methods has been documented |
---|
58 | in a series of papers of mine: (1973a, 1978b, 1979, 1981b, |
---|
59 | 1983b, 1988b). For an opposing |
---|
60 | view arguing that arguments such as mine are invalid |
---|
61 | and that parsimony (and perhaps compatibility) methods make no substantive |
---|
62 | assumptions such as these, see the papers by Farris (1983) and Sober (1983a, |
---|
63 | 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986). |
---|
64 | <P> |
---|
65 | There is, however, some reason to believe that the present criterion is not the |
---|
66 | proper way to correct for the presence of some sites with high rates of |
---|
67 | change in nucleotide sequence data. It can be argued that sites showing more |
---|
68 | than two nucleotide states, even if those are compatible with the other sites, |
---|
69 | are also candidates for sites with high rates of change. It might then be more |
---|
70 | proper to use DNAPARS with the Threshold option with a threshold value of 2. |
---|
71 | <P> |
---|
72 | Change from an occupied site to a gap is counted as one |
---|
73 | change. Reversion from a gap to an occupied site is allowed and is also |
---|
74 | counted as one change. Note that this in effect assumes that a gap |
---|
75 | N bases long is N separate events. This may be an overcorrection. When |
---|
76 | we have nonoverlapping gaps, we could instead code a gap as a |
---|
77 | single event by changing all but the first "-" in the gap into "?" characters. |
---|
78 | In this way only the first base of the gap causes the program to infer a |
---|
79 | change. |
---|
80 | <P> |
---|
81 | The input data is standard. The first line of the input file contains the |
---|
82 | number of species and the number of sites. |
---|
83 | <P> |
---|
84 | Next come the species data. Each |
---|
85 | sequence starts on a new line, has a ten-character species name |
---|
86 | that must be blank-filled to be of that length, followed immediately |
---|
87 | by the species data in the one-letter code. The sequences must either |
---|
88 | be in the "interleaved" or "sequential" formats |
---|
89 | described in the Molecular Sequence Programs document. The I option |
---|
90 | selects between them. The sequences can have internal |
---|
91 | blanks in the sequence but there must be no extra blanks at the end of the |
---|
92 | terminated line. Note that a blank is not a valid symbol for a deletion. |
---|
93 | <P> |
---|
94 | The options are selected using an interactive menu. The menu looks like this: |
---|
95 | <P> |
---|
96 | <TABLE><TR><TD BGCOLOR=white> |
---|
97 | <PRE> |
---|
98 | |
---|
99 | DNA compatibility algorithm, version 3.6a3 |
---|
100 | |
---|
101 | Settings for this run: |
---|
102 | U Search for best tree? Yes |
---|
103 | J Randomize input order of sequences? No. Use input order |
---|
104 | O Outgroup root? No, use as outgroup species 1 |
---|
105 | W Sites weighted? No |
---|
106 | M Analyze multiple data sets? No |
---|
107 | I Input sequences interleaved? Yes |
---|
108 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
---|
109 | 1 Print out the data at start of run No |
---|
110 | 2 Print indications of progress of run Yes |
---|
111 | 3 Print out tree Yes |
---|
112 | 4 Print steps & compatibility at sites No |
---|
113 | 5 Print sequences at all nodes of tree No |
---|
114 | 6 Write out trees onto tree file? Yes |
---|
115 | |
---|
116 | Are these settings correct? (type Y or the letter for one to change) |
---|
117 | |
---|
118 | </PRE> |
---|
119 | </TD></TR></TABLE> |
---|
120 | <P> |
---|
121 | The user either types "Y" (followed, of course, by a carriage-return) |
---|
122 | if the settings shown are to be accepted, or the letter or digit corresponding |
---|
123 | to an option that is to be changed. |
---|
124 | <P> |
---|
125 | The options U, J, O, W, M, and 0 are the usual ones. They are described in the |
---|
126 | main documentation file of this package. Option I is the same as in |
---|
127 | other molecular sequence programs and is described in the documentation file |
---|
128 | for the sequence programs. |
---|
129 | <P> |
---|
130 | The O (outgroup) option has no effect if the U (user-defined tree) option |
---|
131 | is in effect. The user-defined trees (option U) fed in must be strictly |
---|
132 | bifurcating, with a two-way split at their base. |
---|
133 | <P> |
---|
134 | The interpretation of weights (option W) in the case of a compatibility method |
---|
135 | is that they count how many times the character (in this case the site) is |
---|
136 | counted in the analysis. Thus a character can be dropped from the |
---|
137 | analysis by assigning it zero weight. On the other hand, giving it a |
---|
138 | weight of 5 means that in any clique it is in, it is counted as 5 |
---|
139 | characters when the size of the clique is evaluated. Generally, weights |
---|
140 | other than 0 or 1 do not have much meaning when dealing with DNA sequences. |
---|
141 | <P> |
---|
142 | Output is standard: if option 1 is toggled on, the data is printed out, |
---|
143 | with the convention that "." means "the same as in the first species". |
---|
144 | Then comes a list of equally parsimonious trees, and (if option 2 is |
---|
145 | toggled on) a table of the |
---|
146 | number of changes of state required in each character. If option 5 is toggled |
---|
147 | on, a table is printed |
---|
148 | out after each tree, showing for each branch whether there are known to be |
---|
149 | changes in the branch, and what the states are inferred to have been at the |
---|
150 | top end of the branch. If the inferred state is a "?" or one of the IUB |
---|
151 | ambiguity symbols, there will be multiple |
---|
152 | equally-parsimonious assignments of states; the user must work these out for |
---|
153 | themselves by hand. A "?" in the reconstructed states means that in |
---|
154 | addition to one or more bases, a gap may or may not be present. If |
---|
155 | option 6 is left in its default state the trees |
---|
156 | found will be written to a tree file, so that they are available to be used |
---|
157 | in other programs. |
---|
158 | <P> |
---|
159 | If the U (User Tree) option is used and more than one tree is supplied, |
---|
160 | the program also performs a statistical test of each of these trees against the |
---|
161 | one with highest likelihood. If there are two user trees, the test |
---|
162 | done is one which is due to Kishino and Hasegawa (1989), a version |
---|
163 | of a test originally introduced by Templeton (1983). In this |
---|
164 | implementation it uses the mean and variance of weighted |
---|
165 | compatibility differences between trees, taken across sites. If the two |
---|
166 | trees compatibilities are more than 1.96 standard deviations different then |
---|
167 | the trees are declared significantly different. |
---|
168 | <P> |
---|
169 | If there are more than two trees, the test done is an extension of |
---|
170 | the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out |
---|
171 | that a correction for the number of trees was necessary, and they |
---|
172 | introduced a resampling method to make this correction. In the version |
---|
173 | used here the variances and covariances of the sum of weighted |
---|
174 | compatibilities of sites are computed for all pairs of trees. To |
---|
175 | test whether the |
---|
176 | difference between each tree and the best one is larger than could have |
---|
177 | been expected if they all had the same expected compatibility, |
---|
178 | compatibilities for all trees are sampled with these covariances and equal |
---|
179 | means (Shimodaira and Hasegawa's "least favorable hypothesis"), |
---|
180 | and a P value is computed from the fraction of times the difference between |
---|
181 | the tree's value and the highest compatibility exceeds that actually |
---|
182 | observed. Note that this sampling needs random numbers, and so the |
---|
183 | program will prompt the user for a random number seed if one has not |
---|
184 | already been supplied. With the two-tree KHT test no random numbers |
---|
185 | are used. |
---|
186 | <P> |
---|
187 | In either the KHT or the SH test the program |
---|
188 | prints out a table of the compatibility of each tree, the differences of |
---|
189 | each from the highest one, the variance of that quantity as determined by |
---|
190 | the compatibility differences at individual sites, and a conclusion as to |
---|
191 | whether that tree is or is not significantly worse than the best one. |
---|
192 | <P> |
---|
193 | The algorithm is a straightforward modification of DNAPARS, but with |
---|
194 | some extra machinery added to calculate, as each species is added, how |
---|
195 | many base changes are the minimum which could be required at that site. The |
---|
196 | program runs fairly quickly. |
---|
197 | <P> |
---|
198 | The constants |
---|
199 | which can be changed at the beginning of the program are: |
---|
200 | the name length "nmlngth", |
---|
201 | "maxtrees", the maximum number of trees which the program will store for output, |
---|
202 | and "maxuser", |
---|
203 | the maximum number of user trees that can be used in the paired sites test. |
---|
204 | <P> |
---|
205 | <HR><H3>TEST DATA SET</H3> |
---|
206 | <P> |
---|
207 | <TABLE><TR><TD BGCOLOR=white> |
---|
208 | <PRE> |
---|
209 | 5 13 |
---|
210 | Alpha AACGUGGCCAAAU |
---|
211 | Beta AAGGUCGCCAAAC |
---|
212 | Gamma CAUUUCGUCACAA |
---|
213 | Delta GGUAUUUCGGCCU |
---|
214 | Epsilon GGGAUCUCGGCCC |
---|
215 | </PRE> |
---|
216 | </TD></TR></TABLE> |
---|
217 | <P> |
---|
218 | <H3>CONTENTS OF OUTPUT FILE (if all numerical options are turned on)</H3> |
---|
219 | <P> |
---|
220 | <TABLE><TR><TD BGCOLOR=white> |
---|
221 | <PRE> |
---|
222 | |
---|
223 | DNA compatibility algorithm, version 3.6a3 |
---|
224 | |
---|
225 | 5 species, 13 sites |
---|
226 | |
---|
227 | Name Sequences |
---|
228 | ---- --------- |
---|
229 | |
---|
230 | Alpha AACGUGGCCA AAU |
---|
231 | Beta AAGGUCGCCA AAC |
---|
232 | Gamma CAUUUCGUCA CAA |
---|
233 | Delta GGUAUUUCGG CCU |
---|
234 | Epsilon GGGAUCUCGG CCC |
---|
235 | |
---|
236 | |
---|
237 | |
---|
238 | One most parsimonious tree found: |
---|
239 | |
---|
240 | |
---|
241 | |
---|
242 | |
---|
243 | +--Epsilon |
---|
244 | +--4 |
---|
245 | +--3 +--Delta |
---|
246 | ! ! |
---|
247 | +--2 +-----Gamma |
---|
248 | ! ! |
---|
249 | 1 +--------Beta |
---|
250 | ! |
---|
251 | +-----------Alpha |
---|
252 | |
---|
253 | remember: this is an unrooted tree! |
---|
254 | |
---|
255 | |
---|
256 | total number of compatible sites is 11.0 |
---|
257 | |
---|
258 | steps in each site: |
---|
259 | 0 1 2 3 4 5 6 7 8 9 |
---|
260 | *----------------------------------------- |
---|
261 | 0| 2 1 3 2 0 2 1 1 1 |
---|
262 | 10| 1 1 1 3 |
---|
263 | |
---|
264 | compatibility (Y or N) of each site with this tree: |
---|
265 | |
---|
266 | 0123456789 |
---|
267 | *---------- |
---|
268 | 0 ! YYNYYYYYY |
---|
269 | 10 !YYYN |
---|
270 | |
---|
271 | From To Any Steps? State at upper node |
---|
272 | |
---|
273 | 1 AABGTSGCCA AAY |
---|
274 | 1 2 maybe AABGTCGCCA AAY |
---|
275 | 2 3 yes VAKDTCGCCA CAY |
---|
276 | 3 4 yes GGKATCTCGG CCY |
---|
277 | 4 Epsilon maybe GGGATCTCGG CCC |
---|
278 | 4 Delta yes GGTATTTCGG CCT |
---|
279 | 3 Gamma yes CATTTCGTCA CAA |
---|
280 | 2 Beta maybe AAGGTCGCCA AAC |
---|
281 | 1 Alpha maybe AACGTGGCCA AAT |
---|
282 | |
---|
283 | |
---|
284 | </PRE> |
---|
285 | </TD></TR></TABLE> |
---|
286 | </BODY> |
---|
287 | </HTML> |
---|