| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>dnacomp</TITLE> |
|---|
| 5 | <META NAME="description" CONTENT="dnacomp"> |
|---|
| 6 | <META NAME="keywords" CONTENT="dnacomp"> |
|---|
| 7 | <META NAME="resource-type" CONTENT="document"> |
|---|
| 8 | <META NAME="distribution" CONTENT="global"> |
|---|
| 9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
|---|
| 10 | </HEAD> |
|---|
| 11 | <BODY BGCOLOR="#ccffff"> |
|---|
| 12 | <DIV ALIGN=RIGHT> |
|---|
| 13 | version 3.6 |
|---|
| 14 | </DIV> |
|---|
| 15 | <P> |
|---|
| 16 | <DIV ALIGN=CENTER> |
|---|
| 17 | <H1>DNACOMP -- DNA Compatibility Program</H1> |
|---|
| 18 | </DIV> |
|---|
| 19 | <P> |
|---|
| 20 | © Copyright 1986-2002 by the University of |
|---|
| 21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
|---|
| 22 | this document provided that no fee is charged for it and that this copyright |
|---|
| 23 | notice is not removed. |
|---|
| 24 | <P> |
|---|
| 25 | </EM> |
|---|
| 26 | This program implements the compatibility method for DNA sequence |
|---|
| 27 | data. For a four-state character without a character-state tree, as in |
|---|
| 28 | DNA sequences, the usual clique theorems cannot be applied. The |
|---|
| 29 | approach taken in this program is to directly evaluate each tree |
|---|
| 30 | topology by counting how many substitutions are needed in each site, |
|---|
| 31 | comparing this to the minimum number that might be needed (one less than |
|---|
| 32 | the number of bases observed at that site), and then evaluating the |
|---|
| 33 | number of sites which achieve the minimum number. This is the |
|---|
| 34 | evaluation of the tree (the number of compatible sites), and the |
|---|
| 35 | topology is chosen so as to maximize that number. |
|---|
| 36 | <P> |
|---|
| 37 | Compatibility methods originated with Le Quesne's (1969) suggestion that |
|---|
| 38 | one ought to look for trees supported by the largest number of perfectly |
|---|
| 39 | fitting (compatible) characters. Fitch (1975) showed by counterexample that |
|---|
| 40 | one could not use the pairwise compatibility methods used in CLIQUE to |
|---|
| 41 | discover the largest clique of jointly compatible characters. |
|---|
| 42 | <P> |
|---|
| 43 | The assumptions of this method are similar to those of CLIQUE. In |
|---|
| 44 | a paper in the Biological Journal of the Linnean Society (1981b) |
|---|
| 45 | I discuss this matter extensively. In effect, the assumptions are that: |
|---|
| 46 | <OL> |
|---|
| 47 | <LI>Each character evolves independently. |
|---|
| 48 | <LI>Different lineages evolve independently. |
|---|
| 49 | <LI>The ancestral base at each site is unknown. |
|---|
| 50 | <LI>The rates of change in most sites over the time spans involved |
|---|
| 51 | in the the divergence of the group are very small. |
|---|
| 52 | <LI>A few of the sites have very high rates of change. |
|---|
| 53 | <LI>We do not know in advance which are the high and which the low |
|---|
| 54 | rate sites. |
|---|
| 55 | </OL> |
|---|
| 56 | <P> |
|---|
| 57 | That these are the assumptions of compatibility methods has been documented |
|---|
| 58 | in a series of papers of mine: (1973a, 1978b, 1979, 1981b, |
|---|
| 59 | 1983b, 1988b). For an opposing |
|---|
| 60 | view arguing that arguments such as mine are invalid |
|---|
| 61 | and that parsimony (and perhaps compatibility) methods make no substantive |
|---|
| 62 | assumptions such as these, see the papers by Farris (1983) and Sober (1983a, |
|---|
| 63 | 1983b, 1988), but also read the exchange between Felsenstein and Sober (1986). |
|---|
| 64 | <P> |
|---|
| 65 | There is, however, some reason to believe that the present criterion is not the |
|---|
| 66 | proper way to correct for the presence of some sites with high rates of |
|---|
| 67 | change in nucleotide sequence data. It can be argued that sites showing more |
|---|
| 68 | than two nucleotide states, even if those are compatible with the other sites, |
|---|
| 69 | are also candidates for sites with high rates of change. It might then be more |
|---|
| 70 | proper to use DNAPARS with the Threshold option with a threshold value of 2. |
|---|
| 71 | <P> |
|---|
| 72 | Change from an occupied site to a gap is counted as one |
|---|
| 73 | change. Reversion from a gap to an occupied site is allowed and is also |
|---|
| 74 | counted as one change. Note that this in effect assumes that a gap |
|---|
| 75 | N bases long is N separate events. This may be an overcorrection. When |
|---|
| 76 | we have nonoverlapping gaps, we could instead code a gap as a |
|---|
| 77 | single event by changing all but the first "-" in the gap into "?" characters. |
|---|
| 78 | In this way only the first base of the gap causes the program to infer a |
|---|
| 79 | change. |
|---|
| 80 | <P> |
|---|
| 81 | The input data is standard. The first line of the input file contains the |
|---|
| 82 | number of species and the number of sites. |
|---|
| 83 | <P> |
|---|
| 84 | Next come the species data. Each |
|---|
| 85 | sequence starts on a new line, has a ten-character species name |
|---|
| 86 | that must be blank-filled to be of that length, followed immediately |
|---|
| 87 | by the species data in the one-letter code. The sequences must either |
|---|
| 88 | be in the "interleaved" or "sequential" formats |
|---|
| 89 | described in the Molecular Sequence Programs document. The I option |
|---|
| 90 | selects between them. The sequences can have internal |
|---|
| 91 | blanks in the sequence but there must be no extra blanks at the end of the |
|---|
| 92 | terminated line. Note that a blank is not a valid symbol for a deletion. |
|---|
| 93 | <P> |
|---|
| 94 | The options are selected using an interactive menu. The menu looks like this: |
|---|
| 95 | <P> |
|---|
| 96 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 97 | <PRE> |
|---|
| 98 | |
|---|
| 99 | DNA compatibility algorithm, version 3.6a3 |
|---|
| 100 | |
|---|
| 101 | Settings for this run: |
|---|
| 102 | U Search for best tree? Yes |
|---|
| 103 | J Randomize input order of sequences? No. Use input order |
|---|
| 104 | O Outgroup root? No, use as outgroup species 1 |
|---|
| 105 | W Sites weighted? No |
|---|
| 106 | M Analyze multiple data sets? No |
|---|
| 107 | I Input sequences interleaved? Yes |
|---|
| 108 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
|---|
| 109 | 1 Print out the data at start of run No |
|---|
| 110 | 2 Print indications of progress of run Yes |
|---|
| 111 | 3 Print out tree Yes |
|---|
| 112 | 4 Print steps & compatibility at sites No |
|---|
| 113 | 5 Print sequences at all nodes of tree No |
|---|
| 114 | 6 Write out trees onto tree file? Yes |
|---|
| 115 | |
|---|
| 116 | Are these settings correct? (type Y or the letter for one to change) |
|---|
| 117 | |
|---|
| 118 | </PRE> |
|---|
| 119 | </TD></TR></TABLE> |
|---|
| 120 | <P> |
|---|
| 121 | The user either types "Y" (followed, of course, by a carriage-return) |
|---|
| 122 | if the settings shown are to be accepted, or the letter or digit corresponding |
|---|
| 123 | to an option that is to be changed. |
|---|
| 124 | <P> |
|---|
| 125 | The options U, J, O, W, M, and 0 are the usual ones. They are described in the |
|---|
| 126 | main documentation file of this package. Option I is the same as in |
|---|
| 127 | other molecular sequence programs and is described in the documentation file |
|---|
| 128 | for the sequence programs. |
|---|
| 129 | <P> |
|---|
| 130 | The O (outgroup) option has no effect if the U (user-defined tree) option |
|---|
| 131 | is in effect. The user-defined trees (option U) fed in must be strictly |
|---|
| 132 | bifurcating, with a two-way split at their base. |
|---|
| 133 | <P> |
|---|
| 134 | The interpretation of weights (option W) in the case of a compatibility method |
|---|
| 135 | is that they count how many times the character (in this case the site) is |
|---|
| 136 | counted in the analysis. Thus a character can be dropped from the |
|---|
| 137 | analysis by assigning it zero weight. On the other hand, giving it a |
|---|
| 138 | weight of 5 means that in any clique it is in, it is counted as 5 |
|---|
| 139 | characters when the size of the clique is evaluated. Generally, weights |
|---|
| 140 | other than 0 or 1 do not have much meaning when dealing with DNA sequences. |
|---|
| 141 | <P> |
|---|
| 142 | Output is standard: if option 1 is toggled on, the data is printed out, |
|---|
| 143 | with the convention that "." means "the same as in the first species". |
|---|
| 144 | Then comes a list of equally parsimonious trees, and (if option 2 is |
|---|
| 145 | toggled on) a table of the |
|---|
| 146 | number of changes of state required in each character. If option 5 is toggled |
|---|
| 147 | on, a table is printed |
|---|
| 148 | out after each tree, showing for each branch whether there are known to be |
|---|
| 149 | changes in the branch, and what the states are inferred to have been at the |
|---|
| 150 | top end of the branch. If the inferred state is a "?" or one of the IUB |
|---|
| 151 | ambiguity symbols, there will be multiple |
|---|
| 152 | equally-parsimonious assignments of states; the user must work these out for |
|---|
| 153 | themselves by hand. A "?" in the reconstructed states means that in |
|---|
| 154 | addition to one or more bases, a gap may or may not be present. If |
|---|
| 155 | option 6 is left in its default state the trees |
|---|
| 156 | found will be written to a tree file, so that they are available to be used |
|---|
| 157 | in other programs. |
|---|
| 158 | <P> |
|---|
| 159 | If the U (User Tree) option is used and more than one tree is supplied, |
|---|
| 160 | the program also performs a statistical test of each of these trees against the |
|---|
| 161 | one with highest likelihood. If there are two user trees, the test |
|---|
| 162 | done is one which is due to Kishino and Hasegawa (1989), a version |
|---|
| 163 | of a test originally introduced by Templeton (1983). In this |
|---|
| 164 | implementation it uses the mean and variance of weighted |
|---|
| 165 | compatibility differences between trees, taken across sites. If the two |
|---|
| 166 | trees compatibilities are more than 1.96 standard deviations different then |
|---|
| 167 | the trees are declared significantly different. |
|---|
| 168 | <P> |
|---|
| 169 | If there are more than two trees, the test done is an extension of |
|---|
| 170 | the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out |
|---|
| 171 | that a correction for the number of trees was necessary, and they |
|---|
| 172 | introduced a resampling method to make this correction. In the version |
|---|
| 173 | used here the variances and covariances of the sum of weighted |
|---|
| 174 | compatibilities of sites are computed for all pairs of trees. To |
|---|
| 175 | test whether the |
|---|
| 176 | difference between each tree and the best one is larger than could have |
|---|
| 177 | been expected if they all had the same expected compatibility, |
|---|
| 178 | compatibilities for all trees are sampled with these covariances and equal |
|---|
| 179 | means (Shimodaira and Hasegawa's "least favorable hypothesis"), |
|---|
| 180 | and a P value is computed from the fraction of times the difference between |
|---|
| 181 | the tree's value and the highest compatibility exceeds that actually |
|---|
| 182 | observed. Note that this sampling needs random numbers, and so the |
|---|
| 183 | program will prompt the user for a random number seed if one has not |
|---|
| 184 | already been supplied. With the two-tree KHT test no random numbers |
|---|
| 185 | are used. |
|---|
| 186 | <P> |
|---|
| 187 | In either the KHT or the SH test the program |
|---|
| 188 | prints out a table of the compatibility of each tree, the differences of |
|---|
| 189 | each from the highest one, the variance of that quantity as determined by |
|---|
| 190 | the compatibility differences at individual sites, and a conclusion as to |
|---|
| 191 | whether that tree is or is not significantly worse than the best one. |
|---|
| 192 | <P> |
|---|
| 193 | The algorithm is a straightforward modification of DNAPARS, but with |
|---|
| 194 | some extra machinery added to calculate, as each species is added, how |
|---|
| 195 | many base changes are the minimum which could be required at that site. The |
|---|
| 196 | program runs fairly quickly. |
|---|
| 197 | <P> |
|---|
| 198 | The constants |
|---|
| 199 | which can be changed at the beginning of the program are: |
|---|
| 200 | the name length "nmlngth", |
|---|
| 201 | "maxtrees", the maximum number of trees which the program will store for output, |
|---|
| 202 | and "maxuser", |
|---|
| 203 | the maximum number of user trees that can be used in the paired sites test. |
|---|
| 204 | <P> |
|---|
| 205 | <HR><H3>TEST DATA SET</H3> |
|---|
| 206 | <P> |
|---|
| 207 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 208 | <PRE> |
|---|
| 209 | 5 13 |
|---|
| 210 | Alpha AACGUGGCCAAAU |
|---|
| 211 | Beta AAGGUCGCCAAAC |
|---|
| 212 | Gamma CAUUUCGUCACAA |
|---|
| 213 | Delta GGUAUUUCGGCCU |
|---|
| 214 | Epsilon GGGAUCUCGGCCC |
|---|
| 215 | </PRE> |
|---|
| 216 | </TD></TR></TABLE> |
|---|
| 217 | <P> |
|---|
| 218 | <H3>CONTENTS OF OUTPUT FILE (if all numerical options are turned on)</H3> |
|---|
| 219 | <P> |
|---|
| 220 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 221 | <PRE> |
|---|
| 222 | |
|---|
| 223 | DNA compatibility algorithm, version 3.6a3 |
|---|
| 224 | |
|---|
| 225 | 5 species, 13 sites |
|---|
| 226 | |
|---|
| 227 | Name Sequences |
|---|
| 228 | ---- --------- |
|---|
| 229 | |
|---|
| 230 | Alpha AACGUGGCCA AAU |
|---|
| 231 | Beta AAGGUCGCCA AAC |
|---|
| 232 | Gamma CAUUUCGUCA CAA |
|---|
| 233 | Delta GGUAUUUCGG CCU |
|---|
| 234 | Epsilon GGGAUCUCGG CCC |
|---|
| 235 | |
|---|
| 236 | |
|---|
| 237 | |
|---|
| 238 | One most parsimonious tree found: |
|---|
| 239 | |
|---|
| 240 | |
|---|
| 241 | |
|---|
| 242 | |
|---|
| 243 | +--Epsilon |
|---|
| 244 | +--4 |
|---|
| 245 | +--3 +--Delta |
|---|
| 246 | ! ! |
|---|
| 247 | +--2 +-----Gamma |
|---|
| 248 | ! ! |
|---|
| 249 | 1 +--------Beta |
|---|
| 250 | ! |
|---|
| 251 | +-----------Alpha |
|---|
| 252 | |
|---|
| 253 | remember: this is an unrooted tree! |
|---|
| 254 | |
|---|
| 255 | |
|---|
| 256 | total number of compatible sites is 11.0 |
|---|
| 257 | |
|---|
| 258 | steps in each site: |
|---|
| 259 | 0 1 2 3 4 5 6 7 8 9 |
|---|
| 260 | *----------------------------------------- |
|---|
| 261 | 0| 2 1 3 2 0 2 1 1 1 |
|---|
| 262 | 10| 1 1 1 3 |
|---|
| 263 | |
|---|
| 264 | compatibility (Y or N) of each site with this tree: |
|---|
| 265 | |
|---|
| 266 | 0123456789 |
|---|
| 267 | *---------- |
|---|
| 268 | 0 ! YYNYYYYYY |
|---|
| 269 | 10 !YYYN |
|---|
| 270 | |
|---|
| 271 | From To Any Steps? State at upper node |
|---|
| 272 | |
|---|
| 273 | 1 AABGTSGCCA AAY |
|---|
| 274 | 1 2 maybe AABGTCGCCA AAY |
|---|
| 275 | 2 3 yes VAKDTCGCCA CAY |
|---|
| 276 | 3 4 yes GGKATCTCGG CCY |
|---|
| 277 | 4 Epsilon maybe GGGATCTCGG CCC |
|---|
| 278 | 4 Delta yes GGTATTTCGG CCT |
|---|
| 279 | 3 Gamma yes CATTTCGTCA CAA |
|---|
| 280 | 2 Beta maybe AAGGTCGCCA AAC |
|---|
| 281 | 1 Alpha maybe AACGTGGCCA AAT |
|---|
| 282 | |
|---|
| 283 | |
|---|
| 284 | </PRE> |
|---|
| 285 | </TD></TR></TABLE> |
|---|
| 286 | </BODY> |
|---|
| 287 | </HTML> |
|---|