| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>contml</TITLE> |
|---|
| 5 | <META NAME="description" CONTENT="contml"> |
|---|
| 6 | <META NAME="keywords" CONTENT="contml"> |
|---|
| 7 | <META NAME="resource-type" CONTENT="document"> |
|---|
| 8 | <META NAME="distribution" CONTENT="global"> |
|---|
| 9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
|---|
| 10 | </HEAD> |
|---|
| 11 | <BODY BGCOLOR="#ccffff"> |
|---|
| 12 | <DIV ALIGN=RIGHT> |
|---|
| 13 | version 3.6 |
|---|
| 14 | </DIV> |
|---|
| 15 | <P> |
|---|
| 16 | <DIV ALIGN=CENTER> |
|---|
| 17 | <H1>CONTML - Gene Frequencies and Continuous Characters Maximum Likelihood method</H1> |
|---|
| 18 | </DIV> |
|---|
| 19 | <P> |
|---|
| 20 | © Copyright 1986-2002 by the University of |
|---|
| 21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
|---|
| 22 | this document provided that no fee is charged for it and that this copyright |
|---|
| 23 | notice is not removed. |
|---|
| 24 | <P> |
|---|
| 25 | This program estimates phylogenies by the restricted maximum likelihood method |
|---|
| 26 | based on the Brownian motion model. It is based on the model of Edwards and |
|---|
| 27 | Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966), |
|---|
| 28 | Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work |
|---|
| 29 | leading to efficient algorithms. CONTML uses restricted maximum |
|---|
| 30 | likelihood estimation (REML), which is the criterion used by Felsenstein |
|---|
| 31 | (1973b). The actual algorithm is an iterative EM Algorithm (Dempster, |
|---|
| 32 | Laird, and Rubin, 1977) which is guaranteed to always give increasing |
|---|
| 33 | likelihoods. The algorithm is described in detail in a paper of mine |
|---|
| 34 | (Felsenstein, 1981c), which you should definitely consult if you are |
|---|
| 35 | going to use this program. Some simulation tests of it are given |
|---|
| 36 | by Rohlf and Wooten (1988) and Kim and Burgman (1988). |
|---|
| 37 | <P> |
|---|
| 38 | The default (gene frequency) mode treats the input as gene frequencies at a |
|---|
| 39 | series of loci, and |
|---|
| 40 | square-root-transforms the allele frequencies (constructing the frequency of |
|---|
| 41 | the missing allele at each locus first). This enables us to use the |
|---|
| 42 | Brownian motion model on the resulting coordinates, in an approximation |
|---|
| 43 | equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure |
|---|
| 44 | of genetic distance and taking that to give distance between particles |
|---|
| 45 | undergoing pure Brownian motion. It assumes that each locus evolves |
|---|
| 46 | independently by pure genetic drift. |
|---|
| 47 | <P> |
|---|
| 48 | The alternative continuous characters mode (menu option C) treats the input |
|---|
| 49 | as a series of coordinates of each species in N dimensions. It assumes |
|---|
| 50 | that we have transformed the characters to remove correlations and to |
|---|
| 51 | standardize their variances. |
|---|
| 52 | <P> |
|---|
| 53 | The input file is as described in the continuous characters |
|---|
| 54 | documentation file above. Options are selected using a menu: |
|---|
| 55 | <P> |
|---|
| 56 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 57 | <PRE> |
|---|
| 58 | |
|---|
| 59 | Continuous character Maximum Likelihood method version 3.6a3 |
|---|
| 60 | |
|---|
| 61 | Settings for this run: |
|---|
| 62 | U Search for best tree? Yes |
|---|
| 63 | C Gene frequencies or continuous characters? Gene frequencies |
|---|
| 64 | A Input file has all alleles at each locus? No, one allele missing at each |
|---|
| 65 | O Outgroup root? No, use as outgroup species 1 |
|---|
| 66 | G Global rearrangements? No |
|---|
| 67 | J Randomize input order of species? No. Use input order |
|---|
| 68 | M Analyze multiple data sets? No |
|---|
| 69 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
|---|
| 70 | 1 Print out the data at start of run No |
|---|
| 71 | 2 Print indications of progress of run Yes |
|---|
| 72 | 3 Print out tree Yes |
|---|
| 73 | 4 Write out trees onto tree file? Yes |
|---|
| 74 | |
|---|
| 75 | Y to accept these or type the letter for one to change |
|---|
| 76 | |
|---|
| 77 | </PRE> |
|---|
| 78 | </TD></TR></TABLE> |
|---|
| 79 | <P> |
|---|
| 80 | Option U is the usual User Tree option. Options C (Continuous Characters) |
|---|
| 81 | and A (All alleles present) have been described |
|---|
| 82 | in the Gene Frequencies and Continuous Characters Programs documentation |
|---|
| 83 | file. The options G, J, O and M are the usual Global Rearrangements, Jumble |
|---|
| 84 | order of species, Outgroup root, and Multiple Data Sets options. |
|---|
| 85 | <P> |
|---|
| 86 | The M (Multiple data sets) option does not allow multiple sets of weights |
|---|
| 87 | instead of multiple data sets, as there are no weights in this program. |
|---|
| 88 | <P> |
|---|
| 89 | The G and J options have no effect if the User Tree option is selected. User |
|---|
| 90 | trees are given with a trifurcation (three-way split) at the base. They |
|---|
| 91 | can start from any interior node. Thus the tree: |
|---|
| 92 | <P> |
|---|
| 93 | <PRE> |
|---|
| 94 | A |
|---|
| 95 | ! |
|---|
| 96 | *--B |
|---|
| 97 | ! |
|---|
| 98 | *-----C |
|---|
| 99 | ! |
|---|
| 100 | *--D |
|---|
| 101 | ! |
|---|
| 102 | E |
|---|
| 103 | </PRE> |
|---|
| 104 | <P> |
|---|
| 105 | can be represented by any of the following: |
|---|
| 106 | <P> |
|---|
| 107 | <PRE> |
|---|
| 108 | (A,B,(C,(D,E))); |
|---|
| 109 | ((A,B),C,(D,E)); |
|---|
| 110 | (((A,B),C),D,E); |
|---|
| 111 | </PRE> |
|---|
| 112 | <P> |
|---|
| 113 | (there are of course 69 other representations as well obtained from these |
|---|
| 114 | by swapping the order of branches at an interior node). |
|---|
| 115 | <P> |
|---|
| 116 | The output has a standard appearance. The topology of the tree |
|---|
| 117 | is given by an unrooted tree diagram. The lengths (in time or in |
|---|
| 118 | expected amounts of variance) are given in a table below the topology, |
|---|
| 119 | and a rough confidence interval given for each length. Negative lower |
|---|
| 120 | bounds on length indicate that rearrangements may be acceptable. |
|---|
| 121 | <P> |
|---|
| 122 | The units of length are amounts of expected accumulated variance (not |
|---|
| 123 | time). The |
|---|
| 124 | log likelihood (natural log) of each tree is also given, and it is |
|---|
| 125 | indicated how many topologies have been tried. The tree does not |
|---|
| 126 | necessarily have all tips contemporary, and the log likelihood may be |
|---|
| 127 | either positive or negative (this simply corresponds to whether the |
|---|
| 128 | density function does or does not exceed 1) and a negative log |
|---|
| 129 | likelihood does not indicate any error. The log likelihood allows |
|---|
| 130 | various formal likelihood ratio hypothesis tests. The description of |
|---|
| 131 | the tree includes approximate standard errors on the lengths of segments |
|---|
| 132 | of the tree. These are calculated by considering only the curvature of |
|---|
| 133 | the likelihood surface as the length of the segment is varied, holding |
|---|
| 134 | all other lengths constant. As such they are most probably underestimates of |
|---|
| 135 | the variance, and hence may give too much confidence in the given tree. |
|---|
| 136 | <P> |
|---|
| 137 | One should use caution in interpreting the likelihoods that are printed |
|---|
| 138 | out. If the model is wrong, it will not be possible to use the |
|---|
| 139 | likelihoods to make formal statistical statements. Thus, if gene |
|---|
| 140 | frequencies are being analyzed, but the gene frequencies change not only |
|---|
| 141 | by genetic drift, but also by mutation, the model is not correct. It |
|---|
| 142 | would be as well-justified in this case to use GENDIST to compute the |
|---|
| 143 | Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR to make a |
|---|
| 144 | tree. If continuous characters are being analyzed, but if the |
|---|
| 145 | characters have not been transformed to new coordinates that evolve |
|---|
| 146 | independently and at equal rates, then the model is also violated and no |
|---|
| 147 | statistical analysis is possible. |
|---|
| 148 | <P> |
|---|
| 149 | If the U (User Tree) option is used and more than one tree is supplied, |
|---|
| 150 | the program also performs a statistical test of each of these trees against the |
|---|
| 151 | one with highest likelihood. If there are two user trees, the test |
|---|
| 152 | done is one which is due to Kishino and Hasegawa (1989), a version |
|---|
| 153 | of a test originally introduced by Templeton (1983). In this |
|---|
| 154 | implementation it uses the mean and variance of |
|---|
| 155 | log-likelihood differences between trees, taken across loci. If the two |
|---|
| 156 | trees means are more than 1.96 standard deviations different then the trees are |
|---|
| 157 | declared significantly different. This use of the empirical variance of |
|---|
| 158 | log-likelihood differences is more robust and nonparametric than the |
|---|
| 159 | classical likelihood ratio test, and may to some extent compensate for the |
|---|
| 160 | any lack of realism in the model underlying this program. |
|---|
| 161 | <P> |
|---|
| 162 | If there are more than two trees, the test done is an extension of |
|---|
| 163 | the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out |
|---|
| 164 | that a correction for the number of trees was necessary, and they |
|---|
| 165 | introduced a resampling method to make this correction. The version |
|---|
| 166 | used here is a multivariate normal approximation to their test; it is |
|---|
| 167 | due to Shimodaira (1998). The variances and covariances of the sum of |
|---|
| 168 | log likelihoods across loci are computed for all pairs of trees. To test |
|---|
| 169 | whether the difference between each tree and the best one is larger than |
|---|
| 170 | could have been expected if they all had the same expected log-likelihood, |
|---|
| 171 | log-likelihoods for all trees are sampled with these covariances and equal |
|---|
| 172 | means (Shimodaira and Hasegawa's "least favorable hypothesis"), |
|---|
| 173 | and a P value is computed from the fraction of times the difference between |
|---|
| 174 | the tree's value and the highest log-likelihood exceeds that actually |
|---|
| 175 | observed. Note that this sampling needs random numbers, and so the |
|---|
| 176 | program will prompt the user for a random number seed if one has not |
|---|
| 177 | already been supplied. With the two-tree KHT test no random numbers |
|---|
| 178 | are used. |
|---|
| 179 | <P> |
|---|
| 180 | In either the KHT or the SH test the program |
|---|
| 181 | prints out a table of the log-likelihoods of each tree, the differences of |
|---|
| 182 | each from the highest one, the variance of that quantity as determined by |
|---|
| 183 | the log-likelihood differences at individual sites, and a conclusion as to |
|---|
| 184 | whether that tree is or is not significantly worse than the best one. |
|---|
| 185 | <P> |
|---|
| 186 | One problem which sometimes arises is that the program is fed two species |
|---|
| 187 | (or populations) with identical transformed gene frequencies: this can |
|---|
| 188 | happen if sample sizes are small and/or many loci are monomorphic. In |
|---|
| 189 | this case the program "gets its knickers in a twist" and can divide by |
|---|
| 190 | zero, usually causing a crash. If you suspect that this has happened, |
|---|
| 191 | check for two species with identical coordinates. If you find them, |
|---|
| 192 | eliminate one from the problem: the two must always show up as being at the |
|---|
| 193 | same point on the tree anyway. |
|---|
| 194 | <P> |
|---|
| 195 | The constants |
|---|
| 196 | available for modification at the beginning of the |
|---|
| 197 | program include "epsilon1", |
|---|
| 198 | a small quantity used in the iterations of branch lengths, |
|---|
| 199 | "epsilon2", another not quite so small quantity used to check |
|---|
| 200 | whether gene frequencies that were fed in for all alleles do not add up to 1, |
|---|
| 201 | "smoothings", the number of passes through a |
|---|
| 202 | given tree in the iterative likelihood maximization for a given topology, |
|---|
| 203 | "maxtrees", the maximum number of user trees that will be used for the |
|---|
| 204 | Kishino-Hasegawa-Templeton test, and |
|---|
| 205 | "namelength", the length of species names. |
|---|
| 206 | There is no provision in this program for saving multiple trees that are |
|---|
| 207 | tied for having the highest likelihood, mostly because an exact tie is |
|---|
| 208 | unlikely anyway. |
|---|
| 209 | <P> |
|---|
| 210 | The algorithm does not run as quickly as the discrete character |
|---|
| 211 | methods but is not enormously slower. Like them, its execution time |
|---|
| 212 | should rise as the cube of the number of species. |
|---|
| 213 | <P> |
|---|
| 214 | <H3>TEST DATA SET</H3> |
|---|
| 215 | <P> |
|---|
| 216 | This data set was compiled by me from the compilation of human gene |
|---|
| 217 | frequencies by Mourant (1976). It appeared in a paper of mine |
|---|
| 218 | (Felsenstein, 1981c) on maximum likelihood phylogenies from gene |
|---|
| 219 | frequencies. The names of the loci and alleles are given in that |
|---|
| 220 | paper. |
|---|
| 221 | <P> |
|---|
| 222 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 223 | <PRE> |
|---|
| 224 | 5 10 |
|---|
| 225 | 2 2 2 2 2 2 2 2 2 2 |
|---|
| 226 | European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205 |
|---|
| 227 | 0.8055 0.5043 |
|---|
| 228 | African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600 |
|---|
| 229 | 0.7582 0.6207 |
|---|
| 230 | Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726 |
|---|
| 231 | 0.7482 0.7334 |
|---|
| 232 | American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000 |
|---|
| 233 | 0.8086 0.8636 |
|---|
| 234 | Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396 |
|---|
| 235 | 0.9097 0.2976 |
|---|
| 236 | </PRE> |
|---|
| 237 | </TD></TR></TABLE> |
|---|
| 238 | <P> |
|---|
| 239 | <HR> |
|---|
| 240 | <P> |
|---|
| 241 | <H3>TEST SET OUTPUT (WITH ALL NUMERICAL OPTIONS TURNED ON)</H3> |
|---|
| 242 | <P> |
|---|
| 243 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 244 | <PRE> |
|---|
| 245 | |
|---|
| 246 | Continuous character Maximum Likelihood method version 3.6a3 |
|---|
| 247 | |
|---|
| 248 | |
|---|
| 249 | 5 Populations, 10 Loci |
|---|
| 250 | |
|---|
| 251 | Numbers of alleles at the loci: |
|---|
| 252 | ------- -- ------- -- --- ----- |
|---|
| 253 | |
|---|
| 254 | 2 2 2 2 2 2 2 2 2 2 |
|---|
| 255 | |
|---|
| 256 | Name Gene Frequencies |
|---|
| 257 | ---- ---- ----------- |
|---|
| 258 | |
|---|
| 259 | locus: 1 2 3 4 5 6 |
|---|
| 260 | 7 8 9 10 |
|---|
| 261 | |
|---|
| 262 | European 0.28680 0.56840 0.44220 0.42860 0.38280 0.72850 |
|---|
| 263 | 0.63860 0.02050 0.80550 0.50430 |
|---|
| 264 | African 0.13560 0.48400 0.06020 0.03970 0.59770 0.96750 |
|---|
| 265 | 0.95110 0.06000 0.75820 0.62070 |
|---|
| 266 | Chinese 0.16280 0.59580 0.72980 1.00000 0.38110 0.79860 |
|---|
| 267 | 0.77820 0.07260 0.74820 0.73340 |
|---|
| 268 | American 0.01440 0.69900 0.32800 0.74210 0.66060 0.86030 |
|---|
| 269 | 0.79240 0.00000 0.80860 0.86360 |
|---|
| 270 | Australian 0.12110 0.22740 0.58210 1.00000 0.20180 0.90000 |
|---|
| 271 | 0.98370 0.03960 0.90970 0.29760 |
|---|
| 272 | |
|---|
| 273 | |
|---|
| 274 | +----------------------------------African |
|---|
| 275 | ! |
|---|
| 276 | ! +--------American |
|---|
| 277 | 1--------------2 |
|---|
| 278 | ! ! +-----------------------Australian |
|---|
| 279 | ! +--------------------3 |
|---|
| 280 | ! +Chinese |
|---|
| 281 | ! |
|---|
| 282 | +--European |
|---|
| 283 | |
|---|
| 284 | |
|---|
| 285 | remember: this is an unrooted tree! |
|---|
| 286 | |
|---|
| 287 | Ln Likelihood = 33.29060 |
|---|
| 288 | |
|---|
| 289 | Between And Length Approx. Confidence Limits |
|---|
| 290 | ------- --- ------ ------- ---------- ------ |
|---|
| 291 | 1 African 0.08464 ( 0.02351, 0.17917) |
|---|
| 292 | 1 2 0.03569 ( -0.00262, 0.09493) |
|---|
| 293 | 2 American 0.02094 ( -0.00904, 0.06731) |
|---|
| 294 | 2 3 0.05098 ( 0.00555, 0.12124) |
|---|
| 295 | 3 Australian 0.05959 ( 0.01775, 0.12430) |
|---|
| 296 | 3 Chinese 0.00221 ( -0.02034, 0.03710) |
|---|
| 297 | 1 European 0.00624 ( -0.01948, 0.04601) |
|---|
| 298 | |
|---|
| 299 | |
|---|
| 300 | </PRE> |
|---|
| 301 | </TD></TR></TABLE> |
|---|
| 302 | </BODY> |
|---|
| 303 | </HTML> |
|---|