source: trunk/GDE/PHYLIP/doc/contml.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 13.2 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>contml</TITLE>
5<META NAME="description" CONTENT="contml">
6<META NAME="keywords" CONTENT="contml">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>CONTML - Gene Frequencies and Continuous Characters Maximum Likelihood method</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2002 by the University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25This program estimates phylogenies by the restricted maximum likelihood method
26based on the Brownian motion model.  It is based on the model of Edwards and
27Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967).  Gomberg (1966),
28Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work
29leading to efficient algorithms.  CONTML uses restricted maximum
30likelihood estimation (REML), which is the criterion used by Felsenstein
31(1973b).  The actual algorithm is an iterative EM Algorithm (Dempster,
32Laird, and Rubin, 1977) which is guaranteed to always give increasing
33likelihoods.  The algorithm is described in detail in a paper of mine
34(Felsenstein, 1981c), which you should definitely consult if you are
35going to use this program.  Some simulation tests of it are given
36by Rohlf and Wooten (1988) and Kim and Burgman (1988).
37<P>
38The default (gene frequency) mode treats the input as gene frequencies at a
39series of loci, and
40square-root-transforms the allele frequencies (constructing the frequency of
41the missing allele at each locus first).  This enables us to use the
42Brownian motion model on the resulting coordinates, in an approximation
43equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure
44of genetic distance and taking that to give distance between particles
45undergoing pure Brownian motion.  It assumes that each locus evolves
46independently by pure genetic drift.
47<P>
48The alternative continuous characters mode  (menu option C) treats the input
49as a series of coordinates of each species in N dimensions.  It assumes
50that we have transformed the characters to remove correlations and to
51standardize their variances.
52<P>
53The input file is as described in the continuous characters
54documentation file above.  Options are selected using a menu:
55<P>
56<TABLE><TR><TD BGCOLOR=white>
57<PRE>
58
59Continuous character Maximum Likelihood method version 3.6a3
60
61Settings for this run:
62  U                       Search for best tree?  Yes
63  C  Gene frequencies or continuous characters?  Gene frequencies
64  A   Input file has all alleles at each locus?  No, one allele missing at each
65  O                              Outgroup root?  No, use as outgroup species 1
66  G                      Global rearrangements?  No
67  J           Randomize input order of species?  No. Use input order
68  M                 Analyze multiple data sets?  No
69  0         Terminal type (IBM PC, ANSI, none)?  (none)
70  1          Print out the data at start of run  No
71  2        Print indications of progress of run  Yes
72  3                              Print out tree  Yes
73  4             Write out trees onto tree file?  Yes
74
75  Y to accept these or type the letter for one to change
76
77</PRE>
78</TD></TR></TABLE>
79<P>
80Option U is the usual User Tree option.  Options C (Continuous Characters)
81and A (All alleles present) have been described
82in the Gene Frequencies and Continuous Characters Programs documentation
83file.  The options G, J, O and M are the usual Global Rearrangements, Jumble
84order of species, Outgroup root, and Multiple Data Sets options.
85<P>
86The M (Multiple data sets) option does not allow multiple sets of weights
87instead of multiple data sets, as there are no weights in this program.
88<P>
89The G and J options have no effect if the User Tree option is selected.  User
90trees are given with a trifurcation (three-way split) at the base.  They
91can start from any interior node.  Thus the tree:
92<P>
93<PRE>
94&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A
95&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;!
96&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*--B
97&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;!
98&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*-----C
99&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;!
100&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*--D
101&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;!
102&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;E
103</PRE>
104<P>
105can be represented by any of the following:
106<P>
107<PRE>
108&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(A,B,(C,(D,E)));
109&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;((A,B),C,(D,E));
110&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(((A,B),C),D,E);
111</PRE>
112<P>
113(there are of course 69 other representations as well obtained from these
114by swapping the order of branches at an interior node).
115<P>
116The output has a standard appearance.  The topology of the tree
117is given by an unrooted tree diagram.  The lengths (in time or in
118expected amounts of variance) are given in a table below the topology,
119and a rough confidence interval given for each length.  Negative lower
120bounds on length indicate that rearrangements may be acceptable.
121<P>
122The units of length are amounts of expected accumulated variance (not
123time).  The
124log likelihood (natural log) of each tree is also given, and it is
125indicated how many topologies have been tried.  The tree does not
126necessarily have all tips contemporary, and the log likelihood may be
127either positive or negative (this simply corresponds to whether the
128density function does or does not exceed 1) and a negative log
129likelihood does not indicate any error.  The log likelihood allows
130various formal likelihood ratio hypothesis tests.  The description of
131the tree includes approximate standard errors on the lengths of segments
132of the tree.  These are calculated by considering only the curvature of
133the likelihood surface as the length of the segment is varied, holding
134all other lengths constant.  As such they are most probably underestimates of
135the variance, and hence may give too much confidence in the given tree.
136<P>
137One should use caution in interpreting the likelihoods that are printed
138out.  If the model is wrong, it will not be possible to use the
139likelihoods to make formal statistical statements.  Thus, if gene
140frequencies are being analyzed, but the gene frequencies change not only
141by genetic drift, but also by mutation, the model is not correct.  It
142would be as well-justified in this case to use GENDIST to compute the
143Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR to make a
144tree.  If continuous characters are being analyzed, but if the
145characters have not been transformed to new coordinates that evolve
146independently and at equal rates, then the model is also violated and no
147statistical analysis is possible.
148<P>
149If the U (User Tree) option is used and more than one tree is supplied,
150the program also performs a statistical test of each of these trees against the
151one with highest likelihood.   If there are two user trees, the test
152done is one which is due to Kishino and Hasegawa (1989), a version
153of a test originally introduced by Templeton (1983).  In this
154implementation it uses the mean and variance of
155log-likelihood differences between trees, taken across loci.  If the two
156trees means are more than 1.96 standard deviations different then the trees are
157declared significantly different.  This use of the empirical variance of
158log-likelihood differences is more robust and nonparametric than the
159classical likelihood ratio test, and may to some extent compensate for the
160any lack of realism in the model underlying this program.
161<P>
162If there are more than two trees, the test done is an extension of
163the KHT test, due to Shimodaira and Hasegawa (1999).  They pointed out
164that a correction for the number of trees was necessary, and they
165introduced a resampling method to make this correction.  The version
166used here is a multivariate normal approximation to their test; it is
167due to Shimodaira (1998).  The variances and covariances of the sum of
168log likelihoods across loci are computed for all pairs of trees.  To test
169whether the difference between each tree and the best one is larger than
170could have been expected if they all had the same expected log-likelihood,
171log-likelihoods for all trees are sampled with these covariances and equal
172means (Shimodaira and Hasegawa's "least favorable hypothesis"),
173and a P value is computed from the fraction of times the difference between
174the tree's value and the highest log-likelihood exceeds that actually
175observed.  Note that this sampling needs random numbers, and so the
176program will prompt the user for a random number seed if one has not
177already been supplied.  With the two-tree KHT test no random numbers
178are used.
179<P>
180In either the KHT or the SH test the program
181prints out a table of the log-likelihoods of each tree, the differences of
182each from the highest one, the variance of that quantity as determined by
183the log-likelihood differences at individual sites, and a conclusion as to
184whether that tree is or is not significantly worse than the best one.
185<P>
186One problem which sometimes arises is that the program is fed two species
187(or populations) with identical transformed gene frequencies: this can
188happen if sample sizes are small and/or many loci are monomorphic.  In
189this case the program "gets its knickers in a twist" and can divide by
190zero, usually causing a crash.  If you suspect that this has happened,
191check for two species with identical coordinates.  If you find them,
192eliminate one from the problem: the two must always show up as being at the
193same point on the tree anyway.
194<P>
195The constants
196available for modification at the beginning of the
197program include "epsilon1",
198a small quantity used in the iterations of branch lengths,
199"epsilon2", another not quite so small quantity used to check
200whether gene frequencies that were fed in for all alleles do not add up to 1,
201"smoothings", the number of passes through a
202given tree in the iterative likelihood maximization for a given topology,
203"maxtrees", the maximum number of user trees that will be used for the
204Kishino-Hasegawa-Templeton test, and
205"namelength", the length of species names.
206There is no provision in this program for saving multiple trees that are
207tied for having the highest likelihood, mostly because an exact tie is
208unlikely anyway.
209<P>
210The algorithm does not run as quickly as the discrete character
211methods but is not enormously slower.  Like them, its execution time
212should rise as the cube of the number of species.
213<P>
214<H3>TEST DATA SET</H3>
215<P>
216This data set was compiled by me from the compilation of human gene
217frequencies by Mourant (1976).  It appeared in a paper of mine
218(Felsenstein, 1981c) on maximum likelihood phylogenies from gene
219frequencies.  The names of the loci and alleles are given in that
220paper.
221<P>
222<TABLE><TR><TD BGCOLOR=white>
223<PRE>
224    5    10
2252 2 2 2 2 2 2 2 2 2
226European   0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
2270.8055 0.5043
228African    0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
2290.7582 0.6207
230Chinese    0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
2310.7482 0.7334
232American   0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
2330.8086 0.8636
234Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
2350.9097 0.2976
236</PRE>
237</TD></TR></TABLE>
238<P>
239<HR>
240<P>
241<H3>TEST SET OUTPUT (WITH ALL NUMERICAL OPTIONS TURNED ON)</H3>
242<P>
243<TABLE><TR><TD BGCOLOR=white>
244<PRE>
245
246Continuous character Maximum Likelihood method version 3.6a3
247
248
249   5 Populations,   10 Loci
250
251Numbers of alleles at the loci:
252------- -- ------- -- --- -----
253
254   2   2   2   2   2   2   2   2   2   2
255
256Name                 Gene Frequencies
257----                 ---- -----------
258
259  locus:         1         2         3         4         5         6
260                 7         8         9        10
261
262European     0.28680   0.56840   0.44220   0.42860   0.38280   0.72850
263             0.63860   0.02050   0.80550   0.50430
264African      0.13560   0.48400   0.06020   0.03970   0.59770   0.96750
265             0.95110   0.06000   0.75820   0.62070
266Chinese      0.16280   0.59580   0.72980   1.00000   0.38110   0.79860
267             0.77820   0.07260   0.74820   0.73340
268American     0.01440   0.69900   0.32800   0.74210   0.66060   0.86030
269             0.79240   0.00000   0.80860   0.86360
270Australian   0.12110   0.22740   0.58210   1.00000   0.20180   0.90000
271             0.98370   0.03960   0.90970   0.29760
272
273
274  +----------------------------------African   
275  ! 
276  !              +--------American 
277  1--------------2 
278  !              !                    +-----------------------Australian
279  !              +--------------------3 
280  !                                   +Chinese   
281  ! 
282  +--European 
283
284
285remember: this is an unrooted tree!
286
287Ln Likelihood =    33.29060
288
289Between     And             Length      Approx. Confidence Limits
290-------     ---             ------      ------- ---------- ------
291  1       African           0.08464   (     0.02351,     0.17917)
292  1          2              0.03569   (    -0.00262,     0.09493)
293  2       American          0.02094   (    -0.00904,     0.06731)
294  2          3              0.05098   (     0.00555,     0.12124)
295  3       Australian        0.05959   (     0.01775,     0.12430)
296  3       Chinese           0.00221   (    -0.02034,     0.03710)
297  1       European          0.00624   (    -0.01948,     0.04601)
298
299
300</PRE>
301</TD></TR></TABLE>
302</BODY>
303</HTML>
Note: See TracBrowser for help on using the repository browser.