Context Navigation

contml.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 13.2 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>contml</TITLE>
5	<META NAME="description" CONTENT="contml">
6	<META NAME="keywords" CONTENT="contml">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>CONTML - Gene Frequencies and Continuous Characters Maximum Likelihood method</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	This program estimates phylogenies by the restricted maximum likelihood method
26	based on the Brownian motion model. It is based on the model of Edwards and
27	Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967). Gomberg (1966),
28	Felsenstein (1973b, 1981c) and Thompson (1975) have done extensive further work
29	leading to efficient algorithms. CONTML uses restricted maximum
30	likelihood estimation (REML), which is the criterion used by Felsenstein
31	(1973b). The actual algorithm is an iterative EM Algorithm (Dempster,
32	Laird, and Rubin, 1977) which is guaranteed to always give increasing
33	likelihoods. The algorithm is described in detail in a paper of mine
34	(Felsenstein, 1981c), which you should definitely consult if you are
35	going to use this program. Some simulation tests of it are given
36	by Rohlf and Wooten (1988) and Kim and Burgman (1988).
37	<P>
38	The default (gene frequency) mode treats the input as gene frequencies at a
39	series of loci, and
40	square-root-transforms the allele frequencies (constructing the frequency of
41	the missing allele at each locus first). This enables us to use the
42	Brownian motion model on the resulting coordinates, in an approximation
43	equivalent to using Cavalli-Sforza and Edwards's (1967) chord measure
44	of genetic distance and taking that to give distance between particles
45	undergoing pure Brownian motion. It assumes that each locus evolves
46	independently by pure genetic drift.
47	<P>
48	The alternative continuous characters mode (menu option C) treats the input
49	as a series of coordinates of each species in N dimensions. It assumes
50	that we have transformed the characters to remove correlations and to
51	standardize their variances.
52	<P>
53	The input file is as described in the continuous characters
54	documentation file above. Options are selected using a menu:
55	<P>
56	<TABLE><TR><TD BGCOLOR=white>
57	<PRE>
58
59	Continuous character Maximum Likelihood method version 3.6a3
60
61	Settings for this run:
62	U Search for best tree? Yes
63	C Gene frequencies or continuous characters? Gene frequencies
64	A Input file has all alleles at each locus? No, one allele missing at each
65	O Outgroup root? No, use as outgroup species 1
66	G Global rearrangements? No
67	J Randomize input order of species? No. Use input order
68	M Analyze multiple data sets? No
69	0 Terminal type (IBM PC, ANSI, none)? (none)
70	1 Print out the data at start of run No
71	2 Print indications of progress of run Yes
72	3 Print out tree Yes
73	4 Write out trees onto tree file? Yes
74
75	Y to accept these or type the letter for one to change
76
77	</PRE>
78	</TD></TR></TABLE>
79	<P>
80	Option U is the usual User Tree option. Options C (Continuous Characters)
81	and A (All alleles present) have been described
82	in the Gene Frequencies and Continuous Characters Programs documentation
83	file. The options G, J, O and M are the usual Global Rearrangements, Jumble
84	order of species, Outgroup root, and Multiple Data Sets options.
85	<P>
86	The M (Multiple data sets) option does not allow multiple sets of weights
87	instead of multiple data sets, as there are no weights in this program.
88	<P>
89	The G and J options have no effect if the User Tree option is selected. User
90	trees are given with a trifurcation (three-way split) at the base. They
91	can start from any interior node. Thus the tree:
92	<P>
93	<PRE>
94	A
95	!
96	*--B
97	!
98	*-----C
99	!
100	*--D
101	!
102	E
103	</PRE>
104	<P>
105	can be represented by any of the following:
106	<P>
107	<PRE>
108	(A,B,(C,(D,E)));
109	((A,B),C,(D,E));
110	(((A,B),C),D,E);
111	</PRE>
112	<P>
113	(there are of course 69 other representations as well obtained from these
114	by swapping the order of branches at an interior node).
115	<P>
116	The output has a standard appearance. The topology of the tree
117	is given by an unrooted tree diagram. The lengths (in time or in
118	expected amounts of variance) are given in a table below the topology,
119	and a rough confidence interval given for each length. Negative lower
120	bounds on length indicate that rearrangements may be acceptable.
121	<P>
122	The units of length are amounts of expected accumulated variance (not
123	time). The
124	log likelihood (natural log) of each tree is also given, and it is
125	indicated how many topologies have been tried. The tree does not
126	necessarily have all tips contemporary, and the log likelihood may be
127	either positive or negative (this simply corresponds to whether the
128	density function does or does not exceed 1) and a negative log
129	likelihood does not indicate any error. The log likelihood allows
130	various formal likelihood ratio hypothesis tests. The description of
131	the tree includes approximate standard errors on the lengths of segments
132	of the tree. These are calculated by considering only the curvature of
133	the likelihood surface as the length of the segment is varied, holding
134	all other lengths constant. As such they are most probably underestimates of
135	the variance, and hence may give too much confidence in the given tree.
136	<P>
137	One should use caution in interpreting the likelihoods that are printed
138	out. If the model is wrong, it will not be possible to use the
139	likelihoods to make formal statistical statements. Thus, if gene
140	frequencies are being analyzed, but the gene frequencies change not only
141	by genetic drift, but also by mutation, the model is not correct. It
142	would be as well-justified in this case to use GENDIST to compute the
143	Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR to make a
144	tree. If continuous characters are being analyzed, but if the
145	characters have not been transformed to new coordinates that evolve
146	independently and at equal rates, then the model is also violated and no
147	statistical analysis is possible.
148	<P>
149	If the U (User Tree) option is used and more than one tree is supplied,
150	the program also performs a statistical test of each of these trees against the
151	one with highest likelihood. If there are two user trees, the test
152	done is one which is due to Kishino and Hasegawa (1989), a version
153	of a test originally introduced by Templeton (1983). In this
154	implementation it uses the mean and variance of
155	log-likelihood differences between trees, taken across loci. If the two
156	trees means are more than 1.96 standard deviations different then the trees are
157	declared significantly different. This use of the empirical variance of
158	log-likelihood differences is more robust and nonparametric than the
159	classical likelihood ratio test, and may to some extent compensate for the
160	any lack of realism in the model underlying this program.
161	<P>
162	If there are more than two trees, the test done is an extension of
163	the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
164	that a correction for the number of trees was necessary, and they
165	introduced a resampling method to make this correction. The version
166	used here is a multivariate normal approximation to their test; it is
167	due to Shimodaira (1998). The variances and covariances of the sum of
168	log likelihoods across loci are computed for all pairs of trees. To test
169	whether the difference between each tree and the best one is larger than
170	could have been expected if they all had the same expected log-likelihood,
171	log-likelihoods for all trees are sampled with these covariances and equal
172	means (Shimodaira and Hasegawa's "least favorable hypothesis"),
173	and a P value is computed from the fraction of times the difference between
174	the tree's value and the highest log-likelihood exceeds that actually
175	observed. Note that this sampling needs random numbers, and so the
176	program will prompt the user for a random number seed if one has not
177	already been supplied. With the two-tree KHT test no random numbers
178	are used.
179	<P>
180	In either the KHT or the SH test the program
181	prints out a table of the log-likelihoods of each tree, the differences of
182	each from the highest one, the variance of that quantity as determined by
183	the log-likelihood differences at individual sites, and a conclusion as to
184	whether that tree is or is not significantly worse than the best one.
185	<P>
186	One problem which sometimes arises is that the program is fed two species
187	(or populations) with identical transformed gene frequencies: this can
188	happen if sample sizes are small and/or many loci are monomorphic. In
189	this case the program "gets its knickers in a twist" and can divide by
190	zero, usually causing a crash. If you suspect that this has happened,
191	check for two species with identical coordinates. If you find them,
192	eliminate one from the problem: the two must always show up as being at the
193	same point on the tree anyway.
194	<P>
195	The constants
196	available for modification at the beginning of the
197	program include "epsilon1",
198	a small quantity used in the iterations of branch lengths,
199	"epsilon2", another not quite so small quantity used to check
200	whether gene frequencies that were fed in for all alleles do not add up to 1,
201	"smoothings", the number of passes through a
202	given tree in the iterative likelihood maximization for a given topology,
203	"maxtrees", the maximum number of user trees that will be used for the
204	Kishino-Hasegawa-Templeton test, and
205	"namelength", the length of species names.
206	There is no provision in this program for saving multiple trees that are
207	tied for having the highest likelihood, mostly because an exact tie is
208	unlikely anyway.
209	<P>
210	The algorithm does not run as quickly as the discrete character
211	methods but is not enormously slower. Like them, its execution time
212	should rise as the cube of the number of species.
213	<P>
214	<H3>TEST DATA SET</H3>
215	<P>
216	This data set was compiled by me from the compilation of human gene
217	frequencies by Mourant (1976). It appeared in a paper of mine
218	(Felsenstein, 1981c) on maximum likelihood phylogenies from gene
219	frequencies. The names of the loci and alleles are given in that
220	paper.
221	<P>
222	<TABLE><TR><TD BGCOLOR=white>
223	<PRE>
224	5 10
225	2 2 2 2 2 2 2 2 2 2
226	European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
227	0.8055 0.5043
228	African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
229	0.7582 0.6207
230	Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
231	0.7482 0.7334
232	American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
233	0.8086 0.8636
234	Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
235	0.9097 0.2976
236	</PRE>
237	</TD></TR></TABLE>
238	<P>
239	<HR>
240	<P>
241	<H3>TEST SET OUTPUT (WITH ALL NUMERICAL OPTIONS TURNED ON)</H3>
242	<P>
243	<TABLE><TR><TD BGCOLOR=white>
244	<PRE>
245
246	Continuous character Maximum Likelihood method version 3.6a3
247
248
249	5 Populations, 10 Loci
250
251	Numbers of alleles at the loci:
252	------- -- ------- -- --- -----
253
254	2 2 2 2 2 2 2 2 2 2
255
256	Name Gene Frequencies
257	---- ---- -----------
258
259	locus: 1 2 3 4 5 6
260	7 8 9 10
261
262	European 0.28680 0.56840 0.44220 0.42860 0.38280 0.72850
263	0.63860 0.02050 0.80550 0.50430
264	African 0.13560 0.48400 0.06020 0.03970 0.59770 0.96750
265	0.95110 0.06000 0.75820 0.62070
266	Chinese 0.16280 0.59580 0.72980 1.00000 0.38110 0.79860
267	0.77820 0.07260 0.74820 0.73340
268	American 0.01440 0.69900 0.32800 0.74210 0.66060 0.86030
269	0.79240 0.00000 0.80860 0.86360
270	Australian 0.12110 0.22740 0.58210 1.00000 0.20180 0.90000
271	0.98370 0.03960 0.90970 0.29760
272
273
274	+----------------------------------African
275	!
276	! +--------American
277	1--------------2
278	! ! +-----------------------Australian
279	! +--------------------3
280	! +Chinese
281	!
282	+--European
283
284
285	remember: this is an unrooted tree!
286
287	Ln Likelihood = 33.29060
288
289	Between And Length Approx. Confidence Limits
290	------- --- ------ ------- ---------- ------
291	1 African 0.08464 ( 0.02351, 0.17917)
292	1 2 0.03569 ( -0.00262, 0.09493)
293	2 American 0.02094 ( -0.00904, 0.06731)
294	2 3 0.05098 ( 0.00555, 0.12124)
295	3 Australian 0.05959 ( 0.01775, 0.12430)
296	3 Chinese 0.00221 ( -0.02034, 0.03710)
297	1 European 0.00624 ( -0.01948, 0.04601)
298
299
300	</PRE>
301	</TD></TR></TABLE>
302	</BODY>
303	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/contml.html

Download in other formats: