Context Navigation

dnaml.html

Visit:

Last change on this file was 2176, checked in by westram, 20 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 36.3 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>dnaml</TITLE>
5	<META NAME="description" CONTENT="dnaml">
6	<META NAME="keywords" CONTENT="dnaml">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>DnaML -- DNA Maximum Likelihood program</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	This program implements the maximum likelihood method for DNA
26	sequences. The
27	present version is faster than earlier versions of
28	DNAML. Details of the algorithm are published in
29	the paper by Felsenstein and Churchill (1996).
30	The model of base substitution allows the expected frequencies
31	of the four bases to be unequal, allows the expected frequencies of
32	transitions and transversions to be unequal, and has several
33	ways of allowing different rates of evolution at
34	different sites.
35	<P>
36	The assumptions of the present model are:
37	<OL>
38	<LI>Each site in the sequence evolves independently.
39	<LI>Different lineages evolve independently.
40	<LI>Each site undergoes substitution at an expected rate which is
41	chosen from a series of rates (each with a probability of occurrence)
42	which we specify.
43	<LI>All relevant sites are included in the sequence, not just those that
44	have changed or those that are "phylogenetically informative".
45	<LI>A substitution consists of one of two sorts of events:
46	<DL COMPACT>
47	<DT>(a)</DT>
48	<DD>The first kind
49	of event consists of the replacement of the existing base by a base
50	drawn from a pool of purines or a pool of pyrimidines (depending on
51	whether the base being replaced was a purine or a pyrimidine). It can
52	lead either to no change or to a transition.</DD>
53	<DT>(b)</DT>
54	<DD>The second kind of
55	event consists of the replacement of the existing base
56	by a base drawn at random from a pool of bases at known
57	frequencies, independently of the identity of the base which
58	is being replaced. This could lead either to a no change, to a transition
59	or to a transversion.</DD>
60	</DL>
61	<P>
62	The ratio of the two
63	purines in the purine replacement pool is the same as their ratio in the
64	overall pool, and similarly for the pyrimidines.
65	<P>
66	The ratios of transitions to transversions can be set by the
67	user. The substitution process can be diagrammed as follows:
68	Suppose that you specified A, C, G, and T base frequencies of
69	0.24, 0.28, 0.27, and 0.21.
70	<P>
71	<UL>
72	<LI>First kind of event:
73	<P>
74	<OL>
75	<LI>Determine whether the existing base is a purine or a pyrimidine.
76	<LI>Draw from the proper pool:
77	<P>
78	<PRE>
79	Purine pool: Pyrimidine pool:
80
81	\| \| \| \|
82	\| 0.4706 A \| \| 0.5714 C \|
83	\| 0.5294 G \| \| 0.4286 T \|
84	\| (ratio is \| \| (ratio is \|
85	\| 0.24 : 0.27) \| \| 0.28 : 0.21) \|
86	\|_______________\| \|_______________\|
87	</PRE>
88	</OL>
89	<P>
90	<LI>Second kind of event:
91	<P>
92	Draw from the overall pool:
93	<PRE>
94
95	\| \|
96	\| 0.24 A \|
97	\| 0.28 C \|
98	\| 0.27 G \|
99	\| 0.21 T \|
100	\|__________________\|
101	</PRE>
102	</UL>
103	<P>
104	Note that if the existing base is, say, an A, the first kind of event has
105	a 0.4706 probability of "replacing" it by another A. The second kind of
106	event has a 0.24 chance of replacing it by another A. This rather
107	disconcerting model is used because it has nice mathematical properties that
108	make likelihood calculations far easier. A closely similar, but not
109	precisely identical model having different rates of transitions and
110	transversions has been used by Hasegawa et. al. (1985b). The transition
111	probability formulas for the current model were given (with my
112	permission) by Kishino and Hasegawa (1989). Another explanation is
113	available in the paper by Felsenstein and Churchill (1996).
114	</OL>
115	<P>
116	Note the assumption that we are looking at all sites, including those
117	that have not changed at all. It is important not to restrict attention
118	to some sites based on whether or not they have changed; doing that
119	would bias branch lengths by making them too long, and that in turn
120	would cause the method to misinterpret the meaning of those sites that
121	had changed.
122	<P>
123	This program uses a Hidden Markov Model (HMM)
124	method of inferring different rates of evolution at different sites. This
125	was described in a paper by me and Gary Churchill (1996). It allows us to
126	specify to the program that there will be
127	a number of different possible evolutionary rates, what the prior
128	probabilities of occurrence of each is, and what the average length of a
129	patch of sites all having the same rate. The rates can also be chosen
130	by the program to approximate a Gamma distribution of rates, or a
131	Gamma distribution plus a class of invariant sites. The program computes the
132	the likelihood by summing it over all possible assignments of rates to sites,
133	weighting each by its prior probability of occurrence.
134	<P>
135	For example, if we have used the C and A options (described below) to specify
136	that there are three possible rates of evolution, 1.0, 2.4, and 0.0,
137	that the prior probabilities of a site having these rates are 0.4, 0.3, and
138	0.3, and that the average patch length (number of consecutive sites
139	with the same rate) is 2.0, the program will sum the likelihood over
140	all possibilities, but giving less weight to those that (say) assign all
141	sites to rate 2.4, or that fail to have consecutive sites that have the
142	same rate.
143	<P>
144	The Hidden Markov Model framework for rate variation among sites
145	was independently developed by Yang (1993, 1994, 1995). We have
146	implemented a general scheme for a Hidden Markov Model of
147	rates; we allow the rates and their prior probabilities to be specified
148	arbitrarily by the user, or by a discrete approximation to a Gamma
149	distribution of rates (Yang, 1995), or by a mixture of a Gamma
150	distribution and a class of invariant sites.
151	<P>
152	This feature effectively removes the artificial assumption that all sites
153	have the same rate, and also means that we need not know in advance the
154	identities of the sites that have a particular rate of evolution.
155	<P>
156	Another layer of rate variation also is available. The user can assign
157	categories of rates to each site (for example, we might want first, second,
158	and third codon positions in a protein coding sequence to be three different
159	categories. This is done with the categories input file and the C option.
160	We then specify (using the menu) the relative rates of evolution of sites
161	in the different categories. For example, we might specify that first,
162	second, and third positions evolve at relative rates of 1.0, 0.8, and 2.7.
163	<P>
164	If both user-assigned rate categories and Hidden Markov Model rates
165	are allowed, the program assumes that the
166	actual rate at a site is the product of the user-assigned category rate
167	and the Hidden Markov Model regional rate. (This may not always make
168	perfect biological sense: it would be more natural to assume some upper
169	bound to the rate, as we have discussed in the Felsenstein and Churchill
170	paper). Nevertheless you may want to use both types of rate variation.
171	<P>
172	<H2>INPUT FORMAT AND OPTIONS</H2>
173	<P>
174	Subject to these assumptions, the program is a
175	correct maximum likelihood method. The
176	input is fairly standard, with one addition. As usual the first line of the
177	file gives the number of species and the number of sites.
178	<P>
179	Next come the species data. Each
180	sequence starts on a new line, has a ten-character species name
181	that must be blank-filled to be of that length, followed immediately
182	by the species data in the one-letter code. The sequences must either
183	be in the "interleaved" or "sequential" formats
184	described in the Molecular Sequence Programs document. The I option
185	selects between them. The sequences can have internal
186	blanks in the sequence but there must be no extra blanks at the end of the
187	terminated line. Note that a blank is not a valid symbol for a deletion.
188	<P>
189	The options are selected using an interactive menu. The menu looks like this:
190	<P>
191	<TABLE><TR><TD BGCOLOR=white>
192	<PRE>
193	Nucleic acid sequence Maximum Likelihood method, version 3.6a3
194
195	Settings for this run:
196	U Search for best tree? Yes
197	T Transition/transversion ratio: 2.0000
198	F Use empirical base frequencies? Yes
199	C One category of sites? Yes
200	R Rate variation among sites? constant rate
201	W Sites weighted? No
202	S Speedier but rougher analysis? Yes
203	G Global rearrangements? No
204	J Randomize input order of sequences? No. Use input order
205	O Outgroup root? No, use as outgroup species 1
206	M Analyze multiple data sets? No
207	I Input sequences interleaved? Yes
208	0 Terminal type (IBM PC, ANSI, none)? (none)
209	1 Print out the data at start of run No
210	2 Print indications of progress of run Yes
211	3 Print out tree Yes
212	4 Write out trees onto tree file? Yes
213	5 Reconstruct hypothetical sequences? No
214
215	Y to accept these or type the letter for one to change
216
217	</PRE>
218	</TD></TR></TABLE>
219	<P>
220	The user either types "Y" (followed, of course, by a carriage-return)
221	if the settings shown are to be accepted, or the letter or digit corresponding
222	to an option that is to be changed.
223	<P>
224	The options U, W, J, O, M, and 0 are the usual ones. They are described in the
225	main documentation file of this package. Option I is the same as in
226	other molecular sequence programs and is described in the documentation file
227	for the sequence programs.
228	<P>
229	The T option in this program does not stand for Threshold,
230	but instead is the Transition/transversion option. The user is prompted for
231	a real number greater than 0.0, as the expected ratio of transitions to
232	transversions. Note
233	that this is not the ratio of the first to the second kinds of events,
234	but the resulting expected ratio of transitions to transversions. The exact
235	relationship between these two quantities depends on the frequencies in the
236	base pools. The default value of the T parameter if you do not use the T
237	option is 2.0.
238	<P>
239	The F (Frequencies) option is one which may save users much time. If you
240	want to use the empirical frequencies of the bases, observed in the input
241	sequences, as the base frequencies, you simply use the default setting of
242	the F option. These empirical
243	frequencies are not really the maximum likelihood estimates of the base
244	frequencies, but they will often be close to those values (what they are is
245	maximum likelihood estimates under a "star" or "explosion" phylogeny).
246	If you change the setting of the F option you will be prompted for the
247	frequencies of the four bases. These must add to 1 and are to be typed on
248	one line separated by blanks, not commas.
249	<P>
250	The R (Hidden Markov Model rates) option allows the user to
251	approximate a Gamma distribution of rates among sites, or a
252	Gamma distribution plus a class of invariant sites, or to specify how
253	many categories of
254	substitution rates there will be in a Hidden Markov Model of rate
255	variation, and what are the rates and probabilities
256	for each. By repeatedly selecting the R option one toggles among
257	no rate variation, the Gamma, Gamma+I, and general HMM possibilities.
258	<P>
259	If you choose Gamma or Gamma+I the program will ask how many rate
260	categories you want. If you have chosen Gamma+I, keep in mind that
261	one rate category will be set aside for the invariant class and only
262	the remaining ones used to approximate the Gamma distribution.
263	For the approximation we do not use the quantile method of Yang (1995)
264	but instead use a quadrature method using generalized Laguerre
265	polynomials. This should give a good approximation to the Gamma
266	distribution with as few as 5 or 6 categories.
267	<P>
268	In the Gamma and Gamma+I cases, the user will be
269	asked to supply the coefficient of variation of the rate of substitution
270	among sites. This is different from the parameters used by Nei and Jin
271	(1990) but
272	related to them: their parameter <EM>a</EM> is also known as "alpha",
273	the shape parameter of the Gamma distribution. It is
274	related to the coefficient of variation by
275	<P>
276	CV = 1 / a<SUP>1/2</SUP>
277	<P>
278	or
279	<P>
280	a = 1 / (CV)<SUP>2</SUP>
281	<P>
282	(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
283	that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
284	As we consider cases in which the rates are less variable we should set <EM>a</EM>
285	larger and larger, as <EM>CV</EM> gets smaller and smaller.
286	<P>
287	If the user instead chooses the general Hidden Markov Model option,
288	they are first asked how many HMM rate categories there
289	will be (for the moment there is an upper limit of 9,
290	which should not be restrictive). Then
291	the program asks for the rates for each category. These rates are
292	only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4
293	have the exact same effect as rates 2.0, 4.0, and 4.8. Note that an
294	HMM rate category
295	can have rate of change 0, so that this allows us to take into account that
296	there may be a category of sites that are invariant. Note that the run time
297	of the program will be proportional to the number of HMM rate categories:
298	twice as
299	many categories means twice as long a run. Finally the program will ask for
300	the probabilities of a random site falling into each of these
301	regional rate categories. These probabilities must be nonnegative and sum to
302	1. Default
303	for the program is one category, with rate 1.0 and probability 1.0 (actually
304	the rate does not matter in that case).
305	<P>
306	If more than one HMM rate category is specified, then another
307	option, A, becomes
308	visible in the menu. This allows us to specify that we want to assume that
309	sites that have the same HMM rate category are expected to be clustered
310	so that there is autocorrelation of rates. The
311	program asks for the value of the average patch length. This is an expected
312	length of patches that have the same rate. If it is 1, the rates of
313	successive sites will be independent. If it is, say, 10.25, then the
314	chance of change to a new rate will be 1/10.25 after every site. However
315	the "new rate" is randomly drawn from the mix of rates, and hence could
316	even be the same. So the actual observed length of patches with the same
317	rate will be a bit larger than 10.25. Note below that if you choose
318	multiple patches, there will be an estimate in the output file as to
319	which combination of rate categories contributed most to the likelihood.
320	<P>
321	Note that the autocorrelation scheme we use is somewhat different
322	from Yang's (1995) autocorrelated Gamma distribution. I am unsure
323	whether this difference is of any importance -- our scheme is chosen
324	for the ease with which it can be implemented.
325	<P>
326	The C option allows user-defined rate categories. The user is prompted
327	for the number of user-defined rates, and for the rates themselves,
328	which cannot be negative but can be zero. These numbers, which must be
329	nonnegative (some could be 0),
330	are defined relative to each other, so that if rates for three categories
331	are set to 1 : 3 : 2.5 this would have the same meaning as setting them
332	to 2 : 6 : 5.
333	The assignment of rates to
334	sites is then made by reading a file whose default name is "categories".
335	It should contain a string of digits 1 through 9. A new line or a blank
336	can occur after any character in this string. Thus the categories file
337	might look like this:
338	<P>
339	<PRE>
340	122231111122411155
341	1155333333444
342	</PRE>
343	<P>
344	With the current options R, A, and C the program has gained greatly in its
345	ability to infer different rates at different sites and estimate
346	phylogenies under a more realistic model. Note that Likelihood Ratio
347	Tests can be used to test whether one combination of rates is
348	significantly better than another, provided one rate scheme represents
349	a restriction of another with fewer parameters. The number of parameters
350	needed for rate variation is the number of regional rate categories, plus
351	the number of user-defined rate categories less 2, plus one if the
352	regional rate categories have a nonzero autocorrelation.
353	<P>
354	The G (global search) option causes, after the last species is added to
355	the tree, each possible group to be removed and re-added. This improves the
356	result, since the position of every species is reconsidered. It
357	approximately triples the run-time of the program.
358	<P>
359	The User tree (option U) is read from a file whose default name is
360	<TT>intree</TT>.
361	The trees can be multifurcating.
362	<P>
363	If the U (user tree) option is chosen another option appears in
364	the menu, the L option. If it is selected,
365	it signals the program that it
366	should take any branch lengths that are in the user tree and
367	simply evaluate the likelihood of that tree, without further altering
368	those branch lengths. This means that if some branches have lengths
369	and others do not, the program will estimate the lengths of those that
370	do not have lengths given in the user tree. Note that the program RETREE
371	can be used to add and remove lengths from a tree.
372	<P>
373	The U option can read a multifurcating tree. This allows us to
374	test the hypothesis that a certain branch has zero length (we can also
375	do this by using RETREE to set the length of that branch to 0.0 when
376	it is present in the tree). By
377	doing a series of runs with different specified lengths for a branch we
378	can plot a likelihood curve for its branch length while allowing all
379	other branches to adjust their lengths to it. If all branches have
380	lengths specified, none of them will be iterated. This is useful to allow
381	a tree produced by another method to have its likelihood
382	evaluated. The L option has no effect and does not appear in the
383	menu if the U option is not used.
384	<P>
385	The W (Weights) option is invoked in the usual way, with only weights 0
386	and 1 allowed. It selects a set of sites to be analyzed, ignoring the
387	others. The sites selected are those with weight 1. If the W option is
388	not invoked, all sites are analyzed.
389	The Weights (W) option
390	takes the weights from a file whose default name is "weights". The weights
391	follow the format described in the main documentation file.
392	<P>
393	The M (multiple data sets) option will ask you whether you want to
394	use multiple sets of weights (from the weights file) or multiple data sets
395	from the input file.
396	The ability to use a single data set with multiple weights means that
397	much less disk space will be used for this input data. The bootstrapping
398	and jackknifing tool Seqboot has the ability to create a weights file with
399	multiple weights. Note also that when we use multiple weights for
400	bootstrapping we can also then maintain different rate categories for
401	different sites in a meaningful way. You should not use the multiple
402	data sets option without using multiple weights, you should not at the
403	same time use the user-defined rate categories option (option C).
404	<P>
405	The algorithm used for searching among trees is faster than it was in
406	version 3.5, thanks to using a technique invented by David Swofford
407	and J. S. Rogers. This involves not iterating most branch lengths on most
408	trees when searching among tree topologies, This is of necessity a
409	"quick-and-dirty" search but it saves much time. There is a menu option
410	(option S) which can turn off this search and revert to the earlier
411	search method which iterated branch lengths in all topologies. This will
412	be substantially slower but will also be a bit more likely to find the
413	tree topology of highest likelihood.
414	<P>
415	<H2>OUTPUT FORMAT</H2>
416	<P>
417	The output starts by giving the number of species, the number of sites,
418	and the base frequencies for A, C, G, and T that have been specified. It
419	then prints out the transition/transversion ratio that was specified or
420	used by default. It also uses the base frequencies to compute the actual
421	transition/transversion ratio implied by the parameter.
422	<P>
423	If the R (HMM rates) option is used a table of the relative rates of
424	expected substitution at each category of sites is printed, as well
425	as the probabilities of each of those rates.
426	<P>
427	There then follow the data sequences, if the user has selected the menu
428	option to print them out, with the base sequences printed in
429	groups of ten bases along the lines of the Genbank and EMBL formats. The
430	trees found are printed as an unrooted
431	tree topology (possibly rooted by outgroup if so requested). The
432	internal nodes are numbered arbitrarily for the sake of
433	identification. The number of trees evaluated so far and the log
434	likelihood of the tree are also given. Note that the trees printed out
435	have a trifurcation at the base. The branch lengths in the diagram are
436	roughly proportional to the estimated branch lengths, except that very short
437	branches are printed out at least three characters in length so that the
438	connections can be seen.
439	<P>
440	A table is printed
441	showing the length of each tree segment (in units of expected nucleotide
442	substitutions per site), as well as (very) rough confidence
443	limits on their lengths. If a confidence limit is
444	negative, this indicates that rearrangement of the tree in that region
445	is not excluded, while if both limits are positive, rearrangement is
446	still not necessarily excluded because the variance calculation on which
447	the confidence limits are based results in an underestimate, which makes
448	the confidence limits too narrow.
449	<P>
450	In addition to the confidence limits,
451	the program performs a crude Likelihood Ratio Test (LRT) for each
452	branch of the tree. The program computes the ratio of likelihoods with and
453	without this branch length forced to zero length. This done by comparing the
454	likelihoods changing only that branch length. A truly correct LRT would
455	force that branch length to zero and also allow the other branch lengths to
456	adjust to that. The result would be a likelihood ratio closer to 1. Therefore
457	the present LRT will err on the side of being too significant. YOU ARE
458	WARNED AGAINST TAKING IT TOO SERIOUSLY. If you want to get a better
459	likelihood curve for a branch length you can do multiple runs with
460	different prespecified lengths for that branch, as discussed above in the
461	discussion of the L option.
462	<P>
463	One should also
464	realize that if you are looking not at a previously-chosen branch but at all
465	branches, that you are seeing the results of multiple tests. With 20 tests,
466	one is expected to reach significance at the P = .05 level purely by
467	chance. You should therefore use a much more conservative significance level,
468	such as .05 divided by the number of tests. The significance of these tests
469	is shown by printing asterisks next to
470	the confidence interval on each branch length. It is important to keep
471	in mind that both the confidence limits and the tests
472	are very rough and approximate, and probably indicate more significance than
473	they should. Nevertheless, maximum likelihood is one of the few methods that
474	can give you any indication of its own error; most other methods simply fail to
475	warn the user that there is any error! (In fact, whole philosophical schools
476	of taxonomists exist whose main point seems to be that there isn't any
477	error, that the "most parsimonious" tree is the best tree by definition and
478	that's that).
479	<P>
480	The log likelihood printed out with the final tree can be used to perform
481	various likelihood ratio tests. One can, for example, compare runs with
482	different values of the expected transition/transversion ratio to determine
483	which value is the maximum likelihood estimate, and what is the allowable range
484	of values (using a likelihood ratio test, which you will find described in
485	mathematical statistics books). One could also estimate the base frequencies
486	in the same way. Both of these, particularly the latter, require multiple runs
487	of the program to evaluate different possible values, and this might get
488	expensive.
489	<P>
490	If the U (User Tree) option is used and more than one tree is supplied,
491	and the program is not told to assume autocorrelation between the
492	rates at different sites, the
493	program also performs a statistical test of each of these trees against the
494	one with highest likelihood. If there are two user trees, the test
495	done is one which is due to Kishino and Hasegawa (1989), a version
496	of a test originally introduced by Templeton (1983). In this
497	implementation it uses the mean and variance of
498	log-likelihood differences between trees, taken across sites. If the two
499	trees' means are more than 1.96 standard deviations different
500	then the trees are
501	declared significantly different. This use of the empirical variance of
502	log-likelihood differences is more robust and nonparametric than the
503	classical likelihood ratio test, and may to some extent compensate for the
504	any lack of realism in the model underlying this program.
505	<P>
506	If there are more than two trees, the test done is an extension of
507	the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
508	that a correction for the number of trees was necessary, and they
509	introduced a resampling method to make this correction. In the version
510	used here the variances and covariances of the sum of log likelihoods across
511	sites are computed for all pairs of trees. To test whether the
512	difference between each tree and the best one is larger than could have
513	been expected if they all had the same expected log-likelihood,
514	log-likelihoods for all trees are sampled with these covariances and equal
515	means (Shimodaira and Hasegawa's "least favorable hypothesis"),
516	and a P value is computed from the fraction of times the difference between
517	the tree's value and the highest log-likelihood exceeds that actually
518	observed. Note that this sampling needs random numbers, and so the
519	program will prompt the user for a random number seed if one has not
520	already been supplied. With the two-tree KHT test no random numbers
521	are used.
522	<P>
523	In either the KHT or the SH test the program
524	prints out a table of the log-likelihoods of each tree, the differences of
525	each from the highest one, the variance of that quantity as determined by
526	the log-likelihood differences at individual sites, and a conclusion as to
527	whether that tree is or is not significantly worse than the best one. However
528	the test is not available if we assume that there
529	is autocorrelation of rates at neighboring sites (option A) and is not
530	done in those cases.
531	<P>
532	The branch lengths printed out are scaled in terms of expected numbers of
533	substitutions, counting both transitions and transversions but not
534	replacements of a base by itself, and scaled so that the average rate of
535	change, averaged over all sites analyzed, is set to 1.0
536	if there are multiple categories of sites. This means that whether or not
537	there are multiple categories of sites, the expected fraction of change
538	for very small branches is equal to the branch length. Of course,
539	when a branch is twice as
540	long this does not mean that there will be twice as much net change expected
541	along it, since some of the changes occur in the same site and overlie or
542	even reverse each
543	other. The branch length estimates here are in terms of the expected
544	underlying numbers of changes. That means that a branch of length 0.26
545	is 26 times as long as one which would show a 1% difference between
546	the nucleotide sequences at the beginning and end of the branch. But we
547	would not expect the sequences at the beginning and end of the branch to be
548	26% different, as there would be some overlaying of changes.
549	<P>
550	Confidence limits on the branch lengths are
551	also given. Of course a
552	negative value of the branch length is meaningless, and a confidence
553	limit overlapping zero simply means that the branch length is not necessarily
554	significantly different from zero. Because of limitations of the numerical
555	algorithm, branch length estimates of zero will often print out as small
556	numbers such as 0.00001. If you see a branch length that small, it is really
557	estimated to be of zero length. Note that versions 2.7 and earlier of this
558	program printed out the branch lengths in terms of expected probability of
559	change, so that they were scaled differently.
560	<P>
561	Another possible source of confusion is the existence of negative values for
562	the log likelihood. This is not really a problem; the log likelihood is not a
563	probability but the logarithm of a probability. When it is
564	negative it simply means that the corresponding probability is less
565	than one (since we are seeing its logarithm). The log likelihood is
566	maximized by being made more positive: -30.23 is worse than -29.14.
567	<P>
568	At the end of the output, if the R option is in effect with multiple
569	HMM rates, the program will print a list of what site categories
570	contributed the most to the final likelihood. This combination of
571	HMM rate categories need not have contributed a majority of the likelihood,
572	just a plurality. Still, it will be helpful as a view of where the
573	program infers that the higher and lower rates are. Note that the
574	use in this calculations of the prior probabilities of different rates,
575	and the average patch length, gives this inference a "smoothed"
576	appearance: some other combination of rates might make a greater
577	contribution to the likelihood, but be discounted because it conflicts
578	with this prior information. See the example output below to see
579	what this printout of rate categories looks like.
580	A second list will also be printed out, showing for each site which
581	rate accounted for the highest fraction of the likelihood. If the fraction
582	of the likelihood accounted for is less than 95%, a dot is printed instead.
583	<P>
584	Option 3 in the menu controls whether the tree is printed out into
585	the output file. This is on by default, and usually you will want to
586	leave it this way. However for runs with multiple data sets such as
587	bootstrapping runs, you will primarily be interested in the trees
588	which are written onto the output tree file, rather than the trees
589	printed on the output file. To keep the output file from becoming too
590	large, it may be wisest to use option 3 to prevent trees being
591	printed onto the output file.
592	<P>
593	Option 4 in the menu controls whether the tree estimated by the program
594	is written onto a tree file. The default name of this output tree file
595	is "outtree". If the U option is in effect, all the user-defined
596	trees are written to the output tree file.
597	<P>
598	Option 5 in the menu controls whether ancestral states are estimated
599	at each node in the tree. If it is in effect, a table of ancestral
600	sequences is printed out (including the sequences in the tip species which
601	are the input sequences). In that table, if a site has a base which
602	accounts for more than 95% of the likelihood, it is printed in capital
603	letters (A rather than a). If the best nucleotide accounts for less
604	than 50% of the likelihood, the program prints out an ambiguity code
605	(such as M for "A or C") for the set of nucleotides which, taken together,
606	account for more half of the likelihood. The ambiguity codes are listed
607	in the sequence programs documentation file. One limitation of the current
608	version of the program is that when there are multiple HMM rates
609	(option R) the reconstructed nucleotides are based on only the single
610	assignment of rates to sites which accounts for the largest amount of the
611	likelihood. Thus the assessment of 95% of the likelihood, in tabulating
612	the ancestral states, refers to 95% of the likelihood that is accounted
613	for by that particular combination of rates.
614	<P>
615	<H2>PROGRAM CONSTANTS</H2>
616	<P>
617	The constants defined at the beginning of the program include "maxtrees",
618	the maximum number of user trees that can be processed. It is small (100)
619	at present to save some further memory but the cost of increasing it
620	is not very great. Other constants
621	include "maxcategories", the maximum number of site
622	categories, "namelength", the length of species names in
623	characters, and three others, "smoothings", "iterations", and "epsilon", that
624	help "tune" the algorithm and define the compromise between execution speed and
625	the quality of the branch lengths found by iteratively maximizing the
626	likelihood. Reducing iterations and smoothings, and increasing epsilon, will
627	result in faster execution but a worse result. These values
628	will not usually have to be changed.
629	<P>
630	The program spends most of its time doing real arithmetic.
631	The algorithm, with separate and independent computations
632	occurring for each pattern, lends itself readily to parallel processing.
633	<P>
634	<H2>PAST AND FUTURE OF THE PROGRAM</H2>
635	<P>
636	This program, which in version 2.6 replaced the old version of DNAML,
637	is not derived directly
638	from it but instead was developed by modifying CONTML, with which it shares
639	many of its data structures and much of its strategy. It was speeded up
640	by two major developments, the use of aliasing of nucleotide sites (version
641	3.1) and pretabulation of some exponentials (added by Akiko Fuseki in version
642	3.4). In version 3.5 the Hidden Markov Model code was added and the method
643	of iterating branch lengths was changed from an EM algorithm to direct
644	search. The Hidden Markov Model code slows things down, especially if
645	there is autocorrelation between sites, so this version is slower than
646	version 3.4. Nevertheless we hope that the sacrifice is worth it.
647	<P>
648	One change that is needed in the future is to put in some way of
649	allowing for base composition of nucleotide sequences in different parts
650	of the phylogeny.
651	<P>
652	<HR>
653	<P>
654	<H3>TEST DATA SET</H3>
655	<P>
656	<TABLE><TR><TD BGCOLOR=white>
657	<PRE>
658	5 13
659	Alpha AACGTGGCCAAAT
660	Beta AAGGTCGCCAAAC
661	Gamma CATTTCGTCACAA
662	Delta GGTATTTCGGCCT
663	Epsilon GGGATCTCGGCCC
664	</PRE>
665	</TD></TR></TABLE>
666	<P>
667	<HR>
668	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
669	<P>
670	(It was run with HMM rates having gamma-distributed rates
671	approximated by 5 rate categories,
672	with coefficient of variation of rates 1.0, and with patch length
673	parameter = 1.5. Two user-defined rate categories were used, one for
674	the first 6 sites, the other for the last 7, with rates 1.0 : 2.0.
675	Weights were used, with sites 1 and 13 given weight 0, and all others
676	weight 1.)
677	<P>
678	<TABLE><TR><TD BGCOLOR=white>
679	<PRE>
680
681	Nucleic acid sequence Maximum Likelihood method, version 3.6a3
682
683	5 species, 13 sites
684
685	Site categories are:
686
687	1111112222 222
688
689
690	Sites are weighted as follows:
691
692	0111111111 111
693
694
695	Name Sequences
696	---- ---------
697
698	Alpha AACGTGGCCA AAT
699	Beta AAGGTCGCCA AAC
700	Gamma CATTTCGTCA CAA
701	Delta GGTATTTCGG CCT
702	Epsilon GGGATCTCGG CCC
703
704
705
706	Empirical Base Frequencies:
707
708	A 0.23333
709	C 0.30000
710	G 0.23333
711	T(U) 0.23333
712
713	Transition/transversion ratio = 2.000000
714
715
716	Discrete approximation to gamma distributed rates
717	Coefficient of variation of rates = 1.000000 (alpha = 1.000000)
718
719	State in HMM Rate of change Probability
720
721	1 0.264 0.522
722	2 1.413 0.399
723	3 3.596 0.076
724	4 7.086 0.0036
725	5 12.641 0.000023
726
727	Expected length of a patch of sites having the same rate = 1.500
728
729
730	Site category Rate of change
731
732	1 1.000
733	2 2.000
734
735
736
737	+Beta
738	\|
739	\| +Epsilon
740	\| +------------------------------------------------------3
741	1--2 +--Delta
742	\| \|
743	\| +--------Gamma
744	\|
745	+--Alpha
746
747
748	remember: this is an unrooted tree!
749
750	Ln Likelihood = -66.19167
751
752	Between And Length Approx. Confidence Limits
753	------- --- ------ ------- ---------- ------
754
755	1 Alpha 0.49468 ( zero, 1.23032) **
756	1 Beta 0.00006 ( zero, 0.62569)
757	1 2 0.22531 ( zero, 2.28474)
758	2 3 8.20666 ( zero, 23.52785) **
759	3 Epsilon 0.00006 ( zero, 0.65419)
760	3 Delta 0.44668 ( zero, 1.10233) **
761	2 Gamma 1.34187 ( zero, 3.46288) **
762
763	* = significantly positive, P < 0.05
764	** = significantly positive, P < 0.01
765
766	Combination of categories that contributes the most to the likelihood:
767
768	1122121111 112
769
770	Most probable category at each site if > 0.95 probability ("." otherwise)
771
772	.......... ...
773
774	Probable sequences at interior nodes:
775
776	node Reconstructed sequence (caps if > 0.95)
777
778	1 .AGGTCGCCA AAC
779	Beta AAGGTCGCCA AAC
780	2 .AggTcGcCA aAc
781	3 .GGATCTCGG CCC
782	Epsilon GGGATCTCGG CCC
783	Delta GGTATTTCGG CCT
784	Gamma CATTTCGTCA CAA
785	Alpha AACGTGGCCA AAT
786
787	</PRE>
788	</TD></TR></TABLE>
789	</BODY>
790	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: tags/arb-6.0/GDE/PHYLIP/doc/dnaml.html

Download in other formats: