Context Navigation

dnamlk.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 34.4 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>dnamlk</TITLE>
5	<META NAME="description" CONTENT="dnamlk">
6	<META NAME="keywords" CONTENT="dnamlk">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>DnaMLK -- DNA Maximum Likelihood program<BR>with molecular clock</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	This program implements the maximum likelihood method for DNA
26	sequences under the constraint that the trees estimated must be
27	consistent with a molecular clock. The molecular clock is the
28	assumption that the tips of the tree are all equidistant, in branch
29	length, from its root. This program is indirectly related to DNAML.
30	Details of the algorithm are not yet published, but many aspects
31	of it are similar to DNAML, and these are published in
32	the paper by Felsenstein and Churchill (1996).
33	The model of base substitution allows the expected frequencies
34	of the four bases to be unequal, allows the expected frequencies of
35	transitions and transversions to be unequal, and has several
36	ways of allowing different rates of evolution at
37	different sites.
38	<P>
39	The assumptions of the model are:
40	<OL>
41	<LI>Each site in the sequence evolves independently.
42	<LI>Different lineages evolve independently.
43	<LI>There is a molecular clock.
44	<LI>Each site undergoes substitution at an expected rate which is
45	chosen from a series of rates (each with a probability of occurrence)
46	which we specify.
47	<LI>All relevant sites are included in the sequence, not just those that
48	have changed or those that are "phylogenetically informative".
49	<LI>A substitution consists of one of two sorts of events:
50	<DL COMPACT>
51	<DT>(a)</DT>
52	<DD>The first kind
53	of event consists of the replacement of the existing base by a base
54	drawn from a pool of purines or a pool of pyrimidines (depending on
55	whether the base being replaced was a purine or a pyrimidine). It can
56	lead either to no change or to a transition.</DD>
57	<DT>(b)</DT>
58	<DD>The second kind of
59	event consists of the replacement of the existing base
60	by a base drawn at random from a pool of bases at known
61	frequencies, independently of the identity of the base which
62	is being replaced. This could lead either to a no change, to a transition
63	or to a transversion.</DD>
64	</DL>
65	<P>
66	The ratio of the two
67	purines in the purine replacement pool is the same as their ratio in the
68	overall pool, and similarly for the pyrimidines.
69	<P>
70	The ratios of transitions to transversions can be set by the
71	user. The substitution process can be diagrammed as follows:
72	Suppose that you specified A, C, G, and T base frequencies of
73	0.24, 0.28, 0.27, and 0.21.
74	<P>
75	<UL>
76	<LI>First kind of event:
77	<P>
78	<OL>
79	<LI>Determine whether the existing base is a purine or a pyrimidine.
80	<LI>Draw from the proper pool:
81	<P>
82	<PRE>
83	Purine pool: Pyrimidine pool:
84
85	\| \| \| \|
86	\| 0.4706 A \| \| 0.5714 C \|
87	\| 0.5294 G \| \| 0.4286 T \|
88	\| (ratio is \| \| (ratio is \|
89	\| 0.24 : 0.27) \| \| 0.28 : 0.21) \|
90	\|_______________\| \|_______________\|
91	</PRE>
92	</OL>
93	<P>
94	<LI>Second kind of event:
95	<P>
96	Draw from the overall pool:
97	<PRE>
98
99	\| \|
100	\| 0.24 A \|
101	\| 0.28 C \|
102	\| 0.27 G \|
103	\| 0.21 T \|
104	\|__________________\|
105	</PRE>
106	</UL>
107	<P>
108	Note that if the existing base is, say, an A, the first kind of event has
109	a 0.4706 probability of "replacing" it by another A. The second kind of
110	event has a 0.24 chance of replacing it by another A. This rather
111	disconcerting model is used because it has nice mathematical properties that
112	make likelihood calculations far easier. A closely similar, but not
113	precisely identical model having different rates of transitions and
114	transversions has been used by Hasegawa et. al. (1985b). The transition
115	probability formulas for the current model were given (with my
116	permission) by Kishino and Hasegawa (1989). Another explanation is
117	available in the paper by Felsenstein and Churchill (1996).
118	</OL>
119	<P>
120	Note the assumption that we are looking at all sites, including those
121	that have not changed at all. It is important not to restrict attention
122	to some sites based on whether or not they have changed; doing that
123	would bias branch lengths by making them too long, and that in turn
124	would cause the method to misinterpret the meaning of those sites that
125	had changed.
126	<P>
127	This program uses a Hidden Markov Model (HMM)
128	method of inferring different rates of evolution at different sites. This
129	was described in a paper by me and Gary Churchill (1996). It allows us to
130	specify to the program that there will be
131	a number of different possible evolutionary rates, what the prior
132	probabilities of occurrence of each is, and what the average length of a
133	patch of sites all having the same rate. The rates can also be chosen
134	by the program to approximate a Gamma distribution of rates, or a
135	Gamma distribution plus a class of invariant sites. The program computes the
136	the likelihood by summing it over all possible assignments of rates to sites,
137	weighting each by its prior probability of occurrence.
138	<P>
139	For example, if we have used the C and A options (described below) to specify
140	that there are three possible rates of evolution, 1.0, 2.4, and 0.0,
141	that the prior probabilities of a site having these rates are 0.4, 0.3, and
142	0.3, and that the average patch length (number of consecutive sites
143	with the same rate) is 2.0, the program will sum the likelihood over
144	all possibilities, but giving less weight to those that (say) assign all
145	sites to rate 2.4, or that fail to have consecutive sites that have the
146	same rate.
147	<P>
148	The Hidden Markov Model framework for rate variation among sites
149	was independently developed by Yang (1993, 1994, 1995). We have
150	implemented a general scheme for a Hidden Markov Model of
151	rates; we allow the rates and their prior probabilities to be specified
152	arbitrarily by the user, or by a discrete approximation to a Gamma
153	distribution of rates (Yang, 1995), or by a mixture of a Gamma
154	distribution and a class of invariant sites.
155	<P>
156	This feature effectively removes the artificial assumption that all sites
157	have the same rate, and also means that we need not know in advance the
158	identities of the sites that have a particular rate of evolution.
159	<P>
160	Another layer of rate variation also is available. The user can assign
161	categories of rates to each site (for example, we might want first, second,
162	and third codon positions in a protein coding sequence to be three different
163	categories. This is done with the categories input file and the C option.
164	We then specify (using the menu) the relative rates of evolution of sites
165	in the different categories. For example, we might specify that first,
166	second, and third positions evolve at relative rates of 1.0, 0.8, and 2.7.
167	<P>
168	If both user-assigned rate categories and Hidden Markov Model rates
169	are allowed, the program assumes that the
170	actual rate at a site is the product of the user-assigned category rate
171	and the Hidden Markov Model regional rate. (This may not always make
172	perfect biological sense: it would be more natural to assume some upper
173	bound to the rate, as we have discussed in the Felsenstein and Churchill
174	paper). Nevertheless you may want to use both types of rate variation.
175	<P>
176	<H2>INPUT FORMAT AND OPTIONS</H2>
177	<P>
178	Subject to these assumptions, the program is a
179	correct maximum likelihood method. The
180	input is fairly standard, with one addition. As usual the first line of the
181	file gives the number of species and the number of sites.
182	<P>
183	Next come the species data. Each
184	sequence starts on a new line, has a ten-character species name
185	that must be blank-filled to be of that length, followed immediately
186	by the species data in the one-letter code. The sequences must either
187	be in the "interleaved" or "sequential" formats
188	described in the Molecular Sequence Programs document. The I option
189	selects between them. The sequences can have internal
190	blanks in the sequence but there must be no extra blanks at the end of the
191	terminated line. Note that a blank is not a valid symbol for a deletion.
192	<P>
193	The options are selected using an interactive menu. The menu looks like this:
194	<P>
195	<TABLE><TR><TD BGCOLOR=white>
196	<PRE>
197
198	Nucleic acid sequence
199	Maximum Likelihood method with molecular clock, version 3.6a3
200
201	Settings for this run:
202	U Search for best tree? Yes
203	T Transition/transversion ratio: 2.0
204	F Use empirical base frequencies? Yes
205	C One category of substitution rates? Yes
206	R Rate variation among sites? constant rate
207	G Global rearrangements? No
208	W Sites weighted? No
209	J Randomize input order of sequences? No. Use input order
210	M Analyze multiple data sets? No
211	I Input sequences interleaved? Yes
212	0 Terminal type (IBM PC, ANSI, none)? (none)
213	1 Print out the data at start of run No
214	2 Print indications of progress of run Yes
215	3 Print out tree Yes
216	4 Write out trees onto tree file? Yes
217	5 Reconstruct hypothetical sequences? No
218
219	Are these settings correct? (type Y or the letter for one to change)
220
221	</PRE>
222	</TD></TR></TABLE>
223	<P>
224	The user either types "Y" (followed, of course, by a carriage-return)
225	if the settings shown are to be accepted, or the letter or digit corresponding
226	to an option that is to be changed.
227	<P>
228	The options U, W, J, O, M, and 0 are the usual ones. They are described in the
229	main documentation file of this package. Option I is the same as in
230	other molecular sequence programs and is described in the documentation file
231	for the sequence programs.
232	<P>
233	The T option in this program does not stand for Threshold,
234	but instead is the Transition/transversion option. The user is prompted for
235	a real number greater than 0.0, as the expected ratio of transitions to
236	transversions. Note
237	that this is not the ratio of the first to the second kinds of events,
238	but the resulting expected ratio of transitions to transversions. The exact
239	relationship between these two quantities depends on the frequencies in the
240	base pools. The default value of the T parameter if you do not use the T
241	option is 2.0.
242	<P>
243	The F (Frequencies) option is one which may save users much time. If you
244	want to use the empirical frequencies of the bases, observed in the input
245	sequences, as the base frequencies, you simply use the default setting of
246	the F option. These empirical
247	frequencies are not really the maximum likelihood estimates of the base
248	frequencies, but they will often be close to those values (what they are is
249	maximum likelihood estimates under a "star" or "explosion" phylogeny).
250	If you change the setting of the F option you will be prompted for the
251	frequencies of the four bases. These must add to 1 and are to be typed on
252	one line separated by blanks, not commas.
253	<P>
254	The R (Hidden Markov Model rates) option allows the user to
255	approximate a Gamma distribution of rates among sites, or a
256	Gamma distribution plus a class of invariant sites, or to specify how
257	many categories of
258	substitution rates there will be in a Hidden Markov Model of rate
259	variation, and what are the rates and probabilities
260	for each. By repeatedly selecting the R option one toggles among
261	no rate variation, the Gamma, Gamma+I, and general HMM possibilities.
262	<P>
263	If you choose Gamma or Gamma+I the program will ask how many rate
264	categories you want. If you have chosen Gamma+I, keep in mind that
265	one rate category will be set aside for the invariant class and only
266	the remaining ones used to approximate the Gamma distribution.
267	For the approximation we do not use the quantile method of Yang (1995)
268	but instead use a quadrature method using generalized Laguerre
269	polynomials. This should give a good approximation to the Gamma
270	distribution with as few as 5 or 6 categories.
271	<P>
272	In the Gamma and Gamma+I cases, the user will be
273	asked to supply the coefficient of variation of the rate of substitution
274	among sites. This is different from the parameters used by Nei and Jin
275	(1990) but
276	related to them: their parameter <EM>a</EM> is also known as "alpha",
277	the shape parameter of the Gamma distribution. It is
278	related to the coefficient of variation by
279	<P>
280	CV = 1 / a<SUP>1/2</SUP>
281	<P>
282	or
283	<P>
284	a = 1 / (CV)<SUP>2</SUP>
285	<P>
286	(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
287	that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
288	As we consider cases in which the rates are less variable we should set <EM>a</EM>
289	larger and larger, as <EM>CV</EM> gets smaller and smaller.
290	<P>
291	If the user instead chooses the general Hidden Markov Model option,
292	they are first asked how many HMM rate categories there
293	will be (for the moment there is an upper limit of 9,
294	which should not be restrictive). Then
295	the program asks for the rates for each category. These rates are
296	only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4
297	have the exact same effect as rates 2.0, 4.0, and 4.8. Note that an
298	HMM rate category
299	can have rate of change 0, so that this allows us to take into account that
300	there may be a category of sites that are invariant. Note that the run time
301	of the program will be proportional to the number of HMM rate categories:
302	twice as
303	many categories means twice as long a run. Finally the program will ask for
304	the probabilities of a random site falling into each of these
305	regional rate categories. These probabilities must be nonnegative and sum to
306	1. Default
307	for the program is one category, with rate 1.0 and probability 1.0 (actually
308	the rate does not matter in that case).
309	<P>
310	If more than one HMM rate category is specified, then another
311	option, A, becomes
312	If more than one category is specified, then another option, A, becomes
313	visible in the menu. This allows us to specify that we want to assume that
314	sites that have the same HMM rate category are expected to be clustered
315	so that there is autocorrelation of rates. The
316	program asks for the value of the average patch length. This is an expected
317	length of patches that have the same rate. If it is 1, the rates of
318	successive sites will be independent. If it is, say, 10.25, then the
319	chance of change to a new rate will be 1/10.25 after every site. However
320	the "new rate" is randomly drawn from the mix of rates, and hence could
321	even be the same. So the actual observed length of patches with the same
322	rate will be a bit larger than 10.25. Note below that if you choose
323	multiple patches, there will be an estimate in the output file as to
324	which combination of rate categories contributed most to the likelihood.
325	<P>
326	Note that the autocorrelation scheme we use is somewhat different
327	from Yang's (1995) autocorrelated Gamma distribution. I am unsure
328	whether this difference is of any importance -- our scheme is chosen
329	for the ease with which it can be implemented.
330	<P>
331	The C option allows user-defined rate categories. The user is prompted
332	for the number of user-defined rates, and for the rates themselves,
333	which cannot be negative but can be zero. These numbers, which must be
334	nonnegative (some could be 0),
335	are defined relative to each other, so that if rates for three categories
336	are set to 1 : 3 : 2.5 this would have the same meaning as setting them
337	to 2 : 6 : 5.
338	The assignment of rates to
339	sites is then made by reading a file whose default name is "categories".
340	It should contain a string of digits 1 through 9. A new line or a blank
341	can occur after any character in this string. Thus the categories file
342	might look like this:
343	<P>
344	<PRE>
345	122231111122411155
346	1155333333444
347	</PRE>
348	<P>
349	With the current options R, A, and C the program has gained greatly in its
350	ability to infer different rates at different sites and estimate
351	phylogenies under a more realistic model. Note that Likelihood Ratio
352	Tests can be used to test whether one combination of rates is
353	significantly better than another, provided one rate scheme represents
354	a restriction of another with fewer parameters. The number of parameters
355	needed for rate variation is the number of regional rate categories, plus
356	the number of user-defined rate categories less 2, plus one if the
357	regional rate categories have a nonzero autocorrelation.
358	<P>
359	The G (global search) option causes, after the last species is added to
360	the tree, each possible group to be removed and re-added. This improves the
361	result, since the position of every species is reconsidered. It
362	approximately triples the run-time of the program.
363	<P>
364	The User tree (option U) is read from a file whose default name is
365	<TT>intree</TT>.
366	The trees can be multifurcating. This allows us to test the
367	hypothesis that a given branch has zero length.
368	<P>
369	If the U (user tree) option is chosen another option appears in
370	the menu, the L option. If it is selected,
371	it signals the program that it
372	should take any branch lengths that are in the user tree and
373	simply evaluate the likelihood of that tree, without further altering
374	those branch lengths. In the case of a clock, if some branches have lengths
375	and others do not, the program does not estimate the lengths of those that
376	do not have lengths given in the user tree. If any of the branches
377	do not have lengths, the program re-estimates the lengths of all of them.
378	This is done because estimating some and not others is hard in the
379	case of a clock.
380	<P>
381	The W (Weights) option is invoked in the usual way, with only weights 0
382	and 1 allowed. It selects a set of sites to be analyzed, ignoring the
383	others. The sites selected are those with weight 1. If the W option is
384	not invoked, all sites are analyzed.
385	The Weights (W) option
386	takes the weights from a file whose default name is "weights". The weights
387	follow the format described in the main documentation file.
388	<P>
389	The M (multiple data sets) option will ask you whether you want to
390	use multiple sets of weights (from the weights file) or multiple data sets
391	from the input file.
392	The ability to use a single data set with multiple weights means that
393	much less disk space will be used for this input data. The bootstrapping
394	and jackknifing tool Seqboot has the ability to create a weights file with
395	multiple weights. Note also that when we use multiple weights for
396	bootstrapping we can also then maintain different rate categories for
397	different sites in a meaningful way. You should not use the multiple
398	data sets option without using multiple weights, you should not at the
399	same time use the user-defined rate categories option (option C).
400	<P>
401	The algorithm used for searching among trees is faster than it was in
402	version 3.5, thanks to using a technique invented by David Swofford
403	and J. S. Rogers. This involves not iterating most branch lengths on most
404	trees when searching among tree topologies, This is of necessity a
405	"quick-and-dirty" search but it saves much time.
406	<P>
407	<H2>OUTPUT FORMAT</H2>
408	<P>
409	The output starts by giving the number of species, the number of sites,
410	and the base frequencies for A, C, G, and T that have been specified. It
411	then prints out the transition/transversion ratio that was specified or
412	used by default. It also uses the base frequencies to compute the actual
413	transition/transversion ratio implied by the parameter.
414	<P>
415	If the R (HMM rates) option is used a table of the relative rates of
416	expected substitution at each category of sites is printed, as well
417	as the probabilities of each of those rates.
418	<P>
419	There then follow the data sequences, if the user has selected the menu
420	option to print them out, with the base sequences printed in
421	groups of ten bases along the lines of the Genbank and EMBL formats. The
422	trees found are printed as a rooted
423	tree topology. The
424	internal nodes are numbered arbitrarily for the sake of
425	identification. The number of trees evaluated so far and the log
426	likelihood of the tree are also given. The branch lengths in the diagram are
427	roughly proportional to the estimated branch lengths, except that very short
428	branches are printed out at least three characters in length so that the
429	connections can be seen.
430	<P>
431	A table is printed
432	showing the length of each tree segment, and the time (in units of
433	expected nucleotide substitutions per site) of each fork in the tree,
434	measured from the root of the tree. I have not attempted in include
435	code for approximate confidence limits on branch points, as I have done
436	for branch lengths in DNAML, both because of the extreme crudeness of
437	that test, and because the variation of times for different forks would be
438	highly correlated.
439	<P>
440	The log likelihood printed out with the final tree can be used to perform
441	various likelihood ratio tests. One can, for example, compare runs with
442	different values of the expected transition/transversion ratio to determine
443	which value is the maximum likelihood estimate, and what is the allowable range
444	of values (using a likelihood ratio test, which you will find described in
445	mathematical statistics books). One could also estimate the base frequencies
446	in the same way. Both of these, particularly the latter, require multiple runs
447	of the program to evaluate different possible values, and this might get
448	expensive.
449	<P>
450	This program makes possible a (reasonably) legitimate
451	statistical test of the molecular clock. To do such a test, run DNAML
452	and DNAMLK on the same data. If the trees obtained are of the same
453	topology (when considered as unrooted), it is legitimate to compare
454	their likelihoods by the likelihood ratio test. In DNAML the likelihood
455	has been computed by estimating 2n-3 branch lengths, if their are n tips
456	on the tree. In DNAMLK it has been computed by estimating n-1 branching
457	times (in effect, n-1 branch lengths). The difference in the number of
458	parameters is (2n-3)-(n-1) = n-2. To perform the test take the
459	difference in log likelihoods between the two runs (DNAML should be the
460	higher of the two, barring numerical iteration difficulties) and double
461	it. Look this up on a chi-square distribution with n-2 degrees of
462	freedom. If the result is significant, the log likelihood has been
463	significantly increased by allowing all 2n-3 branch lengths to be
464	estimated instead of just n-1, and molecular clock may be rejected.
465	<P>
466	If the U (User Tree) option is used and more than one tree is supplied,
467	and the program is not told to assume autocorrelation between the
468	rates at different sites, the
469	program also performs a statistical test of each of these trees against the
470	one with highest likelihood. If there are two user trees, the test
471	done is one which is due to Kishino and Hasegawa (1989), a version
472	of a test originally introduced by Templeton (1983). In this
473	implementation it uses the mean and variance of
474	log-likelihood differences between trees, taken across sites. If the two
475	trees' means are more than 1.96 standard deviations different
476	then the trees are
477	declared significantly different. This use of the empirical variance of
478	log-likelihood differences is more robust and nonparametric than the
479	classical likelihood ratio test, and may to some extent compensate for the
480	any lack of realism in the model underlying this program.
481	<P>
482	If there are more than two trees, the test done is an extension of
483	the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
484	that a correction for the number of trees was necessary, and they
485	introduced a resampling method to make this correction. In the version
486	used here the variances and covariances of the sum of log likelihoods across
487	sites are computed for all pairs of trees. To test whether the
488	difference between each tree and the best one is larger than could have
489	been expected if they all had the same expected log-likelihood,
490	log-likelihoods for all trees are sampled with these covariances and equal
491	means (Shimodaira and Hasegawa's "least favorable hypothesis"),
492	and a P value is computed from the fraction of times the difference between
493	the tree's value and the highest log-likelihood exceeds that actually
494	observed. Note that this sampling needs random numbers, and so the
495	program will prompt the user for a random number seed if one has not
496	already been supplied. With the two-tree KHT test no random numbers
497	are used.
498	<P>
499	In either the KHT or the SH test the program
500	prints out a table of the log-likelihoods of each tree, the differences of
501	each from the highest one, the variance of that quantity as determined by
502	the log-likelihood differences at individual sites, and a conclusion as to
503	whether that tree is or is not significantly worse than the best one. However
504	the test is not available if we assume that there
505	is autocorrelation of rates at neighboring sites (option A) and is not
506	done in those cases.
507	<P>
508	The branch lengths printed out are scaled in terms of expected numbers of
509	substitutions, counting both transitions and transversions but not
510	replacements of a base by itself, and scaled so that the average rate of
511	change, averaged over all sites analyzed, is set to 1.0
512	if there are multiple categories of sites. This means that whether or not
513	there are multiple categories of sites, the expected fraction of change
514	for very small branches is equal to the branch length. Of course,
515	when a branch is twice as
516	long this does not mean that there will be twice as much net change expected
517	along it, since some of the changes occur in the same site and overlie or
518	even reverse each
519	other. The branch length estimates here are in terms of the expected
520	underlying numbers of changes. That means that a branch of length 0.26
521	is 26 times as long as one which would show a 1% difference between
522	the nucleotide sequences at the beginning and end of the branch. But we
523	would not expect the sequences at the beginning and end of the branch to be
524	26% different, as there would be some overlaying of changes.
525	<P>
526	Because of limitations of the numerical
527	algorithm, branch length estimates of zero will often print out as small
528	numbers such as 0.00001. If you see a branch length that small, it is really
529	estimated to be of zero length.
530	<P>
531	Another possible source of confusion is the existence of negative values for
532	the log likelihood. This is not really a problem; the log likelihood is not a
533	probability but the logarithm of a probability. When it is
534	negative it simply means that the corresponding probability is less
535	than one (since we are seeing its logarithm). The log likelihood is
536	maximized by being made more positive: -30.23 is worse than -29.14.
537	<P>
538	At the end of the output, if the R option is in effect with multiple
539	HMM rates, the program will print a list of what site categories
540	contributed the most to the final likelihood. This combination of
541	HMM rate categories need not have contributed a majority of the likelihood,
542	just a plurality. Still, it will be helpful as a view of where the
543	program infers that the higher and lower rates are. Note that the
544	use in this calculations of the prior probabilities of different rates,
545	and the average patch length, gives this inference a "smoothed"
546	appearance: some other combination of rates might make a greater
547	contribution to the likelihood, but be discounted because it conflicts
548	with this prior information. See the example output below to see
549	what this printout of rate categories looks like.
550	<P>
551	A second list will also be printed out, showing for each site which
552	rate accounted for the highest fraction of the likelihood. If the fraction
553	of the likelihood accounted for is less than 95%, a dot is printed instead.
554	<P>
555	Option 3 in the menu controls whether the tree is printed out into
556	the output file. This is on by default, and usually you will want to
557	leave it this way. However for runs with multiple data sets such as
558	bootstrapping runs, you will primarily be interested in the trees
559	which are written onto the output tree file, rather than the trees
560	printed on the output file. To keep the output file from becoming too
561	large, it may be wisest to use option 3 to prevent trees being
562	printed onto the output file.
563	<P>
564	Option 4 in the menu controls whether the tree estimated by the program
565	is written onto a tree file. The default name of this output tree file
566	is "outtree". If the U option is in effect, all the user-defined
567	trees are written to the output tree file.
568	<P>
569	Option 5 in the menu controls whether ancestral states are estimated
570	at each node in the tree. If it is in effect, a table of ancestral
571	sequences is printed out (including the sequences in the tip species which
572	are the input sequences). In that table, if a site has a base which
573	accounts for more than 95% of the likelihood, it is printed in capital
574	letters (A rather than a). If the best nucleotide accounts for less
575	than 50% of the likelihood, the program prints out an ambiguity code
576	(such as M for "A or C") for the set of nucleotides which, taken together,
577	account for more half of the likelihood. The ambiguity codes are listed
578	in the sequence programs documentation file. One limitation of the current
579	version of the program is that when there are multiple HMM rates
580	(option R) the reconstructed nucleotides are based on only the single
581	assignment of rates to sites which accounts for the largest amount of the
582	likelihood. Thus the assessment of 95% of the likelihood, in tabulating
583	the ancestral states, refers to 95% of the likelihood that is accounted
584	for by that particular combination of rates.
585
586	<P>
587	<H2>PROGRAM CONSTANTS</H2>
588	<P>
589	The constants defined at the beginning of the program include "maxtrees",
590	the maximum number of user trees that can be processed. It is small (100)
591	at present to save some further memory but the cost of increasing it
592	is not very great. Other constants
593	include "maxcategories", the maximum number of site
594	categories, "namelength", the length of species names in
595	characters, and three others, "smoothings", "iterations", and "epsilon", that
596	help "tune" the algorithm and define the compromise between execution speed and
597	the quality of the branch lengths found by iteratively maximizing the
598	likelihood. Reducing iterations and smoothings, and increasing epsilon, will
599	result in faster execution but a worse result. These values
600	will not usually have to be changed.
601	<P>
602	The program spends most of its time doing real arithmetic.
603	The algorithm, with separate and independent computations
604	occurring for each pattern, lends itself readily to parallel processing.
605	<P>
606	<H2>PAST AND FUTURE OF THE PROGRAM</H2>
607	<P>
608	This program was developed in 1989 by combining code from DNAPARS and from
609	DNAML. It was speeded up
610	by two major developments, the use of aliasing of nucleotide sites (version
611	3.1) and pretabulation of some exponentials (added by Akiko Fuseki in version
612	3.4). In version 3.5 the Hidden Markov Model code was added and the method
613	of iterating branch lengths was changed from an EM algorithm to direct
614	search. The Hidden Markov Model code slows things down, especially if
615	there is autocorrelation between sites, so this version is slower than
616	version 3.4. Nevertheless we hope that the sacrifice is worth it.
617	<P>
618	One change that is needed in the future is to put in some way of
619	allowing for base composition of nucleotide sequences in different parts
620	of the phylogeny.
621	<P>
622	<HR>
623	<P>
624	<H3>TEST DATA SET</H3>
625	<P>
626	<TABLE><TR><TD BGCOLOR=white>
627	<PRE>
628	5 13
629	Alpha AACGTGGCCAAAT
630	Beta AAGGTCGCCAAAC
631	Gamma CATTTCGTCACAA
632	Delta GGTATTTCGGCCT
633	Epsilon GGGATCTCGGCCC
634	</PRE>
635	</TD></TR></TABLE>
636	<P>
637	<HR>
638	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
639	<P>
640	(It was run with HMM rates having gamma-distributed rates
641	approximated by 5 rate categories,
642	with coefficient of variation of rates 1.0, and with patch length
643	parameter = 1.5. Two user-defined rate categories were used, one for
644	the first 6 sites, the other for the last 7, with rates 1.0 : 2.0.
645	Weights were used, with sites 1 and 13 given weight 0, and all others
646	weight 1.)
647	<P>
648	<TABLE><TR><TD BGCOLOR=white>
649	<PRE>
650
651	Nucleic acid sequence
652	Maximum Likelihood method with molecular clock, version 3.6a3
653
654	5 species, 13 sites
655
656	Site categories are:
657
658	1111112222 222
659
660
661	Sites are weighted as follows:
662
663	0111111111 111
664
665
666	Name Sequences
667	---- ---------
668
669	Alpha AACGTGGCCA AAT
670	Beta AAGGTCGCCA AAC
671	Gamma CATTTCGTCA CAA
672	Delta GGTATTTCGG CCT
673	Epsilon GGGATCTCGG CCC
674
675
676
677	Empirical Base Frequencies:
678
679	A 0.23333
680	C 0.30000
681	G 0.23333
682	T(U) 0.23333
683
684	Transition/transversion ratio = 2.000000
685
686
687	Discrete approximation to gamma distributed rates
688	Coefficient of variation of rates = 1.000000 (alpha = 1.000000)
689
690	State in HMM Rate of change Probability
691
692	1 0.264 0.522
693	2 1.413 0.399
694	3 3.596 0.076
695	4 7.086 0.0036
696	5 12.641 0.000023
697
698	Expected length of a patch of sites having the same rate = 1.500
699
700
701	Site category Rate of change
702
703	1 1.000
704	2 2.000
705
706
707
708
709
710
711	+-----Epsilon
712	+------------------------------------------------4
713	! +-----Delta
714	--3
715	! +----------Gamma
716	+-------------------------------------------2
717	! +--Beta
718	+-------1
719	+--Alpha
720
721
722	Ln Likelihood = -68.25148
723
724	Ancestor Node Node Height Length
725	-------- ---- ---- ------ ------
726	root 3
727	3 4 4.37769 4.37769
728	4 Epsilon 4.92983 0.55214
729	4 Delta 4.92983 0.55214
730	3 2 3.97954 3.97954
731	2 Gamma 4.92983 0.95029
732	2 1 4.64910 0.66957
733	1 Beta 4.92983 0.28073
734	1 Alpha 4.92983 0.28073
735
736	Combination of categories that contributes the most to the likelihood:
737
738	1122121111 112
739
740	Most probable category at each site if > 0.95 probability ("." otherwise)
741
742	.......... ...
743
744
745	Probable sequences at interior nodes:
746
747	node Reconstructed sequence (caps if > 0.95)
748
749	3 .rymtyscsr ymy
750	4 .GkaTcTCGG CCy
751	Epsilon GGGATCTCGG CCC
752	Delta GGTATTTCGG CCT
753	2 .AykTcGcCA mAy
754	Gamma CATTTCGTCA CAA
755	1 .AcGTcGCCA AAy
756	Beta AAGGTCGCCA AAC
757	Alpha AACGTGGCCA AAT
758
759	</PRE>
760	</TD></TR></TABLE>
761	</BODY>
762	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/dnamlk.html

Download in other formats: