Context Navigation

proml.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 33.1 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>dnaml</TITLE>
5	<META NAME="description" CONTENT="proml">
6	<META NAME="keywords" CONTENT="proml">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>ProML -- Protein Maximum Likelihood program</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	This program implements the maximum likelihood method for protein
26	amino acid sequences.
27	It uses the either the Jones-Taylor-Thornton or the Dayhoff
28	probability model of change between amino acids.
29	The assumptions of these present models are:
30	<OL>
31	<LI>Each position in the sequence evolves independently.
32	<LI>Different lineages evolve independently.
33	<LI>Each position undergoes substitution at an expected rate which is
34	chosen from a series of rates (each with a probability of occurrence)
35	which we specify.
36	<LI>All relevant positions are included in the sequence, not just those that
37	have changed or those that are "phylogenetically informative".
38	<LI>The probabilities of change between amino acids are given by the
39	model of Jones, Taylor, and Thornton (1992) or by the PAM model of
40	Dayhoff (Dayhoff and Eck, 1968; Dayhoff et. al., 1979).
41	</OL>
42	<P>
43	Note the assumption that we are looking at all positions, including those
44	that have not changed at all. It is important not to restrict attention
45	to some positions based on whether or not they have changed; doing that
46	would bias branch lengths by making them too long, and that in turn
47	would cause the method to misinterpret the meaning of those positions that
48	had changed.
49	<P>
50	This program uses a Hidden Markov Model (HMM)
51	method of inferring different rates of evolution at different amino acid
52	positions. This
53	was described in a paper by me and Gary Churchill (1996). It allows us to
54	specify to the program that there will be
55	a number of different possible evolutionary rates, what the prior
56	probabilities of occurrence of each is, and what the average length of a
57	patch of positions all having the same rate. The rates can also be chosen
58	by the program to approximate a Gamma distribution of rates, or a
59	Gamma distribution plus a class of invariant positions. The program computes the
60	the likelihood by summing it over all possible assignments of rates to positions,
61	weighting each by its prior probability of occurrence.
62	<P>
63	For example, if we have used the C and A options (described below) to specify
64	that there are three possible rates of evolution, 1.0, 2.4, and 0.0,
65	that the prior probabilities of a position having these rates are 0.4, 0.3, and
66	0.3, and that the average patch length (number of consecutive positions
67	with the same rate) is 2.0, the program will sum the likelihood over
68	all possibilities, but giving less weight to those that (say) assign all
69	positions to rate 2.4, or that fail to have consecutive positions that have the
70	same rate.
71	<P>
72	The Hidden Markov Model framework for rate variation among positions
73	was independently developed by Yang (1993, 1994, 1995). We have
74	implemented a general scheme for a Hidden Markov Model of
75	rates; we allow the rates and their prior probabilities to be specified
76	arbitrarily by the user, or by a discrete approximation to a Gamma
77	distribution of rates (Yang, 1995), or by a mixture of a Gamma
78	distribution and a class of invariant positions.
79	<P>
80	This feature effectively removes the artificial assumption that all positions
81	have the same rate, and also means that we need not know in advance the
82	identities of the positions that have a particular rate of evolution.
83	<P>
84	Another layer of rate variation also is available. The user can assign
85	categories of rates to each positions (for example, we might want
86	amino acid positions in the active site of a protein to change more slowly
87	than other positions. This is done with the categories input file and the
88	C option. We then specify (using the menu) the relative rates of evolution of
89	amino acid positions
90	in the different categories. For example, we might specify that positions
91	in the active site
92	evolve at relative rates of 0.2 compared to 1.0 at other positions. If we
93	are assuming that a particular position maintains a cysteine bridge to another,
94	we may want to put it in a category of positions (including perhaps the
95	initial position of the protein sequence which maintains methionine) which
96	changes at a rate of 0.0.
97	<P>
98	If both user-assigned rate categories and Hidden Markov Model rates
99	are allowed, the program assumes that the
100	actual rate at a position is the product of the user-assigned category rate
101	and the Hidden Markov Model regional rate. (This may not always make
102	perfect biological sense: it would be more natural to assume some upper
103	bound to the rate, as we have discussed in the Felsenstein and Churchill
104	paper). Nevertheless you may want to use both types of rate variation.
105	<P>
106	<H2>INPUT FORMAT AND OPTIONS</H2>
107	<P>
108	Subject to these assumptions, the program is a
109	correct maximum likelihood method. The
110	input is fairly standard, with one addition. As usual the first line of the
111	file gives the number of species and the number of amino acid positions.
112	<P>
113	Next come the species data. Each
114	sequence starts on a new line, has a ten-character species name
115	that must be blank-filled to be of that length, followed immediately
116	by the species data in the one-letter amino acid code. The sequences must
117	either be in the "interleaved" or "sequential" formats
118	described in the Molecular Sequence Programs document. The I option
119	selects between them. The sequences can have internal
120	blanks in the sequence but there must be no extra blanks at the end of the
121	terminated line. Note that a blank is not a valid symbol for a deletion.
122	<P>
123	The options are selected using an interactive menu. The menu looks like this:
124	<P>
125	<TABLE><TR><TD BGCOLOR=white>
126	<PRE>
127	Amino acid sequence Maximum Likelihood method, version 3.6a3
128
129	Settings for this run:
130	U Search for best tree? Yes
131	P JTT or PAM amino acid change model? Jones-Taylor-Thornton model
132	C One category of sites? Yes
133	R Rate variation among sites? constant rate of change
134	W Sites weighted? No
135	S Speedier but rougher analysis? Yes
136	G Global rearrangements? No
137	J Randomize input order of sequences? No. Use input order
138	O Outgroup root? No, use as outgroup species 1
139	M Analyze multiple data sets? No
140	I Input sequences interleaved? Yes
141	0 Terminal type (IBM PC, ANSI, none)? (none)
142	1 Print out the data at start of run No
143	2 Print indications of progress of run Yes
144	3 Print out tree Yes
145	4 Write out trees onto tree file? Yes
146	5 Reconstruct hypothetical sequences? No
147
148	Y to accept these or type the letter for one to change
149
150	</PRE>
151	</TD></TR></TABLE>
152	<P>
153	The user either types "Y" (followed, of course, by a carriage-return)
154	if the settings shown are to be accepted, or the letter or digit corresponding
155	to an option that is to be changed.
156	<P>
157	The options U, W, J, O, M, and 0 are the usual ones. They are described in the
158	main documentation file of this package. Option I is the same as in
159	other molecular sequence programs and is described in the documentation file
160	for the sequence programs.
161	<P>
162	The P option toggles between two models of amino acid change. One
163	is the Jones-Taylor-Thornton model, the other the Dayhoff PAM matrix
164	model. These are both based on Margaret Dayhoff's (Dayhoff and Eck, 1968;
165	Dayhoff et. al., 1979) method of empirical tabulation of changes of
166	amino acid sequences, and conversion of these to a probability
167	model of amino acid change which is used to make a transition probability
168	matrix which allows prediction of the probability of changing from any
169	one amino acid to any other, and also predicts equilibrium amino acid
170	composition.
171	<P>
172	The default method is that of Jones,
173	Taylor, and Thornton (1992). This is similar to the Dayhoff
174	PAM model, except that it is based on a recounting of the number of
175	observed changes in amino acids, using a much larger sample of protein
176	sequences than did Dayhoff. Because its sample is so much larger this
177	model is to be preferred over the original Dayhoff PAM model.
178	The Dayhoff model uses Dayhoff's PAM 001 matrix from
179	Dayhoff et. al. (1979), page 348.
180	<P>
181	The R (Hidden Markov Model rates) option allows the user to
182	approximate a Gamma distribution of rates among positions, or a
183	Gamma distribution plus a class of invariant positions, or to specify how
184	many categories of
185	substitution rates there will be in a Hidden Markov Model of rate
186	variation, and what are the rates and probabilities
187	for each. By repeatedly selecting the R option one toggles among
188	no rate variation, the Gamma, Gamma+I, and general HMM possibilities.
189	<P>
190	If you choose Gamma or Gamma+I the program will ask how many rate
191	categories you want. If you have chosen Gamma+I, keep in mind that
192	one rate category will be set aside for the invariant class and only
193	the remaining ones used to approximate the Gamma distribution.
194	For the approximation we do not use the quantile method of Yang (1995)
195	but instead use a quadrature method using generalized Laguerre
196	polynomials. This should give a good approximation to the Gamma
197	distribution with as few as 5 or 6 categories.
198	<P>
199	In the Gamma and Gamma+I cases, the user will be
200	asked to supply the coefficient of variation of the rate of substitution
201	among positions. This is different from the parameters used by Nei and Jin
202	(1990) but
203	related to them: their parameter <EM>a</EM> is also known as "alpha",
204	the shape parameter of the Gamma distribution. It is
205	related to the coefficient of variation by
206	<P>
207	CV = 1 / a<SUP>1/2</SUP>
208	<P>
209	or
210	<P>
211	a = 1 / (CV)<SUP>2</SUP>
212	<P>
213	(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
214	that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
215	As we consider cases in which the rates are less variable we should set <EM>a</EM>
216	larger and larger, as <EM>CV</EM> gets smaller and smaller.
217	<P>
218	If the user instead chooses the general Hidden Markov Model option,
219	they are first asked how many HMM rate categories there
220	will be (for the moment there is an upper limit of 9,
221	which should not be restrictive). Then
222	the program asks for the rates for each category. These rates are
223	only meaningful relative to each other, so that rates 1.0, 2.0, and 2.4
224	have the exact same effect as rates 2.0, 4.0, and 4.8. Note that an
225	HMM rate category
226	can have rate of change 0, so that this allows us to take into account that
227	there may be a category of amino acid positions that are invariant. Note that
228	the run time
229	of the program will be proportional to the number of HMM rate categories:
230	twice as
231	many categories means twice as long a run. Finally the program will ask for
232	the probabilities of a random amino acid position falling into each of these
233	regional rate categories. These probabilities must be nonnegative and sum to
234	1. Default
235	for the program is one category, with rate 1.0 and probability 1.0 (actually
236	the rate does not matter in that case).
237	<P>
238	If more than one HMM rate category is specified, then another
239	option, A, becomes
240	visible in the menu. This allows us to specify that we want to assume that
241	positions that have the same HMM rate category are expected to be clustered
242	so that there is autocorrelation of rates. The
243	program asks for the value of the average patch length. This is an expected
244	length of patches that have the same rate. If it is 1, the rates of
245	successive positions will be independent. If it is, say, 10.25, then the
246	chance of change to a new rate will be 1/10.25 after every position. However
247	the "new rate" is randomly drawn from the mix of rates, and hence could
248	even be the same. So the actual observed length of patches with the same
249	rate will be a bit larger than 10.25. Note below that if you choose
250	multiple patches, there will be an estimate in the output file as to
251	which combination of rate categories contributed most to the likelihood.
252	<P>
253	Note that the autocorrelation scheme we use is somewhat different
254	from Yang's (1995) autocorrelated Gamma distribution. I am unsure
255	whether this difference is of any importance -- our scheme is chosen
256	for the ease with which it can be implemented.
257	<P>
258	The C option allows user-defined rate categories. The user is prompted
259	for the number of user-defined rates, and for the rates themselves,
260	which cannot be negative but can be zero. These numbers, which must be
261	nonnegative (some could be 0),
262	are defined relative to each other, so that if rates for three categories
263	are set to 1 : 3 : 2.5 this would have the same meaning as setting them
264	to 2 : 6 : 5.
265	The assignment of rates to amino acid positions
266	is then made by reading a file whose default name is "categories".
267	It should contain a string of digits 1 through 9. A new line or a blank
268	can occur after any character in this string. Thus the categories file
269	might look like this:
270	<P>
271	<PRE>
272	122231111122411155
273	1155333333444
274	</PRE>
275	<P>
276	With the current options R, A, and C the program has a good
277	ability to infer different rates at different positions and estimate
278	phylogenies under a more realistic model. Note that Likelihood Ratio
279	Tests can be used to test whether one combination of rates is
280	significantly better than another, provided one rate scheme represents
281	a restriction of another with fewer parameters. The number of parameters
282	needed for rate variation is the number of regional rate categories, plus
283	the number of user-defined rate categories less 2, plus one if the
284	regional rate categories have a nonzero autocorrelation.
285	<P>
286	The G (global search) option causes, after the last species is added to
287	the tree, each possible group to be removed and re-added. This improves the
288	result, since the position of every species is reconsidered. It
289	approximately triples the run-time of the program.
290	<P>
291	The User tree (option U) is read from a file whose default name is
292	<TT>intree</TT>. The trees can be multifurcating. They must be
293	preceded in the file by a line giving the number of trees in the file.
294	<P>
295	If the U (user tree) option is chosen another option appears in
296	the menu, the L option. If it is selected,
297	it signals the program that it
298	should take any branch lengths that are in the user tree and
299	simply evaluate the likelihood of that tree, without further altering
300	those branch lengths. This means that if some branches have lengths
301	and others do not, the program will estimate the lengths of those that
302	do not have lengths given in the user tree. Note that the program RETREE
303	can be used to add and remove lengths from a tree.
304	<P>
305	The U option can read a multifurcating tree. This allows us to
306	test the hypothesis that a certain branch has zero length (we can also
307	do this by using RETREE to set the length of that branch to 0.0 when
308	it is present in the tree). By
309	doing a series of runs with different specified lengths for a branch we
310	can plot a likelihood curve for its branch length while allowing all
311	other branches to adjust their lengths to it. If all branches have
312	lengths specified, none of them will be iterated. This is useful to allow
313	a tree produced by another method to have its likelihood
314	evaluated. The L option has no effect and does not appear in the
315	menu if the U option is not used.
316	<P>
317	The W (Weights) option is invoked in the usual way, with only weights 0
318	and 1 allowed. It selects a set of positions to be analyzed, ignoring the
319	others. The positions selected are those with weight 1. If the W option is
320	not invoked, all positions are analyzed.
321	The Weights (W) option
322	takes the weights from a file whose default name is "weights". The weights
323	follow the format described in the main documentation file.
324	<P>
325	The M (multiple data sets) option will ask you whether you want to
326	use multiple sets of weights (from the weights file) or multiple data sets
327	from the input file.
328	The ability to use a single data set with multiple weights means that
329	much less disk space will be used for this input data. The bootstrapping
330	and jackknifing tool Seqboot has the ability to create a weights file with
331	multiple weights. Note also that when we use multiple weights for
332	bootstrapping we can also then maintain different rate categories for
333	different positions in a meaningful way. You should not use the multiple
334	data sets option without using multiple weights, you should not at the
335	same time use the user-defined rate categories option (option C).
336	<P>
337	The algorithm used for searching among trees uses
338	a technique invented by David Swofford
339	and J. S. Rogers. This involves not iterating most branch lengths on most
340	trees when searching among tree topologies, This is of necessity a
341	"quick-and-dirty" search but it saves much time. There is a menu option
342	(option S) which can turn off this search and revert to the earlier
343	search method which iterated branch lengths in all topologies. This will
344	be substantially slower but will also be a bit more likely to find the
345	tree topology of highest likelihood. If the Swofford/Rogers search
346	finds the best tree topology, the branch lengths inferred will
347	be almost precisely the same as they would be with the more thorough
348	search, as the maximization of likelihood with respect to branch
349	lengths for the final tree is not different in the two kinds of search.
350	<P>
351	<H2>OUTPUT FORMAT</H2>
352	<P>
353	The output starts by giving the number of species and the number of amino acid
354	positions.
355	<P>
356	If the R (HMM rates) option is used a table of the relative rates of
357	expected substitution at each category of positions is printed, as well
358	as the probabilities of each of those rates.
359	<P>
360	There then follow the data sequences, if the user has selected the menu
361	option to print them, with the sequences printed in
362	groups of ten amino acids. The
363	trees found are printed as an unrooted
364	tree topology (possibly rooted by outgroup if so requested). The
365	internal nodes are numbered arbitrarily for the sake of
366	identification. The number of trees evaluated so far and the log
367	likelihood of the tree are also given. Note that the trees printed out
368	have a trifurcation at the base. The branch lengths in the diagram are
369	roughly proportional to the estimated branch lengths, except that very short
370	branches are printed out at least three characters in length so that the
371	connections can be seen. The unit of branch length is the expected
372	fraction of amino acids changed (so that 1.0 is 100 PAMs).
373	<P>
374	A table is printed
375	showing the length of each tree segment (in units of expected amino acid
376	substitutions per position), as well as (very) rough confidence
377	limits on their lengths. If a confidence limit is
378	negative, this indicates that rearrangement of the tree in that region
379	is not excluded, while if both limits are positive, rearrangement is
380	still not necessarily excluded because the variance calculation on which
381	the confidence limits are based results in an underestimate, which makes
382	the confidence limits too narrow.
383	<P>
384	In addition to the confidence limits,
385	the program performs a crude Likelihood Ratio Test (LRT) for each
386	branch of the tree. The program computes the ratio of likelihoods with and
387	without this branch length forced to zero length. This done by comparing the
388	likelihoods changing only that branch length. A truly correct LRT would
389	force that branch length to zero and also allow the other branch lengths to
390	adjust to that. The result would be a likelihood ratio closer to 1. Therefore
391	the present LRT will err on the side of being too significant. YOU ARE
392	WARNED AGAINST TAKING IT TOO SERIOUSLY. If you want to get a better
393	likelihood curve for a branch length you can do multiple runs with
394	different prespecified lengths for that branch, as discussed above in the
395	discussion of the L option.
396	<P>
397	One should also
398	realize that if you are looking not at a previously-chosen branch but at all
399	branches, that you are seeing the results of multiple tests. With 20 tests,
400	one is expected to reach significance at the P = .05 level purely by
401	chance. You should therefore use a much more conservative significance level,
402	such as .05 divided by the number of tests. The significance of these tests
403	is shown by printing asterisks next to
404	the confidence interval on each branch length. It is important to keep
405	in mind that both the confidence limits and the tests
406	are very rough and approximate, and probably indicate more significance than
407	they should. Nevertheless, maximum likelihood is one of the few methods that
408	can give you any indication of its own error; most other methods simply fail to
409	warn the user that there is any error! (In fact, whole philosophical schools
410	of taxonomists exist whose main point seems to be that there isn't any
411	error, that the "most parsimonious" tree is the best tree by definition and
412	that's that).
413	<P>
414	The log likelihood printed out with the final tree can be used to perform
415	various likelihood ratio tests. One can, for example, compare runs with
416	different values of the relative rate of change in the active site and in
417	the rest of the protein to determine
418	which value is the maximum likelihood estimate, and what is the allowable range
419	of values (using a likelihood ratio test, which you will find described in
420	mathematical statistics books). One could also estimate the base frequencies
421	in the same way. Both of these, particularly the latter, require multiple runs
422	of the program to evaluate different possible values, and this might get
423	expensive.
424	<P>
425	If the U (User Tree) option is used and more than one tree is supplied,
426	and the program is not told to assume autocorrelation between the
427	rates at different amino acid positions, the
428	program also performs a statistical test of each of these trees against the
429	one with highest likelihood. If there are two user trees, the test
430	done is one which is due to Kishino and Hasegawa (1989), a version
431	of a test originally introduced by Templeton (1983). In this
432	implementation it uses the mean and variance of
433	log-likelihood differences between trees, taken across amino acid
434	positions. If the two
435	trees' means are more than 1.96 standard deviations different
436	then the trees are
437	declared significantly different. This use of the empirical variance of
438	log-likelihood differences is more robust and nonparametric than the
439	classical likelihood ratio test, and may to some extent compensate for the
440	any lack of realism in the model underlying this program.
441	<P>
442	If there are more than two trees, the test done is an extension of
443	the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
444	that a correction for the number of trees was necessary, and they
445	introduced a resampling method to make this correction. In the version
446	used here the variances and covariances of the sum of log likelihoods across
447	amino acid positions are computed for all pairs of trees. To test whether the
448	difference between each tree and the best one is larger than could have
449	been expected if they all had the same expected log-likelihood,
450	log-likelihoods for all trees are sampled with these covariances and equal
451	means (Shimodaira and Hasegawa's "least favorable hypothesis"),
452	and a P value is computed from the fraction of times the difference between
453	the tree's value and the highest log-likelihood exceeds that actually
454	observed. Note that this sampling needs random numbers, and so the
455	program will prompt the user for a random number seed if one has not
456	already been supplied. With the two-tree KHT test no random numbers
457	are used.
458	<P>
459	In either the KHT or the SH test the program
460	prints out a table of the log-likelihoods of each tree, the differences of
461	each from the highest one, the variance of that quantity as determined by
462	the log-likelihood differences at individual sites, and a conclusion as to
463	whether that tree is or is not significantly worse than the best one. However
464	the test is not available if we assume that there
465	is autocorrelation of rates at neighboring positions (option A) and is not
466	done in those cases.
467	<P>
468	The branch lengths printed out are scaled in terms of expected numbers of
469	amino acid substitutions, scaled so that the average rate of
470	change, averaged over all the positions analyzed, is set to 1.0.
471	if there are multiple categories of positions. This means that whether or not
472	there are multiple categories of positions, the expected fraction of change
473	for very small branches is equal to the branch length. Of course,
474	when a branch is twice as
475	long this does not mean that there will be twice as much net change expected
476	along it, since some of the changes occur in the same position and overlie or
477	even reverse each
478	other. The branch length estimates here are in terms of the expected
479	underlying numbers of changes. That means that a branch of length 0.26
480	is 26 times as long as one which would show a 1% difference between
481	the amino acid sequences at the beginning and end of the branch. But we
482	would not expect the sequences at the beginning and end of the branch to be
483	26% different, as there would be some overlaying of changes.
484	<P>
485	Confidence limits on the branch lengths are
486	also given. Of course a
487	negative value of the branch length is meaningless, and a confidence
488	limit overlapping zero simply means that the branch length is not necessarily
489	significantly different from zero. Because of limitations of the numerical
490	algorithm, branch length estimates of zero will often print out as small
491	numbers such as 0.00001. If you see a branch length that small, it is really
492	estimated to be of zero length.
493	<P>
494	Another possible source of confusion is the existence of negative values for
495	the log likelihood. This is not really a problem; the log likelihood is not a
496	probability but the logarithm of a probability. When it is
497	negative it simply means that the corresponding probability is less
498	than one (since we are seeing its logarithm). The log likelihood is
499	maximized by being made more positive: -30.23 is worse than -29.14.
500	<P>
501	At the end of the output, if the R option is in effect with multiple
502	HMM rates, the program will print a list of what amino acid position
503	categories contributed the most to the final likelihood. This combination of
504	HMM rate categories need not have contributed a majority of the likelihood,
505	just a plurality. Still, it will be helpful as a view of where the
506	program infers that the higher and lower rates are. Note that the
507	use in this calculations of the prior probabilities of different rates,
508	and the average patch length, gives this inference a "smoothed"
509	appearance: some other combination of rates might make a greater
510	contribution to the likelihood, but be discounted because it conflicts
511	with this prior information. See the example output below to see
512	what this printout of rate categories looks like.
513	A second list will also be printed out, showing for each position which
514	rate accounted for the highest fraction of the likelihood. If the fraction
515	of the likelihood accounted for is less than 95%, a dot is printed instead.
516	<P>
517	Option 3 in the menu controls whether the tree is printed out into
518	the output file. This is on by default, and usually you will want to
519	leave it this way. However for runs with multiple data sets such as
520	bootstrapping runs, you will primarily be interested in the trees
521	which are written onto the output tree file, rather than the trees
522	printed on the output file. To keep the output file from becoming too
523	large, it may be wisest to use option 3 to prevent trees being
524	printed onto the output file.
525	<P>
526	Option 4 in the menu controls whether the tree estimated by the program
527	is written onto a tree file. The default name of this output tree file
528	is "outtree". If the U option is in effect, all the user-defined
529	trees are written to the output tree file.
530	<P>
531	Option 5 in the menu controls whether ancestral states are estimated
532	at each node in the tree. If it is in effect, a table of ancestral
533	sequences is printed out (including the sequences in the tip species which
534	are the input sequences).
535	The symbol printed out is for the amino acid which accounts for the
536	largest fraction of the likelihood at that position.
537	In that table, if a position has an amino acid which
538	accounts for more than 95% of the likelihood, its symbol printed in capital
539	letters (W rather than w). One limitation of the current
540	version of the program is that when there are multiple HMM rates
541	(option R) the reconstructed amino acids are based on only the single
542	assignment of rates to positions which accounts for the largest amount of the
543	likelihood. Thus the assessment of 95% of the likelihood, in tabulating
544	the ancestral states, refers to 95% of the likelihood that is accounted
545	for by that particular combination of rates.
546	<P>
547	<H2>PROGRAM CONSTANTS</H2>
548	<P>
549	The constants defined at the beginning of the program include "maxtrees",
550	the maximum number of user trees that can be processed. It is small (100)
551	at present to save some further memory but the cost of increasing it
552	is not very great. Other constants
553	include "maxcategories", the maximum number of position
554	categories, "namelength", the length of species names in
555	characters, and three others, "smoothings", "iterations", and "epsilon", that
556	help "tune" the algorithm and define the compromise between execution speed and
557	the quality of the branch lengths found by iteratively maximizing the
558	likelihood. Reducing iterations and smoothings, and increasing epsilon, will
559	result in faster execution but a worse result. These values
560	will not usually have to be changed.
561	<P>
562	The program spends most of its time doing real arithmetic.
563	The algorithm, with separate and independent computations
564	occurring for each pattern, lends itself readily to parallel processing.
565	<P>
566	<H2>PAST AND FUTURE OF THE PROGRAM</H2>
567	<P>
568	This program is derived in version 3.6 by Lucas Mix from DNAML,
569	with which it shares
570	many of its data structures and much of its strategy.
571	<P>
572	<HR>
573	<P>
574	<H3>TEST DATA SET</H3>
575	<P>
576	(Note that although these may look like DNA sequences, they are being
577	treated as protein sequences consisting entirely of alanine, cystine,
578	glycine, and threonine).
579	<P>
580	<TABLE><TR><TD BGCOLOR=white>
581	<PRE>
582	5 13
583	Alpha AACGTGGCCAAAT
584	Beta AAGGTCGCCAAAC
585	Gamma CATTTCGTCACAA
586	Delta GGTATTTCGGCCT
587	Epsilon GGGATCTCGGCCC
588	</PRE>
589	</TD></TR></TABLE>
590	<P>
591	<HR>
592	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
593	<P>
594	(It was run with HMM rates having gamma-distributed rates
595	approximated by 5 rate categories,
596	with coefficient of variation of rates 1.0, and with patch length
597	parameter = 1.5. Two user-defined rate categories were used, one for
598	the first 6 positions, the other for the last 7, with rates 1.0 : 2.0.
599	Weights were used, with sites 1 and 13 given weight 0, and all others
600	weight 1.)
601	<P>
602	<TABLE><TR><TD BGCOLOR=white>
603	<PRE>
604
605	Amino acid sequence Maximum Likelihood method, version 3.6a3
606
607	5 species, 13 sites
608
609	Site categories are:
610
611	1111112222 222
612
613
614	Sites are weighted as follows:
615
616	0111111111 111
617
618	Jones-Taylor-Thornton model of amino acid change
619
620
621	Name Sequences
622	---- ---------
623
624	Alpha AACGTGGCCA AAT
625	Beta ..G..C.... ..C
626	Gamma C.TT.C.T.. C.A
627	Delta GGTA.TT.GG CC.
628	Epsilon GGGA.CT.GG CCC
629
630
631
632	Discrete approximation to gamma distributed rates
633	Coefficient of variation of rates = 1.000000 (alpha = 1.000000)
634
635	States in HMM Rate of change Probability
636
637	1 0.264 0.522
638	2 1.413 0.399
639	3 3.596 0.076
640	4 7.086 0.0036
641	5 12.641 0.000023
642
643
644
645	Site category Rate of change
646
647	1 1.000
648	2 2.000
649
650
651
652	+Beta
653	\|
654	\| +Epsilon
655	\| +-----------------------------3
656	1---------2 +-------------------Delta
657	\| \|
658	\| +--------------------------Gamma
659	\|
660	+-----------------Alpha
661
662
663	remember: this is an unrooted tree!
664
665	Ln Likelihood = -121.49044
666
667	Between And Length Approx. Confidence Limits
668	------- --- ------ ------- ---------- ------
669
670	1 Alpha 60.18362 ( zero, 135.65380) **
671	1 Beta 0.00010 ( zero, infinity)
672	1 2 32.56292 ( zero, 96.08019) *
673	2 3 141.85557 ( zero, 304.10906) **
674	3 Epsilon 0.00010 ( zero, infinity)
675	3 Delta 68.68682 ( zero, 151.95402) **
676	2 Gamma 89.79037 ( zero, 198.93830) **
677
678	* = significantly positive, P < 0.05
679	** = significantly positive, P < 0.01
680
681	Combination of categories that contributes the most to the likelihood:
682
683	1122121111 112
684
685	Most probable category at each site if > 0.95 probability ("." otherwise)
686
687	....1..... ...
688
689	Probable sequences at interior nodes:
690
691	node Reconstructed sequence (caps if > 0.95)
692
693	1 .AGGTCGCCA AAC
694	Beta AAGGTCGCCA AAC
695	2 .AggTCGCCA CAC
696	3 .GGATCTCGG CCC
697	Epsilon GGGATCTCGG CCC
698	Delta GGTATTTCGG CCT
699	Gamma CATTTCGTCA CAA
700	Alpha AACGTGGCCA AAT
701
702	</PRE>
703	</TD></TR></TABLE>
704	</BODY>
705	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/proml.html

Download in other formats: