Context Navigation

restml.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 20.0 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>restml</TITLE>
5	<META NAME="description" CONTENT="restml">
6	<META NAME="keywords" CONTENT="restml">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>RESTML -- Restriction sites Maximum Likelihood program</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2000 by The University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	This program implements a maximum likelihood method for restriction sites
26	data (not restriction fragment data). This program is one of the slowest
27	programs in this package, and can be very tedious to run. It is
28	possible to have the program search for the maximum likelihood tree.
29	It will be more practical for some users (those that do not have
30	fast machines) to use the U (User Tree)
31	option, which takes less run time, optimizing branch lengths and computing
32	likelihoods for particular tree topologies suggested by the user. The
33	model used here is essentially identical to that used by Smouse and Li (1987)
34	who give explicit expressions for computing the likelihood for three-species
35	trees. It does not place prior probabilities on trees as they do. The present
36	program extends their approach to multiple species by
37	a technique which, while it does not give explicit expressions for likelihoods,
38	does enable their computation and the iterative improvement of branch
39	lengths. It also allows for multiple restriction enzymes. The algorithm
40	has been described in a paper (Felsenstein, 1992). Another relevant
41	paper is that of DeBry and Slade (1985).
42	<P>
43	The assumptions of the present model are:
44	<P>
45	<OL>
46	<LI>Each restriction site evolves independently.
47	<LI>Different lineages evolve independently.
48	<LI>Each site undergoes substitution at an expected rate which we specify.
49	<LI>Substitutions consist of replacement of a nucleotide by one of the
50	other three nucleotides, chosen at random.
51	</OL>
52	<P>
53	Note that if the existing base is, say, an A, the chance of it being
54	replaced by a G is 1/3, and so is the chance that it is replaced by
55	a T. This means that there can be no difference in the (expected)
56	rate of transitions and transversions. Users who are upset at this
57	might ponder the fact that a version allowing different rates of
58	transitions and transversions would run an estimated 16 times
59	slower. If it also allowed for unequal frequencies of the four bases,
60	it would run about 300,000 times slower! For the moment, until a better
61	method is available, I guess I'll stick with this one!
62	<P>
63	<H2>INPUT FORMAT AND OPTIONS</H2>
64	<P>
65	Subject to these assumptions, the program is an approximately
66	correct maximum likelihood method. The
67	input is fairly standard, with one addition. As usual the first line of the
68	file gives the number of species and the number of sites, but there is also
69	a third number, which is the number of different restriction enzymes that were
70	used to detect the restriction sites. Thus a data set with 10 species and
71	35 different sites, representing digestion with 4 different enzymes, would
72	have the first line of the data file look like this:
73	<P>
74	<PRE>
75	10 35 4
76	</PRE>
77	<P>
78	The first line of the data file will also contain a letter W following
79	these numbers (and separated from them by a space) if the Weights option
80	is being used. As with all programs using the weights option, a line or lines
81	must then follow, before the data, with the weights for each site.
82	<P>
83	The site data are in standard form. Each species starts with a species name
84	whose maximum length is given by the constant "nmlngth"
85	(whose value in the
86	program as distributed is 10 characters). The name should, as usual, be padded
87	out to that length with blanks if necessary. The sites data then follows, one
88	character per site (any blanks will
89	be skipped and ignored). Like the DNA and protein sequence data, the
90	restriction sites data may be either in the "interleaved" form or the
91	"sequential" form. Note that if you are analyzing restriction sites
92	data with the programs DOLLOP or MIX or other discrete character
93	programs, at the moment those programs do not use the "aligned" or
94	"interleaved" data format. Therefore you may want to avoid that format
95	when you have restriction sites data that you will want to feed into
96	those programs.
97	<P>
98	The presence of a site is indicated by a "+" and the absence by a "-". I have
99	also allowed the use of "1" and "0" as synonyms for "+" and "-", for
100	compatibility with MIX and DOLLOP which do not allow "+" and "-". If the
101	presence of
102	the site is unknown (for example, if the DNA containing it has been deleted so
103	that one
104	does not know whether it would have contained the site) then the state "?" can
105	be used to indicate that the state of this site is unknown.
106	<P>
107	User-defined trees may follow the
108	data in the usual way. The trees must be unrooted, which means that at their
109	base they must have a trifurcation.
110	<P>
111	The options are selected by a menu, which looks like this:
112	<P>
113	<TABLE><TR><TD BGCOLOR=white>
114	<PRE>
115
116	Restriction site Maximum Likelihood method, version 3.6
117
118	Settings for this run:
119	U Search for best tree? Yes
120	A Are all sites detected? No
121	S Speedier but rougher analysis? Yes
122	G Global rearrangements? No
123	J Randomize input order of sequences? No. Use input order
124	L Site length? 6
125	O Outgroup root? No, use as outgroup species 1
126	M Analyze multiple data sets? No
127	I Input sequences interleaved? Yes
128	0 Terminal type (IBM PC, ANSI, none)? (none)
129	1 Print out the data at start of run No
130	2 Print indications of progress of run Yes
131	3 Print out tree Yes
132	4 Write out trees onto tree file? Yes
133
134	Y to accept these or type the letter for one to change
135	</PRE>
136	</TD></TR></TABLE>
137	<P>
138	The U, J, O, M, and 0 options are the usual ones, described in the main
139	documentation file. The user trees for option U are read from a file whose
140	default name is <TT>intree</TT>. The I option selects between Interleaved and
141	Sequential input data formats, and is described in the documentation file for
142	the molecular sequences programs.
143	<P>
144	The G
145	(global search) option causes, after the last species is added to the tree,
146	each possible group to be removed and re-added. This improves the result,
147	since the position of every species is reconsidered. It approximately triples
148	the run-time of the program.
149	<P>
150	The two options specific to this program are the A, and L options. The L
151	(Length) option allows the user to specify the length in bases of the
152	restriction sites. Allowed values are 1 to 8 (the
153	constant "maxcutter" in file <TT>phylip.h</TT>
154	controls the maximum allowed value). At the
155	moment the program assumes that all sites have the same length (for example,
156	that all enzymes are 6-base-cutters). The default value for this parameter is
157	6, which will be used if the L option is not invoked. A desirable future
158	development for the package would be allowing the L parameter to be different
159	for every site. It would also be desirable to allow for ambiguities in the
160	recognition site, since some enzymes recognize 2 or 4 sequences. Both of these
161	would require fairly complicated programming or else slower execution times.
162	<P>
163	The A (All) option specifies that all sites are detected, even those for which
164	all of the species have the recognition sequence absent (character state
165	"-"). The default condition is that it is assumed that such sites will not
166	occur in the data. The likelihood computed when the A option is not used is
167	the probability of the pattern of sites given that tree and conditional on the
168	pattern not being all absences. This will be realistic for most data, except
169	for cases in which the data are extracted from sites data for a larger number
170	of species, in which case some of the site positions could have all absences in
171	the subset of species. In such cases an effective way of analyzing the data
172	would be to omit those sites and not use the A option, as such positions, even
173	if not absolutely excluded, are nevertheless less likely than random to have
174	been incorporated in the data set.
175	<P>
176	The W (Weights) option, which is invoked in the input file rather than in
177	the menu, allows the user to select a subset of sites to
178	be analyzed. It is invoked in the usual way, except that only weights
179	0 and 1 are allowed. If the W option is not used, all sites will be
180	analyzed. If the Weights option is used, there must be a W in the first
181	line of the input file.
182	<P>
183	<H2>OUTPUT FORMAT</H2>
184	<P>
185	The output starts by giving the number of species, and the number of sites. If
186	the default condition is used instead of the A option the program states that
187	it is assuming that sites absent in all species have been omitted. The value
188	of the site length (6 bases, for example) is also given.
189	<P>
190	If option 2 (print out data) has been selected,
191	there then follow the restriction site sequences, printed in
192	groups of ten sites. The trees found are printed as an unrooted
193	tree topology (possibly rooted by outgroup if so requested). The
194	internal nodes are numbered arbitrarily for the sake of
195	identification. The number of trees evaluated so far and the log
196	likelihood of the tree are also given.
197	<P>
198	A table is printed
199	showing the length of each tree segment, as well as (very) rough confidence
200	limits on the length. As with DNAML, if a confidence limit is
201	negative, this indicates that rearrangement of the tree in that region
202	is not excluded, while if both limits are positive, rearrangement is
203	still not necessarily excluded because the variance calculation on which
204	the confidence limits are based results in an underestimate, which makes
205	the confidence limits too narrow.
206	<P>
207	In addition to the confidence limits,
208	the program performs a crude Likelihood Ratio Test (LRT) for each
209	branch of the tree. The program computes the ratio of likelihoods with and
210	without this branch length forced to zero length. This done by comparing the
211	likelihoods changing only that branch length. A truly correct LRT would
212	force that branch length to zero and also allow the other branch lengths to
213	adjust to that. The result would be a likelihood ratio closer to 1. Therefore
214	the present LRT will err on the side of being too significant.
215	<P>
216	One should also
217	realize that if you are looking not at a previously-chosen branch but at all
218	branches, that you are seeing the results of multiple tests. With 20 tests,
219	one is expected to reach significance at the P = .05 level purely by
220	chance. You should therefore use a much more conservative significance level,
221	such as .05 divided by the number of tests. The significance of these tests
222	is shown by printing asterisks next to
223	the confidence interval on each branch length. It is important to keep
224	in mind that both the confidence limits and the tests
225	are very rough and approximate, and probably indicate more significance than
226	they should. Nevertheless, maximum likelihood is one of the few methods that
227	can give you any indication of its own error; most other methods simply fail to
228	warn the user that there is any error! (In fact, whole philosophical schools
229	of taxonomists exist whose main point seems to be that there isn't any
230	error, that the "most parsimonious" tree is the best tree by definition and
231	that's that).
232	<P>
233	The log likelihood printed out with the final tree can be used to perform
234	various likelihood ratio tests. Remember that testing one tree topology
235	against another is not a simple matter, because two different tree topologies
236	are not hypotheses that are
237	nested one within the other. If the trees differ by only one branch
238	swap, it seems to be conservative to test the difference between their
239	likelihoods with one degree of freedom, but other than that little is known and
240	more work on this is needed.
241	<P>
242	If the U (User Tree) option is used and more than one tree is supplied,
243	and the program is not told to assume autocorrelation between the
244	rates at different sites, the
245	program also performs a statistical test of each of these trees against the
246	one with highest likelihood. If there are two user trees, the test
247	done is one which is due to Kishino and Hasegawa (1989), a version
248	of a test originally introduced by Templeton (1983). In this
249	implementation it uses the mean and variance of
250	log-likelihood differences between trees, taken across sites. If the two
251	trees' means are more than 1.96 standard deviations different
252	then the trees are
253	declared significantly different. This use of the empirical variance of
254	log-likelihood differences is more robust and nonparametric than the
255	classical likelihood ratio test, and may to some extent compensate for the
256	any lack of realism in the model underlying this program.
257	<P>
258	If there are more than two trees, the test done is an extension of
259	the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
260	that a correction for the number of trees was necessary, and they
261	introduced a resampling method to make this correction. In the version
262	used here the variances and covariances of the sum of log likelihoods across
263	sites are computed for all pairs of trees. To test whether the
264	difference between each tree and the best one is larger than could have
265	been expected if they all had the same expected log-likelihood,
266	log-likelihoods for all trees are sampled with these covariances and equal
267	means (Shimodaira and Hasegawa's "least favorable hypothesis"),
268	and a P value is computed from the fraction of times the difference between
269	the tree's value and the highest log-likelihood exceeds that actually
270	observed. Note that this sampling needs random numbers, and so the
271	program will prompt the user for a random number seed if one has not
272	already been supplied. With the two-tree KHT test no random numbers
273	are used.
274	<P>
275	In either the KHT or the SH test the program
276	prints out a table of the log-likelihoods of each tree, the differences of
277	each from the highest one, the variance of that quantity as determined by
278	the log-likelihood differences at individual sites, and a conclusion as to
279	whether that tree is or is not significantly worse than the best one.
280	<P>
281	The branch lengths printed out are scaled in terms of expected numbers of
282	base substitutions, not counting replacements of a base by itself. Of course,
283	when a branch is twice as long this does not mean that there will be twice as
284	much net change expected along it, since some of the changes occur in the same
285	site and overlie or even reverse each other. Confidence limits on the branch
286	lengths are also given. Of course a negative value of the branch length is
287	meaningless, and a confidence limit overlapping zero simply means that the
288	branch length is not necessarily significantly different from zero. Because of
289	limitations of the numerical algorithm, branch length estimates of zero will
290	often print out as small numbers such as 0.00001. If you see a branch length
291	that small, it is really estimated to be of zero length.
292	<P>
293	Another possible source of confusion is the existence of negative values for
294	the log likelihood. This is not really a problem; the log likelihood is not a
295	probability but the logarithm of a probability, and since probabilities never
296	exceed 1.0 this logarithm will typically be negative. The log likelihood is
297	maximized by being made more positive: -30.23 is worse than -29.14. The log
298	likelihood will not always be negative since a combinatorial constant has been
299	left out of the expression for the likelihood. This does not affect the tree
300	found or the likelihood ratios (or log likelihood differences) between trees.
301	<P>
302	<H2>THE ALGORITHM</H2>
303	<P>
304	The program uses a Newton-Raphson algorithm to update one branch length at a
305	time. This is faster than the EM algorithm which was described in my
306	paper on restriction sites maximum likelihood (Felsenstein, 1992). The
307	likelihood that is being
308	maximized is the same one used by Smouse and Li (1987) extended for multiple
309	species.
310	moving down on the likelihood surface. You may have to "tune" the value of
311	extrapol to suit your data.
312	<P>
313	<H2>PROGRAM CONSTANTS</H2>
314	<P>
315	The constants include "maxcutter" (set in <TT>phylip.h</TT>),
316	the maximum length of an enzyme
317	recognition site. The memory used by the program will be approximately
318	proportional to this value, which is 8 in the distribution copy.
319	The program also uses constants
320	"iterations" and "smoothings", and decreasing "epsilon". Reducing
321	"iterations" and "smoothings" or increasing "epsilon"
322	will result in faster execution but a worse result. These values will
323	not usually have to be changed.
324	<P>
325	The program spends most of its time doing real arithmetic. The algorithm, with
326	separate and independent computations
327	occurring at each site, lends itself readily to parallel processing.
328	<P>
329	A feature of the algorithm is that it saves time by recognizing sites at which
330	the pattern of presence/absence is the same, and does that computation only
331	once. Thus if we have only four species but a large number of sites, there are
332	only about (ignoring ambiguous bases) 16 different patterns of presence/absence
333	(2 x 2 x 2 x 2) that can occur. The program automatically counts
334	occurrences of each and does the computation for each pattern only once, so
335	that it only needs to do as much computation as would be needed with at most
336	16 sites, even though the number of sites is actually much larger. Thus
337	the program will run very effectively with few species and many sites.
338	<P>
339	<H2>PAST AND FUTURE OF THE PROGRAM</H2>
340	<P>
341	This program was developed by modifying DNAML version 3.1 and also adding
342	some of the modifications that were added to DNAML version 3.2, with which
343	it shares many of its data structures and much of its strategy. Version
344	3.6 changed from EM iterations of branch lengths, which involved arbitrary
345	extrapolation factors, to the Newton-Raphson algorithm, which improved the
346	speed of the program (though only from "very slow" to "slow").
347	<P>
348	There are a number of obvious directions in which the program needs to be
349	modified in the future. Extension to allow for different rates of transition
350	and transversion is straightforward, but would slow down the program
351	considerably, as I have mentioned above. I have not included in the program
352	any provision for saving and printing out
353	multiple trees tied for highest likelihood, in part because an exact tie is
354	unlikely.
355	<P>
356	<HR>
357	<P>
358	<H3>TEST DATA SET</H3>
359	<P>
360	<TABLE><TR><TD BGCOLOR=white>
361	<PRE>
362	5 13 2
363	Alpha ++-+-++--+++-
364	Beta ++++--+--+++-
365	Gamma -+--+-++-+-++
366	Delta ++-+----++---
367	Epsilon ++++----++---
368	</PRE>
369	</TD></TR></TABLE>
370	<P>
371	<HR>
372	<P>
373	<H3>CONTENTS OF OUTPUT FILE (if all numerical options are on)</H3>
374	<P>
375	<TABLE><TR><TD BGCOLOR=white>
376	<PRE>
377
378	Restriction site Maximum Likelihood method, version 3.6
379
380	5 Species, 13 Sites, 2 Enzymes
381
382	Recognition sequences all 6 bases long
383
384	Sites absent from all species are assumed to have been omitted
385
386
387	Name Sites
388	---- -----
389
390	Alpha ++-+-++--+ ++-
391	Beta ++++--+--+ ++-
392	Gamma -+--+-++-+ -++
393	Delta ++-+----++ ---
394	Epsilon ++++----++ ---
395
396
397
398
399
400	+----Gamma
401	\|
402	\| +Epsilon
403	\| +--3
404	1--2 +Delta
405	\| \|
406	\| +Beta
407	\|
408	+Alpha
409
410
411	remember: this is an unrooted tree!
412
413	Ln Likelihood = -40.34358
414
415
416	Between And Length Approx. Confidence Limits
417	------- --- ------ ------- ---------- ------
418	1 Gamma 0.10813 ( 0.01154, 0.21901) **
419	1 2 0.01156 ( zero, 0.04578)
420	2 3 0.05885 ( zero, 0.12697) **
421	3 Epsilon 0.00100 ( zero, 0.00617)
422	3 Delta 0.01460 ( zero, 0.05036)
423	2 Beta 0.00100 ( zero, infinity)
424	1 Alpha 0.01310 ( zero, 0.04806)
425
426	* = significantly positive, P < 0.05
427	** = significantly positive, P < 0.01
428
429
430	</PRE>
431	</TD></TR></TABLE>
432	</BODY>
433	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/restml.html

Download in other formats: