source: trunk/GDE/PHYLIP/doc/restml.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 20.0 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>restml</TITLE>
5<META NAME="description" CONTENT="restml">
6<META NAME="keywords" CONTENT="restml">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>RESTML -- Restriction sites Maximum Likelihood program</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2000 by The University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25This program implements a maximum likelihood method for restriction sites
26data (not restriction fragment data).  This program is one of the slowest
27programs in this package, and can be very tedious to run.  It is
28possible to have the program search for the maximum likelihood tree.
29It will be more practical for some users (those that do not have
30fast machines) to use the U (User Tree)
31option, which takes less run time, optimizing branch lengths and computing
32likelihoods for particular tree topologies suggested by the user.  The
33model used here is essentially identical to that used by Smouse and Li (1987)
34who give explicit expressions for computing the likelihood for three-species
35trees.  It does not place prior probabilities on trees as they do.  The present
36program extends their approach to multiple species by
37a technique which, while it does not give explicit expressions for likelihoods,
38does enable their computation and the iterative improvement of branch
39lengths.  It also allows for multiple restriction enzymes.  The algorithm
40has been described in a paper (Felsenstein, 1992).  Another relevant
41paper is that of DeBry and Slade (1985).
42<P>
43The assumptions of the present model are:
44<P>
45<OL>
46<LI>Each restriction site evolves independently.
47<LI>Different lineages evolve independently.
48<LI>Each site undergoes substitution at an expected rate which we specify.
49<LI>Substitutions consist of replacement of a nucleotide by one of the
50other three nucleotides, chosen at random.
51</OL>
52<P>
53Note that if the existing base is, say, an A, the chance of it being
54replaced by a G is 1/3, and so is the chance that it is replaced by
55a T.  This means that there can be no difference in the (expected)
56rate of transitions and transversions.  Users who are upset at this
57might ponder the fact that a version allowing different rates of
58transitions and transversions would run an estimated 16 times
59slower.  If it also allowed for unequal frequencies of the four bases,
60it would run about 300,000 times slower!  For the moment, until a better
61method is available, I guess I'll stick with this one!
62<P>
63<H2>INPUT FORMAT AND OPTIONS</H2>
64<P>
65Subject to these assumptions, the program is an approximately
66correct maximum likelihood method.  The
67input is fairly standard, with one addition.  As usual the first line of the
68file gives the number of species and the number of sites, but there is also
69a third number, which is the number of different restriction enzymes that were
70used to detect the restriction sites.  Thus a data set with 10 species and
7135 different sites, representing digestion with 4 different enzymes, would
72have the first line of the data file look like this:
73<P>
74<PRE>
75   10   35    4
76</PRE>
77<P>
78The first line of the data file will also contain a letter W following
79these numbers (and separated from them by a space) if the Weights option
80is being used.  As with all programs using the weights option, a line or lines
81must then follow, before the data, with the weights for each site.
82<P>
83The site data are in standard form.  Each species starts with a species name
84whose maximum length is given by the constant "nmlngth"
85(whose value in the
86program as distributed is 10 characters).  The name should, as usual, be padded
87out to that length with blanks if necessary.  The sites data then follows, one
88character per site (any blanks will
89be skipped and ignored).  Like the DNA and protein sequence data, the
90restriction sites data may be either in the "interleaved" form or the
91"sequential" form.  Note that if you are analyzing restriction sites
92data with the programs DOLLOP or MIX or other discrete character
93programs, at the moment those programs do not use the "aligned" or
94"interleaved" data format.  Therefore you may want to avoid that format
95when you have restriction sites data that you will want to feed into
96those programs.
97<P>
98The presence of a site is indicated by a "+" and the absence by a "-".  I have
99also allowed the use of "1" and "0" as synonyms for "+" and "-", for
100compatibility with MIX and DOLLOP which do not allow "+" and "-".  If the
101presence of
102the site is unknown (for example, if the DNA containing it has been deleted so
103that one
104does not know whether it would have contained the site) then the state "?" can
105be used to indicate that the state of this site is unknown.
106<P>
107User-defined trees may follow the
108data in the usual way.  The trees must be unrooted, which means that at their
109base they must have a trifurcation.
110<P>
111The options are selected by a menu, which looks like this:
112<P>
113<TABLE><TR><TD BGCOLOR=white>
114<PRE>
115
116Restriction site Maximum Likelihood method, version 3.6
117
118Settings for this run:
119  U                 Search for best tree?  Yes
120  A               Are all sites detected?  No
121  S        Speedier but rougher analysis?  Yes
122  G                Global rearrangements?  No
123  J   Randomize input order of sequences?  No. Use input order
124  L                          Site length?  6
125  O                        Outgroup root?  No, use as outgroup species  1
126  M           Analyze multiple data sets?  No
127  I          Input sequences interleaved?  Yes
128  0   Terminal type (IBM PC, ANSI, none)?  (none)
129  1    Print out the data at start of run  No
130  2  Print indications of progress of run  Yes
131  3                        Print out tree  Yes
132  4       Write out trees onto tree file?  Yes
133
134  Y to accept these or type the letter for one to change
135</PRE>
136</TD></TR></TABLE>
137<P>
138The U, J, O, M, and 0 options are the usual ones, described in the main
139documentation file.  The user trees for option U are read from a file whose
140default name is <TT>intree</TT>.  The I option selects between Interleaved and
141Sequential input data formats, and is described in the documentation file for
142the molecular sequences programs.
143<P>
144The G
145(global search) option causes, after the last species is added to the tree,
146each possible group to be removed and re-added.  This improves the result,
147since the position of every species is reconsidered.  It approximately triples
148the run-time of the program.
149<P>
150The two options specific to this program are the A, and L options.  The L
151(Length) option allows the user to specify the length in bases of the
152restriction sites.  Allowed values are 1 to 8 (the
153constant "maxcutter" in file <TT>phylip.h</TT>
154controls the maximum allowed value).  At the
155moment the program assumes that all sites have the same length (for example,
156that all enzymes are 6-base-cutters).  The default value for this parameter is
1576, which will be used if the L option is not invoked.  A desirable future
158development for the package would be allowing the L parameter to be different
159for every site.  It would also be desirable to allow for ambiguities in the
160recognition site, since some enzymes recognize 2 or 4 sequences.  Both of these
161would require fairly complicated programming or else slower execution times.
162<P>
163The A (All) option specifies that all sites are detected, even those for which
164all of the species have the recognition sequence absent (character state
165"-").  The default condition is that it is assumed that such sites will not
166occur in the data.  The likelihood computed when the A option is not used is
167the probability of the pattern of sites given that tree and conditional on the
168pattern not being all absences.  This will be realistic for most data, except
169for cases in which the data are extracted from sites data for a larger number
170of species, in which case some of the site positions could have all absences in
171the subset of species.  In such cases an effective way of analyzing the data
172would be to omit those sites and not use the A option, as such positions, even
173if not absolutely excluded, are nevertheless less likely than random to have
174been incorporated in the data set.
175<P>
176The W (Weights) option, which is invoked in the input file rather than in
177the menu, allows the user to select a subset of sites to
178be analyzed.  It is invoked in the usual way, except that only weights
1790 and 1 are allowed.  If the W option is not used, all sites will be
180analyzed.  If the Weights option is used, there must be a W in the first
181line of the input file.
182<P>
183<H2>OUTPUT FORMAT</H2>
184<P>
185The output starts by giving the number of species, and the number of sites.  If
186the default condition is used instead of the A option the program states that
187it is assuming that sites absent in all species have been omitted.  The value
188of the site length (6 bases, for example) is also given.
189<P>
190If option 2 (print out data) has been selected,
191there then follow the restriction site sequences, printed in
192groups of ten sites.  The trees found are printed as an unrooted
193tree topology (possibly rooted by outgroup if so requested).  The
194internal nodes are numbered arbitrarily for the sake of
195identification.  The number of trees evaluated so far and the log
196likelihood of the tree are also given.
197<P>
198A table is printed
199showing the length of each tree segment, as well as (very) rough confidence
200limits on the length.  As with DNAML, if a confidence limit is
201negative, this indicates that rearrangement of the tree in that region
202is not excluded, while if both limits are positive, rearrangement is
203still not necessarily excluded because the variance calculation on which
204the confidence limits are based results in an underestimate, which makes
205the confidence limits too narrow.
206<P>
207In addition to the confidence limits,
208the program performs a crude Likelihood Ratio Test (LRT) for each
209branch of the tree.  The program computes the ratio of likelihoods with and
210without this branch length forced to zero length.  This done by comparing the
211likelihoods changing only that branch length.  A truly correct LRT would
212force that branch length to zero and also allow the other branch lengths to
213adjust to that.  The result would be a likelihood ratio closer to 1.  Therefore
214the present LRT will err on the side of being too significant.
215<P>
216One should also
217realize that if you are looking not at a previously-chosen branch but at all
218branches, that you are seeing the results of multiple tests.  With 20 tests,
219one is expected to reach significance at the P = .05 level purely by
220chance.  You should therefore use a much more conservative significance level,
221such as .05 divided by the number of tests.  The significance of these tests
222is shown by printing asterisks next to
223the confidence interval on each branch length.  It is important to keep
224in mind that both the confidence limits and the tests
225are very rough and approximate, and probably indicate more significance than
226they should.  Nevertheless, maximum likelihood is one of the few methods that
227can give you any indication of its own error; most other methods simply fail to
228warn the user that there is any error!  (In fact, whole philosophical schools
229of taxonomists exist whose main point seems to be that there isn't any
230error, that the "most parsimonious" tree is the best tree by definition and
231that's that).
232<P>
233The log likelihood printed out with the final tree can be used to perform
234various likelihood ratio tests.  Remember that testing one tree topology
235against another is not a simple matter, because two different tree topologies
236are not hypotheses that are
237nested one within the other.  If the trees differ by only one branch
238swap, it seems to be conservative to test the difference between their
239likelihoods with one degree of freedom, but other than that little is known and
240more work on this is needed.
241<P>
242If the U (User Tree) option is used and more than one tree is supplied,
243and the program is not told to assume autocorrelation between the
244rates at different sites, the
245program also performs a statistical test of each of these trees against the
246one with highest likelihood.   If there are two user trees, the test
247done is one which is due to Kishino and Hasegawa (1989), a version
248of a test originally introduced by Templeton (1983).  In this
249implementation it uses the mean and variance of
250log-likelihood differences between trees, taken across sites.  If the two
251trees' means are more than 1.96 standard deviations different
252then the trees are
253declared significantly different.  This use of the empirical variance of
254log-likelihood differences is more robust and nonparametric than the
255classical likelihood ratio test, and may to some extent compensate for the
256any lack of realism in the model underlying this program.
257<P>
258If there are more than two trees, the test done is an extension of
259the KHT test, due to Shimodaira and Hasegawa (1999).  They pointed out
260that a correction for the number of trees was necessary, and they
261introduced a resampling method to make this correction.  In the version
262used here the variances and covariances of the sum of log likelihoods across
263sites are computed for all pairs of trees.  To test whether the
264difference between each tree and the best one is larger than could have
265been expected if they all had the same expected log-likelihood,
266log-likelihoods for all trees are sampled with these covariances and equal
267means (Shimodaira and Hasegawa's "least favorable hypothesis"),
268and a P value is computed from the fraction of times the difference between
269the tree's value and the highest log-likelihood exceeds that actually
270observed.  Note that this sampling needs random numbers, and so the
271program will prompt the user for a random number seed if one has not
272already been supplied.  With the two-tree KHT test no random numbers
273are used.
274<P>
275In either the KHT or the SH test the program
276prints out a table of the log-likelihoods of each tree, the differences of
277each from the highest one, the variance of that quantity as determined by
278the log-likelihood differences at individual sites, and a conclusion as to
279whether that tree is or is not significantly worse than the best one.
280<P>
281The branch lengths printed out are scaled in terms of expected numbers of
282base substitutions, not counting replacements of a base by itself.  Of course,
283when a branch is twice as long this does not mean that there will be twice as
284much net change expected along it, since some of the changes occur in the same
285site and overlie  or even reverse each other.  Confidence limits on the branch
286lengths are also given.  Of course a negative value of the branch length is
287meaningless, and a confidence limit overlapping zero simply means that the
288branch length is not necessarily significantly different from zero.  Because of
289limitations of the numerical algorithm, branch length estimates of zero will
290often print out as small numbers such as 0.00001.  If you see a branch length
291that small, it is really estimated to be of zero length.
292<P>
293Another possible source of confusion is the existence of negative values for
294the log likelihood.  This is not really a problem; the log likelihood is not a
295probability but the logarithm of a probability, and since probabilities never
296exceed 1.0 this logarithm will typically be negative.  The log likelihood is
297maximized by being made more positive: -30.23 is worse than -29.14.  The log
298likelihood will not always be negative since a combinatorial constant has been
299left out of the expression for the likelihood.  This does not affect the tree
300found or the likelihood ratios (or log likelihood differences) between trees.
301<P>
302<H2>THE ALGORITHM</H2>
303<P>
304The program uses a Newton-Raphson algorithm to update one branch length at a
305time.  This is faster than the EM algorithm which was described in my
306paper on restriction sites maximum likelihood (Felsenstein, 1992).  The
307likelihood that is being
308maximized is the same one used by Smouse and Li (1987) extended for multiple
309species. 
310moving down on the likelihood surface.  You may have to "tune" the value of
311extrapol to suit your data.
312<P>
313<H2>PROGRAM CONSTANTS</H2>
314<P>
315The constants include "maxcutter" (set in <TT>phylip.h</TT>),
316the maximum length of an enzyme
317recognition site.  The memory used by the program will be approximately
318proportional to this value, which is 8 in the distribution copy.
319The program also uses constants
320"iterations" and "smoothings", and decreasing "epsilon".  Reducing
321"iterations" and "smoothings" or increasing "epsilon"
322will result in faster execution but a worse result.  These values will
323not usually have to be changed. 
324<P>
325The program spends most of its time doing real arithmetic.  The algorithm, with
326separate and independent computations
327occurring at each site, lends itself readily to parallel processing.
328<P>
329A feature of the algorithm is that it saves time by recognizing sites at which
330the pattern of presence/absence is the same, and does that computation only
331once.  Thus if we have only four species but a large number of sites, there are
332only about (ignoring ambiguous bases) 16 different patterns of presence/absence
333(2 x 2 x 2 x 2) that can occur.  The program automatically counts
334occurrences of each and does the computation for each pattern only once, so
335that it only needs to do as much computation as would be needed with at most
33616 sites, even though the number of sites is actually much larger.  Thus
337the program will run very effectively with few species and many sites.
338<P>
339<H2>PAST AND FUTURE OF THE PROGRAM</H2>
340<P>
341This program was developed by modifying DNAML version 3.1 and also adding
342some of the modifications that were added to DNAML version 3.2, with which
343it shares many of its data structures and much of its strategy.   Version
3443.6 changed from EM iterations of branch lengths, which involved arbitrary
345extrapolation factors, to the Newton-Raphson algorithm, which improved the
346speed of the program (though only from "very slow" to "slow").
347<P>
348There are a number of obvious directions in which the program needs to be
349modified in the future.  Extension to allow for different rates of transition
350and transversion is straightforward, but would slow down the program
351considerably, as I have mentioned above.  I have not included in the program
352any provision for saving and printing out
353multiple trees tied for highest likelihood, in part because an exact tie is
354unlikely.
355<P>
356<HR>
357<P>
358<H3>TEST DATA SET</H3>
359<P>
360<TABLE><TR><TD BGCOLOR=white>
361<PRE>
362   5   13   2
363Alpha     ++-+-++--+++-
364Beta      ++++--+--+++-
365Gamma     -+--+-++-+-++
366Delta     ++-+----++---
367Epsilon   ++++----++---
368</PRE>
369</TD></TR></TABLE>
370<P>
371<HR>
372<P>
373<H3>CONTENTS OF OUTPUT FILE (if all numerical options are on)</H3>
374<P>
375<TABLE><TR><TD BGCOLOR=white>
376<PRE>
377
378Restriction site Maximum Likelihood method, version 3.6
379
380   5 Species,   13 Sites,   2 Enzymes
381
382  Recognition sequences all 6 bases long
383
384Sites absent from all species are assumed to have been omitted
385
386
387Name            Sites
388----            -----
389
390Alpha        ++-+-++--+ ++-
391Beta         ++++--+--+ ++-
392Gamma        -+--+-++-+ -++
393Delta        ++-+----++ ---
394Epsilon      ++++----++ ---
395
396
397
398
399
400  +----Gamma     
401  | 
402  |     +Epsilon   
403  |  +--3 
404  1--2  +Delta     
405  |  | 
406  |  +Beta     
407  | 
408  +Alpha     
409
410
411remember: this is an unrooted tree!
412
413Ln Likelihood =   -40.34358
414
415 
416 Between        And            Length      Approx. Confidence Limits
417 -------        ---            ------      ------- ---------- ------
418   1          Gamma           0.10813     (  0.01154,     0.21901) **
419   1             2            0.01156     (     zero,     0.04578)
420   2             3            0.05885     (     zero,     0.12697) **
421   3          Epsilon         0.00100     (     zero,     0.00617)
422   3          Delta           0.01460     (     zero,     0.05036)
423   2          Beta            0.00100     (     zero,    infinity)
424   1          Alpha           0.01310     (     zero,     0.04806)
425
426     *  = significantly positive, P < 0.05
427     ** = significantly positive, P < 0.01
428
429
430</PRE>
431</TD></TR></TABLE>
432</BODY>
433</HTML>
Note: See TracBrowser for help on using the repository browser.