1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>restml</TITLE> |
---|
5 | <META NAME="description" CONTENT="restml"> |
---|
6 | <META NAME="keywords" CONTENT="restml"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>RESTML -- Restriction sites Maximum Likelihood program</H1> |
---|
18 | </DIV> |
---|
19 | <P> |
---|
20 | © Copyright 1986-2000 by The University of |
---|
21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
---|
22 | this document provided that no fee is charged for it and that this copyright |
---|
23 | notice is not removed. |
---|
24 | <P> |
---|
25 | This program implements a maximum likelihood method for restriction sites |
---|
26 | data (not restriction fragment data). This program is one of the slowest |
---|
27 | programs in this package, and can be very tedious to run. It is |
---|
28 | possible to have the program search for the maximum likelihood tree. |
---|
29 | It will be more practical for some users (those that do not have |
---|
30 | fast machines) to use the U (User Tree) |
---|
31 | option, which takes less run time, optimizing branch lengths and computing |
---|
32 | likelihoods for particular tree topologies suggested by the user. The |
---|
33 | model used here is essentially identical to that used by Smouse and Li (1987) |
---|
34 | who give explicit expressions for computing the likelihood for three-species |
---|
35 | trees. It does not place prior probabilities on trees as they do. The present |
---|
36 | program extends their approach to multiple species by |
---|
37 | a technique which, while it does not give explicit expressions for likelihoods, |
---|
38 | does enable their computation and the iterative improvement of branch |
---|
39 | lengths. It also allows for multiple restriction enzymes. The algorithm |
---|
40 | has been described in a paper (Felsenstein, 1992). Another relevant |
---|
41 | paper is that of DeBry and Slade (1985). |
---|
42 | <P> |
---|
43 | The assumptions of the present model are: |
---|
44 | <P> |
---|
45 | <OL> |
---|
46 | <LI>Each restriction site evolves independently. |
---|
47 | <LI>Different lineages evolve independently. |
---|
48 | <LI>Each site undergoes substitution at an expected rate which we specify. |
---|
49 | <LI>Substitutions consist of replacement of a nucleotide by one of the |
---|
50 | other three nucleotides, chosen at random. |
---|
51 | </OL> |
---|
52 | <P> |
---|
53 | Note that if the existing base is, say, an A, the chance of it being |
---|
54 | replaced by a G is 1/3, and so is the chance that it is replaced by |
---|
55 | a T. This means that there can be no difference in the (expected) |
---|
56 | rate of transitions and transversions. Users who are upset at this |
---|
57 | might ponder the fact that a version allowing different rates of |
---|
58 | transitions and transversions would run an estimated 16 times |
---|
59 | slower. If it also allowed for unequal frequencies of the four bases, |
---|
60 | it would run about 300,000 times slower! For the moment, until a better |
---|
61 | method is available, I guess I'll stick with this one! |
---|
62 | <P> |
---|
63 | <H2>INPUT FORMAT AND OPTIONS</H2> |
---|
64 | <P> |
---|
65 | Subject to these assumptions, the program is an approximately |
---|
66 | correct maximum likelihood method. The |
---|
67 | input is fairly standard, with one addition. As usual the first line of the |
---|
68 | file gives the number of species and the number of sites, but there is also |
---|
69 | a third number, which is the number of different restriction enzymes that were |
---|
70 | used to detect the restriction sites. Thus a data set with 10 species and |
---|
71 | 35 different sites, representing digestion with 4 different enzymes, would |
---|
72 | have the first line of the data file look like this: |
---|
73 | <P> |
---|
74 | <PRE> |
---|
75 | 10 35 4 |
---|
76 | </PRE> |
---|
77 | <P> |
---|
78 | The first line of the data file will also contain a letter W following |
---|
79 | these numbers (and separated from them by a space) if the Weights option |
---|
80 | is being used. As with all programs using the weights option, a line or lines |
---|
81 | must then follow, before the data, with the weights for each site. |
---|
82 | <P> |
---|
83 | The site data are in standard form. Each species starts with a species name |
---|
84 | whose maximum length is given by the constant "nmlngth" |
---|
85 | (whose value in the |
---|
86 | program as distributed is 10 characters). The name should, as usual, be padded |
---|
87 | out to that length with blanks if necessary. The sites data then follows, one |
---|
88 | character per site (any blanks will |
---|
89 | be skipped and ignored). Like the DNA and protein sequence data, the |
---|
90 | restriction sites data may be either in the "interleaved" form or the |
---|
91 | "sequential" form. Note that if you are analyzing restriction sites |
---|
92 | data with the programs DOLLOP or MIX or other discrete character |
---|
93 | programs, at the moment those programs do not use the "aligned" or |
---|
94 | "interleaved" data format. Therefore you may want to avoid that format |
---|
95 | when you have restriction sites data that you will want to feed into |
---|
96 | those programs. |
---|
97 | <P> |
---|
98 | The presence of a site is indicated by a "+" and the absence by a "-". I have |
---|
99 | also allowed the use of "1" and "0" as synonyms for "+" and "-", for |
---|
100 | compatibility with MIX and DOLLOP which do not allow "+" and "-". If the |
---|
101 | presence of |
---|
102 | the site is unknown (for example, if the DNA containing it has been deleted so |
---|
103 | that one |
---|
104 | does not know whether it would have contained the site) then the state "?" can |
---|
105 | be used to indicate that the state of this site is unknown. |
---|
106 | <P> |
---|
107 | User-defined trees may follow the |
---|
108 | data in the usual way. The trees must be unrooted, which means that at their |
---|
109 | base they must have a trifurcation. |
---|
110 | <P> |
---|
111 | The options are selected by a menu, which looks like this: |
---|
112 | <P> |
---|
113 | <TABLE><TR><TD BGCOLOR=white> |
---|
114 | <PRE> |
---|
115 | |
---|
116 | Restriction site Maximum Likelihood method, version 3.6 |
---|
117 | |
---|
118 | Settings for this run: |
---|
119 | U Search for best tree? Yes |
---|
120 | A Are all sites detected? No |
---|
121 | S Speedier but rougher analysis? Yes |
---|
122 | G Global rearrangements? No |
---|
123 | J Randomize input order of sequences? No. Use input order |
---|
124 | L Site length? 6 |
---|
125 | O Outgroup root? No, use as outgroup species 1 |
---|
126 | M Analyze multiple data sets? No |
---|
127 | I Input sequences interleaved? Yes |
---|
128 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
---|
129 | 1 Print out the data at start of run No |
---|
130 | 2 Print indications of progress of run Yes |
---|
131 | 3 Print out tree Yes |
---|
132 | 4 Write out trees onto tree file? Yes |
---|
133 | |
---|
134 | Y to accept these or type the letter for one to change |
---|
135 | </PRE> |
---|
136 | </TD></TR></TABLE> |
---|
137 | <P> |
---|
138 | The U, J, O, M, and 0 options are the usual ones, described in the main |
---|
139 | documentation file. The user trees for option U are read from a file whose |
---|
140 | default name is <TT>intree</TT>. The I option selects between Interleaved and |
---|
141 | Sequential input data formats, and is described in the documentation file for |
---|
142 | the molecular sequences programs. |
---|
143 | <P> |
---|
144 | The G |
---|
145 | (global search) option causes, after the last species is added to the tree, |
---|
146 | each possible group to be removed and re-added. This improves the result, |
---|
147 | since the position of every species is reconsidered. It approximately triples |
---|
148 | the run-time of the program. |
---|
149 | <P> |
---|
150 | The two options specific to this program are the A, and L options. The L |
---|
151 | (Length) option allows the user to specify the length in bases of the |
---|
152 | restriction sites. Allowed values are 1 to 8 (the |
---|
153 | constant "maxcutter" in file <TT>phylip.h</TT> |
---|
154 | controls the maximum allowed value). At the |
---|
155 | moment the program assumes that all sites have the same length (for example, |
---|
156 | that all enzymes are 6-base-cutters). The default value for this parameter is |
---|
157 | 6, which will be used if the L option is not invoked. A desirable future |
---|
158 | development for the package would be allowing the L parameter to be different |
---|
159 | for every site. It would also be desirable to allow for ambiguities in the |
---|
160 | recognition site, since some enzymes recognize 2 or 4 sequences. Both of these |
---|
161 | would require fairly complicated programming or else slower execution times. |
---|
162 | <P> |
---|
163 | The A (All) option specifies that all sites are detected, even those for which |
---|
164 | all of the species have the recognition sequence absent (character state |
---|
165 | "-"). The default condition is that it is assumed that such sites will not |
---|
166 | occur in the data. The likelihood computed when the A option is not used is |
---|
167 | the probability of the pattern of sites given that tree and conditional on the |
---|
168 | pattern not being all absences. This will be realistic for most data, except |
---|
169 | for cases in which the data are extracted from sites data for a larger number |
---|
170 | of species, in which case some of the site positions could have all absences in |
---|
171 | the subset of species. In such cases an effective way of analyzing the data |
---|
172 | would be to omit those sites and not use the A option, as such positions, even |
---|
173 | if not absolutely excluded, are nevertheless less likely than random to have |
---|
174 | been incorporated in the data set. |
---|
175 | <P> |
---|
176 | The W (Weights) option, which is invoked in the input file rather than in |
---|
177 | the menu, allows the user to select a subset of sites to |
---|
178 | be analyzed. It is invoked in the usual way, except that only weights |
---|
179 | 0 and 1 are allowed. If the W option is not used, all sites will be |
---|
180 | analyzed. If the Weights option is used, there must be a W in the first |
---|
181 | line of the input file. |
---|
182 | <P> |
---|
183 | <H2>OUTPUT FORMAT</H2> |
---|
184 | <P> |
---|
185 | The output starts by giving the number of species, and the number of sites. If |
---|
186 | the default condition is used instead of the A option the program states that |
---|
187 | it is assuming that sites absent in all species have been omitted. The value |
---|
188 | of the site length (6 bases, for example) is also given. |
---|
189 | <P> |
---|
190 | If option 2 (print out data) has been selected, |
---|
191 | there then follow the restriction site sequences, printed in |
---|
192 | groups of ten sites. The trees found are printed as an unrooted |
---|
193 | tree topology (possibly rooted by outgroup if so requested). The |
---|
194 | internal nodes are numbered arbitrarily for the sake of |
---|
195 | identification. The number of trees evaluated so far and the log |
---|
196 | likelihood of the tree are also given. |
---|
197 | <P> |
---|
198 | A table is printed |
---|
199 | showing the length of each tree segment, as well as (very) rough confidence |
---|
200 | limits on the length. As with DNAML, if a confidence limit is |
---|
201 | negative, this indicates that rearrangement of the tree in that region |
---|
202 | is not excluded, while if both limits are positive, rearrangement is |
---|
203 | still not necessarily excluded because the variance calculation on which |
---|
204 | the confidence limits are based results in an underestimate, which makes |
---|
205 | the confidence limits too narrow. |
---|
206 | <P> |
---|
207 | In addition to the confidence limits, |
---|
208 | the program performs a crude Likelihood Ratio Test (LRT) for each |
---|
209 | branch of the tree. The program computes the ratio of likelihoods with and |
---|
210 | without this branch length forced to zero length. This done by comparing the |
---|
211 | likelihoods changing only that branch length. A truly correct LRT would |
---|
212 | force that branch length to zero and also allow the other branch lengths to |
---|
213 | adjust to that. The result would be a likelihood ratio closer to 1. Therefore |
---|
214 | the present LRT will err on the side of being too significant. |
---|
215 | <P> |
---|
216 | One should also |
---|
217 | realize that if you are looking not at a previously-chosen branch but at all |
---|
218 | branches, that you are seeing the results of multiple tests. With 20 tests, |
---|
219 | one is expected to reach significance at the P = .05 level purely by |
---|
220 | chance. You should therefore use a much more conservative significance level, |
---|
221 | such as .05 divided by the number of tests. The significance of these tests |
---|
222 | is shown by printing asterisks next to |
---|
223 | the confidence interval on each branch length. It is important to keep |
---|
224 | in mind that both the confidence limits and the tests |
---|
225 | are very rough and approximate, and probably indicate more significance than |
---|
226 | they should. Nevertheless, maximum likelihood is one of the few methods that |
---|
227 | can give you any indication of its own error; most other methods simply fail to |
---|
228 | warn the user that there is any error! (In fact, whole philosophical schools |
---|
229 | of taxonomists exist whose main point seems to be that there isn't any |
---|
230 | error, that the "most parsimonious" tree is the best tree by definition and |
---|
231 | that's that). |
---|
232 | <P> |
---|
233 | The log likelihood printed out with the final tree can be used to perform |
---|
234 | various likelihood ratio tests. Remember that testing one tree topology |
---|
235 | against another is not a simple matter, because two different tree topologies |
---|
236 | are not hypotheses that are |
---|
237 | nested one within the other. If the trees differ by only one branch |
---|
238 | swap, it seems to be conservative to test the difference between their |
---|
239 | likelihoods with one degree of freedom, but other than that little is known and |
---|
240 | more work on this is needed. |
---|
241 | <P> |
---|
242 | If the U (User Tree) option is used and more than one tree is supplied, |
---|
243 | and the program is not told to assume autocorrelation between the |
---|
244 | rates at different sites, the |
---|
245 | program also performs a statistical test of each of these trees against the |
---|
246 | one with highest likelihood. If there are two user trees, the test |
---|
247 | done is one which is due to Kishino and Hasegawa (1989), a version |
---|
248 | of a test originally introduced by Templeton (1983). In this |
---|
249 | implementation it uses the mean and variance of |
---|
250 | log-likelihood differences between trees, taken across sites. If the two |
---|
251 | trees' means are more than 1.96 standard deviations different |
---|
252 | then the trees are |
---|
253 | declared significantly different. This use of the empirical variance of |
---|
254 | log-likelihood differences is more robust and nonparametric than the |
---|
255 | classical likelihood ratio test, and may to some extent compensate for the |
---|
256 | any lack of realism in the model underlying this program. |
---|
257 | <P> |
---|
258 | If there are more than two trees, the test done is an extension of |
---|
259 | the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out |
---|
260 | that a correction for the number of trees was necessary, and they |
---|
261 | introduced a resampling method to make this correction. In the version |
---|
262 | used here the variances and covariances of the sum of log likelihoods across |
---|
263 | sites are computed for all pairs of trees. To test whether the |
---|
264 | difference between each tree and the best one is larger than could have |
---|
265 | been expected if they all had the same expected log-likelihood, |
---|
266 | log-likelihoods for all trees are sampled with these covariances and equal |
---|
267 | means (Shimodaira and Hasegawa's "least favorable hypothesis"), |
---|
268 | and a P value is computed from the fraction of times the difference between |
---|
269 | the tree's value and the highest log-likelihood exceeds that actually |
---|
270 | observed. Note that this sampling needs random numbers, and so the |
---|
271 | program will prompt the user for a random number seed if one has not |
---|
272 | already been supplied. With the two-tree KHT test no random numbers |
---|
273 | are used. |
---|
274 | <P> |
---|
275 | In either the KHT or the SH test the program |
---|
276 | prints out a table of the log-likelihoods of each tree, the differences of |
---|
277 | each from the highest one, the variance of that quantity as determined by |
---|
278 | the log-likelihood differences at individual sites, and a conclusion as to |
---|
279 | whether that tree is or is not significantly worse than the best one. |
---|
280 | <P> |
---|
281 | The branch lengths printed out are scaled in terms of expected numbers of |
---|
282 | base substitutions, not counting replacements of a base by itself. Of course, |
---|
283 | when a branch is twice as long this does not mean that there will be twice as |
---|
284 | much net change expected along it, since some of the changes occur in the same |
---|
285 | site and overlie or even reverse each other. Confidence limits on the branch |
---|
286 | lengths are also given. Of course a negative value of the branch length is |
---|
287 | meaningless, and a confidence limit overlapping zero simply means that the |
---|
288 | branch length is not necessarily significantly different from zero. Because of |
---|
289 | limitations of the numerical algorithm, branch length estimates of zero will |
---|
290 | often print out as small numbers such as 0.00001. If you see a branch length |
---|
291 | that small, it is really estimated to be of zero length. |
---|
292 | <P> |
---|
293 | Another possible source of confusion is the existence of negative values for |
---|
294 | the log likelihood. This is not really a problem; the log likelihood is not a |
---|
295 | probability but the logarithm of a probability, and since probabilities never |
---|
296 | exceed 1.0 this logarithm will typically be negative. The log likelihood is |
---|
297 | maximized by being made more positive: -30.23 is worse than -29.14. The log |
---|
298 | likelihood will not always be negative since a combinatorial constant has been |
---|
299 | left out of the expression for the likelihood. This does not affect the tree |
---|
300 | found or the likelihood ratios (or log likelihood differences) between trees. |
---|
301 | <P> |
---|
302 | <H2>THE ALGORITHM</H2> |
---|
303 | <P> |
---|
304 | The program uses a Newton-Raphson algorithm to update one branch length at a |
---|
305 | time. This is faster than the EM algorithm which was described in my |
---|
306 | paper on restriction sites maximum likelihood (Felsenstein, 1992). The |
---|
307 | likelihood that is being |
---|
308 | maximized is the same one used by Smouse and Li (1987) extended for multiple |
---|
309 | species. |
---|
310 | moving down on the likelihood surface. You may have to "tune" the value of |
---|
311 | extrapol to suit your data. |
---|
312 | <P> |
---|
313 | <H2>PROGRAM CONSTANTS</H2> |
---|
314 | <P> |
---|
315 | The constants include "maxcutter" (set in <TT>phylip.h</TT>), |
---|
316 | the maximum length of an enzyme |
---|
317 | recognition site. The memory used by the program will be approximately |
---|
318 | proportional to this value, which is 8 in the distribution copy. |
---|
319 | The program also uses constants |
---|
320 | "iterations" and "smoothings", and decreasing "epsilon". Reducing |
---|
321 | "iterations" and "smoothings" or increasing "epsilon" |
---|
322 | will result in faster execution but a worse result. These values will |
---|
323 | not usually have to be changed. |
---|
324 | <P> |
---|
325 | The program spends most of its time doing real arithmetic. The algorithm, with |
---|
326 | separate and independent computations |
---|
327 | occurring at each site, lends itself readily to parallel processing. |
---|
328 | <P> |
---|
329 | A feature of the algorithm is that it saves time by recognizing sites at which |
---|
330 | the pattern of presence/absence is the same, and does that computation only |
---|
331 | once. Thus if we have only four species but a large number of sites, there are |
---|
332 | only about (ignoring ambiguous bases) 16 different patterns of presence/absence |
---|
333 | (2 x 2 x 2 x 2) that can occur. The program automatically counts |
---|
334 | occurrences of each and does the computation for each pattern only once, so |
---|
335 | that it only needs to do as much computation as would be needed with at most |
---|
336 | 16 sites, even though the number of sites is actually much larger. Thus |
---|
337 | the program will run very effectively with few species and many sites. |
---|
338 | <P> |
---|
339 | <H2>PAST AND FUTURE OF THE PROGRAM</H2> |
---|
340 | <P> |
---|
341 | This program was developed by modifying DNAML version 3.1 and also adding |
---|
342 | some of the modifications that were added to DNAML version 3.2, with which |
---|
343 | it shares many of its data structures and much of its strategy. Version |
---|
344 | 3.6 changed from EM iterations of branch lengths, which involved arbitrary |
---|
345 | extrapolation factors, to the Newton-Raphson algorithm, which improved the |
---|
346 | speed of the program (though only from "very slow" to "slow"). |
---|
347 | <P> |
---|
348 | There are a number of obvious directions in which the program needs to be |
---|
349 | modified in the future. Extension to allow for different rates of transition |
---|
350 | and transversion is straightforward, but would slow down the program |
---|
351 | considerably, as I have mentioned above. I have not included in the program |
---|
352 | any provision for saving and printing out |
---|
353 | multiple trees tied for highest likelihood, in part because an exact tie is |
---|
354 | unlikely. |
---|
355 | <P> |
---|
356 | <HR> |
---|
357 | <P> |
---|
358 | <H3>TEST DATA SET</H3> |
---|
359 | <P> |
---|
360 | <TABLE><TR><TD BGCOLOR=white> |
---|
361 | <PRE> |
---|
362 | 5 13 2 |
---|
363 | Alpha ++-+-++--+++- |
---|
364 | Beta ++++--+--+++- |
---|
365 | Gamma -+--+-++-+-++ |
---|
366 | Delta ++-+----++--- |
---|
367 | Epsilon ++++----++--- |
---|
368 | </PRE> |
---|
369 | </TD></TR></TABLE> |
---|
370 | <P> |
---|
371 | <HR> |
---|
372 | <P> |
---|
373 | <H3>CONTENTS OF OUTPUT FILE (if all numerical options are on)</H3> |
---|
374 | <P> |
---|
375 | <TABLE><TR><TD BGCOLOR=white> |
---|
376 | <PRE> |
---|
377 | |
---|
378 | Restriction site Maximum Likelihood method, version 3.6 |
---|
379 | |
---|
380 | 5 Species, 13 Sites, 2 Enzymes |
---|
381 | |
---|
382 | Recognition sequences all 6 bases long |
---|
383 | |
---|
384 | Sites absent from all species are assumed to have been omitted |
---|
385 | |
---|
386 | |
---|
387 | Name Sites |
---|
388 | ---- ----- |
---|
389 | |
---|
390 | Alpha ++-+-++--+ ++- |
---|
391 | Beta ++++--+--+ ++- |
---|
392 | Gamma -+--+-++-+ -++ |
---|
393 | Delta ++-+----++ --- |
---|
394 | Epsilon ++++----++ --- |
---|
395 | |
---|
396 | |
---|
397 | |
---|
398 | |
---|
399 | |
---|
400 | +----Gamma |
---|
401 | | |
---|
402 | | +Epsilon |
---|
403 | | +--3 |
---|
404 | 1--2 +Delta |
---|
405 | | | |
---|
406 | | +Beta |
---|
407 | | |
---|
408 | +Alpha |
---|
409 | |
---|
410 | |
---|
411 | remember: this is an unrooted tree! |
---|
412 | |
---|
413 | Ln Likelihood = -40.34358 |
---|
414 | |
---|
415 | |
---|
416 | Between And Length Approx. Confidence Limits |
---|
417 | ------- --- ------ ------- ---------- ------ |
---|
418 | 1 Gamma 0.10813 ( 0.01154, 0.21901) ** |
---|
419 | 1 2 0.01156 ( zero, 0.04578) |
---|
420 | 2 3 0.05885 ( zero, 0.12697) ** |
---|
421 | 3 Epsilon 0.00100 ( zero, 0.00617) |
---|
422 | 3 Delta 0.01460 ( zero, 0.05036) |
---|
423 | 2 Beta 0.00100 ( zero, infinity) |
---|
424 | 1 Alpha 0.01310 ( zero, 0.04806) |
---|
425 | |
---|
426 | * = significantly positive, P < 0.05 |
---|
427 | ** = significantly positive, P < 0.01 |
---|
428 | |
---|
429 | |
---|
430 | </PRE> |
---|
431 | </TD></TR></TABLE> |
---|
432 | </BODY> |
---|
433 | </HTML> |
---|