1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>restdist</TITLE> |
---|
5 | <META NAME="description" CONTENT="restdist"> |
---|
6 | <META NAME="keywords" CONTENT="restdist"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>RESTDIST -- Program to compute distance matrix<BR>from restriction sites or fragments</H1> |
---|
18 | </DIV> |
---|
19 | <P> |
---|
20 | © Copyright 2000-2002 by the University of |
---|
21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
---|
22 | this document provided that no fee is charged for it and that this copyright |
---|
23 | notice is not removed. |
---|
24 | <P> |
---|
25 | Restdist reads the same restriction sites format as RESTML and |
---|
26 | computes a restriction sites distance. It can also compute a restriction |
---|
27 | fragments distance. The original restriction fragments and restriction |
---|
28 | sites distance methods were introduced by Nei and Li (1979). Their original |
---|
29 | <! ???? CHANGE NEXT LINE TO methods are WHEN NEI/LI R.S. DISTANCE INCLUDED > |
---|
30 | method for restriction fragments is |
---|
31 | also available in this program, although its default methods |
---|
32 | are my modifications of the original Nei and Li methods. |
---|
33 | <P> |
---|
34 | These two distances assume that the restriction sites are accidental byproducts |
---|
35 | of random change of nucleotide sequences. For my restriction sites distance |
---|
36 | the DNA sequences are assumed to be changing according to the Kimura |
---|
37 | 2-parameter model of DNA change (Kimura, 1980). The user can set the |
---|
38 | transition/transversion rate for the model. For my restriction fragments |
---|
39 | distance there is |
---|
40 | there is an implicit assumption of a Jukes-Cantor (1969) model of change, |
---|
41 | The user can also set the |
---|
42 | parameter of a correction for unequal rates of evolution between sites in |
---|
43 | the DNA sequences, using a Gamma distribution of rates among sites. |
---|
44 | The Jukes-Cantor model is also implicit in the restriction fragments |
---|
45 | distance of Nei and Li(1979). It |
---|
46 | does not allow us to correct for a Gamma distribution of rates among |
---|
47 | sites. |
---|
48 | <P> |
---|
49 | <H2>Restriction Sites Distance</H2> |
---|
50 | <P> |
---|
51 | The restriction sites distances use data coded for the presence of absence of |
---|
52 | individual restriction sites (usually as + and - or 0 and 1). My |
---|
53 | distance is based on the proportion, out of all sites observed in one species |
---|
54 | or the other, which are present in both species. This is done to correct for |
---|
55 | the ascertainment of sites, for the fact that we are not aware of many sites |
---|
56 | because they do not appear in any species. |
---|
57 | <P> |
---|
58 | My distance starts by computing from the particular pair of species the fraction |
---|
59 | <PRE> |
---|
60 | n<SUB>++</SUB> |
---|
61 | f = --------------------- |
---|
62 | n<SUB>++</SUB> + <SUP>1</SUP>/<SUB>2</SUB> (n<SUB>+- </SUB>+ n<SUB>-+</SUB>) |
---|
63 | </PRE> |
---|
64 | where <I>n<SUB>++</SUB></I> is the number of sites contained in both species, |
---|
65 | <I>n<SUB>+-</SUB></I> is the number of sites contained in the first of the |
---|
66 | two species but not in the second, and <I>n<SUB>-+</SUB></I> is the |
---|
67 | number of sites contained in the second of the two species but not in the |
---|
68 | first. This is the fraction of sites that are present in one species which are |
---|
69 | present in both. Since the number of sites present in the two species will |
---|
70 | often differ, the denominator is the average of the number of sites found in |
---|
71 | the two species. |
---|
72 | <P> |
---|
73 | If each restriction site is <I>s</I> nucleotides long, the probability |
---|
74 | that a restriction site is present in the other species, given that it is |
---|
75 | present in a species, is |
---|
76 | <PRE> |
---|
77 | Q<SUP>s</SUP>, |
---|
78 | </PRE> |
---|
79 | where <I>Q</I> is the probability that a nucleotide has no net change as one |
---|
80 | goes from the one species to the other. It may have changed in between; we |
---|
81 | are interested in the probability that that nucleotide site is in the same base |
---|
82 | in both species, irrespective of what has happened in between. |
---|
83 | <P> |
---|
84 | The distance is then computed by finding the branch length of a two-species |
---|
85 | tree (connecting these two species with a single branch) such |
---|
86 | that <I>Q</I> equals the <I>s</I>-th root of <I>f</I>. For this the |
---|
87 | program computes <I>Q</I> for various values of branch length, iterating them |
---|
88 | by a Newton-Raphson algorithm until the two quantities are equal. |
---|
89 | <P> |
---|
90 | The resulting distance should be numerically close to the original |
---|
91 | restriction sites distance of Nei and Li (1979). It is inspired by theirs, |
---|
92 | but theirs differs by implicitly assuming a symmetric Jukes-Cantor (1969) |
---|
93 | model of nucleotide change, and theirs does not include a correction for |
---|
94 | Gamma distribution of rate of change among nucleotide sites. |
---|
95 | <P> |
---|
96 | <H2>Restriction Fragments Distance</H2> |
---|
97 | <P> |
---|
98 | For restriction fragments data we use a different distance. If we |
---|
99 | average over all restriction fragment lengths, each at its own |
---|
100 | expected frequency, the probability that the fragment will still be |
---|
101 | in existence after a certain amount of branch length, we must take into |
---|
102 | account the probability that the two restriction sites at the ends of |
---|
103 | the fragment do not mutate, and the probability that no new |
---|
104 | restriction site occurs within the fragment in that amount of branch |
---|
105 | length. The result for a restriction site length of <I>s</I> is: |
---|
106 | <PRE> |
---|
107 | Q<SUP>2s</SUP> |
---|
108 | f = -------- |
---|
109 | 2 - Q<SUP>s</SUP> |
---|
110 | </PRE> |
---|
111 | (The details of the derivation will be given in my forthcoming book |
---|
112 | <I>Inferring Phylogenies</I> (to be published by Sinauer Associates |
---|
113 | in 2001). |
---|
114 | Given the observed fraction of restriction sites retained, <I>f</I>, |
---|
115 | we can solve a quadratic equation from the above expression for |
---|
116 | <I>Q<SUP>s</SUP></I>. That makes it easy to obtain a value of <I>Q</I>, |
---|
117 | and the branch length can then be estimated by adjusting it so the |
---|
118 | probability of a base not changing is equal to that value. |
---|
119 | <P> |
---|
120 | Alternatively, if we use the Nei and Li (1979) restriction fragments |
---|
121 | distance, this involves solving for <I>g</I> in the nonlinear |
---|
122 | equation |
---|
123 | <PRE> |
---|
124 | g = [ f (3 - 2g) ]<SUP><SUP>1</SUP>/<SUB>4</SUB></SUP> |
---|
125 | </PRE> |
---|
126 | and then the distance is given by |
---|
127 | <PRE> |
---|
128 | d = - (<SUP>2</SUP>/<SUB>r</SUB>) log<SUB>e</SUB>(g) |
---|
129 | </PRE> |
---|
130 | where <I>r</I> is the length of the restriction site. |
---|
131 | <P> |
---|
132 | Comparing these two restriction fragments distances in a case |
---|
133 | where their underlying DNA model is the same (which is when the |
---|
134 | transition/transversion ratio of the modified model is set to |
---|
135 | 0.5), you will find |
---|
136 | that they are very close to each other, differing very little at |
---|
137 | small distances, with the modified distance become smaller than |
---|
138 | the Nei/Li distance at larger distances. It will therefore matter |
---|
139 | very little which one you use. |
---|
140 | <P> |
---|
141 | <H2>A Comment About RAPDs and AFLPs</H2> |
---|
142 | <P> |
---|
143 | Although these distances are designed for restriction sites |
---|
144 | and restriction fragments data, they can be applied to RAPD and |
---|
145 | AFLP data as well. RAPD (Randomly Amplified Polymorphic DNA) |
---|
146 | and AFLP (Amplified Fragment Length Polymorphism) data consist |
---|
147 | of presence or absence of individual bands on a gel. The bands |
---|
148 | are segments of DNA with PCR primers at each end. These |
---|
149 | primers are defined sequences of known length (often about |
---|
150 | 10 nucleotides each). For AFLPs the reolevant length is the primer |
---|
151 | length, plus three nucleotides. Mutation in these sequences makes them no |
---|
152 | longer be primers, just as in the case of restriction sites. |
---|
153 | Thus a pair of 10-nucleotide primers will behave much the same |
---|
154 | as a 20-nucleotide restriction site. You can use the restriction |
---|
155 | sites distance as the distance between RAPD or AFLP patterns if you |
---|
156 | set the proper value for the total length of the site to the |
---|
157 | total length of the primers (plus 6 in the case of AFLPs). |
---|
158 | Of course there are many possible sources of noise in these data, |
---|
159 | including confusing fragments of similar length for each other |
---|
160 | and having primers near each other in the genome, and these are |
---|
161 | not taken into account in the statistical model used here. |
---|
162 | <P> |
---|
163 | <H2>INPUT FORMAT AND OPTIONS</H2> |
---|
164 | <P> |
---|
165 | The input is fairly standard, with one addition. As usual the first line of |
---|
166 | the file gives the number of species and the number of sites, but there is also |
---|
167 | a third number, which is the number of different restriction enzymes that were |
---|
168 | used to detect the restriction sites. Thus a data set with 10 species and |
---|
169 | 35 different sites, representing digestion with 4 different enzymes, would |
---|
170 | have the first line of the data file look like this: |
---|
171 | <P> |
---|
172 | <PRE> |
---|
173 | 10 35 4 |
---|
174 | </PRE> |
---|
175 | <P> |
---|
176 | The site data are in standard form. Each species starts with a species name |
---|
177 | whose maximum length is given by the constant "nmlngth" |
---|
178 | (whose value in the |
---|
179 | program as distributed is 10 characters). The name should, as usual, be padded |
---|
180 | out to that length with blanks if necessary. The sites data then follows, one |
---|
181 | character per site (any blanks will |
---|
182 | be skipped and ignored). Like the DNA and protein sequence data, the |
---|
183 | restriction sites data may be either in the "interleaved" form or the |
---|
184 | "sequential" form. Note that if you are analyzing restriction sites |
---|
185 | data with the programs DOLLOP or MIX or other discrete character |
---|
186 | programs, at the moment those programs do not use the "aligned" or |
---|
187 | "interleaved" data format. Therefore you may want to avoid that format |
---|
188 | when you have restriction sites data that you will want to feed into |
---|
189 | those programs. |
---|
190 | <P> |
---|
191 | The presence of a site is indicated by a "+" and the absence by a "-". I have |
---|
192 | also allowed the use of "1" and "0" as synonyms for "+" and "-", for |
---|
193 | compatibility with MIX and DOLLOP which do not allow "+" and "-". If the |
---|
194 | presence of |
---|
195 | the site is unknown (for example, if the DNA containing it has been deleted so |
---|
196 | that one |
---|
197 | does not know whether it would have contained the site) then the state "?" can |
---|
198 | be used to indicate that the state of this site is unknown. |
---|
199 | <P> |
---|
200 | The options are selected using an interactive menu. The menu looks like this: |
---|
201 | <P> |
---|
202 | <TABLE><TR><TD BGCOLOR=white> |
---|
203 | <PRE> |
---|
204 | |
---|
205 | Restriction site or fragment distances, version 3.6a3 |
---|
206 | |
---|
207 | Settings for this run: |
---|
208 | R Restriction sites or fragments? Sites |
---|
209 | G Gamma distribution of rates among sites? No |
---|
210 | T Transition/transversion ratio? 2.000000 |
---|
211 | S Site length? 6.0 |
---|
212 | L Form of distance matrix? Square |
---|
213 | M Analyze multiple data sets? No |
---|
214 | I Input sequences interleaved? Yes |
---|
215 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
---|
216 | 1 Print out the data at start of run? No |
---|
217 | 2 Print indications of progress of run? Yes |
---|
218 | |
---|
219 | Y to accept these or type the letter for one to change |
---|
220 | |
---|
221 | </PRE> |
---|
222 | </TD></TR></TABLE> |
---|
223 | <P> |
---|
224 | The user either types "Y" (followed, of course, by a carriage-return) |
---|
225 | if the settings shown are to be accepted, or the letter or digit corresponding |
---|
226 | to an option that is to be changed. |
---|
227 | <P> |
---|
228 | The R option toggles between a restriction sites distance, which |
---|
229 | is the default setting, and a restriction fragments distance. In |
---|
230 | the latter case, another option appears, the N (Nei/Li) option. |
---|
231 | This allows the user to choose the original Nei and Li (1979) |
---|
232 | restriction fragments distance rather than my modified Nei/Li |
---|
233 | distance, which is the default. |
---|
234 | <P> |
---|
235 | If the G (Gamma distribution) option is selected, the user will be |
---|
236 | asked to supply the coefficient of variation of the rate of substitution |
---|
237 | among sites. This is different from the parameters used by Nei and Jin, who |
---|
238 | introduced Gamma distribution of rates in DNA distances, but |
---|
239 | related to their parameters: their parameter <EM>a</EM> is also known as |
---|
240 | "alpha", the shape parameter of the Gamma distribution. It is |
---|
241 | related to the coefficient of variation by |
---|
242 | <P> |
---|
243 | CV = 1 / a<SUP>1/2</SUP> |
---|
244 | <P> |
---|
245 | or |
---|
246 | <P> |
---|
247 | a = 1 / (CV)<SUP>2</SUP> |
---|
248 | <P> |
---|
249 | (their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so |
---|
250 | that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>). |
---|
251 | As we consider cases in which the rates are less variable we should set <EM>a</EM> |
---|
252 | larger and larger, as <EM>CV</EM> gets smaller and smaller. |
---|
253 | <P> |
---|
254 | The Gamma distribution option is not available when using the |
---|
255 | original Nei/Li restriction fragments distance. |
---|
256 | <P> |
---|
257 | The T option is the Transition/transversion option. The user is prompted for |
---|
258 | a real number greater than 0.0, as the expected ratio of transitions to |
---|
259 | transversions. Note |
---|
260 | that this is the resulting expected ratio of transitions to transversions. |
---|
261 | The default value of the T parameter if you do not use the T |
---|
262 | option is 2.0. The T option is not available when you choose the original |
---|
263 | Nei/Li restriction fragment distance, which assumes a Jukes-Cantor (1969) |
---|
264 | model of DNA change, for which the transition/transversion ratio is |
---|
265 | in effect fixed at 0.5. |
---|
266 | <P> |
---|
267 | The S option selects the site length. This is set to a default |
---|
268 | value of 6. It can be set to any positive integer. While in |
---|
269 | the RESTML program there is an upper limit on the restriction |
---|
270 | site length (set by memory limitations), in RESTDIST there is |
---|
271 | no effective limit on the size of the restriction sites. A value |
---|
272 | of 20, which might be appropriate in many cases for RAPD or AFLP |
---|
273 | data, is typically not practical in RESTML, but it is useable in |
---|
274 | RESTDIST. |
---|
275 | <P> |
---|
276 | Option L specifies that the output file will have a square matrix |
---|
277 | of distances. It can be used to change to lower-triangular |
---|
278 | data matrices. This will usually not be |
---|
279 | necessary, but if the distance matrices are going to be very |
---|
280 | large, this alternative can reduce their size by half. The |
---|
281 | programs which are to use them should then of course be informed |
---|
282 | that they can expect lower-triangular distance matrices. |
---|
283 | <P> |
---|
284 | The M, I, and 0 options are the usual Multiple data set, |
---|
285 | Interleaved input, and screen terminal type options. These are |
---|
286 | described in the main documentation file. |
---|
287 | <P> |
---|
288 | Option 1 specifies that the input data will be written out |
---|
289 | on the output file before the distances. This is off by |
---|
290 | default. If it is done, it will make the output file unusable |
---|
291 | as input to our distance matrix programs. |
---|
292 | <P> |
---|
293 | Option 2 turns off or on the indications of the progress of the run. |
---|
294 | The program prints out a row of dots (".") indicating the |
---|
295 | calculation of individual distances. Since the distance matrix |
---|
296 | is symmetrical, the program only computes the distances for the |
---|
297 | upper triangle of the distance matrix, and then duplicates |
---|
298 | the distance to the other corner of the matrix. Thus the rows of |
---|
299 | dots start out of full length, and then egt shorter and shorter. |
---|
300 | <P> |
---|
301 | <H2>OUTPUT FORMAT</H2> |
---|
302 | <P> |
---|
303 | The output file contains on its first line the number of species. The |
---|
304 | distance matrix is then printed in standard |
---|
305 | form, with each species starting on a new line with the species name, followed |
---|
306 | by the distances to the species in order. These continue onto a new line |
---|
307 | after every nine distances. If the L option is used, the matrix or distances |
---|
308 | is in lower triangular form, so that only the distances to the other species |
---|
309 | that precede each species are printed. Otherwise the distance matrix is square |
---|
310 | with zero distances on the diagonal. In general the format of the distance |
---|
311 | matrix is such that it can serve as input to any of the distance matrix |
---|
312 | programs. |
---|
313 | <P> |
---|
314 | If the option to print out the data is selected, the output file will |
---|
315 | precede the data by more complete information on the input and the menu |
---|
316 | selections. The output file begins by giving the number of species and the |
---|
317 | number of characters. |
---|
318 | <P> |
---|
319 | The distances printed out are scaled in terms of expected numbers of |
---|
320 | substitutions per DNA site, counting both transitions and transversions but not |
---|
321 | replacements of a base by itself, and scaled so that the average rate of |
---|
322 | change, averaged over all sites analyzed, is set to 1.0. Thus when the |
---|
323 | G option is used, the rate of change at one site may be higher than |
---|
324 | at another, but their mean is expected to be 1. |
---|
325 | <P> |
---|
326 | <H2>PROGRAM CONSTANTS</H2> |
---|
327 | <P> |
---|
328 | The constants available to be changed are "initialv" and |
---|
329 | "iterationsr". The constant "initialv" is the starting |
---|
330 | value of the distance in the iterations. This will typically not need to |
---|
331 | be changed. The constant "iterationsr" is the number of |
---|
332 | times that the Newton-Raphson method which is used to solve the |
---|
333 | equations for the distances is iterated. The program can be |
---|
334 | speeded up by reducing the number of iterations from the default |
---|
335 | value of 20, but at the possible risk of computing the distance |
---|
336 | less accurately. |
---|
337 | <P> |
---|
338 | <H2>FUTURE OF THE PROGRAM</H2> |
---|
339 | <P> |
---|
340 | The present program does not compute the original distance of Nei and Li (1979) |
---|
341 | for restriction sites (though it does have an option to compute their original |
---|
342 | distance for restriction fragments). I hope to add their restriction |
---|
343 | sites distance in the near future. |
---|
344 | <P> |
---|
345 | <HR> |
---|
346 | <P> |
---|
347 | <H3>TEST DATA SET</H3> |
---|
348 | <P> |
---|
349 | <TABLE><TR><TD BGCOLOR=white> |
---|
350 | <PRE> |
---|
351 | 5 13 2 |
---|
352 | Alpha ++-+-++--+++- |
---|
353 | Beta ++++--+--+++- |
---|
354 | Gamma -+--+-++-+-++ |
---|
355 | Delta ++-+----++--- |
---|
356 | Epsilon ++++----++--- |
---|
357 | </PRE> |
---|
358 | </TD></TR></TABLE> |
---|
359 | <P> |
---|
360 | <HR> |
---|
361 | <H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3> |
---|
362 | <P> |
---|
363 | (Note that when the options for displaying the input data are turned off, |
---|
364 | the output is in a form suitable for use as an input file in the distance |
---|
365 | matrix programs).<P> |
---|
366 | <P> |
---|
367 | <TABLE><TR><TD BGCOLOR=white> |
---|
368 | <PRE> |
---|
369 | |
---|
370 | 5 Species, 13 Sites |
---|
371 | |
---|
372 | Name Sites |
---|
373 | ---- ----- |
---|
374 | |
---|
375 | Alpha ++-+-++--+ ++- |
---|
376 | Beta ++++--+--+ ++- |
---|
377 | Gamma -+--+-++-+ -++ |
---|
378 | Delta ++-+----++ --- |
---|
379 | Epsilon ++++----++ --- |
---|
380 | |
---|
381 | |
---|
382 | Alpha 0.0000 0.0224 0.1077 0.0688 0.0826 |
---|
383 | Beta 0.0224 0.0000 0.1077 0.0688 0.0442 |
---|
384 | Gamma 0.1077 0.1077 0.0000 0.1765 0.1925 |
---|
385 | Delta 0.0688 0.0688 0.1765 0.0000 0.0197 |
---|
386 | Epsilon 0.0826 0.0442 0.1925 0.0197 0.0000 |
---|
387 | </PRE> |
---|
388 | </TD></TR></TABLE> |
---|
389 | </BODY> |
---|
390 | </HTML> |
---|