source: trunk/GDE/PHYLIP/doc/restdist.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 17.1 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>restdist</TITLE>
5<META NAME="description" CONTENT="restdist">
6<META NAME="keywords" CONTENT="restdist">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>RESTDIST -- Program to compute distance matrix<BR>from restriction sites or fragments</H1>
18</DIV>
19<P>
20&#169; Copyright 2000-2002 by the University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25Restdist reads the same restriction sites format as RESTML and
26computes a restriction sites distance.  It can also compute a restriction
27fragments distance.  The original restriction fragments and restriction
28sites distance methods were introduced by Nei and Li (1979).  Their original
29<! ???? CHANGE NEXT LINE TO methods are WHEN NEI/LI R.S. DISTANCE INCLUDED >
30method for restriction fragments is
31also available in this program, although its default methods
32are my modifications of the original Nei and Li methods.
33<P>
34These two distances assume that the restriction sites are accidental byproducts
35of random change of nucleotide sequences.  For my restriction sites distance
36the DNA sequences are assumed to be changing according to the Kimura
372-parameter model of DNA change (Kimura, 1980).   The user can set the
38transition/transversion rate for the model.  For my restriction fragments
39distance there is
40there is an implicit assumption of a Jukes-Cantor (1969) model of change,
41The user can also set the
42parameter of a correction for unequal rates of evolution between sites in
43the DNA sequences, using a Gamma distribution of rates among sites.
44The Jukes-Cantor model is also implicit in the restriction fragments
45distance of Nei and Li(1979).  It
46does not allow us to correct for a Gamma distribution of rates among
47sites.
48<P>
49<H2>Restriction Sites Distance</H2>
50<P>
51The restriction sites distances use data coded for the presence of absence of
52individual restriction sites (usually as  + and - or 0 and 1).  My
53distance is based on the proportion, out of all sites observed in one species
54or the other, which are present in both species.  This is done to correct for
55the ascertainment of sites, for the fact that we are not aware of many sites
56because they do not appear in any species.
57<P>
58My distance starts by computing from the particular pair of species the fraction
59<PRE>
60                 n<SUB>++</SUB>
61   f =  ---------------------
62         n<SUB>++</SUB> + <SUP>1</SUP>/<SUB>2</SUB> (n<SUB>+- </SUB>+ n<SUB>-+</SUB>)
63</PRE>
64where <I>n<SUB>++</SUB></I> is the number of sites contained in both species,
65<I>n<SUB>+-</SUB></I> is the number of sites contained in the first of the
66two species but not in the second, and <I>n<SUB>-+</SUB></I> is the
67number of sites contained in the second of the two species but not in the
68first.  This is the fraction of sites that are present in one species which are
69present in both.  Since the number of sites present in the two species will
70often differ, the denominator is the average of the number of sites found in
71the two species.
72<P>
73If each restriction site is <I>s</I> nucleotides long, the probability
74that a restriction site is present in the other species, given that it is
75present in a species, is
76<PRE>
77      Q<SUP>s</SUP>,
78</PRE>
79where <I>Q</I> is the probability that a nucleotide has no net change as one
80goes from the one species to the other.  It may have changed in between; we
81are interested in the probability that that nucleotide site is in the same base
82in both species, irrespective of what has happened in between.
83<P>
84The distance is then computed by finding the branch length of a two-species
85tree (connecting these two species with a single branch) such
86that <I>Q</I> equals the <I>s</I>-th root of <I>f</I>.  For this the
87program computes <I>Q</I> for various values of branch length, iterating them
88by a Newton-Raphson algorithm until the two quantities are equal.
89<P>
90The resulting distance should be numerically close to the original
91restriction sites distance of Nei and Li (1979).  It is inspired by theirs,
92but theirs differs by implicitly assuming a symmetric Jukes-Cantor (1969)
93model of nucleotide change, and theirs does not include a correction for
94Gamma distribution of rate of change among nucleotide sites.
95<P>
96<H2>Restriction Fragments Distance</H2>
97<P>
98For restriction fragments data we use a different distance.  If we
99average over all restriction fragment lengths, each at its own
100expected frequency, the probability that the fragment will still be
101in existence after a certain amount of branch length, we must take into
102account the probability that the two restriction sites at the ends of
103the fragment do not mutate, and the probability that no new
104restriction site occurs within the fragment in that amount of branch
105length.  The result for a restriction site length of <I>s</I> is:
106<PRE>
107                Q<SUP>2s</SUP>
108          f = --------
109               2 - Q<SUP>s</SUP>
110</PRE>
111(The details of the derivation will be given in my forthcoming book
112<I>Inferring Phylogenies</I> (to be published by Sinauer Associates
113in 2001).
114Given the observed fraction of restriction sites retained, <I>f</I>,
115we can solve a quadratic equation from the above expression for
116<I>Q<SUP>s</SUP></I>.  That makes it easy to obtain a value of <I>Q</I>,
117and the branch length can then be estimated by adjusting it so the
118probability of a base not changing is equal to that value.
119<P>
120Alternatively, if we use the Nei and Li (1979) restriction fragments
121distance, this involves solving for <I>g</I> in the nonlinear
122equation
123<PRE>
124       g  =  [ f (3 - 2g) ]<SUP><SUP>1</SUP>/<SUB>4</SUB></SUP>
125</PRE>
126and then the distance is given by
127<PRE>
128       d  =  - (<SUP>2</SUP>/<SUB>r</SUB>) log<SUB>e</SUB>(g)
129</PRE>
130where <I>r</I> is the length of the restriction site.
131<P>
132Comparing these two restriction fragments distances in a case
133where their underlying DNA model is the same (which is when the
134transition/transversion ratio of the modified model is set to
1350.5), you will find
136that they are very close to each other, differing very little at
137small distances, with the modified distance become smaller than
138the Nei/Li distance at larger distances.  It will therefore matter
139very little which one you use.
140<P>
141<H2>A Comment About RAPDs and AFLPs</H2>
142<P>
143Although these distances are designed for restriction sites
144and restriction fragments data, they can be applied to RAPD and
145AFLP data as well.  RAPD (Randomly Amplified Polymorphic DNA)
146and AFLP (Amplified Fragment Length Polymorphism) data consist
147of presence or absence of individual bands on a gel.  The bands
148are segments of DNA with PCR primers at each end.  These
149primers are defined sequences of known length (often about
15010 nucleotides each). For AFLPs the reolevant length is the primer
151length, plus three nucleotides.  Mutation in these sequences makes them no
152longer be primers, just as in the case of restriction sites.
153Thus a pair of 10-nucleotide primers will behave much the same
154as a 20-nucleotide restriction site.  You can use the restriction
155sites distance as the distance between RAPD or AFLP patterns if you
156set the proper value for the total length of the site to the
157total length of the primers (plus 6 in the case of AFLPs).
158Of course there are many possible sources of noise in these data,
159including confusing fragments of similar length for each other
160and having primers near each other in the genome, and these are
161not taken into account in the statistical model used here.
162<P>
163<H2>INPUT FORMAT AND OPTIONS</H2>
164<P>
165The input is fairly standard, with one addition.  As usual the first line of
166the file gives the number of species and the number of sites, but there is also
167a third number, which is the number of different restriction enzymes that were
168used to detect the restriction sites.  Thus a data set with 10 species and
16935 different sites, representing digestion with 4 different enzymes, would
170have the first line of the data file look like this:
171<P>
172<PRE>
173   10   35    4
174</PRE>
175<P>
176The site data are in standard form.  Each species starts with a species name
177whose maximum length is given by the constant "nmlngth"
178(whose value in the
179program as distributed is 10 characters).  The name should, as usual, be padded
180out to that length with blanks if necessary.  The sites data then follows, one
181character per site (any blanks will
182be skipped and ignored).  Like the DNA and protein sequence data, the
183restriction sites data may be either in the "interleaved" form or the
184"sequential" form.  Note that if you are analyzing restriction sites
185data with the programs DOLLOP or MIX or other discrete character
186programs, at the moment those programs do not use the "aligned" or
187"interleaved" data format.  Therefore you may want to avoid that format
188when you have restriction sites data that you will want to feed into
189those programs.
190<P>
191The presence of a site is indicated by a "+" and the absence by a "-".  I have
192also allowed the use of "1" and "0" as synonyms for "+" and "-", for
193compatibility with MIX and DOLLOP which do not allow "+" and "-".  If the
194presence of
195the site is unknown (for example, if the DNA containing it has been deleted so
196that one
197does not know whether it would have contained the site) then the state "?" can
198be used to indicate that the state of this site is unknown.
199<P>
200The options are selected using an interactive menu.  The menu looks like this:
201<P>
202<TABLE><TR><TD BGCOLOR=white>
203<PRE>
204
205Restriction site or fragment distances, version 3.6a3
206
207Settings for this run:
208  R           Restriction sites or fragments?  Sites
209  G  Gamma distribution of rates among sites?  No
210  T            Transition/transversion ratio?  2.000000
211  S                              Site length?  6.0
212  L                  Form of distance matrix?  Square
213  M               Analyze multiple data sets?  No
214  I              Input sequences interleaved?  Yes
215  0       Terminal type (IBM PC, ANSI, none)?  (none)
216  1       Print out the data at start of run?  No
217  2     Print indications of progress of run?  Yes
218
219  Y to accept these or type the letter for one to change
220
221</PRE>
222</TD></TR></TABLE>
223<P>
224The user either types "Y" (followed, of course, by a carriage-return)
225if the settings shown are to be accepted, or the letter or digit corresponding
226to an option that is to be changed.
227<P>
228The R option toggles between a restriction sites distance, which
229is the default setting, and a restriction fragments distance.  In
230the latter case, another option appears, the N (Nei/Li) option.
231This allows the user to choose the original Nei and Li (1979)
232restriction fragments distance rather than my modified Nei/Li
233distance, which is the default.
234<P>
235If the G (Gamma distribution) option is selected, the user will be
236asked to supply the coefficient of variation of the rate of substitution
237among sites.  This is different from the parameters used by Nei and Jin, who
238introduced Gamma distribution of rates in DNA distances, but
239related to their parameters: their parameter <EM>a</EM> is also known as
240"alpha", the shape parameter of the Gamma distribution.  It is
241related to the coefficient of variation by
242<P>
243&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;CV = 1 / a<SUP>1/2</SUP>
244<P>
245or
246<P>
247&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a = 1 / (CV)<SUP>2</SUP>
248<P>
249(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
250that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
251As we consider cases in which the rates are less variable we should set <EM>a</EM>
252larger and larger, as <EM>CV</EM> gets smaller and smaller.
253<P>
254The Gamma distribution option is not available when using the
255original Nei/Li restriction fragments distance.
256<P>
257The T option is the Transition/transversion option.  The user is prompted for
258a real number greater than 0.0, as the expected ratio of transitions to
259transversions.  Note
260that this is the resulting expected ratio of transitions to transversions.
261The default value of the T parameter if you do not use the T
262option is 2.0.  The T option is not available when you choose the original
263Nei/Li restriction fragment distance, which assumes a Jukes-Cantor (1969)
264model of DNA change, for which the transition/transversion ratio is
265in effect fixed at 0.5.
266<P>
267The S option selects the site length.  This is set to a default
268value of 6.  It can be set to any positive integer.  While in
269the RESTML program there is an upper limit on the restriction
270site length (set by memory limitations), in RESTDIST there is
271no effective limit on the size of the restriction sites.  A value
272of 20, which might be appropriate in many cases for RAPD or AFLP
273data, is typically not practical in RESTML, but it is useable in
274RESTDIST.
275<P>
276Option L specifies that the output file will have a square matrix
277of distances.  It can be used to change to lower-triangular
278data matrices.  This will usually not be
279necessary, but if the distance matrices are going to be very
280large, this alternative can reduce their size by half.  The
281programs which are to use them should then of course be informed
282that they can expect lower-triangular distance matrices.
283<P>
284The M, I, and 0 options are the usual Multiple data set,
285Interleaved input, and screen terminal type options.  These are
286described in the main documentation file.
287<P>
288Option 1 specifies that the input data will be written out
289on the output file before the distances.  This is off by
290default.  If it is done, it will make the output file unusable
291as input to our distance matrix programs.
292<P>
293Option 2 turns off or on the indications of the progress of the run.
294The program prints out a row of dots (".") indicating the
295calculation of individual distances.  Since the distance matrix
296is symmetrical, the program only computes the distances for the
297upper triangle of the distance matrix, and then duplicates
298the distance to the other corner of the matrix.  Thus the rows of
299dots start out of full length, and then egt shorter and shorter.
300<P>
301<H2>OUTPUT FORMAT</H2>
302<P>
303The output file contains on its first line the number of species. The
304distance matrix is then printed in standard
305form, with each species starting on a new line with the species name, followed
306by the distances to the species in order.  These continue onto a new line
307after every nine distances.  If the L option is used, the matrix or distances
308is in lower triangular form, so that only the distances to the other species
309that precede each species are printed.  Otherwise the distance matrix is square
310with zero distances on the diagonal.  In general the format of the distance
311matrix is such that it can serve as input to any of the distance matrix
312programs.
313<P>
314If the option to print out the data is selected, the output file will
315precede the data by more complete information on the input and the menu
316selections.  The output file begins by giving the number of species and the
317number of characters.
318<P>
319The distances printed out are scaled in terms of expected numbers of
320substitutions per DNA site, counting both transitions and transversions but not
321replacements of a base by itself, and scaled so that the average rate of
322change, averaged over all sites analyzed, is set to 1.0.  Thus when the
323G option is used, the rate of change at one site may be higher than
324at another, but their mean is expected to be 1.
325<P>
326<H2>PROGRAM CONSTANTS</H2>
327<P>
328The constants available to be changed are "initialv" and
329"iterationsr".  The constant "initialv" is the starting
330value of the distance in the iterations.  This will typically not need to
331be changed.  The constant "iterationsr" is the number of
332times that the Newton-Raphson method which is used to solve the
333equations for the distances is iterated.  The program can be
334speeded up by reducing the number of iterations from the default
335value of 20, but at the possible risk of computing the distance
336less accurately.
337<P>
338<H2>FUTURE OF THE PROGRAM</H2>
339<P>
340The present program does not compute the original distance of Nei and Li (1979)
341for restriction sites (though it does have an option to compute their original
342distance for restriction fragments).   I hope to add their restriction
343sites distance in the near future.
344<P>
345<HR>
346<P>
347<H3>TEST DATA SET</H3>
348<P>
349<TABLE><TR><TD BGCOLOR=white>
350<PRE>
351   5   13   2
352Alpha     ++-+-++--+++-
353Beta      ++++--+--+++-
354Gamma     -+--+-++-+-++
355Delta     ++-+----++---
356Epsilon   ++++----++---
357</PRE>
358</TD></TR></TABLE>
359<P>
360<HR>
361<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
362<P>
363(Note that when the options for displaying the input data are turned off,
364the output is in a form suitable for use as an input file in the distance
365matrix programs).<P>
366<P>
367<TABLE><TR><TD BGCOLOR=white>
368<PRE>
369
370    5 Species,   13 Sites
371
372Name            Sites
373----            -----
374
375Alpha        ++-+-++--+ ++-
376Beta         ++++--+--+ ++-
377Gamma        -+--+-++-+ -++
378Delta        ++-+----++ ---
379Epsilon      ++++----++ ---
380
381
382Alpha       0.0000  0.0224  0.1077  0.0688  0.0826
383Beta        0.0224  0.0000  0.1077  0.0688  0.0442
384Gamma       0.1077  0.1077  0.0000  0.1765  0.1925
385Delta       0.0688  0.0688  0.1765  0.0000  0.0197
386Epsilon     0.0826  0.0442  0.1925  0.0197  0.0000
387</PRE>
388</TD></TR></TABLE>
389</BODY>
390</HTML>
Note: See TracBrowser for help on using the repository browser.