source: branches/stable/GDE/PHYLIP/doc/dnadist.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 23.9 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>dnadist</TITLE>
5<META NAME="description" CONTENT="dnadist">
6<META NAME="keywords" CONTENT="dnadist">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>DNADIST -- Program to compute distance matrix<BR>from nucleotide sequences</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2002 by the University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25This program uses nucleotide sequences to compute a distance matrix, under
26four different models of nucleotide substitution.  It can also
27compute a table of similarity between the nucleotide sequences.
28The distance for each
29pair of species estimates the total branch length between the two species, and
30can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR.  This
31is an alternative to use of the sequence data itself in the maximum likelihood
32program DNAML or the parsimony program DNAPARS.
33<P>
34The program reads in nucleotide sequences and writes an output file containing
35the distance matrix, or else a table of similarity between sequences.
36The four models of nucleotide substitution are those
37of Jukes and Cantor (1969), Kimura (1980), the F84 model (Kishino and Hasegawa,
381989; Felsenstein and Churchill, 1996), and the model underlying the
39LogDet distance (Barry and Hartigan, 1987; Lake, 1994;
40Steel, 1994; Lockhart et. al., 1994).
41All except the LogDet distance can be made to allow for
42for unequal rates of substitution at different sites, as Jin and Nei
43(1990) did for the Jukes-Cantor model.
44The program correctly takes
45into account a variety of sequence ambiguities, although in cases where they
46exist it can be slow.
47<P>
48Jukes and Cantor's (1969) model assumes that there is independent change at
49all sites, with equal probability.  Whether a base changes is independent of
50its identity, and when it changes there is an equal probability of ending up
51with each of the other three bases.  Thus the transition probability matrix
52(this is a technical term from probability theory and has nothing to do with
53transitions as opposed to transversions) for a short period of time dt is:
54<P>
55<PRE>
56              To:    A        G        C        T
57                   ---------------------------------
58               A  | 1-3a      a         a       a
59       From:   G  |  a       1-3a       a       a
60               C  |  a        a        1-3a     a
61               T  |  a        a         a      1-3a
62</PRE>
63<P>
64where <EM>a</EM> is <EM>u&nbsp;dt</EM>, the product of the rate of substitution per unit time (<EM>u</EM>)
65and the length <EM>dt</EM> of the time interval.  For longer periods of time this
66implies that the probability that two sequences will differ at a given site
67is:
68<P>
69&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;p   =    <SUP>3</SUP>/<SUB>4</SUB> ( 1   -    e<SUP>- 4/3 u t</SUP>)
70<P>
71and hence that if we observe <EM>p</EM>, we can compute an estimate of the branch
72length <EM>ut</EM> by inverting this to get
73<P>
74&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ut   =   - <SUP>3</SUP>/<SUB>4</SUB> log<SUB>e</SUB> ( 1  -  <SUP>4</SUP>/<SUB>3</SUB> p )
75<P>
76The Kimura "2-parameter" model is almost as symmetric as this, but allows
77for a difference between transition and transversion rates.  Its transition
78probability matrix for a short interval of time is:
79<P>
80<PRE>
81              To:     A        G        C        T
82                   ---------------------------------
83               A  | 1-a-2b     a         b       b
84       From:   G  |   a      1-a-2b      b       b
85               C  |   b        b       1-a-2b    a
86               T  |   b        b         a     1-a-2b
87</PRE>
88<P>
89<P>
90where <EM>a</EM> is <EM>u dt</EM>, the product of the rate of transitions per unit time and <EM>dt</EM> 
91is the length <EM>dt</EM> of the time interval, and <EM>b</EM> is <EM>v dt</EM>, the product of half the
92rate of transversions (i.e., the rate of a specific transversion)
93and the length dt of the time interval.
94<P>
95The F84 model incorporates different rates of transition
96and transversion, but also allowing for different frequencies of the four
97nucleotides.  It is the model which is used in DNAML, the maximum likelihood
98nucelotide sequence phylogenies program in this package.  You will find the
99model described in the document for that program.  The transition
100probabilities for this model are given by Kishino and Hasegawa (1989),
101and further explained in a paper by me and Gary Churchill (1996).
102<P>
103The LogDet distance allows a fairly general model of substitution.  It
104computes the distance from the determinant of the empirically observed
105matrix of joint probabilities of nucleotides in the two species.  An
106explanation of it is available in the chapter by Swofford et, al. (1996).
107<P>
108The first three models are closely related.  The DNAML model reduces to
109Kimura's
110two-parameter model if we assume that the equilibrium frequencies of the four
111bases are equal.  The Jukes-Cantor model in turn is a special case of the
112Kimura 2-parameter model where a = b.  Thus each model is a special case of
113the ones that follow it, Jukes-Cantor being a special case of both of the
114others.
115<P>
116The Jin and Nei (1990) correction for variation in rate of evolution from
117site to site can be adapted to all of the first three models.
118It assumes that the rate of substitution varies from site to site
119according to a gamma distribution, with a coefficient of variation that
120is specified by the user.  The user is asked for it when choosing this option
121in the menu.
122<P>
123Each distance that is calculated is an estimate, from that particular pair of
124species, of the divergence time between those two species.  For the Jukes-
125Cantor model, the estimate is computed using the formula for <EM>ut</EM> given above,
126as long as the nucleotide symbols in the two sequences are all either A, C, G,
127T, U, N, X, ?, or - (the latter four indicate a deletion or an unknown
128nucleotide.  This estimate is a maximum likelihood estimate for that
129model.  For the Kimura 2-parameter model, with only these nucleotide symbols,
130formulas special to that estimate are also computed.  These are also,
131in effect, computing the maximum likelihood estimate for that model.  In
132the Kimura case it depends on the
133observed sequences only through the sequence length and the observed number of
134transition and transversion differences between those two sequences.   The
135calculation in that case is a maximum likelihood estimate and will differ
136somewhat from the estimate obtained from the formulas in Kimura's original
137paper.  That formula was also a maximum likelihood estimate, but with the
138transition/transversion ratio estimated empirically, separately for each pair
139of sequences.  In the present case, one overall preset transition/transversion
140ratio is used which makes the computations harder but achieves greater
141consistency between different comparisons.
142<P>
143For the
144F84 model, or for any of the models where one or both sequences contain at
145least one of the other ambiguity codons such as Y, R, etc., a maximum
146likelihood calculation is also done using code which was originally written
147for DNAML.  Its disadvantage is that it is slow.  The resulting
148distance is in effect a maximum likelihood estimate of the divergence time
149(total branch length between) the two sequences.  However the present
150program will be much faster than versions earlier than 3.5, because I have
151speeded up the iterations.
152<P>
153The LogDet model computes the distance from the determinant of the
154matrix of co-occurrence of nucleotides in the two species, according to
155the formula
156<PRE>
157   D  = - <SUP>1</SUP>/<SUB>4</SUB>(log<SUB>e</SUB>(|F|) - <SUP>1</SUP>/<SUB>2</SUB>log<SUB>e</SUB>(f<SUB>A</SUB><SUP>1</SUP>f<SUB>C</SUB><SUP>1</SUP>f<SUB>G</SUB><SUP>1</SUP>f<SUB>T</SUB><SUP>1</SUP>f<SUB>A</SUB><SUP>2</SUP>f<SUB>C</SUB><SUP>2</SUP>f<SUB>G</SUB><SUP>2</SUP>f<SUB>T</SUB><SUP>2</SUP>))
158</PRE>
159Where <EM>F</EM> is a matrix whose <EM>(i,j)</EM> element is the fraction
160of sites at which base <EM>i</EM> occurs in one species and base  <EM>j</EM>
161occurs in the other.  f<SUB>j</SUB><SUP>i</SUP> is the fraction of sites at
162which species <EM>i</EM> has base <EM>j</EM>.
163The LogDet distance cannot cope with ambiguity codes.  It must have
164completely defined sequences.  One limitation of the LogDet distance is
165that it may be infinite sometimes, if there are too many changes between
166certain pairs of nucleotides.  This can be particularly noticeable with
167distances computed from bootstrapped sequences.
168<P>
169Note that there is an
170assumption that we are looking at all sites, including those
171that have not changed at all.  It is important not to restrict attention
172to some sites based on whether or not they have changed; doing that
173would bias the distances by making them too large, and that in turn
174would cause the distances
175to misinterpret the meaning of those sites that
176had changed.
177<P>
178For all of these
179distance methods, the program allows us to specify
180that "third position" bases have a different rate of substitution than first and
181second positions, that introns have a different rate than exons, and so on.  The
182Categories option which does this allows us to make up to
1839 categories of sites and specify different rates of change for them.
184<P>
185In addition to the four distance calculations, the program can also
186compute a table of similarities between nucleotide sequences.  These values
187are the fractions of sites identical between the sequences.
188The diagonal values are 1.0000.  No attempt is made to count similarity
189of nonidentical nucleotides, so that no credit is given for having
190(for example) different purines at corresponding
191sites in the two sequences.  This option has been requested by many
192users, who need it for descriptive purposes.  It is not intended that
193the table be used for inferring the tree.
194<P>
195<H2>INPUT FORMAT AND OPTIONS</H2>
196<P>
197Input is fairly standard, with one addition.  As usual the first line of the
198file gives the number of species and the number of sites.
199<P>
200Next come the species data.  Each
201sequence starts on a new line, has a ten-character species name
202that must be blank-filled to be of that length, followed immediately
203by the species data in the one-letter code.  The sequences must either
204be in the "interleaved" or "sequential" formats
205described in the Molecular Sequence Programs document.  The I option
206selects between them.  The sequences can have internal
207blanks in the sequence but there must be no extra blanks at the end of the
208terminated line.  Note that a blank is not a valid symbol for a deletion --
209neither is dot (".").
210<P>
211The options are selected using an interactive menu.  The menu looks like this:
212<P>
213<TABLE><TR><TD BGCOLOR=white>
214<PRE>
215
216Nucleic acid sequence Distance Matrix program, version 3.6a3
217
218Settings for this run:
219  D  Distance (F84, Kimura, Jukes-Cantor, LogDet)?  F84
220  G          Gamma distributed rates across sites?  No
221  T                 Transition/transversion ratio?  2.0
222  C            One category of substitution rates?  Yes
223  W                         Use weights for sites?  No
224  F                Use empirical base frequencies?  Yes
225  L                       Form of distance matrix?  Square
226  M                    Analyze multiple data sets?  No
227  I                   Input sequences interleaved?  Yes
228  0            Terminal type (IBM PC, ANSI, none)?  (none)
229  1             Print out the data at start of run  No
230  2           Print indications of progress of run  Yes
231
232  Y to accept these or type the letter for one to change
233
234</PRE>
235</TD></TR></TABLE>
236<P>
237The user either types "Y" (followed, of course, by a carriage-return)
238if the settings shown are to be accepted, or the letter or digit corresponding
239to an option that is to be changed.
240<P>
241The D option selects one of the four distance methods, or the
242similarity table.  It toggles among the
243five methods. The default method, if none is specified, is the F84
244model.
245<P>
246If the G (Gamma distribution) option is selected, the user will be
247asked to supply the coefficient of variation of the rate of substitution
248among sites.  This is different from the parameters used by Nei and Jin but
249related to them:  their parameter <EM>a</EM> is also known as "alpha",
250the shape parameter of the Gamma distribution.  It is
251related to the coefficient of variation by
252<P>
253&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;CV = 1 / a<SUP>1/2</SUP>
254<P>
255or
256<P>
257&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a = 1 / (CV)<SUP>2</SUP>
258<P>
259(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
260that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
261As we consider cases in which the rates are less variable we should set <EM>a</EM>
262larger and larger, as <EM>CV</EM> gets smaller and smaller.
263<P>
264The F (Frequencies) option appears when the Maximum Likelihood distance is
265selected.  This distance requires that the program be provided with the
266equilibrium frequencies of the four bases A, C, G, and T (or U).  Its default
267setting is one which may save users much time.  If you
268want to use the empirical frequencies of the bases, observed in the input
269sequences, as the base frequencies, you simply use the default setting of
270the F option.  These empirical
271frequencies are not really the maximum likelihood estimates of the base
272frequencies, but they will often be close to those values (what they are is
273maximum likelihood estimates under a "star" or "explosion" phylogeny).
274If you change the setting of the F option you will be prompted for the
275frequencies of the four bases.  These must add to 1 and are to be typed on
276one line separated by blanks, not commas.
277<P>
278The T option in this program does not stand for Threshold,
279but instead is the Transition/transversion option.  The user is prompted for
280a real number greater than 0.0, as the expected ratio of transitions to
281transversions.  Note
282that this is not the ratio of the first to the second kinds of events,
283but the resulting expected ratio of transitions to transversions.  The exact
284relationship between these two quantities depends on the frequencies in the
285base pools.  The default value of the T parameter if you do not use the T
286option is 2.0.
287<P>
288The C option allows user-defined rate categories.  The user is prompted
289for the number of user-defined rates, and for the rates themselves,
290which cannot be negative but can be zero.  These numbers, which must be
291nonnegative (some could be 0),
292are defined relative to each other, so that if rates for three categories
293are set to 1 : 3 : 2.5 this would have the same meaning as setting them
294to 2 : 6 : 5.
295The assignment of rates to
296sites is then made by reading a file whose default name is "categories".
297It should contain a string of digits 1 through 9.  A new line or a blank
298can occur after any character in this string.  Thus the categories file
299might look like this:
300<P>
301<PRE>
302122231111122411155
3031155333333444
304</PRE>
305<P>
306The L option specifies that the output file is to have the distance
307matrix in lower triangular form.
308<P>
309The W (Weights) option is invoked in the usual way, with only weights 0
310and 1 allowed.  It selects a set of sites to be analyzed, ignoring the
311others.  The sites selected are those with weight 1.  If the W option is
312not invoked, all sites are analyzed.
313The Weights (W) option
314takes the weights from a file whose default name is "weights".  The weights
315follow the format described in the main documentation file.
316<P>
317The M (multiple data sets) option will ask you whether you want to
318use multiple sets of weights (from the weights file) or multiple data sets
319from the input file.
320The ability to use a single data set with multiple weights means that
321much less disk space will be used for this input data.  The bootstrapping
322and jackknifing tool Seqboot has the ability to create a weights file with
323multiple weights.  Note also that when we use multiple weights for
324bootstrapping we can also then maintain different rate categories for
325different sites in a meaningful way.  You should not use the multiple
326data sets option without using multiple weights, you should not at the
327same time use the user-defined rate categories option (option C).
328<P>
329The options 0 is the usual one. It is described in the
330main documentation file of this package.  Option I is the same as in
331other molecular sequence programs and is described in the documentation file
332for the sequence programs.
333<P>
334<H2>OUTPUT FORMAT</H2>
335<P>
336As the
337distances are computed, the program prints on your screen or terminal
338the names of the species in turn,
339followed by one dot (".") for each other species for which the distance to
340that species has been computed.  Thus if there are ten species, the first
341species name is printed out, followed by nine dots, then on the next line
342the next species name is printed out followed by eight dots, then the
343next followed by seven dots, and so on.  The pattern of dots should form
344a triangle.  When the distance matrix has been written out to the output
345file, the user is notified of that.
346<P>
347The output file contains on its first line the number of species. The
348distance matrix is then printed in standard
349form, with each species starting on a new line with the species name, followed
350by the distances to the species in order.  These continue onto a new line
351after every nine distances.  If the L option is used, the matrix or distances
352is in lower triangular form, so that only the distances to the other species
353that precede each species are printed.  Otherwise the distance matrix is square
354with zero distances on the diagonal.  In general the format of the distance
355matrix is such that it can serve as input to any of the distance matrix
356programs.
357<P>
358If the option to print out the data is selected, the output file will
359precede the data by more complete information on the input and the menu
360selections.  The output file begins by giving the number of species and the
361number of characters, and the identity of the distance measure that is
362being used.
363<P>
364If the C (Categories) option is used a table of the relative rates of
365expected substitution at each category of sites is printed, and a listing
366of the categories each site is in.
367<P>
368There will then follow the equilibrium frequencies of the four bases.
369If the Jukes-Cantor or Kimura distances are used, these will necessarily be
3700.25 : 0.25 : 0.25 : 0.25.  The output then shows
371the transition/transversion ratio that was specified or
372used by default.  In the case of the Jukes-Cantor distance this will always
373be 0.5.  The transition-transversion parameter (as opposed to the ratio)
374is also printed out: this is used within the program and can be ignored.
375There then follow the data sequences, with the base sequences printed in
376groups of ten bases along the lines of the Genbank and EMBL formats.
377<P>
378The distances printed out are scaled in terms of expected numbers of
379substitutions, counting both transitions and transversions but not
380replacements of a base by itself, and scaled so that the average rate of
381change, averaged over all sites analyzed, is set to 1.0
382if there are multiple categories of sites.  This means that whether or not
383there are multiple categories of sites, the expected fraction of change
384for very small branches is equal to the branch length.  Of course,
385when a branch is twice as
386long this does not mean that there will be twice as much net change expected
387along it, since some of the changes may occur in the same site and overlie or
388even reverse each
389other.  The branch lengths estimates here are in terms of the expected
390underlying numbers of changes.  That means that a branch of length 0.26
391is 26 times as long as one which would show a 1% difference between
392the nucleotide sequences at the beginning and end of the branch.  But we
393would not expect the sequences at the beginning and end of the branch to be
39426% different, as there would be some overlaying of changes.
395<P>
396One problem that can arise is that two or more of the species can be so
397dissimilar that the distance between them would have to be infinite, as
398the likelihood rises indefinitely as the estimated divergence time
399increases.  For example, with the Jukes-Cantor model, if the two sequences
400differ in 75% or more of their positions then the estimate of dovergence
401time would be infinite.  Since there is no way to represent an infinite
402distance in the output file, the program regards this as an error, issues an
403error message indicating which pair of species are causing the problem, and
404stops.  It might be that, had it continued running, it would have also run
405into the same problem with other pairs of species.  If the Kimura
406distance is being used there may be no error message; the program may
407simply give a large distance value (it is iterating towards
408infinity and the value is just where the iteration stopped).  Likewise
409some maximum likelihood estimates may also become large for the same
410reason (the sequences showing more divergence than is expected even with
411infinite branch length).  I hope in the future to add more warning
412messages that would alert the user the this.
413<P>
414If the similarity table is selected, the table that is produced is not
415in a format that can be used as input to the distance matrix programs.
416it has a heading, and the species names are also put at the tops of the
417columns of the table (or rather, the first 8 characters of each species
418name is there, the other two characters omitted to save space).  There
419is not an option to put the table into a format that can be read by
420the distance matrix programs, nor is there one to make it into a table
421of fractions of difference by subtracting the similarity values from 1.
422This is done deliberately to make it more difficult for the use to
423use these values to construct trees.  The similarity values are
424not corrected for multiple changes, and their use to construct trees
425(even after converting them to fractions of difference) would be
426wrong, as it would lead to severe conflict between the distant
427pairs of sequences and the close pairs of sequences.
428<P>
429<H2>PROGRAM CONSTANTS</H2>
430<P>
431The constants that are available to be changed by the user at the beginning
432of the program include
433"maxcategories", the maximum number of site
434categories, "iterations", which controls the number of times
435the program iterates the EM algorithm that is used to do the maximum
436likelihood distance, "namelength", the length of species names in
437characters, and "epsilon", a parameter which controls the accuracy of the
438results of the iterations which estimate the distances.  Making "epsilon"
439smaller will increase run times but result in more decimal places of
440accuracy.  This should not be necessary.
441<P>
442The program spends most of its time doing real arithmetic.
443The algorithm, with separate and independent computations
444occurring for each pattern, lends itself readily to parallel processing.
445<P>
446<HR>
447<P>
448<H3>TEST DATA SET</H3>
449<P>
450<TABLE><TR><TD BGCOLOR=white>
451<PRE>
452   5   13
453Alpha     AACGTGGCCACAT
454Beta      AAGGTCGCCACAC
455Gamma     CAGTTCGCCACAA
456Delta     GAGATTTCCGCCT
457Epsilon   GAGATCTCCGCCC
458</PRE>
459</TD></TR></TABLE>
460<P>
461<HR>
462<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
463<P>
464(Note that when the options for displaying the input data are turned off,
465the output is in a form suitable for use as an input file in the distance
466matrix programs).<P>
467<P>
468<TABLE><TR><TD BGCOLOR=white>
469<PRE>
470
471Nucleic acid sequence Distance Matrix program, version 3.6a3
472
473 5 species,  13  sites
474
475  F84 Distance
476
477Transition/transversion ratio =   2.000000
478
479Name            Sequences
480----            ---------
481
482Alpha        AACGTGGCCA CAT
483Beta         AAGGTCGCCA CAC
484Gamma        CAGTTCGCCA CAA
485Delta        GAGATTTCCG CCT
486Epsilon      GAGATCTCCG CCC
487
488
489
490Empirical Base Frequencies:
491
492   A       0.24615
493   C       0.36923
494   G       0.21538
495  T(U)     0.16923
496
497Alpha       0.0000  0.3039  0.8575  1.1589  1.5429
498Beta        0.3039  0.0000  0.3397  0.9135  0.6197
499Gamma       0.8575  0.3397  0.0000  1.6317  1.2937
500Delta       1.1589  0.9135  1.6317  0.0000  0.1659
501Epsilon     1.5429  0.6197  1.2937  0.1659  0.0000
502</PRE>
503</TD></TR></TABLE>
504</BODY>
505</HTML>
Note: See TracBrowser for help on using the repository browser.