source: trunk/GDE/PHYLIP/doc/gendist.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 12.9 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>gendist</TITLE>
5<META NAME="description" CONTENT="gendist">
6<META NAME="keywords" CONTENT="gendist">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>GENDIST - Compute genetic distances from gene frequencies</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2002 by the University of
21Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
22this document provided that no fee is charged for it and that this copyright
23notice is not removed.
24<P>
25This program computes any one of three measures of genetic distance from a set
26of gene frequencies in different populations (or species).  The three are
27Nei's genetic distance (Nei, 1972), Cavalli-Sforza's chord measure (Cavalli-
28Sforza and Edwards, 1967) and Reynolds, Weir, and Cockerham's (1983) genetic
29distance.  These are written to an output file in a format that can be read by
30the distance matrix phylogeny programs FITCH and KITSCH.
31<P>
32The three measures have somewhat different assumptions.  All assume that all
33differences between populations arise from genetic drift.  Nei's distance is
34formulated for an infinite isoalleles model of mutation, in which there is a
35rate of neutral mutation and each mutant is to a completely new alleles.  It
36is assumed that all loci have the same rate of neutral mutation, and that the
37genetic variability initially in the population is at equilibrium between
38mutation and genetic drift, with the effective population size of each
39population remaining constant.
40<P>
41Nei's distance is:
42<P>
43<PRE>
44                                      __  __
45                                      \   \
46                                      /_  /_  p<SUB>1mi</SUB>   p<SUB>2mi</SUB>
47                                       m   i
48           D  =  - ln  <FONT SIZE=+3>(</FONT> ------------------------------------- <FONT SIZE=+3>)</FONT>.
49                           __  __              __  __             
50                           \   \               \   \
51                         [ /_  /_  p<SUB>1mi</SUB><SUP>2</SUP>]<SUP><SUP>1</SUP>/<SUB>2</SUB></SUP>   [ /_  /_  p<SUB>2mi</SUB><SUP>2</SUP>]<SUP><SUP>1</SUP>/<SUB>2</SUB></SUP>     
52                            m   i                m   i
53</PRE>
54<P>
55where <EM>m</EM> is summed over loci, <EM>i</EM> over alleles at the <EM>m</EM>-th locus, and where
56<P>
57&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;p<SUB>1mi</SUB>
58<P>
59is the frequency of the <EM>i</EM>-th allele at the <EM>m</EM>-th locus in population 1. 
60Subject to the above assumptions, Nei's genetic distance is expected, for a
61sample of sufficiently many equivalent loci, to rise linearly with time.
62<P>
63The other two genetic distances assume that there is no mutation, and that all
64gene frequency changes are by genetic drift alone.  However they do not
65assume that population sizes have remained constant and equal in all
66populations.  They cope with changing population size by having expectations
67that rise linearly not with time, but with the sum over time of 1/N, where N
68is the effective population size.  Thus if population size doubles, genetic
69drift will be taking place more slowly, and the genetic distance will be
70expected to be rising only half as fast with respect to time.  Both genetic
71distances are different estimators of the same quantity under the same model.
72<P>
73Cavalli-Sforza's chord distance is given by
74<P>
75<PRE>
76                   __              __                     __
77                   \               \                      \
78     D<SUP>2</SUP>    =    4  /_  [  1   -    /_   p<SUB>1mi</SUB><SUP><SUP>1</SUP>/<SUB>2</SUB></SUP> p <SUB>2mi</SUB><SUP><SUP>1</SUP>/<SUB>2</SUB></SUP>]  /  /_  (a<SUB>m</SUB>  - 1)
79                    m               i                        m
80</PRE>
81<P>
82<P>
83where m indexes the loci, where i is summed over the alleles at the m-th
84locus, and where a is the number of alleles at the m-th locus.  It can be
85shown that this distance always satisfies the triangle
86inequality.  Note that as given here it is divided by the number of degrees
87of freedom, the sum of the numbers of alleles minus one.  The quantity which
88is expected to rise linearly with amount of
89genetic drift (sum of <EM>1/N</EM> over time) is <EM>D</EM> squared, the quantity computed
90above, and that is what is written out into the distance matrix.
91<P>
92Reynolds, Weir, and Cockerham's (1983) genetic distance is
93<P>
94<PRE>
95
96                       __   __
97                       \    \
98                       /_   /_  [ p<SUB>1mi</SUB>     -  p<SUB>2mi</SUB>]<SUP>2</SUP>
99                        m    i                 
100       D<SUP>2</SUP>     =      --------------------------------------
101                         __              __
102                         \               \
103                      2  /_   [  1   -   /_  p<SUB>1mi</SUB>    p<SUB>2mi</SUB> ]
104                          m               i
105</PRE>
106<P>
107<P>
108where the notation is as before and D<SUP>2</SUP> is the quantity that is
109expected to rise linearly with cumulated genetic drift.
110<P>
111Having computed one of these genetic distances, one which you feel is
112appropriate to the biology of the situation, you can use it as the input to
113the programs FITCH, KITSCH or NEIGHBOR.  Keep in mind that the statistical
114model in
115those programs implicitly assumes that the distances in the input table have
116independent errors.  For any measure of genetic distance this will not be true,
117as bursts of random genetic drift, or sampling events in drawing the sample of
118individuals from each population, cause fluctuations of gene frequency that
119affect many distances simultaneously.  While this is not expected to bias the
120estimate of the phylogeny, it does mean that the weighing of evidence from all
121the different distances in the table will not be done with maximal
122efficiency.  One issue is which value of the P
123(Power) parameter should be used.  This depends on how the variance of a
124distance rises with its expectation.  For Cavalli-Sforza's chord distance, and
125for the Reynolds et. al. distance it can be shown that the variance of the
126distance will be proportional to the square of its expectation; this suggests
127a value of 2 for <EM>P</EM>, which the default value for FITCH and KITSCH
128(there is no P option in NEIGHBOR).
129<P>
130If you think that the pure genetic drift model is appropriate, and are thus
131tempted to use the Cavalli-Sforza or Reynolds et. al. distances, you might
132consider using the maximum likelihood program CONTML instead.  It will
133correctly weigh the evidence in that case.  Like those genetic distances, it
134uses approximations that break down as loci start to drift all the way to
135fixation.  Although Nei's distance will not break down in that case, it
136makes other assumptions about equality of substitution rates at all loci and
137constancy of population sizes.
138<P>
139The most important thing to remember is that genetic distance is not an
140abstract, idealized measure of "differentness".  It is an estimate of a
141parameter (time or cumulated inverse effective population size) of the
142model which is thought to have generated the differences we see.  As an
143estimate, it has statistical properties that can be assessed, and we should
144never have to choose between genetic distances based on their aesthetic
145properties, or on the personal prestige of their originators.  Considering them
146as estimates
147focuses us on the questions which genetic distances are intended to answer,
148for if there are none there is no reason to compute them.  For further
149perspective on genetic distances, I recommend my own paper evaluating
150Reynolds, Weir, and Cockerham (1983), and the material in Nei's book (Nei,
1511987).
152<P>
153<H2>INPUT FORMAT</H2>
154<P>
155The input to this program is standard and is as described in the Gene
156Frequencies and Continuous
157Characters Programs documentation file above.  It consists of the number of
158populations (or species), the number of loci,
159and after that a line containing the numbers of alleles at each of the
160loci.   Then the gene frequencies follow in standard format.
161<P>
162The options are selected using a menu:
163<P>
164<TABLE><TR><TD BGCOLOR=white>
165<PRE>
166
167Genetic Distance Matrix program, version 3.6a3
168
169Settings for this run:
170  A   Input file contains all alleles at each locus?  One omitted at each locus
171  N                        Use Nei genetic distance?  Yes
172  C                Use Cavalli-Sforza chord measure?  No
173  R                   Use Reynolds genetic distance?  No
174  L                         Form of distance matrix?  Square
175  M                      Analyze multiple data sets?  No
176  0              Terminal type (IBM PC, ANSI, none)?  (none)
177  1            Print indications of progress of run?  Yes
178
179  Y to accept these or type the letter for one to change
180
181</PRE>
182</TD></TR></TABLE>
183<P>
184The A (All alleles) option is described in the Gene Frequencies and
185Continuous Characters Programs documentation file.  As with CONTML, it is
186the signal that all alleles are represented in the gene frequency input,
187without one being left out per locus.  C, N, and R are the signals to
188use the Cavalli-Sforza, Nei, or Reynolds et. al. genetic distances
189respectively.  The Nei distance is the default, and it will be computed
190if none of these options is explicitly invoked.   The L option is the signal
191that the distance matrix is to be written out in Lower triangular form.
192The M option is the usual Multiple Data Sets option, useful for
193doing bootstrap analyses with the distance matrix programs.   It allows
194multiple data sets, but does not allow multiple sets of weights (since
195there is no provision for weighting in this program).
196<P>
197<H2>OUTPUT FORMAT</H2>
198<P>
199The output file simply contains on its first line the number of species (or
200populations).  Each
201species (or population) starts a new line, with its name printed out
202first, and then and there are up to nine
203genetic distances printed on each line, in the standard format used as input
204by the distance matrix programs.  The output, in its default form, is
205ready to be used in the distance matrix programs.
206<P>
207<H2>CONSTANTS</H2>
208<P>
209The constants
210available to be changed by the user if the program is recompiled are
211"namelength" the length of a species name, set to 10 in the distribution
212and "epsilon" which defines a small quantity that is used when checking
213whether allele frequencies at a locus sum to more than one: if all
214alleles are input (option A) and the sum differs from 1 by more than epsilon,
215or if not all alleles are input and the sum is greater than 1 by more
216then epsilon, the program will see this as an error and stop.  You may 
217find this causes difficulties if you gene frequencies have been rounded.
218I have tried to keep epsilon from being too small to prevent such problems.
219<P>
220<H2>RUN TIMES</H2>
221<P>
222The program is quite fast and the user should effectively never be limited by
223the amount of time it takes.  All that the program has to do is read in the
224gene frequency data and then, for each pair of species, compute a genetic
225distance formula for each pair of species.  This should require an amount of
226effort proportional to the total number of alleles over loci, and to the
227square of the number of populations.
228<P>
229<H2>FUTURE OF THIS PROGRAM</H2>
230<P>
231The main change that will be made to this program in the future is to add
232provisions for taking into account the sample size for each population.  The
233genetic distance formulas have been modified by their inventors to correct for
234the inaccuracy of the estimate of the genetic distances, which on the whole
235should artificially increase the distance between populations by a small
236amount dependent on the sample sizes.  The main difficulty with doing this is
237that I have not yet settled on a format for putting the sample size in the
238input data along with the gene frequency data for a species or population.
239<P>
240I may also include other distance measures, but only if I think their use is
241justified.  There are many very arbitrary genetic distances, and I am
242reluctant to include most of them.
243<P>
244<HR>
245<P>
246<H3>TEST DATA SET</H3>
247<P>
248<TABLE><TR><TD BGCOLOR=white>
249<PRE>
250    5    10
2512 2 2 2 2 2 2 2 2 2
252European   0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
2530.8055 0.5043
254African    0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
2550.7582 0.6207
256Chinese    0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
2570.7482 0.7334
258American   0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
2590.8086 0.8636
260Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
2610.9097 0.2976
262</PRE>
263</TD></TR></TABLE>
264<P>
265<HR>
266<P>
267<H3>TEST SET OUTPUT</H3>
268<P>
269<TABLE><TR><TD BGCOLOR=white>
270<PRE>
271    5
272European    0.0000  0.0780  0.0807  0.0668  0.1030
273African     0.0780  0.0000  0.2347  0.1050  0.2273
274Chinese     0.0807  0.2347  0.0000  0.0539  0.0633
275American    0.0668  0.1050  0.0539  0.0000  0.1348
276Australian  0.1030  0.2273  0.0633  0.1348  0.0000
277</PRE>
278</TD></TR></TABLE>
279<P>
280</BODY>
281</HTML>
Note: See TracBrowser for help on using the repository browser.