1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>gendist</TITLE> |
---|
5 | <META NAME="description" CONTENT="gendist"> |
---|
6 | <META NAME="keywords" CONTENT="gendist"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>GENDIST - Compute genetic distances from gene frequencies</H1> |
---|
18 | </DIV> |
---|
19 | <P> |
---|
20 | © Copyright 1986-2002 by the University of |
---|
21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
---|
22 | this document provided that no fee is charged for it and that this copyright |
---|
23 | notice is not removed. |
---|
24 | <P> |
---|
25 | This program computes any one of three measures of genetic distance from a set |
---|
26 | of gene frequencies in different populations (or species). The three are |
---|
27 | Nei's genetic distance (Nei, 1972), Cavalli-Sforza's chord measure (Cavalli- |
---|
28 | Sforza and Edwards, 1967) and Reynolds, Weir, and Cockerham's (1983) genetic |
---|
29 | distance. These are written to an output file in a format that can be read by |
---|
30 | the distance matrix phylogeny programs FITCH and KITSCH. |
---|
31 | <P> |
---|
32 | The three measures have somewhat different assumptions. All assume that all |
---|
33 | differences between populations arise from genetic drift. Nei's distance is |
---|
34 | formulated for an infinite isoalleles model of mutation, in which there is a |
---|
35 | rate of neutral mutation and each mutant is to a completely new alleles. It |
---|
36 | is assumed that all loci have the same rate of neutral mutation, and that the |
---|
37 | genetic variability initially in the population is at equilibrium between |
---|
38 | mutation and genetic drift, with the effective population size of each |
---|
39 | population remaining constant. |
---|
40 | <P> |
---|
41 | Nei's distance is: |
---|
42 | <P> |
---|
43 | <PRE> |
---|
44 | __ __ |
---|
45 | \ \ |
---|
46 | /_ /_ p<SUB>1mi</SUB> p<SUB>2mi</SUB> |
---|
47 | m i |
---|
48 | D = - ln <FONT SIZE=+3>(</FONT> ------------------------------------- <FONT SIZE=+3>)</FONT>. |
---|
49 | __ __ __ __ |
---|
50 | \ \ \ \ |
---|
51 | [ /_ /_ p<SUB>1mi</SUB><SUP>2</SUP>]<SUP><SUP>1</SUP>/<SUB>2</SUB></SUP> [ /_ /_ p<SUB>2mi</SUB><SUP>2</SUP>]<SUP><SUP>1</SUP>/<SUB>2</SUB></SUP> |
---|
52 | m i m i |
---|
53 | </PRE> |
---|
54 | <P> |
---|
55 | where <EM>m</EM> is summed over loci, <EM>i</EM> over alleles at the <EM>m</EM>-th locus, and where |
---|
56 | <P> |
---|
57 | p<SUB>1mi</SUB> |
---|
58 | <P> |
---|
59 | is the frequency of the <EM>i</EM>-th allele at the <EM>m</EM>-th locus in population 1. |
---|
60 | Subject to the above assumptions, Nei's genetic distance is expected, for a |
---|
61 | sample of sufficiently many equivalent loci, to rise linearly with time. |
---|
62 | <P> |
---|
63 | The other two genetic distances assume that there is no mutation, and that all |
---|
64 | gene frequency changes are by genetic drift alone. However they do not |
---|
65 | assume that population sizes have remained constant and equal in all |
---|
66 | populations. They cope with changing population size by having expectations |
---|
67 | that rise linearly not with time, but with the sum over time of 1/N, where N |
---|
68 | is the effective population size. Thus if population size doubles, genetic |
---|
69 | drift will be taking place more slowly, and the genetic distance will be |
---|
70 | expected to be rising only half as fast with respect to time. Both genetic |
---|
71 | distances are different estimators of the same quantity under the same model. |
---|
72 | <P> |
---|
73 | Cavalli-Sforza's chord distance is given by |
---|
74 | <P> |
---|
75 | <PRE> |
---|
76 | __ __ __ |
---|
77 | \ \ \ |
---|
78 | D<SUP>2</SUP> = 4 /_ [ 1 - /_ p<SUB>1mi</SUB><SUP><SUP>1</SUP>/<SUB>2</SUB></SUP> p <SUB>2mi</SUB><SUP><SUP>1</SUP>/<SUB>2</SUB></SUP>] / /_ (a<SUB>m</SUB> - 1) |
---|
79 | m i m |
---|
80 | </PRE> |
---|
81 | <P> |
---|
82 | <P> |
---|
83 | where m indexes the loci, where i is summed over the alleles at the m-th |
---|
84 | locus, and where a is the number of alleles at the m-th locus. It can be |
---|
85 | shown that this distance always satisfies the triangle |
---|
86 | inequality. Note that as given here it is divided by the number of degrees |
---|
87 | of freedom, the sum of the numbers of alleles minus one. The quantity which |
---|
88 | is expected to rise linearly with amount of |
---|
89 | genetic drift (sum of <EM>1/N</EM> over time) is <EM>D</EM> squared, the quantity computed |
---|
90 | above, and that is what is written out into the distance matrix. |
---|
91 | <P> |
---|
92 | Reynolds, Weir, and Cockerham's (1983) genetic distance is |
---|
93 | <P> |
---|
94 | <PRE> |
---|
95 | |
---|
96 | __ __ |
---|
97 | \ \ |
---|
98 | /_ /_ [ p<SUB>1mi</SUB> - p<SUB>2mi</SUB>]<SUP>2</SUP> |
---|
99 | m i |
---|
100 | D<SUP>2</SUP> = -------------------------------------- |
---|
101 | __ __ |
---|
102 | \ \ |
---|
103 | 2 /_ [ 1 - /_ p<SUB>1mi</SUB> p<SUB>2mi</SUB> ] |
---|
104 | m i |
---|
105 | </PRE> |
---|
106 | <P> |
---|
107 | <P> |
---|
108 | where the notation is as before and D<SUP>2</SUP> is the quantity that is |
---|
109 | expected to rise linearly with cumulated genetic drift. |
---|
110 | <P> |
---|
111 | Having computed one of these genetic distances, one which you feel is |
---|
112 | appropriate to the biology of the situation, you can use it as the input to |
---|
113 | the programs FITCH, KITSCH or NEIGHBOR. Keep in mind that the statistical |
---|
114 | model in |
---|
115 | those programs implicitly assumes that the distances in the input table have |
---|
116 | independent errors. For any measure of genetic distance this will not be true, |
---|
117 | as bursts of random genetic drift, or sampling events in drawing the sample of |
---|
118 | individuals from each population, cause fluctuations of gene frequency that |
---|
119 | affect many distances simultaneously. While this is not expected to bias the |
---|
120 | estimate of the phylogeny, it does mean that the weighing of evidence from all |
---|
121 | the different distances in the table will not be done with maximal |
---|
122 | efficiency. One issue is which value of the P |
---|
123 | (Power) parameter should be used. This depends on how the variance of a |
---|
124 | distance rises with its expectation. For Cavalli-Sforza's chord distance, and |
---|
125 | for the Reynolds et. al. distance it can be shown that the variance of the |
---|
126 | distance will be proportional to the square of its expectation; this suggests |
---|
127 | a value of 2 for <EM>P</EM>, which the default value for FITCH and KITSCH |
---|
128 | (there is no P option in NEIGHBOR). |
---|
129 | <P> |
---|
130 | If you think that the pure genetic drift model is appropriate, and are thus |
---|
131 | tempted to use the Cavalli-Sforza or Reynolds et. al. distances, you might |
---|
132 | consider using the maximum likelihood program CONTML instead. It will |
---|
133 | correctly weigh the evidence in that case. Like those genetic distances, it |
---|
134 | uses approximations that break down as loci start to drift all the way to |
---|
135 | fixation. Although Nei's distance will not break down in that case, it |
---|
136 | makes other assumptions about equality of substitution rates at all loci and |
---|
137 | constancy of population sizes. |
---|
138 | <P> |
---|
139 | The most important thing to remember is that genetic distance is not an |
---|
140 | abstract, idealized measure of "differentness". It is an estimate of a |
---|
141 | parameter (time or cumulated inverse effective population size) of the |
---|
142 | model which is thought to have generated the differences we see. As an |
---|
143 | estimate, it has statistical properties that can be assessed, and we should |
---|
144 | never have to choose between genetic distances based on their aesthetic |
---|
145 | properties, or on the personal prestige of their originators. Considering them |
---|
146 | as estimates |
---|
147 | focuses us on the questions which genetic distances are intended to answer, |
---|
148 | for if there are none there is no reason to compute them. For further |
---|
149 | perspective on genetic distances, I recommend my own paper evaluating |
---|
150 | Reynolds, Weir, and Cockerham (1983), and the material in Nei's book (Nei, |
---|
151 | 1987). |
---|
152 | <P> |
---|
153 | <H2>INPUT FORMAT</H2> |
---|
154 | <P> |
---|
155 | The input to this program is standard and is as described in the Gene |
---|
156 | Frequencies and Continuous |
---|
157 | Characters Programs documentation file above. It consists of the number of |
---|
158 | populations (or species), the number of loci, |
---|
159 | and after that a line containing the numbers of alleles at each of the |
---|
160 | loci. Then the gene frequencies follow in standard format. |
---|
161 | <P> |
---|
162 | The options are selected using a menu: |
---|
163 | <P> |
---|
164 | <TABLE><TR><TD BGCOLOR=white> |
---|
165 | <PRE> |
---|
166 | |
---|
167 | Genetic Distance Matrix program, version 3.6a3 |
---|
168 | |
---|
169 | Settings for this run: |
---|
170 | A Input file contains all alleles at each locus? One omitted at each locus |
---|
171 | N Use Nei genetic distance? Yes |
---|
172 | C Use Cavalli-Sforza chord measure? No |
---|
173 | R Use Reynolds genetic distance? No |
---|
174 | L Form of distance matrix? Square |
---|
175 | M Analyze multiple data sets? No |
---|
176 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
---|
177 | 1 Print indications of progress of run? Yes |
---|
178 | |
---|
179 | Y to accept these or type the letter for one to change |
---|
180 | |
---|
181 | </PRE> |
---|
182 | </TD></TR></TABLE> |
---|
183 | <P> |
---|
184 | The A (All alleles) option is described in the Gene Frequencies and |
---|
185 | Continuous Characters Programs documentation file. As with CONTML, it is |
---|
186 | the signal that all alleles are represented in the gene frequency input, |
---|
187 | without one being left out per locus. C, N, and R are the signals to |
---|
188 | use the Cavalli-Sforza, Nei, or Reynolds et. al. genetic distances |
---|
189 | respectively. The Nei distance is the default, and it will be computed |
---|
190 | if none of these options is explicitly invoked. The L option is the signal |
---|
191 | that the distance matrix is to be written out in Lower triangular form. |
---|
192 | The M option is the usual Multiple Data Sets option, useful for |
---|
193 | doing bootstrap analyses with the distance matrix programs. It allows |
---|
194 | multiple data sets, but does not allow multiple sets of weights (since |
---|
195 | there is no provision for weighting in this program). |
---|
196 | <P> |
---|
197 | <H2>OUTPUT FORMAT</H2> |
---|
198 | <P> |
---|
199 | The output file simply contains on its first line the number of species (or |
---|
200 | populations). Each |
---|
201 | species (or population) starts a new line, with its name printed out |
---|
202 | first, and then and there are up to nine |
---|
203 | genetic distances printed on each line, in the standard format used as input |
---|
204 | by the distance matrix programs. The output, in its default form, is |
---|
205 | ready to be used in the distance matrix programs. |
---|
206 | <P> |
---|
207 | <H2>CONSTANTS</H2> |
---|
208 | <P> |
---|
209 | The constants |
---|
210 | available to be changed by the user if the program is recompiled are |
---|
211 | "namelength" the length of a species name, set to 10 in the distribution |
---|
212 | and "epsilon" which defines a small quantity that is used when checking |
---|
213 | whether allele frequencies at a locus sum to more than one: if all |
---|
214 | alleles are input (option A) and the sum differs from 1 by more than epsilon, |
---|
215 | or if not all alleles are input and the sum is greater than 1 by more |
---|
216 | then epsilon, the program will see this as an error and stop. You may |
---|
217 | find this causes difficulties if you gene frequencies have been rounded. |
---|
218 | I have tried to keep epsilon from being too small to prevent such problems. |
---|
219 | <P> |
---|
220 | <H2>RUN TIMES</H2> |
---|
221 | <P> |
---|
222 | The program is quite fast and the user should effectively never be limited by |
---|
223 | the amount of time it takes. All that the program has to do is read in the |
---|
224 | gene frequency data and then, for each pair of species, compute a genetic |
---|
225 | distance formula for each pair of species. This should require an amount of |
---|
226 | effort proportional to the total number of alleles over loci, and to the |
---|
227 | square of the number of populations. |
---|
228 | <P> |
---|
229 | <H2>FUTURE OF THIS PROGRAM</H2> |
---|
230 | <P> |
---|
231 | The main change that will be made to this program in the future is to add |
---|
232 | provisions for taking into account the sample size for each population. The |
---|
233 | genetic distance formulas have been modified by their inventors to correct for |
---|
234 | the inaccuracy of the estimate of the genetic distances, which on the whole |
---|
235 | should artificially increase the distance between populations by a small |
---|
236 | amount dependent on the sample sizes. The main difficulty with doing this is |
---|
237 | that I have not yet settled on a format for putting the sample size in the |
---|
238 | input data along with the gene frequency data for a species or population. |
---|
239 | <P> |
---|
240 | I may also include other distance measures, but only if I think their use is |
---|
241 | justified. There are many very arbitrary genetic distances, and I am |
---|
242 | reluctant to include most of them. |
---|
243 | <P> |
---|
244 | <HR> |
---|
245 | <P> |
---|
246 | <H3>TEST DATA SET</H3> |
---|
247 | <P> |
---|
248 | <TABLE><TR><TD BGCOLOR=white> |
---|
249 | <PRE> |
---|
250 | 5 10 |
---|
251 | 2 2 2 2 2 2 2 2 2 2 |
---|
252 | European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205 |
---|
253 | 0.8055 0.5043 |
---|
254 | African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600 |
---|
255 | 0.7582 0.6207 |
---|
256 | Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726 |
---|
257 | 0.7482 0.7334 |
---|
258 | American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000 |
---|
259 | 0.8086 0.8636 |
---|
260 | Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396 |
---|
261 | 0.9097 0.2976 |
---|
262 | </PRE> |
---|
263 | </TD></TR></TABLE> |
---|
264 | <P> |
---|
265 | <HR> |
---|
266 | <P> |
---|
267 | <H3>TEST SET OUTPUT</H3> |
---|
268 | <P> |
---|
269 | <TABLE><TR><TD BGCOLOR=white> |
---|
270 | <PRE> |
---|
271 | 5 |
---|
272 | European 0.0000 0.0780 0.0807 0.0668 0.1030 |
---|
273 | African 0.0780 0.0000 0.2347 0.1050 0.2273 |
---|
274 | Chinese 0.0807 0.2347 0.0000 0.0539 0.0633 |
---|
275 | American 0.0668 0.1050 0.0539 0.0000 0.1348 |
---|
276 | Australian 0.1030 0.2273 0.0633 0.1348 0.0000 |
---|
277 | </PRE> |
---|
278 | </TD></TR></TABLE> |
---|
279 | <P> |
---|
280 | </BODY> |
---|
281 | </HTML> |
---|