Context Navigation

gendist.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 12.9 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>gendist</TITLE>
5	<META NAME="description" CONTENT="gendist">
6	<META NAME="keywords" CONTENT="gendist">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>GENDIST - Compute genetic distances from gene frequencies</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	This program computes any one of three measures of genetic distance from a set
26	of gene frequencies in different populations (or species). The three are
27	Nei's genetic distance (Nei, 1972), Cavalli-Sforza's chord measure (Cavalli-
28	Sforza and Edwards, 1967) and Reynolds, Weir, and Cockerham's (1983) genetic
29	distance. These are written to an output file in a format that can be read by
30	the distance matrix phylogeny programs FITCH and KITSCH.
31	<P>
32	The three measures have somewhat different assumptions. All assume that all
33	differences between populations arise from genetic drift. Nei's distance is
34	formulated for an infinite isoalleles model of mutation, in which there is a
35	rate of neutral mutation and each mutant is to a completely new alleles. It
36	is assumed that all loci have the same rate of neutral mutation, and that the
37	genetic variability initially in the population is at equilibrium between
38	mutation and genetic drift, with the effective population size of each
39	population remaining constant.
40	<P>
41	Nei's distance is:
42	<P>
43	<PRE>
44	__ __
45	\ \
46	/_ /_ p<SUB>1mi</SUB> p<SUB>2mi</SUB>
47	m i
48	D = - ln <FONT SIZE=+3>(</FONT> ------------------------------------- <FONT SIZE=+3>)</FONT>.
49	__ __ __ __
50	\ \ \ \
51	[ /_ /_ p<SUB>1mi</SUB><SUP>2</SUP>]<SUP><SUP>1</SUP>/<SUB>2</SUB></SUP> [ /_ /_ p<SUB>2mi</SUB><SUP>2</SUP>]<SUP><SUP>1</SUP>/<SUB>2</SUB></SUP>
52	m i m i
53	</PRE>
54	<P>
55	where <EM>m</EM> is summed over loci, <EM>i</EM> over alleles at the <EM>m</EM>-th locus, and where
56	<P>
57	p<SUB>1mi</SUB>
58	<P>
59	is the frequency of the <EM>i</EM>-th allele at the <EM>m</EM>-th locus in population 1.
60	Subject to the above assumptions, Nei's genetic distance is expected, for a
61	sample of sufficiently many equivalent loci, to rise linearly with time.
62	<P>
63	The other two genetic distances assume that there is no mutation, and that all
64	gene frequency changes are by genetic drift alone. However they do not
65	assume that population sizes have remained constant and equal in all
66	populations. They cope with changing population size by having expectations
67	that rise linearly not with time, but with the sum over time of 1/N, where N
68	is the effective population size. Thus if population size doubles, genetic
69	drift will be taking place more slowly, and the genetic distance will be
70	expected to be rising only half as fast with respect to time. Both genetic
71	distances are different estimators of the same quantity under the same model.
72	<P>
73	Cavalli-Sforza's chord distance is given by
74	<P>
75	<PRE>
76	__ __ __
77	\ \ \
78	D<SUP>2</SUP> = 4 /_ [ 1 - /_ p<SUB>1mi</SUB><SUP><SUP>1</SUP>/<SUB>2</SUB></SUP> p <SUB>2mi</SUB><SUP><SUP>1</SUP>/<SUB>2</SUB></SUP>] / /_ (a<SUB>m</SUB> - 1)
79	m i m
80	</PRE>
81	<P>
82	<P>
83	where m indexes the loci, where i is summed over the alleles at the m-th
84	locus, and where a is the number of alleles at the m-th locus. It can be
85	shown that this distance always satisfies the triangle
86	inequality. Note that as given here it is divided by the number of degrees
87	of freedom, the sum of the numbers of alleles minus one. The quantity which
88	is expected to rise linearly with amount of
89	genetic drift (sum of <EM>1/N</EM> over time) is <EM>D</EM> squared, the quantity computed
90	above, and that is what is written out into the distance matrix.
91	<P>
92	Reynolds, Weir, and Cockerham's (1983) genetic distance is
93	<P>
94	<PRE>
95
96	__ __
97	\ \
98	/_ /_ [ p<SUB>1mi</SUB> - p<SUB>2mi</SUB>]<SUP>2</SUP>
99	m i
100	D<SUP>2</SUP> = --------------------------------------
101	__ __
102	\ \
103	2 /_ [ 1 - /_ p<SUB>1mi</SUB> p<SUB>2mi</SUB> ]
104	m i
105	</PRE>
106	<P>
107	<P>
108	where the notation is as before and D<SUP>2</SUP> is the quantity that is
109	expected to rise linearly with cumulated genetic drift.
110	<P>
111	Having computed one of these genetic distances, one which you feel is
112	appropriate to the biology of the situation, you can use it as the input to
113	the programs FITCH, KITSCH or NEIGHBOR. Keep in mind that the statistical
114	model in
115	those programs implicitly assumes that the distances in the input table have
116	independent errors. For any measure of genetic distance this will not be true,
117	as bursts of random genetic drift, or sampling events in drawing the sample of
118	individuals from each population, cause fluctuations of gene frequency that
119	affect many distances simultaneously. While this is not expected to bias the
120	estimate of the phylogeny, it does mean that the weighing of evidence from all
121	the different distances in the table will not be done with maximal
122	efficiency. One issue is which value of the P
123	(Power) parameter should be used. This depends on how the variance of a
124	distance rises with its expectation. For Cavalli-Sforza's chord distance, and
125	for the Reynolds et. al. distance it can be shown that the variance of the
126	distance will be proportional to the square of its expectation; this suggests
127	a value of 2 for <EM>P</EM>, which the default value for FITCH and KITSCH
128	(there is no P option in NEIGHBOR).
129	<P>
130	If you think that the pure genetic drift model is appropriate, and are thus
131	tempted to use the Cavalli-Sforza or Reynolds et. al. distances, you might
132	consider using the maximum likelihood program CONTML instead. It will
133	correctly weigh the evidence in that case. Like those genetic distances, it
134	uses approximations that break down as loci start to drift all the way to
135	fixation. Although Nei's distance will not break down in that case, it
136	makes other assumptions about equality of substitution rates at all loci and
137	constancy of population sizes.
138	<P>
139	The most important thing to remember is that genetic distance is not an
140	abstract, idealized measure of "differentness". It is an estimate of a
141	parameter (time or cumulated inverse effective population size) of the
142	model which is thought to have generated the differences we see. As an
143	estimate, it has statistical properties that can be assessed, and we should
144	never have to choose between genetic distances based on their aesthetic
145	properties, or on the personal prestige of their originators. Considering them
146	as estimates
147	focuses us on the questions which genetic distances are intended to answer,
148	for if there are none there is no reason to compute them. For further
149	perspective on genetic distances, I recommend my own paper evaluating
150	Reynolds, Weir, and Cockerham (1983), and the material in Nei's book (Nei,
151	1987).
152	<P>
153	<H2>INPUT FORMAT</H2>
154	<P>
155	The input to this program is standard and is as described in the Gene
156	Frequencies and Continuous
157	Characters Programs documentation file above. It consists of the number of
158	populations (or species), the number of loci,
159	and after that a line containing the numbers of alleles at each of the
160	loci. Then the gene frequencies follow in standard format.
161	<P>
162	The options are selected using a menu:
163	<P>
164	<TABLE><TR><TD BGCOLOR=white>
165	<PRE>
166
167	Genetic Distance Matrix program, version 3.6a3
168
169	Settings for this run:
170	A Input file contains all alleles at each locus? One omitted at each locus
171	N Use Nei genetic distance? Yes
172	C Use Cavalli-Sforza chord measure? No
173	R Use Reynolds genetic distance? No
174	L Form of distance matrix? Square
175	M Analyze multiple data sets? No
176	0 Terminal type (IBM PC, ANSI, none)? (none)
177	1 Print indications of progress of run? Yes
178
179	Y to accept these or type the letter for one to change
180
181	</PRE>
182	</TD></TR></TABLE>
183	<P>
184	The A (All alleles) option is described in the Gene Frequencies and
185	Continuous Characters Programs documentation file. As with CONTML, it is
186	the signal that all alleles are represented in the gene frequency input,
187	without one being left out per locus. C, N, and R are the signals to
188	use the Cavalli-Sforza, Nei, or Reynolds et. al. genetic distances
189	respectively. The Nei distance is the default, and it will be computed
190	if none of these options is explicitly invoked. The L option is the signal
191	that the distance matrix is to be written out in Lower triangular form.
192	The M option is the usual Multiple Data Sets option, useful for
193	doing bootstrap analyses with the distance matrix programs. It allows
194	multiple data sets, but does not allow multiple sets of weights (since
195	there is no provision for weighting in this program).
196	<P>
197	<H2>OUTPUT FORMAT</H2>
198	<P>
199	The output file simply contains on its first line the number of species (or
200	populations). Each
201	species (or population) starts a new line, with its name printed out
202	first, and then and there are up to nine
203	genetic distances printed on each line, in the standard format used as input
204	by the distance matrix programs. The output, in its default form, is
205	ready to be used in the distance matrix programs.
206	<P>
207	<H2>CONSTANTS</H2>
208	<P>
209	The constants
210	available to be changed by the user if the program is recompiled are
211	"namelength" the length of a species name, set to 10 in the distribution
212	and "epsilon" which defines a small quantity that is used when checking
213	whether allele frequencies at a locus sum to more than one: if all
214	alleles are input (option A) and the sum differs from 1 by more than epsilon,
215	or if not all alleles are input and the sum is greater than 1 by more
216	then epsilon, the program will see this as an error and stop. You may
217	find this causes difficulties if you gene frequencies have been rounded.
218	I have tried to keep epsilon from being too small to prevent such problems.
219	<P>
220	<H2>RUN TIMES</H2>
221	<P>
222	The program is quite fast and the user should effectively never be limited by
223	the amount of time it takes. All that the program has to do is read in the
224	gene frequency data and then, for each pair of species, compute a genetic
225	distance formula for each pair of species. This should require an amount of
226	effort proportional to the total number of alleles over loci, and to the
227	square of the number of populations.
228	<P>
229	<H2>FUTURE OF THIS PROGRAM</H2>
230	<P>
231	The main change that will be made to this program in the future is to add
232	provisions for taking into account the sample size for each population. The
233	genetic distance formulas have been modified by their inventors to correct for
234	the inaccuracy of the estimate of the genetic distances, which on the whole
235	should artificially increase the distance between populations by a small
236	amount dependent on the sample sizes. The main difficulty with doing this is
237	that I have not yet settled on a format for putting the sample size in the
238	input data along with the gene frequency data for a species or population.
239	<P>
240	I may also include other distance measures, but only if I think their use is
241	justified. There are many very arbitrary genetic distances, and I am
242	reluctant to include most of them.
243	<P>
244	<HR>
245	<P>
246	<H3>TEST DATA SET</H3>
247	<P>
248	<TABLE><TR><TD BGCOLOR=white>
249	<PRE>
250	5 10
251	2 2 2 2 2 2 2 2 2 2
252	European 0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
253	0.8055 0.5043
254	African 0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
255	0.7582 0.6207
256	Chinese 0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
257	0.7482 0.7334
258	American 0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
259	0.8086 0.8636
260	Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
261	0.9097 0.2976
262	</PRE>
263	</TD></TR></TABLE>
264	<P>
265	<HR>
266	<P>
267	<H3>TEST SET OUTPUT</H3>
268	<P>
269	<TABLE><TR><TD BGCOLOR=white>
270	<PRE>
271	5
272	European 0.0000 0.0780 0.0807 0.0668 0.1030
273	African 0.0780 0.0000 0.2347 0.1050 0.2273
274	Chinese 0.0807 0.2347 0.0000 0.0539 0.0633
275	American 0.0668 0.1050 0.0539 0.0000 0.1348
276	Australian 0.1030 0.2273 0.0633 0.1348 0.0000
277	</PRE>
278	</TD></TR></TABLE>
279	<P>
280	</BODY>
281	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/gendist.html

Download in other formats: