Context Navigation

restdist.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 17.1 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>restdist</TITLE>
5	<META NAME="description" CONTENT="restdist">
6	<META NAME="keywords" CONTENT="restdist">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>RESTDIST -- Program to compute distance matrix<BR>from restriction sites or fragments</H1>
18	</DIV>
19	<P>
20	© Copyright 2000-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	Restdist reads the same restriction sites format as RESTML and
26	computes a restriction sites distance. It can also compute a restriction
27	fragments distance. The original restriction fragments and restriction
28	sites distance methods were introduced by Nei and Li (1979). Their original
29	<! ???? CHANGE NEXT LINE TO methods are WHEN NEI/LI R.S. DISTANCE INCLUDED >
30	method for restriction fragments is
31	also available in this program, although its default methods
32	are my modifications of the original Nei and Li methods.
33	<P>
34	These two distances assume that the restriction sites are accidental byproducts
35	of random change of nucleotide sequences. For my restriction sites distance
36	the DNA sequences are assumed to be changing according to the Kimura
37	2-parameter model of DNA change (Kimura, 1980). The user can set the
38	transition/transversion rate for the model. For my restriction fragments
39	distance there is
40	there is an implicit assumption of a Jukes-Cantor (1969) model of change,
41	The user can also set the
42	parameter of a correction for unequal rates of evolution between sites in
43	the DNA sequences, using a Gamma distribution of rates among sites.
44	The Jukes-Cantor model is also implicit in the restriction fragments
45	distance of Nei and Li(1979). It
46	does not allow us to correct for a Gamma distribution of rates among
47	sites.
48	<P>
49	<H2>Restriction Sites Distance</H2>
50	<P>
51	The restriction sites distances use data coded for the presence of absence of
52	individual restriction sites (usually as + and - or 0 and 1). My
53	distance is based on the proportion, out of all sites observed in one species
54	or the other, which are present in both species. This is done to correct for
55	the ascertainment of sites, for the fact that we are not aware of many sites
56	because they do not appear in any species.
57	<P>
58	My distance starts by computing from the particular pair of species the fraction
59	<PRE>
60	n<SUB>++</SUB>
61	f = ---------------------
62	n<SUB>++</SUB> + <SUP>1</SUP>/<SUB>2</SUB> (n<SUB>+- </SUB>+ n<SUB>-+</SUB>)
63	</PRE>
64	where <I>n<SUB>++</SUB></I> is the number of sites contained in both species,
65	<I>n<SUB>+-</SUB></I> is the number of sites contained in the first of the
66	two species but not in the second, and <I>n<SUB>-+</SUB></I> is the
67	number of sites contained in the second of the two species but not in the
68	first. This is the fraction of sites that are present in one species which are
69	present in both. Since the number of sites present in the two species will
70	often differ, the denominator is the average of the number of sites found in
71	the two species.
72	<P>
73	If each restriction site is <I>s</I> nucleotides long, the probability
74	that a restriction site is present in the other species, given that it is
75	present in a species, is
76	<PRE>
77	Q<SUP>s</SUP>,
78	</PRE>
79	where <I>Q</I> is the probability that a nucleotide has no net change as one
80	goes from the one species to the other. It may have changed in between; we
81	are interested in the probability that that nucleotide site is in the same base
82	in both species, irrespective of what has happened in between.
83	<P>
84	The distance is then computed by finding the branch length of a two-species
85	tree (connecting these two species with a single branch) such
86	that <I>Q</I> equals the <I>s</I>-th root of <I>f</I>. For this the
87	program computes <I>Q</I> for various values of branch length, iterating them
88	by a Newton-Raphson algorithm until the two quantities are equal.
89	<P>
90	The resulting distance should be numerically close to the original
91	restriction sites distance of Nei and Li (1979). It is inspired by theirs,
92	but theirs differs by implicitly assuming a symmetric Jukes-Cantor (1969)
93	model of nucleotide change, and theirs does not include a correction for
94	Gamma distribution of rate of change among nucleotide sites.
95	<P>
96	<H2>Restriction Fragments Distance</H2>
97	<P>
98	For restriction fragments data we use a different distance. If we
99	average over all restriction fragment lengths, each at its own
100	expected frequency, the probability that the fragment will still be
101	in existence after a certain amount of branch length, we must take into
102	account the probability that the two restriction sites at the ends of
103	the fragment do not mutate, and the probability that no new
104	restriction site occurs within the fragment in that amount of branch
105	length. The result for a restriction site length of <I>s</I> is:
106	<PRE>
107	Q<SUP>2s</SUP>
108	f = --------
109	2 - Q<SUP>s</SUP>
110	</PRE>
111	(The details of the derivation will be given in my forthcoming book
112	<I>Inferring Phylogenies</I> (to be published by Sinauer Associates
113	in 2001).
114	Given the observed fraction of restriction sites retained, <I>f</I>,
115	we can solve a quadratic equation from the above expression for
116	<I>Q<SUP>s</SUP></I>. That makes it easy to obtain a value of <I>Q</I>,
117	and the branch length can then be estimated by adjusting it so the
118	probability of a base not changing is equal to that value.
119	<P>
120	Alternatively, if we use the Nei and Li (1979) restriction fragments
121	distance, this involves solving for <I>g</I> in the nonlinear
122	equation
123	<PRE>
124	g = [ f (3 - 2g) ]<SUP><SUP>1</SUP>/<SUB>4</SUB></SUP>
125	</PRE>
126	and then the distance is given by
127	<PRE>
128	d = - (<SUP>2</SUP>/<SUB>r</SUB>) log<SUB>e</SUB>(g)
129	</PRE>
130	where <I>r</I> is the length of the restriction site.
131	<P>
132	Comparing these two restriction fragments distances in a case
133	where their underlying DNA model is the same (which is when the
134	transition/transversion ratio of the modified model is set to
135	0.5), you will find
136	that they are very close to each other, differing very little at
137	small distances, with the modified distance become smaller than
138	the Nei/Li distance at larger distances. It will therefore matter
139	very little which one you use.
140	<P>
141	<H2>A Comment About RAPDs and AFLPs</H2>
142	<P>
143	Although these distances are designed for restriction sites
144	and restriction fragments data, they can be applied to RAPD and
145	AFLP data as well. RAPD (Randomly Amplified Polymorphic DNA)
146	and AFLP (Amplified Fragment Length Polymorphism) data consist
147	of presence or absence of individual bands on a gel. The bands
148	are segments of DNA with PCR primers at each end. These
149	primers are defined sequences of known length (often about
150	10 nucleotides each). For AFLPs the reolevant length is the primer
151	length, plus three nucleotides. Mutation in these sequences makes them no
152	longer be primers, just as in the case of restriction sites.
153	Thus a pair of 10-nucleotide primers will behave much the same
154	as a 20-nucleotide restriction site. You can use the restriction
155	sites distance as the distance between RAPD or AFLP patterns if you
156	set the proper value for the total length of the site to the
157	total length of the primers (plus 6 in the case of AFLPs).
158	Of course there are many possible sources of noise in these data,
159	including confusing fragments of similar length for each other
160	and having primers near each other in the genome, and these are
161	not taken into account in the statistical model used here.
162	<P>
163	<H2>INPUT FORMAT AND OPTIONS</H2>
164	<P>
165	The input is fairly standard, with one addition. As usual the first line of
166	the file gives the number of species and the number of sites, but there is also
167	a third number, which is the number of different restriction enzymes that were
168	used to detect the restriction sites. Thus a data set with 10 species and
169	35 different sites, representing digestion with 4 different enzymes, would
170	have the first line of the data file look like this:
171	<P>
172	<PRE>
173	10 35 4
174	</PRE>
175	<P>
176	The site data are in standard form. Each species starts with a species name
177	whose maximum length is given by the constant "nmlngth"
178	(whose value in the
179	program as distributed is 10 characters). The name should, as usual, be padded
180	out to that length with blanks if necessary. The sites data then follows, one
181	character per site (any blanks will
182	be skipped and ignored). Like the DNA and protein sequence data, the
183	restriction sites data may be either in the "interleaved" form or the
184	"sequential" form. Note that if you are analyzing restriction sites
185	data with the programs DOLLOP or MIX or other discrete character
186	programs, at the moment those programs do not use the "aligned" or
187	"interleaved" data format. Therefore you may want to avoid that format
188	when you have restriction sites data that you will want to feed into
189	those programs.
190	<P>
191	The presence of a site is indicated by a "+" and the absence by a "-". I have
192	also allowed the use of "1" and "0" as synonyms for "+" and "-", for
193	compatibility with MIX and DOLLOP which do not allow "+" and "-". If the
194	presence of
195	the site is unknown (for example, if the DNA containing it has been deleted so
196	that one
197	does not know whether it would have contained the site) then the state "?" can
198	be used to indicate that the state of this site is unknown.
199	<P>
200	The options are selected using an interactive menu. The menu looks like this:
201	<P>
202	<TABLE><TR><TD BGCOLOR=white>
203	<PRE>
204
205	Restriction site or fragment distances, version 3.6a3
206
207	Settings for this run:
208	R Restriction sites or fragments? Sites
209	G Gamma distribution of rates among sites? No
210	T Transition/transversion ratio? 2.000000
211	S Site length? 6.0
212	L Form of distance matrix? Square
213	M Analyze multiple data sets? No
214	I Input sequences interleaved? Yes
215	0 Terminal type (IBM PC, ANSI, none)? (none)
216	1 Print out the data at start of run? No
217	2 Print indications of progress of run? Yes
218
219	Y to accept these or type the letter for one to change
220
221	</PRE>
222	</TD></TR></TABLE>
223	<P>
224	The user either types "Y" (followed, of course, by a carriage-return)
225	if the settings shown are to be accepted, or the letter or digit corresponding
226	to an option that is to be changed.
227	<P>
228	The R option toggles between a restriction sites distance, which
229	is the default setting, and a restriction fragments distance. In
230	the latter case, another option appears, the N (Nei/Li) option.
231	This allows the user to choose the original Nei and Li (1979)
232	restriction fragments distance rather than my modified Nei/Li
233	distance, which is the default.
234	<P>
235	If the G (Gamma distribution) option is selected, the user will be
236	asked to supply the coefficient of variation of the rate of substitution
237	among sites. This is different from the parameters used by Nei and Jin, who
238	introduced Gamma distribution of rates in DNA distances, but
239	related to their parameters: their parameter <EM>a</EM> is also known as
240	"alpha", the shape parameter of the Gamma distribution. It is
241	related to the coefficient of variation by
242	<P>
243	CV = 1 / a<SUP>1/2</SUP>
244	<P>
245	or
246	<P>
247	a = 1 / (CV)<SUP>2</SUP>
248	<P>
249	(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
250	that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
251	As we consider cases in which the rates are less variable we should set <EM>a</EM>
252	larger and larger, as <EM>CV</EM> gets smaller and smaller.
253	<P>
254	The Gamma distribution option is not available when using the
255	original Nei/Li restriction fragments distance.
256	<P>
257	The T option is the Transition/transversion option. The user is prompted for
258	a real number greater than 0.0, as the expected ratio of transitions to
259	transversions. Note
260	that this is the resulting expected ratio of transitions to transversions.
261	The default value of the T parameter if you do not use the T
262	option is 2.0. The T option is not available when you choose the original
263	Nei/Li restriction fragment distance, which assumes a Jukes-Cantor (1969)
264	model of DNA change, for which the transition/transversion ratio is
265	in effect fixed at 0.5.
266	<P>
267	The S option selects the site length. This is set to a default
268	value of 6. It can be set to any positive integer. While in
269	the RESTML program there is an upper limit on the restriction
270	site length (set by memory limitations), in RESTDIST there is
271	no effective limit on the size of the restriction sites. A value
272	of 20, which might be appropriate in many cases for RAPD or AFLP
273	data, is typically not practical in RESTML, but it is useable in
274	RESTDIST.
275	<P>
276	Option L specifies that the output file will have a square matrix
277	of distances. It can be used to change to lower-triangular
278	data matrices. This will usually not be
279	necessary, but if the distance matrices are going to be very
280	large, this alternative can reduce their size by half. The
281	programs which are to use them should then of course be informed
282	that they can expect lower-triangular distance matrices.
283	<P>
284	The M, I, and 0 options are the usual Multiple data set,
285	Interleaved input, and screen terminal type options. These are
286	described in the main documentation file.
287	<P>
288	Option 1 specifies that the input data will be written out
289	on the output file before the distances. This is off by
290	default. If it is done, it will make the output file unusable
291	as input to our distance matrix programs.
292	<P>
293	Option 2 turns off or on the indications of the progress of the run.
294	The program prints out a row of dots (".") indicating the
295	calculation of individual distances. Since the distance matrix
296	is symmetrical, the program only computes the distances for the
297	upper triangle of the distance matrix, and then duplicates
298	the distance to the other corner of the matrix. Thus the rows of
299	dots start out of full length, and then egt shorter and shorter.
300	<P>
301	<H2>OUTPUT FORMAT</H2>
302	<P>
303	The output file contains on its first line the number of species. The
304	distance matrix is then printed in standard
305	form, with each species starting on a new line with the species name, followed
306	by the distances to the species in order. These continue onto a new line
307	after every nine distances. If the L option is used, the matrix or distances
308	is in lower triangular form, so that only the distances to the other species
309	that precede each species are printed. Otherwise the distance matrix is square
310	with zero distances on the diagonal. In general the format of the distance
311	matrix is such that it can serve as input to any of the distance matrix
312	programs.
313	<P>
314	If the option to print out the data is selected, the output file will
315	precede the data by more complete information on the input and the menu
316	selections. The output file begins by giving the number of species and the
317	number of characters.
318	<P>
319	The distances printed out are scaled in terms of expected numbers of
320	substitutions per DNA site, counting both transitions and transversions but not
321	replacements of a base by itself, and scaled so that the average rate of
322	change, averaged over all sites analyzed, is set to 1.0. Thus when the
323	G option is used, the rate of change at one site may be higher than
324	at another, but their mean is expected to be 1.
325	<P>
326	<H2>PROGRAM CONSTANTS</H2>
327	<P>
328	The constants available to be changed are "initialv" and
329	"iterationsr". The constant "initialv" is the starting
330	value of the distance in the iterations. This will typically not need to
331	be changed. The constant "iterationsr" is the number of
332	times that the Newton-Raphson method which is used to solve the
333	equations for the distances is iterated. The program can be
334	speeded up by reducing the number of iterations from the default
335	value of 20, but at the possible risk of computing the distance
336	less accurately.
337	<P>
338	<H2>FUTURE OF THE PROGRAM</H2>
339	<P>
340	The present program does not compute the original distance of Nei and Li (1979)
341	for restriction sites (though it does have an option to compute their original
342	distance for restriction fragments). I hope to add their restriction
343	sites distance in the near future.
344	<P>
345	<HR>
346	<P>
347	<H3>TEST DATA SET</H3>
348	<P>
349	<TABLE><TR><TD BGCOLOR=white>
350	<PRE>
351	5 13 2
352	Alpha ++-+-++--+++-
353	Beta ++++--+--+++-
354	Gamma -+--+-++-+-++
355	Delta ++-+----++---
356	Epsilon ++++----++---
357	</PRE>
358	</TD></TR></TABLE>
359	<P>
360	<HR>
361	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
362	<P>
363	(Note that when the options for displaying the input data are turned off,
364	the output is in a form suitable for use as an input file in the distance
365	matrix programs).<P>
366	<P>
367	<TABLE><TR><TD BGCOLOR=white>
368	<PRE>
369
370	5 Species, 13 Sites
371
372	Name Sites
373	---- -----
374
375	Alpha ++-+-++--+ ++-
376	Beta ++++--+--+ ++-
377	Gamma -+--+-++-+ -++
378	Delta ++-+----++ ---
379	Epsilon ++++----++ ---
380
381
382	Alpha 0.0000 0.0224 0.1077 0.0688 0.0826
383	Beta 0.0224 0.0000 0.1077 0.0688 0.0442
384	Gamma 0.1077 0.1077 0.0000 0.1765 0.1925
385	Delta 0.0688 0.0688 0.1765 0.0000 0.0197
386	Epsilon 0.0826 0.0442 0.1925 0.0197 0.0000
387	</PRE>
388	</TD></TR></TABLE>
389	</BODY>
390	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/restdist.html

Download in other formats: