Context Navigation

restdist.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 17.1 KB

Rev	Line
[2176]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
	2	<HTML>
	3	<HEAD>
	4	<TITLE>restdist</TITLE>
	5	<META NAME="description" CONTENT="restdist">
	6	<META NAME="keywords" CONTENT="restdist">
	7	<META NAME="resource-type" CONTENT="document">
	8	<META NAME="distribution" CONTENT="global">
	9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
	10	</HEAD>
	11	<BODY BGCOLOR="#ccffff">
	12	<DIV ALIGN=RIGHT>
	13	version 3.6
	14	</DIV>
	15	<P>
	16	<DIV ALIGN=CENTER>
	17	<H1>RESTDIST -- Program to compute distance matrix<BR>from restriction sites or fragments</H1>
	18	</DIV>
	19	<P>
	20	© Copyright 2000-2002 by the University of
	21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
	22	this document provided that no fee is charged for it and that this copyright
	23	notice is not removed.
	24	<P>
	25	Restdist reads the same restriction sites format as RESTML and
	26	computes a restriction sites distance. It can also compute a restriction
	27	fragments distance. The original restriction fragments and restriction
	28	sites distance methods were introduced by Nei and Li (1979). Their original
	29	<! ???? CHANGE NEXT LINE TO methods are WHEN NEI/LI R.S. DISTANCE INCLUDED >
	30	method for restriction fragments is
	31	also available in this program, although its default methods
	32	are my modifications of the original Nei and Li methods.
	33	<P>
	34	These two distances assume that the restriction sites are accidental byproducts
	35	of random change of nucleotide sequences. For my restriction sites distance
	36	the DNA sequences are assumed to be changing according to the Kimura
	37	2-parameter model of DNA change (Kimura, 1980). The user can set the
	38	transition/transversion rate for the model. For my restriction fragments
	39	distance there is
	40	there is an implicit assumption of a Jukes-Cantor (1969) model of change,
	41	The user can also set the
	42	parameter of a correction for unequal rates of evolution between sites in
	43	the DNA sequences, using a Gamma distribution of rates among sites.
	44	The Jukes-Cantor model is also implicit in the restriction fragments
	45	distance of Nei and Li(1979). It
	46	does not allow us to correct for a Gamma distribution of rates among
	47	sites.
	48	<P>
	49	<H2>Restriction Sites Distance</H2>
	50	<P>
	51	The restriction sites distances use data coded for the presence of absence of
	52	individual restriction sites (usually as + and - or 0 and 1). My
	53	distance is based on the proportion, out of all sites observed in one species
	54	or the other, which are present in both species. This is done to correct for
	55	the ascertainment of sites, for the fact that we are not aware of many sites
	56	because they do not appear in any species.
	57	<P>
	58	My distance starts by computing from the particular pair of species the fraction
	59	<PRE>
	60	n<SUB>++</SUB>
	61	f = ---------------------
	62	n<SUB>++</SUB> + <SUP>1</SUP>/<SUB>2</SUB> (n<SUB>+- </SUB>+ n<SUB>-+</SUB>)
	63	</PRE>
	64	where <I>n<SUB>++</SUB></I> is the number of sites contained in both species,
	65	<I>n<SUB>+-</SUB></I> is the number of sites contained in the first of the
	66	two species but not in the second, and <I>n<SUB>-+</SUB></I> is the
	67	number of sites contained in the second of the two species but not in the
	68	first. This is the fraction of sites that are present in one species which are
	69	present in both. Since the number of sites present in the two species will
	70	often differ, the denominator is the average of the number of sites found in
	71	the two species.
	72	<P>
	73	If each restriction site is <I>s</I> nucleotides long, the probability
	74	that a restriction site is present in the other species, given that it is
	75	present in a species, is
	76	<PRE>
	77	Q<SUP>s</SUP>,
	78	</PRE>
	79	where <I>Q</I> is the probability that a nucleotide has no net change as one
	80	goes from the one species to the other. It may have changed in between; we
	81	are interested in the probability that that nucleotide site is in the same base
	82	in both species, irrespective of what has happened in between.
	83	<P>
	84	The distance is then computed by finding the branch length of a two-species
	85	tree (connecting these two species with a single branch) such
	86	that <I>Q</I> equals the <I>s</I>-th root of <I>f</I>. For this the
	87	program computes <I>Q</I> for various values of branch length, iterating them
	88	by a Newton-Raphson algorithm until the two quantities are equal.
	89	<P>
	90	The resulting distance should be numerically close to the original
	91	restriction sites distance of Nei and Li (1979). It is inspired by theirs,
	92	but theirs differs by implicitly assuming a symmetric Jukes-Cantor (1969)
	93	model of nucleotide change, and theirs does not include a correction for
	94	Gamma distribution of rate of change among nucleotide sites.
	95	<P>
	96	<H2>Restriction Fragments Distance</H2>
	97	<P>
	98	For restriction fragments data we use a different distance. If we
	99	average over all restriction fragment lengths, each at its own
	100	expected frequency, the probability that the fragment will still be
	101	in existence after a certain amount of branch length, we must take into
	102	account the probability that the two restriction sites at the ends of
	103	the fragment do not mutate, and the probability that no new
	104	restriction site occurs within the fragment in that amount of branch
	105	length. The result for a restriction site length of <I>s</I> is:
	106	<PRE>
	107	Q<SUP>2s</SUP>
	108	f = --------
	109	2 - Q<SUP>s</SUP>
	110	</PRE>
	111	(The details of the derivation will be given in my forthcoming book
	112	<I>Inferring Phylogenies</I> (to be published by Sinauer Associates
	113	in 2001).
	114	Given the observed fraction of restriction sites retained, <I>f</I>,
	115	we can solve a quadratic equation from the above expression for
	116	<I>Q<SUP>s</SUP></I>. That makes it easy to obtain a value of <I>Q</I>,
	117	and the branch length can then be estimated by adjusting it so the
	118	probability of a base not changing is equal to that value.
	119	<P>
	120	Alternatively, if we use the Nei and Li (1979) restriction fragments
	121	distance, this involves solving for <I>g</I> in the nonlinear
	122	equation
	123	<PRE>
	124	g = [ f (3 - 2g) ]<SUP><SUP>1</SUP>/<SUB>4</SUB></SUP>
	125	</PRE>
	126	and then the distance is given by
	127	<PRE>
	128	d = - (<SUP>2</SUP>/<SUB>r</SUB>) log<SUB>e</SUB>(g)
	129	</PRE>
	130	where <I>r</I> is the length of the restriction site.
	131	<P>
	132	Comparing these two restriction fragments distances in a case
	133	where their underlying DNA model is the same (which is when the
	134	transition/transversion ratio of the modified model is set to
	135	0.5), you will find
	136	that they are very close to each other, differing very little at
	137	small distances, with the modified distance become smaller than
	138	the Nei/Li distance at larger distances. It will therefore matter
	139	very little which one you use.
	140	<P>
	141	<H2>A Comment About RAPDs and AFLPs</H2>
	142	<P>
	143	Although these distances are designed for restriction sites
	144	and restriction fragments data, they can be applied to RAPD and
	145	AFLP data as well. RAPD (Randomly Amplified Polymorphic DNA)
	146	and AFLP (Amplified Fragment Length Polymorphism) data consist
	147	of presence or absence of individual bands on a gel. The bands
	148	are segments of DNA with PCR primers at each end. These
	149	primers are defined sequences of known length (often about
	150	10 nucleotides each). For AFLPs the reolevant length is the primer
	151	length, plus three nucleotides. Mutation in these sequences makes them no
	152	longer be primers, just as in the case of restriction sites.
	153	Thus a pair of 10-nucleotide primers will behave much the same
	154	as a 20-nucleotide restriction site. You can use the restriction
	155	sites distance as the distance between RAPD or AFLP patterns if you
	156	set the proper value for the total length of the site to the
	157	total length of the primers (plus 6 in the case of AFLPs).
	158	Of course there are many possible sources of noise in these data,
	159	including confusing fragments of similar length for each other
	160	and having primers near each other in the genome, and these are
	161	not taken into account in the statistical model used here.
	162	<P>
	163	<H2>INPUT FORMAT AND OPTIONS</H2>
	164	<P>
	165	The input is fairly standard, with one addition. As usual the first line of
	166	the file gives the number of species and the number of sites, but there is also
	167	a third number, which is the number of different restriction enzymes that were
	168	used to detect the restriction sites. Thus a data set with 10 species and
	169	35 different sites, representing digestion with 4 different enzymes, would
	170	have the first line of the data file look like this:
	171	<P>
	172	<PRE>
	173	10 35 4
	174	</PRE>
	175	<P>
	176	The site data are in standard form. Each species starts with a species name
	177	whose maximum length is given by the constant "nmlngth"
	178	(whose value in the
	179	program as distributed is 10 characters). The name should, as usual, be padded
	180	out to that length with blanks if necessary. The sites data then follows, one
	181	character per site (any blanks will
	182	be skipped and ignored). Like the DNA and protein sequence data, the
	183	restriction sites data may be either in the "interleaved" form or the
	184	"sequential" form. Note that if you are analyzing restriction sites
	185	data with the programs DOLLOP or MIX or other discrete character
	186	programs, at the moment those programs do not use the "aligned" or
	187	"interleaved" data format. Therefore you may want to avoid that format
	188	when you have restriction sites data that you will want to feed into
	189	those programs.
	190	<P>
	191	The presence of a site is indicated by a "+" and the absence by a "-". I have
	192	also allowed the use of "1" and "0" as synonyms for "+" and "-", for
	193	compatibility with MIX and DOLLOP which do not allow "+" and "-". If the
	194	presence of
	195	the site is unknown (for example, if the DNA containing it has been deleted so
	196	that one
	197	does not know whether it would have contained the site) then the state "?" can
	198	be used to indicate that the state of this site is unknown.
	199	<P>
	200	The options are selected using an interactive menu. The menu looks like this:
	201	<P>
	202	<TABLE><TR><TD BGCOLOR=white>
	203	<PRE>
	204
	205	Restriction site or fragment distances, version 3.6a3
	206
	207	Settings for this run:
	208	R Restriction sites or fragments? Sites
	209	G Gamma distribution of rates among sites? No
	210	T Transition/transversion ratio? 2.000000
	211	S Site length? 6.0
	212	L Form of distance matrix? Square
	213	M Analyze multiple data sets? No
	214	I Input sequences interleaved? Yes
	215	0 Terminal type (IBM PC, ANSI, none)? (none)
	216	1 Print out the data at start of run? No
	217	2 Print indications of progress of run? Yes
	218
	219	Y to accept these or type the letter for one to change
	220
	221	</PRE>
	222	</TD></TR></TABLE>
	223	<P>
	224	The user either types "Y" (followed, of course, by a carriage-return)
	225	if the settings shown are to be accepted, or the letter or digit corresponding
	226	to an option that is to be changed.
	227	<P>
	228	The R option toggles between a restriction sites distance, which
	229	is the default setting, and a restriction fragments distance. In
	230	the latter case, another option appears, the N (Nei/Li) option.
	231	This allows the user to choose the original Nei and Li (1979)
	232	restriction fragments distance rather than my modified Nei/Li
	233	distance, which is the default.
	234	<P>
	235	If the G (Gamma distribution) option is selected, the user will be
	236	asked to supply the coefficient of variation of the rate of substitution
	237	among sites. This is different from the parameters used by Nei and Jin, who
	238	introduced Gamma distribution of rates in DNA distances, but
	239	related to their parameters: their parameter <EM>a</EM> is also known as
	240	"alpha", the shape parameter of the Gamma distribution. It is
	241	related to the coefficient of variation by
	242	<P>
	243	CV = 1 / a<SUP>1/2</SUP>
	244	<P>
	245	or
	246	<P>
	247	a = 1 / (CV)<SUP>2</SUP>
	248	<P>
	249	(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
	250	that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
	251	As we consider cases in which the rates are less variable we should set <EM>a</EM>
	252	larger and larger, as <EM>CV</EM> gets smaller and smaller.
	253	<P>
	254	The Gamma distribution option is not available when using the
	255	original Nei/Li restriction fragments distance.
	256	<P>
	257	The T option is the Transition/transversion option. The user is prompted for
	258	a real number greater than 0.0, as the expected ratio of transitions to
	259	transversions. Note
	260	that this is the resulting expected ratio of transitions to transversions.
	261	The default value of the T parameter if you do not use the T
	262	option is 2.0. The T option is not available when you choose the original
	263	Nei/Li restriction fragment distance, which assumes a Jukes-Cantor (1969)
	264	model of DNA change, for which the transition/transversion ratio is
	265	in effect fixed at 0.5.
	266	<P>
	267	The S option selects the site length. This is set to a default
	268	value of 6. It can be set to any positive integer. While in
	269	the RESTML program there is an upper limit on the restriction
	270	site length (set by memory limitations), in RESTDIST there is
	271	no effective limit on the size of the restriction sites. A value
	272	of 20, which might be appropriate in many cases for RAPD or AFLP
	273	data, is typically not practical in RESTML, but it is useable in
	274	RESTDIST.
	275	<P>
	276	Option L specifies that the output file will have a square matrix
	277	of distances. It can be used to change to lower-triangular
	278	data matrices. This will usually not be
	279	necessary, but if the distance matrices are going to be very
	280	large, this alternative can reduce their size by half. The
	281	programs which are to use them should then of course be informed
	282	that they can expect lower-triangular distance matrices.
	283	<P>
	284	The M, I, and 0 options are the usual Multiple data set,
	285	Interleaved input, and screen terminal type options. These are
	286	described in the main documentation file.
	287	<P>
	288	Option 1 specifies that the input data will be written out
	289	on the output file before the distances. This is off by
	290	default. If it is done, it will make the output file unusable
	291	as input to our distance matrix programs.
	292	<P>
	293	Option 2 turns off or on the indications of the progress of the run.
	294	The program prints out a row of dots (".") indicating the
	295	calculation of individual distances. Since the distance matrix
	296	is symmetrical, the program only computes the distances for the
	297	upper triangle of the distance matrix, and then duplicates
	298	the distance to the other corner of the matrix. Thus the rows of
	299	dots start out of full length, and then egt shorter and shorter.
	300	<P>
	301	<H2>OUTPUT FORMAT</H2>
	302	<P>
	303	The output file contains on its first line the number of species. The
	304	distance matrix is then printed in standard
	305	form, with each species starting on a new line with the species name, followed
	306	by the distances to the species in order. These continue onto a new line
	307	after every nine distances. If the L option is used, the matrix or distances
	308	is in lower triangular form, so that only the distances to the other species
	309	that precede each species are printed. Otherwise the distance matrix is square
	310	with zero distances on the diagonal. In general the format of the distance
	311	matrix is such that it can serve as input to any of the distance matrix
	312	programs.
	313	<P>
	314	If the option to print out the data is selected, the output file will
	315	precede the data by more complete information on the input and the menu
	316	selections. The output file begins by giving the number of species and the
	317	number of characters.
	318	<P>
	319	The distances printed out are scaled in terms of expected numbers of
	320	substitutions per DNA site, counting both transitions and transversions but not
	321	replacements of a base by itself, and scaled so that the average rate of
	322	change, averaged over all sites analyzed, is set to 1.0. Thus when the
	323	G option is used, the rate of change at one site may be higher than
	324	at another, but their mean is expected to be 1.
	325	<P>
	326	<H2>PROGRAM CONSTANTS</H2>
	327	<P>
	328	The constants available to be changed are "initialv" and
	329	"iterationsr". The constant "initialv" is the starting
	330	value of the distance in the iterations. This will typically not need to
	331	be changed. The constant "iterationsr" is the number of
	332	times that the Newton-Raphson method which is used to solve the
	333	equations for the distances is iterated. The program can be
	334	speeded up by reducing the number of iterations from the default
	335	value of 20, but at the possible risk of computing the distance
	336	less accurately.
	337	<P>
	338	<H2>FUTURE OF THE PROGRAM</H2>
	339	<P>
	340	The present program does not compute the original distance of Nei and Li (1979)
	341	for restriction sites (though it does have an option to compute their original
	342	distance for restriction fragments). I hope to add their restriction
	343	sites distance in the near future.
	344	<P>
	345	<HR>
	346	<P>
	347	<H3>TEST DATA SET</H3>
	348	<P>
	349	<TABLE><TR><TD BGCOLOR=white>
	350	<PRE>
	351	5 13 2
	352	Alpha ++-+-++--+++-
	353	Beta ++++--+--+++-
	354	Gamma -+--+-++-+-++
	355	Delta ++-+----++---
	356	Epsilon ++++----++---
	357	</PRE>
	358	</TD></TR></TABLE>
	359	<P>
	360	<HR>
	361	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
	362	<P>
	363	(Note that when the options for displaying the input data are turned off,
	364	the output is in a form suitable for use as an input file in the distance
	365	matrix programs).<P>
	366	<P>
	367	<TABLE><TR><TD BGCOLOR=white>
	368	<PRE>
	369
	370	5 Species, 13 Sites
	371
	372	Name Sites
	373	---- -----
	374
	375	Alpha ++-+-++--+ ++-
	376	Beta ++++--+--+ ++-
	377	Gamma -+--+-++-+ -++
	378	Delta ++-+----++ ---
	379	Epsilon ++++----++ ---
	380
	381
	382	Alpha 0.0000 0.0224 0.1077 0.0688 0.0826
	383	Beta 0.0224 0.0000 0.1077 0.0688 0.0442
	384	Gamma 0.1077 0.1077 0.0000 0.1765 0.1925
	385	Delta 0.0688 0.0688 0.1765 0.0000 0.0197
	386	Epsilon 0.0826 0.0442 0.1925 0.0197 0.0000
	387	</PRE>
	388	</TD></TR></TABLE>
	389	</BODY>
	390	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/restdist.html

Download in other formats: