Context Navigation

dnadist.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 23.9 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>dnadist</TITLE>
5	<META NAME="description" CONTENT="dnadist">
6	<META NAME="keywords" CONTENT="dnadist">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>DNADIST -- Program to compute distance matrix<BR>from nucleotide sequences</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	This program uses nucleotide sequences to compute a distance matrix, under
26	four different models of nucleotide substitution. It can also
27	compute a table of similarity between the nucleotide sequences.
28	The distance for each
29	pair of species estimates the total branch length between the two species, and
30	can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR. This
31	is an alternative to use of the sequence data itself in the maximum likelihood
32	program DNAML or the parsimony program DNAPARS.
33	<P>
34	The program reads in nucleotide sequences and writes an output file containing
35	the distance matrix, or else a table of similarity between sequences.
36	The four models of nucleotide substitution are those
37	of Jukes and Cantor (1969), Kimura (1980), the F84 model (Kishino and Hasegawa,
38	1989; Felsenstein and Churchill, 1996), and the model underlying the
39	LogDet distance (Barry and Hartigan, 1987; Lake, 1994;
40	Steel, 1994; Lockhart et. al., 1994).
41	All except the LogDet distance can be made to allow for
42	for unequal rates of substitution at different sites, as Jin and Nei
43	(1990) did for the Jukes-Cantor model.
44	The program correctly takes
45	into account a variety of sequence ambiguities, although in cases where they
46	exist it can be slow.
47	<P>
48	Jukes and Cantor's (1969) model assumes that there is independent change at
49	all sites, with equal probability. Whether a base changes is independent of
50	its identity, and when it changes there is an equal probability of ending up
51	with each of the other three bases. Thus the transition probability matrix
52	(this is a technical term from probability theory and has nothing to do with
53	transitions as opposed to transversions) for a short period of time dt is:
54	<P>
55	<PRE>
56	To: A G C T
57	---------------------------------
58	A \| 1-3a a a a
59	From: G \| a 1-3a a a
60	C \| a a 1-3a a
61	T \| a a a 1-3a
62	</PRE>
63	<P>
64	where <EM>a</EM> is <EM>u dt</EM>, the product of the rate of substitution per unit time (<EM>u</EM>)
65	and the length <EM>dt</EM> of the time interval. For longer periods of time this
66	implies that the probability that two sequences will differ at a given site
67	is:
68	<P>
69	p = <SUP>3</SUP>/<SUB>4</SUB> ( 1 - e<SUP>- 4/3 u t</SUP>)
70	<P>
71	and hence that if we observe <EM>p</EM>, we can compute an estimate of the branch
72	length <EM>ut</EM> by inverting this to get
73	<P>
74	ut = - <SUP>3</SUP>/<SUB>4</SUB> log<SUB>e</SUB> ( 1 - <SUP>4</SUP>/<SUB>3</SUB> p )
75	<P>
76	The Kimura "2-parameter" model is almost as symmetric as this, but allows
77	for a difference between transition and transversion rates. Its transition
78	probability matrix for a short interval of time is:
79	<P>
80	<PRE>
81	To: A G C T
82	---------------------------------
83	A \| 1-a-2b a b b
84	From: G \| a 1-a-2b b b
85	C \| b b 1-a-2b a
86	T \| b b a 1-a-2b
87	</PRE>
88	<P>
89	<P>
90	where <EM>a</EM> is <EM>u dt</EM>, the product of the rate of transitions per unit time and <EM>dt</EM>
91	is the length <EM>dt</EM> of the time interval, and <EM>b</EM> is <EM>v dt</EM>, the product of half the
92	rate of transversions (i.e., the rate of a specific transversion)
93	and the length dt of the time interval.
94	<P>
95	The F84 model incorporates different rates of transition
96	and transversion, but also allowing for different frequencies of the four
97	nucleotides. It is the model which is used in DNAML, the maximum likelihood
98	nucelotide sequence phylogenies program in this package. You will find the
99	model described in the document for that program. The transition
100	probabilities for this model are given by Kishino and Hasegawa (1989),
101	and further explained in a paper by me and Gary Churchill (1996).
102	<P>
103	The LogDet distance allows a fairly general model of substitution. It
104	computes the distance from the determinant of the empirically observed
105	matrix of joint probabilities of nucleotides in the two species. An
106	explanation of it is available in the chapter by Swofford et, al. (1996).
107	<P>
108	The first three models are closely related. The DNAML model reduces to
109	Kimura's
110	two-parameter model if we assume that the equilibrium frequencies of the four
111	bases are equal. The Jukes-Cantor model in turn is a special case of the
112	Kimura 2-parameter model where a = b. Thus each model is a special case of
113	the ones that follow it, Jukes-Cantor being a special case of both of the
114	others.
115	<P>
116	The Jin and Nei (1990) correction for variation in rate of evolution from
117	site to site can be adapted to all of the first three models.
118	It assumes that the rate of substitution varies from site to site
119	according to a gamma distribution, with a coefficient of variation that
120	is specified by the user. The user is asked for it when choosing this option
121	in the menu.
122	<P>
123	Each distance that is calculated is an estimate, from that particular pair of
124	species, of the divergence time between those two species. For the Jukes-
125	Cantor model, the estimate is computed using the formula for <EM>ut</EM> given above,
126	as long as the nucleotide symbols in the two sequences are all either A, C, G,
127	T, U, N, X, ?, or - (the latter four indicate a deletion or an unknown
128	nucleotide. This estimate is a maximum likelihood estimate for that
129	model. For the Kimura 2-parameter model, with only these nucleotide symbols,
130	formulas special to that estimate are also computed. These are also,
131	in effect, computing the maximum likelihood estimate for that model. In
132	the Kimura case it depends on the
133	observed sequences only through the sequence length and the observed number of
134	transition and transversion differences between those two sequences. The
135	calculation in that case is a maximum likelihood estimate and will differ
136	somewhat from the estimate obtained from the formulas in Kimura's original
137	paper. That formula was also a maximum likelihood estimate, but with the
138	transition/transversion ratio estimated empirically, separately for each pair
139	of sequences. In the present case, one overall preset transition/transversion
140	ratio is used which makes the computations harder but achieves greater
141	consistency between different comparisons.
142	<P>
143	For the
144	F84 model, or for any of the models where one or both sequences contain at
145	least one of the other ambiguity codons such as Y, R, etc., a maximum
146	likelihood calculation is also done using code which was originally written
147	for DNAML. Its disadvantage is that it is slow. The resulting
148	distance is in effect a maximum likelihood estimate of the divergence time
149	(total branch length between) the two sequences. However the present
150	program will be much faster than versions earlier than 3.5, because I have
151	speeded up the iterations.
152	<P>
153	The LogDet model computes the distance from the determinant of the
154	matrix of co-occurrence of nucleotides in the two species, according to
155	the formula
156	<PRE>
157	D = - <SUP>1</SUP>/<SUB>4</SUB>(log<SUB>e</SUB>(\|F\|) - <SUP>1</SUP>/<SUB>2</SUB>log<SUB>e</SUB>(f<SUB>A</SUB><SUP>1</SUP>f<SUB>C</SUB><SUP>1</SUP>f<SUB>G</SUB><SUP>1</SUP>f<SUB>T</SUB><SUP>1</SUP>f<SUB>A</SUB><SUP>2</SUP>f<SUB>C</SUB><SUP>2</SUP>f<SUB>G</SUB><SUP>2</SUP>f<SUB>T</SUB><SUP>2</SUP>))
158	</PRE>
159	Where <EM>F</EM> is a matrix whose <EM>(i,j)</EM> element is the fraction
160	of sites at which base <EM>i</EM> occurs in one species and base <EM>j</EM>
161	occurs in the other. f<SUB>j</SUB><SUP>i</SUP> is the fraction of sites at
162	which species <EM>i</EM> has base <EM>j</EM>.
163	The LogDet distance cannot cope with ambiguity codes. It must have
164	completely defined sequences. One limitation of the LogDet distance is
165	that it may be infinite sometimes, if there are too many changes between
166	certain pairs of nucleotides. This can be particularly noticeable with
167	distances computed from bootstrapped sequences.
168	<P>
169	Note that there is an
170	assumption that we are looking at all sites, including those
171	that have not changed at all. It is important not to restrict attention
172	to some sites based on whether or not they have changed; doing that
173	would bias the distances by making them too large, and that in turn
174	would cause the distances
175	to misinterpret the meaning of those sites that
176	had changed.
177	<P>
178	For all of these
179	distance methods, the program allows us to specify
180	that "third position" bases have a different rate of substitution than first and
181	second positions, that introns have a different rate than exons, and so on. The
182	Categories option which does this allows us to make up to
183	9 categories of sites and specify different rates of change for them.
184	<P>
185	In addition to the four distance calculations, the program can also
186	compute a table of similarities between nucleotide sequences. These values
187	are the fractions of sites identical between the sequences.
188	The diagonal values are 1.0000. No attempt is made to count similarity
189	of nonidentical nucleotides, so that no credit is given for having
190	(for example) different purines at corresponding
191	sites in the two sequences. This option has been requested by many
192	users, who need it for descriptive purposes. It is not intended that
193	the table be used for inferring the tree.
194	<P>
195	<H2>INPUT FORMAT AND OPTIONS</H2>
196	<P>
197	Input is fairly standard, with one addition. As usual the first line of the
198	file gives the number of species and the number of sites.
199	<P>
200	Next come the species data. Each
201	sequence starts on a new line, has a ten-character species name
202	that must be blank-filled to be of that length, followed immediately
203	by the species data in the one-letter code. The sequences must either
204	be in the "interleaved" or "sequential" formats
205	described in the Molecular Sequence Programs document. The I option
206	selects between them. The sequences can have internal
207	blanks in the sequence but there must be no extra blanks at the end of the
208	terminated line. Note that a blank is not a valid symbol for a deletion --
209	neither is dot (".").
210	<P>
211	The options are selected using an interactive menu. The menu looks like this:
212	<P>
213	<TABLE><TR><TD BGCOLOR=white>
214	<PRE>
215
216	Nucleic acid sequence Distance Matrix program, version 3.6a3
217
218	Settings for this run:
219	D Distance (F84, Kimura, Jukes-Cantor, LogDet)? F84
220	G Gamma distributed rates across sites? No
221	T Transition/transversion ratio? 2.0
222	C One category of substitution rates? Yes
223	W Use weights for sites? No
224	F Use empirical base frequencies? Yes
225	L Form of distance matrix? Square
226	M Analyze multiple data sets? No
227	I Input sequences interleaved? Yes
228	0 Terminal type (IBM PC, ANSI, none)? (none)
229	1 Print out the data at start of run No
230	2 Print indications of progress of run Yes
231
232	Y to accept these or type the letter for one to change
233
234	</PRE>
235	</TD></TR></TABLE>
236	<P>
237	The user either types "Y" (followed, of course, by a carriage-return)
238	if the settings shown are to be accepted, or the letter or digit corresponding
239	to an option that is to be changed.
240	<P>
241	The D option selects one of the four distance methods, or the
242	similarity table. It toggles among the
243	five methods. The default method, if none is specified, is the F84
244	model.
245	<P>
246	If the G (Gamma distribution) option is selected, the user will be
247	asked to supply the coefficient of variation of the rate of substitution
248	among sites. This is different from the parameters used by Nei and Jin but
249	related to them: their parameter <EM>a</EM> is also known as "alpha",
250	the shape parameter of the Gamma distribution. It is
251	related to the coefficient of variation by
252	<P>
253	CV = 1 / a<SUP>1/2</SUP>
254	<P>
255	or
256	<P>
257	a = 1 / (CV)<SUP>2</SUP>
258	<P>
259	(their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so
260	that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>).
261	As we consider cases in which the rates are less variable we should set <EM>a</EM>
262	larger and larger, as <EM>CV</EM> gets smaller and smaller.
263	<P>
264	The F (Frequencies) option appears when the Maximum Likelihood distance is
265	selected. This distance requires that the program be provided with the
266	equilibrium frequencies of the four bases A, C, G, and T (or U). Its default
267	setting is one which may save users much time. If you
268	want to use the empirical frequencies of the bases, observed in the input
269	sequences, as the base frequencies, you simply use the default setting of
270	the F option. These empirical
271	frequencies are not really the maximum likelihood estimates of the base
272	frequencies, but they will often be close to those values (what they are is
273	maximum likelihood estimates under a "star" or "explosion" phylogeny).
274	If you change the setting of the F option you will be prompted for the
275	frequencies of the four bases. These must add to 1 and are to be typed on
276	one line separated by blanks, not commas.
277	<P>
278	The T option in this program does not stand for Threshold,
279	but instead is the Transition/transversion option. The user is prompted for
280	a real number greater than 0.0, as the expected ratio of transitions to
281	transversions. Note
282	that this is not the ratio of the first to the second kinds of events,
283	but the resulting expected ratio of transitions to transversions. The exact
284	relationship between these two quantities depends on the frequencies in the
285	base pools. The default value of the T parameter if you do not use the T
286	option is 2.0.
287	<P>
288	The C option allows user-defined rate categories. The user is prompted
289	for the number of user-defined rates, and for the rates themselves,
290	which cannot be negative but can be zero. These numbers, which must be
291	nonnegative (some could be 0),
292	are defined relative to each other, so that if rates for three categories
293	are set to 1 : 3 : 2.5 this would have the same meaning as setting them
294	to 2 : 6 : 5.
295	The assignment of rates to
296	sites is then made by reading a file whose default name is "categories".
297	It should contain a string of digits 1 through 9. A new line or a blank
298	can occur after any character in this string. Thus the categories file
299	might look like this:
300	<P>
301	<PRE>
302	122231111122411155
303	1155333333444
304	</PRE>
305	<P>
306	The L option specifies that the output file is to have the distance
307	matrix in lower triangular form.
308	<P>
309	The W (Weights) option is invoked in the usual way, with only weights 0
310	and 1 allowed. It selects a set of sites to be analyzed, ignoring the
311	others. The sites selected are those with weight 1. If the W option is
312	not invoked, all sites are analyzed.
313	The Weights (W) option
314	takes the weights from a file whose default name is "weights". The weights
315	follow the format described in the main documentation file.
316	<P>
317	The M (multiple data sets) option will ask you whether you want to
318	use multiple sets of weights (from the weights file) or multiple data sets
319	from the input file.
320	The ability to use a single data set with multiple weights means that
321	much less disk space will be used for this input data. The bootstrapping
322	and jackknifing tool Seqboot has the ability to create a weights file with
323	multiple weights. Note also that when we use multiple weights for
324	bootstrapping we can also then maintain different rate categories for
325	different sites in a meaningful way. You should not use the multiple
326	data sets option without using multiple weights, you should not at the
327	same time use the user-defined rate categories option (option C).
328	<P>
329	The options 0 is the usual one. It is described in the
330	main documentation file of this package. Option I is the same as in
331	other molecular sequence programs and is described in the documentation file
332	for the sequence programs.
333	<P>
334	<H2>OUTPUT FORMAT</H2>
335	<P>
336	As the
337	distances are computed, the program prints on your screen or terminal
338	the names of the species in turn,
339	followed by one dot (".") for each other species for which the distance to
340	that species has been computed. Thus if there are ten species, the first
341	species name is printed out, followed by nine dots, then on the next line
342	the next species name is printed out followed by eight dots, then the
343	next followed by seven dots, and so on. The pattern of dots should form
344	a triangle. When the distance matrix has been written out to the output
345	file, the user is notified of that.
346	<P>
347	The output file contains on its first line the number of species. The
348	distance matrix is then printed in standard
349	form, with each species starting on a new line with the species name, followed
350	by the distances to the species in order. These continue onto a new line
351	after every nine distances. If the L option is used, the matrix or distances
352	is in lower triangular form, so that only the distances to the other species
353	that precede each species are printed. Otherwise the distance matrix is square
354	with zero distances on the diagonal. In general the format of the distance
355	matrix is such that it can serve as input to any of the distance matrix
356	programs.
357	<P>
358	If the option to print out the data is selected, the output file will
359	precede the data by more complete information on the input and the menu
360	selections. The output file begins by giving the number of species and the
361	number of characters, and the identity of the distance measure that is
362	being used.
363	<P>
364	If the C (Categories) option is used a table of the relative rates of
365	expected substitution at each category of sites is printed, and a listing
366	of the categories each site is in.
367	<P>
368	There will then follow the equilibrium frequencies of the four bases.
369	If the Jukes-Cantor or Kimura distances are used, these will necessarily be
370	0.25 : 0.25 : 0.25 : 0.25. The output then shows
371	the transition/transversion ratio that was specified or
372	used by default. In the case of the Jukes-Cantor distance this will always
373	be 0.5. The transition-transversion parameter (as opposed to the ratio)
374	is also printed out: this is used within the program and can be ignored.
375	There then follow the data sequences, with the base sequences printed in
376	groups of ten bases along the lines of the Genbank and EMBL formats.
377	<P>
378	The distances printed out are scaled in terms of expected numbers of
379	substitutions, counting both transitions and transversions but not
380	replacements of a base by itself, and scaled so that the average rate of
381	change, averaged over all sites analyzed, is set to 1.0
382	if there are multiple categories of sites. This means that whether or not
383	there are multiple categories of sites, the expected fraction of change
384	for very small branches is equal to the branch length. Of course,
385	when a branch is twice as
386	long this does not mean that there will be twice as much net change expected
387	along it, since some of the changes may occur in the same site and overlie or
388	even reverse each
389	other. The branch lengths estimates here are in terms of the expected
390	underlying numbers of changes. That means that a branch of length 0.26
391	is 26 times as long as one which would show a 1% difference between
392	the nucleotide sequences at the beginning and end of the branch. But we
393	would not expect the sequences at the beginning and end of the branch to be
394	26% different, as there would be some overlaying of changes.
395	<P>
396	One problem that can arise is that two or more of the species can be so
397	dissimilar that the distance between them would have to be infinite, as
398	the likelihood rises indefinitely as the estimated divergence time
399	increases. For example, with the Jukes-Cantor model, if the two sequences
400	differ in 75% or more of their positions then the estimate of dovergence
401	time would be infinite. Since there is no way to represent an infinite
402	distance in the output file, the program regards this as an error, issues an
403	error message indicating which pair of species are causing the problem, and
404	stops. It might be that, had it continued running, it would have also run
405	into the same problem with other pairs of species. If the Kimura
406	distance is being used there may be no error message; the program may
407	simply give a large distance value (it is iterating towards
408	infinity and the value is just where the iteration stopped). Likewise
409	some maximum likelihood estimates may also become large for the same
410	reason (the sequences showing more divergence than is expected even with
411	infinite branch length). I hope in the future to add more warning
412	messages that would alert the user the this.
413	<P>
414	If the similarity table is selected, the table that is produced is not
415	in a format that can be used as input to the distance matrix programs.
416	it has a heading, and the species names are also put at the tops of the
417	columns of the table (or rather, the first 8 characters of each species
418	name is there, the other two characters omitted to save space). There
419	is not an option to put the table into a format that can be read by
420	the distance matrix programs, nor is there one to make it into a table
421	of fractions of difference by subtracting the similarity values from 1.
422	This is done deliberately to make it more difficult for the use to
423	use these values to construct trees. The similarity values are
424	not corrected for multiple changes, and their use to construct trees
425	(even after converting them to fractions of difference) would be
426	wrong, as it would lead to severe conflict between the distant
427	pairs of sequences and the close pairs of sequences.
428	<P>
429	<H2>PROGRAM CONSTANTS</H2>
430	<P>
431	The constants that are available to be changed by the user at the beginning
432	of the program include
433	"maxcategories", the maximum number of site
434	categories, "iterations", which controls the number of times
435	the program iterates the EM algorithm that is used to do the maximum
436	likelihood distance, "namelength", the length of species names in
437	characters, and "epsilon", a parameter which controls the accuracy of the
438	results of the iterations which estimate the distances. Making "epsilon"
439	smaller will increase run times but result in more decimal places of
440	accuracy. This should not be necessary.
441	<P>
442	The program spends most of its time doing real arithmetic.
443	The algorithm, with separate and independent computations
444	occurring for each pattern, lends itself readily to parallel processing.
445	<P>
446	<HR>
447	<P>
448	<H3>TEST DATA SET</H3>
449	<P>
450	<TABLE><TR><TD BGCOLOR=white>
451	<PRE>
452	5 13
453	Alpha AACGTGGCCACAT
454	Beta AAGGTCGCCACAC
455	Gamma CAGTTCGCCACAA
456	Delta GAGATTTCCGCCT
457	Epsilon GAGATCTCCGCCC
458	</PRE>
459	</TD></TR></TABLE>
460	<P>
461	<HR>
462	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3>
463	<P>
464	(Note that when the options for displaying the input data are turned off,
465	the output is in a form suitable for use as an input file in the distance
466	matrix programs).<P>
467	<P>
468	<TABLE><TR><TD BGCOLOR=white>
469	<PRE>
470
471	Nucleic acid sequence Distance Matrix program, version 3.6a3
472
473	5 species, 13 sites
474
475	F84 Distance
476
477	Transition/transversion ratio = 2.000000
478
479	Name Sequences
480	---- ---------
481
482	Alpha AACGTGGCCA CAT
483	Beta AAGGTCGCCA CAC
484	Gamma CAGTTCGCCA CAA
485	Delta GAGATTTCCG CCT
486	Epsilon GAGATCTCCG CCC
487
488
489
490	Empirical Base Frequencies:
491
492	A 0.24615
493	C 0.36923
494	G 0.21538
495	T(U) 0.16923
496
497	Alpha 0.0000 0.3039 0.8575 1.1589 1.5429
498	Beta 0.3039 0.0000 0.3397 0.9135 0.6197
499	Gamma 0.8575 0.3397 0.0000 1.6317 1.2937
500	Delta 1.1589 0.9135 1.6317 0.0000 0.1659
501	Epsilon 1.5429 0.6197 1.2937 0.1659 0.0000
502	</PRE>
503	</TD></TR></TABLE>
504	</BODY>
505	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/dnadist.html

Download in other formats: