Context Navigation

distance.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 16.2 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>distance</TITLE>
5	<META NAME="description" CONTENT="distance">
6	<META NAME="keywords" CONTENT="distance">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>Distance matrix programs</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2000 by the University of Washington. Written by Joseph
21	Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	The programs FITCH, KITSCH, and NEIGHBOR are for dealing with data which
26	comes in the form of a matrix of pairwise distances between all pairs of taxa,
27	such as distances based on molecular sequence data,
28	gene frequency genetic distances, amounts of DNA hybridization, or
29	immunological distances. In analyzing these
30	data, distance matrix programs implicitly assume that:
31	<OL>
32	<LI>Each distance is measured independently from the others: no item of
33	data contributes to more than one distance.
34	<LI>The distance between each pair of taxa is drawn from a distribution
35	with an expectation which is the sum of values (in effect amounts of evolution)
36	along the tree from one tip to the other. The variance of the distribution is
37	proportional to a power <EM>p</EM> of the expectation.
38	</OL>
39	<P>
40	These assumptions can be traced in the least squares methods of programs
41	FITCH and KITSCH but it is not quite so easy to see them in operation in the
42	Neighbor-Joining method of NEIGHBOR, where the independence assumptions is less
43	obvious.
44	<P>
45	THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not be
46	expected to be true in most kinds of data, such as genetic distances from gene
47	frequency data. For genetic distance data in which pure genetic drift without
48	mutation can be assumed to be the mechanism of change CONTML may be more
49	appropriate. However, FITCH, KITSCH, and NEIGHBOR will not give positively
50	misleading results (they will not make a statistically inconsistent estimate)
51	provided that additivity holds, which it will if the distance is computed from
52	the original data by a method which corrects for reversals and parallelisms in
53	evolution. If additivity is not expected to hold, problems are more severe. A
54	short discussion of these matters will be found in a review article of mine
55	(1984a). For detailed, if sometimes irrelevant, controversy see the papers by
56	Farris (1981, 1985, 1986) and myself (1986, 1988b).
57	<P>
58	For genetic distances from gene frequencies, FITCH, KITSCH, and NEIGHBOR
59	may be appropriate if a neutral mutation model can be assumed and Nei's genetic
60	distance is used, or if pure drift can be assumed and either Cavalli-Sforza's
61	chord measure or Reynolds, Weir, and Cockerham's (1983) genetic distance is
62	used. However, in the latter case (pure drift) CONTML should be better.
63	<P>
64	Restriction site and restriction fragment data can be treated by distance
65	matrix methods if a distance such as that of Nei and Li (1979) is used.
66	Distances of this sort can be computed in PHYLIp by the program RESTDIST.
67	<P>
68	For nucleic acid sequences, the distances computed in DNADIST allow
69	correction for multiple hits (in different ways) and should allow one to
70	analyse the data under the presumption of additivity. <I>In all of these cases
71	independence will not be expected to hold</I>. DNA hybridization and
72	immunological distances may be additive and independent if transformed properly
73	and if (and only if) the standards against which each value is measured are
74	independent. (This is rarely exactly true).
75	<P>
76	FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has the
77	branch lengths unconstrained. KITSCH and the UPGMA option of NEIGHBOR, by
78	contrast, assume that an "evolutionary clock" is valid, according to which the
79	true branch lengths from the root of the tree to each tip are the same: the
80	expected amount of evolution in any lineage is proportional to elapsed time.
81	<P>
82	The input format for distance data is straightforward. The first line of
83	the input file contains the number of species. There follows species data,
84	starting, as with all other programs, with a species name. The species name is
85	ten characters long, and must be padded out with blanks if shorter. For each
86	species there then follows a set of distances to all the other species (options
87	selected in the programs' menus
88	allow the distance matrix to be upper or lower triangular or square).
89	The distances can continue to a new line after any of them. If the matrix
90	is lower-triangular, the diagonal entries (the distances from a species to
91	itself) will not be read by the programs. If they are included anyway, they
92	will be ignored by the programs, except for the case where one of them
93	starts a new line, in which case the program will mistake it for a species
94	name and get very confused.
95	<P>
96	For example, here is a sample input matrix, with a square matrix:
97	<P>
98	<TABLE><TR><TD BGCOLOR=white>
99	<PRE>
100	5
101	Alpha 0.000 1.000 2.000 3.000 3.000
102	Beta 1.000 0.000 2.000 3.000 3.000
103	Gamma 2.000 2.000 0.000 3.000 3.000
104	Delta 3.000 3.000 0.000 0.000 1.000
105	Epsilon 3.000 3.000 3.000 1.000 0.000
106	</PRE>
107	</TD></TR></TABLE>
108	<P>
109	and here is a sample lower-triangular input matrix with distances continuing
110	to new lines as needed:
111	<P>
112	<TABLE><TR><TD BGCOLOR=white>
113	<PRE>
114	14
115	Mouse
116	Bovine 1.7043
117	Lemur 2.0235 1.1901
118	Tarsier 2.1378 1.3287 1.2905
119	Squir Monk 1.5232 1.2423 1.3199 1.7878
120	Jpn Macaq 1.8261 1.2508 1.3887 1.3137 1.0642
121	Rhesus Mac 1.9182 1.2536 1.4658 1.3788 1.1124 0.1022
122	Crab-E.Mac 2.0039 1.3066 1.4826 1.3826 0.9832 0.2061 0.2681
123	BarbMacaq 1.9431 1.2827 1.4502 1.4543 1.0629 0.3895 0.3930 0.3665
124	Gibbon 1.9663 1.3296 1.8708 1.6683 0.9228 0.8035 0.7109 0.8132
125	0.7858
126	Orang 2.0593 1.2005 1.5356 1.6606 1.0681 0.7239 0.7290 0.7894
127	0.7140 0.7095
128	Gorilla 1.6664 1.3460 1.4577 1.5935 0.9127 0.7278 0.7412 0.8763
129	0.7966 0.5959 0.4604
130	Chimp 1.7320 1.3757 1.7803 1.7119 1.0635 0.7899 0.8742 0.8868
131	0.8288 0.6213 0.5065 0.3502
132	Human 1.7101 1.3956 1.6661 1.7599 1.0557 0.6933 0.7118 0.7589
133	0.8542 0.5612 0.4700 0.3097 0.2712
134	</PRE>
135	</TD></TR></TABLE>
136	<P>
137	Note that the name "Mouse" in this matrix must be padded out by blanks to the
138	full length of 10 characters.
139	<P>
140	In general the distances are assumed to all be present: at the moment
141	there is only one way we can have missing entries in the distance matrix. If
142	the S option (which allows the user to specify the degree of replication of
143	each distance) is invoked, with some of the entries having degree of
144	replication zero, if the U (User Tree) option is in effect, and if the tree
145	being examined is such that every branch length can be estimated from the data,
146	it will be possible to solve for the branch lengths and sum of squares when
147	there is some missing data. You may not get away with this if the U option is
148	not in effect, as a tree may be tried on which the program will calculate a
149	branch length by dividing zero by zero, and get upset.
150	<P>
151	The present version of NEIGHBOR does allow the Subreplication option to be
152	used and the number of replicates to be in the input file, but it actally does
153	nothing with this information except read it in. It makes use of the average
154	distances in the cells of the input data matrix. This means that you cannot use
155	the S option to treat zero cells. We hope to modify NEIGHBOR in the future to
156	allow Subreplication. Of course the U (User tree) option is not available in
157	NEIGHBOR in any case.
158	<P>
159	The present versions of FITCH and KITSCH will do much better on missing
160	values than did previous versions, but you will still have to be careful about
161	them. Nevertheless you might (just) be able to explore relevant alternative
162	tree topologies one at a time using the U option when there is missing data.
163	<P>
164	Alternatively, if the missing values in one cell always correspond to a
165	cell with non-missing values on the opposite side of the main diagonal (i.e.,
166	if D(i,j) missing implies that D(j,i) is not missing), then use of the S option
167	will always be sufficient to cope with missing values. When it is used, the
168	missing distances should be entered as if present (any number can be
169	used) and the degree of replication for them should be given as 0.
170	<P>
171	Note that the algorithm for searching among topologies in FITCH and KITSCH
172	is the same one used in other programs, so that it is necessary to try
173	different orders of species in the input data. The J (Jumble) menu option may
174	be sufficient for most purposes.
175	<P>
176	The programs FITCH and KITSCH carry out the method of Fitch and Margoliash
177	(1967) for fitting trees to distance matrices. They also are able to carry out
178	the least squares method of Cavalli-Sforza and Edwards (1967), plus a variety
179	of other methods of the same family (see the discussion of the P option below).
180	They can also be set to use the Minimum Evolution method (Nei and Rzhetsky,
181	1993; Kidd and Sgaramella-Zonta, 1971).
182	<P>
183	The objective of these methods is to find that tree which minimizes
184	<P>
185	<PRE>
186	__ __
187	\ \ n<SUB>ij</SUB> ( D<SUB>ij</SUB> - d<SUB>ij</SUB>)<SUP>2</SUP>
188	Sum of squares = /_ /_ ------------------
189	i j D<SUB>ij</SUB><SUP>p</SUP>
190	</PRE>
191	<P>
192	(the symbol made up of \, / and _ characters is of course a summation sign)
193	where <EM>D</EM> is the observed distance between species <EM>i</EM> and
194	<EM>j</EM> and <EM>d</EM> is the expected
195	distance, computed as the sum of the lengths (amounts of evolution) of the
196	segments of the tree from species <EM>i</EM> to species <EM>j</EM>. The
197	quantity <EM>n</EM> is the number
198	of times each distance has been replicated. In simple cases this is taken to
199	be one, but the user can, as an option, specify the degree of replication for
200	each distance. The distance is then assumed to be a mean of those replicates.
201	The power <EM>P</EM> is what distinguished the various methods. For the Fitch-
202	Margoliash method, which is the default method with this program, <EM>P</EM>
203	is 2.0. For the Cavalli-Sforza and Edwards least squares method it should be set to 0
204	(so that the denominator is always 1). An intermediate method is also
205	available in which P is 1.0, and any other value of P, such as 4.0 or -2.3, can
206	also be used. This generates a whole family of methods.
207	<P>
208	The P (Power) option is not available in the Neighbor-Joining program
209	NEIGHBOR. Implicitly, in this program P is 0.0 (though it is hard to prove
210	this). The UPGMA option of NEIGHBOR will assign the same branch lengths to the
211	particular tree topology that it finds as will KITSCH when given the same tree
212	and Power = 0.0.
213	<P>
214	All these methods make the assumptions of additivity and independent
215	errors. The difference between the methods is how they weight departures of
216	observed from expected. In effect, these methods differ in how they assume
217	that the variance of measurement of a distance will rise as a function of the
218	expected value of the distance.
219	<P>
220	These methods assume that the variance of the measurement error is
221	proportional to the <EM>P</EM>-th power of the expectation (hence the standard deviation
222	will be proportional to the <EM>P/2</EM>-th power of the expectation). If
223	you have
224	reason to think that the measurement error of a distance is the same for small
225	distances as it is for large, then you should set <EM>P=0</EM> and use the
226	least squares
227	method, but if you have reason to think that the relative (percentage) error is
228	more nearly constant than the absolute error, you should use <EM>P=2</EM>,
229	the Fitch-Margoliash method. In between, P=1 would be appropriate if the
230	sizes of the errors were proportional to the square roots of the expected
231	distance.
232	<P>
233	One question which arises frequently is what the units of branch length are
234	in the resulting trees. In general, they are not time but units of
235	distance. Thus if two species have a distance 0.3 between them, they will
236	tend to be separated by branches whose total length is about 0.3. In the
237	case of DNA distances, for example, the unit of branch length will be
238	subsxtitutions per base. (In the case of protein distances, it will be
239	amino acid substitutions per amino acid posiiton.
240	tend to be sd
241	<P>
242	<H2>OPTIONS</H2>
243	<P>
244	Here are the options available in all three programs. They are selected using
245	the menu of options.
246	<P>
247	<DL COMPACT>
248	<DT>U</DT> <DD>the User tree option.
249	The trees in FITCH are regarded as unrooted, and are
250	specified with a trifurcation (three-way split) at their base:
251	e. g.:
252	<P>
253	((A,B),C,(D,E));
254	<P>
255	while in KITSCH they are to be regarded as rooted and have a
256	bifurcation at the base:
257	<P>
258	((A,B),(C,(D,E)));
259	<P>
260	Be careful not to move User trees from FITCH to KITSCH without
261	changing their form appropriately (you can use RETREE to do
262	this). User trees are not available in NEIGHBOR. In FITCH if
263	you specify the branch lengths on one or more branches, you can
264	select the L (use branch Lengths) option to avoid having those
265	branches iterated, so that the tree is evaluated with their
266	lengths fixed.
267	</DD>
268	<P>
269	<DT>P</DT> <DD>indicates that you are going to set the Power (P in the above
270	formula). The default value is 2 (the Fitch-Margoliash method).
271	The power, a real number such as 1.0, is prompted for by the
272	programs. This option is not available in NEIGHBOR.</DD>
273	<P>
274	<DT>-</DT> <DD>indicates that negative segment lengths are to be allowed in the
275	tree (default is to require that all branch lengths be
276	nonnegative). This option is not available in NEIGHBOR.</DD>
277	<P>
278	<DT>O</DT> <DD>is the usual Outgroup option, available in FITCH and NEIGHBOR but
279	not in KITSCH, nor when the UPGMA option of NEIGHBOR is used.</DD>
280	<P>
281	<DT>L</DT> <DD>indicates that the distance matrix is input in Lower-triangular
282	form (the lower-left half of the distance matrix only, without
283	the zero diagonal elements).</DD>
284	<P>
285	<DT>R</DT> <DD>indicates that the distance matrix is input in uppeR-triangular
286	form (the upper-right half of the distance matrix only, without
287	the zero diagonal elements).</DD>
288	<P>
289	<DT>S</DT> <DD>is the Subreplication option. It informs the program that after
290	each distance will be provided an integer indicating that the
291	distance is a mean of that many replicates. There is no
292	auxiliary information, but the presence of the S option indicates
293	that the data will be in a different form. Each distance must be
294	followed by an integer indicating the number of replicates, so
295	that a line of data looks like this:
296	<P>
297	<PRE>
298	Delta 3.00 5 3.21 3 1.84 9
299	</PRE>
300	<P>
301	the 5, 3, and 9 being the number of times the measurement was
302	replicated. When the number of replicates is zero, a distance
303	value must still be provided, although its vale will not afect
304	the result. This option is not available in NEIGHBOR.</DD>
305	<P>
306	<DT>G</DT><DD>is the usual Global branch-swapping option. It is available in
307	FITCH and KITSCH but is not relevant to NEIGHBOR.</DD>
308	<P>
309	<DT>J</DT> <DD>indicates the usual J (Jumble) option for entering species in a
310	random order. In FITCH and KITSCH if you do multiple jumbles in
311	one run the program will print out the best tree found overall.</DD>
312	<P>
313	<DT>M</DT> <DD>is the usal Multiple data sets option, available in all of these
314	programs. It allows us (when the output tree file is analyzed in
315	CONSENSE) to do a bootstrap (or delete-half-jackknife) analysis
316	with the distance matrix{ programs.</DD>
317	</DL>
318	<P>
319	The numerical options are the usual ones and should be clear from the
320	menu.
321	<P>
322	Note that when the options L or R are used one of the species, the first
323	or last one, will have its name on an otherwise empty line. Even so, the name
324	should be padded out to full length with blanks. Here is a sample lower-
325	triangular data set.
326	<P>
327	<TABLE><TR><TD BGCOLOR=white>
328	<PRE>
329	5
330	Alpha
331	Beta 1.00
332	Gamma 3.00 3.00
333	Delta 3.00 3.00 2.00
334	Epsilon 3.00 3.00 2.00 1.00
335	</PRE>
336	</TD><TD> <--- note: five blanks should follow the name "Alpha"<BR>
337	<BR><BR><BR></TD></TR></TABLE>
338	<P>
339	Be careful if you are using lower- or upper-triangular trees to make the
340	corresponding selection from the menu (L or R), as the
341	program may get horribly confused otherwise, <I>but it still gives a result even
342	though the result is then meaningless</I>. With the menu option selected all
343	should be well.
344	</BODY>
345	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/distance.html

Download in other formats: