1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>distance</TITLE> |
---|
5 | <META NAME="description" CONTENT="distance"> |
---|
6 | <META NAME="keywords" CONTENT="distance"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>Distance matrix programs</H1> |
---|
18 | </DIV> |
---|
19 | <P> |
---|
20 | © Copyright 1986-2000 by the University of Washington. Written by Joseph |
---|
21 | Felsenstein. Permission is granted to copy |
---|
22 | this document provided that no fee is charged for it and that this copyright |
---|
23 | notice is not removed. |
---|
24 | <P> |
---|
25 | The programs FITCH, KITSCH, and NEIGHBOR are for dealing with data which |
---|
26 | comes in the form of a matrix of pairwise distances between all pairs of taxa, |
---|
27 | such as distances based on molecular sequence data, |
---|
28 | gene frequency genetic distances, amounts of DNA hybridization, or |
---|
29 | immunological distances. In analyzing these |
---|
30 | data, distance matrix programs implicitly assume that: |
---|
31 | <OL> |
---|
32 | <LI>Each distance is measured independently from the others: no item of |
---|
33 | data contributes to more than one distance. |
---|
34 | <LI>The distance between each pair of taxa is drawn from a distribution |
---|
35 | with an expectation which is the sum of values (in effect amounts of evolution) |
---|
36 | along the tree from one tip to the other. The variance of the distribution is |
---|
37 | proportional to a power <EM>p</EM> of the expectation. |
---|
38 | </OL> |
---|
39 | <P> |
---|
40 | These assumptions can be traced in the least squares methods of programs |
---|
41 | FITCH and KITSCH but it is not quite so easy to see them in operation in the |
---|
42 | Neighbor-Joining method of NEIGHBOR, where the independence assumptions is less |
---|
43 | obvious. |
---|
44 | <P> |
---|
45 | THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not be |
---|
46 | expected to be true in most kinds of data, such as genetic distances from gene |
---|
47 | frequency data. For genetic distance data in which pure genetic drift without |
---|
48 | mutation can be assumed to be the mechanism of change CONTML may be more |
---|
49 | appropriate. However, FITCH, KITSCH, and NEIGHBOR will not give positively |
---|
50 | misleading results (they will not make a statistically inconsistent estimate) |
---|
51 | provided that additivity holds, which it will if the distance is computed from |
---|
52 | the original data by a method which corrects for reversals and parallelisms in |
---|
53 | evolution. If additivity is not expected to hold, problems are more severe. A |
---|
54 | short discussion of these matters will be found in a review article of mine |
---|
55 | (1984a). For detailed, if sometimes irrelevant, controversy see the papers by |
---|
56 | Farris (1981, 1985, 1986) and myself (1986, 1988b). |
---|
57 | <P> |
---|
58 | For genetic distances from gene frequencies, FITCH, KITSCH, and NEIGHBOR |
---|
59 | may be appropriate if a neutral mutation model can be assumed and Nei's genetic |
---|
60 | distance is used, or if pure drift can be assumed and either Cavalli-Sforza's |
---|
61 | chord measure or Reynolds, Weir, and Cockerham's (1983) genetic distance is |
---|
62 | used. However, in the latter case (pure drift) CONTML should be better. |
---|
63 | <P> |
---|
64 | Restriction site and restriction fragment data can be treated by distance |
---|
65 | matrix methods if a distance such as that of Nei and Li (1979) is used. |
---|
66 | Distances of this sort can be computed in PHYLIp by the program RESTDIST. |
---|
67 | <P> |
---|
68 | For nucleic acid sequences, the distances computed in DNADIST allow |
---|
69 | correction for multiple hits (in different ways) and should allow one to |
---|
70 | analyse the data under the presumption of additivity. <I>In all of these cases |
---|
71 | independence will not be expected to hold</I>. DNA hybridization and |
---|
72 | immunological distances may be additive and independent if transformed properly |
---|
73 | and if (and only if) the standards against which each value is measured are |
---|
74 | independent. (This is rarely exactly true). |
---|
75 | <P> |
---|
76 | FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has the |
---|
77 | branch lengths unconstrained. KITSCH and the UPGMA option of NEIGHBOR, by |
---|
78 | contrast, assume that an "evolutionary clock" is valid, according to which the |
---|
79 | true branch lengths from the root of the tree to each tip are the same: the |
---|
80 | expected amount of evolution in any lineage is proportional to elapsed time. |
---|
81 | <P> |
---|
82 | The input format for distance data is straightforward. The first line of |
---|
83 | the input file contains the number of species. There follows species data, |
---|
84 | starting, as with all other programs, with a species name. The species name is |
---|
85 | ten characters long, and must be padded out with blanks if shorter. For each |
---|
86 | species there then follows a set of distances to all the other species (options |
---|
87 | selected in the programs' menus |
---|
88 | allow the distance matrix to be upper or lower triangular or square). |
---|
89 | The distances can continue to a new line after any of them. If the matrix |
---|
90 | is lower-triangular, the diagonal entries (the distances from a species to |
---|
91 | itself) will not be read by the programs. If they are included anyway, they |
---|
92 | will be ignored by the programs, except for the case where one of them |
---|
93 | starts a new line, in which case the program will mistake it for a species |
---|
94 | name and get very confused. |
---|
95 | <P> |
---|
96 | For example, here is a sample input matrix, with a square matrix: |
---|
97 | <P> |
---|
98 | <TABLE><TR><TD BGCOLOR=white> |
---|
99 | <PRE> |
---|
100 | 5 |
---|
101 | Alpha 0.000 1.000 2.000 3.000 3.000 |
---|
102 | Beta 1.000 0.000 2.000 3.000 3.000 |
---|
103 | Gamma 2.000 2.000 0.000 3.000 3.000 |
---|
104 | Delta 3.000 3.000 0.000 0.000 1.000 |
---|
105 | Epsilon 3.000 3.000 3.000 1.000 0.000 |
---|
106 | </PRE> |
---|
107 | </TD></TR></TABLE> |
---|
108 | <P> |
---|
109 | and here is a sample lower-triangular input matrix with distances continuing |
---|
110 | to new lines as needed: |
---|
111 | <P> |
---|
112 | <TABLE><TR><TD BGCOLOR=white> |
---|
113 | <PRE> |
---|
114 | 14 |
---|
115 | Mouse |
---|
116 | Bovine 1.7043 |
---|
117 | Lemur 2.0235 1.1901 |
---|
118 | Tarsier 2.1378 1.3287 1.2905 |
---|
119 | Squir Monk 1.5232 1.2423 1.3199 1.7878 |
---|
120 | Jpn Macaq 1.8261 1.2508 1.3887 1.3137 1.0642 |
---|
121 | Rhesus Mac 1.9182 1.2536 1.4658 1.3788 1.1124 0.1022 |
---|
122 | Crab-E.Mac 2.0039 1.3066 1.4826 1.3826 0.9832 0.2061 0.2681 |
---|
123 | BarbMacaq 1.9431 1.2827 1.4502 1.4543 1.0629 0.3895 0.3930 0.3665 |
---|
124 | Gibbon 1.9663 1.3296 1.8708 1.6683 0.9228 0.8035 0.7109 0.8132 |
---|
125 | 0.7858 |
---|
126 | Orang 2.0593 1.2005 1.5356 1.6606 1.0681 0.7239 0.7290 0.7894 |
---|
127 | 0.7140 0.7095 |
---|
128 | Gorilla 1.6664 1.3460 1.4577 1.5935 0.9127 0.7278 0.7412 0.8763 |
---|
129 | 0.7966 0.5959 0.4604 |
---|
130 | Chimp 1.7320 1.3757 1.7803 1.7119 1.0635 0.7899 0.8742 0.8868 |
---|
131 | 0.8288 0.6213 0.5065 0.3502 |
---|
132 | Human 1.7101 1.3956 1.6661 1.7599 1.0557 0.6933 0.7118 0.7589 |
---|
133 | 0.8542 0.5612 0.4700 0.3097 0.2712 |
---|
134 | </PRE> |
---|
135 | </TD></TR></TABLE> |
---|
136 | <P> |
---|
137 | Note that the name "Mouse" in this matrix must be padded out by blanks to the |
---|
138 | full length of 10 characters. |
---|
139 | <P> |
---|
140 | In general the distances are assumed to all be present: at the moment |
---|
141 | there is only one way we can have missing entries in the distance matrix. If |
---|
142 | the S option (which allows the user to specify the degree of replication of |
---|
143 | each distance) is invoked, with some of the entries having degree of |
---|
144 | replication zero, if the U (User Tree) option is in effect, and if the tree |
---|
145 | being examined is such that every branch length can be estimated from the data, |
---|
146 | it will be possible to solve for the branch lengths and sum of squares when |
---|
147 | there is some missing data. You may not get away with this if the U option is |
---|
148 | not in effect, as a tree may be tried on which the program will calculate a |
---|
149 | branch length by dividing zero by zero, and get upset. |
---|
150 | <P> |
---|
151 | The present version of NEIGHBOR does allow the Subreplication option to be |
---|
152 | used and the number of replicates to be in the input file, but it actally does |
---|
153 | nothing with this information except read it in. It makes use of the average |
---|
154 | distances in the cells of the input data matrix. This means that you cannot use |
---|
155 | the S option to treat zero cells. We hope to modify NEIGHBOR in the future to |
---|
156 | allow Subreplication. Of course the U (User tree) option is not available in |
---|
157 | NEIGHBOR in any case. |
---|
158 | <P> |
---|
159 | The present versions of FITCH and KITSCH will do much better on missing |
---|
160 | values than did previous versions, but you will still have to be careful about |
---|
161 | them. Nevertheless you might (just) be able to explore relevant alternative |
---|
162 | tree topologies one at a time using the U option when there is missing data. |
---|
163 | <P> |
---|
164 | Alternatively, if the missing values in one cell always correspond to a |
---|
165 | cell with non-missing values on the opposite side of the main diagonal (i.e., |
---|
166 | if D(i,j) missing implies that D(j,i) is not missing), then use of the S option |
---|
167 | will always be sufficient to cope with missing values. When it is used, the |
---|
168 | missing distances should be entered as if present (any number can be |
---|
169 | used) and the degree of replication for them should be given as 0. |
---|
170 | <P> |
---|
171 | Note that the algorithm for searching among topologies in FITCH and KITSCH |
---|
172 | is the same one used in other programs, so that it is necessary to try |
---|
173 | different orders of species in the input data. The J (Jumble) menu option may |
---|
174 | be sufficient for most purposes. |
---|
175 | <P> |
---|
176 | The programs FITCH and KITSCH carry out the method of Fitch and Margoliash |
---|
177 | (1967) for fitting trees to distance matrices. They also are able to carry out |
---|
178 | the least squares method of Cavalli-Sforza and Edwards (1967), plus a variety |
---|
179 | of other methods of the same family (see the discussion of the P option below). |
---|
180 | They can also be set to use the Minimum Evolution method (Nei and Rzhetsky, |
---|
181 | 1993; Kidd and Sgaramella-Zonta, 1971). |
---|
182 | <P> |
---|
183 | The objective of these methods is to find that tree which minimizes |
---|
184 | <P> |
---|
185 | <PRE> |
---|
186 | __ __ |
---|
187 | \ \ n<SUB>ij</SUB> ( D<SUB>ij</SUB> - d<SUB>ij</SUB>)<SUP>2</SUP> |
---|
188 | Sum of squares = /_ /_ ------------------ |
---|
189 | i j D<SUB>ij</SUB><SUP>p</SUP> |
---|
190 | </PRE> |
---|
191 | <P> |
---|
192 | (the symbol made up of \, / and _ characters is of course a summation sign) |
---|
193 | where <EM>D</EM> is the observed distance between species <EM>i</EM> and |
---|
194 | <EM>j</EM> and <EM>d</EM> is the expected |
---|
195 | distance, computed as the sum of the lengths (amounts of evolution) of the |
---|
196 | segments of the tree from species <EM>i</EM> to species <EM>j</EM>. The |
---|
197 | quantity <EM>n</EM> is the number |
---|
198 | of times each distance has been replicated. In simple cases this is taken to |
---|
199 | be one, but the user can, as an option, specify the degree of replication for |
---|
200 | each distance. The distance is then assumed to be a mean of those replicates. |
---|
201 | The power <EM>P</EM> is what distinguished the various methods. For the Fitch- |
---|
202 | Margoliash method, which is the default method with this program, <EM>P</EM> |
---|
203 | is 2.0. For the Cavalli-Sforza and Edwards least squares method it should be set to 0 |
---|
204 | (so that the denominator is always 1). An intermediate method is also |
---|
205 | available in which P is 1.0, and any other value of P, such as 4.0 or -2.3, can |
---|
206 | also be used. This generates a whole family of methods. |
---|
207 | <P> |
---|
208 | The P (Power) option is not available in the Neighbor-Joining program |
---|
209 | NEIGHBOR. Implicitly, in this program P is 0.0 (though it is hard to prove |
---|
210 | this). The UPGMA option of NEIGHBOR will assign the same branch lengths to the |
---|
211 | particular tree topology that it finds as will KITSCH when given the same tree |
---|
212 | and Power = 0.0. |
---|
213 | <P> |
---|
214 | All these methods make the assumptions of additivity and independent |
---|
215 | errors. The difference between the methods is how they weight departures of |
---|
216 | observed from expected. In effect, these methods differ in how they assume |
---|
217 | that the variance of measurement of a distance will rise as a function of the |
---|
218 | expected value of the distance. |
---|
219 | <P> |
---|
220 | These methods assume that the variance of the measurement error is |
---|
221 | proportional to the <EM>P</EM>-th power of the expectation (hence the standard deviation |
---|
222 | will be proportional to the <EM>P/2</EM>-th power of the expectation). If |
---|
223 | you have |
---|
224 | reason to think that the measurement error of a distance is the same for small |
---|
225 | distances as it is for large, then you should set <EM>P=0</EM> and use the |
---|
226 | least squares |
---|
227 | method, but if you have reason to think that the relative (percentage) error is |
---|
228 | more nearly constant than the absolute error, you should use <EM>P=2</EM>, |
---|
229 | the Fitch-Margoliash method. In between, P=1 would be appropriate if the |
---|
230 | sizes of the errors were proportional to the square roots of the expected |
---|
231 | distance. |
---|
232 | <P> |
---|
233 | One question which arises frequently is what the units of branch length are |
---|
234 | in the resulting trees. In general, they are not time but units of |
---|
235 | distance. Thus if two species have a distance 0.3 between them, they will |
---|
236 | tend to be separated by branches whose total length is about 0.3. In the |
---|
237 | case of DNA distances, for example, the unit of branch length will be |
---|
238 | subsxtitutions per base. (In the case of protein distances, it will be |
---|
239 | amino acid substitutions per amino acid posiiton. |
---|
240 | tend to be sd |
---|
241 | <P> |
---|
242 | <H2>OPTIONS</H2> |
---|
243 | <P> |
---|
244 | Here are the options available in all three programs. They are selected using |
---|
245 | the menu of options. |
---|
246 | <P> |
---|
247 | <DL COMPACT> |
---|
248 | <DT>U</DT> <DD>the User tree option. |
---|
249 | The trees in FITCH are regarded as unrooted, and are |
---|
250 | specified with a trifurcation (three-way split) at their base: |
---|
251 | e. g.: |
---|
252 | <P> |
---|
253 | ((A,B),C,(D,E)); |
---|
254 | <P> |
---|
255 | while in KITSCH they are to be regarded as rooted and have a |
---|
256 | bifurcation at the base: |
---|
257 | <P> |
---|
258 | ((A,B),(C,(D,E))); |
---|
259 | <P> |
---|
260 | Be careful not to move User trees from FITCH to KITSCH without |
---|
261 | changing their form appropriately (you can use RETREE to do |
---|
262 | this). User trees are not available in NEIGHBOR. In FITCH if |
---|
263 | you specify the branch lengths on one or more branches, you can |
---|
264 | select the L (use branch Lengths) option to avoid having those |
---|
265 | branches iterated, so that the tree is evaluated with their |
---|
266 | lengths fixed. |
---|
267 | </DD> |
---|
268 | <P> |
---|
269 | <DT>P</DT> <DD>indicates that you are going to set the Power (P in the above |
---|
270 | formula). The default value is 2 (the Fitch-Margoliash method). |
---|
271 | The power, a real number such as 1.0, is prompted for by the |
---|
272 | programs. This option is not available in NEIGHBOR.</DD> |
---|
273 | <P> |
---|
274 | <DT>-</DT> <DD>indicates that negative segment lengths are to be allowed in the |
---|
275 | tree (default is to require that all branch lengths be |
---|
276 | nonnegative). This option is not available in NEIGHBOR.</DD> |
---|
277 | <P> |
---|
278 | <DT>O</DT> <DD>is the usual Outgroup option, available in FITCH and NEIGHBOR but |
---|
279 | not in KITSCH, nor when the UPGMA option of NEIGHBOR is used.</DD> |
---|
280 | <P> |
---|
281 | <DT>L</DT> <DD>indicates that the distance matrix is input in Lower-triangular |
---|
282 | form (the lower-left half of the distance matrix only, without |
---|
283 | the zero diagonal elements).</DD> |
---|
284 | <P> |
---|
285 | <DT>R</DT> <DD>indicates that the distance matrix is input in uppeR-triangular |
---|
286 | form (the upper-right half of the distance matrix only, without |
---|
287 | the zero diagonal elements).</DD> |
---|
288 | <P> |
---|
289 | <DT>S</DT> <DD>is the Subreplication option. It informs the program that after |
---|
290 | each distance will be provided an integer indicating that the |
---|
291 | distance is a mean of that many replicates. There is no |
---|
292 | auxiliary information, but the presence of the S option indicates |
---|
293 | that the data will be in a different form. Each distance must be |
---|
294 | followed by an integer indicating the number of replicates, so |
---|
295 | that a line of data looks like this: |
---|
296 | <P> |
---|
297 | <PRE> |
---|
298 | Delta 3.00 5 3.21 3 1.84 9 |
---|
299 | </PRE> |
---|
300 | <P> |
---|
301 | the 5, 3, and 9 being the number of times the measurement was |
---|
302 | replicated. When the number of replicates is zero, a distance |
---|
303 | value must still be provided, although its vale will not afect |
---|
304 | the result. This option is not available in NEIGHBOR.</DD> |
---|
305 | <P> |
---|
306 | <DT>G</DT><DD>is the usual Global branch-swapping option. It is available in |
---|
307 | FITCH and KITSCH but is not relevant to NEIGHBOR.</DD> |
---|
308 | <P> |
---|
309 | <DT>J</DT> <DD>indicates the usual J (Jumble) option for entering species in a |
---|
310 | random order. In FITCH and KITSCH if you do multiple jumbles in |
---|
311 | one run the program will print out the best tree found overall.</DD> |
---|
312 | <P> |
---|
313 | <DT>M</DT> <DD>is the usal Multiple data sets option, available in all of these |
---|
314 | programs. It allows us (when the output tree file is analyzed in |
---|
315 | CONSENSE) to do a bootstrap (or delete-half-jackknife) analysis |
---|
316 | with the distance matrix{ programs.</DD> |
---|
317 | </DL> |
---|
318 | <P> |
---|
319 | The numerical options are the usual ones and should be clear from the |
---|
320 | menu. |
---|
321 | <P> |
---|
322 | Note that when the options L or R are used one of the species, the first |
---|
323 | or last one, will have its name on an otherwise empty line. Even so, the name |
---|
324 | should be padded out to full length with blanks. Here is a sample lower- |
---|
325 | triangular data set. |
---|
326 | <P> |
---|
327 | <TABLE><TR><TD BGCOLOR=white> |
---|
328 | <PRE> |
---|
329 | 5 |
---|
330 | Alpha |
---|
331 | Beta 1.00 |
---|
332 | Gamma 3.00 3.00 |
---|
333 | Delta 3.00 3.00 2.00 |
---|
334 | Epsilon 3.00 3.00 2.00 1.00 |
---|
335 | </PRE> |
---|
336 | </TD><TD> <--- note: five blanks should follow the name "Alpha"<BR> |
---|
337 | <BR><BR><BR></TD></TR></TABLE> |
---|
338 | <P> |
---|
339 | Be careful if you are using lower- or upper-triangular trees to make the |
---|
340 | corresponding selection from the menu (L or R), as the |
---|
341 | program may get horribly confused otherwise, <I>but it still gives a result even |
---|
342 | though the result is then meaningless</I>. With the menu option selected all |
---|
343 | should be well. |
---|
344 | </BODY> |
---|
345 | </HTML> |
---|