| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>distance</TITLE> |
|---|
| 5 | <META NAME="description" CONTENT="distance"> |
|---|
| 6 | <META NAME="keywords" CONTENT="distance"> |
|---|
| 7 | <META NAME="resource-type" CONTENT="document"> |
|---|
| 8 | <META NAME="distribution" CONTENT="global"> |
|---|
| 9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
|---|
| 10 | </HEAD> |
|---|
| 11 | <BODY BGCOLOR="#ccffff"> |
|---|
| 12 | <DIV ALIGN=RIGHT> |
|---|
| 13 | version 3.6 |
|---|
| 14 | </DIV> |
|---|
| 15 | <P> |
|---|
| 16 | <DIV ALIGN=CENTER> |
|---|
| 17 | <H1>Distance matrix programs</H1> |
|---|
| 18 | </DIV> |
|---|
| 19 | <P> |
|---|
| 20 | © Copyright 1986-2000 by the University of Washington. Written by Joseph |
|---|
| 21 | Felsenstein. Permission is granted to copy |
|---|
| 22 | this document provided that no fee is charged for it and that this copyright |
|---|
| 23 | notice is not removed. |
|---|
| 24 | <P> |
|---|
| 25 | The programs FITCH, KITSCH, and NEIGHBOR are for dealing with data which |
|---|
| 26 | comes in the form of a matrix of pairwise distances between all pairs of taxa, |
|---|
| 27 | such as distances based on molecular sequence data, |
|---|
| 28 | gene frequency genetic distances, amounts of DNA hybridization, or |
|---|
| 29 | immunological distances. In analyzing these |
|---|
| 30 | data, distance matrix programs implicitly assume that: |
|---|
| 31 | <OL> |
|---|
| 32 | <LI>Each distance is measured independently from the others: no item of |
|---|
| 33 | data contributes to more than one distance. |
|---|
| 34 | <LI>The distance between each pair of taxa is drawn from a distribution |
|---|
| 35 | with an expectation which is the sum of values (in effect amounts of evolution) |
|---|
| 36 | along the tree from one tip to the other. The variance of the distribution is |
|---|
| 37 | proportional to a power <EM>p</EM> of the expectation. |
|---|
| 38 | </OL> |
|---|
| 39 | <P> |
|---|
| 40 | These assumptions can be traced in the least squares methods of programs |
|---|
| 41 | FITCH and KITSCH but it is not quite so easy to see them in operation in the |
|---|
| 42 | Neighbor-Joining method of NEIGHBOR, where the independence assumptions is less |
|---|
| 43 | obvious. |
|---|
| 44 | <P> |
|---|
| 45 | THESE TWO ASSUMPTIONS ARE DUBIOUS IN MOST CASES: independence will not be |
|---|
| 46 | expected to be true in most kinds of data, such as genetic distances from gene |
|---|
| 47 | frequency data. For genetic distance data in which pure genetic drift without |
|---|
| 48 | mutation can be assumed to be the mechanism of change CONTML may be more |
|---|
| 49 | appropriate. However, FITCH, KITSCH, and NEIGHBOR will not give positively |
|---|
| 50 | misleading results (they will not make a statistically inconsistent estimate) |
|---|
| 51 | provided that additivity holds, which it will if the distance is computed from |
|---|
| 52 | the original data by a method which corrects for reversals and parallelisms in |
|---|
| 53 | evolution. If additivity is not expected to hold, problems are more severe. A |
|---|
| 54 | short discussion of these matters will be found in a review article of mine |
|---|
| 55 | (1984a). For detailed, if sometimes irrelevant, controversy see the papers by |
|---|
| 56 | Farris (1981, 1985, 1986) and myself (1986, 1988b). |
|---|
| 57 | <P> |
|---|
| 58 | For genetic distances from gene frequencies, FITCH, KITSCH, and NEIGHBOR |
|---|
| 59 | may be appropriate if a neutral mutation model can be assumed and Nei's genetic |
|---|
| 60 | distance is used, or if pure drift can be assumed and either Cavalli-Sforza's |
|---|
| 61 | chord measure or Reynolds, Weir, and Cockerham's (1983) genetic distance is |
|---|
| 62 | used. However, in the latter case (pure drift) CONTML should be better. |
|---|
| 63 | <P> |
|---|
| 64 | Restriction site and restriction fragment data can be treated by distance |
|---|
| 65 | matrix methods if a distance such as that of Nei and Li (1979) is used. |
|---|
| 66 | Distances of this sort can be computed in PHYLIp by the program RESTDIST. |
|---|
| 67 | <P> |
|---|
| 68 | For nucleic acid sequences, the distances computed in DNADIST allow |
|---|
| 69 | correction for multiple hits (in different ways) and should allow one to |
|---|
| 70 | analyse the data under the presumption of additivity. <I>In all of these cases |
|---|
| 71 | independence will not be expected to hold</I>. DNA hybridization and |
|---|
| 72 | immunological distances may be additive and independent if transformed properly |
|---|
| 73 | and if (and only if) the standards against which each value is measured are |
|---|
| 74 | independent. (This is rarely exactly true). |
|---|
| 75 | <P> |
|---|
| 76 | FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has the |
|---|
| 77 | branch lengths unconstrained. KITSCH and the UPGMA option of NEIGHBOR, by |
|---|
| 78 | contrast, assume that an "evolutionary clock" is valid, according to which the |
|---|
| 79 | true branch lengths from the root of the tree to each tip are the same: the |
|---|
| 80 | expected amount of evolution in any lineage is proportional to elapsed time. |
|---|
| 81 | <P> |
|---|
| 82 | The input format for distance data is straightforward. The first line of |
|---|
| 83 | the input file contains the number of species. There follows species data, |
|---|
| 84 | starting, as with all other programs, with a species name. The species name is |
|---|
| 85 | ten characters long, and must be padded out with blanks if shorter. For each |
|---|
| 86 | species there then follows a set of distances to all the other species (options |
|---|
| 87 | selected in the programs' menus |
|---|
| 88 | allow the distance matrix to be upper or lower triangular or square). |
|---|
| 89 | The distances can continue to a new line after any of them. If the matrix |
|---|
| 90 | is lower-triangular, the diagonal entries (the distances from a species to |
|---|
| 91 | itself) will not be read by the programs. If they are included anyway, they |
|---|
| 92 | will be ignored by the programs, except for the case where one of them |
|---|
| 93 | starts a new line, in which case the program will mistake it for a species |
|---|
| 94 | name and get very confused. |
|---|
| 95 | <P> |
|---|
| 96 | For example, here is a sample input matrix, with a square matrix: |
|---|
| 97 | <P> |
|---|
| 98 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 99 | <PRE> |
|---|
| 100 | 5 |
|---|
| 101 | Alpha 0.000 1.000 2.000 3.000 3.000 |
|---|
| 102 | Beta 1.000 0.000 2.000 3.000 3.000 |
|---|
| 103 | Gamma 2.000 2.000 0.000 3.000 3.000 |
|---|
| 104 | Delta 3.000 3.000 0.000 0.000 1.000 |
|---|
| 105 | Epsilon 3.000 3.000 3.000 1.000 0.000 |
|---|
| 106 | </PRE> |
|---|
| 107 | </TD></TR></TABLE> |
|---|
| 108 | <P> |
|---|
| 109 | and here is a sample lower-triangular input matrix with distances continuing |
|---|
| 110 | to new lines as needed: |
|---|
| 111 | <P> |
|---|
| 112 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 113 | <PRE> |
|---|
| 114 | 14 |
|---|
| 115 | Mouse |
|---|
| 116 | Bovine 1.7043 |
|---|
| 117 | Lemur 2.0235 1.1901 |
|---|
| 118 | Tarsier 2.1378 1.3287 1.2905 |
|---|
| 119 | Squir Monk 1.5232 1.2423 1.3199 1.7878 |
|---|
| 120 | Jpn Macaq 1.8261 1.2508 1.3887 1.3137 1.0642 |
|---|
| 121 | Rhesus Mac 1.9182 1.2536 1.4658 1.3788 1.1124 0.1022 |
|---|
| 122 | Crab-E.Mac 2.0039 1.3066 1.4826 1.3826 0.9832 0.2061 0.2681 |
|---|
| 123 | BarbMacaq 1.9431 1.2827 1.4502 1.4543 1.0629 0.3895 0.3930 0.3665 |
|---|
| 124 | Gibbon 1.9663 1.3296 1.8708 1.6683 0.9228 0.8035 0.7109 0.8132 |
|---|
| 125 | 0.7858 |
|---|
| 126 | Orang 2.0593 1.2005 1.5356 1.6606 1.0681 0.7239 0.7290 0.7894 |
|---|
| 127 | 0.7140 0.7095 |
|---|
| 128 | Gorilla 1.6664 1.3460 1.4577 1.5935 0.9127 0.7278 0.7412 0.8763 |
|---|
| 129 | 0.7966 0.5959 0.4604 |
|---|
| 130 | Chimp 1.7320 1.3757 1.7803 1.7119 1.0635 0.7899 0.8742 0.8868 |
|---|
| 131 | 0.8288 0.6213 0.5065 0.3502 |
|---|
| 132 | Human 1.7101 1.3956 1.6661 1.7599 1.0557 0.6933 0.7118 0.7589 |
|---|
| 133 | 0.8542 0.5612 0.4700 0.3097 0.2712 |
|---|
| 134 | </PRE> |
|---|
| 135 | </TD></TR></TABLE> |
|---|
| 136 | <P> |
|---|
| 137 | Note that the name "Mouse" in this matrix must be padded out by blanks to the |
|---|
| 138 | full length of 10 characters. |
|---|
| 139 | <P> |
|---|
| 140 | In general the distances are assumed to all be present: at the moment |
|---|
| 141 | there is only one way we can have missing entries in the distance matrix. If |
|---|
| 142 | the S option (which allows the user to specify the degree of replication of |
|---|
| 143 | each distance) is invoked, with some of the entries having degree of |
|---|
| 144 | replication zero, if the U (User Tree) option is in effect, and if the tree |
|---|
| 145 | being examined is such that every branch length can be estimated from the data, |
|---|
| 146 | it will be possible to solve for the branch lengths and sum of squares when |
|---|
| 147 | there is some missing data. You may not get away with this if the U option is |
|---|
| 148 | not in effect, as a tree may be tried on which the program will calculate a |
|---|
| 149 | branch length by dividing zero by zero, and get upset. |
|---|
| 150 | <P> |
|---|
| 151 | The present version of NEIGHBOR does allow the Subreplication option to be |
|---|
| 152 | used and the number of replicates to be in the input file, but it actally does |
|---|
| 153 | nothing with this information except read it in. It makes use of the average |
|---|
| 154 | distances in the cells of the input data matrix. This means that you cannot use |
|---|
| 155 | the S option to treat zero cells. We hope to modify NEIGHBOR in the future to |
|---|
| 156 | allow Subreplication. Of course the U (User tree) option is not available in |
|---|
| 157 | NEIGHBOR in any case. |
|---|
| 158 | <P> |
|---|
| 159 | The present versions of FITCH and KITSCH will do much better on missing |
|---|
| 160 | values than did previous versions, but you will still have to be careful about |
|---|
| 161 | them. Nevertheless you might (just) be able to explore relevant alternative |
|---|
| 162 | tree topologies one at a time using the U option when there is missing data. |
|---|
| 163 | <P> |
|---|
| 164 | Alternatively, if the missing values in one cell always correspond to a |
|---|
| 165 | cell with non-missing values on the opposite side of the main diagonal (i.e., |
|---|
| 166 | if D(i,j) missing implies that D(j,i) is not missing), then use of the S option |
|---|
| 167 | will always be sufficient to cope with missing values. When it is used, the |
|---|
| 168 | missing distances should be entered as if present (any number can be |
|---|
| 169 | used) and the degree of replication for them should be given as 0. |
|---|
| 170 | <P> |
|---|
| 171 | Note that the algorithm for searching among topologies in FITCH and KITSCH |
|---|
| 172 | is the same one used in other programs, so that it is necessary to try |
|---|
| 173 | different orders of species in the input data. The J (Jumble) menu option may |
|---|
| 174 | be sufficient for most purposes. |
|---|
| 175 | <P> |
|---|
| 176 | The programs FITCH and KITSCH carry out the method of Fitch and Margoliash |
|---|
| 177 | (1967) for fitting trees to distance matrices. They also are able to carry out |
|---|
| 178 | the least squares method of Cavalli-Sforza and Edwards (1967), plus a variety |
|---|
| 179 | of other methods of the same family (see the discussion of the P option below). |
|---|
| 180 | They can also be set to use the Minimum Evolution method (Nei and Rzhetsky, |
|---|
| 181 | 1993; Kidd and Sgaramella-Zonta, 1971). |
|---|
| 182 | <P> |
|---|
| 183 | The objective of these methods is to find that tree which minimizes |
|---|
| 184 | <P> |
|---|
| 185 | <PRE> |
|---|
| 186 | __ __ |
|---|
| 187 | \ \ n<SUB>ij</SUB> ( D<SUB>ij</SUB> - d<SUB>ij</SUB>)<SUP>2</SUP> |
|---|
| 188 | Sum of squares = /_ /_ ------------------ |
|---|
| 189 | i j D<SUB>ij</SUB><SUP>p</SUP> |
|---|
| 190 | </PRE> |
|---|
| 191 | <P> |
|---|
| 192 | (the symbol made up of \, / and _ characters is of course a summation sign) |
|---|
| 193 | where <EM>D</EM> is the observed distance between species <EM>i</EM> and |
|---|
| 194 | <EM>j</EM> and <EM>d</EM> is the expected |
|---|
| 195 | distance, computed as the sum of the lengths (amounts of evolution) of the |
|---|
| 196 | segments of the tree from species <EM>i</EM> to species <EM>j</EM>. The |
|---|
| 197 | quantity <EM>n</EM> is the number |
|---|
| 198 | of times each distance has been replicated. In simple cases this is taken to |
|---|
| 199 | be one, but the user can, as an option, specify the degree of replication for |
|---|
| 200 | each distance. The distance is then assumed to be a mean of those replicates. |
|---|
| 201 | The power <EM>P</EM> is what distinguished the various methods. For the Fitch- |
|---|
| 202 | Margoliash method, which is the default method with this program, <EM>P</EM> |
|---|
| 203 | is 2.0. For the Cavalli-Sforza and Edwards least squares method it should be set to 0 |
|---|
| 204 | (so that the denominator is always 1). An intermediate method is also |
|---|
| 205 | available in which P is 1.0, and any other value of P, such as 4.0 or -2.3, can |
|---|
| 206 | also be used. This generates a whole family of methods. |
|---|
| 207 | <P> |
|---|
| 208 | The P (Power) option is not available in the Neighbor-Joining program |
|---|
| 209 | NEIGHBOR. Implicitly, in this program P is 0.0 (though it is hard to prove |
|---|
| 210 | this). The UPGMA option of NEIGHBOR will assign the same branch lengths to the |
|---|
| 211 | particular tree topology that it finds as will KITSCH when given the same tree |
|---|
| 212 | and Power = 0.0. |
|---|
| 213 | <P> |
|---|
| 214 | All these methods make the assumptions of additivity and independent |
|---|
| 215 | errors. The difference between the methods is how they weight departures of |
|---|
| 216 | observed from expected. In effect, these methods differ in how they assume |
|---|
| 217 | that the variance of measurement of a distance will rise as a function of the |
|---|
| 218 | expected value of the distance. |
|---|
| 219 | <P> |
|---|
| 220 | These methods assume that the variance of the measurement error is |
|---|
| 221 | proportional to the <EM>P</EM>-th power of the expectation (hence the standard deviation |
|---|
| 222 | will be proportional to the <EM>P/2</EM>-th power of the expectation). If |
|---|
| 223 | you have |
|---|
| 224 | reason to think that the measurement error of a distance is the same for small |
|---|
| 225 | distances as it is for large, then you should set <EM>P=0</EM> and use the |
|---|
| 226 | least squares |
|---|
| 227 | method, but if you have reason to think that the relative (percentage) error is |
|---|
| 228 | more nearly constant than the absolute error, you should use <EM>P=2</EM>, |
|---|
| 229 | the Fitch-Margoliash method. In between, P=1 would be appropriate if the |
|---|
| 230 | sizes of the errors were proportional to the square roots of the expected |
|---|
| 231 | distance. |
|---|
| 232 | <P> |
|---|
| 233 | One question which arises frequently is what the units of branch length are |
|---|
| 234 | in the resulting trees. In general, they are not time but units of |
|---|
| 235 | distance. Thus if two species have a distance 0.3 between them, they will |
|---|
| 236 | tend to be separated by branches whose total length is about 0.3. In the |
|---|
| 237 | case of DNA distances, for example, the unit of branch length will be |
|---|
| 238 | subsxtitutions per base. (In the case of protein distances, it will be |
|---|
| 239 | amino acid substitutions per amino acid posiiton. |
|---|
| 240 | tend to be sd |
|---|
| 241 | <P> |
|---|
| 242 | <H2>OPTIONS</H2> |
|---|
| 243 | <P> |
|---|
| 244 | Here are the options available in all three programs. They are selected using |
|---|
| 245 | the menu of options. |
|---|
| 246 | <P> |
|---|
| 247 | <DL COMPACT> |
|---|
| 248 | <DT>U</DT> <DD>the User tree option. |
|---|
| 249 | The trees in FITCH are regarded as unrooted, and are |
|---|
| 250 | specified with a trifurcation (three-way split) at their base: |
|---|
| 251 | e. g.: |
|---|
| 252 | <P> |
|---|
| 253 | ((A,B),C,(D,E)); |
|---|
| 254 | <P> |
|---|
| 255 | while in KITSCH they are to be regarded as rooted and have a |
|---|
| 256 | bifurcation at the base: |
|---|
| 257 | <P> |
|---|
| 258 | ((A,B),(C,(D,E))); |
|---|
| 259 | <P> |
|---|
| 260 | Be careful not to move User trees from FITCH to KITSCH without |
|---|
| 261 | changing their form appropriately (you can use RETREE to do |
|---|
| 262 | this). User trees are not available in NEIGHBOR. In FITCH if |
|---|
| 263 | you specify the branch lengths on one or more branches, you can |
|---|
| 264 | select the L (use branch Lengths) option to avoid having those |
|---|
| 265 | branches iterated, so that the tree is evaluated with their |
|---|
| 266 | lengths fixed. |
|---|
| 267 | </DD> |
|---|
| 268 | <P> |
|---|
| 269 | <DT>P</DT> <DD>indicates that you are going to set the Power (P in the above |
|---|
| 270 | formula). The default value is 2 (the Fitch-Margoliash method). |
|---|
| 271 | The power, a real number such as 1.0, is prompted for by the |
|---|
| 272 | programs. This option is not available in NEIGHBOR.</DD> |
|---|
| 273 | <P> |
|---|
| 274 | <DT>-</DT> <DD>indicates that negative segment lengths are to be allowed in the |
|---|
| 275 | tree (default is to require that all branch lengths be |
|---|
| 276 | nonnegative). This option is not available in NEIGHBOR.</DD> |
|---|
| 277 | <P> |
|---|
| 278 | <DT>O</DT> <DD>is the usual Outgroup option, available in FITCH and NEIGHBOR but |
|---|
| 279 | not in KITSCH, nor when the UPGMA option of NEIGHBOR is used.</DD> |
|---|
| 280 | <P> |
|---|
| 281 | <DT>L</DT> <DD>indicates that the distance matrix is input in Lower-triangular |
|---|
| 282 | form (the lower-left half of the distance matrix only, without |
|---|
| 283 | the zero diagonal elements).</DD> |
|---|
| 284 | <P> |
|---|
| 285 | <DT>R</DT> <DD>indicates that the distance matrix is input in uppeR-triangular |
|---|
| 286 | form (the upper-right half of the distance matrix only, without |
|---|
| 287 | the zero diagonal elements).</DD> |
|---|
| 288 | <P> |
|---|
| 289 | <DT>S</DT> <DD>is the Subreplication option. It informs the program that after |
|---|
| 290 | each distance will be provided an integer indicating that the |
|---|
| 291 | distance is a mean of that many replicates. There is no |
|---|
| 292 | auxiliary information, but the presence of the S option indicates |
|---|
| 293 | that the data will be in a different form. Each distance must be |
|---|
| 294 | followed by an integer indicating the number of replicates, so |
|---|
| 295 | that a line of data looks like this: |
|---|
| 296 | <P> |
|---|
| 297 | <PRE> |
|---|
| 298 | Delta 3.00 5 3.21 3 1.84 9 |
|---|
| 299 | </PRE> |
|---|
| 300 | <P> |
|---|
| 301 | the 5, 3, and 9 being the number of times the measurement was |
|---|
| 302 | replicated. When the number of replicates is zero, a distance |
|---|
| 303 | value must still be provided, although its vale will not afect |
|---|
| 304 | the result. This option is not available in NEIGHBOR.</DD> |
|---|
| 305 | <P> |
|---|
| 306 | <DT>G</DT><DD>is the usual Global branch-swapping option. It is available in |
|---|
| 307 | FITCH and KITSCH but is not relevant to NEIGHBOR.</DD> |
|---|
| 308 | <P> |
|---|
| 309 | <DT>J</DT> <DD>indicates the usual J (Jumble) option for entering species in a |
|---|
| 310 | random order. In FITCH and KITSCH if you do multiple jumbles in |
|---|
| 311 | one run the program will print out the best tree found overall.</DD> |
|---|
| 312 | <P> |
|---|
| 313 | <DT>M</DT> <DD>is the usal Multiple data sets option, available in all of these |
|---|
| 314 | programs. It allows us (when the output tree file is analyzed in |
|---|
| 315 | CONSENSE) to do a bootstrap (or delete-half-jackknife) analysis |
|---|
| 316 | with the distance matrix{ programs.</DD> |
|---|
| 317 | </DL> |
|---|
| 318 | <P> |
|---|
| 319 | The numerical options are the usual ones and should be clear from the |
|---|
| 320 | menu. |
|---|
| 321 | <P> |
|---|
| 322 | Note that when the options L or R are used one of the species, the first |
|---|
| 323 | or last one, will have its name on an otherwise empty line. Even so, the name |
|---|
| 324 | should be padded out to full length with blanks. Here is a sample lower- |
|---|
| 325 | triangular data set. |
|---|
| 326 | <P> |
|---|
| 327 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 328 | <PRE> |
|---|
| 329 | 5 |
|---|
| 330 | Alpha |
|---|
| 331 | Beta 1.00 |
|---|
| 332 | Gamma 3.00 3.00 |
|---|
| 333 | Delta 3.00 3.00 2.00 |
|---|
| 334 | Epsilon 3.00 3.00 2.00 1.00 |
|---|
| 335 | </PRE> |
|---|
| 336 | </TD><TD> <--- note: five blanks should follow the name "Alpha"<BR> |
|---|
| 337 | <BR><BR><BR></TD></TR></TABLE> |
|---|
| 338 | <P> |
|---|
| 339 | Be careful if you are using lower- or upper-triangular trees to make the |
|---|
| 340 | corresponding selection from the menu (L or R), as the |
|---|
| 341 | program may get horribly confused otherwise, <I>but it still gives a result even |
|---|
| 342 | though the result is then meaningless</I>. With the menu option selected all |
|---|
| 343 | should be well. |
|---|
| 344 | </BODY> |
|---|
| 345 | </HTML> |
|---|