[2176] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
| 2 | <HTML> |
---|
| 3 | <HEAD> |
---|
| 4 | <TITLE>restdist</TITLE> |
---|
| 5 | <META NAME="description" CONTENT="restdist"> |
---|
| 6 | <META NAME="keywords" CONTENT="restdist"> |
---|
| 7 | <META NAME="resource-type" CONTENT="document"> |
---|
| 8 | <META NAME="distribution" CONTENT="global"> |
---|
| 9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
| 10 | </HEAD> |
---|
| 11 | <BODY BGCOLOR="#ccffff"> |
---|
| 12 | <DIV ALIGN=RIGHT> |
---|
| 13 | version 3.6 |
---|
| 14 | </DIV> |
---|
| 15 | <P> |
---|
| 16 | <DIV ALIGN=CENTER> |
---|
| 17 | <H1>RESTDIST -- Program to compute distance matrix<BR>from restriction sites or fragments</H1> |
---|
| 18 | </DIV> |
---|
| 19 | <P> |
---|
| 20 | © Copyright 2000-2002 by the University of |
---|
| 21 | Washington. Written by Joseph Felsenstein. Permission is granted to copy |
---|
| 22 | this document provided that no fee is charged for it and that this copyright |
---|
| 23 | notice is not removed. |
---|
| 24 | <P> |
---|
| 25 | Restdist reads the same restriction sites format as RESTML and |
---|
| 26 | computes a restriction sites distance. It can also compute a restriction |
---|
| 27 | fragments distance. The original restriction fragments and restriction |
---|
| 28 | sites distance methods were introduced by Nei and Li (1979). Their original |
---|
| 29 | <! ???? CHANGE NEXT LINE TO methods are WHEN NEI/LI R.S. DISTANCE INCLUDED > |
---|
| 30 | method for restriction fragments is |
---|
| 31 | also available in this program, although its default methods |
---|
| 32 | are my modifications of the original Nei and Li methods. |
---|
| 33 | <P> |
---|
| 34 | These two distances assume that the restriction sites are accidental byproducts |
---|
| 35 | of random change of nucleotide sequences. For my restriction sites distance |
---|
| 36 | the DNA sequences are assumed to be changing according to the Kimura |
---|
| 37 | 2-parameter model of DNA change (Kimura, 1980). The user can set the |
---|
| 38 | transition/transversion rate for the model. For my restriction fragments |
---|
| 39 | distance there is |
---|
| 40 | there is an implicit assumption of a Jukes-Cantor (1969) model of change, |
---|
| 41 | The user can also set the |
---|
| 42 | parameter of a correction for unequal rates of evolution between sites in |
---|
| 43 | the DNA sequences, using a Gamma distribution of rates among sites. |
---|
| 44 | The Jukes-Cantor model is also implicit in the restriction fragments |
---|
| 45 | distance of Nei and Li(1979). It |
---|
| 46 | does not allow us to correct for a Gamma distribution of rates among |
---|
| 47 | sites. |
---|
| 48 | <P> |
---|
| 49 | <H2>Restriction Sites Distance</H2> |
---|
| 50 | <P> |
---|
| 51 | The restriction sites distances use data coded for the presence of absence of |
---|
| 52 | individual restriction sites (usually as + and - or 0 and 1). My |
---|
| 53 | distance is based on the proportion, out of all sites observed in one species |
---|
| 54 | or the other, which are present in both species. This is done to correct for |
---|
| 55 | the ascertainment of sites, for the fact that we are not aware of many sites |
---|
| 56 | because they do not appear in any species. |
---|
| 57 | <P> |
---|
| 58 | My distance starts by computing from the particular pair of species the fraction |
---|
| 59 | <PRE> |
---|
| 60 | n<SUB>++</SUB> |
---|
| 61 | f = --------------------- |
---|
| 62 | n<SUB>++</SUB> + <SUP>1</SUP>/<SUB>2</SUB> (n<SUB>+- </SUB>+ n<SUB>-+</SUB>) |
---|
| 63 | </PRE> |
---|
| 64 | where <I>n<SUB>++</SUB></I> is the number of sites contained in both species, |
---|
| 65 | <I>n<SUB>+-</SUB></I> is the number of sites contained in the first of the |
---|
| 66 | two species but not in the second, and <I>n<SUB>-+</SUB></I> is the |
---|
| 67 | number of sites contained in the second of the two species but not in the |
---|
| 68 | first. This is the fraction of sites that are present in one species which are |
---|
| 69 | present in both. Since the number of sites present in the two species will |
---|
| 70 | often differ, the denominator is the average of the number of sites found in |
---|
| 71 | the two species. |
---|
| 72 | <P> |
---|
| 73 | If each restriction site is <I>s</I> nucleotides long, the probability |
---|
| 74 | that a restriction site is present in the other species, given that it is |
---|
| 75 | present in a species, is |
---|
| 76 | <PRE> |
---|
| 77 | Q<SUP>s</SUP>, |
---|
| 78 | </PRE> |
---|
| 79 | where <I>Q</I> is the probability that a nucleotide has no net change as one |
---|
| 80 | goes from the one species to the other. It may have changed in between; we |
---|
| 81 | are interested in the probability that that nucleotide site is in the same base |
---|
| 82 | in both species, irrespective of what has happened in between. |
---|
| 83 | <P> |
---|
| 84 | The distance is then computed by finding the branch length of a two-species |
---|
| 85 | tree (connecting these two species with a single branch) such |
---|
| 86 | that <I>Q</I> equals the <I>s</I>-th root of <I>f</I>. For this the |
---|
| 87 | program computes <I>Q</I> for various values of branch length, iterating them |
---|
| 88 | by a Newton-Raphson algorithm until the two quantities are equal. |
---|
| 89 | <P> |
---|
| 90 | The resulting distance should be numerically close to the original |
---|
| 91 | restriction sites distance of Nei and Li (1979). It is inspired by theirs, |
---|
| 92 | but theirs differs by implicitly assuming a symmetric Jukes-Cantor (1969) |
---|
| 93 | model of nucleotide change, and theirs does not include a correction for |
---|
| 94 | Gamma distribution of rate of change among nucleotide sites. |
---|
| 95 | <P> |
---|
| 96 | <H2>Restriction Fragments Distance</H2> |
---|
| 97 | <P> |
---|
| 98 | For restriction fragments data we use a different distance. If we |
---|
| 99 | average over all restriction fragment lengths, each at its own |
---|
| 100 | expected frequency, the probability that the fragment will still be |
---|
| 101 | in existence after a certain amount of branch length, we must take into |
---|
| 102 | account the probability that the two restriction sites at the ends of |
---|
| 103 | the fragment do not mutate, and the probability that no new |
---|
| 104 | restriction site occurs within the fragment in that amount of branch |
---|
| 105 | length. The result for a restriction site length of <I>s</I> is: |
---|
| 106 | <PRE> |
---|
| 107 | Q<SUP>2s</SUP> |
---|
| 108 | f = -------- |
---|
| 109 | 2 - Q<SUP>s</SUP> |
---|
| 110 | </PRE> |
---|
| 111 | (The details of the derivation will be given in my forthcoming book |
---|
| 112 | <I>Inferring Phylogenies</I> (to be published by Sinauer Associates |
---|
| 113 | in 2001). |
---|
| 114 | Given the observed fraction of restriction sites retained, <I>f</I>, |
---|
| 115 | we can solve a quadratic equation from the above expression for |
---|
| 116 | <I>Q<SUP>s</SUP></I>. That makes it easy to obtain a value of <I>Q</I>, |
---|
| 117 | and the branch length can then be estimated by adjusting it so the |
---|
| 118 | probability of a base not changing is equal to that value. |
---|
| 119 | <P> |
---|
| 120 | Alternatively, if we use the Nei and Li (1979) restriction fragments |
---|
| 121 | distance, this involves solving for <I>g</I> in the nonlinear |
---|
| 122 | equation |
---|
| 123 | <PRE> |
---|
| 124 | g = [ f (3 - 2g) ]<SUP><SUP>1</SUP>/<SUB>4</SUB></SUP> |
---|
| 125 | </PRE> |
---|
| 126 | and then the distance is given by |
---|
| 127 | <PRE> |
---|
| 128 | d = - (<SUP>2</SUP>/<SUB>r</SUB>) log<SUB>e</SUB>(g) |
---|
| 129 | </PRE> |
---|
| 130 | where <I>r</I> is the length of the restriction site. |
---|
| 131 | <P> |
---|
| 132 | Comparing these two restriction fragments distances in a case |
---|
| 133 | where their underlying DNA model is the same (which is when the |
---|
| 134 | transition/transversion ratio of the modified model is set to |
---|
| 135 | 0.5), you will find |
---|
| 136 | that they are very close to each other, differing very little at |
---|
| 137 | small distances, with the modified distance become smaller than |
---|
| 138 | the Nei/Li distance at larger distances. It will therefore matter |
---|
| 139 | very little which one you use. |
---|
| 140 | <P> |
---|
| 141 | <H2>A Comment About RAPDs and AFLPs</H2> |
---|
| 142 | <P> |
---|
| 143 | Although these distances are designed for restriction sites |
---|
| 144 | and restriction fragments data, they can be applied to RAPD and |
---|
| 145 | AFLP data as well. RAPD (Randomly Amplified Polymorphic DNA) |
---|
| 146 | and AFLP (Amplified Fragment Length Polymorphism) data consist |
---|
| 147 | of presence or absence of individual bands on a gel. The bands |
---|
| 148 | are segments of DNA with PCR primers at each end. These |
---|
| 149 | primers are defined sequences of known length (often about |
---|
| 150 | 10 nucleotides each). For AFLPs the reolevant length is the primer |
---|
| 151 | length, plus three nucleotides. Mutation in these sequences makes them no |
---|
| 152 | longer be primers, just as in the case of restriction sites. |
---|
| 153 | Thus a pair of 10-nucleotide primers will behave much the same |
---|
| 154 | as a 20-nucleotide restriction site. You can use the restriction |
---|
| 155 | sites distance as the distance between RAPD or AFLP patterns if you |
---|
| 156 | set the proper value for the total length of the site to the |
---|
| 157 | total length of the primers (plus 6 in the case of AFLPs). |
---|
| 158 | Of course there are many possible sources of noise in these data, |
---|
| 159 | including confusing fragments of similar length for each other |
---|
| 160 | and having primers near each other in the genome, and these are |
---|
| 161 | not taken into account in the statistical model used here. |
---|
| 162 | <P> |
---|
| 163 | <H2>INPUT FORMAT AND OPTIONS</H2> |
---|
| 164 | <P> |
---|
| 165 | The input is fairly standard, with one addition. As usual the first line of |
---|
| 166 | the file gives the number of species and the number of sites, but there is also |
---|
| 167 | a third number, which is the number of different restriction enzymes that were |
---|
| 168 | used to detect the restriction sites. Thus a data set with 10 species and |
---|
| 169 | 35 different sites, representing digestion with 4 different enzymes, would |
---|
| 170 | have the first line of the data file look like this: |
---|
| 171 | <P> |
---|
| 172 | <PRE> |
---|
| 173 | 10 35 4 |
---|
| 174 | </PRE> |
---|
| 175 | <P> |
---|
| 176 | The site data are in standard form. Each species starts with a species name |
---|
| 177 | whose maximum length is given by the constant "nmlngth" |
---|
| 178 | (whose value in the |
---|
| 179 | program as distributed is 10 characters). The name should, as usual, be padded |
---|
| 180 | out to that length with blanks if necessary. The sites data then follows, one |
---|
| 181 | character per site (any blanks will |
---|
| 182 | be skipped and ignored). Like the DNA and protein sequence data, the |
---|
| 183 | restriction sites data may be either in the "interleaved" form or the |
---|
| 184 | "sequential" form. Note that if you are analyzing restriction sites |
---|
| 185 | data with the programs DOLLOP or MIX or other discrete character |
---|
| 186 | programs, at the moment those programs do not use the "aligned" or |
---|
| 187 | "interleaved" data format. Therefore you may want to avoid that format |
---|
| 188 | when you have restriction sites data that you will want to feed into |
---|
| 189 | those programs. |
---|
| 190 | <P> |
---|
| 191 | The presence of a site is indicated by a "+" and the absence by a "-". I have |
---|
| 192 | also allowed the use of "1" and "0" as synonyms for "+" and "-", for |
---|
| 193 | compatibility with MIX and DOLLOP which do not allow "+" and "-". If the |
---|
| 194 | presence of |
---|
| 195 | the site is unknown (for example, if the DNA containing it has been deleted so |
---|
| 196 | that one |
---|
| 197 | does not know whether it would have contained the site) then the state "?" can |
---|
| 198 | be used to indicate that the state of this site is unknown. |
---|
| 199 | <P> |
---|
| 200 | The options are selected using an interactive menu. The menu looks like this: |
---|
| 201 | <P> |
---|
| 202 | <TABLE><TR><TD BGCOLOR=white> |
---|
| 203 | <PRE> |
---|
| 204 | |
---|
| 205 | Restriction site or fragment distances, version 3.6a3 |
---|
| 206 | |
---|
| 207 | Settings for this run: |
---|
| 208 | R Restriction sites or fragments? Sites |
---|
| 209 | G Gamma distribution of rates among sites? No |
---|
| 210 | T Transition/transversion ratio? 2.000000 |
---|
| 211 | S Site length? 6.0 |
---|
| 212 | L Form of distance matrix? Square |
---|
| 213 | M Analyze multiple data sets? No |
---|
| 214 | I Input sequences interleaved? Yes |
---|
| 215 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
---|
| 216 | 1 Print out the data at start of run? No |
---|
| 217 | 2 Print indications of progress of run? Yes |
---|
| 218 | |
---|
| 219 | Y to accept these or type the letter for one to change |
---|
| 220 | |
---|
| 221 | </PRE> |
---|
| 222 | </TD></TR></TABLE> |
---|
| 223 | <P> |
---|
| 224 | The user either types "Y" (followed, of course, by a carriage-return) |
---|
| 225 | if the settings shown are to be accepted, or the letter or digit corresponding |
---|
| 226 | to an option that is to be changed. |
---|
| 227 | <P> |
---|
| 228 | The R option toggles between a restriction sites distance, which |
---|
| 229 | is the default setting, and a restriction fragments distance. In |
---|
| 230 | the latter case, another option appears, the N (Nei/Li) option. |
---|
| 231 | This allows the user to choose the original Nei and Li (1979) |
---|
| 232 | restriction fragments distance rather than my modified Nei/Li |
---|
| 233 | distance, which is the default. |
---|
| 234 | <P> |
---|
| 235 | If the G (Gamma distribution) option is selected, the user will be |
---|
| 236 | asked to supply the coefficient of variation of the rate of substitution |
---|
| 237 | among sites. This is different from the parameters used by Nei and Jin, who |
---|
| 238 | introduced Gamma distribution of rates in DNA distances, but |
---|
| 239 | related to their parameters: their parameter <EM>a</EM> is also known as |
---|
| 240 | "alpha", the shape parameter of the Gamma distribution. It is |
---|
| 241 | related to the coefficient of variation by |
---|
| 242 | <P> |
---|
| 243 | CV = 1 / a<SUP>1/2</SUP> |
---|
| 244 | <P> |
---|
| 245 | or |
---|
| 246 | <P> |
---|
| 247 | a = 1 / (CV)<SUP>2</SUP> |
---|
| 248 | <P> |
---|
| 249 | (their parameter <EM>b</EM> is absorbed here by the requirement that time is scaled so |
---|
| 250 | that the mean rate of evolution is 1 per unit time, which means that <EM>a = b</EM>). |
---|
| 251 | As we consider cases in which the rates are less variable we should set <EM>a</EM> |
---|
| 252 | larger and larger, as <EM>CV</EM> gets smaller and smaller. |
---|
| 253 | <P> |
---|
| 254 | The Gamma distribution option is not available when using the |
---|
| 255 | original Nei/Li restriction fragments distance. |
---|
| 256 | <P> |
---|
| 257 | The T option is the Transition/transversion option. The user is prompted for |
---|
| 258 | a real number greater than 0.0, as the expected ratio of transitions to |
---|
| 259 | transversions. Note |
---|
| 260 | that this is the resulting expected ratio of transitions to transversions. |
---|
| 261 | The default value of the T parameter if you do not use the T |
---|
| 262 | option is 2.0. The T option is not available when you choose the original |
---|
| 263 | Nei/Li restriction fragment distance, which assumes a Jukes-Cantor (1969) |
---|
| 264 | model of DNA change, for which the transition/transversion ratio is |
---|
| 265 | in effect fixed at 0.5. |
---|
| 266 | <P> |
---|
| 267 | The S option selects the site length. This is set to a default |
---|
| 268 | value of 6. It can be set to any positive integer. While in |
---|
| 269 | the RESTML program there is an upper limit on the restriction |
---|
| 270 | site length (set by memory limitations), in RESTDIST there is |
---|
| 271 | no effective limit on the size of the restriction sites. A value |
---|
| 272 | of 20, which might be appropriate in many cases for RAPD or AFLP |
---|
| 273 | data, is typically not practical in RESTML, but it is useable in |
---|
| 274 | RESTDIST. |
---|
| 275 | <P> |
---|
| 276 | Option L specifies that the output file will have a square matrix |
---|
| 277 | of distances. It can be used to change to lower-triangular |
---|
| 278 | data matrices. This will usually not be |
---|
| 279 | necessary, but if the distance matrices are going to be very |
---|
| 280 | large, this alternative can reduce their size by half. The |
---|
| 281 | programs which are to use them should then of course be informed |
---|
| 282 | that they can expect lower-triangular distance matrices. |
---|
| 283 | <P> |
---|
| 284 | The M, I, and 0 options are the usual Multiple data set, |
---|
| 285 | Interleaved input, and screen terminal type options. These are |
---|
| 286 | described in the main documentation file. |
---|
| 287 | <P> |
---|
| 288 | Option 1 specifies that the input data will be written out |
---|
| 289 | on the output file before the distances. This is off by |
---|
| 290 | default. If it is done, it will make the output file unusable |
---|
| 291 | as input to our distance matrix programs. |
---|
| 292 | <P> |
---|
| 293 | Option 2 turns off or on the indications of the progress of the run. |
---|
| 294 | The program prints out a row of dots (".") indicating the |
---|
| 295 | calculation of individual distances. Since the distance matrix |
---|
| 296 | is symmetrical, the program only computes the distances for the |
---|
| 297 | upper triangle of the distance matrix, and then duplicates |
---|
| 298 | the distance to the other corner of the matrix. Thus the rows of |
---|
| 299 | dots start out of full length, and then egt shorter and shorter. |
---|
| 300 | <P> |
---|
| 301 | <H2>OUTPUT FORMAT</H2> |
---|
| 302 | <P> |
---|
| 303 | The output file contains on its first line the number of species. The |
---|
| 304 | distance matrix is then printed in standard |
---|
| 305 | form, with each species starting on a new line with the species name, followed |
---|
| 306 | by the distances to the species in order. These continue onto a new line |
---|
| 307 | after every nine distances. If the L option is used, the matrix or distances |
---|
| 308 | is in lower triangular form, so that only the distances to the other species |
---|
| 309 | that precede each species are printed. Otherwise the distance matrix is square |
---|
| 310 | with zero distances on the diagonal. In general the format of the distance |
---|
| 311 | matrix is such that it can serve as input to any of the distance matrix |
---|
| 312 | programs. |
---|
| 313 | <P> |
---|
| 314 | If the option to print out the data is selected, the output file will |
---|
| 315 | precede the data by more complete information on the input and the menu |
---|
| 316 | selections. The output file begins by giving the number of species and the |
---|
| 317 | number of characters. |
---|
| 318 | <P> |
---|
| 319 | The distances printed out are scaled in terms of expected numbers of |
---|
| 320 | substitutions per DNA site, counting both transitions and transversions but not |
---|
| 321 | replacements of a base by itself, and scaled so that the average rate of |
---|
| 322 | change, averaged over all sites analyzed, is set to 1.0. Thus when the |
---|
| 323 | G option is used, the rate of change at one site may be higher than |
---|
| 324 | at another, but their mean is expected to be 1. |
---|
| 325 | <P> |
---|
| 326 | <H2>PROGRAM CONSTANTS</H2> |
---|
| 327 | <P> |
---|
| 328 | The constants available to be changed are "initialv" and |
---|
| 329 | "iterationsr". The constant "initialv" is the starting |
---|
| 330 | value of the distance in the iterations. This will typically not need to |
---|
| 331 | be changed. The constant "iterationsr" is the number of |
---|
| 332 | times that the Newton-Raphson method which is used to solve the |
---|
| 333 | equations for the distances is iterated. The program can be |
---|
| 334 | speeded up by reducing the number of iterations from the default |
---|
| 335 | value of 20, but at the possible risk of computing the distance |
---|
| 336 | less accurately. |
---|
| 337 | <P> |
---|
| 338 | <H2>FUTURE OF THE PROGRAM</H2> |
---|
| 339 | <P> |
---|
| 340 | The present program does not compute the original distance of Nei and Li (1979) |
---|
| 341 | for restriction sites (though it does have an option to compute their original |
---|
| 342 | distance for restriction fragments). I hope to add their restriction |
---|
| 343 | sites distance in the near future. |
---|
| 344 | <P> |
---|
| 345 | <HR> |
---|
| 346 | <P> |
---|
| 347 | <H3>TEST DATA SET</H3> |
---|
| 348 | <P> |
---|
| 349 | <TABLE><TR><TD BGCOLOR=white> |
---|
| 350 | <PRE> |
---|
| 351 | 5 13 2 |
---|
| 352 | Alpha ++-+-++--+++- |
---|
| 353 | Beta ++++--+--+++- |
---|
| 354 | Gamma -+--+-++-+-++ |
---|
| 355 | Delta ++-+----++--- |
---|
| 356 | Epsilon ++++----++--- |
---|
| 357 | </PRE> |
---|
| 358 | </TD></TR></TABLE> |
---|
| 359 | <P> |
---|
| 360 | <HR> |
---|
| 361 | <H3>CONTENTS OF OUTPUT FILE (with all numerical options on)</H3> |
---|
| 362 | <P> |
---|
| 363 | (Note that when the options for displaying the input data are turned off, |
---|
| 364 | the output is in a form suitable for use as an input file in the distance |
---|
| 365 | matrix programs).<P> |
---|
| 366 | <P> |
---|
| 367 | <TABLE><TR><TD BGCOLOR=white> |
---|
| 368 | <PRE> |
---|
| 369 | |
---|
| 370 | 5 Species, 13 Sites |
---|
| 371 | |
---|
| 372 | Name Sites |
---|
| 373 | ---- ----- |
---|
| 374 | |
---|
| 375 | Alpha ++-+-++--+ ++- |
---|
| 376 | Beta ++++--+--+ ++- |
---|
| 377 | Gamma -+--+-++-+ -++ |
---|
| 378 | Delta ++-+----++ --- |
---|
| 379 | Epsilon ++++----++ --- |
---|
| 380 | |
---|
| 381 | |
---|
| 382 | Alpha 0.0000 0.0224 0.1077 0.0688 0.0826 |
---|
| 383 | Beta 0.0224 0.0000 0.1077 0.0688 0.0442 |
---|
| 384 | Gamma 0.1077 0.1077 0.0000 0.1765 0.1925 |
---|
| 385 | Delta 0.0688 0.0688 0.1765 0.0000 0.0197 |
---|
| 386 | Epsilon 0.0826 0.0442 0.1925 0.0197 0.0000 |
---|
| 387 | </PRE> |
---|
| 388 | </TD></TR></TABLE> |
---|
| 389 | </BODY> |
---|
| 390 | </HTML> |
---|