1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>seqboot</TITLE> |
---|
5 | <META NAME="description" CONTENT="seqboot"> |
---|
6 | <META NAME="keywords" CONTENT="seqboot"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>SEQBOOT -- Bootstrap, Jackknife, or Permutation Resampling<BR> |
---|
18 | of Molecular Sequence, Restriction Site,<BR> |
---|
19 | Gene Frequency or Character Data</H1> |
---|
20 | </DIV> |
---|
21 | <P> |
---|
22 | © Copyright 1991-2002 by the University of Washington. |
---|
23 | Written by Joseph Felsenstein. Permission is granted to copy |
---|
24 | this document provided that no fee is charged for it and that this copyright |
---|
25 | notice is not removed. |
---|
26 | <P> |
---|
27 | SEQBOOT is a general bootstrapping and data set translation tool. It is intended to allow you to |
---|
28 | generate multiple data sets that are resampled versions of the input data |
---|
29 | set. Since almost all programs in the package can analyze these multiple |
---|
30 | data sets, this allows almost anything in this package to be bootstrapped, |
---|
31 | jackknifed, or permuted. SEQBOOT can handle molecular sequences, |
---|
32 | binary characters, restriction sites, or gene frequencies. It |
---|
33 | can also convert data sets between Sequential and Interleaved |
---|
34 | format, and into NEXUS and a new XML sequence alignment format. |
---|
35 | <P> |
---|
36 | To carry out a bootstrap (or jackknife, or permutation test) with some method |
---|
37 | in the package, you may need to use three programs. First, you need to run |
---|
38 | SEQBOOT to take the original data set and produce a large number of |
---|
39 | bootstrapped or jackknifed data |
---|
40 | sets (somewhere between 100 and 1000 is usually adequate). |
---|
41 | Then you need to find the phylogeny estimate for |
---|
42 | each of these, using the particular method of interest. For example, if |
---|
43 | you were using DNAPARS you would first run SEQBOOT and make a file with 100 |
---|
44 | bootstrapped data sets. Then you would give this file the proper name to |
---|
45 | have it be the input file for DNAPARS. Running DNAPARS with the M (Multiple |
---|
46 | Data Sets) menu choice and informing it to expect 100 data sets, you |
---|
47 | would generate a big output file as well as a treefile with the trees from |
---|
48 | the 100 data sets. This treefile could be renamed so that it would serve |
---|
49 | as the input for CONSENSE. When CONSENSE is run the majority rule consensus |
---|
50 | tree will result, showing the outcome of the analysis. |
---|
51 | <P> |
---|
52 | This may sound tedious, but the run of CONSENSE is fast, and that of |
---|
53 | SEQBOOT is fairly fast, so that it will not actually take any longer than |
---|
54 | a run of a single bootstrap program with the same original data and the same |
---|
55 | number of replicates. This is not very hard and allows bootstrapping on many of |
---|
56 | the methods in |
---|
57 | this package. The same steps are necessary with all of them. Doing things |
---|
58 | this way some of the intermediate files (the tree file from the DNAPARS |
---|
59 | run, for example) can be used to summarize the results of the bootstrap in |
---|
60 | other ways than the majority rule consensus method does. |
---|
61 | <P> |
---|
62 | If you are using the Distance Matrix programs, you will have to add one extra |
---|
63 | step to this, calculating distance matrices from each of the replicate data |
---|
64 | sets, using DNADIST or GENDIST. So (for example) you would run SEQBOOT, then |
---|
65 | run DNADIST using the output of SEQBOOT as its input, then run (say) NEIGHBOR |
---|
66 | using the output of DNADIST as its input, and then run CONSENSE using the |
---|
67 | tree file from NEIGHBOR as its input. |
---|
68 | <P> |
---|
69 | The resampling methods available are three: |
---|
70 | <UL> |
---|
71 | <LI><B>The bootstrap.</B> Bootstrapping was invented by Bradley Efron in 1979, |
---|
72 | and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b; |
---|
73 | see also Penny and Hendy, 1985). |
---|
74 | It involves creating a new data set by sampling <I>N</I> characters randomly |
---|
75 | with replacement, so that the resulting data set has the same size as the |
---|
76 | original, but some characters have been left out and others are duplicated. |
---|
77 | The random variation of the results from analyzing these bootstrapped |
---|
78 | data sets can be shown statistically to be typical of the variation that |
---|
79 | you would get from collecting new data sets. The method assumes that the |
---|
80 | characters evolve independently, an assumption that may not be realistic |
---|
81 | for many kinds of data. |
---|
82 | <P> |
---|
83 | <LI><B>Block-bootstrapping.</B> One pattern of departure from indeopendence |
---|
84 | of character evolution is correlation of evolution in adjacent characters. |
---|
85 | When this is thought to have occurred, we can correct for it by samopling, |
---|
86 | not individual characters, but blocks of adjacent characters. This is |
---|
87 | called a block bootstrap and was introduced by Künsch (1989). If the |
---|
88 | correlations are believed to extend over some number of characters, you |
---|
89 | choose a block size, <I>B</I>, that is larger than this, and choose |
---|
90 | <I>N/B</I> blocks of size <I>B</I>. In its implementation here the |
---|
91 | block bootstrap "wraps around" at the end of the characters (so that if a |
---|
92 | block starts in the last <I>B-1</B> characters, it continues by wrapping |
---|
93 | around to the first character after it reaches the last character). Note also |
---|
94 | that if you have a DNA sequence data set of an exon of a coding region, you |
---|
95 | can ensure that equal numbers of first, second, and third coding positions |
---|
96 | are sampled by using the block bootstrap with <I>B = 3</B>. |
---|
97 | <P> |
---|
98 | <LI><B>Delete-half-jackknifing</B>. This alternative to the bootstrap involves |
---|
99 | sampling a random half of the characters, and including them in the data |
---|
100 | but dropping the others. The resulting data sets are half the size of the |
---|
101 | original, and no characters are duplicated. The random variation from |
---|
102 | doing this should be very similar to that obtained from the bootstrap. |
---|
103 | The method is advocated by Wu (1986). It was mentioned by me in my |
---|
104 | bootstrapping paper (Felsenstein, 1985b), and has been available for many |
---|
105 | years in this program as an option. Jackknifing is advocated by |
---|
106 | Farris et. al. (1996) but as deleting a fraction 1/e (1/2.71828). This |
---|
107 | retains too many characters and will lead to overconfidence in the |
---|
108 | resulting groups. |
---|
109 | <P> |
---|
110 | <LI><B>Permuting species within characters.</B> This method of resampling (well, OK, |
---|
111 | it may not be best to call it resampling) was introduced by Archie (1989) |
---|
112 | and Faith (1990; see also Faith and Cranston, 1991). It involves permuting the |
---|
113 | columns of the data matrix |
---|
114 | separately. This produces data matrices that have the same number and kinds |
---|
115 | of characters but no taxonomic structure. It is used for different purposes |
---|
116 | than the bootstrap, as it tests not the variation around an estimated tree |
---|
117 | but the hypothesis that there is no taxonomic structure in the data: if |
---|
118 | a statistic such as number of steps is significantly smaller in the actual |
---|
119 | data than it is in replicates that are permuted, then we can argue that there |
---|
120 | is some taxonomic structure in the data (though perhaps it might be just a |
---|
121 | pair of sibling species). |
---|
122 | </UL> |
---|
123 | <P> |
---|
124 | The data input file is of standard form for molecular sequences (either in |
---|
125 | interleaved or sequential form), restriction sites, gene frequencies, or |
---|
126 | binary morphological characters. |
---|
127 | <P> |
---|
128 | When the program runs it first asks you for a random number seed. This should |
---|
129 | be an integer greater than zero (and probably less than 32767) and which is |
---|
130 | of the form 4n+1, that is, it leaves a remainder of 1 when divided by 4. This |
---|
131 | can be judged by looking at the last two digits of the integer (for instance |
---|
132 | 7651 is not of form 4n+1 as 51, when divided by 4, leaves the remainder 3). |
---|
133 | The random number seed is used to start the random number generator. |
---|
134 | If the randum number seed is not odd, the program will request it again. |
---|
135 | Any odd number can be used, but may result in a random number sequence that |
---|
136 | repeats itself after less than the full one billion numbers. Usually this |
---|
137 | is not a problem. As the random numbers appear to be unpredictable, |
---|
138 | there is no such thing as a "good" seed -- the numbers produced from one |
---|
139 | seed are indistinguishable from those produced by another, and it is |
---|
140 | not true that the numbers produced from one seed (say 4533) are similar to |
---|
141 | those produced from a nearby seed (say 4537). |
---|
142 | <P> |
---|
143 | Then the program shows you a menu to allow you to choose options. The menu |
---|
144 | looks like this: |
---|
145 | <P> |
---|
146 | <TABLE><TR><TD BGCOLOR=white> |
---|
147 | <PRE> |
---|
148 | |
---|
149 | Bootstrapping algorithm, version 3.6a3 |
---|
150 | |
---|
151 | Settings for this run: |
---|
152 | D Sequence, Morph, Rest., Gene Freqs? Molecular sequences |
---|
153 | J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap |
---|
154 | B Block size for block-bootstrapping? 1 (regular bootstrap) |
---|
155 | R How many replicates? 100 |
---|
156 | W Read weights of characters? No |
---|
157 | C Read categories of sites? No |
---|
158 | F Write out data sets or just weights? Data sets |
---|
159 | I Input sequences interleaved? Yes |
---|
160 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
---|
161 | 1 Print out the data at start of run No |
---|
162 | 2 Print indications of progress of run Yes |
---|
163 | |
---|
164 | Y to accept these or type the letter for one to change |
---|
165 | |
---|
166 | </PRE> |
---|
167 | </TD></TR></TABLE> |
---|
168 | <P> |
---|
169 | The user selects options by typing one of the letters in the left column, |
---|
170 | and continues to do so until all options are correctly set. Then the |
---|
171 | program can be run by typing Y. |
---|
172 | <P> |
---|
173 | It is important to select the correct data type (the D selection). Each |
---|
174 | time D is typed the program will change data type, proceeding successively |
---|
175 | through Molecular Sequences, Discrete Morphological Characters, Restriction |
---|
176 | Sites, and Gene Frequencies. Some of these will cause additional entries |
---|
177 | to appear in the menu. If Molecular Sequences or Restriction Sites settings |
---|
178 | and chosen the I (Interleaved) |
---|
179 | option appears in the menu (and as Molecular Sequences are also the default, |
---|
180 | it therefore appears in the first menu). It is the usual |
---|
181 | I option discussed in the Molecular Sequences document file and in the main |
---|
182 | documentation files for the package, and is on by default. |
---|
183 | <P> |
---|
184 | If the Restriction Sites option is chosen the menu option E appears, which |
---|
185 | asks whether the input file contains a third number on the first line of |
---|
186 | the file, for the number of restriction enzymes used to detect these sites. |
---|
187 | This is necessary because data sets for RESTML need this third number, but |
---|
188 | other programs do not, and SEQBOOT needs to know what to expect. |
---|
189 | <P> |
---|
190 | If the Gene Frequencies option is chosen an menu option A appears which allows |
---|
191 | the user to specify that all alleles at each locus are in the input file. |
---|
192 | The default setting is that one allele is absent at each locus. |
---|
193 | <P> |
---|
194 | The J option allows the user to select Bootstrapping, Delete-Half-Jackknifing, |
---|
195 | or the Archie-Faith permutation of species within characters. It changes |
---|
196 | successively among these three each time J is typed. |
---|
197 | <P> |
---|
198 | The B option selects the Block Bootstrap. When you select option B the program |
---|
199 | will ask you to enter the block length. When the block length is 1, |
---|
200 | this means that we are doing regular bootstrapping rather than |
---|
201 | block-bootstrapping. |
---|
202 | <P> |
---|
203 | The R option allows the user to set the number of replicate data sets. |
---|
204 | This defaults to 100. Most statisticians would be happiest with 1000 to |
---|
205 | 10,000 replicates in a bootstrap, but 100 gives a rough picture. You |
---|
206 | will have to decide this based on how long a running time you are willing to |
---|
207 | tolerate. |
---|
208 | <P> |
---|
209 | The W (Weights) option allows weights to be read |
---|
210 | from a file whose default name is "weights". The weights |
---|
211 | follow the format described in the main documentation file. |
---|
212 | Weights can only be 0 or 1, and act to select |
---|
213 | the characters (or sites) that will be used in the resampling, the others |
---|
214 | being ignored and always omitted from the output data sets. |
---|
215 | <B>Note:</B> At present, if you use W together with the F (just weights) |
---|
216 | option, you write a file of weights, but with only weights for the |
---|
217 | sites that had input weights of 1, the others being omitted. Thus if |
---|
218 | you had 100 characters, and gave 60 of them weights of 1, when you |
---|
219 | produce the output weights these will only have 60 weights, not 100. |
---|
220 | Thus they could only be used together with a data file that had been |
---|
221 | edited to remove the sites that you gave 0 weights to. This is |
---|
222 | clumsy and we need to correct it. |
---|
223 | <P> |
---|
224 | The C (Categories) option can be used with molecular sequence programs to |
---|
225 | allow assignment of sites or amino acid positions to user-defined rate |
---|
226 | categories. The assignment of rates to |
---|
227 | sites is then made by reading a file whose default name is "categories". |
---|
228 | It should contain a string of digits 1 through 9. A new line or a blank |
---|
229 | can occur after any character in this string. Thus the categories file |
---|
230 | might look like this: |
---|
231 | <P> |
---|
232 | <PRE> |
---|
233 | 122231111122411155 |
---|
234 | 1155333333444 |
---|
235 | </PRE> |
---|
236 | <P> |
---|
237 | The only use of the Categories information in SEQBOOT is that they |
---|
238 | are sampled along with the sites (or amino acid positions) and are |
---|
239 | written out onto a file whose default name is "outcategories", |
---|
240 | which has one set of categories information for each bootstrap |
---|
241 | or jackknife replicate. |
---|
242 | <P> |
---|
243 | The F option is a particularly important one. It is used whether to |
---|
244 | produce multiple output files or multiple weights. If your |
---|
245 | data set is large, a file with (say) 1000 such data sets can be very |
---|
246 | large and may use up too much space on your system. If you choose |
---|
247 | the F option, the program will instead produce a weights file with |
---|
248 | multiple sets of weights. The default name of this file is "outweights". |
---|
249 | Except for some programs that cannot handle multiple sets of |
---|
250 | weights, |
---|
251 | the programs have an M (multiple data sets) option that asks the |
---|
252 | user whether to use multiple data sets or multiple sets of weights. |
---|
253 | If the latter is selected when running those programs, they |
---|
254 | read one data set, but analyze it multiple times, each time reading a new |
---|
255 | set of weights. As both bootstrapping and jackknifing can be thought of |
---|
256 | as reweighting the characters, this accomplishes the same thing (the |
---|
257 | multiple weights option is not available for Archie/Faith permutation). |
---|
258 | As the file with multiple sets of weights is much smaller than a file with |
---|
259 | multiple data sets, this can be an attractive way to save file space. |
---|
260 | When multiple sets of weights is chosen, they reflect the sampling as |
---|
261 | well as any set of weights that was read in, so that you can use |
---|
262 | SEQBOOT's W option as well. |
---|
263 | <P> |
---|
264 | The 0 (Terminal type) option is the usual one. |
---|
265 | <P> |
---|
266 | <H2>Input File</H2> |
---|
267 | <P> |
---|
268 | The data files read by SEQBOOT are the standard ones for the various kinds of |
---|
269 | data. For molecular sequences the sequences may be either interleaved or |
---|
270 | sequential, and similarly for restriction sites. Restriction sites data |
---|
271 | may either have or not have the third argument, the number of restriction |
---|
272 | enzymes used. Discrete morphological |
---|
273 | characters are always assumed to be in sequential format. Gene frequencies |
---|
274 | data start with the number of species and the number of loci, and then |
---|
275 | follow that by a line with the number of alleles at each locus. The data for |
---|
276 | each locus may either have one entry for each allele, or omit one allele at |
---|
277 | each locus. The details of the formats are given in the main documentation |
---|
278 | file, and in the documentation files for the groups of programs. |
---|
279 | <P> |
---|
280 | The only option that can be present in the |
---|
281 | input file is F (Factors), the latter only in the case of |
---|
282 | binary (0,1) characters. The Factors |
---|
283 | option allows us to specify that groups of binary characters represent |
---|
284 | one multistate character. When sampling is done they will be sampled or |
---|
285 | omitted together, and when permutations of species are done they will all |
---|
286 | have the same permutation, as would happen if they really were just one |
---|
287 | column in the data matrix. For futher description of the F (Factors) option |
---|
288 | see the Discrete Characters Programs documentation file. |
---|
289 | <P> |
---|
290 | <H2>Output</H2> |
---|
291 | <P> |
---|
292 | The output file will contain the data sets generated by the resampling |
---|
293 | process. Note that, when Gene Frequencies data is used or when |
---|
294 | Discrete Morphological characters with the Factors option are used, |
---|
295 | the number of characters in each data set may vary. It may also vary |
---|
296 | if there are an odd number of characters or sites and the Delete-Half-Jackknife |
---|
297 | resampling method is used, for then there will be a 50% chance of choosing |
---|
298 | (n+1)/2 characters and a 50% chance of choosing (n-1)/2 characters. |
---|
299 | <P> |
---|
300 | The order of species in the data sets in the output file will vary |
---|
301 | randomly. This is a precaution to help the programs that analyze these data |
---|
302 | avoid any result which is sensitive to |
---|
303 | the input order of species from showing up repeatedly |
---|
304 | and thus appearing to have evidence in its favor. |
---|
305 | <P> |
---|
306 | The numerical options 1 and 2 in the menu also affect the output file. |
---|
307 | If 1 is chosen (it is off by default) the program will print the original |
---|
308 | input data set on the output file before the resampled data sets. I cannot |
---|
309 | actually see why anyone would want to do this. Option 2 toggles the |
---|
310 | feature (on by default) that prints out up to 20 times during the resampling |
---|
311 | process a notification that the program has completed a certain number of |
---|
312 | data sets. Thus if 100 resampled data sets are being produced, every 5 |
---|
313 | data sets a line is printed saying which data set has just been completed. |
---|
314 | This option should be turned off if the program is running in background and |
---|
315 | silence is desirable. At the end of execution the program will always (whatever |
---|
316 | the setting of option 2) print |
---|
317 | a couple of lines saying that output has been written to the output file. |
---|
318 | <P> |
---|
319 | <H2>Size and Speed</H2> |
---|
320 | <P> |
---|
321 | The program runs moderately quickly, though more slowly when the Permutation |
---|
322 | resampling method is used than with the others. |
---|
323 | <P> |
---|
324 | <H2>Future</H2> |
---|
325 | <P> |
---|
326 | I hope in the future to include code to pass on the Ancestors |
---|
327 | option from the input file (for use in programs MIX and DOLLOP) |
---|
328 | to the output file, a serious |
---|
329 | omission in the current version. |
---|
330 | <P> |
---|
331 | <HR> |
---|
332 | <P> |
---|
333 | <H3>TEST DATA SET</H3> |
---|
334 | <P> |
---|
335 | <TABLE><TR><TD BGCOLOR=white> |
---|
336 | <PRE> |
---|
337 | 5 6 |
---|
338 | Alpha AACAAC |
---|
339 | Beta AACCCC |
---|
340 | Gamma ACCAAC |
---|
341 | Delta CCACCA |
---|
342 | Epsilon CCAAAC |
---|
343 | </PRE> |
---|
344 | </TD></TR></TABLE> |
---|
345 | <P> |
---|
346 | <HR> |
---|
347 | <P> |
---|
348 | <H3>CONTENTS OF OUTPUT FILE</H3> |
---|
349 | <P> |
---|
350 | (If Replicates are set to 10 and seed to 4333) |
---|
351 | <P> |
---|
352 | <TABLE><TR><TD BGCOLOR=white> |
---|
353 | <PRE> |
---|
354 | 5 6 |
---|
355 | Alpha ACAAAC |
---|
356 | Beta ACCCCC |
---|
357 | Gamma ACAAAC |
---|
358 | Delta CACCCA |
---|
359 | Epsilon CAAAAC |
---|
360 | 5 6 |
---|
361 | Alpha AAAACC |
---|
362 | Beta AACCCC |
---|
363 | Gamma CCAACC |
---|
364 | Delta CCCCAA |
---|
365 | Epsilon CCAACC |
---|
366 | 5 6 |
---|
367 | Alpha ACAAAC |
---|
368 | Beta ACCCCC |
---|
369 | Gamma CCAAAC |
---|
370 | Delta CACCCA |
---|
371 | Epsilon CAAAAC |
---|
372 | 5 6 |
---|
373 | Alpha ACCAAA |
---|
374 | Beta ACCCCC |
---|
375 | Gamma ACCAAA |
---|
376 | Delta CAACCC |
---|
377 | Epsilon CAAAAA |
---|
378 | 5 6 |
---|
379 | Alpha ACAAAC |
---|
380 | Beta ACCCCC |
---|
381 | Gamma ACAAAC |
---|
382 | Delta CACCCA |
---|
383 | Epsilon CAAAAC |
---|
384 | 5 6 |
---|
385 | Alpha AAAACA |
---|
386 | Beta AAAACC |
---|
387 | Gamma AAACCA |
---|
388 | Delta CCCCAC |
---|
389 | Epsilon CCCCAA |
---|
390 | 5 6 |
---|
391 | Alpha AAACCC |
---|
392 | Beta CCCCCC |
---|
393 | Gamma AAACCC |
---|
394 | Delta CCCAAA |
---|
395 | Epsilon AAACCC |
---|
396 | 5 6 |
---|
397 | Alpha AAAACC |
---|
398 | Beta AACCCC |
---|
399 | Gamma AAAACC |
---|
400 | Delta CCCCAA |
---|
401 | Epsilon CCAACC |
---|
402 | 5 6 |
---|
403 | Alpha AAAAAC |
---|
404 | Beta AACCCC |
---|
405 | Gamma CCAAAC |
---|
406 | Delta CCCCCA |
---|
407 | Epsilon CCAAAC |
---|
408 | 5 6 |
---|
409 | Alpha AACCAC |
---|
410 | Beta AACCCC |
---|
411 | Gamma AACCAC |
---|
412 | Delta CCAACA |
---|
413 | Epsilon CCAAAC |
---|
414 | </PRE> |
---|
415 | </TD></TR></TABLE> |
---|
416 | <P> |
---|
417 | </BODY> |
---|
418 | </HTML> |
---|