source: trunk/GDE/PHYLIP/doc/seqboot.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 18.4 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>seqboot</TITLE>
5<META NAME="description" CONTENT="seqboot">
6<META NAME="keywords" CONTENT="seqboot">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>SEQBOOT -- Bootstrap, Jackknife, or Permutation Resampling<BR>
18of Molecular Sequence, Restriction Site,<BR>
19Gene Frequency or Character Data</H1>
20</DIV>
21<P>
22&#169; Copyright 1991-2002 by the University of Washington.
23Written by Joseph Felsenstein.  Permission is granted to copy
24this document provided that no fee is charged for it and that this copyright
25notice is not removed.
26<P>
27SEQBOOT is a general bootstrapping and data set translation tool.  It is intended to allow you to
28generate multiple data sets that are resampled versions of the input data
29set.  Since almost all programs in the package can analyze these multiple
30data sets, this allows almost anything in this package to be bootstrapped,
31jackknifed, or permuted.  SEQBOOT can handle molecular sequences,
32binary characters, restriction sites, or gene frequencies.  It
33can also convert data sets between Sequential and Interleaved
34format, and into NEXUS and a new XML sequence alignment format.
35<P>
36To carry out a bootstrap (or jackknife, or permutation test) with some method
37in the package, you may need to use three programs.  First, you need to run
38SEQBOOT to take the original data set and produce a large number of
39bootstrapped or jackknifed data
40sets (somewhere between 100 and 1000 is usually adequate).
41Then you need to find the phylogeny estimate for
42each of these, using the particular method of interest.  For example, if
43you were using DNAPARS you would first run SEQBOOT and make a file with 100
44bootstrapped data sets.  Then you would give this file the proper name to
45have it be the input file for DNAPARS.  Running DNAPARS with the M (Multiple
46Data Sets) menu choice and informing it to expect 100 data sets, you
47would generate a big output file as well as a treefile with the trees from
48the 100 data sets.  This treefile could be renamed so that it would serve
49as the input for CONSENSE.  When CONSENSE is run the majority rule consensus
50tree will result, showing the outcome of the analysis.
51<P>
52This may sound tedious, but the run of CONSENSE is fast, and that of
53SEQBOOT is fairly fast, so that it will not actually take any longer than
54a run of a single bootstrap program with the same original data and the same
55number of replicates.  This is not very hard and allows bootstrapping on many of
56the methods in
57this package.  The same steps are necessary with all of them.  Doing things
58this way some of the intermediate files (the tree file from the DNAPARS
59run, for example) can be used to summarize the results of the bootstrap in
60other ways than the majority rule consensus method does.
61<P>
62If you are using the Distance Matrix programs, you will have to add one extra
63step to this, calculating distance matrices from each of the replicate data
64sets, using DNADIST or GENDIST.  So (for example) you would run SEQBOOT, then
65run DNADIST using the output of SEQBOOT as its input, then run (say) NEIGHBOR
66using the output of DNADIST as its input, and then run CONSENSE using the
67tree file from NEIGHBOR as its input.
68<P>
69The resampling methods available are three:
70<UL>
71<LI><B>The bootstrap.</B>  Bootstrapping was invented by Bradley Efron in 1979,
72and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b;
73see also Penny and Hendy, 1985).
74It involves creating a new data set by sampling <I>N</I> characters randomly
75with replacement, so that the resulting data set has the same size as the
76original, but some characters have been left out and others are duplicated.
77The random variation of the results from analyzing these bootstrapped
78data sets can be shown statistically to be typical of the variation that
79you would get from collecting new data sets.  The method assumes that the
80characters evolve independently, an assumption that may not be realistic
81for many kinds of data.
82<P>
83<LI><B>Block-bootstrapping.</B>  One pattern of departure from indeopendence
84of character evolution is correlation of evolution in adjacent characters.
85When this is thought to have occurred, we can correct for it by samopling,
86not individual characters, but blocks of adjacent characters.  This is
87called a block bootstrap and was introduced by K&uuml;nsch (1989).  If the
88correlations are believed to extend over some number of characters, you
89choose a block size, <I>B</I>, that is larger than this, and choose
90<I>N/B</I> blocks of size <I>B</I>.  In its implementation here the
91block bootstrap "wraps around" at the end of the characters (so that if a
92block starts in the last&nbsp; <I>B-1</B> characters, it continues by wrapping
93around to the first character after it reaches the last character).  Note also
94that if you have a DNA sequence data set of an exon of a coding region, you
95can ensure that equal numbers of first, second, and third coding positions
96are sampled by using the block bootstrap with <I>B = 3</B>.
97<P>
98<LI><B>Delete-half-jackknifing</B>.  This alternative to the bootstrap involves
99sampling a random half of the characters, and including them in the data
100but dropping the others.  The resulting data sets are half the size of the
101original, and no characters are duplicated.  The random variation from
102doing this should be very similar to that obtained from the bootstrap.
103The method is advocated by Wu (1986).  It was mentioned by me in my
104bootstrapping paper (Felsenstein, 1985b), and has been available for many
105years in this program as an option.  Jackknifing is advocated by
106Farris et. al. (1996) but as deleting a fraction 1/e (1/2.71828).  This
107retains too many characters and will lead to overconfidence in the
108resulting groups.
109<P>
110<LI><B>Permuting species within characters.</B>  This method of resampling (well, OK,
111it may not be best to call it resampling) was introduced by Archie (1989)
112and Faith (1990; see also Faith and Cranston, 1991).  It involves permuting the
113columns of the data matrix
114separately.  This produces data matrices that have the same number and kinds
115of characters but no taxonomic structure.  It is used for different purposes
116than the bootstrap, as it tests not the variation around an estimated tree
117but the hypothesis that there is no taxonomic structure in the data: if
118a statistic such as number of steps is significantly smaller in the actual
119data than it is in replicates that are permuted, then we can argue that there
120is some taxonomic structure in the data (though perhaps it might be just a
121pair of sibling species).
122</UL>
123<P>
124The data input file is of standard form for molecular sequences (either in
125interleaved or sequential form), restriction sites, gene frequencies, or
126binary morphological characters.
127<P>
128When the program runs it first asks you for a random number seed.  This should
129be an integer greater than zero (and probably less than 32767) and which is
130of the form 4n+1, that is, it leaves a remainder of 1 when divided by 4.  This
131can be judged by looking at the last two digits of the integer (for instance
1327651 is not of form 4n+1 as 51, when divided by 4, leaves the remainder 3).
133The random number seed is used to start the random number generator.
134If the randum number seed is not odd, the program will request it again.
135Any odd number can be used, but may result in a random number sequence that
136repeats itself after less than the full one billion numbers.  Usually this
137is not a problem.  As the random numbers appear to be unpredictable,
138there is no such thing as a "good" seed -- the numbers produced from one
139seed are indistinguishable from those produced by another, and it is
140not true that the numbers produced from one seed (say 4533) are similar to
141those produced from a nearby seed (say 4537).
142<P>
143Then the program shows you a menu to allow you to choose options.  The menu
144looks like this:
145<P>
146<TABLE><TR><TD BGCOLOR=white>
147<PRE>
148
149Bootstrapping algorithm, version 3.6a3
150
151Settings for this run:
152  D      Sequence, Morph, Rest., Gene Freqs?  Molecular sequences
153  J  Bootstrap, Jackknife, Permute, Rewrite?  Bootstrap
154  B      Block size for block-bootstrapping?  1 (regular bootstrap)
155  R                     How many replicates?  100
156  W              Read weights of characters?  No
157  C                Read categories of sites?  No
158  F     Write out data sets or just weights?  Data sets
159  I             Input sequences interleaved?  Yes
160  0      Terminal type (IBM PC, ANSI, none)?  (none)
161  1       Print out the data at start of run  No
162  2     Print indications of progress of run  Yes
163
164  Y to accept these or type the letter for one to change
165
166</PRE>
167</TD></TR></TABLE>
168<P>
169The user selects options by typing one of the letters in the left column,
170and continues to do so until all options are correctly set.  Then the
171program can be run by typing Y.
172<P>
173It is important to select the correct data type (the D selection).  Each
174time D is typed the program will change data type, proceeding successively
175through Molecular Sequences, Discrete Morphological Characters, Restriction
176Sites, and Gene Frequencies.  Some of these will cause additional entries
177to appear in the menu.  If Molecular Sequences or Restriction Sites settings
178and chosen the I (Interleaved)
179option appears in the menu (and as Molecular Sequences are also the default,
180it therefore appears in the first menu).  It is the usual
181I option discussed in the Molecular Sequences document file and in the main
182documentation files for the package, and is on by default.
183<P>
184If the Restriction Sites option is chosen the menu option E appears, which
185asks whether the input file contains a third number on the first line of
186the file, for the number of restriction enzymes used to detect these sites.
187This is necessary because data sets for RESTML need this third number, but
188other programs do not, and SEQBOOT needs to know what to expect.
189<P>
190If the Gene Frequencies option is chosen an menu option A appears which allows
191the user to specify that all alleles at each locus are in the input file.
192The default setting is that one allele is absent at each locus.
193<P>
194The J option allows the user to select Bootstrapping, Delete-Half-Jackknifing,
195or the Archie-Faith permutation of species within characters.  It changes
196successively among these three each time J is typed.
197<P>
198The B option selects the Block Bootstrap.  When you select option B the program
199will ask you to enter the block length.  When the block length is 1,
200this means that we are doing regular bootstrapping rather than
201block-bootstrapping.
202<P>
203The R option allows the user to set the number of replicate data sets.
204This defaults to 100.  Most statisticians would be happiest with 1000 to
20510,000 replicates in a bootstrap, but 100 gives a rough picture.  You
206will have to decide this based on how long a running time you are willing to
207tolerate.
208<P>
209The W (Weights) option allows weights to be read
210from a file whose default name is "weights".  The weights
211follow the format described in the main documentation file.
212Weights can only be 0 or 1, and act to select
213the characters (or sites) that will be used in the resampling, the others
214being ignored and always omitted from the output data sets.
215<B>Note:</B> At present, if you use W together with the F (just weights)
216option, you write a file of weights, but with only weights for the
217sites that had input weights of 1, the others being omitted.  Thus if
218you had 100 characters, and gave 60 of them weights of 1, when you
219produce the output weights these will only have 60 weights, not 100.
220Thus they could only be used together with a data file that had been
221edited to remove the sites that you gave 0 weights to.  This is
222clumsy and we need to correct it.
223<P>
224The C (Categories) option can be used with molecular sequence programs to
225allow assignment of sites or amino acid positions to user-defined rate
226categories.  The assignment of rates to
227sites is then made by reading a file whose default name is "categories".
228It should contain a string of digits 1 through 9.  A new line or a blank
229can occur after any character in this string.  Thus the categories file
230might look like this:
231<P>
232<PRE>
233122231111122411155
2341155333333444
235</PRE>
236<P>
237The only use of the Categories information in SEQBOOT is that they
238are sampled along with the sites (or amino acid positions) and are
239written out onto a file whose default name is "outcategories",
240which has one set of categories information for each bootstrap
241or jackknife replicate.
242<P>
243The F option is a particularly important one.  It is used whether to
244produce multiple output files or multiple weights.  If your
245data set is large, a file with (say) 1000 such data sets can be very
246large and may use up too much space on your system.  If you choose
247the F option, the program will instead produce a weights file with
248multiple sets of weights.  The default name of this file is "outweights".
249Except for some programs that cannot handle multiple sets of
250weights,
251the programs have an M (multiple data sets) option that asks the
252user whether to use multiple data sets or multiple sets of weights.
253If the latter is selected when running those programs, they
254read one data set, but analyze it multiple times, each time reading a new
255set of weights.  As both bootstrapping and jackknifing can be thought of
256as reweighting the characters, this accomplishes the same thing (the
257multiple weights option is not available for Archie/Faith permutation).
258As the file with multiple sets of weights is much smaller than a file with
259multiple data sets, this can be an attractive way to save file space.
260When multiple sets of weights is chosen, they reflect the sampling as
261well as any set of weights that was read in, so that you can use
262SEQBOOT's W option as well.
263<P>
264The 0 (Terminal type) option is the usual one.
265<P>
266<H2>Input File</H2>
267<P>
268The data files read by SEQBOOT are the standard ones for the various kinds of
269data.  For molecular sequences the sequences may be either interleaved or
270sequential, and similarly for restriction sites.  Restriction sites data
271may either have or not have the third argument, the number of restriction
272enzymes used.  Discrete morphological
273characters are always assumed to be in sequential format.  Gene frequencies
274data start with the number of species and the number of loci, and then
275follow that by a line with the number of alleles at each locus.  The data for
276each locus may either have one entry for each allele, or omit one allele at
277each locus.  The details of the formats are given in the main documentation
278file, and in the documentation files for the groups of programs.
279<P>
280The only option that can be present in the
281input file is F (Factors), the latter only in the case of
282binary (0,1) characters.  The Factors
283option allows us to specify that groups of binary characters represent
284one multistate character.  When sampling is done they will be sampled or
285omitted together, and when permutations of species are done they will all
286have the same permutation, as would happen if they really were just one
287column in the data matrix.  For futher description of the F (Factors) option
288see the Discrete Characters Programs documentation file.
289<P>
290<H2>Output</H2>
291<P>
292The output file will contain the data sets generated by the resampling
293process.  Note that, when Gene Frequencies data is used or when
294Discrete Morphological characters with the Factors option are used,
295the number of characters in each data set may vary.  It may also vary
296if there are an odd number of characters or sites and the Delete-Half-Jackknife
297resampling method is used, for then there will be a 50% chance of choosing
298(n+1)/2 characters and a 50% chance of choosing (n-1)/2 characters.
299<P>
300The order of species in the data sets in the output file will vary
301randomly.  This is a precaution to help the programs that analyze these data
302avoid any result which is sensitive to
303the input order of species from showing up repeatedly
304and thus appearing to have evidence in its favor.
305<P>
306The numerical options 1 and 2 in the menu also affect the output file.
307If 1 is chosen (it is off by default) the program will print the original
308input data set on the output file before the resampled data sets.  I cannot
309actually see why anyone would want to do this.  Option 2 toggles the
310feature (on by default) that prints out up to 20 times during the resampling
311process a notification that the program has completed a certain number of
312data sets.  Thus if 100 resampled data sets are being produced, every 5
313data sets a line is printed saying which data set has just been completed.
314This option should be turned off if the program is running in background and
315silence is desirable.  At the end of execution the program will always (whatever
316the setting of option 2) print
317a couple of lines saying that output has been written to the output file.
318<P>
319<H2>Size and Speed</H2>
320<P>
321The program runs moderately quickly, though more slowly when the Permutation
322resampling method is used than with the others.
323<P>
324<H2>Future</H2>
325<P>
326I hope in the future to include code to pass on the Ancestors
327option from the input file (for use in programs MIX and DOLLOP)
328to the output file, a serious
329omission in the current version.
330<P>
331<HR>
332<P>
333<H3>TEST DATA SET</H3>
334<P>
335<TABLE><TR><TD BGCOLOR=white>
336<PRE>
337    5    6
338Alpha     AACAAC
339Beta      AACCCC
340Gamma     ACCAAC
341Delta     CCACCA
342Epsilon   CCAAAC
343</PRE>
344</TD></TR></TABLE>
345<P>
346<HR>
347<P>
348<H3>CONTENTS OF OUTPUT FILE</H3>
349<P>
350(If Replicates are set to 10 and seed to 4333)
351<P>
352<TABLE><TR><TD BGCOLOR=white>
353<PRE>
354    5     6
355Alpha     ACAAAC
356Beta      ACCCCC
357Gamma     ACAAAC
358Delta     CACCCA
359Epsilon   CAAAAC
360    5     6
361Alpha     AAAACC
362Beta      AACCCC
363Gamma     CCAACC
364Delta     CCCCAA
365Epsilon   CCAACC
366    5     6
367Alpha     ACAAAC
368Beta      ACCCCC
369Gamma     CCAAAC
370Delta     CACCCA
371Epsilon   CAAAAC
372    5     6
373Alpha     ACCAAA
374Beta      ACCCCC
375Gamma     ACCAAA
376Delta     CAACCC
377Epsilon   CAAAAA
378    5     6
379Alpha     ACAAAC
380Beta      ACCCCC
381Gamma     ACAAAC
382Delta     CACCCA
383Epsilon   CAAAAC
384    5     6
385Alpha     AAAACA
386Beta      AAAACC
387Gamma     AAACCA
388Delta     CCCCAC
389Epsilon   CCCCAA
390    5     6
391Alpha     AAACCC
392Beta      CCCCCC
393Gamma     AAACCC
394Delta     CCCAAA
395Epsilon   AAACCC
396    5     6
397Alpha     AAAACC
398Beta      AACCCC
399Gamma     AAAACC
400Delta     CCCCAA
401Epsilon   CCAACC
402    5     6
403Alpha     AAAAAC
404Beta      AACCCC
405Gamma     CCAAAC
406Delta     CCCCCA
407Epsilon   CCAAAC
408    5     6
409Alpha     AACCAC
410Beta      AACCCC
411Gamma     AACCAC
412Delta     CCAACA
413Epsilon   CCAAAC
414</PRE>
415</TD></TR></TABLE>
416<P>
417</BODY>
418</HTML>
Note: See TracBrowser for help on using the repository browser.