Context Navigation

seqboot.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 18.4 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>seqboot</TITLE>
5	<META NAME="description" CONTENT="seqboot">
6	<META NAME="keywords" CONTENT="seqboot">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>SEQBOOT -- Bootstrap, Jackknife, or Permutation Resampling<BR>
18	of Molecular Sequence, Restriction Site,<BR>
19	Gene Frequency or Character Data</H1>
20	</DIV>
21	<P>
22	© Copyright 1991-2002 by the University of Washington.
23	Written by Joseph Felsenstein. Permission is granted to copy
24	this document provided that no fee is charged for it and that this copyright
25	notice is not removed.
26	<P>
27	SEQBOOT is a general bootstrapping and data set translation tool. It is intended to allow you to
28	generate multiple data sets that are resampled versions of the input data
29	set. Since almost all programs in the package can analyze these multiple
30	data sets, this allows almost anything in this package to be bootstrapped,
31	jackknifed, or permuted. SEQBOOT can handle molecular sequences,
32	binary characters, restriction sites, or gene frequencies. It
33	can also convert data sets between Sequential and Interleaved
34	format, and into NEXUS and a new XML sequence alignment format.
35	<P>
36	To carry out a bootstrap (or jackknife, or permutation test) with some method
37	in the package, you may need to use three programs. First, you need to run
38	SEQBOOT to take the original data set and produce a large number of
39	bootstrapped or jackknifed data
40	sets (somewhere between 100 and 1000 is usually adequate).
41	Then you need to find the phylogeny estimate for
42	each of these, using the particular method of interest. For example, if
43	you were using DNAPARS you would first run SEQBOOT and make a file with 100
44	bootstrapped data sets. Then you would give this file the proper name to
45	have it be the input file for DNAPARS. Running DNAPARS with the M (Multiple
46	Data Sets) menu choice and informing it to expect 100 data sets, you
47	would generate a big output file as well as a treefile with the trees from
48	the 100 data sets. This treefile could be renamed so that it would serve
49	as the input for CONSENSE. When CONSENSE is run the majority rule consensus
50	tree will result, showing the outcome of the analysis.
51	<P>
52	This may sound tedious, but the run of CONSENSE is fast, and that of
53	SEQBOOT is fairly fast, so that it will not actually take any longer than
54	a run of a single bootstrap program with the same original data and the same
55	number of replicates. This is not very hard and allows bootstrapping on many of
56	the methods in
57	this package. The same steps are necessary with all of them. Doing things
58	this way some of the intermediate files (the tree file from the DNAPARS
59	run, for example) can be used to summarize the results of the bootstrap in
60	other ways than the majority rule consensus method does.
61	<P>
62	If you are using the Distance Matrix programs, you will have to add one extra
63	step to this, calculating distance matrices from each of the replicate data
64	sets, using DNADIST or GENDIST. So (for example) you would run SEQBOOT, then
65	run DNADIST using the output of SEQBOOT as its input, then run (say) NEIGHBOR
66	using the output of DNADIST as its input, and then run CONSENSE using the
67	tree file from NEIGHBOR as its input.
68	<P>
69	The resampling methods available are three:
70	<UL>
71	<LI><B>The bootstrap.</B> Bootstrapping was invented by Bradley Efron in 1979,
72	and its use in phylogeny estimation was introduced by me (Felsenstein, 1985b;
73	see also Penny and Hendy, 1985).
74	It involves creating a new data set by sampling <I>N</I> characters randomly
75	with replacement, so that the resulting data set has the same size as the
76	original, but some characters have been left out and others are duplicated.
77	The random variation of the results from analyzing these bootstrapped
78	data sets can be shown statistically to be typical of the variation that
79	you would get from collecting new data sets. The method assumes that the
80	characters evolve independently, an assumption that may not be realistic
81	for many kinds of data.
82	<P>
83	<LI><B>Block-bootstrapping.</B> One pattern of departure from indeopendence
84	of character evolution is correlation of evolution in adjacent characters.
85	When this is thought to have occurred, we can correct for it by samopling,
86	not individual characters, but blocks of adjacent characters. This is
87	called a block bootstrap and was introduced by Künsch (1989). If the
88	correlations are believed to extend over some number of characters, you
89	choose a block size, <I>B</I>, that is larger than this, and choose
90	<I>N/B</I> blocks of size <I>B</I>. In its implementation here the
91	block bootstrap "wraps around" at the end of the characters (so that if a
92	block starts in the last  <I>B-1</B> characters, it continues by wrapping
93	around to the first character after it reaches the last character). Note also
94	that if you have a DNA sequence data set of an exon of a coding region, you
95	can ensure that equal numbers of first, second, and third coding positions
96	are sampled by using the block bootstrap with <I>B = 3</B>.
97	<P>
98	<LI><B>Delete-half-jackknifing</B>. This alternative to the bootstrap involves
99	sampling a random half of the characters, and including them in the data
100	but dropping the others. The resulting data sets are half the size of the
101	original, and no characters are duplicated. The random variation from
102	doing this should be very similar to that obtained from the bootstrap.
103	The method is advocated by Wu (1986). It was mentioned by me in my
104	bootstrapping paper (Felsenstein, 1985b), and has been available for many
105	years in this program as an option. Jackknifing is advocated by
106	Farris et. al. (1996) but as deleting a fraction 1/e (1/2.71828). This
107	retains too many characters and will lead to overconfidence in the
108	resulting groups.
109	<P>
110	<LI><B>Permuting species within characters.</B> This method of resampling (well, OK,
111	it may not be best to call it resampling) was introduced by Archie (1989)
112	and Faith (1990; see also Faith and Cranston, 1991). It involves permuting the
113	columns of the data matrix
114	separately. This produces data matrices that have the same number and kinds
115	of characters but no taxonomic structure. It is used for different purposes
116	than the bootstrap, as it tests not the variation around an estimated tree
117	but the hypothesis that there is no taxonomic structure in the data: if
118	a statistic such as number of steps is significantly smaller in the actual
119	data than it is in replicates that are permuted, then we can argue that there
120	is some taxonomic structure in the data (though perhaps it might be just a
121	pair of sibling species).
122	</UL>
123	<P>
124	The data input file is of standard form for molecular sequences (either in
125	interleaved or sequential form), restriction sites, gene frequencies, or
126	binary morphological characters.
127	<P>
128	When the program runs it first asks you for a random number seed. This should
129	be an integer greater than zero (and probably less than 32767) and which is
130	of the form 4n+1, that is, it leaves a remainder of 1 when divided by 4. This
131	can be judged by looking at the last two digits of the integer (for instance
132	7651 is not of form 4n+1 as 51, when divided by 4, leaves the remainder 3).
133	The random number seed is used to start the random number generator.
134	If the randum number seed is not odd, the program will request it again.
135	Any odd number can be used, but may result in a random number sequence that
136	repeats itself after less than the full one billion numbers. Usually this
137	is not a problem. As the random numbers appear to be unpredictable,
138	there is no such thing as a "good" seed -- the numbers produced from one
139	seed are indistinguishable from those produced by another, and it is
140	not true that the numbers produced from one seed (say 4533) are similar to
141	those produced from a nearby seed (say 4537).
142	<P>
143	Then the program shows you a menu to allow you to choose options. The menu
144	looks like this:
145	<P>
146	<TABLE><TR><TD BGCOLOR=white>
147	<PRE>
148
149	Bootstrapping algorithm, version 3.6a3
150
151	Settings for this run:
152	D Sequence, Morph, Rest., Gene Freqs? Molecular sequences
153	J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap
154	B Block size for block-bootstrapping? 1 (regular bootstrap)
155	R How many replicates? 100
156	W Read weights of characters? No
157	C Read categories of sites? No
158	F Write out data sets or just weights? Data sets
159	I Input sequences interleaved? Yes
160	0 Terminal type (IBM PC, ANSI, none)? (none)
161	1 Print out the data at start of run No
162	2 Print indications of progress of run Yes
163
164	Y to accept these or type the letter for one to change
165
166	</PRE>
167	</TD></TR></TABLE>
168	<P>
169	The user selects options by typing one of the letters in the left column,
170	and continues to do so until all options are correctly set. Then the
171	program can be run by typing Y.
172	<P>
173	It is important to select the correct data type (the D selection). Each
174	time D is typed the program will change data type, proceeding successively
175	through Molecular Sequences, Discrete Morphological Characters, Restriction
176	Sites, and Gene Frequencies. Some of these will cause additional entries
177	to appear in the menu. If Molecular Sequences or Restriction Sites settings
178	and chosen the I (Interleaved)
179	option appears in the menu (and as Molecular Sequences are also the default,
180	it therefore appears in the first menu). It is the usual
181	I option discussed in the Molecular Sequences document file and in the main
182	documentation files for the package, and is on by default.
183	<P>
184	If the Restriction Sites option is chosen the menu option E appears, which
185	asks whether the input file contains a third number on the first line of
186	the file, for the number of restriction enzymes used to detect these sites.
187	This is necessary because data sets for RESTML need this third number, but
188	other programs do not, and SEQBOOT needs to know what to expect.
189	<P>
190	If the Gene Frequencies option is chosen an menu option A appears which allows
191	the user to specify that all alleles at each locus are in the input file.
192	The default setting is that one allele is absent at each locus.
193	<P>
194	The J option allows the user to select Bootstrapping, Delete-Half-Jackknifing,
195	or the Archie-Faith permutation of species within characters. It changes
196	successively among these three each time J is typed.
197	<P>
198	The B option selects the Block Bootstrap. When you select option B the program
199	will ask you to enter the block length. When the block length is 1,
200	this means that we are doing regular bootstrapping rather than
201	block-bootstrapping.
202	<P>
203	The R option allows the user to set the number of replicate data sets.
204	This defaults to 100. Most statisticians would be happiest with 1000 to
205	10,000 replicates in a bootstrap, but 100 gives a rough picture. You
206	will have to decide this based on how long a running time you are willing to
207	tolerate.
208	<P>
209	The W (Weights) option allows weights to be read
210	from a file whose default name is "weights". The weights
211	follow the format described in the main documentation file.
212	Weights can only be 0 or 1, and act to select
213	the characters (or sites) that will be used in the resampling, the others
214	being ignored and always omitted from the output data sets.
215	<B>Note:</B> At present, if you use W together with the F (just weights)
216	option, you write a file of weights, but with only weights for the
217	sites that had input weights of 1, the others being omitted. Thus if
218	you had 100 characters, and gave 60 of them weights of 1, when you
219	produce the output weights these will only have 60 weights, not 100.
220	Thus they could only be used together with a data file that had been
221	edited to remove the sites that you gave 0 weights to. This is
222	clumsy and we need to correct it.
223	<P>
224	The C (Categories) option can be used with molecular sequence programs to
225	allow assignment of sites or amino acid positions to user-defined rate
226	categories. The assignment of rates to
227	sites is then made by reading a file whose default name is "categories".
228	It should contain a string of digits 1 through 9. A new line or a blank
229	can occur after any character in this string. Thus the categories file
230	might look like this:
231	<P>
232	<PRE>
233	122231111122411155
234	1155333333444
235	</PRE>
236	<P>
237	The only use of the Categories information in SEQBOOT is that they
238	are sampled along with the sites (or amino acid positions) and are
239	written out onto a file whose default name is "outcategories",
240	which has one set of categories information for each bootstrap
241	or jackknife replicate.
242	<P>
243	The F option is a particularly important one. It is used whether to
244	produce multiple output files or multiple weights. If your
245	data set is large, a file with (say) 1000 such data sets can be very
246	large and may use up too much space on your system. If you choose
247	the F option, the program will instead produce a weights file with
248	multiple sets of weights. The default name of this file is "outweights".
249	Except for some programs that cannot handle multiple sets of
250	weights,
251	the programs have an M (multiple data sets) option that asks the
252	user whether to use multiple data sets or multiple sets of weights.
253	If the latter is selected when running those programs, they
254	read one data set, but analyze it multiple times, each time reading a new
255	set of weights. As both bootstrapping and jackknifing can be thought of
256	as reweighting the characters, this accomplishes the same thing (the
257	multiple weights option is not available for Archie/Faith permutation).
258	As the file with multiple sets of weights is much smaller than a file with
259	multiple data sets, this can be an attractive way to save file space.
260	When multiple sets of weights is chosen, they reflect the sampling as
261	well as any set of weights that was read in, so that you can use
262	SEQBOOT's W option as well.
263	<P>
264	The 0 (Terminal type) option is the usual one.
265	<P>
266	<H2>Input File</H2>
267	<P>
268	The data files read by SEQBOOT are the standard ones for the various kinds of
269	data. For molecular sequences the sequences may be either interleaved or
270	sequential, and similarly for restriction sites. Restriction sites data
271	may either have or not have the third argument, the number of restriction
272	enzymes used. Discrete morphological
273	characters are always assumed to be in sequential format. Gene frequencies
274	data start with the number of species and the number of loci, and then
275	follow that by a line with the number of alleles at each locus. The data for
276	each locus may either have one entry for each allele, or omit one allele at
277	each locus. The details of the formats are given in the main documentation
278	file, and in the documentation files for the groups of programs.
279	<P>
280	The only option that can be present in the
281	input file is F (Factors), the latter only in the case of
282	binary (0,1) characters. The Factors
283	option allows us to specify that groups of binary characters represent
284	one multistate character. When sampling is done they will be sampled or
285	omitted together, and when permutations of species are done they will all
286	have the same permutation, as would happen if they really were just one
287	column in the data matrix. For futher description of the F (Factors) option
288	see the Discrete Characters Programs documentation file.
289	<P>
290	<H2>Output</H2>
291	<P>
292	The output file will contain the data sets generated by the resampling
293	process. Note that, when Gene Frequencies data is used or when
294	Discrete Morphological characters with the Factors option are used,
295	the number of characters in each data set may vary. It may also vary
296	if there are an odd number of characters or sites and the Delete-Half-Jackknife
297	resampling method is used, for then there will be a 50% chance of choosing
298	(n+1)/2 characters and a 50% chance of choosing (n-1)/2 characters.
299	<P>
300	The order of species in the data sets in the output file will vary
301	randomly. This is a precaution to help the programs that analyze these data
302	avoid any result which is sensitive to
303	the input order of species from showing up repeatedly
304	and thus appearing to have evidence in its favor.
305	<P>
306	The numerical options 1 and 2 in the menu also affect the output file.
307	If 1 is chosen (it is off by default) the program will print the original
308	input data set on the output file before the resampled data sets. I cannot
309	actually see why anyone would want to do this. Option 2 toggles the
310	feature (on by default) that prints out up to 20 times during the resampling
311	process a notification that the program has completed a certain number of
312	data sets. Thus if 100 resampled data sets are being produced, every 5
313	data sets a line is printed saying which data set has just been completed.
314	This option should be turned off if the program is running in background and
315	silence is desirable. At the end of execution the program will always (whatever
316	the setting of option 2) print
317	a couple of lines saying that output has been written to the output file.
318	<P>
319	<H2>Size and Speed</H2>
320	<P>
321	The program runs moderately quickly, though more slowly when the Permutation
322	resampling method is used than with the others.
323	<P>
324	<H2>Future</H2>
325	<P>
326	I hope in the future to include code to pass on the Ancestors
327	option from the input file (for use in programs MIX and DOLLOP)
328	to the output file, a serious
329	omission in the current version.
330	<P>
331	<HR>
332	<P>
333	<H3>TEST DATA SET</H3>
334	<P>
335	<TABLE><TR><TD BGCOLOR=white>
336	<PRE>
337	5 6
338	Alpha AACAAC
339	Beta AACCCC
340	Gamma ACCAAC
341	Delta CCACCA
342	Epsilon CCAAAC
343	</PRE>
344	</TD></TR></TABLE>
345	<P>
346	<HR>
347	<P>
348	<H3>CONTENTS OF OUTPUT FILE</H3>
349	<P>
350	(If Replicates are set to 10 and seed to 4333)
351	<P>
352	<TABLE><TR><TD BGCOLOR=white>
353	<PRE>
354	5 6
355	Alpha ACAAAC
356	Beta ACCCCC
357	Gamma ACAAAC
358	Delta CACCCA
359	Epsilon CAAAAC
360	5 6
361	Alpha AAAACC
362	Beta AACCCC
363	Gamma CCAACC
364	Delta CCCCAA
365	Epsilon CCAACC
366	5 6
367	Alpha ACAAAC
368	Beta ACCCCC
369	Gamma CCAAAC
370	Delta CACCCA
371	Epsilon CAAAAC
372	5 6
373	Alpha ACCAAA
374	Beta ACCCCC
375	Gamma ACCAAA
376	Delta CAACCC
377	Epsilon CAAAAA
378	5 6
379	Alpha ACAAAC
380	Beta ACCCCC
381	Gamma ACAAAC
382	Delta CACCCA
383	Epsilon CAAAAC
384	5 6
385	Alpha AAAACA
386	Beta AAAACC
387	Gamma AAACCA
388	Delta CCCCAC
389	Epsilon CCCCAA
390	5 6
391	Alpha AAACCC
392	Beta CCCCCC
393	Gamma AAACCC
394	Delta CCCAAA
395	Epsilon AAACCC
396	5 6
397	Alpha AAAACC
398	Beta AACCCC
399	Gamma AAAACC
400	Delta CCCCAA
401	Epsilon CCAACC
402	5 6
403	Alpha AAAAAC
404	Beta AACCCC
405	Gamma CCAAAC
406	Delta CCCCCA
407	Epsilon CCAAAC
408	5 6
409	Alpha AACCAC
410	Beta AACCCC
411	Gamma AACCAC
412	Delta CCAACA
413	Epsilon CCAAAC
414	</PRE>
415	</TD></TR></TABLE>
416	<P>
417	</BODY>
418	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/seqboot.html

Download in other formats: