Context Navigation

main.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 233.0 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>main</TITLE>
5	<META NAME="description" CONTENT="main">
6	<META NAME="keywords" CONTENT="PHYLIP", "main", "documentation">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<P>
13	<DIV ALIGN="CENTER">
14	<H1>PHYLIP</H1>
15	<H2>Phylogeny Inference Package</H2>
16	<P>
17	<IMG SRC="phylip.gif" ALT="PHYLIP Logo">
18	<P>
19	<H3>Version 3.6(alpha3)</H3>
20	<P>
21	<H3>July, 2002</H3>
22	<P>
23	<H2>by Joseph Felsenstein</H2>
24	<P>
25	<BR>
26	<TABLE>
27	<TR><TD>
28	<FONT SIZE="+2">
29	Department of Genome Sciences<BR>
30	University of Washington<BR>
31	Box 357730<BR>
32	Seattle, WA   98195-7730<BR>
33	USA
34	</FONT>
35	</TD></TR>
36	</TABLE>
37	<H2>E-mail address: <TT>joe@gs.washington.edu</TT></H2>
38	</DIV>
39	<P>
40	<DIV ALIGN="CENTER">
41	<A NAME="contents"><HR><P></A>
42	<H2>Contents of this document</H2></DIV>
43	<P>
44	<BR>
45	<A HREF="#contents">Contents of this document
46	<BR>
47	<A HREF="#description">A Brief Description of the Programs</A>
48	<BR>
49	<A HREF="#copyright">Copyright Notice for PHYLIP</A>
50	<BR>
51	<A HREF="#documentation">The Documentation Files and How to Read Them</A>
52	<BR>
53	<A HREF="#programs">What The Programs Do</A>
54	<BR>
55	<A HREF="#running">Running the Programs</A>
56	<BR>
57	A word about input files
58	<BR>
59	Running the programs on a Windows machine
60	<BR>
61	Running the programs on a Macintosh
62	<BR>
63	Running the programs on a Unix system
64	<BR>
65	Running the programs in MSDOS
66	<BR>
67	Running the programs in background or under control of a command file
68	<BR>
69	<A HREF="#inputfiles">Preparing Input Files</A>
70	<BR>
71	Input and output files
72	<BR>
73	Data file format
74	<BR>
75	<A HREF="#menu">The Menu</A>
76	<BR>
77	<A HREF="#outputfile">The Output File</A>
78	<BR>
79	<A HREF="#treefile">The Tree File</A>
80	<BR>
81	<A HREF="#options">The Options and How To Invoke Them</A>
82	<BR>
83	Common options in the menu
84	<BR>
85	The <TT>U</TT> (User tree) option
86	<BR>
87	The <TT>G</TT> (Global) option
88	<BR>
89	The <TT>J</TT> (Jumble) option
90	<BR>
91	The <TT>O</TT> (Outgroup) option
92	<BR>
93	The <TT>T</TT> (Threshold) option
94	<BR>
95	The <TT>M</TT> (Multiple data sets) option
96	<BR>
97	The <TT>W</TT> (Weights) option
98	<BR>
99	The option to write out the trees into a tree file
100	<BR>
101	The (<TT>0</TT>) terminal type option
102	<BR>
103	<A HREF="#algorithm">The Algorithm for Constructing Trees</A>
104	<BR>
105	Local Rearrangements
106	<BR>
107	Global Rearrangements
108	<BR>
109	Multiple Jumbles
110	<BR>
111	Saving multiple tied trees
112	<BR>
113	Strategy for Finding the Best Tree
114	<BR>
115	<A HREF="#warning">A Warning on Interpreting Results</A>
116	<BR>
117	<A HREF="#speed">Relative Speed of Different Programs and Machines</A>
118	<BR>
119	Relative speed of the different programs
120	<BR>
121	Speed with different numbers of species
122	<BR>
123	Relative speed of different machines
124	<BR>
125	<A HREF="#comments">General Comments on Adapting the Package to Different Computer Systems</A>
126	<BR>
127	<A HREF="#compiling">Compiling the programs</A>
128	<BR>
129	Unix and Linux
130	<BR>
131	Macintosh PowerMacs
132	<BR>
133	Compiling with Metrowerks Codewarrior
134	<BR>
135	On Windows systems
136	<BR>
137	Compiling with Microsoft Visual C++
138	<BR>
139	Compiling with Borland C++
140	<BR>
141	Compiling with Metrowerks Codewarrior for Windows
142	<BR>
143	Compiling with Cygnus Gnu C++
144	<BR>
145	VMS VAX systems
146	<BR>
147	Parallel computers
148	<BR>
149	Other computer systems
150	<BR>
151	<A HREF="#FAQ">Frequently Asked Questions</A>
152	<BR>
153	How to make it do various things
154	<BR>
155	Background information needed:
156	<BR>
157	Questions about distribution and citation:
158	<BR>
159	Questions about documentation
160	<BR>
161	Additional Frequently Asked Questions, or: "Why didn't it occur to you to ...
162	<BR>
163	(Fortunately) obsolete questions
164	<BR>
165	<A HREF="#newfeatures">New Features in This Version</A>
166	<BR>
167	<A HREF="#future">Coming Attractions, Future Plans</A>
168	<BR>
169	<A HREF="#endorsements">Endorsements</A>
170	<BR>
171	From the pages of <I>Cladistics</I>
172	<BR>
173	... and in the pages of other journals:
174	<BR>
175	<A HREF="#references">References for the Documentation Files</A>
176	<BR>
177	<A HREF="#credits">Credits</A>
178	<BR>
179	<A HREF="#otherprograms">Other Phylogeny Programs Available Elsewhere</A>
180	<BR>
181	PAUP*
182	<BR>
183	MacClade
184	<BR>
185	MEGA
186	<BR>
187	MOLPHY
188	<BR>
189	PAML
190	<BR>
191	TREE-PUZZLE
192	<BR>
193	DAMBE
194	<BR>
195	Hennig86
196	<BR>
197	RnA
198	<BR>
199	NONA
200	<BR>
201	TNT
202	<BR>
203	<A HREF="#helpme">How You Can Help Me</A>
204	<BR>
205	<A HREF="#trouble">In Case of Trouble</A>
206	<P>
207	<A NAME="description"><HR><P></A>
208	<DIV ALIGN="CENTER">
209	<H2>A Brief Description of the Programs</H2></DIV>
210	<P>
211	<TT>PHYLIP</TT>, the Phylogeny Inference Package, is a package of programs for
212	inferring phylogenies (evolutionary trees). It has been distributed since
213	1980, and has over 10,000 registered users, making it the most widely
214	distributed package of phylogeny programs. It is available free, from
215	its web site:
216	<P>
217	<DIV ALIGN="CENTER">
218	<FONT SIZE=+2><A HREF="http://evolution.gs.washington.edu/phylip.html">
219	<TT>http://evolution.gs.washington.edu/phylip.html</TT></A></FONT>
220
221	</DIV>
222	<P>
223	<TT>PHYLIP</TT> is available as source code in C, and also as executables for
224	some common computer systems. It can infer phylogenies by parsimony,
225	compatibility, distance matrix methods, and likelihood. It can also
226	compute consensus trees, compute distances between trees, draw trees,
227	resample data sets by bootstrapping or jackknifing, edit trees, and
228	compute distance matrices. It can handle data that are nucleotide
229	sequences, protein sequences, gene frequencies, restriction sites,
230	restriction fragments, distances, discrete characters, and continuous
231	characters.
232	<P>
233	<BR>
234	<A NAME="copyright"><HR><P></A>
235	<DIV ALIGN=CENTER>
236	<TABLE BORDER=4 WIDTH=80%><TR><TD ALIGN=LEFT>
237	<DIV ALIGN="CENTER">
238	<H2>Copyright Notice for PHYLIP</H2></DIV>
239	<P>
240	The following copyright notice is intended to cover all source code, all
241	documentation, and all executable programs of the PHYLIP package.
242	<P>
243	© Copyright 1980-2002. University of Washington and Joseph Felsenstein. All
244	rights reserved. Permission is granted to reproduce, perform, and modify
245	these programs and documentation files. Permission is granted to distribute
246	or provide access to these
247	programs provided that this copyright notice is not removed, the programs are
248	not integrated with or called by any product or service that generates
249	revenue, and that your distribution of these materials program are free.
250	Any modified
251	versions of these materials that are distributed or accessible shall indicate
252	that they are based on these program. Institutions of higher education are
253	granted permission to distribute this material to their students and staff
254	for a fee to recover distribution costs. Permission requests for any other
255	distribution of this program should be directed to <TT>license@u.washington.edu</TT>.
256	<BR>
257	</TD></TR></TABLE></DIV>
258
259	<BR>
260	<A NAME="documentation"><HR><P></A>
261	<DIV ALIGN="CENTER">
262	<H2>The Documentation Files and How to Read Them</H2></DIV>
263	<P>
264	<TT>PHYLIP</TT> comes with an extensive set of documentation files. These
265	include the main documentation file (this one), which you should read
266	fairly completely. In addition there are files for groups of programs,
267	including ones for the <A HREF="sequence.html">molecular sequence</A>
268	programs, the <A HREF="distance.html">distance matrix</A>
269	programs, the
270	<A HREF="contchar.html">gene frequency and continuous characters</A>
271	programs, the <A HREF="discrete.html">discrete characters</A> programs,
272	and the <A HREF="draw.html">tree drawing</A> programs. Finally,
273	each program has its own documentation file. References for the
274	documentation files are all gathered together in this main documentation
275	file. A good strategy is to:
276	<OL>
277	<LI>Read this main documentation file.
278	<LI>Tentatively decide which programs are of interest to you.
279	<LI>Read the documentation files for the groups of programs that
280	contain those.
281	<LI>Read the documentation files for those individual programs.
282	</OL>
283	<P>
284	<A NAME="programs"><HR><P></A>
285	<DIV ALIGN="CENTER">
286	<H2>What The Programs Do</H2></DIV>
287	<P>
288	Here is a short description of each of the programs. For more detailed
289	discussion you should definitely read the documentation file for the
290	individual program and the documentation file for the group of programs
291	it is in. In this list the name of each program is a link which will
292	take you to the documentation file for that program. Note that there is no
293	program in the PHYLIP package called PHYLIP.
294	<DL>
295	<DT><STRONG><A HREF="protpars.html">PROTPARS</A></STRONG>
296	<DD>Estimates phylogenies from protein sequences (input using the
297	standard one-letter code for amino acids) using the parsimony method, in
298	a variant which counts only those nucleotide changes that change the amino
299	acid, on the assumption that silent changes are more easily accomplished.
300	<DT><STRONG><A HREF="dnapars.html">DNAPARS</A></STRONG>
301	<DD>Estimates phylogenies by the parsimony method using nucleic acid
302	sequences. Allows use the full IUB ambiguity codes, and estimates
303	ancestral nucleotide states. Gaps treated as a fifth nucleotide state.
304	Can use 0/1 weights, reconstruct ancestral states, and infer branch
305	lengths.
306	<DT><STRONG><A HREF="dnamove.html">DNAMOVE</A></STRONG>
307	<DD>Interactive construction of phylogenies from nucleic acid
308	sequences, with their evaluation by parsimony and compatibility and the
309	display of reconstructed ancestral bases. This can be used to find
310	parsimony or compatibility estimates by hand.
311	<DT><STRONG><A HREF="dnapenny.html">DNAPENNY</A></STRONG>
312	<DD>Finds all most parsimonious phylogenies for nucleic acid
313	sequences by branch-and-bound search. This may not be practical (depending
314	on the data) for more than 15 species or so.
315	<DT><STRONG><A HREF="dnacomp.html">DNACOMP</A></STRONG>
316	<DD>Estimates phylogenies from nucleic acid sequence data using
317	the compatibility criterion, which searches for the largest number of sites
318	which could have all states (nucleotides) uniquely evolved on the same
319	tree. Compatibility is particularly appropriate when sites vary greatly in
320	their rates of evolution, but we do not know in advance which are the less
321	reliable ones.
322	<DT><STRONG><A HREF="dnainvar.html">DNAINVAR</A></STRONG>
323	<DD>For nucleic acid sequence data on four species, computes
324	Lake's and Cavender's phylogenetic invariants, which test alternative tree
325	topologies. The program also tabulates the frequencies of occurrence of the
326	different nucleotide patterns. Lake's invariants are the method which he
327	calls "evolutionary parsimony".
328	<DT><STRONG><A HREF="dnaml.html">DNAML</A></STRONG>
329	<DD>Estimates phylogenies from nucleotide sequences by maximum
330	likelihood. The model employed allows for unequal expected frequencies of
331	the four nucleotides, for unequal rates of transitions and transversions,
332	and for different (prespecified) rates of change in different categories of
333	sites, with the program inferring which sites have which rates. It also
334	allows different rates of change at known sites.
335	<DT><STRONG><A HREF="dnamlk.html">DNAMLK</A></STRONG>
336	<DD>Same as DNAML but assumes a molecular clock. The use of the
337	two programs together permits a likelihood ratio test of the
338	molecular clock hypothesis to be made.
339	<DT><STRONG><A HREF="proml.html">PROML</A></STRONG>
340	<DD>Estimates phylogenies from protein amino acid sequences by maximum
341	likelihood. The PAM or JTTF models can be employed. The program
342	can allow for different (prespecified) rates of change in different
343	categories of amino acid positions, with the program inferring which
344	posiitons have which rates. It also allows different rates of change
345	at known sites.
346	<DT><STRONG><A HREF="promlk.html">PROMLK</A></STRONG>
347	<DD>Same as PROML but assumes a molecular clock. The use of the
348	two programs together permits a likelihood ratio test of the
349	molecular clock hypothesis to be made.
350	<DT><STRONG><A HREF="dnadist.html">DNADIST</A></STRONG>
351	<DD>Computes four different distances between species from nucleic
352	acid sequences. The distances can then be used in the distance matrix
353	programs. The distances are the Jukes-Cantor formula, one based on Kimura's
354	2-parameter method, Jin and Nei's distance which allows for rate variation
355	from site to site, and a maximum likelihood method using the model employed
356	in DNAML. The latter method of computing distances can be very slow.
357	<DT><STRONG><A HREF="protdist.html">PROTDIST</A></STRONG>
358	<DD>Computes a distance measure for protein sequences, using
359	maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983
360	approximation to it, or a model based on the genetic code plus a
361	constraint on changing to a different category of amino acid. Rate
362	variation from site to site is also allowed. The
363	distances can be used in the distance matrix programs.
364	<DT><STRONG><A HREF="restdist.html">RESTDIST</A></STRONG>
365	<DD>Distances calculated from restriction sites data or
366	restriction fragments data. The restriction sites option is the one to
367	use to also make distances for RAPDs or AFLPs.
368	<DT><STRONG><A HREF="restml.html">RESTML</A></STRONG>
369	<DD>Estimation of phylogenies by maximum likelihood using
370	restriction sites data (not restriction fragments but presence/absence of
371	individual sites). It employs the Jukes-Cantor symmetrical model of
372	nucleotide change, which does not allow for differences of rate between
373	transitions and transversions. This program is <I>very</I> slow.
374	<DT><STRONG><A HREF="seqboot.html">SEQBOOT</A></STRONG>
375	<DD>Reads in a data set, and produces multiple data sets from
376	it by bootstrap resampling. Since most programs in the current version of
377	the package allow processing of multiple data sets, this can be used
378	together with the consensus tree program CONSENSE to do bootstrap (or
379	delete-half-jackknife) analyses with most of the methods in this package.
380	This program also allows the Archie/Faith technique of permutation of
381	species within characters. It can also rewrite a data set to convert
382	it from between the PHYLIP Interleaved and Sequential forms, and into
383	a preliminary version of a new XML sequence alignment format
384	which is under development.
385	<DT><STRONG><A HREF="fitch.html">FITCH</A></STRONG>
386	<DD>Estimates phylogenies from distance matrix data under the
387	"additive tree model" according to which the distances are expected to
388	equal the sums of branch lengths between the species. Uses the
389	Fitch-Margoliash criterion and some related least squares criteria. Does
390	not assume an evolutionary clock. This program will be useful with
391	distances computed from molecular sequences, restriction sites or fragments
392	distances, with DNA hybridization measurements, and with genetic distances
393	computed from gene frequencies.
394	<DT><STRONG><A HREF="kitsch.html">KITSCH</A></STRONG>
395	<DD>Estimates phylogenies from distance matrix data under the
396	"ultrametric" model which is the same as the additive tree model except
397	that an evolutionary clock is assumed. The Fitch-Margoliash criterion and
398	other least squares criteria are assumed. This program will be useful with
399	distances computed from molecular sequences, restriction sites or
400	fragments distances, with distances from DNA hybridization measurements,
401	and with genetic distances computed from gene frequencies.
402	<DT><STRONG><A HREF="neighbor.html">NEIGHBOR</A></STRONG>
403	<DD>An implementation by Mary Kuhner and John Yamato of Saitou and
404	Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage
405	clustering) method. Neighbor Joining is a distance matrix method producing
406	an unrooted tree without the assumption of a clock. UPGMA does assume a
407	clock. The branch lengths are not optimized by the least squares criterion
408	but the methods are very fast and thus can handle much larger data sets.
409	<DT><STRONG><A HREF="contml.html">CONTML</A></STRONG>
410	<DD>Estimates phylogenies from gene frequency data by maximum
411	likelihood under a model in which all divergence is due to genetic drift in
412	the absence of new mutations. Does not assume a molecular clock. An
413	alternative method of analyzing this data is to compute Nei's genetic
414	distance and use one of the distance matrix programs.
415	This program can also do maximum likelihoodn analysis of continuous
416	charactersn that evolve by a Brownian Motion model, but it assumes that
417	the characters evolve at equal rates and in an uncorrelated fashion, so
418	that it does not take into account the usual correlations of characters.
419	<DT><STRONG><A HREF="gendist.html">GENDIST</A></STRONG>
420	<DD>Computes one of three different genetic distance formulas
421	from gene frequency data. The formulas are Nei's genetic distance, the
422	Cavalli-Sforza chord measure, and the genetic distance of Reynolds et. al.
423	The former is appropriate for data in which new mutations occur in an
424	infinite isoalleles neutral mutation model, the latter two for a model
425	without mutation and with pure genetic drift. The distances are written to
426	a file in a format appropriate for input to the distance matrix programs.
427	<DT><STRONG><A HREF="contrast.html">CONTRAST</A></STRONG>
428	<DD>Reads a tree from a tree file, and a data set with continuous
429	characters data, and produces the independent contrasts for those
430	characters, for use in any multivariate statistics package. Will also
431	produce covariances, regressions and correlations between characters for
432	those contrasts. Can also correct for within-species sampling variation
433	when individual phenotypes are available within a population.
434	<DT><STRONG><A HREF="pars.html">PARS</A></STRONG>
435	<DD>Multistate discrete-characters parsimony method. Up to 8 states
436	(as well as "<TT>?</TT>") are allowed. Cannot do Camin-Sokal or Dollo Parsimony.
437	Can reconstruct ancestral states, use character weights, and infer branch
438	lengths.
439	<DT><STRONG><A HREF="mix.html">MIX</A></STRONG>
440	<DD>Estimates phylogenies by some parsimony methods for discrete
441	character data with two states (0 and 1). Allows use of the
442	Wagner parsimony method, the Camin-Sokal parsimony method, or arbitrary
443	mixtures of these. Also reconstructs ancestral states and allows weighting
444	of characters (does not infer branch lengths).
445	<DT><STRONG><A HREF="move.html">MOVE</A></STRONG>
446	<DD>Interactive construction of phylogenies from discrete character
447	data with two states (0 and 1). Evaluates parsimony and compatibility
448	criteria for those phylogenies and displays reconstructed states throughout
449	the tree. This can be used to find parsimony or compatibility estimates by
450	hand.
451	<DT><STRONG><A HREF="penny.html">PENNY</A></STRONG>
452	<DD>Finds all most parsimonious phylogenies for discrete-character
453	data with two states, for the Wagner, Camin-Sokal, and mixed parsimony
454	criteria using the branch-and-bound method of exact search. May be
455	impractical (depending on the data) for more than 10-11 species.
456	<DT><STRONG><A HREF="dollop.html">DOLLOP</A></STRONG>
457	<DD>Estimates phylogenies by the Dollo or polymorphism parsimony
458	criteria for discrete character data with two states (0 and 1). Also
459	reconstructs ancestral states and allows weighting of characters. Dollo
460	parsimony is particularly appropriate for restriction sites data; with
461	ancestor states specified as unknown it may be appropriate for restriction
462	fragments data.
463	<DT><STRONG><A HREF="dolmove.html">DOLMOVE</A></STRONG>
464	<DD>Interactive construction of phylogenies from discrete
465	character data with two states (0 and 1) using the Dollo or polymorphism
466	parsimony criteria. Evaluates parsimony and compatibility criteria for
467	those phylogenies and displays reconstructed states throughout the tree.
468	This can be used to find parsimony or compatibility estimates by hand.
469	<DT><STRONG><A HREF="dolpenny.html">DOLPENNY</A></STRONG>
470	<DD>Finds all most parsimonious phylogenies for
471	discrete-character data with two states, for the Dollo or polymorphism
472	parsimony criteria using the branch-and-bound method of exact search. May
473	be impractical (depending on the data) for more than 10-11 species.
474	<DT><STRONG><A HREF="clique.html">CLIQUE</A></STRONG>
475	<DD>Finds the largest clique of mutually compatible characters, and
476	the phylogeny which they recommend, for discrete character data with two
477	states. The largest clique (or all cliques within a given size range of
478	the largest one) are found by a very fast branch and bound search method.
479	The method does not allow for missing data. For such cases the <TT>T</TT>
480	(Threshold) option of PARS or MIX may be a useful alternative.
481	Compatibility methods are particular useful when some characters are of
482	poor quality and the rest of good quality, but when it is not known in
483	advance which ones are which.
484	<DT><STRONG><A HREF="factor.html">FACTOR</A></STRONG>
485	<DD>Takes discrete multistate data with character state trees and
486	produces the corresponding data set with two states (0 and 1). Written by
487	Christopher Meacham. This program was formerly used to accomodate
488	multistate characters in MIX, but this is less necessary now that PARS is
489	available.
490	<DT><STRONG><A HREF="drawgram.html">DRAWGRAM</A></STRONG>
491	<DD>Plots rooted phylogenies, cladograms, and phenograms in a
492	wide variety of user-controllable formats. The program is interactive and
493	allows previewing of the tree on PC or Macintosh graphics screens,
494	and Tektronix or Digital graphics terminals. Final output can be
495	to a file formatted for one of the drawing programs, on
496	a laser printer (such as Postscript or PCL-compatible printers),
497	on graphics screens or terminals, on pen plotters (Hewlett-Packard or
498	Houston Instruments) or on dot matrix printers capable of graphics
499	(Epson, Okidata, Imagewriter, or Toshiba).
500	<DT><STRONG><A HREF="drawtree.html">DRAWTREE</A></STRONG>
501	<DD>Similar to DRAWGRAM but plots unrooted phylogenies.
502	<DT><STRONG><A HREF="treedist.html">TREEDIST</A></STRONG>
503	<DD>Computes the Robinson-Foulds symmetric difference distance
504	between trees, which allows for differences in tree topology (but does not
505	use branch lengths).
506	<DT><STRONG><A HREF="consense.html">CONSENSE</A></STRONG>
507	<DD>Computes consensus trees by the majority-rule consensus tree
508	method, which also allows one to easily find the strict consensus tree.
509	Is not able to compute the Adams consensus tree. Trees are input in a tree
510	file in standard nested-parenthesis notation, which is produced by many of
511	the tree estimation programs in the package. This program can be used as
512	the final step in doing bootstrap analyses for many of the methods in the
513	package.
514	<DT><STRONG><A HREF="retree.html">RETREE</A></STRONG>
515	<DD>Reads in a tree (with branch lengths if necessary) and allows
516	you to reroot the tree, to flip branches, to change species names and
517	branch lengths, and then write the result out. Can be used to convert
518	between rooted and unrooted trees, and to write the tree into a
519	preliminary version of a new XML tree file format which is under
520	development.
521	</DL>
522	<P>
523	<A NAME="running"><HR><P></A>
524	<DIV ALIGN="CENTER">
525	<H2>Running the Programs</H2></DIV>
526	<P>
527	This section assumes that you have obtained PHYLIP as compiled executables
528	(for Windows, Macintosh, or DOS), or have obtained the source code
529	and compiled it yourself (for Linux, Unix, or OpenVMS). For machines for
530	which compiled executables are available, there will usually be no need for
531	you to have a compiler or compile the programs yourself. This section
532	describes how to run the programs. Later in this document we will
533	discuss how to download and install PHYLIP (in case you are somehow
534	reading this without yet having done that). Normally you will only read
535	this document after downloading and installing PHYLIP.
536	<P>
537	<H3>A word about input files.</H3>
538	<P>
539	For all of these types of machines, it is
540	important to have the input files for the programs (typically data files)
541	prepared in advance. They can be prepared in any editor, but it is important
542	that they be saved in Text Only ("flat ASCII") format, not in the format that
543	word processors such as Microsoft Word want to write. It is up to you to read
544	the PHYLIP documentation files which describe the files formats that are
545	needed. There is a partial description in the next section of this document.
546	The input files can also be obtained by running a program that
547	produces output files in PHYLIP format (some of these programs do, and so do
548	programs by others such as sequence alignment programs such as ClustalW and
549	sequence format conversion programs such as Readseq). There is <I>not</I> any
550	input file editor available in any program in PHYLIP (you should <I>not</I>
551	simply start running one of the programs and then expect to click a mouse
552	somewhere to start creating a data file).
553	<P>
554	When they start running, the programs look first for input files with
555	particular names (such as <TT>infile</TT>, <TT>treefile</TT>, <TT>intree</TT>, or <TT>fontfile</TT>).
556	Exactly which file names they look for varies a bit from program to program,
557	and you should read the documentation file for the particular program to
558	find out. If you have files with those names the programs will use them
559	and not ask you for the file name. If they do not find files of those
560	names, the programs will say that they cannot find a file of that name, and
561	ask you to type in the file name.
562	For example, if DnaML looks
563	for the file <TT>infile</TT> and does not find one of that name,
564	it prints the message:
565	<P>
566	<TABLE><TR><TD BGCOLOR=white>
567	<TT>dnaml: can't find input file "infile"<BR>
568	Please enter a new file name></TT>
569	</TD></TR></TABLE>
570	<P><I>This does not mean that an error
571	has occurred.</I> All you need to do is to type in the name of the file.
572	<P>
573	The program looks for the input files in the same directory that the
574	program is in (a directory is the same thing as a "folder"). In Windows, Linux, Unix, or MSDOS, if you are asked for the
575	file name you can type in the path to the file, as part of the name (thus,
576	if the file is in the directory above the current one, you can type in
577	a file name such as <TT>../myfile.dna</TT>). If you do not know what a
578	"directory" is, or what "above" means, then you are a member of the new
579	generation who just clicks the mouse and assumes that a list of file names
580	will magically appear. (Typically members of this generation have no idea
581	where the files are on their system, and accumulate enormous amounts of
582	unnecessary clutter in their file systems.) In this case you should ask
583	someone to explain directories to you.
584	<P>
585	<H3>Running the programs on a Windows machine.</H3>
586	<P>
587	Double-click on the icon for
588	the program. A window should open with a menu in it. Further dialog with the
589	program occurs
590	by typing on the keyboard in response to what you see in the window. The
591	programs can be interrupted either by typing Control-C (which means to
592	press down on the <TT>Ctrl</TT> key while typing the letter <TT>C</TT>), or by using
593	the mouse to open the <TT>File</TT> menu in the upper-left corner of the program's
594	window area and then select <TT>Quit</TT>. Other than this, most PHYLIP programs
595	make no use of the mouse. The tree-drawing programs Drawtree and Drawgram
596	do allow use of the mouse to select some options.
597	<P>
598	<H3>Running the programs on a Macintosh.</H3>
599	<P>
600	Double-click on the icon for
601	the program. A window should open. Further dialog with the program occurs
602	by typing on the keyboard in response to what you see in the window. The
603	programs can be interrupted by using
604	the mouse to open the <TT>File</TT> menu in the upper-left corner of the program's
605	window area and then select <TT>Quit</TT>. Alternatively, you can use the
606	Command-Q key combination.
607	<P>
608	When you use Quit, the program will ask you whether you want to save
609	a file whose name is the program name (often followed by <TT>.out</TT> -- for
610	example, if you are using DNAML it will ask you if you want to save file
611	<TT>Dnaml.out</TT>. This file is simply a record of everything that
612	displayed on the program window, and you usually will not want to save it.
613	Pressing the <TT>Enter</TT> key or selecting the Do Not Save button with
614	the mouse will keep this from being saved.
615	<P>
616	If you encounter memory limitations on a Macintosh, and determine that
617	this is not due to a problem with the format of the input file, as it
618	often will be, you may be able to solve it by raising the limits of the
619	stack and heap sizes of the program. To do this click on the program
620	and then select <TT>Get Info</TT> from the Finder <TT>File</TT> menu.
621	This will open a window which can be made to show the memory limits
622	of the program. These can be changed by selecting them and typing in
623	larger numbers. This may relieve nagging memory problems. If it does
624	not, consult your local documentation and suspect problems with your
625	input file format.
626	<P>
627	<H3>Running the programs on a Unix system.</H3>
628	<P>
629	Type the name of the program
630	in lower-case letters (such as <TT>dnaml</TT>). To interrupt the program while
631	it is running, type Control-C (which means to press down on the <TT>Ctrl</TT> key
632	while typing the letter <TT>C</TT>).
633	<P>
634	<H3>Running the programs in MSDOS.</H3>
635	<P>
636	Type the name of the program
637	in lower-case letters (such as <TT>dnaml</TT>). To interrupt the program while
638	it is running, type Control-C (which means to press down on the <TT>Ctrl</TT> key
639	while typing the letter <TT>C</TT>).
640	<P>
641	<H3>Running the programs in background or under control of a command file</H3>
642	<P>
643	In running the programs, you may sometimes want to put them in background
644	so you can proceed with other work. On systems with a windowing environment
645	they can be put in their own window, and commands like the Unix and Linux
646	<TT>nice</TT> command used to make
647	them have lower priority so that they do not interfere with interactive
648	applications in other windows. This part of the discussion will
649	assume either a Windows system or a Unix or Linux system. I will
650	note when the commands work on one of these systems but not the other.
651	Running jobs in background on Macintosh systems is an arcane art into whose
652	mysteries I have not been initiated (or perhaps no one has been initiated).
653	<P>
654	If there is no windowing
655	environment, on a Unix or Linux system you will want to use an
656	ampersand (<TT>&</TT>) after the command file name when invoking it to put the
657	job in the background. You will have to put all the responses to the
658	interactive menu of the program into a file and tell the background job
659	to take its input from that file.
660	On Windows systems there is no <TT>&</TT> or <TT>nice</TT> command
661	but input and output redirection and command files work fine, with the sole
662	difference that the a file of commands must have a name ending in
663	<TT>.BAT</TT>, such as <TT>FOOFILE.BAT</TT>.
664	<P>
665	For example: suppose you want to run DNAPARS in a background, taking its
666	input data from a file called <TT>sequences.dat</TT>, putting its interactive
667	output to file called <TT>screenout</TT>, and using a file called <TT>input</TT> as
668	the place to store the interactive input. The file <TT>input</TT> need only
669	contain two lines:
670	<P>
671	<TABLE><TR><TD bgcolor=white>
672	<PRE>
673	sequences.dat
674	Y
675	</PRE>
676	</TD></TR></TABLE>
677	<P>
678	which is what you would have typed to run the program interactively, in
679	response to the program's request for an input file name if it did not
680	find a file named <TT>infile</TT>, in in response the the menu.
681	<P>
682	To run the program in background, in Unix or Linux you would simply give the command:
683	<P>
684	<TT>dnapars < input > screenout &
685	</TT>
686	<P>
687	These run the program with input responses coming from <TT>input</TT> and
688	interactive output being put into file <TT>screenout</TT>. The usual output
689	file and tree file will also be created by this run (keep that in mind
690	as if you run any other PHYLIP program from the same directory while
691	this one is running in background you may overwrite the output file from
692	one program with that from the other!).
693	<P>
694	If you wanted to give the program lower priority, so that it would
695	not interfere with other work, and you have Berkeley Unix type job control
696	facilities in your Unix or Linux (and you usually do), you can use the
697	<TT>nice</TT> command:
698	<P>
699	<TT>nice +10 dnapars < input > screenout &
700	</TT>
701	<P>
702	which lowers the priority of the run. To also time the run and put the
703	timing at the end of <TT>screenout</TT>, you can do this:
704	<P>
705	<TT>nice +10 ( time dnapars < input ) >& screenout &
706	</TT>
707	<P>
708	which I will not attempt to explain.
709	<P>
710	On Unix or Linux systems
711	you may also want to explore putting the interactive output into the
712	null file <TT>/dev/null</TT> so as to not be bothered with it (but then you
713	cannot look at it to see why something went wrong). If you have problems
714	with creating output files that are too large, you may want to
715	explore carefully the turning off of options in the programs you run.
716	<P>
717	If you are doing several runs in one, as for example when you do a
718	bootstrap analysis using SEQBOOT, DNAPARS (say), and CONSENSE, you
719	can use an editor to create a "command file" with these commands:
720	<P>
721	<TABLE><TR><TD bgcolor=white>
722	<PRE>
723	seqboot < input1 > screenout
724	mv outfile infile
725	dnapars < input2 >> screenout
726	mv outtree intree
727	consense < input3 >> screenout
728	</PRE>
729	</TD></TR></TABLE>
730	<P>
731	This is the Unix or Linux version -- in the MSDOS version, the renaming
732	of files and the appending of output to the file <TT>screenout</TT> is
733	handled differently.
734	<P>
735	On Unix or Linux the command file might be named something like
736	<TT>foofile</TT>, and on Windows systems might be named <TT>foofile.bat</TT>.
737	<P>
738	On Unix or Linux the command file must be given
739	execute permission by using the command <TT>chmod +x foofile</TT> followed
740	by the command <TT>rehash</TT>. The job that <TT>foofile</TT> describes
741	can be run in background on Unix or Linux by giving the command
742	<P>
743	<TT>foofile &</TT>
744	<P>
745	On Windows systems it can be run by
746	clicking on the icon of the command file. Its icon will have a little gear
747	symbol.
748	<P>
749	Note that you must also have the interactive input
750	commands for SEQBOOT (including the random number seed), DNAPARS, and
751	CONSENSE in the separate files <TT>input1</TT>, <TT>input2</TT>, and <TT>input3</TT>.
752	Note that when PHYLIP programs attempt to open a new output file (such as
753	<TT>outfile</TT>, <TT>outtree</TT>, or <TT>plotfile</TT>, if they see
754	a file of that name already in existence they will ask you if you want to
755	overwrite it, and offer alternatives including writing to another file,
756	appending information to that file, or quitting the program without writing to
757	the file. This means that in writing batch files it is important to know
758	whether there will be a prompt of this sort. You must know in advance
759	whether the file will exist. You may want to put in your batch file a
760	command that tests for the existence of a pre-existing output file and
761	if so, removes it. You might even want to put in a command that creates a
762	file of that name, so that you can be sure it is there! Either way,
763	you will then know whether to put into your file of keyboard responses the
764	proper response to the inquiry about overwriting that output file.
765	<P>
766	<A NAME="inputfiles"><HR><P></A>
767	<DIV ALIGN="CENTER">
768	<H2>Preparing Input Files</H2></DIV>
769	<P>
770	The input files for PHYLIP programs must be prepared separately - there is
771	no data editor within PHYLIP. You can use a word processor (or text
772	editor) to prepare them yourself, or you can use a program that produces
773	a PHYLIP-format output. Sequence alignment programs such as ClustalW
774	commonly have an option to produce PHYLIP files as output, and some
775	other phylogeny programs, such as MacClade and TreeView, are capable of
776	producing a PHYLIP-format file.
777	<P>
778	The format of the input files is discussed below, and you should also
779	read the other PHYLIP documentation relevant to the particular type of
780	data that you are using, and the particular programs you want to run, as
781	there will be more details there.
782	<P>
783	It is very important that the input files be in "Text Only" or "flat
784	ASCII" format. This means that they contain only printable ASCII/ISO
785	characters, and not any unprintable characters. Many word processors such
786	as Microsoft Word save their files in a format that contains unprintable
787	characters, unless you tell them not to. For Microsoft Word you can
788	select <TT>Save As</TT> from its <TT>File</TT> menu, and choose <TT>Text Only</TT>
789	as the file format. This can also be done in WordPad utility in Windows .
790	Other word processors will have equivalent
791	options. Text editors such as the <TT>vi</TT> and <TT>emacs</TT> editors on
792	Unix and Linux, Windows Notepad, the <TT>SimpleText</TT> editor in MacOS, or the <TT>pico</TT>
793	editor that comes with the <TT>pine</TT>
794	mailer program, produce their files in Text Only format and should not
795	cause any trouble.
796	<P>
797	<H3>Input and output files</H3>
798	<P>
799	For most of the PHYLIP programs, information comes from a series of
800	input files, and ends up in a series of output files:
801	<P>
802	<DIV ALIGN="CENTER">
803	<TABLE>
804	<TR><TD>
805	<PRE>
806	-------------------
807	\| \|
808	infile ---------> \| \|
809	\| \|
810	intree ---------> \| \| -----------> outfile
811	\| \|
812	weights --------> \| program \| -----------> outtree
813	\| \|
814	categories -----> \| \| -----------> plotfile
815	\| \|
816	fonftile -------> \| \|
817	\| \|
818	-------------------
819	</PRE>
820	</TD></TR>
821	</TABLE>
822	</DIV><P></P>
823
824	<P>
825	The programs interact with the user by presenting a menu. Aside from the
826	user's choices from the menu, they read
827	all other input from files. These files have default names. The program
828	will try to find a file of that name - if it does not, it will ask the
829	user to supply the name of that file.
830	Input data such as DNA sequences
831	comes from a file whose default name is <TT>infile</TT>. If the user
832	supplies a tree, this is in a file whose default name is <TT>intree</TT>.
833	Values of weights for the characters are in <TT>weights</TT>, and the
834	tree plotting program need some digitized fonts which are supplied in
835	<TT>fontfile</TT> (all these are default names).
836	<P>
837	For example, if DnaML looks
838	for the file <TT>infile</TT> and does not find one of that name,
839	it prints the message:
840	<P>
841	<TABLE><TR><TD BGCOLOR=white>
842	<TT>dnaml: can't find input file "infile"<BR>
843	Please enter a new file name></TT>
844	</TD></TR></TABLE>
845	<P>
846	This simply means that it wants you to type in the name of the
847	input file.
848	<P>
849	Two programs in the package works differently according to an older ("Old
850	Style") system. These are <TT>CLIQUE</TT> and <TT>FACTOR</TT>. The information on ancestral
851	states is supplied in the data file whose
852	default name is <TT>infile</TT>, and for <TT>FACTOR</TT> the Factors
853	information is written into the output file rather than being put into a
854	separate file called <TT>factors</TT>. See the <A HREF="clique.html">documentation
855	page for <TT>CLIQUE</TT></A>
856	and the <A HREF="factor.html">documentation page for FACTOR</A>
857	for information on these differences. By the time of the final 3.6
858	release we hope to have these last Old Style programs converted to the new
859	system.
860	<P>
861	<H3>Data file format</H3>
862	<P>
863	I have tried to adhere to a rather stereotyped input and output
864	format. For the parsimony, compatibility and maximum likelihood programs,
865	excluding the distance matrix methods, the simplest version of the input
866	data file looks something like this:
867	<P>
868	<TABLE><TR><TD BGCOLOR=white>
869	<PRE>
870	6 13
871	Archaeopt CGATGCTTAC CGC
872	HesperorniCGTTACTCGT TGT
873	BaluchitheTAATGTTAAT TGT
874	B. virginiTAATGTTCGT TGT
875	BrontosaurCAAAACCCAT CAT
876	B.subtilisGGCAGCCAAT CAC
877	</TD></TR></TABLE>
878	</PRE>
879	<P>
880	The first line of the input file contains the number of species and the
881	number of characters (in this case sites). These are in free format, separated
882	by blanks. The information for each species follows, starting with a
883	ten-character species name (which can include blanks and some punctuation
884	marks), and continuing with the characters for that species. The name should
885	be on the same line as the first character of the data for that species.
886	(I will use the term "species" for the tips of the trees, recognizing
887	that in some cases these will actually be populations or individual gene
888	sequences).
889	<P>
890	The name should be ten characters in length, filled out to the full
891	ten characters by blanks if shorter. Any printable ASCII/ISO character is
892	allowed in the name, except for parentheses ("<TT>(</TT>" and "<TT>)</TT>"), square
893	brackets ("<TT>[</TT>" and "<TT>]</TT>"), colon ("<TT>:</TT>"), semicolon ("<TT>;</TT>") and comma ("<TT>,</TT>").
894	If you forget to extend the names to ten characters in length by blanks,
895	the program will get out of synchronization with the contents of the data
896	file, and an error message will result.
897	<P>
898	In the
899	discrete-character programs, DNA sequence programs and protein sequence
900	programs the characters are each a
901	single letter or digit, sometimes separated by blanks. In
902	the continuous-characters programs they are real numbers with decimal points,
903	separated by blanks:
904	<P>
905	<TT>Latimeria 2.03 3.457 100.2 0.0 -3.7</TT>
906	<P>
907	The conventions about continuing the data beyond one line per species are
908	different between the molecular sequence programs and the others. The
909	molecular sequence programs can take the data in "aligned" or "interleaved"
910	format, in which we first have some lines giving the first part of each of the
911	sequences, then some
912	lines giving the next part of each, and so on. Thus the sequences might
913	look like this:
914	<P>
915	<TABLE><TR><TD BGCOLOR=white>
916	<PRE>
917	6 39
918	Archaeopt CGATGCTTAC CGCCGATGCT
919	HesperorniCGTTACTCGT TGTCGTTACT
920	BaluchitheTAATGTTAAT TGTTAATGTT
921	B. virginiTAATGTTCGT TGTTAATGTT
922	BrontosaurCAAAACCCAT CATCAAAACC
923	B.subtilisGGCAGCCAAT CACGGCAGCC
924
925	TACCGCCGAT GCTTACCGC
926	CGTTGTCGTT ACTCGTTGT
927	AATTGTTAAT GTTAATTGT
928	CGTTGTTAAT GTTCGTTGT
929	CATCATCAAA ACCCATCAT
930	AATCACGGCA GCCAATCAC
931	</PRE>
932	</TD></TR></TABLE>
933	<P>
934	Note that in these sequences we have a blank every
935	ten sites to make them easier to read: any such blanks are allowed. The blank
936	line which separates the two groups of lines (the ones
937	containing sites 1-20 and ones containing sites 21-39) may or may not
938	be present, but if it is, it should be a line of zero length and not contain
939	any extra blank
940	characters (this is because of a limitation of the current versions
941	of the programs). It is important that the number of sites in each
942	group be the same for all species (i.e., it will not be possible to run
943	the programs successfully if the first species line contains 20 bases, but
944	the first line for the second species contains 21 bases).
945	<P>
946	Alternatively, an option can be selected in the menu to take the data in
947	"sequential" format, with all of the data for the first species,
948	then all of the characters for the next species, and so on. This is also
949	the way that the discrete characters programs and the gene frequencies
950	and quantitative characters programs want to read the data. They do not
951	allow the interleaved format.
952	<P>
953	In the sequential format, the character data can run on to a new line at any
954	time (except in the middle of a species name or, in the case of continuous
955	character and distance matrix programs where you cannot go to a new line in
956	the middle of a real number). Thus it is legal to have:
957	<P>
958	<TT>Archaeopt 001100
959	<BR>
960	1101
961	<BR>
962	</TT>
963	<P>
964	or even:
965	<P>
966	<TT>Archaeopt
967	<BR>
968	0011001101
969	<BR>
970	</TT>
971
972	<P>
973	though note that the <I>full</I> ten characters of the species name <I>must</I>
974	then be present: in the above case there must be a blank after the "t". In all
975	cases it is possible to put internal blanks between any of the character
976	values, so that
977	<P>
978	<TT>Archaeopt 0011001101 0111011100
979	</TT>
980	<P>
981	is allowed.
982	<P>
983	Note that you can convert molecular sequence data between the interleaved
984	and the sequential data formats by using the Rewrite option of the D
985	menu item in SEQBOOT.
986	<P>
987	If you make an error in the format of the input file, the programs can
988	sometimes detect that
989	they have been fed an illegal character or illegal numerical value and issue
990	an error message such as <TT>BAD CHARACTER STATE:</TT>, often printing out the
991	bad value, and sometimes the number of the species and character in which it
992	occurred. The program will then stop shortly after. One of the things which
993	can lead to a bad value is the omission of something earlier in the file, or
994	the insertion of something superfluous, which cause the reading of the file to
995	get out of synchronization. The program then starts reading things it
996	didn't expect, and concludes that they are in error. So if you see this error
997	message, you may also want
998	to look for the earlier problem that may have led to the program becoming
999	confused about what it is reading.
1000	<P>
1001	Some options are described below, but you should also read the documentation
1002	for the groups of the programs and for the individual programs.
1003	<BR>
1004	<P>
1005	<A NAME="menu"><HR><P></A>
1006	<H3>The Menu</H3>
1007	<P>
1008	The menu is straightforward. It typically looks like this (this one is for
1009	DNAPARS):
1010	<P>
1011	<TABLE><TR><TD BGCOLOR=white>
1012	<PRE>
1013	DNA parsimony algorithm, version 3.6
1014
1015	Setting for this run:
1016	U Search for best tree? Yes
1017	S Search option? More thorough search
1018	V Number of trees to save? 100
1019	J Randomize input order of sequences? No. Use input order
1020	O Outgroup root? No, use as outgroup species 1
1021	T Use Threshold parsimony? No, use ordinary parsimony
1022	N Use Transversion parsimony? No, count all steps
1023	W Sites weighted? No
1024	M Analyze multiple data sets? No
1025	I Input sequences interleaved? Yes
1026	0 Terminal type (IBM PC, ANSI, none)? (none)
1027	1 Print out the data at start of run No
1028	2 Print indications of progress of run Yes
1029	3 Print out tree Yes
1030	4 Print out steps in each site No
1031	5 Print sequences at all nodes of tree No
1032	6 Write out trees onto tree file? Yes
1033
1034	Y to accept these or type the letter for one to change
1035	</PRE>
1036	</TD></TR></TABLE>
1037	<P>
1038	If you want to accept the default settings (they are shown in the above case)
1039	you can simply type <TT>Y</TT> followed by pressing on the <TT>Enter</TT> key.
1040	If you want to change any of the options, you should type the letter
1041	shown to the left of its entry in the menu. For example, to set a threshold
1042	type <TT>T</TT>. Lower-case letters will also work. For many of the options
1043	the program will ask for supplementary information, such as the value of
1044	the threshold.
1045	<P>
1046	Note the <TT>Terminal type</TT> entry, which you will find on all menus. It
1047	allows you to specify which type of terminal your screen is. The options
1048	are an IBM PC screen, an ANSI standard terminal, or <TT>none</TT>.
1049	Choosing zero (<TT>0</TT>) toggles
1050	among these three options in cyclical order, changing each time the <TT>0</TT>
1051	option is chosen. If one of them is right for your terminal the screen will be
1052	cleared before the menu is displayed. If none works, the <TT>none</TT> option
1053	should probably be chosen. The programs should start with a terminal option
1054	appropriate for your computer, but if they do not, you can change the
1055	terminal type manually. This is particularly important in program RETREE
1056	where a tree is displayed on the screen - if the terminal type is set to the
1057	wrong value, the tree can look very strange.
1058	<P>
1059	The other numbered options control which information the program will
1060	display on your screen or on the output files. The option to <TT>Print
1061	indications of progress of run</TT> will show information such as the names of
1062	the species as they are successively added to the tree, and the
1063	progress of rearrangements. You will usually want to see these as
1064	reassurance that the program is running and to help you estimate how long
1065	it will take. But if you are running the program "in background" as can be
1066	done on multitasking and multiuser systems, and do not have the
1067	program running in its own window, you may want to turn this option off so
1068	that it does not disturb your use of the computer while the program is
1069	running.
1070	<P>
1071	<A NAME="outputfile"><HR><P></A>
1072	<H2>The Output File</H2>
1073	<BR>
1074	<P>
1075	Most of the programs write their output onto a file called (usually) <TT>outfile</TT>, and a representation of the trees found onto a file called
1076	<TT>outtree</TT>.
1077	<P>
1078	The exact contents of the output file vary from program to program and also
1079	depend on which menu options you have selected. For many programs, if you
1080	select all possible output information, the output will consist of
1081	(1) the name of the program and its
1082	version number, (2) some of the input information printed out, and (3) a series of
1083	phylogenies, some with associated information indicating how much change
1084	there was in each character or on each part of the tree. A typical rooted tree
1085	looks like this:
1086	<P>
1087	<TABLE><TR><TD BGCOLOR=white>
1088	<PRE>
1089	+-------------------Gibbon
1090	+----------------------------2
1091	! ! +------------------Orang
1092	! +------4
1093	! ! +---------Gorilla
1094	+-----3 +--6
1095	! ! ! +---------Chimp
1096	! ! +----5
1097	--1 ! +-----Human
1098	! !
1099	! +-----------------------------------------------Mouse
1100	!
1101	+------------------------------------------------Bovine
1102	</PRE>
1103	</TD></TR></TABLE>
1104	<P>
1105	The interpretation of the tree is fairly straightforward: it "grows"
1106	from left to right. The numbers at the forks are arbitrary and are used (if
1107	present) merely to identify the forks. For many of the programs the tree
1108	produced is unrooted. Rooted and unrooted trees are printed in nearly the
1109	same form, but the unrooted ones are accompanied by the
1110	warning message:
1111	<P>
1112	<TT> remember: this is an unrooted tree!
1113	</TT>
1114	<P>
1115	to indicate that this is an unrooted tree and to warn against
1116	taking the position of its root too seriously. Mathematicians still call
1117	an unrooted tree a tree, though some systematists unfortunately use the term
1118	"network" for an unrooted tree. This conflicts with standard mathematical
1119	usage, which reserves the name "network" for a completely different kind of
1120	graph). The root of this tree could be anywhere, say on the line leading
1121	immediately to <TT>Mouse</TT>. As an exercise,
1122	see if you can tell whether the following tree is or is not a different
1123	one from the above:
1124	<P>
1125	<TABLE><TR><TD BGCOLOR=white>
1126	<PRE>
1127	+-----------------------------------------------Mouse
1128	!
1129	+---------4 +------------------Orang
1130	! ! +------3
1131	! ! ! ! +---------Chimp
1132	---6 +----------------------------1 ! +----2
1133	! ! +--5 +-----Human
1134	! ! !
1135	! ! +---------Gorilla
1136	! !
1137	! +-------------------Gibbon
1138	!
1139	+-------------------------------------------Bovine
1140
1141	remember: this is an unrooted tree!
1142	</PRE>
1143	</TD></TR></TABLE>
1144	<P>
1145	(it is <I>not</I> different). It is <I>important</I> also to realize that the
1146	lengths of the segments of the printed tree may not be significant: some
1147	may actually represent branches of zero length, in the sense that there is no
1148	evidence that
1149	those branches are nonzero in length. Some of the diagrams of trees attempt
1150	to print branches approximately proportional to estimated
1151	branch lengths, while in others the lengths are purely conventional and
1152	are presented just to make the topology visible. You will have to look closely
1153	at the documentation that accompanies each program to see what it presents
1154	and what is known about the lengths of the branches on the tree. The above
1155	tree attempts to represent branch lengths approximately in the diagram. But
1156	even in those cases, some of the smaller branches are likely to be
1157	artificially lengthened to make the tree topology clearer. Here is what
1158	a tree from DNAPARS looks like, when no attempt is made to make the
1159	lengths of branches in the diagram proportional to estimated branch
1160	lengths:
1161	<P>
1162	<TABLE><TR><TD BGCOLOR=white>
1163	<PRE>
1164	+--Human
1165	+--5
1166	+--4 +--Chimp
1167	! !
1168	+--3 +-----Gorilla
1169	! !
1170	+--2 +--------Orang
1171	! !
1172	+--1 +-----------Gibbon
1173	! !
1174	--6 +--------------Mouse
1175	!
1176	+-----------------Bovine
1177
1178	remember: this is an unrooted tree!
1179	</PRE>
1180	</TD></TR></TABLE>
1181	<P>
1182	When a tree has branch lengths, it will be accompanied by a table showing
1183	for each branch the numbers (or names) of the nodes at each end of the
1184	branch, and the length of that branch. For the first tree shown above,
1185	the corresponding table is:
1186	<P>
1187	<TABLE><TR><TD BGCOLOR=white>
1188	<PRE>
1189	Between And Length Approx. Confidence Limits
1190	------- --- ------ ------- ---------- ------
1191
1192	1 Bovine 0.90216 ( 0.50346, 1.30086) **
1193	1 Mouse 0.79240 ( 0.42191, 1.16297) **
1194	1 2 0.48553 ( 0.16602, 0.80496) **
1195	2 3 0.12113 ( zero, 0.24676) *
1196	3 4 0.04895 ( zero, 0.12668)
1197	4 5 0.07459 ( 0.00735, 0.14180) **
1198	5 Human 0.10563 ( 0.04234, 0.16889) **
1199	5 Chimp 0.17158 ( 0.09765, 0.24553) **
1200	4 Gorilla 0.15266 ( 0.07468, 0.23069) **
1201	3 Orang 0.30368 ( 0.18735, 0.41999) **
1202	2 Gibbon 0.33636 ( 0.19264, 0.48009) **
1203
1204	* = significantly positive, P < 0.05
1205	** = significantly positive, P < 0.01
1206	</PRE>
1207	</TD></TR></TABLE>
1208	<P>
1209	Ignoring the asterisks and the approximate confidence limits, which will be
1210	described in the documentation file for DNAML, we can see that the table
1211	gives a more precise idea of what the lengths of all the branches are.
1212	Similar tables exist in distance matrix and likelihood programs, as well
1213	as in the parsimony programs DNAPARS and PARS.
1214	<P>
1215	Some of the parsimony programs in the package can print out a table
1216	of the number of steps that different characters (or sites) require on
1217	the tree. This table may not be obvious at first. A typical example looks like
1218	this:
1219	<P>
1220	<TABLE><TR><TD BGCOLOR=white>
1221	<PRE>
1222	steps in each site:
1223	0 1 2 3 4 5 6 7 8 9
1224	*-----------------------------------------
1225	0! 2 2 2 2 1 1 2 2 1
1226	10! 1 2 3 1 1 1 1 1 1 2
1227	20! 1 2 2 1 2 2 1 1 1 2
1228	30! 1 2 1 1 1 2 1 3 1 1
1229	40! 1
1230	</PRE>
1231	</TD></TR></TABLE>
1232	<P>
1233	The numbers across the top and down the side indicate which site
1234	is being referred to. Thus site 23 is column "3" of row "20"
1235	and has 1 step in this case.
1236	<P>
1237	There are many other kinds of information that can appear in the
1238	output file, They vary from program to program, and we leave their
1239	description to the documentation files for the specific programs.
1240	<P>
1241	<A NAME="treefile"><HR><P></A>
1242	<H2>The Tree File</H2>
1243	<P>
1244	In output from most programs,
1245	a representation of the tree is also written into the tree file
1246	<TT>outtree</TT>. The tree is specified by nested pairs
1247	of parentheses, enclosing
1248	names and separated by commas. We will describe how this works
1249	below. If there are any blanks in the names,
1250	these must be replaced by the underscore character "<TT>_</TT>". Trailing blanks
1251	in the name may be omitted. The pattern of the parentheses indicates
1252	the pattern of the tree by having each pair of parentheses enclose all
1253	the members of a monophyletic group. The tree file could look like this:
1254	<P>
1255	<TT>((Mouse,Bovine),(Gibbon,(Orang,(Gorilla,(Chimp,Human)))));
1256	</TT>
1257	<P>
1258	In this tree the first fork separates the lineage leading to
1259	<TT>Mouse</TT> and <TT>Bovine</TT> from the lineage leading to the rest. Within the
1260	latter group there is a fork separating <TT>Gibbon</TT> from the rest, and so on.
1261	The entire tree is enclosed in an outermost pair of parentheses. The tree ends
1262	with a semicolon. In some programs such as DNAML, FITCH, and CONTML,
1263	the tree will be unrooted. An unrooted tree should have its
1264	bottommost fork have a
1265	three-way split, with three groups separated by two commas:
1266	<P>
1267	<TT>(A,(B,(C,D)),(E,F));
1268	</TT>
1269	<P>
1270	Here the three groups at the bottom node are <TT>A</TT>, <TT>(B,C,D)</TT>, and
1271	<TT>(E,F)</TT>. The single three-way split corresponds to one of the interior
1272	nodes of the unrooted tree (it can be any interior node of the tree). The
1273	remaining forks are encountered as you move out from that first node.
1274	In newer programs, some are able to tolerate these other forks being
1275	multifurcations (multi-way splits).
1276	You should check the documentation files
1277	for the particular programs you are using to see in which of these forms
1278	you can expect the user tree to be in. Note that many of the programs
1279	that actually estimate an unrooted tree (such as DNAPARS) produce trees in the
1280	treefile in rooted form! This is done for reasons of arbitrary internal bookkeeping. The placement of the root is arbitrary. We are working toward
1281	having all programs be able to read all trees, whether rooted or unrooted,
1282	multifurcating or bifurcating, and having them do the right thing with
1283	them. But this is a long-term goal and it is not yet achieved.
1284	<P>
1285	For programs that infer branch lengths, these are given in the trees in the
1286	tree file as real numbers following a colon, and placed immediately
1287	after the group descended from that branch. Here is a typical tree
1288	with branch lengths:
1289	<P>
1290	<TT>((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,<BR>
1291	bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,<BR>
1292	seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);
1293	</TT>
1294	<P>
1295	Note that the tree may continue to a new line at any time except in the
1296	middle of a name or the middle of a branch length, although in trees
1297	written to the tree file this will only be done after a comma.
1298	<P>
1299	These representations of trees are a subset of the standard adopted
1300	on 24 June 1986 at the annual meetings of the Society for the Study of
1301	Evolution by an informal committee (its final session in Newick's
1302	lobster restaurant - hence its name, the Newick standard)
1303	consisting of Wayne Maddison (author of MacClade), David Swofford (PAUP),
1304	F. James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and the original
1305	PHYLIP tree drawing programs), James Archie,
1306	William H.E. Day, and me. This standard is a generalization of
1307	PHYLIP's format, itself based on a well-known representation of trees in
1308	terms of parenthesis patterns which is due to the famous mathematician
1309	Arthur Cayley, and which has been around for over a century. The
1310	standard is now employed by most phylogeny computer programs but unfortunately
1311	has yet to be decribed in a formal published description. Other
1312	descriptions by me and by Gary Olsen can be accessed using the Web at:
1313	<P>
1314	<DIV ALIGN="CENTER">
1315	<FONT SIZE=+2><A HREF="http://evolution.gs.washington.edu/phylip/newicktree.html">
1316	<TT>http://evolution.gs.washington.edu/phylip/newicktree.html</TT></A></FONT>
1317	</DIV>
1318	<P>
1319	<A NAME="options"><HR><P></A>
1320	<H2>The Options and How To Invoke Them</H2>
1321	<P>
1322	Most of the programs allow various options that alter the amount of
1323	information the program is provided or what is done with the
1324	information. Options are selected in the menu.
1325	<P>
1326	<H3>Common options in the menu</H3>
1327	<P>
1328	A number of the options from the menu, the <TT>U</TT> (User tree), <TT>G</TT> (Global),
1329	<TT>J</TT> (Jumble), <TT>O</TT> (Outgroup), <TT>W</TT> (Weights),
1330	<TT>T</TT> (Threshold), <TT>M</TT> (multiple data sets), and the tree output options, are used
1331	so widely that it is best to discuss them in this document.
1332	<P>
1333	<B>The <TT>U</TT> (User tree) option.</B> This option toggles between the default
1334	setting, which allows the program to search for the best tree, and the
1335	User tree setting, which reads a tree or trees ("user trees") from the input
1336	tree file and evaluates them. The input tree file's
1337	default name is <TT>intree</TT>. In a few cases the trees should
1338	be preceded by a line giving the number of trees:
1339	<P>
1340	<TABLE><TR><TD BGCOLOR=white>
1341	<PRE>
1342	3
1343	((Alligator,Bear),((Cow,(Dog,Elephant)),Ferret));
1344	((Alligator,Bear),(((Cow,Dog),Elephant),Ferret));
1345	((Alligator,Bear),((Cow,Dog),(Elephant,Ferret)));
1346	</PRE>
1347	</TD></TR></TABLE>
1348	<P>
1349	while in most cases the initial line with the number of trees is not
1350	required. This is an inconsistency in the programs that we are intending
1351	to eliminate soon. Some programs require rooted trees, some unrooted
1352	trees, and some can handle multifurcating trees. You should read
1353	the documentation for the particular program to find out which it
1354	requires. Program RETREE can be used to convert trees among
1355	these forms (on saving a tree from RETREE, you are asked whether
1356	you want it to be rooted or unrooted).
1357	<P>
1358	In using the user tree option, check the pattern of parentheses
1359	carefully. The programs do not always detect
1360	whether the tree makes sense, and if it does not there will probably be
1361	a crash (hopefully, but not inevitably, with an error message indicating
1362	the nature of the problem). Trees written out by programs are
1363	typically in the proper form.
1364	<P>
1365	Some of the programs require that the user trees be preceded by line with the
1366	number of user trees. Some require that they <EM>not</EM> be preceded by
1367	this line, and many can tolerate either. I have tried to note for
1368	each of these programs which of these forms of the user tree file
1369	is appropriate. We hope to bring all programs to the same user tree file
1370	format as soon as possible.
1371	<P>
1372	<B>The <TT>G</TT> (Global) option.</B> In the programs which construct trees (except for
1373	NEIGHBOR, the "...PENNY" programs and CLIQUE, and of course
1374	the "...MOVE" programs where you construct the trees yourself),
1375	after all species have been added to the tree a rearrangements phase
1376	ensues. In most of these programs the rearrangements are automatically
1377	global, which in this case means that subtrees will be removed from the tree
1378	and put back on in all possible ways so as to have a better chance of
1379	finding a better tree. Since this can be time consuming (it roughly
1380	triples the time taken for a run) it is left as an option in some of the
1381	programs, specifically CONTML, FITCH, and DNAML. In these programs
1382	the G menu option toggles between the default of local rearrangement and
1383	global rearrangement. The rearrangements are explained more below.
1384	<P>
1385	<B>The <TT>J</TT> (Jumble) option.</B> In most of the tree construction programs
1386	(except for the "...PENNY" programs and CLIQUE), the exact
1387	details of the search of different trees depend on the order of input of
1388	species. In these programs <TT>J</TT> option enables you to tell the program to use
1389	a random number
1390	generator to choose the input order of species. This option is toggled on
1391	and off by
1392	selecting option <TT>J</TT> in the menu. The program will then prompt you for
1393	a "seed" for the random number generator. The seed should be an integer
1394	between 1 and 32767, and should of form 4n+1,
1395	which means that it must give a remainder of 1 when divided by 4. This can be
1396	judged by looking at the last two digits of the number. Each different seed
1397	leads to a different sequence of addition of species. By simply changing the
1398	random number seed and re-running the programs one can look for other, and
1399	better trees. If the seed entered is not odd, the program will not proceed,
1400	but will prompt for another seed.
1401	<P>
1402	The Jumble option also causes the program to ask you how many times you
1403	want to restart the process. If you answer 10, the program will
1404	try ten different orders of species in constructing the trees, and the
1405	results printed out will reflect this entire search process (that is,
1406	the best trees found among all 10 runs will be printed out, not the
1407	best trees from each individual run).
1408	<P>
1409	Some people have asked what are good values of the random number seed.
1410	The random number seed is used to start a process of choosing "random"
1411	(actually pseudorandom) numbers, which behave as if they were
1412	unpredictably randomly chosen between 0 and 2<SUP>32</SUP>-1 (which is
1413	4,294,967,296). You could put in the number 133 and find that the
1414	next random number was 1,876,973,009. As they are effectively
1415	unpredictable, there is no such thing as a choice that is better than
1416	any other, provided that the numbers are of the form 4<I>n</I>+1. However
1417	if you re-use a random number seed, the sequence of random numbers
1418	that result will be the same as before, resulting in exactly the same
1419	series of choices, which may not be what you want.
1420	<P>
1421	<B>The <TT>O</TT> (Outgroup) option.</B> This specifies which species is to be used
1422	to root the tree by having it become the outgroup. This option is
1423	toggled on and off by choosing <TT>O</TT> in the menu (the alphabetic
1424	character <TT>O</TT>, not the digit <TT>0</TT>). When it is on, the program will
1425	then prompt for the
1426	number of the outgroup (the species being taken in the numerical order that
1427	they occur in the input file). Responding by typing <TT>6</TT> and then an
1428	<TT>Enter</TT> character indicates that the sixth species in the data
1429	is the outgroup. Outgroup-rooting will not be attempted if the
1430	data have already established a root for the tree from some other
1431	consideration, and may not be if it is a user-defined tree,
1432	despite your invoking the option. Thus programs such as DOLLOP that
1433	produce only rooted trees do not allow the Outgroup option. It is also
1434	not available in KITSCH, DNAMLK, or CLIQUE. When it is used, the tree as
1435	printed out is still listed as being an
1436	unrooted tree, though the outgroup is connected to the bottommost node
1437	so that it is easy to visually convert the tree into rooted form.
1438	<P>
1439	<B>The <TT>T</TT> (Threshold) option.</B> This sets a threshold forn the
1440	parsimony programs such that if the
1441	number of steps counted in a character is higher than the threshold, it
1442	will be taken to be the threshold value rather than the actual number of
1443	steps. The default is a threshold so high that it will never be
1444	surpassed (in which case the steps whill simply be counted). The <TT>T</TT>
1445	menu option toggles on and off asking the user to
1446	supply a threshold. The use of thresholds to obtain methods intermediate
1447	between parsimony and compatibility methods is described in my 1981b paper.
1448	When the T option is in force, the program
1449	will prompt for the numerical threshold value. This will be a positive
1450	real number greater than 1. In programs MIX, MOVE, PENNY, PROTPARS,
1451	DNAPARS, DNAMOVE, and DNAPENNY, do not use threshold values less
1452	than or equal to 1.0, as they have no meaning and lead to a tree which
1453	depends only on considerations such as the input order of species and not at
1454	all on the character state data! In programs DOLLOP, DOLMOVE, and DOLPENNY
1455	the threshold should never be 0.0 or less, for the same
1456	reason. The <TT>T</TT> option is an
1457	important and underutilized one: it is, for example, the only way in this
1458	package (except for program DNACOMP) to do a compatibility analysis when there
1459	are missing data. It is a method of de-weighting characters that evolve
1460	rapidly. I wish more people were aware of its properties.
1461	<P>
1462	<B>The <TT>M</TT> (Multiple data sets) option.</B> In menu programs there is an
1463	<TT>M</TT> menu
1464	option which allows one to toggle on the multiple data sets option. The
1465	program will ask you how many data sets it should expect. The data sets
1466	have the same format as the first data set. Here is a (very small) input file
1467	with two five-species data sets:
1468	<P>
1469	<TABLE><TR><TD bgcolor=white>
1470	<PRE>
1471	5 6
1472	Alpha CCACCA
1473	Beta CCAAAA
1474	Gamma CAACCA
1475	Delta AACAAC
1476	Epsilon AACCCA
1477	5 6
1478	Alpha CACACA
1479	Beta CCAACC
1480	Gamma CAACAC
1481	Delta GCCTGG
1482	Epsilon TGCAAT
1483	</PRE>
1484	</TD></TR></TABLE>
1485	<P>
1486	The main use of this option will be to allow all of the methods in these
1487	programs to be bootstrapped. Using the program SEQBOOT one can take any
1488	DNA, protein, restriction sites, gene frequency or binary character data set and
1489	make multiple data sets by bootstrapping. Trees can be produced for all of
1490	these using the <TT>M</TT> option. They will be written on the tree output file if
1491	that option is left in force. Then the program CONSENSE can be used with
1492	that tree file as its input file. The result is a majority rule consensus
1493	tree which can be used to make confidence intervals. The present version
1494	of the package allows, with the use of SEQBOOT and CONSENSE and the M option,
1495	bootstrapping of many of the methods in the package.
1496	<P>
1497	Programs DNAML, DNAPARS and PARS can also take multiple weights
1498	instead of multiple data sets. They can then do bootstrapping by
1499	reading in one data set, together with a file of weights that show how
1500	the characters (or sites) are reweighted in each bootstrap sample. Thus a
1501	site that is omitted in a bootstrap sample has effectively been given
1502	weight 0, while a site that has been duplicated has effectively been
1503	given weight 2. SEQBOOT has a menu selection to produce the file of
1504	weights information automatically, instead of producing a file of
1505	multiple data sets.
1506	<P>
1507	<B>The <TT>W</TT> (Weights) option</B>. This signals the program that, in
1508	addition to the data set, you want to read in a series of weights that
1509	tell how many times each character is to be counted. If the weight
1510	for a character is zero (<TT>0</TT>) then that character is in effect to
1511	be omitted when the tree is evaluated. If it is (<TT>1</TT>) the
1512	character is to be counted once. Some programs allow weights greater than
1513	1 as well. These have the effect that the character is counted as
1514	if it were present that many times, so that a weight of 4 means that the
1515	character is counted 4 times.
1516	The values 0-9 give weights 0 through 9, and the
1517	values A-Z give weights 10 through 35. By use of the weights we can
1518	give overwhelming weight to some characters, and drop others from the
1519	analysis. In the molecular sequence programs only two values of the
1520	weights, 0 or 1 are allowed.
1521	<P>
1522	The weights are used to analyze subsets of the characters, and also can be
1523	used for resampling of the data as in bootstrap and jackknife resampling.
1524	For those programs that allow weights to be greater than 1, they can also
1525	be used to emphasize information from some characters more strongly than
1526	others. Of course, you must have some rationale for doing this.
1527	<P>
1528	The weights are provided as a sequence of digits. Thus they might be
1529	<P>
1530	<TT>10011111100010100011110001100</TT>
1531	<P>
1532	The weights are to be provided in an input file
1533	whose default name is <TT>weights</TT>. In programs such as SEQBOOT
1534	that can also output a file of weights, the input weights have a default
1535	file name of <TT>inweights</TT>, and the output file name has a default
1536	file name of <TT>outweights</TT>.
1537	<P>
1538	Weights can be used to analyze different subsets of characters (by weighting
1539	the rest as zero). Alternatively, in the discrete characters programs
1540	they can be used to force a certain
1541	group to appear on the phylogeny (in effect confining consideration to only
1542	phylogenies containing that group). This is done by adding an imaginary
1543	character that has <TT>1</TT>'s for the members of the group, and <TT>0</TT>'s
1544	for all the
1545	other species. That imaginary character is then given the highest weight
1546	possible: the result will be that any phylogeny that does not contain that
1547	group will be penalized by such a heavy amount that it will not (except in
1548	the most unusual circumstances) be considered. Of course, the new character
1549	brings extra steps to the tree, but the number of these can be calculated
1550	in advance and subtracted out of the total when reporting the results. This
1551	use of weights is an important one, and one sadly ignored
1552	by many users who could profit from it. In the case of molecular sequences
1553	we cannot use weights this way, so that to force a given group to appear we
1554	have to add a large extra segment of sites to the molecule, with (say) A's
1555	for that group and C's for every other species.
1556	<P>
1557	<B>The option to write out the trees into a tree file</B>. This specifies that you
1558	want the program to write
1559	out the tree not only on its usual output, but also onto a file in
1560	nested-parenthesis notation (as described above). This option is sufficiently
1561	useful that it is turned on by default in all programs that allow it. You
1562	can optionally turn it off if you wish, by typing the appropriate number
1563	from the menu (it varies from program to program). This option is useful for
1564	creating tree files that can be directly read into the programs, including
1565	the consensus tree and tree distance programs, and the tree plotting programs.
1566	<P>
1567	The output tree file has a default name of <TT>outtree</TT>.
1568	<P>
1569	<B>The (<TT>0</TT>) terminal type option</B> . (This is the digit <TT>0</TT>, not
1570	the alphabetic character <TT>O</TT>). The program will default to
1571	one particular assumption about your terminal (except in the case of
1572	Macintoshes, the default will be an ANSI compatible terminal). You can
1573	alternatively select it to be either an IBM PC, or nothing.
1574	This affects the ability of the programs to clear the screen when they
1575	display their menus, and the graphics characters used to display trees
1576	in the programs DNAMOVE, MOVE, DOLMOVE, and RETREE. If you are running an
1577	MSDOS system and have the ANSI.SYS driver installed in your CONFIG.SYS
1578	file, you may find that the screen clears correctly even with the default
1579	setting of ANSI.
1580	<P>
1581	<A NAME="algorithm"><HR><P></A>
1582	<DIV ALIGN="CENTER">
1583	<H2>The Algorithm for Constructing Trees</H2></DIV>
1584	<P>
1585	All of the programs except FACTOR, DNADIST, GENDIST, DNAINVAR, SEQBOOT,
1586	CONTRAST, RETREE, and the plotting and
1587	consensus tree programs act to construct an estimate of a phylogeny. MOVE,
1588	DOLMOVE, and DNAMOVE let you construct it yourself by hand. All of
1589	the rest but NEIGHBOR, the "...PENNY" programs and CLIQUE make use of
1590	a common approach involving additions and rearrangements. They are
1591	trying to minimize or maximize some quantity over the space of all
1592	possible evolutionary trees. Each program contains a part that, given
1593	the topology of the tree, evaluates the quantity that is being minimized
1594	or maximized. The straightforward approach would be to evaluate all
1595	possible tree topologies one after another and pick the one which,
1596	according to the criterion being used, is best. This would not be
1597	possible for more than a small number of species, since the number of
1598	possible tree topologies is enormous. A review of the literature on the
1599	counting of evolutionary trees will be found one of my papers
1600	(Felsenstein, 1978a).
1601	<P>
1602	Since we cannot search all topologies, these programs are not
1603	guaranteed to always find the best tree, although they seem to do quite
1604	well in practice. The strategy they employ is as follows: the species
1605	are taken in the order in which they appear in the input file. The
1606	first two (in some programs the first three) are taken and a tree
1607	constructed containing only those. There is only one possible topology for
1608	this tree. Then the next species is taken, and we consider where it
1609	might be added to the tree. If the initial tree is (say) a rooted tree
1610	with two species and we want the resulting three-species tree to be a
1611	bifurcating tree, there are only three places where we could add the
1612	third species. Each of these is tried, and each time the resulting tree is
1613	evaluated according to the criterion. The best one is chosen to be the
1614	basis for further operations. Now we consider adding the fourth
1615	species, again at each of the five possible places that would result in
1616	a bifurcating tree. Again, the best of these is accepted.
1617	<P>
1618	<H3>Local Rearrangements</H3>
1619	<P>
1620	The process continues in this manner, with one important exception. After
1621	each species is added, and before the next
1622	is added, a number of rearrangements of the tree are tried, in an effort
1623	to improve it. The algorithms move through the tree, making all
1624	possible local rearrangements of the tree. A local rearrangement involves an
1625	internal segment of the tree in the following manner. Each internal
1626	segment of the tree is of this form (where T1, T2, and T3 are subtrees
1627	- parts of the tree that can contain further forks and tips):
1628	<P>
1629	<PRE>
1630	T1 T2 T3
1631	\ / /
1632	\ / /
1633	\ / /
1634	\/ /
1635	* /
1636	* /
1637	* /
1638	* /
1639	*
1640	!
1641	!
1642	</PRE>
1643	<P>
1644	the segment we are discussing being indicated by the asterisks. A local
1645	rearrangement consists of switching the subtrees T1 and T3 or T2 and T3,
1646	so as to obtain one of the following:
1647	<P>
1648	<PRE>
1649	T3 T2 T1 T1 T3 T2
1650	\ / / \ / /
1651	\ / / \ / /
1652	\ / / \ / /
1653	\ / / \ / /
1654	\ / \ /
1655	\ / \ /
1656	\ / \ /
1657	\ / \ /
1658	! !
1659	! !
1660	! !
1661	</PRE>
1662	<P>
1663	Each time a local rearrangement is successful in finding a better tree,
1664	the new arrangement is accepted. The phase of local rearrangements does
1665	not end until the program can traverse the entire tree, attempting local
1666	rearrangements, without finding any that improve the tree.
1667	<P>
1668	This strategy of adding species and making local rearrangements will look
1669	at about  (n-1)x(2n-3)  different topologies, though if
1670	rearrangements are frequently successful the number may be larger. I
1671	have been describing the strategy when rooted trees are being
1672	considered. For unrooted trees there is a precisely similar strategy,
1673	though the first tree constructed may be a three-species tree and the
1674	rearrangements may not start until after the addition of the fifth
1675	species.
1676	<P>
1677	Though we are not guaranteed to have found the best tree topology,
1678	we are guaranteed that no nearby topology (i. e. none accessible by a
1679	single local rearrangement) is better. In this sense we have reached a
1680	local optimum of our criterion. Note that the whole process is
1681	dependent on the order in which the species are present in the input
1682	file. We can try to find a different and better solution by reordering
1683	the species in the input file and running the program again (or, more
1684	easily, by using the <TT>J</TT> option). If none of
1685	these attempts finds a better solution, then we have some indication
1686	that we may have found the best topology, though we can never be certain
1687	of this.
1688	<P>
1689	Note also that a new topology is never accepted unless it is better
1690	than the previous one, so that the rearrangement process can never fall
1691	into an endless loop. This is also the way ties in our criterion are
1692	resolved, namely by sticking with the tree found first. However, the tree
1693	construction programs other than CLIQUE, CONTML, FITCH,
1694	and DNAML do keep a record of all trees found that are tied with the best one
1695	found. This gives you some immediate idea of which parts of the tree can be
1696	altered without affecting the quality of the result.
1697	<P>
1698
1699	<H3>Global Rearrangements</H3>
1700	<P>
1701	A feature of most of the programs, such as PROTPARS, DNAPARS,
1702	DNACOMP, DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX, and DOLLOP,
1703	is "global" optimization of the tree. In four of these (CONTML,
1704	FITCH, DNAML and DNAMLK) this is an option, <TT>G</TT>. In the others it
1705	automatically applies. When
1706	it is present there is an additional stage to the search for the best tree.
1707	Each possible subtree is removed from the tree from the tree and added back in
1708	all possible places. This process continues until all subtrees can be removed
1709	and added again without any improvement in the tree. The purpose of this
1710	extra rearrangement is to make it less likely that one or more a species gets
1711	"stuck" in a suboptimal region of the space of all possible trees. The use of
1712	global optimization results in approximately a tripling (3 x ) of the run-time,
1713	which is why I have left it as an option in some of the slower programs.
1714	<P>
1715	What PHYLIP calls "global" rearrangements are more properly called
1716	SPR (subtree pruning and regrafting) by Swofford et. al. (1996) as distinct
1717	from the NNI (nearest neighbor interchange) rearrangements that PHYLIP
1718	also uses, and the TBR (tree bisection and reconnection) rearrangements
1719	that it does not use.
1720	<P>
1721	The programs doing global optimization print out a dot "<TT>.</TT>" after each group is
1722	removed and re-added to the tree, to give the user some sign that the
1723	rearrangements are proceeding. A new line of dots is started whenever a new
1724	round of global rearrangements is started following an improvement in the
1725	tree. On the line before the dots are printed there is printed a bar of
1726	the form "!---------------!" to show how many dots
1727	to expect. The dots will
1728	not be printed out at a uniform rate, but the later dots, which represent
1729	removal of larger groups from the tree and trying them consequently in fewer
1730	places, will print out more quickly. With some compilers each row of dots may
1731	not be printed out until it is complete.
1732	<P>
1733	It should be noted that PENNY, DOLPENNY, DNAPENNY and CLIQUE use a more
1734	sophisticated strategy of "depth-first search" with a "branch and bound"
1735	search method that guarantees that all
1736	of the best trees will be found. In the case
1737	of PENNY, DOLPENNY and DNAPENNY there can be a considerable sacrifice of
1738	computer time if the number of species is greater than about ten: it is a
1739	matter for you to consider whether it is worth it for you to guarantee finding
1740	all the most parsimonious trees, and that depends on how much free computer
1741	time you have! CLIQUE finds all largest cliques, and does so without undue
1742	burning of computer time. Although all of these problems that have been
1743	investigated fall into the
1744	category of "NP-hard" problems that in effect do not have a rapid solution,
1745	the cases that cause this trouble for the largest-cliques algorithm in
1746	CLIQUE apparently are not biologically realistic and do not occur in actual
1747	data.
1748	<P>
1749
1750	<H3>Multiple Jumbles</H3>
1751	<P>
1752	As just mentioned, for most of these programs the search depends on the order
1753	in which the species are entered into the tree. Using the <TT>J</TT> (Jumble)
1754	option you can supply a random number seed which will allow the program to put
1755	the species in in a random order. Jumbling can be
1756	done multiple times. For example, if you tell the program to do it
1757	10 times, it will go through the tree-building process 10 times, each with a
1758	different random order of adding species. It will keep a record of the trees
1759	tied for best over the whole process. In other words, it does not just
1760	record the best trees from each of the 10 runs, but records the best ones
1761	overall. Of course this is slow, taking 10 times longer than a single run.
1762	But it does give us a much greater chance of finding all of the most
1763	parsimonious trees. In the terminology of Maddison (1991) it
1764	can find different "islands" of trees. The present algorithms do not
1765	guarantee us to find all trees in a given "island" from a single run, so
1766	multiple runs also help explore those "islands" that are found.
1767	<P>
1768	<H3>Saving multiple tied trees</H3>
1769	<P>
1770	For the parsimony and compatibility programs, one can have a perfect tie
1771	between two or more trees. In these programs these trees are all
1772	saved. For the newer parsimony programs such as DNAPARS and PARS,
1773	global rearrangement is carried out on all of these tied trees. This can
1774	be turned off in the menu.
1775	<P>
1776	For trees with criteria which are real numbers, such as the distance
1777	matrix programs FITCH and KITSCH, and the likelihood programs DNAML,
1778	DNAMLK, CONTML, and RESTML, it is difficult to get an exact tie between
1779	trees. Consequently these programs save only the single best tree
1780	(even though the others may be only a tiny bit worse).
1781	<P>
1782	<H3>Strategy for Finding the Best Tree</H3>
1783	<P>
1784	In practice, it is advisable to use the Jumble option to evaluate many
1785	different orderings of the input species. <I>It is advisable to use the
1786	Jumble option and specify that it be done many times (as many as ten)</I>
1787	to use different orderings
1788	of the input species).
1789	<P>
1790	People who want a magic "black box" program whose results they do
1791	not have to question (or think about) often are upset that these
1792	programs give results that are dependent on the order in which the species
1793	are entered in the data. To me this property is an advantage, for it
1794	permits you to try different searches for better trees, simply by
1795	varying the input order of species. If you do not use the multiple Jumble
1796	option, but do multiple individual runs instead, you
1797	can easily decide which to pay most attention to - the one or ones that
1798	are best according to the criterion employed (for example, with parsimony,
1799	the one out of the runs that results in the tree with the fewest changes).
1800	<P>
1801	In practice, in a single run, it usually seems best to put species that are
1802	likely to be sources of confusion in the topology last, as by the time they are
1803	added the arrangement of the earlier species will have stabilized into a
1804	good configuration, and then the last few species will by fitted into
1805	that topology. There will be less chance this way of a poor initial
1806	topology that would affect all subsequent parts of the search. However,
1807	a variety of arrangements of the input order of species should be tried,
1808	as can be done if the <TT>J</TT> option is used,
1809	and no species should be kept in a fixed place in the order of input.
1810	Note that the results of the "...PENNY" programs and CLIQUE
1811	are not sensitive to the input order of species, and NEIGHBOR is only
1812	slightly sensistive to it, so that multiple Jumbling is not possible
1813	with those programs. Note also that with global search, which
1814	is standard in many programs and in others is an
1815	option, each group (including
1816	each individual species) will be removed and re-added in all possible
1817	positions, so that a species causing confusion will have more chance of moving
1818	to a new location than it would without global rearrangement.
1819	<P>
1820	<A NAME="warning"><HR><P></A>
1821	<DIV ALIGN="CENTER">
1822	<H2>A Warning on Interpreting Results</H2></DIV>
1823	<P>
1824	Probably the most important thing to keep in mind while running any of the
1825	parsimony or compatibility programs is not
1826	to overinterpret the result. Many users treat the set of most parsimonious
1827	trees as if it were a confidence interval. If a group appears in all of the
1828	most parsimonious trees then they treat it as well established. Unfortunately
1829	<I>the confidence interval on phylogenies appears to be much
1830	larger than the set of all most parsimonious trees</I> (Felsenstein, 1985b).
1831	Likewise, variation of result among different methods will not be a good
1832	indicator of the size of the confidence interval. Consider a simple data set
1833	in which, out of 100 binary characters, 51 recommend the unrooted tree
1834	<TT>((A,B),(C,D))</TT> and 49 the tree <TT>((A,D),(B,C))</TT>. Many different
1835	methods will all give the same result on
1836	such a data set: they will estimate the tree as <TT>((A,B),(C,D))</TT>.
1837	Nevertheless it is
1838	clear that the 51:49 margin by which this tree is favored is not statistically
1839	significantly different from 50:50. So <I>consistency among different methods
1840	is a poor guide to statistical significance</I>.
1841	<P>
1842	<A NAME="speed"><HR><P></A>
1843	<DIV ALIGN="CENTER">
1844	<H2>Relative Speed of Different<BR>
1845	Programs and Machines</H2></DIV>
1846	<P>
1847	<H3>Relative speed of the different programs</H3>
1848	<P>
1849	C compilers differ in efficiency of the code they generate,
1850	and some deal with some features of the language better than with
1851	others. Thus a program which is unusually fast on one computer may be
1852	unusually slow on another. Nevertheless, as a rough guide to relative
1853	execution speeds, I have tested the programs on three data sets, each of
1854	which has 10 species and 40 characters. The first is an imaginary one
1855	in which all characters are compatible - ("The Willi Hennig Memorial
1856	Data Set" as J. S. Farris once called ones like it). The second is the binary
1857	recoded form of the fossil horses data set of Camin and Sokal (1965).
1858	The third data set has data that is completely random: 10 species and 20
1859	characters that have a 50% chance that each character state is <TT>0</TT> or
1860	<TT>1</TT> (or <TT>A</TT> or <TT>G</TT>). The data sets thus range from a completely
1861	compatible one in which there is no homoplasy (paralellism or convergence),
1862	through the horses data set, which requires 29 steps where the possible
1863	minimum number would be 20, to the random data set, which requires 49 steps.
1864	We can thus see how this increasing messiness of the data affects running
1865	times. The three data sets have all had 20 sites of <TT>A</TT>'s added to the
1866	end of each sequence, so as to prevent likelihood or distance matrix programs
1867	from having infinite branch lengths (the test data sets used for timing
1868	previous versions of PHYLIP wsere the same except that they lacked these
1869	20 extra sites).
1870	<P>
1871	Here are the nucleotide sequence versions of the three data sets:
1872	<P>
1873	<TABLE><TR><TD BGCOLOR=white>
1874	<PRE>
1875	10 40
1876	A CACACACAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
1877	B CACACAACAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
1878	C CACAACAAAAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
1879	D CAACAAAACAAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
1880	E CAACAAAAACAAAAAAAACAAAAAAAAAAAAAAAAAAAAA
1881	F ACAAAAAAAACACACAAAACAAAAAAAAAAAAAAAAAAAA
1882	G ACAAAAAAAACACAACAAACAAAAAAAAAAAAAAAAAAAA
1883	H ACAAAAAAAACAACAAAAACAAAAAAAAAAAAAAAAAAAA
1884	I ACAAAAAAAAACAAAACAACAAAAAAAAAAAAAAAAAAAA
1885	J ACAAAAAAAAACAAAAACACAAAAAAAAAAAAAAAAAAAA
1886	</PRE>
1887	</TD></TR></TABLE>
1888	<P>
1889	<TABLE><TR><TD BGCOLOR=white>
1890	<PRE>
1891	10 40
1892	MesohippusAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
1893	HypohippusAAACCCCCCCAAAAAAAAACAAAAAAAAAAAAAAAAAAAA
1894	ArchaeohipCAAAAAAAAAAAAAAAACACAAAAAAAAAAAAAAAAAAAA
1895	ParahippusCAAACAACAACAAAAAAAACAAAAAAAAAAAAAAAAAAAA
1896	MerychippuCCAACCACCACCCCACACCCAAAAAAAAAAAAAAAAAAAA
1897	M. secunduCCAACCACCACCCACACCCCAAAAAAAAAAAAAAAAAAAA
1898	Nannipus CCAACCACAACCCCACACCCAAAAAAAAAAAAAAAAAAAA
1899	NeohippariCCAACCCCCCCCCCACACCCAAAAAAAAAAAAAAAAAAAA
1900	Calippus CCAACCACAACCCACACCCCAAAAAAAAAAAAAAAAAAAA
1901	PliohippusCCCACCCCCCCCCACACCCCAAAAAAAAAAAAAAAAAAAA
1902	</PRE>
1903	</TD></TR></TABLE>
1904	<P>
1905	<TABLE><TR><TD BGCOLOR=white>
1906	<PRE>
1907	10 40
1908	A CACACAACCAAACAAACCACAAAAAAAAAAAAAAAAAAAA
1909	B AAACCACACACACAAACCCAAAAAAAAAAAAAAAAAAAAA
1910	C ACAAAACCAAACCACCCACAAAAAAAAAAAAAAAAAAAAA
1911	D AAAAACACAACACACCAAACAAAAAAAAAAAAAAAAAAAA
1912	E AAACAACCACACACAACCAAAAAAAAAAAAAAAAAAAAAA
1913	F CCCAAACACCCCCAAAAAACAAAAAAAAAAAAAAAAAAAA
1914	G ACACCCCCACACCCACCAACAAAAAAAAAAAAAAAAAAAA
1915	H AAAACAACAACCACCCCACCAAAAAAAAAAAAAAAAAAAA
1916	I ACACAACAACACAAACAACCAAAAAAAAAAAAAAAAAAAA
1917	J CCAAAAACACCCAACCCAACAAAAAAAAAAAAAAAAAAAA
1918	</PRE>
1919	</TD></TR></TABLE>
1920	<P>
1921	Here are the timings of many of the version 3.6 programs on these three data
1922	sets as run after being compiled by Gnu C and run on a
1923	266 MHz Pentium MMX computer under Linux.
1924	<P>
1925	<DIV ALIGN="CENTER">
1926	<TABLE CELLPADDING=3 BORDER="1">
1927	<TR><TD ALIGN="LEFT"> </TD>
1928	<TD ALIGN="RIGHT">Hennigian Data</TD>
1929	<TD ALIGN="RIGHT">Horses Data</TD>
1930	<TD ALIGN="RIGHT">Random Data</TD>
1931	</TR>
1932	<TR><TD ALIGN="LEFT">PROTPARS</TD>
1933	<TD ALIGN="RIGHT">0.133</TD>
1934	<TD ALIGN="RIGHT">0.167</TD>
1935	<TD ALIGN="RIGHT">0.308</TD>
1936	</TR>
1937	<TR><TD ALIGN="LEFT">DNAPARS</TD>
1938	<TD ALIGN="RIGHT">0.163</TD>
1939	<TD ALIGN="RIGHT">0.191</TD>
1940	<TD ALIGN="RIGHT">0.573</TD>
1941	</TR>
1942	<TR><TD ALIGN="LEFT">DNAPENNY</TD>
1943	<TD ALIGN="RIGHT">0.300</TD>
1944	<TD ALIGN="RIGHT">0.196</TD>
1945	<TD ALIGN="RIGHT">36.68</TD>
1946	</TR>
1947	<TR><TD ALIGN="LEFT">DNACOMP</TD>
1948	<TD ALIGN="RIGHT">0.081</TD>
1949	<TD ALIGN="RIGHT">0.073</TD>
1950	<TD ALIGN="RIGHT">0.127</TD>
1951	</TR>
1952	<TR><TD ALIGN="LEFT">DNAML</TD>
1953	<TD ALIGN="RIGHT">2.19</TD>
1954	<TD ALIGN="RIGHT">2.53</TD>
1955	<TD ALIGN="RIGHT">2.73</TD>
1956	</TR>
1957	<TR><TD ALIGN="LEFT">DNAMLK</TD>
1958	<TD ALIGN="RIGHT">5.40</TD>
1959	<TD ALIGN="RIGHT">6.13</TD>
1960	<TD ALIGN="RIGHT">7.21</TD>
1961	</TR>
1962	<TR><TD ALIGN="LEFT">PROML</TD>
1963	<TD ALIGN="RIGHT">44.79</TD>
1964	<TD ALIGN="RIGHT">90.46</TD>
1965	<TD ALIGN="RIGHT">68.49</TD>
1966	</TR>
1967	<TR><TD ALIGN="LEFT">PROMLK</TD>
1968	<TD ALIGN="RIGHT">171.01</TD>
1969	<TD ALIGN="RIGHT">183.61</TD>
1970	<TD ALIGN="RIGHT">239.34</TD>
1971	</TR>
1972	<TR><TD ALIGN="LEFT">DNAML</TD>
1973	<TD ALIGN="RIGHT">2.19</TD>
1974	<TD ALIGN="RIGHT">2.53</TD>
1975	<TD ALIGN="RIGHT">2.73</TD>
1976	</TR>
1977	<TR><TD ALIGN="LEFT">DNAINVAR</TD>
1978	<TD ALIGN="RIGHT">0.002</TD>
1979	<TD ALIGN="RIGHT">0.002</TD>
1980	<TD ALIGN="RIGHT">0.002</TD>
1981	</TR>
1982	<TR><TD ALIGN="LEFT">DNADIST</TD>
1983	<TD ALIGN="RIGHT">0.029</TD>
1984	<TD ALIGN="RIGHT">0.024</TD>
1985	<TD ALIGN="RIGHT">0.033</TD>
1986	</TR>
1987	<TR><TD ALIGN="LEFT">PROTDIST</TD>
1988	<TD ALIGN="RIGHT">1.095</TD>
1989	<TD ALIGN="RIGHT">1.089</TD>
1990	<TD ALIGN="RIGHT">1.107</TD>
1991	</TR>
1992	<TR><TD ALIGN="LEFT">RESTML</TD>
1993	<TD ALIGN="RIGHT">3.55</TD>
1994	<TD ALIGN="RIGHT">3.18</TD>
1995	<TD ALIGN="RIGHT">5.15</TD>
1996	</TR>
1997	<TR><TD ALIGN="LEFT">RESTDIST</TD>
1998	<TD ALIGN="RIGHT">0.012</TD>
1999	<TD ALIGN="RIGHT">0.010</TD>
2000	<TD ALIGN="RIGHT">0.010</TD>
2001	</TR>
2002	<TR><TD ALIGN="LEFT">FITCH</TD>
2003	<TD ALIGN="RIGHT">0.20</TD>
2004	<TD ALIGN="RIGHT">0.31</TD>
2005	<TD ALIGN="RIGHT">0.24</TD>
2006	</TR>
2007	<TR><TD ALIGN="LEFT">KITSCH</TD>
2008	<TD ALIGN="RIGHT">0.055</TD>
2009	<TD ALIGN="RIGHT">0.061</TD>
2010	<TD ALIGN="RIGHT">0.058</TD>
2011	</TR>
2012	<TR><TD ALIGN="LEFT">NEIGHBOR</TD>
2013	<TD ALIGN="RIGHT">0.003</TD>
2014	<TD ALIGN="RIGHT">0.004</TD>
2015	<TD ALIGN="RIGHT">0.005</TD>
2016	</TR>
2017	<TR><TD ALIGN="LEFT">CONTML</TD>
2018	<TD ALIGN="RIGHT">0.380</TD>
2019	<TD ALIGN="RIGHT">0.368</TD>
2020	<TD ALIGN="RIGHT">0.396</TD>
2021	</TR>
2022	<TR><TD ALIGN="LEFT">GENDIST</TD>
2023	<TD ALIGN="RIGHT">0.008</TD>
2024	<TD ALIGN="RIGHT">0.009</TD>
2025	<TD ALIGN="RIGHT">0.008</TD>
2026	</TR>
2027	<TR><TD ALIGN="LEFT">PARS</TD>
2028	<TD ALIGN="RIGHT">0.201</TD>
2029	<TD ALIGN="RIGHT">0.263</TD>
2030	<TD ALIGN="RIGHT">0.729</TD>
2031	</TR>
2032	<TR><TD ALIGN="LEFT">MIX</TD>
2033	<TD ALIGN="RIGHT">0.064</TD>
2034	<TD ALIGN="RIGHT">0.078</TD>
2035	<TD ALIGN="RIGHT">0.123</TD>
2036	</TR>
2037	<TR><TD ALIGN="LEFT">PENNY</TD>
2038	<TD ALIGN="RIGHT">0.038</TD>
2039	<TD ALIGN="RIGHT">0.087</TD>
2040	<TD ALIGN="RIGHT">15.93</TD>
2041	</TR>
2042	<TR><TD ALIGN="LEFT">DOLLOP</TD>
2043	<TD ALIGN="RIGHT">0.134</TD>
2044	<TD ALIGN="RIGHT">0.141</TD>
2045	<TD ALIGN="RIGHT">0.233</TD>
2046	</TR>
2047	<TR><TD ALIGN="LEFT">DOLPENNY</TD>
2048	<TD ALIGN="RIGHT">0.051</TD>
2049	<TD ALIGN="RIGHT">0.241</TD>
2050	<TD ALIGN="RIGHT">101.29</TD>
2051	</TR>
2052	<TR><TD ALIGN="LEFT">CLIQUE</TD>
2053	<TD ALIGN="RIGHT">0.010</TD>
2054	<TD ALIGN="RIGHT">0.015</TD>
2055	<TD ALIGN="RIGHT">0.020</TD>
2056	</TR>
2057	</TABLE>
2058	</DIV>
2059
2060	<P>
2061	<BR>
2062
2063	<P>
2064	In all cases the programs were run under the default options without compiler
2065	switches, except as
2066	specified here. The
2067	data sets used for the discrete characters programs have <TT>0</TT>'s and <TT>1</TT>'s
2068	instead of <TT>A</TT>'s and <TT>C</TT>'s. For CONTML the <TT>A</TT>'s and <TT>C</TT>'s
2069	were made into <TT>0.0</TT>'s and <TT>1.0</TT>'s and considered as 40 2-allele loci.
2070	For the distance programs 10 x 10 distance matrices were
2071	computed from the three data sets.
2072	For the restriction sites programs <TT>A</TT> and <TT>C</TT> were changed into
2073	<TT>+</TT> and <TT>-</TT>. It does not
2074	make much sense to benchmark MOVE, DOLMOVE, or DNAMOVE, although when there
2075	are many characters and many species the response time after each
2076	alteration of the tree should be proportional to the product of the number of
2077	species and the number of characters. For DNAML and DNAMLK the frequencies
2078	of the four bases were
2079	set to be equal rather than determined empirically as is the default. For
2080	RESTML the number of enzymes was set to 1.
2081	<P>
2082	In most cases, the benchmark was made more accurate by analyzing 10 data
2083	sets using the <TT>M</TT> (Multiple data sets) option and dividing the resulting
2084	time by 10. Times were determined as user times using the Linux <TT>time</TT>
2085	command. Several patterns will be apparent from this. The algorithms (MIX,
2086	DOLLOP, CONTML, FITCH, KITSCH, PROTPARS, DNAPARS, DNACOMP, and
2087	DNAML, DNAMLK, RESTML) that use the above-described addition strategy have
2088	run times that do not depend strongly on the messiness of the data. The only
2089	exception to this is that if a data set such as the Random data requires
2090	extra rounds of global rearrangements it takes longer. The
2091	programs differ greatly in run time: the likelihood programs RESTML, DNAML and
2092	CONTML are quite a bit slower than the others. The protein sequence parsimony
2093	program, which has to do a considerable amount of bookkeeping to keep track of
2094	which amino acids can mutate to each other, is also relatively slow.
2095	<P>
2096	Another class of algorithms includes PENNY, DOLPENNY, DNAPENNY and CLIQUE.
2097	These are branch-and-bound methods: in principle they should have execution
2098	times that rise exponentially with the number of species and/or
2099	characters, and they might be much more sensitive to messy data. This is
2100	apparent with PENNY, DOLPENNY, and DNAPENNY, which go from being reasonably
2101	fast with clean data to very slow with messy data. DOLPENNY is particularly
2102	slow on messy data - this is because this algorithm cannot make use of some of
2103	the lower-bound calculations that are possible with DNAPENNY and PENNY. CLIQUE
2104	is very fast on all
2105	data sets. Although in theory it should bog down if the number of cliques in
2106	the data is very large, that does not happen with random data, which in
2107	fact has few cliques and those small ones. Apparently the "worst-case"
2108	data sets that cause exponential run time are much rarer for CLIQUE than for
2109	the other branch-and-bound methods.
2110	<P>
2111	NEIGHBOR is quite fast compared to FITCH and KITSCH, and should make it
2112	possible to run much larger cases, although the results are expected to be
2113	a bit rougher than with those programs.
2114	<BR>
2115	<P>
2116	<H3>Speed with different numbers of species</H3>
2117	<P>
2118	How will the speed depend on the number of species and the number
2119	of characters? For the sequential-addition algorithms, the speed should
2120	be proportional to somewhere between the cube of the number of species and
2121	the square of the number of species, and to the number
2122	of characters. Thus a case that has, instead of 10 species and 20
2123	characters, 20 species and 50 characters would take (in the cubic case)
2124	2 x 2 x 2 x 2.5 = 20
2125	times as long. This implies that cases with more than 20 species will
2126	be slow, and cases with more than 40 species <I>very</I> slow. This places a
2127	premium on working on small subproblems rather than just dumping a whole
2128	large data set into the programs.
2129	<P>
2130	An exception to these rules will be some of the DNA programs that use an
2131	aliasing device to save execution time. In these programs execution time
2132	will not necessarily increase proportional to the number of sites,
2133	as sites that show the same pattern of nucleotides will be detected
2134	as identical and the calculations for them will be done only once, which does
2135	not lead to more execution time. This is particularly
2136	likely to happen with few species and many sites, or with data sets that have
2137	small amounts of evolutionary divergence.
2138	<P>
2139	For programs FITCH and KITSCH, the distance matrix is square, so
2140	that when we double the number of species we also double the number of
2141	"characters", so that running times will go up as the fourth power of
2142	the number of species rather than the third power. Thus a 20-species
2143	case with FITCH is expected to run sixteen times more slowly than a 10-species
2144	case.
2145	<P>
2146	For programs like PENNY and CLIQUE the run times will rise faster
2147	than the cube of the number of species (in fact, they can rise faster
2148	than any power since these algorithms are not guaranteed to work in
2149	polynomial time). In practice, PENNY will frequently bog down above 11
2150	species, while CLIQUE easily deals with larger numbers.
2151	<P>
2152	For NEIGHBOR the speed should vary only as the square of the number of
2153	species, so a case twice as large will take only four times as long. This
2154	will make it an attractive alternative to FITCH and KITSCH for large data
2155	sets.
2156	<P>
2157	<B>Note:</B> If you are unsure of how long a program will take, try it first on
2158	a few species, then work your way up until you get a feel for the speed
2159	and for what size programs you can afford to run.
2160	<P>
2161	Execution time is not the most important criterion for a program,
2162	particularly as computer time gets much cheaper than your time or a
2163	programmer's time. With workstations on which background jobs can be run
2164	all night, execution speed is not overwhelmingly relevant. Some of us have been
2165	conditioned by an earlier era of computing to consider execution speed
2166	paramount. But ease of use, ease of adaptation to your computer system,
2167	and ease of modification are much more important in practice, and in
2168	these respects I think these programs are adequate. Only if you are
2169	engaged in 1960's style mainframe computing, or if you have very large
2170	amounts of data is minimization of execution
2171	time paramount.
2172	<P>
2173	Nevertheless it would have been nice to have made the programs
2174	faster. The present speeds are a compromise between speed and
2175	effectiveness: by making them slower and trying more rearrangements in the
2176	trees, or by enumerating all possible trees, I could have made the programs
2177	more likely to find the best tree. By trying fewer rearrangements I
2178	could have speeded them up, but at the cost of finding worse trees. I
2179	could also have speeded them up by writing critical sections in assembly
2180	language, but this would have sacrificed ease of distribution to new
2181	computer systems. There are also some options included in these programs that
2182	make it
2183	harder to adopt some of the economies of bookkeeping that make other programs
2184	faster. However to some extent I have simply made the decision not to spend
2185	time trying to speed up program bookkeeping when there were new likelihood and
2186	statistical methods to be developed.
2187	<BR>
2188	<P>
2189	<H3>Relative speed of different machines</H3>
2190	<P>
2191	It is interesting to compare different machines using DNAPARS as the
2192	standard task. One can rate a machine on the DNAPARS benchmark by summing the
2193	times for all three of the data sets. Here are relative total timings over
2194	all three data sets (done with various versions of DNAPARS) for some machines,
2195	taking a Pentium MMX 266 notebook computer running Linux with gcc as the
2196	standard. Benchmarks from versions 3.4 and 3.5 of the program are
2197	included (respectively the Pascal and C versions whose timings are in
2198	parentheses. They are compared only with each other and are scaled to the
2199	rest of the timings using the joint runs on the 386SX and the Pentium MMX 266.
2200	This use of separate standards is necessary not
2201	because of different languages but because different versions of the package
2202	are being compared. Thus, the "Time" is the ratio of the Total to that for
2203	the Pentium, adjusted by the scalings of machines using 3.4 and 3.5 when
2204	appropriate. The Relative Speed is the reciprocal of the Time.
2205	<P>
2206	<DIV ALIGN="CENTER">
2207	<TABLE CELLPADDING=3 BORDER="1">
2208	<TR><TD ALIGN="LEFT"><B>Machine</B></TD>
2209	<TD ALIGN="LEFT"><B>Operating<BR>System</B></TD>
2210	<TD ALIGN="LEFT"><B>Compiler</B></TD>
2211	<TD ALIGN="LEFT"><B>Total</B></TD>
2212	<TD ALIGN="LEFT"><B>Time</B></TD>
2213	<TD ALIGN="LEFT"><B>Relative<BR>Speed</B></TD>
2214	</TR>
2215	<TR><TD ALIGN="LEFT">Toshiba T1100+</TD>
2216	<TD ALIGN="LEFT">MSDOS</TD>
2217	<TD ALIGN="LEFT">Turbo Pascal 3.01A</TD>
2218	<TD ALIGN="LEFT">(269)</TD>
2219	<TD ALIGN="LEFT">1758.2</TD>
2220	<TD ALIGN="LEFT">0.0005688</TD>
2221	</TR>
2222	<TR><TD ALIGN="LEFT">Apple Mac Plus</TD>
2223	<TD ALIGN="LEFT">MacOS</TD>
2224	<TD ALIGN="LEFT">Lightspeed Pascal 2</TD>
2225	<TD ALIGN="LEFT">(175.84)</TD>
2226	<TD ALIGN="LEFT">1149.3</TD>
2227	<TD ALIGN="LEFT">0.0008701</TD>
2228	</TR>
2229	<TR><TD ALIGN="LEFT">Toshiba T1100+</TD>
2230	<TD ALIGN="LEFT">MSDOS</TD>
2231	<TD ALIGN="LEFT">Turbo Pascal 5.0</TD>
2232	<TD ALIGN="LEFT">(162)</TD>
2233	<TD ALIGN="LEFT">1058.9</TD>
2234	<TD ALIGN="LEFT">0.0009443</TD>
2235	</TR>
2236	<TR><TD ALIGN="LEFT">Macintosh Classic</TD>
2237	<TD ALIGN="LEFT">MacOS</TD>
2238	<TD ALIGN="LEFT">Think Pascal 3</TD>
2239	<TD ALIGN="LEFT">(160)</TD>
2240	<TD ALIGN="LEFT">1045.8</TD>
2241	<TD ALIGN="LEFT">0.0009562</TD>
2242	</TR>
2243	<TR><TD ALIGN="LEFT">Macintosh Classic</TD>
2244	<TD ALIGN="LEFT">MacOS</TD>
2245	<TD ALIGN="LEFT">Think C</TD>
2246	<TD ALIGN="LEFT">(43.0)</TD>
2247	<TD ALIGN="LEFT">795.6</TD>
2248	<TD ALIGN="LEFT">0.0012569</TD>
2249	</TR>
2250	<TR><TD ALIGN="LEFT">IBM PS2/60</TD>
2251	<TD ALIGN="LEFT">MSDOS</TD>
2252	<TD ALIGN="LEFT">Turbo Pascal 5.0</TD>
2253	<TD ALIGN="LEFT">(58.76)</TD>
2254	<TD ALIGN="LEFT">384.00</TD>
2255	<TD ALIGN="LEFT">0.002604</TD>
2256	</TR>
2257	<TR><TD ALIGN="LEFT">80286 (12 Mhz)</TD>
2258	<TD ALIGN="LEFT">MSDOS</TD>
2259	<TD ALIGN="LEFT">Turbo Pascal 5.0</TD>
2260	<TD ALIGN="LEFT">(47.09)</TD>
2261	<TD ALIGN="LEFT">307.77</TD>
2262	<TD ALIGN="LEFT">0.003249</TD>
2263	</TR>
2264	<TR><TD ALIGN="LEFT">Apple Mac IIcx</TD>
2265	<TD ALIGN="LEFT">MacOS</TD>
2266	<TD ALIGN="LEFT">Think Pascal 3</TD>
2267	<TD ALIGN="LEFT">(42)</TD>
2268	<TD ALIGN="LEFT">274.44</TD>
2269	<TD ALIGN="LEFT">0.003644</TD>
2270	</TR>
2271	<TR><TD ALIGN="LEFT">Apple Mac SE/30</TD>
2272	<TD ALIGN="LEFT">MacOS</TD>
2273	<TD ALIGN="LEFT">Think Pascal 3</TD>
2274	<TD ALIGN="LEFT">(42)</TD>
2275	<TD ALIGN="LEFT">274.44</TD>
2276	<TD ALIGN="LEFT">0.003644</TD>
2277	</TR>
2278	<TR><TD ALIGN="LEFT">Apple Mac IIcx</TD>
2279	<TD ALIGN="LEFT">MacOS</TD>
2280	<TD ALIGN="LEFT">Lightspeed Pascal 2</TD>
2281	<TD ALIGN="LEFT">(39.84)</TD>
2282	<TD ALIGN="LEFT">260.44</TD>
2283	<TD ALIGN="LEFT">0.003840</TD>
2284	</TR>
2285	<TR><TD ALIGN="LEFT">Apple Mac IIcx</TD>
2286	<TD ALIGN="LEFT">MacOS</TD>
2287	<TD ALIGN="LEFT">Lightspeed Pascal 2#</TD>
2288	<TD ALIGN="LEFT">(39.69)</TD>
2289	<TD ALIGN="LEFT">259.33</TD>
2290	<TD ALIGN="LEFT">0.003856</TD>
2291	</TR>
2292	<TR><TD ALIGN="LEFT">Zenith Z386 (16MHz)</TD>
2293	<TD ALIGN="LEFT">MSDOS</TD>
2294	<TD ALIGN="LEFT">Turbo Pascal 5.0</TD>
2295	<TD ALIGN="LEFT">(38.27)</TD>
2296	<TD ALIGN="LEFT">256.67</TD>
2297	<TD ALIGN="LEFT">0.003896</TD>
2298	</TR>
2299	<TR><TD ALIGN="LEFT">Macintosh SE/30</TD>
2300	<TD ALIGN="LEFT">MacOS</TD>
2301	<TD ALIGN="LEFT">Think C</TD>
2302	<TD ALIGN="LEFT">(13.6)</TD>
2303	<TD ALIGN="LEFT">251.56</TD>
2304	<TD ALIGN="LEFT">0.003975</TD>
2305	</TR>
2306	<TR><TD ALIGN="LEFT">386SX (16 MHz)</TD>
2307	<TD ALIGN="LEFT">MSDOS</TD>
2308	<TD ALIGN="LEFT">Turbo Pascal 6.0</TD>
2309	<TD ALIGN="LEFT">(34)</TD>
2310	<TD ALIGN="LEFT">222.41</TD>
2311	<TD ALIGN="LEFT">0.004496</TD>
2312	</TR>
2313	<TR><TD ALIGN="LEFT">386SX (16 MHz)</TD>
2314	<TD ALIGN="LEFT">MSDOS</TD>
2315	<TD ALIGN="LEFT">Microsoft Quick C</TD>
2316	<TD ALIGN="LEFT">(12.01)</TD>
2317	<TD ALIGN="LEFT">222.41</TD>
2318	<TD ALIGN="LEFT">0.004496</TD>
2319	</TR>
2320	<TR><TD ALIGN="LEFT">Sequent-S81</TD>
2321	<TD ALIGN="LEFT">DYNIX</TD>
2322	<TD ALIGN="LEFT">Silicon Valley Pascal</TD>
2323	<TD ALIGN="LEFT">(13.0)</TD>
2324	<TD ALIGN="LEFT">84.89</TD>
2325	<TD ALIGN="LEFT">0.011780</TD>
2326	</TR>
2327	<TR><TD ALIGN="LEFT">VAX 11/785</TD>
2328	<TD ALIGN="LEFT">Unix</TD>
2329	<TD ALIGN="LEFT">Berkeley Pascal</TD>
2330	<TD ALIGN="LEFT">(11.9)</TD>
2331	<TD ALIGN="LEFT">77.77</TD>
2332	<TD ALIGN="LEFT">0.012857</TD>
2333	</TR>
2334	<TR><TD ALIGN="LEFT">80486-33</TD>
2335	<TD ALIGN="LEFT">MSDOS</TD>
2336	<TD ALIGN="LEFT">Turbo Pascal 6.0</TD>
2337	<TD ALIGN="LEFT">(11.46)</TD>
2338	<TD ALIGN="LEFT">74.89</TD>
2339	<TD ALIGN="LEFT">0.013353</TD>
2340	</TR>
2341	<TR><TD ALIGN="LEFT">Sun 3/60</TD>
2342	<TD ALIGN="LEFT">SunOS</TD>
2343	<TD ALIGN="LEFT">Sun C</TD>
2344	<TD ALIGN="LEFT">(3.93)</TD>
2345	<TD ALIGN="LEFT">72.67</TD>
2346	<TD ALIGN="LEFT">0.013761</TD>
2347	</TR>
2348	<TR><TD ALIGN="LEFT">NeXT Cube (68030)</TD>
2349	<TD ALIGN="LEFT">Mach</TD>
2350	<TD ALIGN="LEFT">Gnu C</TD>
2351	<TD ALIGN="LEFT">(2.608)</TD>
2352	<TD ALIGN="LEFT">48.256</TD>
2353	<TD ALIGN="LEFT">0.02072</TD>
2354	</TR>
2355	<TR><TD ALIGN="LEFT">Sequent S-81</TD>
2356	<TD ALIGN="LEFT">DYNIX</TD>
2357	<TD ALIGN="LEFT">Sequent Symmetry C</TD>
2358	<TD ALIGN="LEFT">(2.604)</TD>
2359	<TD ALIGN="LEFT">48.182</TD>
2360	<TD ALIGN="LEFT">0.02075</TD>
2361	</TR>
2362	<TR><TD ALIGN="LEFT">VAXstation 3500</TD>
2363	<TD ALIGN="LEFT">Unix</TD>
2364	<TD ALIGN="LEFT">Berkeley Pascal</TD>
2365	<TD ALIGN="LEFT">(7.3)</TD>
2366	<TD ALIGN="LEFT">47.777</TD>
2367	<TD ALIGN="LEFT">0.02093</TD>
2368	</TR>
2369	<TR><TD ALIGN="LEFT">Sequent S-81</TD>
2370	<TD ALIGN="LEFT">DYNIX</TD>
2371	<TD ALIGN="LEFT">Berkeley Pascal</TD>
2372	<TD ALIGN="LEFT">(5.6)</TD>
2373	<TD ALIGN="LEFT">36.600</TD>
2374	<TD ALIGN="LEFT">0.02732</TD>
2375	</TR>
2376	<TR><TD ALIGN="LEFT">Unisys 7000/40</TD>
2377	<TD ALIGN="LEFT">Unix</TD>
2378	<TD ALIGN="LEFT">Berkeley Pascal</TD>
2379	<TD ALIGN="LEFT">(5.24)</TD>
2380	<TD ALIGN="LEFT">34.244</TD>
2381	<TD ALIGN="LEFT">0.02920</TD>
2382	</TR>
2383	<TR><TD ALIGN="LEFT">VAX 8600</TD>
2384	<TD ALIGN="LEFT">VMS</TD>
2385	<TD ALIGN="LEFT">DEC VAX Pascal</TD>
2386	<TD ALIGN="LEFT">(3.96)</TD>
2387	<TD ALIGN="LEFT">25.889</TD>
2388	<TD ALIGN="LEFT">0.03863</TD>
2389	</TR>
2390	<TR><TD ALIGN="LEFT">Sun SPARC IPX</TD>
2391	<TD ALIGN="LEFT">SunOS</TD>
2392	<TD ALIGN="LEFT">Gnu C version 2.1</TD>
2393	<TD ALIGN="LEFT">(1.28)</TD>
2394	<TD ALIGN="LEFT">23.689</TD>
2395	<TD ALIGN="LEFT">0.04221</TD>
2396	</TR>
2397	<TR><TD ALIGN="LEFT">VAX 6000-530</TD>
2398	<TD ALIGN="LEFT">VMS</TD>
2399	<TD ALIGN="LEFT">DEC C</TD>
2400	<TD ALIGN="LEFT">(0.858)</TD>
2401	<TD ALIGN="LEFT">15.867</TD>
2402	<TD ALIGN="LEFT">0.06303</TD>
2403	</TR>
2404	<TR><TD ALIGN="LEFT">VAXstation 4000</TD>
2405	<TD ALIGN="LEFT">VMS</TD>
2406	<TD ALIGN="LEFT">DEC C</TD>
2407	<TD ALIGN="LEFT">(0.809)</TD>
2408	<TD ALIGN="LEFT">14.978</TD>
2409	<TD ALIGN="LEFT">0.06677</TD>
2410	</TR>
2411	<TR><TD ALIGN="LEFT">IBM RS/6000 540</TD>
2412	<TD ALIGN="LEFT">AIX</TD>
2413	<TD ALIGN="LEFT">XLP Pascal</TD>
2414	<TD ALIGN="LEFT">(2.276)</TD>
2415	<TD ALIGN="LEFT">14.866</TD>
2416	<TD ALIGN="LEFT">0.06726</TD>
2417	</TR>
2418	<TR><TD ALIGN="LEFT">NeXTstation(040/25)</TD>
2419	<TD ALIGN="LEFT">Mach</TD>
2420	<TD ALIGN="LEFT">Gnu C</TD>
2421	<TD ALIGN="LEFT">(0.75)</TD>
2422	<TD ALIGN="LEFT">13.867</TD>
2423	<TD ALIGN="LEFT">0.07212</TD>
2424	</TR>
2425	<TR><TD ALIGN="LEFT">Sun SPARC IPX</TD>
2426	<TD ALIGN="LEFT">SunOS</TD>
2427	<TD ALIGN="LEFT">Sun C</TD>
2428	<TD ALIGN="LEFT">(0.68)</TD>
2429	<TD ALIGN="LEFT">12.580</TD>
2430	<TD ALIGN="LEFT">0.07951</TD>
2431	</TR>
2432	<TR><TD ALIGN="LEFT">486DX (33 MHz)</TD>
2433	<TD ALIGN="LEFT">Linux</TD>
2434	<TD ALIGN="LEFT">Gnu C #</TD>
2435	<TD ALIGN="LEFT">(0.63)</TD>
2436	<TD ALIGN="LEFT">11.666</TD>
2437	<TD ALIGN="LEFT">0.08571</TD>
2438	</TR>
2439	<TR><TD ALIGN="LEFT">Sun SPARCstation-1</TD>
2440	<TD ALIGN="LEFT">Unix</TD>
2441	<TD ALIGN="LEFT">Sun Pascal</TD>
2442	<TD ALIGN="LEFT">(1.7)</TD>
2443	<TD ALIGN="LEFT">11.111</TD>
2444	<TD ALIGN="LEFT">0.09000</TD>
2445	</TR>
2446	<TR><TD ALIGN="LEFT">DECstation 5000/200</TD>
2447	<TD ALIGN="LEFT">Unix</TD>
2448	<TD ALIGN="LEFT">DEC Ultrix C</TD>
2449	<TD ALIGN="LEFT">(0.45)</TD>
2450	<TD ALIGN="LEFT">8.333</TD>
2451	<TD ALIGN="LEFT">0.12000</TD>
2452	</TR>
2453	<TR><TD ALIGN="LEFT">Sun SPARC 1+</TD>
2454	<TD ALIGN="LEFT">SunOS</TD>
2455	<TD ALIGN="LEFT">Sun C</TD>
2456	<TD ALIGN="LEFT">(0.40)</TD>
2457	<TD ALIGN="LEFT">7.400</TD>
2458	<TD ALIGN="LEFT">0.13513</TD>
2459	</TR>
2460	<TR><TD ALIGN="LEFT">DECstation 3100</TD>
2461	<TD ALIGN="LEFT">Unix</TD>
2462	<TD ALIGN="LEFT">DEC Ultrix Pascal</TD>
2463	<TD ALIGN="LEFT">(0.77)</TD>
2464	<TD ALIGN="LEFT">5.022</TD>
2465	<TD ALIGN="LEFT">0.1991</TD>
2466	</TR>
2467	<TR><TD ALIGN="LEFT">IBM 3090-300E</TD>
2468	<TD ALIGN="LEFT">AIX</TD>
2469	<TD ALIGN="LEFT">Metaware High C</TD>
2470	<TD ALIGN="LEFT">(0.27)</TD>
2471	<TD ALIGN="LEFT">5.000</TD>
2472	<TD ALIGN="LEFT">0.2000</TD>
2473	</TR>
2474	<TR><TD ALIGN="LEFT">DECstation 5000/125</TD>
2475	<TD ALIGN="LEFT">Unix</TD>
2476	<TD ALIGN="LEFT">DEC Ultrix C</TD>
2477	<TD ALIGN="LEFT">(0.267)</TD>
2478	<TD ALIGN="LEFT">4.933</TD>
2479	<TD ALIGN="LEFT">0.2027</TD>
2480	</TR>
2481	<TR><TD ALIGN="LEFT">DECstation 5000/200</TD>
2482	<TD ALIGN="LEFT">Unix</TD>
2483	<TD ALIGN="LEFT">DEC Ultrix C</TD>
2484	<TD ALIGN="LEFT">(0.256)</TD>
2485	<TD ALIGN="LEFT">4.733</TD>
2486	<TD ALIGN="LEFT">0.2113</TD>
2487	</TR>
2488	<TR><TD ALIGN="LEFT">Sun SPARC 4/50</TD>
2489	<TD ALIGN="LEFT">SunOS</TD>
2490	<TD ALIGN="LEFT">Sun C</TD>
2491	<TD ALIGN="LEFT">(0.249)</TD>
2492	<TD ALIGN="LEFT">4.607</TD>
2493	<TD ALIGN="LEFT">0.2171</TD>
2494	</TR>
2495	<TR><TD ALIGN="LEFT">DEC 3000/400 AXP</TD>
2496	<TD ALIGN="LEFT">Unix</TD>
2497	<TD ALIGN="LEFT">DEC C</TD>
2498	<TD ALIGN="LEFT">(0.224)</TD>
2499	<TD ALIGN="LEFT">4.144</TD>
2500	<TD ALIGN="LEFT">0.2413</TD>
2501	</TR>
2502	<TR><TD ALIGN="LEFT">DECstation 5000/240</TD>
2503	<TD ALIGN="LEFT">Unix</TD>
2504	<TD ALIGN="LEFT">DEC Ultrix C</TD>
2505	<TD ALIGN="LEFT">(0.1889)</TD>
2506	<TD ALIGN="LEFT">3.496</TD>
2507	<TD ALIGN="LEFT">0.2861</TD>
2508	</TR>
2509	<TR><TD ALIGN="LEFT">SGI Iris R4000</TD>
2510	<TD ALIGN="LEFT">Unix</TD>
2511	<TD ALIGN="LEFT">SGI C</TD>
2512	<TD ALIGN="LEFT">(0.184)</TD>
2513	<TD ALIGN="LEFT">3.404</TD>
2514	<TD ALIGN="LEFT">0.2937</TD>
2515	</TR>
2516	<TR><TD ALIGN="LEFT">IBM 3090-300E</TD>
2517	<TD ALIGN="LEFT">VM</TD>
2518	<TD ALIGN="LEFT">Pascal VS</TD>
2519	<TD ALIGN="LEFT">(0.464)</TD>
2520	<TD ALIGN="LEFT">3.022</TD>
2521	<TD ALIGN="LEFT">0.3309</TD>
2522	</TR>
2523	<TR><TD ALIGN="LEFT">DECstation 5000/200</TD>
2524	<TD ALIGN="LEFT">Unix</TD>
2525	<TD ALIGN="LEFT">DEC Ultrix Pascal</TD>
2526	<TD ALIGN="LEFT">(0.39)</TD>
2527	<TD ALIGN="LEFT">2.533</TD>
2528	<TD ALIGN="LEFT">0.3947</TD>
2529	</TR>
2530	<TR><TD ALIGN="LEFT">Pentium 120</TD>
2531	<TD ALIGN="LEFT">Linux</TD>
2532	<TD ALIGN="LEFT">Gnu C</TD>
2533	<TD ALIGN="LEFT">1.848</TD>
2534	<TD ALIGN="LEFT">1.994</TD>
2535	<TD ALIGN="LEFT">0.5016</TD>
2536	</TR>
2537	<TR><TD ALIGN="LEFT">Pentium Pro 180</TD>
2538	<TD ALIGN="LEFT">Linux</TD>
2539	<TD ALIGN="LEFT">Gnu C</TD>
2540	<TD ALIGN="LEFT">1.009</TD>
2541	<TD ALIGN="LEFT">1.088</TD>
2542	<TD ALIGN="LEFT">0.9353</TD>
2543	</TR>
2544	<TR><TD ALIGN="LEFT">Pentium 266 MMX</TD>
2545	<TD ALIGN="LEFT">Linux</TD>
2546	<TD ALIGN="LEFT">Gnu C (PHYLIP 3.5)</TD>
2547	<TD ALIGN="LEFT">(0.054)</TD>
2548	<TD ALIGN="LEFT">1.0</TD>
2549	<TD ALIGN="LEFT">1.0</TD>
2550	</TR>
2551	<TR><TD ALIGN="LEFT">Pentium 266 MMX</TD>
2552	<TD ALIGN="LEFT">Linux</TD>
2553	<TD ALIGN="LEFT">Gnu C</TD>
2554	<TD ALIGN="LEFT">0.927</TD>
2555	<TD ALIGN="LEFT">1.0</TD>
2556	<TD ALIGN="LEFT">1.0</TD>
2557	</TR>
2558	<TR><TD ALIGN="LEFT">Pentium 200</TD>
2559	<TD ALIGN="LEFT">Linux</TD>
2560	<TD ALIGN="LEFT">Gnu C</TD>
2561	<TD ALIGN="LEFT">0.853</TD>
2562	<TD ALIGN="LEFT">0.9202</TD>
2563	<TD ALIGN="LEFT">1.2647</TD>
2564	</TR>
2565	<TR><TD ALIGN="LEFT">SGI PowerChallenge</TD>
2566	<TD ALIGN="LEFT">Irix</TD>
2567	<TD ALIGN="LEFT">Gnu C</TD>
2568	<TD ALIGN="LEFT">0.844</TD>
2569	<TD ALIGN="LEFT">0.9297</TD>
2570	<TD ALIGN="LEFT">1.0756</TD>
2571	</TR>
2572	<TR><TD ALIGN="LEFT">DEC Alpha 400 4/233</TD>
2573	<TD ALIGN="LEFT">DUNIX</TD>
2574	<TD ALIGN="LEFT">Digital C (cc -fast)</TD>
2575	<TD ALIGN="LEFT">0.730</TD>
2576	<TD ALIGN="LEFT">0.7875</TD>
2577	<TD ALIGN="LEFT">1.2699</TD>
2578	</TR>
2579	<TR><TD ALIGN="LEFT">Pentium II 500</TD>
2580	<TD ALIGN="LEFT">Linux</TD>
2581	<TD ALIGN="LEFT">Gnu C</TD>
2582	<TD ALIGN="LEFT">0.368</TD>
2583	<TD ALIGN="LEFT">0.4053</TD>
2584	<TD ALIGN="LEFT">2.467</TD>
2585	</TR>
2586	<TR><TD ALIGN="LEFT">Compaq/Digital Alpha 500au</TD>
2587	<TD ALIGN="LEFT">DUNIX</TD>
2588	<TD ALIGN="LEFT">Digital C (cc -fast)</TD>
2589	<TD ALIGN="LEFT">0.167</TD>
2590	<TD ALIGN="LEFT">0.1805</TD>
2591	<TD ALIGN="LEFT">5.541</TD>
2592	</TR>
2593	</TABLE>
2594	</DIV>
2595	<P>
2596	This benchmark not only reflects integer performance of these machines
2597	(as DNAPARS has few floating-point operations) but also the efficiency
2598	of the compilers. Some of the machines (the DEC 3000/400 AXP
2599	and the IBM RS/6000, in particular) are much faster than this benchmark
2600	would indicate. The numerical programs benchmark below gives them a
2601	fairer test. The Compaq/Digital Alpha 500au times are exaggerated because,
2602	although their compiles are optimized for that processor, the Pentium
2603	compiles are not similarly optimized.
2604	<P>
2605	Note that parallel machines like the Sequent and the SGI PowerChallenge are not
2606	really as slow as indicated by the data here, as these runs did nothing to take
2607	advantage of their parallelism.
2608	<P>
2609	These benchmarks have now extended over 13 years, and in the DNAPARS
2610	benchmark they extend over a range of 8000-fold in speed!
2611	The experience of our laboratory, which seems typical, is that
2612	computer power grows by a factor of about 1.85 per year. This is
2613	roughly consistent with these benchmarks.
2614	<P>
2615	For a picture of speeds for a more numerically intensive program,
2616	here are benchmarks using DNAML, with the Pentium MMX 266
2617	as the standard. Some of the timings, the ones in parentheses, are
2618	using PHYLIP version 3.5, and those are compared to that version run on
2619	the Pentium 266. Runs using the PHYLIP 3.4 Pascal version are adjusted
2620	using the 386SX timings where both were run. Numbers are
2621	total run times (total user time in the case of Unix) over all three data sets.
2622	<P>
2623	<DIV ALIGN="CENTER">
2624	<TABLE CELLPADDING=3 BORDER="1">
2625	<TR><TD ALIGN="LEFT"><B>Machine</B></TD>
2626	<TD ALIGN="LEFT"><B>Operating<BR>System</B></TD>
2627	<TD ALIGN="LEFT"><B>Compiler</B></TD>
2628	<TD ALIGN="RIGHT"><B>Seconds</B></TD>
2629	<TD ALIGN="LEFT"><B>Time</B></TD>
2630	<TD ALIGN="RIGHT"><B>Relative<BR>Speed</B></TD>
2631	</TR>
2632	<TR><TD ALIGN="LEFT">386SX 16 Mhz</TD>
2633	<TD ALIGN="LEFT">PCDOS</TD>
2634	<TD ALIGN="LEFT">Turbo Pascal 6</TD>
2635	<TD ALIGN="RIGHT">(7826)</TD>
2636	<TD ALIGN="LEFT"> 181.18</TD>
2637	<TD ALIGN="RIGHT">0.005519</TD>
2638	</TR>
2639	<TR><TD ALIGN="LEFT">386SX 16 Mhz</TD>
2640	<TD ALIGN="LEFT">PCDOS</TD>
2641	<TD ALIGN="LEFT">Quick C</TD>
2642	<TD ALIGN="RIGHT">(6549.79)</TD>
2643	<TD ALIGN="LEFT"> 181.18</TD>
2644	<TD ALIGN="RIGHT">0.005519</TD>
2645	</TR>
2646	<TR><TD ALIGN="LEFT">Compudyne 486DX/33</TD>
2647	<TD ALIGN="LEFT">Linux</TD>
2648	<TD ALIGN="LEFT">Gnu C</TD>
2649	<TD ALIGN="RIGHT">(1599.9)</TD>
2650	<TD ALIGN="LEFT"> 44.26</TD>
2651	<TD ALIGN="RIGHT">0.022595</TD>
2652	</TR>
2653	<TR><TD ALIGN="LEFT">SUN Sparcstation 1+</TD>
2654	<TD ALIGN="LEFT">SunOS</TD>
2655	<TD ALIGN="LEFT">Sun C</TD>
2656	<TD ALIGN="RIGHT">(1402.8)</TD>
2657	<TD ALIGN="LEFT"> 38.805</TD>
2658	<TD ALIGN="RIGHT">0.025770</TD>
2659	</TR>
2660	<TR><TD ALIGN="LEFT">Everex STEP 386/20</TD>
2661	<TD ALIGN="LEFT">PCDOS</TD>
2662	<TD ALIGN="LEFT">Turbo Pascal 5.5</TD>
2663	<TD ALIGN="RIGHT">(1440.8)</TD>
2664	<TD ALIGN="LEFT"> 33.356</TD>
2665	<TD ALIGN="RIGHT"> 0.029980</TD>
2666	</TR>
2667	<TR><TD ALIGN="LEFT">486DX/33</TD>
2668	<TD ALIGN="LEFT">PCDOS</TD>
2669	<TD ALIGN="LEFT">Turbo C++</TD>
2670	<TD ALIGN="RIGHT">(1107.2)</TD>
2671	<TD ALIGN="LEFT"> 30.628</TD>
2672	<TD ALIGN="RIGHT">0.032650</TD>
2673	</TR>
2674	<TR><TD ALIGN="LEFT">Compudyne 486DX/33</TD>
2675	<TD ALIGN="LEFT">PCDOS</TD>
2676	<TD ALIGN="LEFT">Waterloo C/386</TD>
2677	<TD ALIGN="RIGHT">(1045.78)</TD>
2678	<TD ALIGN="LEFT"> 28.929</TD>
2679	<TD ALIGN="RIGHT">0.034567</TD>
2680	</TR>
2681	<TR><TD ALIGN="LEFT">Sun SPARCstation IPX</TD>
2682	<TD ALIGN="LEFT">SunOS</TD>
2683	<TD ALIGN="LEFT">Gnu C</TD>
2684	<TD ALIGN="RIGHT"> (960.2)</TD>
2685	<TD ALIGN="LEFT"> 26.562</TD>
2686	<TD ALIGN="RIGHT">0.037648</TD>
2687	</TR>
2688	<TR><TD ALIGN="LEFT">NeXTstation(68040/25)</TD>
2689	<TD ALIGN="LEFT">Mach</TD>
2690	<TD ALIGN="LEFT">Gnu C</TD>
2691	<TD ALIGN="RIGHT"> (916.6)</TD>
2692	<TD ALIGN="LEFT"> 25.355</TD>
2693	<TD ALIGN="RIGHT">0.039439</TD>
2694	</TR>
2695	<TR><TD ALIGN="LEFT">486DX/33</TD>
2696	<TD ALIGN="LEFT">PCDOS</TD>
2697	<TD ALIGN="LEFT">Waterloo C/386</TD>
2698	<TD ALIGN="RIGHT"> (861.0)</TD>
2699	<TD ALIGN="LEFT"> 23.817</TD>
2700	<TD ALIGN="RIGHT">0.041986</TD>
2701	</TR>
2702	<TR><TD ALIGN="LEFT">Sun SPARCstation IPX</TD>
2703	<TD ALIGN="LEFT">SunOS</TD>
2704	<TD ALIGN="LEFT">Sun C</TD>
2705	<TD ALIGN="RIGHT"> (787.7)</TD>
2706	<TD ALIGN="LEFT"> 21.790</TD>
2707	<TD ALIGN="RIGHT">0.045893</TD>
2708	</TR>
2709	<TR><TD ALIGN="LEFT">486DX/33</TD>
2710	<TD ALIGN="LEFT">PCDOS</TD>
2711	<TD ALIGN="LEFT">Gnu C</TD>
2712	<TD ALIGN="RIGHT"> (650.9)</TD>
2713	<TD ALIGN="LEFT"> 18.006</TD>
2714	<TD ALIGN="RIGHT">0.05554</TD>
2715	</TR>
2716	<TR><TD ALIGN="LEFT">VAX 6000-530</TD>
2717	<TD ALIGN="LEFT">VMS</TD>
2718	<TD ALIGN="LEFT">DEC C</TD>
2719	<TD ALIGN="RIGHT"> (637.0)</TD>
2720	<TD ALIGN="LEFT"> 17.621</TD>
2721	<TD ALIGN="RIGHT">0.05675</TD>
2722	</TR>
2723	<TR><TD ALIGN="LEFT">DECstation 5000/200</TD>
2724	<TD ALIGN="LEFT">Unix</TD>
2725	<TD ALIGN="LEFT">DEC Ultrix RISC C</TD>
2726	<TD ALIGN="RIGHT"> (423.3)</TD>
2727	<TD ALIGN="LEFT"> 11.710</TD>
2728	<TD ALIGN="RIGHT">0.08540</TD>
2729	</TR>
2730	<TR><TD ALIGN="LEFT">IBM 3090-300E</TD>
2731	<TD ALIGN="LEFT">AIX</TD>
2732	<TD ALIGN="LEFT">Metaware High C</TD>
2733	<TD ALIGN="RIGHT"> (201.8)</TD>
2734	<TD ALIGN="LEFT"> 5.582</TD>
2735	<TD ALIGN="RIGHT">0.17914</TD>
2736	</TR>
2737	<TR><TD ALIGN="LEFT">Convex C240/1024</TD>
2738	<TD ALIGN="LEFT">Unix</TD>
2739	<TD ALIGN="LEFT">C</TD>
2740	<TD ALIGN="RIGHT"> (101.6)</TD>
2741	<TD ALIGN="LEFT"> 2.8105</TD>
2742	<TD ALIGN="RIGHT">0.35581</TD>
2743	</TR>
2744	<TR><TD ALIGN="LEFT">DEC 3000/400 AXP</TD>
2745	<TD ALIGN="LEFT">Unix</TD>
2746	<TD ALIGN="LEFT">DEC C</TD>
2747	<TD ALIGN="RIGHT"> (98.29)</TD>
2748	<TD ALIGN="LEFT"> 2.7189</TD>
2749	<TD ALIGN="RIGHT">0.36779</TD>
2750	</TR>
2751	<TR><TD ALIGN="LEFT">Pentium 120</TD>
2752	<TD ALIGN="LEFT">Linux</TD>
2753	<TD ALIGN="LEFT">Gnu C</TD>
2754	<TD ALIGN="RIGHT">25.26</TD>
2755	<TD ALIGN="LEFT">3.3906</TD>
2756	<TD ALIGN="RIGHT">0.29493</TD>
2757	</TR>
2758	<TR><TD ALIGN="LEFT">Pentium Pro 180</TD>
2759	<TD ALIGN="LEFT">Linux</TD>
2760	<TD ALIGN="LEFT">Gnu C</TD>
2761	<TD ALIGN="RIGHT">18.88</TD>
2762	<TD ALIGN="LEFT">2.5342</TD>
2763	<TD ALIGN="RIGHT">0.3946</TD>
2764	</TR>
2765	<TR><TD ALIGN="LEFT">Pentium 200</TD>
2766	<TD ALIGN="LEFT">Linux</TD>
2767	<TD ALIGN="LEFT">Gnu C</TD>
2768	<TD ALIGN="RIGHT">16.51</TD>
2769	<TD ALIGN="LEFT">2.2161</TD>
2770	<TD ALIGN="RIGHT">0.4512</TD>
2771	</TR>
2772	<TR><TD ALIGN="LEFT">SGI PowerChallenge</TD>
2773	<TD ALIGN="LEFT">IRIX</TD>
2774	<TD ALIGN="LEFT">Gnu C</TD>
2775	<TD ALIGN="RIGHT">12.446</TD>
2776	<TD ALIGN="LEFT">1.6706</TD>
2777	<TD ALIGN="RIGHT">0.5985</TD>
2778	</TR>
2779	<TR><TD ALIGN="LEFT">Pentium MMX 266</TD>
2780	<TD ALIGN="LEFT">Linux</TD>
2781	<TD ALIGN="LEFT">Gnu C (PHYLIP 3.5)</TD>
2782	<TD ALIGN="RIGHT">(36.15)</TD>
2783	<TD ALIGN="LEFT"> 1.0</TD>
2784	<TD ALIGN="RIGHT"> 1.0</TD>
2785	</TR>
2786	<TR><TD ALIGN="LEFT">DEC Alpha 400 4/233</TD>
2787	<TD ALIGN="LEFT">Linux</TD>
2788	<TD ALIGN="LEFT">Gnu C (cc -fast)</TD>
2789	<TD ALIGN="RIGHT">8.0418</TD>
2790	<TD ALIGN="LEFT">1.0792</TD>
2791	<TD ALIGN="RIGHT">0.9266</TD>
2792	</TR>
2793	<TR><TD ALIGN="LEFT">Pentium MMX 266</TD>
2794	<TD ALIGN="LEFT">Linux</TD>
2795	<TD ALIGN="LEFT">Gnu C</TD>
2796	<TD ALIGN="RIGHT">7.45</TD>
2797	<TD ALIGN="LEFT"> 1.0</TD>
2798	<TD ALIGN="RIGHT"> 1.0</TD>
2799	</TR>
2800	<TR><TD ALIGN="LEFT">Pentium II 500</TD>
2801	<TD ALIGN="LEFT">Linux</TD>
2802	<TD ALIGN="LEFT">Gnu C</TD>
2803	<TD ALIGN="RIGHT">6.02</TD>
2804	<TD ALIGN="LEFT"> 0.8081</TD>
2805	<TD ALIGN="RIGHT"> 1.2375</TD>
2806	</TR>
2807	<TR><TD ALIGN="LEFT">Compaq/Digital Alpha 500au</TD>
2808	<TD ALIGN="LEFT">Linux</TD>
2809	<TD ALIGN="LEFT">Gnu C (cc -fast)</TD>
2810	<TD ALIGN="RIGHT">0.9383</TD>
2811	<TD ALIGN="LEFT"> 0.1259</TD>
2812	<TD ALIGN="RIGHT">7.940</TD>
2813	</TR>
2814	</TABLE>
2815	</DIV>
2816	<P>
2817	As before, the parallel machines such as the Convex and the SGI PowerChallenge
2818	were only run using one processor, which does not take into account the
2819	gain that could be obtained by parallelizing the programs. The speed of the
2820	Compaq/Digital Alpha 500au is exaggerated because it was compiled in a way
2821	optimized for its processor, while the Pentium compiles were not.
2822	<P>
2823	You are invited to send me figures for your machine for
2824	inclusion in future tables. Use the data sets above and compute the total
2825	times for DNAPARS and for DNAML for the three data sets (setting the
2826	frequencies of the four bases to 0.25 each for the DNAML runs). Be sure to
2827	tell me the name and version of your compiler, and the version of PHYLIP you
2828	tested.
2829	If the times are too small to be measured accurately, obtain the times
2830	for ten data sets (the Multiple data sets option) and divide by 10.
2831	<P>
2832	<A NAME="comments"><HR><P></A>
2833	<DIV ALIGN="CENTER">
2834	<H2>General Comments on Adapting<BR>
2835	the Package to Different Computer Systems</H2></DIV>
2836	<P>
2837	In the sections following you will find instructions on how to adapt the
2838	programs to different computers and compilers. The programs should compile
2839	without alteration on most versions of C. They use the "malloc" library
2840	or "calloc" function to allocate memory so that the upper limits on how many
2841	species or how many sites or characters they can run is set by the system memory
2842	available to that memory-allocation function.
2843	<P>
2844	In the document file for each program, I have supplied a small
2845	input example, and the output it produces, to help you check whether the
2846	programs are running properly.
2847	<P>
2848	<DIV ALIGN=CENTER>
2849	<A NAME="compiling"><HR><P></A>
2850	<H2>Compiling the programs</H2>
2851	</DIV>
2852	<P>
2853	If you have not been able to get executables for PHYLIP, you should be
2854	able to make your own. This is easy under Unix and Linux, but more
2855	difficult if you have a Macintosh or a Windows system. If you have the
2856	latter, we stringly recommend you download and use the PowerMac and
2857	Windows executables that we distribute. If you do that, you will not need
2858	to have any compiler or to do any compiling. I get a certain number of
2859	inquiries each year from confused users who are not sure what a compiler
2860	is but think they need one. After downloading the executables they
2861	contact me and complain that they did not find a compiler included in the
2862	package, and would I please e-mail them the compiler. What they really
2863	need to do is use the executables and forget about compiling them.
2864	<P>
2865	Some users may also need to compile the programs in order to modify them.
2866	The instructions below will help with this.
2867	<P>
2868	I will discuss how to compile PHYLIP using one of a number of widely-used
2869	compilers. After these I will comment on compiling PHYLIP on other, less
2870	widely-used systems.
2871	<P>
2872	<H3>Unix and Linux</H3>
2873	<P>
2874	In Unix and Linux (which is Unix in all important functional respects, if
2875	not in all
2876	legal respects) it is easy to compile PHYLIP yourself, which is why we have
2877	generally not bothered to distribute executables for Unix. Unix (and Linux)
2878	systems generally have a C compiler and have the <TT>make</TT> utility. We
2879	distribute with the PHYLIP source code a Unix-compatible <TT>Makefile</TT>.
2880	<P>
2881	After you have finished unpacking the Documentation and Source Code
2882	archive, you will find that you have created a directory <TT>phylip</TT>
2883	in which there are three
2884	subdirectories, called <TT>exe</TT>, <TT>src</TT>, and <TT>doc</TT>.
2885	There is also an HTML web page, <TT>phylip.html</TT>. The <TT>exe</TT>
2886	directory
2887	will be empty, <TT>src</TT> contains the source code files, including the
2888	<TT>Makefile</TT>. Directory <TT>doc</TT> contains the documentation files.
2889	<P>
2890	Enter the <TT>src</TT> directory. Before you compile, you will want to
2891	look at the makefile and see whether you want to alter the compilation
2892	command. There are careful instructions in the Makefile telling you how to
2893	do this. To compile all the programs just type:
2894	<P>
2895	<TT>make install</TT>
2896	<P>
2897	You will then see the compiling commands as they happen, with
2898	occasional warning messages. If these are warnings, rather than errors,
2899	they are not too serious. A typical warning would be like this:
2900	<P>
2901	<TT>dnaml.c:1204: warning: static declaration for re_move follows non-static</TT>
2902	<P>
2903	After a time the compiler will finish compiling. If you have done a
2904	<TT>make install</TT> the system will then move the executables into the
2905	<TT>exe</TT> subdirectory and also save space by erasing all the relocatable
2906	object files that were produced in the process. You should be left with
2907	useable executables in the <TT>exe</TT> directory, and the <TT>src</TT>
2908	directory should be as before. To run the executables, go into the
2909	<TT>exe</TT> directory and type the program name (say <TT>dnaml</TT>).
2910	The names of the
2911	executables will be the same as the names of the C programs, but without the
2912	<TT>.c</TT> suffix. Thus <TT>dnaml.c</TT> compiles to make an executable called <TT>dnaml</TT>.
2913	<P>
2914	A typical Unix or Linux installation would put the directory <TT>phylip</TT>
2915	in <TT>/usr/local</TT>. The name of the executables directory <TT>EXEDIR</TT>
2916	could be changed to be <TT>/usr/local/bin</TT>, so that the <TT>make install</TT>
2917	command puts the executables there. If the users have <TT>/usr/local/bin</TT>
2918	in their paths, the programs would be found when their names are typed.
2919	The font files <TT>font1</TT> through <TT>font6</TT> could also be
2920	placed there. A batch script containing the lines
2921	<P>
2922	<PRE>
2923	ln -s /usr/local/bin/font1 font1
2924	ln -s /usr/local/bin/font2 font2
2925	ln -s /usr/local/bin/font3 font3
2926	ln -s /usr/local/bin/font4 font4
2927	ln -s /usr/local/bin/font5 font5
2928	ln -s /usr/local/bin/font6 font6
2929	</PRE>
2930	<P>
2931	could be used to establish links in the user's working directory so that
2932	Drawtree and Drawgram would find these font files when users
2933	type a name such as <TT>font1</TT> when the program asks
2934	them for a font file name. The
2935	documentation web pages are in subdirectory <TT>doc</TT> of the
2936	main PHYLIP directory, except for one, <TT>phylip.html</TT> which is
2937	in the main PHYLIP directory. It has a table of all of the documentation
2938	pages, including this one. If users create a bookmark to that page
2939	it can be used to access all of the other documentation pages.
2940	<P>
2941	To compile just one program, such as DNAML, type:
2942	<P>
2943	<TT>make dnaml</TT>
2944	<P>
2945	After this compilation, <TT>dnaml</TT> will be in the <TT>src</TT>
2946	subdirectory. So will some rrelocatable object code files that
2947	were used to create the executable. These have names ending in
2948	<TT>.o</TT> - they can safely be deleted.
2949	<P>
2950	If you have problems with the compilation command, you can edit the
2951	<TT>Makefile</TT>. It has careful explanations at its front of how you
2952	might want to do so. For example, you might want to change the C
2953	compiler name <TT>cc</TT> to the name of the Gnu C compiler, <TT>gcc</TT>.
2954	This can be done by removing the comment character <TT>#</TT> from the
2955	front of one line, and placing it at the front of a nearby line.
2956	How to do so should be clear from the material at the beginning of the
2957	<TT>Makefile</TT>. We have included sample lines for using the <TT>gcc</TT>
2958	compiler and for using the Cygwin Gnu C++ environment on Windows, as
2959	well as the default of <TT>cc</TT>.
2960	<P>
2961	Some older C compilers (notably the Berkeley C compiler which is
2962	included free with some Sun systems) do not adhere to the ANSI C
2963	standard (because they were written before it was set down).
2964	They have trouble with the function prototypes which are in
2965	our programs. We have included an <TT>#ifndef</TT> preprocessor
2966	command to eliminate the problem, if you use the switch <TT>-DOLDC</TT>
2967	when compiling. Thus with these compilers you need only use this in
2968	your C flags (in the Makefile) and compilers such as Berkeley C
2969	will cause no trouble.
2970	<P>
2971	<H3>Macintosh PowerMacs</H3>
2972	<P>
2973	<B>Compiling with Metrowerks Codewarrior on Macintosh PowerMacs...</B>
2974	<P>
2975	We shall assume that you have a recent version of the Metrowerks
2976	Codewarrior C++
2977	compiler. This description, and the project files that we provide,
2978	assume Codewarrior 5.3. We also assume some familiarity with
2979	the use of the Codewarrior compiler and its Integrated Development
2980	Environment (IDE).
2981	<P>
2982	Start with our <TT>src</TT> directory (folder) that contains the C source
2983	code files such as <TT>dnaml.c</TT> and also the Codewarrior resource
2984	files such as <TT>dnaml.rsrc</TT>, which are provided by us.
2985	<P>
2986	<B>Creating the project file.</B> We will use DnaML as our example.
2987	We have provided a full set of project files in the
2988	self-extracting Macintosh archive.
2989	<EM>If you have them then you do not need
2990	to do the items on the following list:</EM>
2991	<OL>
2992	<LI>Start up the Codewarrior IDE integrated development environment.
2993	<LI>Create a new project file by choosing <TT>New...</TT> on the <TT>File</TT>
2994	menu.
2995	<LI>Type in the project name <TT>dnaml.proj</TT>
2996	<LI>On the Project menu on the left side of the <TT>New</TT> window, double-click on <TT>MacOS C/C++ Stationery</TT>
2997	<LI>In the <TT>New project</TT> window that opens, click on the triangle
2998	to the left of <TT>Standard Console</TT>.
2999	<LI>Move the slider at the right of the window down until you reach
3000	<TT>SIOUX-WASTE</TT>
3001	<LI>Click on the triangle to the left of <TT>SIOUX-WASTE</TT>. This opens
3002	another list of choices below.
3003	<LI>Click on the menu item <TT>SIOUX-WASTE C PPC</TT>. Press the <TT>OK</TT> button. After a bit a window <TT>dnaml.proj</TT> will open.
3004	<LI>Click on the triangle to the left of the <TT>Sources</TT> menu item. A
3005	template item called <TT>HelloWorld.c</TT> will open.
3006	<LI>Select <TT>HelloWorld.c</TT>.
3007	<LI>Open the <TT>Edit</TT> menu at the top of the Mac screen and select
3008	<TT>Clear</TT>. A box will open asking if you want to remove <TT>HelloWorld.c</TT> from the project.
3009	<LI>Select <TT>OK</TT>.
3010	<LI>If the <TT>dnaml.c</TT> file came from the self-extracting Macintosh
3011	archive that we distribute, it should show a yellow-and-back-striped Metrowerks
3012	icon (if not, as when you get it from some other form of our distribution,
3013	you may have to pass it through a program like Microsoft Word, making
3014	sure to save it as a Text Only file, to get
3015	Metrowerks to be able to see it as a potential source code file).
3016	<LI>Drag the <TT>dnaml.c</TT> file onto the <TT>Sources</TT> item in your
3017	<TT>dnaml.proj</TT> window.
3018	<LI>Drop it onto Sources so that it appears under the <TT>Sources</TT> choice.
3019	This may take a few tries -- if it appears above <TT>Sources</TT> grab it
3020	and move it again.
3021	<LI>Now add the other files that must be compiled with <TT>dnaml.c</TT>.
3022	These can be identified by looking at our <TT>Makefile</TT> -- for DnaML
3023	they are <TT>seq.c</TT>, <TT>phylip.c</TT>, <TT>seq.h</TT>, and <TT>phylip.h</TT>. Each of them needs to be added to the project file in the same way that
3024	<TT>dnaml.c</TT> was.
3025	<LI>Drag <TT>dnaml.rsrc</TT> into <TT>Sources</TT> in the same way. It
3026	doesn't matter whether it appears before or after <TT>dnaml.c</TT>.
3027	<LI>Go to the <TT>Edit</TT> menu and select the <TT>PPC Std C SIOUX-WASTE Settings</TT> item. A window of that name will then open.
3028	<LI>Under the <TT>Target</TT> item you will see a <TT>PPC Target</TT> item.
3029	Select it. A <TT>PPC Target</TT> window will open to the right.
3030	<LI>Change the name in the <TT>File Name</TT> box to be <TT>PHYLIP</TT>
3031	<LI>Change the <TT>????</TT> in the <TT>Creator</TT> box to (say) <TT>PHYD</TT>
3032	<LI>Change the <TT>Preferred Heap Size</TT> to <TT>1024</TT>.
3033	<! need to add selections of PPC Processor here >
3034	<! ditto for Global Optimization >
3035	<LI>Under <TT>Language Settings</TT> in the left-hand menu of the window,
3036	select <TT>C/C++ Language</TT>. A window called <TT>C/C++ Language</TT>
3037	will open to the immediate right.
3038	<LI>Click on <TT>Require Function Prototypes</TT> to deselect that setting.
3039	<LI>Click on the <TT>Save</TT> button at the lower-right of the project
3040	settings window.
3041	<LI>Close the <TT>PPC Std C SIOUX-WASTE Settings</TT> window using the usual
3042	box in the upper-left corner.
3043	<LI>On your Desktop you should now find a folder <TT>PHYLIP</TT>.
3044	If it has a
3045	file called <TT>HelloWorld.c</TT> you may want to discard that file.
3046	<LI>In that <TT>PHYLIP</TT> folder you will find a file <TT>dnaml.proj</TT>.
3047	<LI>Double-click on that project file. If the Metrowerks is not already open,
3048	it should open now.
3049	<LI>If a window called <TT>Project Messages</TT> opens and there is a
3050	complaint in it about access paths being wrong, you should fix these by
3051	selecting the <TT>Reset project entry paths</TT> item in the <TT>Project</TT>
3052	menu.
3053	<LI>Select the <TT>Make</TT> item in the <TT>Project</TT> menu.
3054	<LI>In the <TT>Project</TT> menu, select <TT>Make</TT>
3055	</OL>
3056	<B>Compiling a program once its resource file is available.</B>.
3057	If the resource files are all available (as they should be), you did not need
3058	to do any of the above. Usually users will have no need to compile
3059	the programs, but occasionally they may want to change a setting or
3060	add a feature. In that case the Metrowerks Codewarrior compiler can be
3061	used. We have provided support for compiling the programs in its
3062	most recent version, version 5.3. The following discussion will
3063	assume that you have obtained and installed the compiler.
3064	<P>
3065	You should find in the source code directory
3066	<TT>src</TT> a subdirectory called <TT>mac</TT> which contains the
3067	Metrowerks Codewarrior compiler "project files" (with names ending in
3068	<TT>.proj</TT>, as well as the resource files (which end in <TT>.rsrc</TT>
3069	for each program. You can get into this subdirectory, activate the
3070	Metrowerks compiler, and open the appropriate project file. To
3071	compile the program, simply make sure that the project file is an
3072	active window, and type <TT>Command-M</TT> (which is to say, hold down
3073	the <TT>Command</TT> key while typing <TT>M</TT>). Alternatively,
3074	pull down the <TT>Project</TT> window and select <TT>Make</TT>. The
3075	program should then compile, possibly with ignorable warning messages.
3076	<P>
3077	<H3>Windows systems</H3>
3078	<P>
3079	<B>Compiling with Microsoft Visual C++</B>
3080	<P>
3081	Microsoft Visual C++ is used to compile the executables we distribute
3082	Windows. It can compile using a Makefile. We have supplied this
3083	in the source code distrubution as <TT>Makefile.msvc</TT>.
3084	You will need to preserve the Unix Makefile by renaming it to, say,
3085	<TT>Makefile.unix</TT>, then make a copy of <TT>Makefile.msvc</TT>
3086	and call it <TT>Makefile</TT>.
3087	<P>
3088	<B>Setting the path.</B>
3089	Before using <TT>nmake</TT> you will need to have the paths
3090	set properly. For this, use the Start menu to open Command or
3091	a Dos Prompt first. To set the path type<BR>
3092	<PRE>
3093	set MSVC=Path
3094	</PRE>
3095	where Path is where Microsoft Visual Studio is installed
3096	(e.g. it might be in <TT>c:\Microsoft Visual Studio</TT>).
3097	However the path you type should not have any spaces in it.
3098	This means that you may have to use the directory's
3099	DOS filename. In general to get a DOS name you take the first six letters of
3100	the directory name and follow them by <TT>~1</TT>. For example,
3101	<TT>Microsoft Visual Studio</TT> will have a DOS name
3102	<TT>Micros~1</TT>, <TT>Program Files</TT> will be <TT>Progra~1</TT>).
3103	Depending on what other
3104	file are in the directory the DOS name may be the first six letters followed
3105	by <TT>~2,~3,~4</TT>, etc... (e.g. <TT>Micros~3</TT> or <TT>Progra~5</TT>).
3106	It may take some
3107	experimentation to figure it out. With older Versions of Windows (pre-win2000)
3108	it may be possible to just right click on the directory icon and select
3109	Properties to get the DOS name.
3110	<P>
3111	Once you have set MSVC, type
3112	<PRE>
3113	PATH=%PATH%;%MSVC%\VC98\bin
3114	</PRE>
3115	Then the Makefile will need to be edited. The line
3116	<PRE>
3117	MSVCPATH=c:\Micros~1\VC98
3118	</PRE>
3119	will need to be changed so that
3120	It points to whereever Microsoft Visual Studio is installed followed by
3121	<TT>\VC98</TT>.
3122	<P>
3123	<B>Using the Makefile</B>. The Makefile is invoked using the
3124	<TT>nmake</TT> command. If you simply type <TT>nmake</TT> you
3125	will get a list of possible <TT>make</TT> commands. For example,
3126	to compile a single program such as <TT>Dnaml</TT> but not
3127	install it, type <TT>make dnaml</TT>. To compile and install all
3128	programs type <TT>make install</TT>. We have supplied all the
3129	support files and icons needed for the compilations. They are
3130	in subdirectory <TT>msvc</TT> of the main source code
3131	directory.
3132	<P>
3133	<B>Compiling with Borland C++</B>
3134	<P>
3135	Borland C++ can be downloaded for free from Inprise (Borland)
3136	(see their site
3137	<A HREF="http://www.borland.com">http://www.borland.com</A>
3138	It can compile using a Makefile. We have supplied this
3139	in the source code distrubution as <TT>Makefile.bcc</TT>.
3140	You will need to preserve the Unix Makefile by renaming it to, say,
3141	<TT>Makefile.unix</TT>, then make a copy of <TT>Makefile.bcc</TT>
3142	and call it <TT>Makefile</TT>. The Makefile is invoked using the
3143	<TT>make</TT> command. If you simply type <TT>make</TT> you
3144	will get a list of possible <TT>make</TT> commands. For example,
3145	to compile a single program such as <TT>Dnaml</TT> but not
3146	install it, type <TT>make dnaml</TT>. To compile and install all
3147	programs type <TT>make install</TT>. We have supplied all the
3148	the support files and icons needed for the compilations. They
3149	are in subdirectory <TT>bcc</TT> of the main source code
3150	directory. We have had to supply a complete
3151	second set of the resource files with names <TT>*.brc</TT>
3152	because Borland resource files have a minor incompatibility
3153	with Microsoft Visual C++ resource files.
3154	<P>
3155	If this does not work the <TT>PATH</TT> may need to be set manually.
3156	This can be done by opening a Command or DOS window using the Start
3157	menu. To set the path, type
3158	<PRE>
3159	set BORLAND=Path
3160	</PRE>
3161	Where <TT>Path</TT> is where Borland is installed, such as
3162	<TT>C:\Progra~1\Borland</TT>.
3163	Then type
3164	<PRE>
3165	PATH=%PATH%;%BORLAND%\CBUILD~1\Bin
3166	</PRE>
3167	<P>
3168	<B>Compiling with Metrowerks Codewarrior for Windows</B>
3169	<P>
3170	As with Macintosh systems, Metrowerks Codewarrior requires
3171	you to have project files for each program you compile.
3172	For Metrowerks Codewarrior for Windows we are not providing the projects
3173	themselves, but we are providing
3174	projects which have been exported as XML files. To open one of these one
3175	cannot just click on
3176	File/Open but instead on the menu option File/Import Project.
3177	Metrowerks will then ask you for the project name.
3178	Type in the name of the program (e.g. dnaml). Once this is done Metrowerks will
3179	act like this is a regular project file.
3180	<P>
3181	We have supplied a complete set of these XML project files in the
3182	source code distribution. They are in subdirectory <TT>metro</TT>
3183	of the main source code directory. This is supplied with the
3184	source code distribution for Windows (it is not in the source
3185	code distributions for other platforms).
3186	For Metrowerks Codewarrior for Windows we are not providing the projects
3187	themselves, but we are providing
3188	projects which have been exported as XML files. To open one of these one
3189	cannot just click on
3190	File/Open but instead on the menu option File/Import Project.
3191	Metrowerks will then ask you for the project name.
3192	Type in the name of the program (e.g. dnaml). Once this is done Metrowerks will
3193	act like this is a regular project file.
3194	<P>
3195	To compile the program
3196	pull down the <TT>Project</TT> menu and select <TT>Make</TT>. The
3197	program should then compile, possibly with ignorable warning messages.
3198	<P>
3199	For the moment we are not giving here the details of
3200	how to create these projects yourself -- you usually will not need
3201	to, as you have the project files we have supplied.
3202	<P>
3203	<B>Compiling with Cygnus Gnu C++</B>
3204	<P>
3205	Cygnus Solutions (now a part of Red Hat, Inc.) has adapted the Gnu C compiler
3206	to Windows systems and
3207	provided an environment, CygWin, which mimics Unix for compiling.
3208	This is available for purchase from them, and they also make it
3209	available to be downloaded for free. The download is large. To get it, go
3210	to <A HREF="http://sources.redhat.com/cygwin/download.html">their download site</A> at
3211	<CODE>http://sources.redhat.com/cygwin/download.html</CODE> and follow the
3212	instructions there. It is a bit
3213	difficult to figure out how to download it -- you need to download
3214	their <TT>setup.exe</TT> program and then it will download the rest
3215	when it is run. You will need a lot of disk space for it.
3216	<P>
3217	Once you have
3218	installed the free Cygnus environment and the associated Gnu C compiler
3219	on your Windows system, compiling PHYLIP is essentially identical to
3220	what one does for Unix or Linux. In PHYLIP's <TT>src</TT> directory,
3221	change the name of our Unix <TT>Makefile</TT> to something like
3222	<TT>Makefile.unx</TT> (so as to keep it around). There is a special
3223	Makefile for the Cygwin
3224	compiler called <TT>Makefile.cyg</TT>. Make a copy of it called
3225	<TT>Makefile</TT>.
3226	<P>
3227	This Makefile should contain a compiling command:
3228	<P>
3229	<TT>CC = gcc</TT>
3230	<P>
3231	Now enter the Cygwin environment (which you can do using the Windows
3232	<TT>Start</TT> menu and its <TT>Programs</TT> menu item. There should be
3233	a <TT>Cygnus</TT> menu choice within that submenu, which you can use to
3234	start the Cygnus environment. This puts you in an imitation of a Unix
3235	shell.
3236	<P>
3237	On entering the CygWin environment you will find yourself in one of the
3238	subdirectories of the CygWin directory. Change to the directory where the
3239	PHYLIP programs have been put (for example by issuing the command
3240	<P>
3241	<TT>cd c:/phylip</TT><BR>
3242	<BR>
3243	You should then be able to compile PHYLIP
3244	by issuing the appropriate make command, such as <TT>make install</TT>.
3245	If you have modified one of our source code files such as <TT>dnaml.c</TT>,
3246	it would be wise to
3247	have saved the original version of it first as, say, <TT>dnaml.c0</TT>.
3248	To associate an icon with a program (say DnaML), you need an icon
3249	file (say <TT>dna.ico</TT> which contains the icon in standard format.
3250	There should also be a file called <TT>dnaml.rc</TT> which contains the single
3251	line:
3252	<P>
3253	<TT>dnaml ICON "dna.ico"</TT>
3254	<P>
3255	We have provided a subdirectory <TT>icons</TT> in the <TT>src</TT>
3256	subdirectory, containing a full set of icons and a full set of resource
3257	files (<TT>*.rc</TT>).
3258	Our Cygwin Makefile will automatically invoke them.
3259	<P>
3260	<H3>VMS VAX systems</H3>
3261	<P>
3262	We have not tried to compile version 3.6 on an OpenVMS system but the
3263	following instructions should work.
3264	On the OpenVMS operating system with DEC VAX VMS C the programs will compile
3265	without alteration. The commands for compiling a typical program
3266	(DNAPARS, which depends on the separately compiled files <TT>phylip.c</TT>
3267	and <TT>seq.c</TT>) are:
3268	<P>
3269	<TT>$ DEFINE LNK$LIBRARY SYS$LIBRARY:VAXCRTL
3270	<BR>
3271	$ CC DNAPARS.C
3272	<BR>
3273	$ CC PHYLIP.C
3274	<BR>
3275	$ CC SEQ.C
3276	<BR>
3277	$ LINK DNAPARS,PHYLIP,SEQ
3278	<BR>
3279	</TT>
3280	<P>
3281	Once you use this <TT>$ DEFINE</TT> statement during a given interactive session,
3282	you need not repeat it again as the symbol <TT>LNK$LIBRARY</TT> is thereafter
3283	properly defined. The compilation process leaves a file <TT>DNAPARS.OBJ</TT>
3284	in your directory: this can
3285	be discarded. The executable program is named <TT>DNAPARS.EXE</TT>. To run the program
3286	one then uses the command:
3287	<P>
3288	<TT>$ R DNAPARS</TT>
3289	<P>
3290	The compiler defaults to the filenames <TT>INFILE.</TT>, <TT>OUTFILE.</TT>, and
3291	<TT>TREEFILE.</TT>.
3292	If the input file <TT>INFILE.</TT> does not exist the program will prompt you to
3293	type in its name. Note that some commands on VMS such as <TT>TYPE OUTFILE</TT>
3294	will fail because the name of the file that it will attempt to type out will be not
3295	<TT>OUTFILE.</TT> but <TT>OUTFILE.LIS</TT>. To get it to type the write file you
3296	would have to instead issue the command <TT>TYPE OUTFILE.</TT>.
3297	<P>
3298	When you are
3299	using the interactive previewing feature of DRAWGRAM (or DRAWTREE) on
3300	a Tektronix or DEC ReGIS compatible terminal, you will want before
3301	running the program to have issued the command:
3302	<P>
3303	<TT>$ SET TERM/NOWRAP/ESCAPE</TT>
3304	<P>
3305	so that you do not run into trouble from the VMS line length limit of
3306	255 characters or the filtering of escape characters.
3307	<P>
3308	To know which files to compile together, look at the entries in the
3309	<TT>Makefile</TT>.
3310	<P>
3311	VMS systems are rapidly disappearing, so we will not devote much
3312	effort to get PHYLIP working on them.
3313	<P>
3314	<H3>Parallel computers</H3>
3315	<P>
3316	As parallel computers become more common, the issue of how to compile
3317	PHYLIP for them has become more pressing. People have been compiling
3318	PHYLIP for vector machines and parallel machines for many years. We
3319	have not made a version for parallel machines because there is still
3320	no standard parallel programming environment on such machines (or rather,
3321	there are many standards, so that one cannot find one that makes
3322	a parallel execution version of PHYLIP practical). However the
3323	MPI Message Passing Interface is spreading rapidly, and we will
3324	probably support it in future versions of PHYLIP.
3325	<P>
3326	Although the underlying algorithms of most programs,
3327	which treat sites independently, should be amenable to vector and
3328	parallel processors,
3329	there are details of the code which might best be changed.
3330	In certain of the programs (<TT>Dnaml</TT>, <TT>Dnamlk</TT>,
3331	<TT>Proml</TT>, <TT>Promlk</TT>) I have put a special
3332	comment statement next to the loops in the program where
3333	the program will spend most of its time, and which are the places
3334	most likely to benefit from parallelization. This comment statement is:<BR>
3335	<PRE>
3336	/* parallelize here */
3337	</PRE>
3338	In particular
3339	within these innermost loops of the programs there are often scalar quantities
3340	that are used for temporary bookkeeping. These quantities, such as
3341	<TT>sum1, sum2, zz, z1, yy, y1, aa, bb, cc, sum,</TT> and <TT>denom</TT> in procedure makenewv
3342	of DNAML (and similar quantities in procedure nuview) are there to
3343	minimize the number of array references. For vectorizing and parallelizing
3344	compilers it will
3345	be better to replace them by arrays so that processing can occur
3346	simultaneously.
3347	<P>
3348	If you succeed in making a parallel version of PHYLIP we would like to
3349	know how you did it. In particular, if you can prepare a web page which
3350	describes how to do it for your computer system, we would like to have it
3351	for inclusion in our PHYLIP web pages. Please e-mail it to me. We hope to
3352	have a set of pages that give detailed instructions on how to make parallel
3353	version of PHYLIP on various kinds of machines. Alternatively, if we
3354	are given your modified version of the program we may be able to
3355	figure out how to make modifications to our source code to allow
3356	users to compile the program in a way which makes those modifications.
3357	<P>
3358	<H3>Other computer systems</H3>
3359	<P>
3360	As you can see from the variety of different systems on which these
3361	programs have been successfully run, there are no serious
3362	incompatibility problems with most computer systems. PHYLIP in various
3363	past Pascal versions has also been compiled on 8080 and Z80 CP/M Systems, Apple
3364	II systems running UCSD Pascal, a variety of minicomputer systems such as
3365	DEC PDP-11's and HP 1000's, on 1970's era mainframes such as CDC
3366	Cyber systems, and so on. In a later era
3367	it was also compiled on IBM 370 mainframes, and of course on DOS and
3368	Windows systems and on Macintosh and PowerMacintosh systems.
3369	We have gradually
3370	accumulated experience on a wider variety of C compilers. If you succeed in
3371	compiling the C version of PHYLIP on a different machine or a different
3372	compiler, I would like to
3373	hear the details so that I can consider including the instructions in a future version
3374	of this manual.
3375	<P>
3376	<DIV ALIGN="CENTER">
3377	<A NAME="FAQ"><HR><P></A>
3378	<H2>Frequently Asked Questions</H2></DIV>
3379	<P>
3380	This set of Frequently Asked Questions, and their answers, is from the
3381	PHYLIP web site. A more up-to-date version can be found there, at:
3382	<P>
3383	<DIV ALIGN="CENTER">
3384	<A HREF="http://evolution.gs.washington.edu/phylip/faq.html">
3385	<TT>http://evolution.gs.washington.edu/phylip/faq.html</TT></A></DIV>
3386	<P>
3387	<DL>
3388	<DT><STRONG>"It doesn't work! <I>It doesn't work!!</I> It says <TT>can't find infile.</TT></STRONG>
3389	<DD>Actually, it's working just fine. Many of the programs look for an input file called <TT>infile</TT>,
3390	and if one of that name is not present in the current directory, they then ask
3391	you to type in the name of the input file. That's all that it's doing. This
3392	is done so that
3393	you can get the program to read the file without you having to type in its
3394	name, by making a copy of your input file and calling it <TT>infile</TT>.
3395	If you don't do that, then the program issues this message. It looks
3396	alarming, but really all that it is trying to do is to get you to type in
3397	the name of the input file. Try giving it the name of the input file.
3398	<DT><STRONG>"The program reads my data file and then says it's has
3399	a memory allocation error!"</STRONG>
3400	<DD>This is what tends to happen if there is a problem with the format of the data
3401	file, so that the programs get confused and think they need to set aside memory
3402	for 1,000,000 species or so. The result is a "memory allocation error". Check the data file format against the documentation:
3403	make sure that the data files have <I>not</I> been saved in the format of
3404	your word processor (such as Microsoft Word) but in a "flat ASCII" or "text only"
3405	mode. Note that adding memory to your computer is <I>not</I> the
3406	way to solve this problem -- you probably have plenty of memory
3407	to run the program once the data file is in the correct format.
3408	<DT><STRONG>"On our Macintosh, larger data files fail to run."</STRONG>
3409	<DD>We have set the memory allowances on the Macintosh executables
3410	to be generous, but not too big. You therefore may need to
3411	increase them. Use the <TT>Get Info</TT> item on the Finder <TT>File</TT> menu.
3412	<DT><STRONG>"I opened the program but I don't see where to create
3413	a data file!"</STRONG>
3414	<DD>The programs (there are more than one) use data
3415	files that have been created outside of the program. They do not have any
3416	data editor within them. You can create a data file by using an editor,
3417	such as Microsoft Word, EMACS, vi, SimpleText, Notepad, etc. But be sure
3418	<I>not</I> to save the file in Microsoft Word's own format. It should be saved in
3419	Text Only format. You can use the documentation files, including the examples
3420	at the end of those files, to figure out the format of the input file.
3421	Documentation files such as <TT>main.html</TT>, <TT>sequence.html</TT>,
3422	<TT>distance.html</TT> and many others should be consulted. Many users
3423	create their data files by having their alignment program (such as
3424	ClustalW), output its alignments in PHYLIP format. Many alignment programs
3425	have options to do that.
3426	menu while the program is selected.
3427	<DT><STRONG>"I ran PHYLIP, and all it did was say it was extracting a bunch of files!"</STRONG>
3428	<DD>
3429	There is no executable program
3430	named <TT>PHYLIP</TT> in the PHYLIP package! But in some cases
3431	(especially the Windows distribution) there is a file called
3432	<TT>phylip.exe</TT>.
3433	That file is an archive of documentation and source code. Once you have
3434	run it and extracted the files in it, so that they are in the directory,
3435	running it again will just do the extraction again, which is unnecessary.
3436	Similarly for the archive files for the Windows executables, which
3437	have names like <TT>phylipwx.exe</TT> and <TT>phylipwy.exe</TT>.
3438	They are run only once to extract their contents.
3439	<DT><STRONG>"One program makes an output file and then the next program crashes while reading it!"</STRONG>
3440	<DD>Did you rename the file? If a program makes a file called <TT>outfile</TT>, and then the
3441	next program is told to use <TT>outfile</TT> as its input file, terrible things will
3442	happen. The second program first opens <TT>outfile</TT> as an output file, thus
3443	erasing it. When it then tries to read from this empty <TT>outfile</TT>
3444	a psychological
3445	crisis ensues. The solution is simply to rename <TT>outfile</TT> before trying to
3446	use it as an input file.
3447	<DT><STRONG>"I make a file called infile and then the program can't find it!"</STRONG>
3448	<DD>Let me guess. You are using Windows, right? You made your file in Word or
3449	in Notepad or WordPad, right? If you made a file in one of these editors, and
3450	saved it, not in Word format, but in Text Only format, then you were doing the
3451	right thing. But when you told the operating system to save the file as
3452	<TT>infile</TT>, it actually didn't. It saved it as
3453	<TT>infile.txt</TT>. Then just to make
3454	life harder for you, the operating system is set up by default to not show
3455	that three-letter extension to the file name. Next to its icon it will show
3456	the name <TT>infile</TT>. So you think, quite reasonably, that
3457	there is a file called <TT>infile</TT>. But there isn't a file of that
3458	name, so the program, quite reasonably, can't find a file called
3459	<TT>infile</TT>. If you want to check what the actual file name is, use
3460	the <TT>Properties</TT>
3461	menu item of the <TT>File</TT> item on your folder (in Windows versions, anyway).
3462	You should be able to get the program to work by telling it that the file name
3463	is <TT>INFILE.TXT</TT>.
3464	<DT><STRONG>"Consense gives wierd branch lengths! How do I
3465	get more reasonable ones?"</STRONG>
3466	<DD>Consense gives branch lengths which are simply the numbers of replicates
3467	that support the branch. This is not a good reflection of how long those
3468	branches are estimated to be. The best way to put better branch lengths on a
3469	consensus tree is to use it as a User Tree in a program that will estimate
3470	branch lengths for it. You may need to convert it to being an unrooted tree,
3471	using Retree, first. If the original program you were using was a parsimony
3472	program, which does not estimate branch lengths, you may instead have to make
3473	some distances between your species (using, for example, DnaDist), and use
3474	Fitch to put branch lengths on the user tree. Here is the sequence of
3475	steps you should go through:
3476	<OL>
3477	<LI>Take the tree and use Retree to make sure it is Unrooted (just
3478	read it into Retree and then save it, specifying Unrooted)
3479	<LI>Use the unrooted tree as a User Tree (option <TT>U</TT>) in one of
3480	our programs (such as Fitch or DnaML). If you use Fitch, you also
3481	need to use one of the distance programs such as DnaDist to
3482	compute a set of distances to serve as its input.
3483	<LI>Specify that the branch lengths
3484	of the tree are not to be used but should be re-estimated. This
3485	is actually the default.
3486	</OL>
3487	<DT><STRONG>"DrawTree (or DrawGram) doesn't work: it can't find the font file!"</STRONG>
3488	<DD>Six font files, called <TT>font1</TT> through <TT>font6</TT>, are
3489	distributed with the executables
3490	(and with the source code too). The program looks for a copy of one of them
3491	called <TT>fontfile</TT>. If you haven't made such a copy called
3492	<TT>fontfile</TT> it then asks
3493	you for the name of the font file. If they are in the current directory, just
3494	type one of <TT>font1</TT> through <TT>font6</TT>. The reason for
3495	having the program look for <TT>fontfile</TT>
3496	is so that you can copy your favorite font file, call the copy
3497	<TT>fontfile</TT>,
3498	and then it will be found automatically without you having to type the name of
3499	the font file each time.
3500	<DT><STRONG>"Can DrawGram draw a scale beside the tree? Print the branch lengths as numbers?"</STRONG>
3501	<DD>It can't do either of these. Doing so would make the program more complex, and
3502	it is not obvious how to fit the branch length numbers into a tree that has
3503	many very short internal branches. If you want these scales or numbers,
3504	choose an output plot file format (such as Postscript, PICT or PCX) that can be read by
3505	a drawing program such as Adobe Illustrator, Freehand, Canvas, CorelDraw,
3506	or MacDraw.
3507	Then you can add the scales and branch length numbers yourself by hand. Note
3508	the menu option in DrawTree and DrawGram that specifies the tree size to be
3509	a given number of centimeters per unit branch length.
3510	<DT><STRONG>"How can I get DrawGram or DrawTree to print the bootstrap values
3511	next to the branches?"</STRONG>
3512	<DD>When you do bootstrapping and use Consense, it prints the bootstrap
3513	values in its output file (both in a table of sets, and on the diagram
3514	of the tree which it makes). These are also in the output tree file of
3515	Consense. There they are in place of branch lengths. So to get them to
3516	be on the output of DrawGram or DrawTree, you must write the tree in the
3517	format of a drawing program and use it to put the values in by hand, as
3518	mentioned in the answer to the previous question.
3519	<DT><STRONG>"I have an HP Laserjet and can't get DrawGram to print on it"</STRONG>
3520	<DD>DRAWGRAM and DRAWTREE produce a plot file (called <TT>plotfile</TT>): they
3521	do not send it to the printer. It is up to you to get the plot file to
3522	the printer. If you are running Windows or DOS this can probably be done
3523	with the MSDOS command <TT>COPY/B PLOTFILE PRN:</TT>, unless your printer
3524	is a networked printer. The <TT>/B</TT>
3525	is important. If it is omitted the copy command will strip off the
3526	highest bit of each byte, which can cause the printing to fail or produce
3527	garbage.
3528	<DT><STRONG>"DNAML won't read the treefile that is produced by DNAPARS!"</STRONG>
3529	<DD>That's because the DnaPars tree file is a rooted tree, and DnaML wants an
3530	unrooted tree. Try using Retree to change the file to be an unrooted tree
3531	file.</DD>
3532	<DT><STRONG>"In bootstrapping, SEQBOOT makes too large a file"</STRONG>
3533	<DD>If there are 1000 bootstrap replicates, it will make a file
3534	1000 times as long as your original data set. But for many methods
3535	there is another way that uses much less file space. You can use
3536	SEQBOOT to make a file of multiple sets of weights, and use those
3537	together with the original data set to do bootstrapping.
3538	<DT><STRONG>"In bootstrapping, the output file gets too big."</STRONG>
3539	<DD> When running a program such as NEIGHBOR or DNAPARS with multiple data
3540	sets (or multiple weights) for purposes of bootstrapping,
3541	the output file is usually not needed, as it
3542	is the output tree file that is used next. You can use the menu
3543	of the program to turn off the writing of trees into the
3544	output file. The trees will still be written into the tree file.
3545	<DT><STRONG>"Why doesn't NEIGHBOR read my DNA sequences correctly?"</STRONG>
3546	<DD>Because it wants
3547	to have as input a distance matrix, not sequences. You have to use DNADIST to
3548	make the distance matrix first.
3549	<P>
3550	<H3>How to make it do various things</H3>
3551	<P>
3552	<DT><STRONG>"How do I bootstrap?"</STRONG>
3553	<DD>The general method of bootstrapping
3554	involves running SEQBOOT to make multiple bootstrapped data sets out of your
3555	one data set, then running one of the tree-making programs with the Multiple
3556	data sets option to analyze them all, then running CONSENSE to make a majority
3557	rule consensus tree from the resulting tree file. Read the documentation of
3558	SEQBOOT to get further information. Before, only parsimony methods could be
3559	bootstrapped. With this new system almost any of the tree-making methods in
3560	the package can be bootstrapped. It is somewhat more tedious but you will find
3561	it much more rewarding.
3562	<DT><STRONG>"How do I specify a multi-species outgroup
3563	with your parsimony programs?"</STRONG>
3564	<DD>It's not a feature but is not too hard to do in many of the programs. In
3565	parsimony programs like MIX, for which the W (Weights) and A (Ancestral states)
3566	options are available, and weights can be larger than 1, all you need to do is:
3567	<DL COMPACT>
3568	<DT><STRONG>(a)</STRONG>
3569	<DD>In MIX, make up an extra character with states 0 for all the outgroups
3570	and 1 for all the ingroups. If using DNAPARS the ingroup can have (say)
3571	<TT>G</TT> and the outgroup <TT>A</TT>.
3572	<DT><STRONG>(b)</STRONG>
3573	<DD>Assign this character an enormous weight (such as <TT>Z</TT> for 35) using the W
3574	option, all other characters getting weight 1, or whatever weight they had
3575	before.
3576	<DT><STRONG>(c)</STRONG>
3577	<DD>If it is available, Use the A (Ancestral states) option to designate that
3578	for that new character the state found in the outgroup is the ancestral
3579	state.
3580	<DT><STRONG>(d)</STRONG>
3581	<DD>In MIX do not use the O (Outgroup) option.
3582	<DT><STRONG>(e)</STRONG>
3583	<DD>After the tree is found, the designated ingroup should have been held
3584	together by the fake character. The tree will be rooted somewhere in the
3585	outgroup (the program may or may not have a preference for one place in
3586	the outgroup over another). Make sure that you subtract from the total
3587	number of steps on the tree all steps in the new character.
3588	</DL>
3589	<P>
3590	In programs like DNAPARS, you cannot use this method as weights of sites
3591	cannot be greater than 1. But you do an analogous trick, by adding a
3592	largish number of extra sites to the data, with one nucleotide state ("A")
3593	for the ingroup and another ("G") for the outgroup. You will then have to
3594	use RETREE to manually reroot the tree in the desired place.
3595	<DT><STRONG>"How do I force certain groups to remain monophyletic in your
3596	parsimony programs?"</STRONG>
3597	<DD>By the same method as in the previous question, using multiple fake characters, any number of
3598	groups of species can be forced to be monophyletic. In MOVE, DOLMOVE, and
3599	DNAMOVE you can specify whatever outgroups you want without going to this
3600	trouble.
3601	<DT><STRONG>"How can I reroot one of the trees written out by PHYLIP?"</STRONG>
3602	<DD>Use the program
3603	RETREE. But keep in mind whether the tree inferred by the original program was
3604	already rooted, or whether you are free to reroot it.
3605	<DT><STRONG>"What do I do about deletions and insertions in my sequences?"</STRONG>
3606	<DD>The
3607	molecular sequence programs will accept sequences that have gaps (the "<TT>-</TT>"
3608	character). They do various things with them, mostly not optimal. DNAPARS
3609	counts "gap" as if it were a fifth nucleotide state (in addition to A, C, G,
3610	and T). Each site counts one change when a gap arises or disappears. The
3611	disadvantage of this treatment is that a long gap will be overweighted, with
3612	one event per gapped site. So a gap of 10 nucleotides will count as being as
3613	much evidence as 10 single site nucleotide substitutions. If there are not
3614	overlapping gaps, one way to correct this is to recode the first site in the
3615	gap as "<TT>-</TT>" but make all the others be "<TT>?</TT>" so the gap only counts as one event.
3616	Other programs such as DNAML and DNADIST count gaps as equivalent to unknown
3617	nucleotides (or unknown amino acids) on the grounds that we don't know what
3618	would be there if something were there. This completely leaves out the
3619	information from the presence or absence of the gap itself, but does not bias
3620	the gapped sequence to be close to or far from other gapped or ungapped
3621	sequences.
3622	So it is not necessary to remove gapped regions from your
3623	sequences, unless the presence of gaps indicates that the region is
3624	badly aligned.
3625	<DT><STRONG>"How can I produce distances for my data set which
3626	has 0's and 1's?"</STRONG>
3627	<DD>You can't do it in a simple and general
3628	way, for a straightforward reason. Distance methods must correct the
3629	distances for superimposed changes. Unless we know specifically how to
3630	do this for your particular characters, we cannot accomplish the
3631	correction. There are many formulas we could use, but we can't choose
3632	among them without much more information. There are issues of superimposed
3633	changes, as well as heterogeneity of rates of change in different
3634	characters. Thus we have not provided a distance program for 0/1 data.
3635	It is up to you to figure out what is an appropriate stochastic model
3636	for your data and to find the right distance formulas.
3637	<DT><STRONG>"I have RFLP fragment data: which programs should I
3638	use?"</STRONG>
3639	<DD>This is more difficult question than you may imagine.
3640	Here is quick tour of the issues:
3641	<UL><LI>You can code fragments are 0 and 1 and use a parsimony program. It is
3642	not obvious in advance whether 0 or 1 is ancestral, though it is likely that
3643	change in one direction is more likely than change in the other for each
3644	fragment. One can use either Wagner parsimony (programs <TT>MIX</TT>,
3645	<TT>PENNY</TT> or <TT>MOVE</TT>) or use Dollo parsimony
3646	(<TT>DOLLOP, DOLPENNY</TT> or <TT>DOLMOVE</TT>)
3647	with the ancestral states all set as unknown ("<TT>?</TT>").
3648	<LI>You can use a distance matrix method using the RFLP distance of Nei and
3649	Li (1979). Their restriction fragment distance is available in our
3650	program RestDist.
3651	<LI>You should be very hesitant to bootstrap RFLP's. The individual
3652	fragments do not evolve independently: a single nucleotide substitution
3653	can eliminate one fragment and create two (or vice versa).
3654	</UL>
3655	For restriction <I>sites</I> (rather than fragments) life is a bit
3656	easier: they evolve nearly independently so bootstrapping is possible
3657	and <TT>RESTML</TT> can be used. Also directionality of change
3658	is less ambiguous when parsimony is used.
3659	<DT><STRONG>"Why don't your parsimony programs print out branch lengths?"</STRONG>
3660	<DD>Well, DNAPARS and PARS can. The others have not yet been upgraded to the
3661	same level. The longer answer is that it is because
3662	there are problems defining the branch lengths. If you look closely at the
3663	reconstructions of the states of the hypothetical ancestral nodes for almost
3664	any data set and almost any parsimony method you will find some ambiguous
3665	states on those nodes. There is then usually an ambiguity as to which branch
3666	the change is actually on. Other parsimony programs resolve this in one or
3667	another arbitrary fashion, sometimes with the user specifying how (for example,
3668	methods that push the changes up the tree as far as possible or down it as far
3669	as possible). Our older programs leave it to the user to do this. In
3670	DNAPARS and PARS we use an algorithm discovered by Hochbaum and Pathria (1997)
3671	(and independently by Wayne Maddison) to compute branch lengths that average
3672	over all possible placements of the changes. But these branch lengths, as
3673	nice as they are, do not correct for mulitple superimposed changes. Few
3674	programs available from others currently correct the branch lengths for
3675	multiple changes of state that may have overlain each other. One possible way
3676	to get branch lengths with nucleotide sequence data is to take the tree
3677	topology that you got, use RETREE to convert it to be unrooted, prepare a
3678	distance matrix from your data using DNADIST, and then use FITCH with that tree
3679	as User Tree and see what branch lengths it estimates.
3680	<DT><STRONG>"Why can't your programs handle unordered multistate characters?"</STRONG>
3681	<DD>In this 3.6 release there is a program PARS which does parsimony for
3682	undordered multistate characters with up to 8 states, plus <TT>?</TT>. The
3683	other the discrete characters parsimony programs can only handle two states,
3684	<TT>0</TT> and <TT>1</TT>.
3685	This is mostly because I have not yet had time to modify them to do so - the
3686	modifications would have to be extensive. Ultimately I hope to get these done.
3687	If you have four or fewer states and need a feature that is not in PARS,
3688	you could recode your states to look like nucleotides
3689	and use the parsimony programs in the molecular sequence section of PHYLIP, or
3690	you could use one of the excellent parsimony programs produced by others.
3691	<P>
3692	<H3>Background information needed:</H3>
3693	<P>
3694	<DT><STRONG>"What file format do I use for the sequences?"<BR>
3695	"How do I use the programs? I can't find any documentation!"</STRONG>
3696	<DD>These are discussed in the documentation files. Do you have them? If you
3697	have a copy of this page you probably do. They are
3698	in a separate archive from the executables (they are in the Documentation and
3699	Sources archives, which you should definitely fetch). Input file formats
3700	are discussed in <TT>main.html</TT>, in <TT>sequence.html</TT>, <TT>distance.html</TT>,
3701	<TT>contchar.html</TT>, <TT>discrete.html</TT>, and the documentation files for the
3702	individual programs.
3703	<DT><STRONG>"Where can I find out how to infer
3704	phylogenies?</STRONG>
3705	<DD>There are few books yet. For molecular data you could use one of these:
3706	<UL>
3707	<LI> Graur, D. and W.-H. Li. 2000. <EM>Fundamentals of Molecular
3708	Evolution.</EM> Sinauer Associates, Sunderland, Massachusetts. (or the earlier edition
3709	by Li and Graur).
3710	<LI> Page, R. D. P. and E. C. Holmes. 1998. <EM>Molecular Evolution:
3711	A Phylogenetic Approach.</EM> Blackwell, Oxford.
3712	<LI> Nei, M. and S. Kumar. 2000. <EM>Molecular Evolution and
3713	Phylogenetics.</EM> Oxford University Press, Oxford.
3714	<LI> Li, W.-H. 1999. <EM>Molecular Evolution.</EM> Sinauer Associates,
3715	Sunderland, Massachusetts.
3716	</UL>
3717	In addition, one of these three review articles may help:
3718	<UL><LI>Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996.
3719	Phylogenetic inference. pp. 407-514 in <I>Molecular Systematics</I>, 2nd ed.,
3720	ed. D. M. Hillis, C. Moritz, and B. K. Mable. Sinauer Associates, Sunderland,
3721	Massachusetts.
3722	<LI>Felsenstein, J. 1988. Phylogenies from molecular sequences: inference and
3723	reliability. <I>Annual Review of Genetics</I> <B>22:</B> 521-565.
3724	<LI>Felsenstein, J. 1988. Phylogenies and quantitative
3725	characters. <I>Annual Review of Ecology and Systematics</I> <B>19:</B> 445-471.
3726	</UL>
3727	My own book on phylogenies is due to be published in late 2002. It
3728	will be called "Inferring Phylogenies". For information on whether it has
3729	been published you should check the
3730	<A HREF="http://www.sinauer.com">Sinauer Associates web site</A>.
3731	<P>
3732	<H3>Questions about distribution and citation:</H3>
3733	<P>
3734	<DT><STRONG>"If I copied PHYLIP from a friend without you knowing, should I try
3735	to keep you from finding out?"</STRONG>
3736	<DD>No. It is to your advantage and mine for you to
3737	let me know. If you did not get PHYLIP "officially" from me or from someone
3738	authorized by me, but copied a friend's version, you are not in my database of
3739	users. You may also have an old version which has since been
3740	substantially improved. I don't mind you "bootlegging"
3741	PHYLIP (it's free anyway), but
3742	you should realize that you may have copied an outdated version. If you are reading this
3743	Web page,
3744	you can get the latest version just as quickly over Internet.
3745	It will help both of us if you get
3746	onto my mailing list. If you are on it, then I will give your name to other
3747	nearby users when they ask for the names of nearby users, and they are urged to contact you and
3748	update your copy. (I benefit by getting a better feel for how many
3749	distributions there have been, and having a better mailing list to use to give
3750	other users local people to contact). Use the registration form which
3751	can be accessed through our web site's registration page.
3752	<DT><STRONG>"How do I make a citation to the PHYLIP package in the paper I am
3753	writing?"</STRONG>
3754	<DD>One way is like this:
3755	<P>
3756	Felsenstein, J. 2002. PHYLIP (Phylogeny Inference Package) version 3.6a3.
3757	<I>Distributed by the author. Department of Genome Sciences, University of
3758	Washington, Seattle.</I>
3759	<P>
3760	or if the editor for whom you are writing insists that the citation must be to
3761	a printed publication, you could cite a notice for version 3.2 published in
3762	Cladistics:
3763	<P>
3764	Felsenstein, J. 1989. PHYLIP - Phylogeny Inference Package (Version 3.2).
3765	<I>Cladistics</I> <B>5:</B> 164-166.
3766	<BR>
3767	<P>
3768	For a while a printed version of the PHYLIP documentation was available and one
3769	could cite that. This is no longer true. Other than that, this is difficult,
3770	because I have never written a paper announcing PHYLIP! My 1985b paper in
3771	Evolution on the bootstrap method contains a
3772	one-paragraph Appendix describing the availability of this package, and that
3773	can also be cited as a reference for the package, although it was
3774	distributed since 1980 while the bootstrap paper is 1985. A paper on PHYLIP
3775	is needed mostly to give people something to cite, as word-of-mouth, references
3776	in other people's papers, and electronic newsgroup postings have spread the
3777	word about PHYLIP's existence quite effectively.
3778	<DT><STRONG>"Can I make copies of PHYLIP available to the students in
3779	my class?"</STRONG>
3780	<DD>Generally, yes. Read the Copyright notice near the front of
3781	this main documentation page. If you charge money for PHYLIP,
3782	or use it in a service for which you charge money, you will need
3783	to negotiate a royalty. But you can make it freely available
3784	and you do not need to get any special permission from us to do so.
3785	<DT><STRONG>"How many copies of PHYLIP have been distributed?"</STRONG>
3786	<DD>On
3787	27 September, 1996 we reached 5,000 registered installations worldwide.
3788	(By now we are well over 15,000 but have lost count for
3789	the moment). Of course there are
3790	many more people who have got copies from friends. PHYLIP is the most widely
3791	distributed phylogeny package. (This situation may reverse itself rapidly
3792	once PAUP* is fully released. During the years it was in full distribution,
3793	PAUP was ahead in phylogenies published, and the availability of distance and
3794	likelihood methods in PAUP* are making it very popular.)
3795	In recent years magnetic tape distribution and e-mail distribution of
3796	PHYLIP have disappeared,
3797	and there has been a big decrease of diskette distributions (down to only
3798	one or two per year). But all this has
3799	been more than offset by, first, an explosion of distributions by anonymous ftp
3800	over Internet, and then a bigger explosion of World Wide Web distributions and
3801	registrations (about 6 registrations per day at the moment).
3802	<P>
3803	<H3>Questions about documentation</H3>
3804	<P>
3805	<DT><STRONG>"Where can I get a printed version of the PHYLIP documents?"</STRONG>
3806	<DD>For the
3807	moment, you can only get a printed version by printing it yourself. For
3808	versions 3.1 to 3.3 a printed version was sold by Christopher Meacham and Tom
3809	Duncan, then at the University Herbarium of the University of California at
3810	Berkeley. But they have had to discontinue this as it was too much work. You
3811	should be able to print out the documentation files on almost any printer and
3812	make yourself a printed version of whichever of them you need.
3813	<DT><STRONG>"Why have I been dropped from your newsletter mailing list?"</STRONG>
3814	<DD>You haven't.
3815	The newsletter was dropped. It simply was too hard to mail it out to such a
3816	large mailing list. The last issue of the newsletter was Number 9 in May,
3817	1987. The Listserver News Bulletins that we tried for a while have also been dropped
3818	as too hard to keep up to date. I am hoping that our World Wide Web site will take their place.
3819	</DL>
3820	<P>
3821	<DIV ALIGN="CENTER">
3822	<H3>Additional Frequently Asked Questions, or:</B>
3823	"Why didn't it occur to you to ...</H3></DIV>
3824	<DL>
3825	<DT><STRONG>... allow the options to be set on the command line?</STRONG>
3826	<DD>We could in Unix and Linux, or somewhat differently in Windows. But
3827	there are so many options that this would be difficult, especially
3828	when the options require additional information to be supplied such as
3829	rates of evolution for many categories of sites. You may be asking this
3830	question because you want to automate the operation of PHYLIP programs
3831	using batch files (command files) to run in background. If that is the
3832	issue, see the section of this main documentation page on
3833	"Running the programs in background or under control of a command file".
3834	It explains how to set the options using input redirection and a file
3835	that has the menu responses as keystrokes.
3836	<DT><STRONG>... write these programs in Pascal?"</STRONG>
3837	<DD>These programs started out
3838	in Pascal in 1980. In 1993 we released both Pascal and C versions. The
3839	present version (3.6) and
3840	future versions will be C-only. I make fewer mistakes in Pascal and do
3841	like the language better than C, but C has overtaken Pascal and Pascal
3842	compilers are starting to be hard to find on some machines. Also C is a
3843	bit better standardized which makes the number of modifications a user
3844	has to make to adapt the programs to their system much less.
3845	<DT><STRONG>... write these programs in Java?"</STRONG>
3846	<DD>Well, we might. It is not completely clear which of two contenders,
3847	C++ and Java, will become more widespread, and which one will gradually
3848	fade away. Whichever one is more successful, we will probably want to use
3849	for future versions of PHYLIP. As the C compilers that are used to
3850	compile PHYLIP are usually also able to compile C++, we will be moving in
3851	that direction, but with constant worrying about whether to convert PHYLIP
3852	to Java instead.</DD>
3853	<DT><STRONG>... forgot about all those inferior systems and just develop PHYLIP for Unix?"</STRONG>
3854	<DD>This is self-answering, since the same people first said I should
3855	just develop it for Apple II's, then for CP/M Z-80's, then for IBM PCDOS,
3856	then for Macintoshes or for Sun
3857	workstations, and then for Windows. If I had listened to them and done any one of these, I would
3858	have had a very hard time adapting the package to any of the other ones once
3859	these folks changed their mind (and most of them did)!
3860	<DT><STRONG>... write these programs in PROLOG
3861	(or Ada, or Modula-2, or SIMULA, or BCPL, or PL/I, or APL, or LISP)?"</STRONG>
3862	<DD>These are all languages I have considered. All
3863	have advantages, but they are not really widespread (as are C and C++).
3864	<DT><STRONG>... include in the package a program to do the Distance Wagner method, (or
3865	successive approximations character weighting,
3866	or transformation series analysis)?"</STRONG>
3867	<DD>In most cases where I have not
3868	included other methods, it is because I decided that they had no substantial
3869	advantages over methods that were included (such as the programs FITCH,
3870	KITSCH, NEIGHBOR, the <TT>T</TT> option of MIX and DOLLOP, and the "<TT>?</TT>" ancestral
3871	states option of the discrete characters parsimony programs).
3872	<DT><STRONG>... include in the package ordination methods and more
3873	clustering algorithms?"</STRONG>
3874	<DD>Because this is <I>not</I> a clustering package, it's a
3875	package for phylogeny estimation. Those are different tasks with different
3876	objectives and mostly different methods. Mary Kuhner and Jon Yamato have,
3877	however,
3878	included in NEIGHBOR an option for UPGMA clustering, which will be very
3879	similar to KITSCH in results.
3880	<DT><STRONG>... include in the package a program to do nucleotide sequence
3881	alignment?"</STRONG>
3882	<DD>Well, yes, I should
3883	have, and this is scheduled to be in future releases. But multiple sequence
3884	alignment programs, in the era after Sankoff, Morel, and Cedergren's 1973
3885	classic paper, need to use substantial computer horsepower to estimate the
3886	alignment and the tree together (but see Karl Nicholas's program
3887	<TT>GeneDoc</TT> or Ward Wheeler and David Gladstein's <TT>MALIGN</TT>, as
3888	well as more approximate methods of tree-based alignment used in
3889	<TT>ClustalW</TT> or <TT>TreeAlign</TT>).
3890	</DL>
3891	<P>
3892	<DIV ALIGN="CENTER">
3893	<H3>(Fortunately) obsolete questions</H3></DIV>
3894	<P>
3895	(The following four questions, once
3896	common, have finally disappeared, I am pleased to report).
3897	<H4>"Why didn't it occur to you to ...</H4></DIV>
3898	<DL>
3899	<DT><STRONG>... let me log in to your computer in Seattle
3900	and copy the files out over a phone line?"</STRONG>
3901	<DD>No thanks. It would cost you for a lot of
3902	long-distance telephone time, plus a half hour of my time and yours in which
3903	I had to explain to you how to log in and do the copying.
3904	<DT><STRONG>... send me a listing of your program?"</STRONG>
3905	<DD>Damn it, it's not "a program",
3906	it's 35 programs, in a great many files. What were you
3907	thinking of doing, having 1800-line programs typed in by slaves at your
3908	end? If you were going to go to all that trouble why not try network
3909	transfer? If you have these then you can print out all the
3910	listings you want to and add them to the huge stack of printed output in
3911	the corner of your office.
3912	<DT><STRONG>... write a magnetic tape in our computer center's favorite format
3913	(inverted Lithuanian EBCDIC at 998 bpi)?"</STRONG>
3914	<DD>Because the ANSI standard
3915	format is the most widely used one, and even though your computer center
3916	may pretend it can't read a tape written this way, if you sniff around
3917	you will find a utility to read it. It's just a <I>lot</I> easier for me to
3918	let you do that work. If I tried to put the tape into your format, I
3919	would probably get it wrong anyway.
3920	<DT><STRONG>... give us a version of these in FORTRAN?"</STRONG>
3921	<DD>Because the
3922	programs are <I>far</I> easier to write and debug in C or Pascal, and cannot
3923	easily be
3924	rewritten into FORTRAN (they make extensive use of recursive calls and
3925	of records and pointers). In any case, C is widely available. If you don't
3926	have a C compiler or don't know
3927	how to use it, you are going to have to learn a language like C or
3928	Pascal sooner or later, and the sooner the better.
3929	</DL>
3930	<P>
3931	<A NAME="newfeatures"><HR><P></A>
3932	<DIV ALIGN="CENTER">
3933	<H2>New Features in This Version</H2></DIV>
3934	<P>
3935	Version 3.6 has many new features:
3936	<UL><LI>Faster (well, less, slow) likelihood programs.
3937	<LI>The DNA and protein likelihood and distance programs allow
3938	for rate variation between sites using a gamma distribution of
3939	rates among sites, or using a gamma distribution plus a given
3940	fraction of sites which are assumed invariant.
3941	<LI>A new multistate discrete characters parsimony program, PARS, that
3942	handles unordered multistate characters.
3943	<LI>The DNAPARS and PARS parsimony programs can infer multifurcating
3944	trees, which sensibly reduces the number of tied trees they find.
3945	<LI>A new protein sequence likelihood program, <TT>PROML</TT>,
3946	and also a version, <TT>PROMLK</TT> which assumes a molecular clock.
3947	<LI>A new restriction sites and restriction fragments distance program,
3948	<TT>RESTDIST</TT>, that can also be used to compute distances for RAPD and
3949	AFLP data. It also allows for gamma-distributed rate variation among
3950	DNA sites.
3951	<LI>In the DNA likelihood programs, you can now specify different
3952	categories of rates of change (such as rates for first, second, and
3953	third positions of a coding sequence) and assign them to specific sites.
3954	This is in addition to the ability of the program to use the Hidden Markov
3955	Model mechanism to allow rates of change to vary across sites in a way that
3956	does not ask you to assign which rate goes with which site.
3957	<LI>The input files for many of the programs are now
3958	simpler, in that they do not contain options information such as specification
3959	of weights and categories. That information is now provided in separete
3960	files with default names such as <TT>weights</TT> and <TT>categories</TT>.
3961	<LI>The DNA likelihood programs can now evaluate multifurcating
3962	user trees (option <TT>U</TT>).
3963	<LI>All programs that read in user-defined trees now do so from a separate
3964	file, whose default name is <TT>intree</TT>, rather than requiring them to
3965	be in the input file as before.
3966	<LI>The DNA likelihood programs can infer the sequence at ancestral
3967	nodes in the interior of the tree.
3968	<LI>DNAPARS can now do transversion parsimony.
3969	<LI>The bootstrapping program SEQBOOT now can, instead of producing a
3970	large file containing multiple data sets, be asked instead
3971	to produce a weights file with multiple sets of weights. Many
3972	programs in this release can analyze those multiple weights together with
3973	the original data set, which saves disk space.
3974	<LI>The bootstrapping program SEQBOOT can pass weights and categories
3975	information through to a multiple weights file or a multiple categories
3976	file.
3977	<LI>SEQBOOT can also convert sequence files from Interleaved to
3978	Sequential form, or back.
3979	<LI>SEQBOOT can also write a sequence data file into a preliminary version of
3980	a new XML format which is being defined for sequence alignments,
3981	for use by programs that need XML input
3982	(none of the current PHYLIP programs yet need this format, but it
3983	will be useful in the future).
3984	<LI>RETREE can now write tree out into a preliminary version of a new XML tree
3985	file format which is in the process of being defined.
3986	<LI>The Kishino-Hasegawa-Templeton (KHT) test which compares user-defined
3987	trees (option U) is now joined by the Shimodaira-Hasegawa (SH) test
3988	(Shimodaira and Hasegawa, 1999) which corrects for comparisons among
3989	multiple tests. This avoids a statistical problem with multiple user trees.
3990	<LI>CONTRAST can now carry out an analysis that takes into account
3991	within-species variation, according to a model similar (but not
3992	identical) to that introduced by Michael Lynch (1990)
3993	<LI>A new program, TREEDIST, computes the Robinson-Foulds symmetric
3994	difference distance among trees. This measures the number of branches in
3995	the trees that are present in one but not the other.
3996	<LI>FITCH and KITSCH now have an option to make trees by the
3997	minimum evolution distance matrix method.
3998	<LI>The protein parsimony program PROTPARS now allows you to choose among
3999	a number of different genetic codes such as mitochondrial codes.
4000	<LI>The consensus tree program CONSENSE
4001	can compute the M<SUB>l</SUB> family of consensus tree methods, which
4002	generalize the Majority Rule consensus tree method. It can
4003	also compute our extended Majority Rule consensus (which is
4004	Majority Rule with some additional groups added to resolve the
4005	tree more completely), and it can also compute the original
4006	Majority Rule consensus tree method which does not add these
4007	extra groups. It can also
4008	compute the Strict consensus.
4009	<LI>The tree-drawing programs DRAWGRAM and DRAWTREE have a number of new
4010	options of kinds of file they can produce, including Windows Bitmap files,
4011	files for the Idraw and FIG X windows drawing programs, the POV ray-tracer,
4012	and even VRML Virtual Reality Markup Language files that will enable you
4013	to wander around the tree using a VRML plugin for your browser, such as
4014	Cosmo Player.
4015	<LI>DRAWTREE now uses my new Equal Daylight Algorithm to draw unrooted
4016	trees. This gives a much better-looking tree. Of course, competing programs
4017	such as TREEVIEW and PAUP draw trees that look just as good - because they
4018	too have started to use my method (with my encouragement). DRAWTREE also
4019	can use another algorithm, the n-body method.
4020	<LI>The tree-drawing programs can now produce trees across multiple
4021	pages, which is handy for looking at trees with very large numbers
4022	of tips, and for producing giant diagrams by pasting together
4023	multiple sheets of paper.
4024	</UL>
4025	<P>
4026	There are many more, lesser features added as well.
4027	<P>
4028	<A NAME="future"><HR><P></A>
4029	<DIV ALIGN="CENTER">
4030	<H2>Coming Attractions, Future Plans</H2></DIV>
4031	<P>
4032	There are some obvious deficiencies in this version. Some of these
4033	holes will be filled in the next few releases (leading to version
4034	4.0). They include:
4035	<OL>
4036	<LI>A program to align molecular sequences on a predefined User Tree may
4037	ultimately be included. This will allow alignment and phylogeny
4038	reconstruction to procede iteratively by successive runs of two programs, one
4039	aligning on a tree and the other finding a better tree based on that alignment.
4040	In the shorter run a simple two-sequence alignment program may be included.
4041	<LI>An interactive "likelihood explorer" for DNA sequences will be written.
4042	This will allow, either with or without the assumption of a molecular
4043	clock, trees to be varied interactively so that the user can get a much
4044	better feel for the shape of the likelihood surface. Likelihood will be
4045	able to be plotted against branch lengths for any branch.
4046	<LI>If possible we will find some way of correcting for purine/pyrimidine
4047	richness variations among species, within the framework of the maximum
4048	likelihood programs. That they maximum likelihood programs do not allow
4049	for base composition variation is their major limitation at the moment.
4050	<LI>The Hidden Markov Model (regional rates) option of DNAML and DNAMLK will
4051	be generalized to allow
4052	for rates at sites to gradually change as one moves along the tree,
4053	in an attempt to implement Fitch and Markowitz's (1970) notion of "covarions".
4054	<LI>Obviously we need to start thinking about a more visual mouse/windows
4055	interface, but only if that can be used on X windows, Macintoshes, and
4056	Windows.
4057	<LI>Program PENNY and its relatives will improved so as to run faster
4058	and find all most parsimonious trees more quickly.
4059	<LI>A more sophisticated compatibility program should be included, if I can
4060	find one.
4061	<LI>An "evolutionary clock" version of CONTML will be done, and the same
4062	may also be done for RESTML.
4063	<LI>We are gradually generalizing the tree structures in the programs to
4064	infer multifurcating trees as well as bifurcating ones.
4065	We should be able to have any program read any tree and know what to do
4066	with it, without the user having to fret about whether an unrooted tree was
4067	fed to a program that needs a rooted tree.
4068	<LI>We are economizing on the size of the source code, and enforcing some
4069	standardization of it, by putting frequently used routines in separate
4070	files which can be linked into various programs. This will enforce
4071	a rather complete standardization of our code.
4072	<LI>We will move our code to an object-oriented
4073	language, most lkely C++. One could describe the language that version
4074	3.4 was written in as "Pascal", version 3.5 as "Pascal written in C",
4075	version 3.6 as "C written in C", and maybe version 4.0 as "C++ written
4076	in C" and then 4.1 as "C++ written in C++". At least that scenario
4077	is one possibility.
4078	</OL>
4079	<P>
4080	Much of the future development of the package will be in the DNA and protein
4081	likelihood programs and the distance matrix programs. This is for several
4082	reasons. First, I am more interested in those problems. Second, collection of
4083	molecular data is increasing rapidly, and those programs have the most promise
4084	for future development
4085	for those data.
4086	<P>
4087	<A NAME="endorsements"><HR><P></A>
4088	<DIV ALIGN="CENTER">
4089	<H2>Endorsements</H2></DIV>
4090	<P>
4091	Here are some comments people have made in print about PHYLIP. Explanatory
4092	material in square brackets is my own. They fall naturally into two groups:
4093	<P>
4094	<H3>From the pages of <I>Cladistics</I>:</H3>
4095	<P>
4096	<BLOCKQUOTE>
4097	"Under no circumstances can we recommend PHYLIP/WAG [their name for the
4098	Wagner parsimony option of MIX]."
4099	<DIV ALIGN="RIGHT">
4100	Luckow, M. and R. A. Pimentel (1985)
4101	</DIV>
4102	</BLOCKQUOTE>
4103	<P>
4104	<BLOCKQUOTE>
4105	"PHYLIP has not proven very effective in implementing parsimony (Luckow and
4106	Pimentel, 1985)."
4107	<DIV ALIGN="RIGHT">
4108	J. Carpenter (1987a)
4109	</DIV>
4110	</BLOCKQUOTE>
4111	<P>
4112	<BLOCKQUOTE>
4113	"... PHYLIP. This is the computer program where every newsletter concerning
4114	it is mostly bug-catching, some of which have been put there by previous
4115	corrections. As Platnick (1987) documents, through dint of much labor useful
4116	results may be attained with this program, but I would suggest an
4117	easier way: FORMAT b:"
4118	<DIV ALIGN="RIGHT">
4119	J. Carpenter (1987b)
4120	</DIV>
4121	</BLOCKQUOTE>
4122	<P>
4123	<BLOCKQUOTE>
4124	"PHYLIP is bug-infested and both less effective and orders of
4125	magnitude slower than other programs ...."
4126	<DIV ALIGN="RIGHT">
4127	"T. N. Nayenizgani" [J. S. Farris] (1990)
4128	</DIV>
4129	</BLOCKQUOTE>
4130	<P>
4131	<BLOCKQUOTE>
4132	"Hennig86 [by J. S. Farris] provides such substantial improvements over
4133	previously available programs (for both mainframes and microcomputers) that
4134	it should now become the tool of choice for practising systematists."
4135	<DIV ALIGN="RIGHT">
4136	N. Platnick (1989)
4137	</DIV>
4138	</BLOCKQUOTE>
4139	<P>
4140	<H3>... and in the pages of other journals:</H3>
4141	<P>
4142	<BLOCKQUOTE>
4143	"The availability, within PHYLIP of distance, compatibility, maximum likelihood,
4144	and generalized `invariants' algorithms (Cavender and Felsenstein, 1987) sets
4145	it apart from other packages .... One of the strengths of PHYLIP is its
4146	documentation ...."
4147	<DIV ALIGN="RIGHT">
4148	Michael J. Sanderson (1990)
4149	</DIV>
4150	<EM>(Sanderson also criticizes PHYLIP for slowness and inflexibility of its
4151	parsimony algorithms, and compliments other packages on their strengths).</EM>
4152	</BLOCKQUOTE>
4153	<P>
4154	<BLOCKQUOTE>
4155	"This package of programs has gradually become a basic necessity to anyone
4156	working seriously on various aspects of phylogenetic inference .... The package
4157	includes more programs than any other known phylogeny package. But it is not
4158	just a collection of cladistic and related programs. The package has great
4159	value added to the whole, and for this it is unique and of extreme
4160	importance .... its various strengths are in the great array of methods
4161	provided ...."
4162	<DIV ALIGN="RIGHT">
4163	Bernard R. Baum (1989)
4164	</DIV>
4165	</BLOCKQUOTE>
4166	<P>
4167	(note also W. Fink's critical remarks (1986) on version 2.8 of PHYLIP).
4168	<P>
4169	<A NAME="references"><HR><P></A>
4170	<DIV ALIGN="CENTER">
4171	<H2>References for the Documentation Files</H2></DIV>
4172	<P>
4173	In the documentation files that follow I frequently refer to papers
4174	in the literature. In order to centralize the references they are given
4175	in this section. The chapter by David Swofford,
4176	Gary Olsen, Peter Waddell, and David Hillis
4177	(1996) is also an excellent review of the issues in phylogeny
4178	reconstruction.
4179	If you want to find further papers beyond these, my
4180	Quarterly Review of Biology review of 1982 and my Annual Review of Genetics
4181	review of 1988 list many further references.
4182	<P>
4183	Adams, E. N. 1972. Consensus techniques and the comparison of
4184	taxonomic trees. <I>Systematic Zoology</I> <B>21:</B> 390-397.
4185	<P>
4186	Adams, E. N. 1986. N-trees as nestings: complexity, similarity, and
4187	consensus. <I>Journal of Classification</I> <B>3:</B> 299-317.
4188	<P>
4189	Archie, J. W. 1989. A randomization test for phylogenetic information in
4190	systematic data. <I>Systematic Zoology</I> <B>38:</B> 219-252.
4191	<P>
4192	Barry, D., and J. A. Hartigan. 1987. Statistical analysis of hominoid
4193	molecular evolution. <I>Statistical Science</I> <B>2:</B> 191-210.
4194	<P>
4195	Baum, B. R. 1989. PHYLIP: Phylogeny Inference Package. Version 3.2. (Software
4196	review). <I>Quarterly Review of Biology</I> <B>64:</B> 539-541.
4197	<P>
4198	Bron, C., and J. Kerbosch. 1973. Algorithm 457: Finding all cliques
4199	of an undirected graph. <I>Communications of the Association for Computing Machinery</I> <B>16:</B> 575-577.
4200	<P>
4201	Camin, J. H., and R. R. Sokal. 1965. A method for deducing branching
4202	sequences in phylogeny. <I>Evolution</I> <B>19:</B> 311-326.
4203	<P>
4204	Carpenter, J. 1987a. A report on the Society for the Study of Evolution
4205	workshop "Computer Programs for Inferring Phylogenies". <I>Cladistics</I> <B>3:</B>
4206	363-375.
4207	<P>
4208	Carpenter, J. 1987b. Cladistics of cladists. <I>Cladistics</I> <B>3:</B> 363-375.
4209	<P>
4210	Cavalli-Sforza, L. L., and A. W. F. Edwards. 1967. Phylogenetic
4211	analysis: models and estimation procedures. <I>Evolution</I> <B>32:</B> 550-570
4212	(also <I>American Journal of Human Genetics</I> <B>19:</B> 233-257).
4213	<P>
4214	Cavender, J. A. and J. Felsenstein. 1987. Invariants of phylogenies in a
4215	simple case with discrete states. <I>Journal of Classification</I> <B>4:</B> 57-71.
4216	<P>
4217	Churchill, G.A. 1989. Stochastic models for heterogeneous DNA sequences.
4218	<I>Bulletin of Mathematical Biology</I> <B>51:</B> 79-94.
4219	<P>
4220	Conn, E. E. and P. K. Stumpf. 1963. <I>Outlines of Biochemistry.</I> John Wiley
4221	and Sons, New York.
4222	<P>
4223	Day, W. H. E. 1983. Computationally difficult parsimony problems in
4224	phylogenetic systematics. <I>Journal of Theoretical Biology</I> <B>103:</B>
4225	429-438.
4226	<P>
4227	Dayhoff, M. O. and R. V. Eck. 1968. <I>Atlas of Protein Sequence
4228	and Structure 1967-1968.</I> National Biomedical Research Foundation,
4229	Silver Spring, Maryland.
4230	<P>
4231	Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. 1979. A model of
4232	evolutionary change in proteins. pp. 345-352 in <I>Atlas of
4233	Protein Sequence and Structure, volume 5, supplement 3, 1978,</I> ed.
4234	M. O. Dayhoff. National Biomedical Research Foundation, Silver Spring, Maryland
4235	.
4236	<P>
4237	Dayhoff, M. O. 1979. <I>Atlas of Protein Sequence and Structure, Volume 5,
4238	Supplement 3, 1978.</I> National Biomedical Research Foundation, Washington, D.C.
4239	<P>
4240	DeBry, R. W. and N. A. Slade. 1985. Cladistic analysis of restriction
4241	endonuclease cleavage maps within a maximum-likelihood framework.
4242	<I>Systematic Zoology</I> <B>34:</B> 21-34.
4243	<P>
4244	Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum
4245	likelihood from incomplete data via the EM algorithm. <I>Journal of the Royal Statistical Society B</I> <B>39:</B> 1-38.
4246	<P>
4247	Eck, R. V., and M. O. Dayhoff. 1966. <I>Atlas of Protein Sequence and
4248	Structure 1966.</I> National Biomedical Research Foundation, Silver
4249	Spring, Maryland.
4250	<P>
4251	Edwards, A. W. F., and L. L. Cavalli-Sforza. 1964. Reconstruction of
4252	evolutionary trees. pp. 67-76 in <I>Phenetic and Phylogenetic
4253	Classification,</I> ed. V. H. Heywood and J. McNeill. Systematics
4254	Association Volume No. 6. Systematics Association, London.
4255	<P>
4256	Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976a. A
4257	mathematical foundation for the analysis of character
4258	compatibility. <I>Mathematical Biosciences</I> <B>23:</B> 181-187.
4259	<P>
4260	Estabrook, G. F., C. S. Johnson, Jr., and F. R. McMorris. 1976b. An
4261	algebraic analysis of cladistic characters. <I>Discrete Mathematics</I> <B>16:</B> 141-147.
4262	<P>
4263	Estabrook, G. F., F. R. McMorris, and C. A. Meacham. 1985. Comparison of
4264	undirected phylogenetic trees based on subtrees of four evolutionary units.
4265	<I>Systematic Zoology</I> <B>34:</B> 193-200.
4266	<P>
4267	Faith, D. P. 1990. Chance marsupial relationships. <I>Nature</I><B>345:</B> 393-394.
4268	<P>
4269	Faith, D. P. and P. S. Cranston. 1991. Could a cladogram this short have
4270	arisen by chance alone?: On permutation tests for cladistic
4271	structure. <I>Cladistics</I> <B>7:</B> 1-28.
4272	<P>
4273	Farris, J. S. 1977. Phylogenetic analysis under Dollo's Law. <I>Systematic Zoology</I> <B>26:</B> 77-88.
4274	<P>
4275	Farris, J. S. 1978a. Inferring phylogenetic trees from chromosome
4276	inversion data. <I>Systematic Zoology</I> <B>27:</B> 275-284.
4277	<P>
4278	Farris, J. S. 1981. Distance data in phylogenetic analysis. pp. 3-23
4279	in <I>Advances in Cladistics: Proceedings of the first meeting of the
4280	Willi Hennig Society,</I> ed. V. A. Funk and D. R. Brooks. New York
4281	Botanical Garden, Bronx, New York.
4282	<P>
4283	Farris, J. S. 1983. The logical basis of phylogenetic analysis. pp. 1-47
4284	in <I>Advances in Cladistics, Volume 2, Proceedings of the Second Meeting of
4285	the Willi Hennig Society.</I> ed. Norman I. Platnick and V. A. Funk. Columbia
4286	University Press, New York.
4287	<P>
4288	Farris, J. S. 1985. Distance data revisited. <I>Cladistics</I> <B>1:</B> 67-85.
4289	<P>
4290	Farris, J. S. 1986. Distances and statistics. <I>Cladistics</I> <B>2:</B> 144-157.
4291	<P>
4292	Farris, J. S. ["T. N. Nayenizgani"]. 1990. The systematics association
4293	enters its golden years (review of <I>Prospects in Systematics</I>, ed. D.
4294	Hawksworth). <I>Cladistics</I> <B>6:</B> 307-314.
4295	<P>
4296	Felsenstein, J. 1973a. Maximum likelihood and minimum-steps methods
4297	for estimating evolutionary trees from data on discrete characters.
4298	<I>Systematic Zoology</I> <B>22:</B> 240-249.
4299	<P>
4300	Felsenstein, J. 1973b. Maximum-likelihood estimation of evolutionary
4301	trees from continuous characters. <I>American Journal of Human Genetics</I> <B>25:</B>
4302	471-492.
4303	<P>
4304	Felsenstein, J. 1978a. The number of evolutionary trees. <I>Systematic Zoology</I> <B>27:</B> 27-33.
4305	<P>
4306	Felsenstein, J. 1978b. Cases in which parsimony and compatibility
4307	methods will be positively misleading. <I>Systematic Zoology</I> <B>27:</B>
4308	401-410.
4309	<P>
4310	Felsenstein, J. 1979. Alternative methods of phylogenetic inference
4311	and their interrelationship. <I>Systematic Zoology</I> <B>28:</B> 49-62.
4312	<P>
4313	Felsenstein, J. 1981a. Evolutionary trees from DNA sequences: a
4314	maximum likelihood approach. <I>Journal of Molecular Evolution</I> <B>17:</B> 368-376.
4315	<P>
4316	Felsenstein, J. 1981b. A likelihood approach to character weighting
4317	and what it tells us about parsimony and compatibility. <I>Biological Journal of the Linnean Society</I> <B>16:</B> 183-196.
4318	<P>
4319	Felsenstein, J. 1981c. Evolutionary trees from gene frequencies and
4320	quantitative characters: finding maximum likelihood estimates.
4321	<I>Evolution</I> <B>35:</B> 1229-1242.
4322	<P>
4323	Felsenstein, J. 1982. Numerical methods for inferring evolutionary
4324	trees. <I>Quarterly Review of Biology</I> <B>57:</B> 379-404.
4325	<P>
4326	Felsenstein, J. 1983b. Parsimony in systematics: biological and
4327	statistical issues. <I>Annual Review of Ecology and Systematics</I> <B>14:</B> 313-333.
4328	<P>
4329	Felsenstein, J. 1984a. Distance methods for inferring phylogenies: a
4330	justification. <I>Evolution</I> <B>38:</B> 16-24.
4331	<P>
4332	Felsenstein, J. 1984b. The statistical approach to inferring
4333	evolutionary trees and what it tells us about parsimony and
4334	compatibility. pp. 169-191 in: <I>Cladistics: Perspectives in the
4335	Reconstruction of Evolutionary History,</I> edited by T. Duncan and T. F.
4336	Stuessy. Columbia University Press, New York.
4337	<P>
4338	Felsenstein, J. 1985a. Confidence limits on phylogenies with a molecular
4339	clock. <I>Systematic Zoology</I> <B>34:</B> 152-161.
4340	<P>
4341	Felsenstein, J. 1985b. Confidence limits on phylogenies: an approach
4342	using the bootstrap. <I>Evolution</I> <B>39:</B> 783-791.
4343	<P>
4344	Felsenstein, J. 1985c. Phylogenies from gene frequencies: a statistical
4345	problem. <I>Systematic Zoology</I> <B>34:</B> 300-311.
4346	<P>
4347	Felsenstein, J. 1985d. Phylogenies and the comparative method. <I>American Naturalist</I> <B>125:</B> 1-12.
4348	<P>
4349	Felsenstein, J. 1986. Distance methods: a reply to Farris. <I>Cladistics</I> <B>2:</B>
4350	130-144.
4351	<P>
4352	Felsenstein, J. and E. Sober. 1986. Parsimony and likelihood: an
4353	exchange. <I>Systematic Zoology</I> <B>35:</B> 617-626.
4354	<P>
4355	Felsenstein, J. 1988a. Phylogenies and quantitative characters. <I>Annual Review of Ecology and Systematics</I> <B>19:</B> 445-471.
4356	<P>
4357	Felsenstein, J. 1988b. Phylogenies from molecular sequences: inference and
4358	reliability. <I>Annual Review of Genetics</I> <B>22:</B> 521-565.
4359	<P>
4360	Felsenstein, J. 1992. Phylogenies from restriction sites, a
4361	maximum likelihood approach. <I>Evolution</I> <B>46:</B> 159-173.
4362	<P>
4363	Felsenstein, J. and G. A. Churchill. 1996.
4364	A hidden Markov model approach to variation among sites in rate of evolution
4365	<I>Molecular Biology and Evolution</I> <B>13:</B> 93-104.
4366	<P>
4367	Fink, W. L. 1986. Microcomputers and phylogenetic analysis. <I>Science</I> <B>234:</B> 1135-1139.
4368	<P>
4369	Fitch, W. M., and E. Markowitz. 1970. An improved method for determining
4370	codon variability in a gene and its application to the rate of fixation of
4371	mutations in evolution. <I>Biochemical Genetics</I> <B>4:</B> 579-593.
4372	<P>
4373	Fitch, W. M., and E. Margoliash. 1967. Construction of phylogenetic
4374	trees. <I>Science</I> <B>155:</B> 279-284.
4375	<P>
4376	Fitch, W. M. 1971. Toward defining the course of evolution: minimum
4377	change for a specified tree topology. <I>Systematic Zoology</I> <B>20:</B> 406-416.
4378	<P>
4379	Fitch, W. M. 1975. Toward finding the tree of maximum parsimony. pp. 189-230
4380	in Proceedings of the Eighth International Conference on Numerical Taxonomy,
4381	ed. G. F. Estabrook. W. H. Freeman, San Francisco.
4382	<P>
4383	Fitch, W. M. and E. Markowitz. 1970. An improved method for determining
4384	codon variability and its application to the rate of fixation of mutations
4385	in evolution. <I>Biochemical Genetics</I> <B>4:</B> 579-593.
4386	<P>
4387	George, D. G., L. T. Hunt, and W. C. Barker. 1988. Current methods in
4388	sequence comparison and analysis. pp. 127-149 in Macromolecular Sequencing
4389	and Synthesis, ed. D. H. Schlesinger. Alan R. Liss, New York.
4390	<P>
4391	Gomberg, D. 1966. "Bayesian" post-diction in an evolution process.
4392	unpublished manuscript: University of Pavia, Italy.
4393	<P>
4394	Graham, R. L., and L. R. Foulds. 1982. Unlikelihood that minimal
4395	phylogenies for a realistic biological study can be constructed in
4396	reasonable computational time. <I>Mathematical Biosciences</I> <B>60:</B> 133-142.
4397	<P>
4398	Hasegawa, M. and T. Yano. 1984a. Maximum likelihood method of phylogenetic
4399	inference from DNA sequence data. <I>Bulletin of the Biometric Society of Japan</I> No. 5: 1-7.
4400	<P>
4401	Hasegawa, M. and T. Yano. 1984b. Phylogeny and classification of
4402	Hominoidea as inferred from DNA sequence data. <I>Proceedings of the Japan Academy</I> <B>60 B:</B> 389-392.
4403	<P>
4404	Hasegawa, M., Y. Iida, T. Yano, F. Takaiwa, and M. Iwabuchi. 1985a.
4405	Phylogenetic relationships among eukaryotic kingdoms as inferred from
4406	ribosomal RNA sequences. Journal of Molecular Evolution 22: 32-38.
4407	<P>
4408	Hasegawa, M., H. Kishino, and T. Yano. 1985b. Dating of the human-ape
4409	splitting by a molecular clock of mitochondrial DNA. Journal of Molecular
4410	Evolution 22: 160-174.
4411	<P>
4412	Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms to
4413	determine minimal evolutionary trees. <I>Mathematical Biosciences</I> <B>59:</B> 277-290.
4414	<P>
4415	Higgins, D. G. and P. M. Sharp. 1989. Fast and sensitive
4416	multiple sequence alignments on a microcomputer. <I>Computer Applications in the Biological Sciences (CABIOS)</I> <B>5:</B> 151-153.
4417	<P>
4418	Hochbaum, D. S. and A. Pathria. 1997. Path costs in evolutionary
4419	tree reconstruction. <I>Journal of Computational Biology</I> <B>4:</B> 163-175.
4420	<P>
4421	Holmquist, R., M. M. Miyamoto, and M. Goodman. 1988. Higher-primate
4422	phylogeny - why can't we decide? <I>Molecular Biology and Evolution</I> <B>5:</B> 201-216.
4423	<P>
4424	Inger, R. F. 1967. The development of a phylogeny of frogs.
4425	<I>Evolution</I> <B>21:</B> 369-384.
4426	<P>
4427	Jin, L. and M. Nei. 1990. Limitations of the evolutionary parsimony method
4428	of phylogenetic analysis. <I>Molecular Biology and Evolution</I> <B>7:</B> 82-102.
4429	<P>
4430	Jones, D. T., W. R. Taylor and J. M. Thornton. 1992. The rapid generation of
4431	mutation data matrices from protein sequences. <I>Computer Applications
4432	in the Biosciences (CABIOS)</I> <B>8:</B> 275-282.
4433	<P>
4434	Jukes, T. H. and C. R. Cantor. 1969. Evolution of protein molecules. pp.
4435	21-132 in Mammalian Protein Metabolism, ed. H. N. Munro. Academic Press, New
4436	York.
4437	<P>
4438	Kidd, K. K. and L. A. Sgaramella-Zonta. 1971. Phylogenetic analysis: concepts
4439	and methods. <I>American Journal of Human Genetics</I> <B>23:</B> 235-252.
4440	<P>
4441	Kim, J. and M. A. Burgman. 1988. Accuracy of phylogenetic-estimation
4442	methods using simulated allele-frequency data. <I>Evolution</I> <B>42:</B> 596-602.
4443	<P>
4444	Kimura, M. 1980. A simple model for estimating evolutionary rates of base
4445	substitutions through comparative studies of nucleotide sequences. <I>Journal of Molecular Evolution</I> <B>16:</B> 111-120.
4446	<P>
4447	Kimura, M. 1983. The Neutral Theory of Molecular Evolution. Cambridge
4448	University Press, Cambridge.
4449	<P>
4450	Kingman, J. F. C. 1982a. The coalescent. <I>Stochastic Processes and Their Applications</I> <B>13:</B> 235-248.
4451	<P>
4452	Kingman, J. F. C. 1982b. On the genealogy of large populations. <I>Journal of Applied Probability</I> <B>19A:</B> 27-43.
4453	<P>
4454	Kishino, H. and M. Hasegawa. 1989. Evaluation of the maximum likelihood
4455	estimate of the evolutionary tree topologies from DNA sequence data, and the
4456	branching order in Hominoidea. <I>Journal of Molecular Evolution</I> <B>29:</B> 170-179.
4457	<P>
4458	Kluge, A. G., and J. S. Farris. 1969. Quantitative phyletics and the
4459	evolution of anurans. <I>Systematic Zoology</I> <B>18:</B> 1-32.
4460	<P>
4461	Kuhner, M. K. and J. Felsenstein. 1994. A simulation comparison of
4462	phylogeny algorithms under equal and unequal evolutionary rates.
4463	<I>Molecular Biology and Evolution</I> <B>11:</B> 459-468 (Erratum <B>12:</B> 525  1995).
4464	<P>
4465	Künsch, H. R. 1989. The jackknife and the bootstrap for general stationary
4466	observations. <I>Annals of Statistics</I> <B>17:</B> 1217-1241.
4467	<P>
4468	Lake, J. A. 1987. A rate-independent technique for analysis of nucleic acid
4469	sequences: evolutionary parsimony. <I>Molecular Biology and Evolution</I> <B>4:</B> 167-191.
4470	<P>
4471	Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein
4472	sequences: paralinear distances.
4473	<I>Proceedings of the Natonal Academy of Sciences, USA</I> <B>91:</B> 1455-1459.
4474	<P>
4475	Le Quesne, W. J. 1969. A method of selection of characters in
4476	numerical taxonomy. <I>Systematic Zoology</I> <B>18:</B> 201-205.
4477	<P>
4478	Le Quesne, W. J. 1974. The uniquely evolved character concept and its
4479	cladistic application. <I>Systematic Zoology</I> <B>23:</B> 513-517.
4480	<P>
4481	Lewis, H. R., and C. H. Papadimitriou. 1978. The efficiency of
4482	algorithms. <I>Scientific American</I> <B>238:</B> 96-109 (January issue)
4483	<P>
4484	Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994.
4485	Recovering evolutionary trees under a more realistic model of sequence
4486	evolution. <I>Molecular Biology and Evolution</I> <B>11:</B> 605-612.
4487	<P>
4488	López-Martínez, N.; Álvarez-Sierra,
4489	M. A. & García Moreno, E. 1986. Paleontología y
4490	Bioestratigrafía
4491	(Micromamíferos) del Mioceno medio-superior del Sector Central de
4492	la Cuenca del Duero. <I>Stvdia Geologica Salmanticensia</I>
4493	<B>22:</B> 146-191.
4494	<P>
4495	Luckow, M. and D. Pimentel. 1985. An empirical comparison of
4496	numerical Wagner computer programs. <I>Cladistics</I> <B>1:</B> 47-66.
4497	<P>
4498	Lynch, M. 1990. Methods for the analysis of comparative data in evolutionary
4499	biology. <I>Evolution</I> <B>45:</B> 1065-1080.
4500	<P>
4501	Maddison, D. R. 1991. The discovery and importance of multiple islands of
4502	most-parsimonious trees. <I>Systematic Zoology</I> <B>40:</B> 315-328.
4503	<P>
4504	Margush, T. and F. R. McMorris. 1981. Consensus n-trees. <I>Bulletin of Mathematical Biology</I> <B>43:</B> 239-244.
4505	<P>
4506	Nelson, G. 1979. Cladistic analysis and synthesis: principles and definitions,
4507	with a historical note on Adanson's <I>Familles des Plantes</I>
4508	(1763-1764). <I>Systematic Zoology</I> <B>28:</B> 1-21.
4509	<P>
4510	Nei, M. 1972. Genetic distance between populations. <I>American Naturalist</I> <B>106:</B> 283-292.
4511	<P>
4512	Nei, M. and W.-H. Li. 1979. Mathematical model for studying genetic variation
4513	in terms of restriction endonucleases. <I>Proceedings of the National Academy of Sciences, USA</I> <B>76:</B> 5269-5273.
4514	<P>
4515	Page, R. D. M. 1989. Comments on component-compatibility in historical
4516	biogeography. <I>Cladistics</I> <B>5:</B> 167-182.
4517	<P>
4518	Penny, D. and M. D. Hendy. 1985. Testing methods of evolutionary tree
4519	construction. <I>Cladistics</I> <B>1:</B> 266-278.
4520	<P>
4521	Platnick, N. 1987. An empirical comparison of microcomputer parsimony
4522	programs. <I>Cladistics</I> <B>3:</B> 121-144.
4523	<P>
4524	Platnick, N. 1989. An empirical comparison of microcomputer parsimony
4525	programs. II. <I>Cladistics</I> <B>5:</B> 145-161.
4526	<P>
4527	Reynolds, J. B., B. S. Weir, and C. C. Cockerham. 1983. Estimation of the
4528	coancestry coefficient: basis for a short-term genetic
4529	distance. <I>Genetics</I> <B>105:</B> 767-779.
4530	<P>
4531	Robinson, D. F. and L. R. Foulds. 1981. Comparison of phylogenetic trees.
4532	<I>Mathematical Biosciences</I> <B>53:</B> 131-147.
4533	<P>
4534	Rohlf, F. J. and M. C. Wooten. 1988. Evaluation of the restricted maximum
4535	likelihood method for estimating phylogenetic trees using simulated allele-
4536	frequency data. <I>Evolution</I> <B>42:</B> 581-595.
4537	<P>
4538	Rzhetsky, A., and M. Nei. 1992. Statistical properties of the ordinary
4539	least-squares, generalized least-squares, and minimum-evolution methods
4540	of phylogenetic inference. <I>Journal of Molecular Evolution</I> <B>35:</B>
4541	367-375 .
4542	<P>
4543	Saitou, N., Nei, M. 1987. The neighbor-joining method: a new method for
4544	reconstructing phylogenetic trees. <I>Molecular Biology and Evolution</I> <B>4:</B> 406-425.
4545	<P>
4546	Sanderson, M. J. 1990. Flexible phylogeny reconstruction: a review of
4547	phylogenetic inference packages using parsimony. <I>Systematic Zoology</I> <B>39:</B> 414-420.
4548	<P>
4549	Sankoff, D. D., C. Morel, R. J. Cedergren. 1973. Evolution of 5S RNA and
4550	the nonrandomness of base replacement. <I>Nature New Biology</I> <B>245:</B> 232-234.
4551	<P>
4552	Shimodaira, H. and M. Hasegawa. 1999. Multiple comparisons of log-likelihoods
4553	with applications to phylogenetic inference. <EM>Molecular Biology and
4554	Evolution</EM> <B>16:</B> 1114-1116.
4555	<P>
4556	Sokal, R. R. and P. H. A. Sneath. 1963. <I>Principles of Numerical Taxonomy.</I>
4557	W. H. Freeman, San Francisco.
4558	<P>
4559	Smouse, P. E. and W.-H. Li. 1987. Likelihood analysis of mitochondrial
4560	restriction-cleavage patterns for the human-chimpanzee-gorilla trichotomy.
4561	<I>Evolution</I> <B>41:</B> 1162-1176.
4562	<P>
4563	Sober, E. 1983a. Parsimony in systematics: philosophical issues. <I>Annual Review of Ecology and Systematics</I> <B>14:</B> 335-357.
4564	<P>
4565	Sober, E. 1983b. A likelihood justification of parsimony. <I>Cladistics</I> <B>1:</B> 209-233.
4566	<P>
4567	Sober, E. 1988. <I>Reconstructing the Past: Parsimony, Evolution,
4568	and Inference.</I> MIT Press, Cambridge, Massachusetts.
4569	<P>
4570	Sokal, R. R., and P. H. A. Sneath. 1963. <I>Principles of Numerical
4571	Taxonomy.</I> W. H. Freeman, San Francisco.
4572	<P>
4573	Steel, M. A. 1994. Recovering a tree from the Markov leaf colourations
4574	it generates under a Markov model. <I>Applied Mathematics Letters</I>
4575	<B>7:</B> 19-23.
4576	<P>
4577	Studier, J. A. and K. J. Keppler. 1988. A note on the neighbor-joining
4578	algorithm of Saitou and Nei. <I>Molecular Biology and Evolution</I><B>5:</B> 729-731.
4579	<P>
4580	Swofford, D. L. and G. J. Olsen. 1990. Phylogeny reconstruction. Chapter
4581	11, pages 411-501 in <I>Molecular Systematics,</I> ed. D. M. Hillis and C. Moritz.
4582	Sinauer Associates, Sunderland, Massachusetts.
4583	<P>
4584	Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996.
4585	Phylogenetic inference. pp. 407-514 in <I>Molecular Systematics</I>, 2nd ed.,
4586	ed. D. M. Hillis, C. Moritz, and B. K. Mable. Sinauer Associates, Sunderland,
4587	Massachusetts.
4588	<P>
4589	Templeton, A. R. 1983. Phylogenetic inference from restriction endonuclease
4590	cleavage site maps with particular reference to the evolution of humans and the
4591	apes. <I>Evolution</I> <B>37:</B> 221-244.
4592	<P>
4593	Thompson, E. A. 1975. <I>Human Evolutionary Trees.</I> Cambridge University
4594	Press, Cambridge.
4595	<P>
4596	Wu, C. F. J. 1986. Jackknife, bootstrap and other resampling plans in
4597	regression analysis. <I>Annals of Statistics</I> <B>14:</B> 1261-1295.
4598	<P>
4599	Yang, Z. 1993. Maximum-likelihood estimation of phylogeny from DNA sequences
4600	when substitution rates differ over sites. <I>Molecular Biology and
4601	Evolution</I> <B>10:</B> 1396-1401.
4602	<P>
4603	Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences
4604	with variable rates over sites: approximate methods. <I>Journal of Molecular
4605	Evolution</I> <B>39:</B> 306-314.
4606	<P>
4607	Yang, Z. 1995. A space-time process model for the evolution of DNA sequences.
4608	<I>Genetics</I> <B>139:</B> 993-1005.
4609	<P>
4610	<DIV ALIGN="CENTER">
4611	<H2>Credits</H2></DIV>
4612	<P>
4613	Over the years various granting agencies have contributed to the
4614	support of the PHYLIP project (at first without knowing it). They are:
4615	<P>
4616	<TABLE CELLPADDING=3 BORDER="1">
4617	<TR><TD ALIGN="LEFT">Years</TD>
4618	<TD ALIGN="LEFT">Agency</TD>
4619	<TD ALIGN="LEFT">Grant or Contract Number</TD>
4620	</TR>
4621	<TR><TD ALIGN="LEFT">1999-2002</TD>
4622	<TD ALIGN="LEFT">NSF</TD>
4623	<TD ALIGN="LEFT">BIR-9527687</TD>
4624	</TR>
4625	<TR><TD ALIGN="LEFT">1999-2002</TD>
4626	<TD ALIGN="LEFT">NIH NIGMS</TD>
4627	<TD ALIGN="LEFT">R01 GM51929-04</TD>
4628	</TR>
4629	<TR><TD ALIGN="LEFT">1999-2001</TD>
4630	<TD ALIGN="LEFT">NIH NIMH</TD>
4631	<TD ALIGN="LEFT">R01 HG01989-01</TD>
4632	</TR>
4633	<TR><TD ALIGN="LEFT">1995-1999</TD>
4634	<TD ALIGN="LEFT">NIH NIGMS</TD>
4635	<TD ALIGN="LEFT">R01 GM51929-01</TD>
4636	</TR>
4637	<TR><TD ALIGN="LEFT">1992-1995 </TD>
4638	<TD ALIGN="LEFT">National Science Foundation</TD>
4639	<TD ALIGN="LEFT">DEB-9207558</TD>
4640	</TR>
4641	<TR><TD ALIGN="LEFT">1992-1994</TD>
4642	<TD ALIGN="LEFT">NIH NIGMS Shannon Award</TD>
4643	<TD ALIGN="LEFT">2 R55 GM41716-04</TD>
4644	</TR>
4645	<TR><TD ALIGN="LEFT">
4646	1989-1992</TD>
4647	<TD ALIGN="LEFT">NIH NIGMS</TD>
4648	<TD ALIGN="LEFT">1 R01-GM41716-01</TD>
4649	</TR>
4650	<TR><TD ALIGN="LEFT">
4651	1990-1992</TD>
4652	<TD ALIGN="LEFT">National Science Foundation</TD>
4653	<TD ALIGN="LEFT">BSR-8918333</TD>
4654	</TR>
4655	<TR><TD ALIGN="LEFT">
4656	1987-1990</TD>
4657	<TD ALIGN="LEFT">National Science Foundation</TD>
4658	<TD ALIGN="LEFT">BSR-8614807</TD>
4659	</TR>
4660	<TR><TD ALIGN="LEFT">1979-1987</TD>
4661	<TD ALIGN="LEFT">U.S. Department of Energy</TD>
4662	<TD ALIGN="LEFT">DE-AM06-76RLO2225 TA DE-AT06-76EV71005</TD>
4663	</TR>
4664	</TABLE>
4665	<P>
4666	I am particularly grateful to program administrators William Moore,
4667	Irene Eckstrand, Peter Arzberger, and Conrad Istock, who have
4668	gone beyond the call of duty to make sure that PHYLIP continued.
4669	<P>
4670	Booby prizes for funding are awarded to:
4671	<UL><LI>The people at the U.S. Department of Energy who, in 1987, decided they
4672	were "not interested in phylogenies",
4673	<LI>The members of the Systematics Panel of NSF who twice (in 1989 and 1992)
4674	positively recommended that my applications <I>not</I> be funded. I am very
4675	grateful to program director William Moore for courageously overruling
4676	their decision the first time. The 1992 NSF Systematics Panel could claim
4677	no credit for PHYLIP whatsoever.
4678	<LI>The members of the 1992 Genetics Study Section of NIH who rated my
4679	proposal in the 53rd percentile (I don't know if that's 53rd from
4680	the top or the bottom, but does it matter?), thus denying it funding. I am,
4681	however, grateful to the NIGMS administrators, especially Irene Eckstrand,
4682	who supported giving me
4683	a "Shannon award" partially funding my work for a period in spite of this
4684	rating.
4685	</UL>
4686	<P>
4687	The original Camin-Sokal parsimony program and the polymorphism parsimony
4688	program were written by me in 1977 and 1978. They were Pascal versions of
4689	earlier FORTRAN programs I wrote in 1966 and 1967 using the same algorithm to
4690	infer phylogenies under the Camin-Sokal and polymorphism parsimony
4691	criteria. Harvey Motulsky worked for me as a programmer in 1971 and wrote
4692	FORTRAN programs to carry out the Camin-Sokal, Dollo, and polymorphism
4693	methods (he is known these days as the author of the scientific
4694	graphing package GraphPad). But most of the early work on PHYLIP other than my own was by Jerry
4695	Shurman and Mark Moehring. Jerry Shurman worked for me in the summers of
4696	1979 and 1980, and Mark Moehring worked for me in the summers of 1980 and
4697	1981. Both wrote original versions of many of the other programs, based on
4698	the original versions of my Camin-Sokal parsimony program and POLYM. These
4699	formed the basis of Version 1 of the Package, first distributed in October,
4700	1980.
4701	<P>
4702	Version 2, released in the spring of 1982, involved a fairly complete rewrite
4703	by me of many of those programs. Hisashi Horino for
4704	version 3.3 reworked some parts of the programs CLIQUE and CONSENSE
4705	to make their output more comprehensible, and has added some code to the
4706	tree-drawing programs DRAWGRAM and DRAWTREE as well. He also worked on
4707	some of the Drawtree and Drawgram driver code.
4708	<P>
4709	My more recent part-time programmers Akiko Fuseki, Sean Lamont,
4710	Andrew Keeffe, Daniel Yek, Dan Fineman, Patrick Colacurcio,
4711	Mike Palczewski, and Doug Buxton gave
4712	me substantial help with the current release, and their excellent work is
4713	greatly appreciated. Akiko in particular did much of the hard work of adding
4714	new features and changing old ones in the 3.4 and 3.5 releases,
4715	centralized many of the C routines in support files, and is responsible for the
4716	new versions of DNAPARS and PARS. Andrew
4717	prepared the Macintosh version, wrote RETREE, added the ray-tracing
4718	and PICT code to the DRAW programs and has since done much other work. Sean
4719	was central to the conversion to
4720	C, and tested it extensively. My postdoctoral fellow
4721	Mary Kuhner and her associate Jon Yamato created NEIGHBOR, the
4722	neighbor-joining and UPGMA program, for the current release, for which I am
4723	also grateful (Naruya Saitou and Li Jin kindly encouraged us to use some of the
4724	code from their own implementation of this method).
4725	<P>
4726	I am very grateful to over 200
4727	users for algorithmic suggestions, complaints about features (or lack of
4728	features), and information about the behavior of their operating systems
4729	and compilers. A list of some of their names will be found at the credits page
4730	on the PHYLIP web site.
4731	<P>
4732	A major contribution to this package has been made by others
4733	writing programs or parts of programs. Chris Meacham contributed the
4734	important program FACTOR, long demanded by users, and the even more
4735	important ones PLOTREE and PLOTGRAM. Important parts of the code in
4736	DRAWGRAM and DRAWTREE were taken over from those two programs.
4737	Kent Fiala wrote
4738	function "reroot" to do outgroup-rooting, which was an essential part of many
4739	programs in earlier versions. Someone at the Western Australia Institute of
4740	Technology suggested the name PHYLIP (by writing it the label on the
4741	outside of a magnetic tape), but they all seem to deny having done
4742	so (and I've lost the relevant letter).
4743	<P>
4744	The distribution of the package also owes much to Buz Wilson and Willem Ellis,
4745	who put a lot of effort into the early distributions of the PCDOS and
4746	Macintosh versions respectively. Christopher Meacham and Tom Duncan for three
4747	versions distributed a printed version of these documentation files (they are no
4748	longer able to do so), and I am
4749	very grateful to them for those efforts. William H.E. Day and F. James Rohlf
4750	have been very helpful in setting up the listserver news bulletin service which
4751	succeeded the PHYLIP newsletter for a time.
4752	<P>
4753	I also wish to thank the people who have made computer resources available to
4754	me, mostly in the loan of use of microcomputers. These include Jeremy
4755	Field, Clem Furlong, Rick Garber, Dan Jacobson, Rochelle Kochin, Monty Slatkin,
4756	Jim Archie, Jim Thomas, and George Gilchrist.
4757	<P>
4758	I should also note the computers used to develop this package:
4759	These include a CDC 6400, two DECSystem 1090s, my trusty old SOL-20, my
4760	old Osborne-1, a VAX 11/780, a VAX 8600, a MicroVAX I, a DECstation
4761	3100, my old Toshiba 1100+, my
4762	DECstation 5000/200, a DECstation 5000/125, a Compudyne 486DX/33, a
4763	Trinity Genesis 386SX, a Zenith Z386, a Mac Classic, a DEC Alphastation 400
4764	4/233, a Pentium 120, a Pentium 200, a PowerMac 6100, and a Macintosh G3.
4765	(One of the reasons
4766	we have been successful in achieving compatibility between different computer
4767	systems is that I have had to run them myself under so many different operating
4768	systems and compilers).
4769	<P>
4770	<A NAME="otherprograms"><HR><P></A>
4771	<DIV ALIGN="CENTER">
4772	<H2>Other Phylogeny Programs Available Elsewhere</H2></DIV>
4773	<P>
4774	A comprehensive list of phylogeny programs is maintained at the PHYLIP
4775	web site on the Phylogeny Programs pages:
4776	<P>
4777	<DIV ALIGN="CENTER">
4778	<FONT SIZE=+2><A HREF="http://evolution.gs.washington.edu/phylip/software.html">
4779	<TT>http://evolution.gs.washington.edu/phylip/software.html</TT></FONT></A></DIV>
4780	<P>
4781	Here we will simply mention some of the major general-purpose programs. For
4782	many more and much more, see those web pages.
4783	<P>
4784	<B>PAUP*</B>   A comprehensive program with parsimony, likelihood, and
4785	distance matrix methods. It competes with PHYLIP to be responsible for
4786	the most trees published. Written by David Swofford and distributed by
4787	Sinauer Associates of Sunderland, Massachusetts.
4788	It is described in a web pages for
4789	<A HREF="http://www.sinauer.com/detail.php?id=8060">the Macintosh version,</A>
4790	<A HREF="http://www.sinauer.com/detail.php?id=8079">the Windows version,</A>
4791	and
4792	<A HREF="http://www.sinauer.com/detail.php?id=8044">the Unix/OpenVMS version.</A>
4793	Current prices are $100 for the Macintosh version, $85 for the
4794	Windows version, and $150 for Unix versions for many kinds of workstations.
4795	<P>
4796	<B>MacClade</B>   An interactive Macintosh and PowerMac program to
4797	rearrange trees and watch the changes in the fit of the trees to
4798	data as judged by parsimony. MacClade has a great many features including
4799	a spreadsheet data editor and many different descriptive statistics
4800	for different kinds of data. It is particularly designed to export and
4801	import data to and from PAUP*.
4802	MacClade is available for $100 from Sinauer Associates, of Sunderland,
4803	Massachusetts. It is described in a web page at
4804	<A HREF="http://www.sinauer.com/detail.php?id=4707">
4805	<TT>http://www.sinauer.com/detail.php?id=4707</TT></A>.
4806	MacClade is also described on its <A HREF="http://phylogeny.arizona.edu/macclade/macclade.html">
4807	Web page</A>, at <CODE>http://phylogeny.arizona.edu/macclade/macclade.html</CODE
4808	>.
4809	<P>
4810	<B>MEGA</B>   A Windows and DOS program by Sudhir Kumar of Arizona State University
4811	(written together with Koichiro Tamura and Masatoshi Nei while he was a
4812	student in Nei's lab at Pennsylvania
4813	State University). It can carry out parsimony and distance matrix methods
4814	for DNA sequence data. Version 2.1 for Windows
4815	can be downloaded from <A HREF="http://www.megasoftware.net">
4816	the MEGA web site</A>
4817	at <TT>http://www.megasoftware.net</TT>.
4818	<P>
4819	<B>PAML</B>   Ziheng Yang of the Department of Genetics and Biometry at
4820	University College, London has written this package of programs to
4821	carry out likelihood analysis of DNA and protein sequence data. PAML is
4822	particularly strong in the options for coping with variability of rates
4823	of evolution from site to site, though it is less able than some other
4824	packages to search effectively for the best tree. It is available as
4825	C source code and as PowerMac and Windows executables from its web site at
4826	<A HREF="http://abacus.gene.ucl.ac.uk/software/paml.html">
4827	<TT>http://abacus.gene.ucl.ac.uk/software/paml.html</TT></A>.
4828	<P>
4829	<B>TREE-PUZZLE</B>   This package by Korbinian Strimmer and Arndt von Haeseler
4830	was begun when they were at the Uviversität Munchen in Germany.
4831	TREE-PUZZLE can carry out likelihood
4832	methods for DNA and protein data, searching by the strategy of
4833	"quartet puzzling" which they invented. It can also compute distances.
4834	It superimposes trees estimated
4835	from many quartets of species. TREE-PUZZLE is available for Unix, Macintoshes,
4836	or Windows from their web site at
4837	<A HREF="http://www.tree-puzzle.de/"><TT>http://www.tree-puzzle.de/</TT></A>.
4838	<P>
4839	<B>DAMBE</B>    A package written by Xuhua Xia, then of the
4840	Department of
4841	Ecology and Biodiversity of the University of Hong Kong.
4842	Its initials stand for Data Analysis in Molecular Biology and Evolution.
4843	DAMBE is a general-purpose package for DNA and protein sequence phylogenies.
4844	It can read and
4845	convert a number of file formats, and has many features for
4846	descriptive statistics, and can compute a number of commonly-used
4847	distance matrix measures and infer phylogenies by parsimony, distance,
4848	or likelihood methods, including bootstrapping and jackknifing. There are
4849	a number of kinds of statistical tests of trees available and it
4850	can also display phylogenies. DAMBE includes a copy of ClustalW as well;
4851	DAMBE consists of Windows95 executables. It is available from its
4852	web site at <A HREF="http://web.hku.hk/~xxia/software/software.htm">
4853	<CODE>http://web.hku.hk/~xxia/software/software.htm</CODE></A>.
4854	Xia has now moved to the Department of Biology of the University of Ottawa,
4855	Canada, and I suspect the DAMBE web site will soon follow him there.
4856	<P>
4857	<B>MOLPHY</B>   A package of programs for carrying out likelihood analysis
4858	of DNA and protein data, written by Jun Adachi and Masami Hasegawa of the
4859	Institute of Statistical Mathematics in Tokyo, Japan. The source code
4860	is available from them at
4861	<A HREF="http://www.ism.ac.jp/software/ismlib/softother.e.html">
4862	the MOLPHY web site</A> at
4863	<CODE>http://www.ism.ac.jp/software/ismlib/softother.e.html</CODE>, and
4864	Windows executables are available from Russell Malmberg's web site at
4865	<A HREF="http://dogwood.botany.uga.edu/malmberg/software.html">
4866	<TT>http://dogwood.botany.uga.edu/malmberg/software.html</TT></A>.
4867	<P>
4868	<B>Hennig86</B>   A fast parsimony program by J. S. Farris of the
4869	Naturhistoriska Riksmuseet in Stockholm, Sweden for discrete characters
4870	data (it can handle DNA if its states are recoded to be digits).
4871	Reputed to be faster than PAUP*.
4872	The program is distributed as an executable and costs $50, plus $5
4873	mailing costs ($10 outside of of the U.S.). The user's name should be stated,
4874	as copies are personalized as a copy-protection measure. It is
4875	distributed by Arnold Kluge, Amphibians and Reptiles, Museum of Zoology,
4876	University of
4877	Michigan, Ann Arbor, Michigan 48109-1079, U.S.A. (<TT>akluge@umich.edu</TT>) and
4878	by Diana Lipscomb at George Washington University (<TT>BIODL@gwuvm.gwu.edu</TT>).
4879	<P>
4880	<B>RnA</B>   J. S. Farris's very fast program which uses parsimony
4881	to carry out jackknifing resampling of DNA sequence data. This would be
4882	nearly equivalent in properties to bootstrapping if the jackknifing were
4883	sampling random halves of the data, but Farris prefers to have each
4884	jackknife sample delete a fraction 1/<I>e</I> of the data, which will give
4885	most groups too much support (he would disagree with this statement).
4886	RnA is available from Arnold Kluge, Amphibians and Reptiles, Museum of Zoology,
4887	University of
4888	Michigan, Ann Arbor, Michigan 48109-1079, U.S.A. (<TT>akluge@umich.edu</TT>)
4889	and Diana Lipscomb
4890	at George Washington University (<TT>BIODL@gwuvm.gwu.edu</TT>) who may be
4891	contacted for details. The cost is about $30 US.
4892	<P>
4893	<B>NONA</B>   Pablo Goloboff, of the Instituto Miguel Lillo in
4894	Tucuman, Argentina has written these very fast parsimony programs, capable
4895	of some relevant forms of weighted parsimony, which can handle either
4896	DNA sequence data or discrete characters. It is available as shareware
4897	from <A HREF="http://www.cladistics.com/aboutNona.htm">
4898	<TT>http://www.cladistics.com/aboutNona.htm</TT></A>
4899	There is a 30 day free trial, after which
4900	NONA must be purchased separately by sending a check for $40.00 to
4901	either directly to the the author, or to: James M. Carpenter, Attn: NONA,
4902	Division of Invertebrate Zoology, American Museum of Natural History,
4903	Central Park West at 79th Street, New York, NY 10024.
4904	<P>
4905	<B>TNT</B> This program, by Pablo Goloboff, J. S. Farris, and Kevin Nixon,
4906	is for searching large data sets for most parsimonious trees.
4907	The authors are respectively at the Instituto Miguel Lillo in Tucuman,
4908	Argentina, the Naturhistoriska Riksmuseet in Stockholm, Sweden, and the
4909	Hortorium, Cornell University, Ithaca, New York.
4910	TNT is described
4911	as faster than other methods, though not faster than NONA for small to
4912	medium data sets. Its distribution status is somewhat uncertain. The site
4913	<A HREF="http://www.cladistics.com/aboutTNT.html">
4914	<TT>http://www.cladistics.com/aboutTNT.html</TT></A>
4915	describes it as unavailable,
4916	while the web site <A HREF="http://www.cladistics.com/webtnt.html">
4917	<TT>http://www.cladistics.com/webtnt.html</TT></A> makes a beta version
4918	available for download. The program downloaded is free but needs a password to
4919	function, which the user should obtain from Pablo Goloboff (see the latter
4920	web page for details).
4921	<P>
4922	These are only a few of the more than 194 different phylogeny packages that
4923	are now available (as of January, 2001 - the number keeps increasing). The
4924	others are described (and web links and ftp addresses provided) at my
4925	Phylogeny Programs web pages at the address given above.
4926	<P>
4927	<A NAME="helpme"><HR><P></A>
4928	<DIV ALIGN="CENTER">
4929	<H2>How You Can Help Me</H2></DIV>
4930	<P>
4931	Simply let me know of any problems you have had adapting the
4932	programs to your computer. I can often make "transparent" changes that, by
4933	making the code avoid the wilder, woolier, and less standard parts of
4934	C, not only help others who have your machine but even improve the
4935	chance of the programs functioning on new machines. I would like fairly
4936	detailed information on what gave trouble, on what operating system,
4937	machine, and (if relevant) compiler, and what had to be done to make the
4938	programs work. I am sometimes able to do some over-the-telephone
4939	trouble-shooting, particularly
4940	if I don't have to pay for the call, but electronic mail is a the best
4941	way for me to be asked about problems, as you can include your
4942	input and output files so I can see what is going on (please do <EM>not</EM>
4943	send them as Attachments, but as part of the body of a message). I'd really
4944	like these programs to be
4945	able to run with only routine changes on <I>absolutely everything</I>, down to
4946	and possibly including the Amana Touchmatic Radarange Microwave Oven
4947	which was an Intel 8080 system (in fact, early versions of this package did
4948	run successfully on Intel 8080 systems running the CP/M operating system).
4949	A PalmPilot version is contemplated too.
4950	<P>
4951	I would also like to know timings of programs from the package, when
4952	run on the three test input files provided above, for various computer and
4953	compiler combinations, so that I can provide this information in the
4954	section on speeds of this document.
4955	<P>
4956	For the phylogeny plotting programs DRAWGRAM and DRAWTREE,
4957	I am particularly interested in knowing what has to be done
4958	to adapt them for other graphic file formats.
4959	<P>
4960	You can also be helpful to PHYLIP users in your part of the world by
4961	helping them get the latest version of PHYLIP from our web site
4962	and by helping them with any
4963	problems they may have in getting PHYLIP working on their data.
4964	<P>
4965	Your help is appreciated. I am always happy to hear suggestions
4966	for features and programs that ought to be incorporated in the package,
4967	but please do not be upset if I turn out to have already considered the
4968	particular possibility you suggest and decided against it.
4969	<P>
4970	<A NAME="trouble"><HR><P></A>
4971	<DIV ALIGN="CENTER">
4972	<H2>In Case of Trouble</H2></DIV>
4973	<P>
4974	<I>Read The (documentation) Files Meticulously</I> ("RTFM"). If that doesn't solve the
4975	problem, please check the Frequently Asked Questions web page at the
4976	PHYLIP web site:
4977	<P>
4978	<FONT SIZE=+2>
4979	<TT><A HREF="http://evolution.gs.washington.edu/phylip/faq.html">
4980	http://evolution.gs.washington.edu/phylip/faq.html</TT></A></FONT>
4981	<P>
4982	and the PHYLIP Bugs web page at that site:
4983	<P>
4984	<FONT SIZE=+2>
4985	<TT><A HREF="http://evolution.gs.washington.edu/phylip/bugs.html">
4986	http://evolution.gs.washington.edu/phylip/bugs.html</TT></A></FONT>
4987	<P>
4988	If none of these answers your question, get in touch with me. My electronic mail address
4989	is given below. If you do ask about a problem, please specify the program
4990	name, version of the package, computer operating system, and
4991	send me your data file so I can test the problem. Do <I>not</I>
4992	send your data file as an e-mail Attachment but instead
4993	as the body of a message. I read the e-mail on a Unix system, which makes
4994	it impossible to read some formats of attachments without
4995	running around to other machines and moving the files there. This
4996	is one of my least favorite activities, so please do not use attachments.
4997	Also it will help if you
4998	have the relevant output and documentation files so that you
4999	can refer to them in any correspondence. I can also be reached by telephone
5000	by calling me in my office:
5001	+1-(206)-543-0150, or at home: +1-(206)-526-9057 (how's <I>that</I> for user
5002	support!). If I cannot be reached at either place, a message can be left at
5003	the office of
5004	the Department of Genome Sciences, (206)-221-7377 but I prefer strongly that I not
5005	call you, as in any phone consultation the least you can do is pay the phone
5006	bill. Better yet, use electronic mail.
5007	<P>
5008	Particularly if you are in a part of the world distant from me, you may also
5009	want to try to get in touch with other users of PHYLIP nearby. I can also,
5010	if requested, provide a list of nearby users.
5011	<P>
5012	<DIV ALIGN="RIGHT">
5013	<TABLE><TR><TD ALIGN=LEFT>
5014	Joe Felsenstein<BR>
5015	Department of Genome Sciences<BR>
5016	University of Washington<BR>
5017	Box 357730<BR>
5018	Seattle, Washington 98195-7730, U.S.A.
5019	</TD></TR></TABLE>
5020	</DIV>
5021	<P>
5022	Electronic mail addresses:      <TT>joe@gs.washington.edu</TT>
5023	<BR><HR>
5024	</BODY>
5025	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/main.html

Download in other formats: