source: tags/arb-6.0/GDE/PHYLIP/doc/kitsch.html

Last change on this file was 2176, checked in by westram, 20 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 14.1 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>kitsch</TITLE>
5<META NAME="description" CONTENT="kitsch">
6<META NAME="keywords" CONTENT="kitsch">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>KITSCH -- Fitch-Margoliash and Least Squares Methods<BR>
18with Evolutionary Clock</H1>
19</DIV>
20<P>
21&#169; Copyright 1986-2002 by the University of
22Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
23this document provided that no fee is charged for it and that this copyright
24notice is not removed.
25<P>
26This program carries out the Fitch-Margoliash and Least Squares methods,
27plus a variety of others of the same family, with the assumption that all
28tip species are contemporaneous, and that there is an evolutionary clock
29(in effect, a molecular clock).  This means that branches of the tree cannot
30be of arbitrary length, but are constrained so that the total
31length from the root of
32the tree to any species is the same.  The quantity minimized is the same
33weighted sum of squares described in the Distance Matrix Methods documentation
34file.
35<P>
36The options are set using the menu:
37<P>
38<TABLE><TR><TD BGCOLOR=white>
39<PRE>
40
41Fitch-Margoliash method with contemporary tips, version 3.6a3
42
43Settings for this run:
44  D      Method (F-M, Minimum Evolution)?  Fitch-Margoliash
45  U                 Search for best tree?  Yes
46  P                                Power?  2.00000
47  -      Negative branch lengths allowed?  No
48  L         Lower-triangular data matrix?  No
49  R         Upper-triangular data matrix?  No
50  S                        Subreplicates?  No
51  J     Randomize input order of species?  No. Use input order
52  M           Analyze multiple data sets?  No
53  0   Terminal type (IBM PC, ANSI, none)?  (none)
54  1    Print out the data at start of run  No
55  2  Print indications of progress of run  Yes
56  3                        Print out tree  Yes
57  4       Write out trees onto tree file?  Yes
58
59  Y to accept these or type the letter for one to change
60
61</PRE>
62</TD></TR></TABLE>
63<P>
64Most of the options are described in the Distance Matrix Programs documentation
65file.
66<P>
67The D (methods) option allows choice between the Fitch-Margoliash
68criterion and the Minimum Evolution method (Kidd and Sgaramella-Zonta, 1971;
69Rzhetsky and Nei, 1993).  Minimum Evolution (not to be confused with
70parsimony) uses the Fitch-Margoliash criterion to fit branch lengths to each
71topology, but then chooses topologies based on their total branch length
72(rather than the goodness of fit sum of squares).  There is no
73constraint on negative branch lengths in the Minimum Evolution method;
74it sometimes gives rather strange results, as it can like solutions
75that have large negative branch lengths, as these reduce the total
76sum of branch lengths!
77<P>
78Note that the User Trees (used by option U) must be
79rooted trees (with a bifurcation at their base).  If you take a user
80tree from FITCH and try to evaluate it in KITSCH, it must first be
81rooted.  This can be done using RETREE.  Of the options
82available in FITCH, the O option is
83not available, as KITSCH estimates a rooted tree which cannot be
84rerooted, and the G option is not
85available, as global rearrangement is the default condition anyway.  It
86is also not possible to specify that specific branch lengths of a user tree
87be retained when it is read into KITSCH, unless all of them are present.  In
88that case the tree should be properly clocklike.  Readers who wonder why
89we have not provided the feature of holding some of the user tree branch
90lengths constant while iterating others are invited to tell us how they
91would do it.  As you consider particular possible patterns of branch
92lengths you will find that the matter is not at all simple.
93<P>
94If you use a User Tree (option U) with branch lengths with KITSCH, and the
95tree is not clocklike, when two branch lengths give conflicting positions
96for a node, KITSCH will use the first of them and ignore the other.  Thus
97the user tree:
98<P>
99<PRE>
100&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;((A:0.1,B:0.2):0.4,(C:0.06,D:0.01):43);
101</PRE>
102<P>
103is nonclocklike, so it will be treated as if it were actually the tree:
104<P>
105<PRE>
106&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;((A:0.1,B:0.1):0.4,(C:0.06,D:0.06):44);
107</PRE>
108<P>
109The input is exactly the same as described in the Distance Matrix Methods
110documentation file.  The output is a rooted tree, together with the sum of
111squares, the number of tree topologies searched, and, if the power P is at
112its default value of 2.0, the Average Percent Standard Deviation is also
113supplied.  The lengths of the branches of the tree are given in a table,
114that also shows for each branch the time at the upper end of the
115branch.  "Time" here really means cumulative branch length from the root, going
116upwards (on the printed diagram, rightwards).  For each branch, the
117"time" given is for the node at the right (upper) end of the branch.   It
118is important to realize that the branch lengths are not exactly proportional to
119the lengths drawn on the printed tree diagram!  In particular, short
120branches are exaggerated in the length on that diagram so that they are
121more visible.
122<P>
123The method may be considered as providing an estimate of the
124phylogeny.  Alternatively, it can be considered as a phenetic clustering of
125the tip species.  This method minimizes an objective function, the sum of
126squares,
127not only setting the levels of the clusters so as to do so, but rearranging
128the hierarchy of clusters to try to find alternative clusterings that
129give a lower overall sum of squares.  When the power option P is set to a
130value of <EM>P = 0.0</EM>, so that we are minimizing a simple sum of squares
131of the differences between the observed distance matrix and the expected one,
132the method is very close in spirit to Unweighted Pair Group Arithmetic Average
133Clustering (UPGMA), also called Average-Linkage Clustering.  If the topology of
134the tree is fixed and there turn out to be no branches of negative length, its
135result should be the same as UPGMA in that case.  But since it tries
136alternative topologies and (unless
137the N option is set) it combines nodes that otherwise could result in a reversal
138of levels, it is possible for it to give a different, and better, result than
139simple sequential clustering.  Of course UPGMA itself is available as an
140option in program NEIGHBOR.
141<P>
142The U (User Tree) option requires a bifurcating tree, unlike FITCH, which
143requires an unrooted tree with a trifurcation at its base.  Thus the tree
144shown below would be written:
145<P>
146&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;((D,E),(C,(A,B)));
147<P>
148If a tree with a trifurcation at the base is by mistake fed into the U option
149of KITSCH then some of its species (the entire rightmost furc, in fact) will be
150ignored and too small a tree read in.  This should result in an error message
151and the program should stop.  It is important to understand the
152difference between the User Tree formats for KITSCH and FITCH.  You may want
153to use RETREE to convert a user tree that is suitable for FITCH into one
154suitable for KITSCH or vice versa.
155<P>
156An important use of this method will be to do a formal statistical test of
157the evolutionary clock hypothesis.  This can be done by comparing the sums
158of squares achieved by FITCH and by KITSCH, BUT SOME CAVEATS ARE
159NECESSARY.  First, the assumption is that the observed distances are truly
160independent, that no original data item contributes to more than one of them
161(not counting the two reciprocal distances from i to j and from j to i).  THIS
162WILL NOT HOLD IF THE DISTANCES ARE OBTAINED FROM GENE FREQUENCIES, FROM
163MORPHOLOGICAL CHARACTERS, OR FROM MOLECULAR SEQUENCES.  It may be invalid even
164for immunological distances and levels of DNA hybridization, provided that the
165use of common standard for all members of a row or column allows an error in
166the measurement of the standard to affect all these distances
167simultaneously.  It will also be invalid if the numbers have been collected in
168experimental groups, each measured by taking differences from a common standard
169which itself is measured with error.  Only if the numbers in different cells
170are measured from independent standards can we depend on the statistical
171model.  The details of the test and the assumptions are discussed in my review
172paper on distance methods (Felsenstein, 1984a).  For further and sometimes
173irrelevant controversy on these matters see the papers by Farris (1981,
1741985, 1986) and myself (Felsenstein, 1986, 1988b).
175<P>
176A second caveat is that the distances must be expected to rise linearly with
177time, not according to any other curve.  Thus it may be necessary to transform
178the distances to achieve an expected linearity.  If the distances have an upper
179limit beyond which they could not go, this is a signal that linearity may
180not hold.  It is also VERY important to choose the power <EM>P</EM> at a value
181that results in the standard deviation of the variation of the observed from the
182expected distances being the <EM>P/2</EM>-th power of the expected distance.
183<P>
184To carry out the test, fit the same data with both FITCH and KITSCH,
185and record the two sums of squares.  If the topology has turned out the
186same, we have <EM>N = n(n-1)/2</EM> distances which have been fit with
187<EM>2n-3</EM>
188parameters in FITCH, and with <EM>n-1</EM> parameters in KITSCH.  Then the
189difference between <EM>S(K)</EM> and <EM>S(F)</EM> has <EM>d<SUB>1</SUB> = n-2</EM>
190degrees of freedom.  It is
191statistically independent of the value of <EM>S(F)</EM>, which has
192<EM>d<SUB>2</SUB> = N-(2n-3)</EM>
193degrees of freedom.  The ratio of mean squares
194<P>
195<PRE>
196&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [S(K)-S(F)]/d<SUB>1</SUB>
197&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;----------------
198&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;     S(F)/d<SUB>2</SUB>
199</PRE>
200<P>
201should, under the
202evolutionary clock, have an F distribution with <EM>n-2</EM> and
203<EM>N-(2n-3)</EM> degrees of
204freedom respectively.  The test desired is that the F ratio is in the upper
205tail (say the upper 5%) of its distribution.  If the S (subreplication)
206option is in
207effect, the above degrees of freedom must be modified by noting that
208N is not <EM>n(n-1)/2</EM> but is the sum of the numbers of replicates of all
209cells in the distance matrix read in, which may be either square or
210triangular.  A further explanation of the
211statistical test of the clock is given in a paper of mine (Felsenstein, 1986).
212<P>
213The program uses a similar tree construction method to the other programs
214in the package and, like them, is not guaranteed to give the best-fitting
215tree.  The assignment of the branch lengths for a given topology is a
216least squares fit, subject to the constraints against negative branch lengths,
217and should not be able to be improved upon.  KITSCH runs more quickly than
218FITCH.
219<P>
220The constant
221available for modification at the beginning of the program is
222"epsilon", which defines a small quantity needed in
223some of the calculations.  There is no feature saving multiply trees
224tied for best,
225because exact ties are not expected, except in cases where it should be
226obvious from the tree printed out what is the nature of the tie (as when an
227interior branch is of length zero).
228<P>
229<HR>
230<P>
231<H3>TEST DATA SET</H3>
232<P>
233<TABLE><TR><TD BGCOLOR=white>
234<PRE>
235    7
236Bovine      0.0000  1.6866  1.7198  1.6606  1.5243  1.6043  1.5905
237Mouse       1.6866  0.0000  1.5232  1.4841  1.4465  1.4389  1.4629
238Gibbon      1.7198  1.5232  0.0000  0.7115  0.5958  0.6179  0.5583
239Orang       1.6606  1.4841  0.7115  0.0000  0.4631  0.5061  0.4710
240Gorilla     1.5243  1.4465  0.5958  0.4631  0.0000  0.3484  0.3083
241Chimp       1.6043  1.4389  0.6179  0.5061  0.3484  0.0000  0.2692
242Human       1.5905  1.4629  0.5583  0.4710  0.3083  0.2692  0.0000
243</PRE>
244</TD></TR></TABLE>
245<P>
246<HR>
247<P>
248<H3>TEST SET OUTPUT FILE (with all numerical options on)</H3>
249<P>
250<TABLE><TR><TD BGCOLOR=white>
251<PRE>
252
253   7 Populations
254
255Fitch-Margoliash method with contemporary tips, version 3.6a3
256
257                  __ __             2
258                  \  \   (Obs - Exp)
259Sum of squares =  /_ /_  ------------
260                                2
261                   i  j      Obs
262
263negative branch lengths not allowed
264
265
266Name                       Distances
267----                       ---------
268
269Bovine        0.00000   1.68660   1.71980   1.66060   1.52430   1.60430
270              1.59050
271Mouse         1.68660   0.00000   1.52320   1.48410   1.44650   1.43890
272              1.46290
273Gibbon        1.71980   1.52320   0.00000   0.71150   0.59580   0.61790
274              0.55830
275Orang         1.66060   1.48410   0.71150   0.00000   0.46310   0.50610
276              0.47100
277Gorilla       1.52430   1.44650   0.59580   0.46310   0.00000   0.34840
278              0.30830
279Chimp         1.60430   1.43890   0.61790   0.50610   0.34840   0.00000
280              0.26920
281Human         1.59050   1.46290   0.55830   0.47100   0.30830   0.26920
282              0.00000
283
284
285                                           +-------Human     
286                                         +-6
287                                    +----5 +-------Chimp     
288                                    !    !
289                                +---4    +---------Gorilla   
290                                !   !
291       +------------------------3   +--------------Orang     
292       !                        !
293  +----2                        +------------------Gibbon   
294  !    !
295--1    +-------------------------------------------Mouse     
296  !
297  +------------------------------------------------Bovine   
298
299
300Sum of squares =      0.107
301
302Average percent standard deviation =   5.16213
303
304From     To            Length          Height
305----     --            ------          ------
306
307   6   Human           0.13460         0.81285
308   5      6            0.02836         0.67825
309   6   Chimp           0.13460         0.81285
310   4      5            0.07638         0.64990
311   5   Gorilla         0.16296         0.81285
312   3      4            0.06639         0.57352
313   4   Orang           0.23933         0.81285
314   2      3            0.42923         0.50713
315   3   Gibbon          0.30572         0.81285
316   1      2            0.07790         0.07790
317   2   Mouse           0.73495         0.81285
318   1   Bovine          0.81285         0.81285
319
320</PRE>
321</TD></TR></TABLE>
322</BODY>
323</HTML>
Note: See TracBrowser for help on using the repository browser.