source: trunk/GDE/PHYLIP/doc/dnainvar.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 16.2 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>dnainvar</TITLE>
5<META NAME="description" CONTENT="dnainvar">
6<META NAME="keywords" CONTENT="dnainvar">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>DNAINVAR -- Program to compute Lake's and Cavender's<BR>
18phylogenetic invariants from nucleotide sequences</H1>
19</DIV>
20<P>
21&#169; Copyright 1986-2002 by the University of Washington.
22Written by Joseph Felsenstein.  Permission is granted to copy
23this document provided that no fee is charged for it and that this copyright
24notice is not removed.
25<P>
26This program reads in nucleotide sequences for four species and computes
27the phylogenetic invariants discovered by James Cavender (Cavender and
28Felsenstein, 1987) and James Lake (1987).  Lake's method is also
29called by him "evolutionary parsimony".  I prefer Cavender's more
30mathematically precise term "invariants", as the method bears somewhat
31more relationship to likelihood methods than to parsimony.  The invariants
32are mathematical
33formulas (in the present case linear or quadratic) in the EXPECTED
34frequencies of site patterns which are zero for all trees of a given
35tree topology, irrespective of branch lengths.  We can consider at a given
36site that if there
37are no ambiguities, we could have for four species the nucleotide patterns
38(considering the same site across all four species) AAAA, AAAC, AAAG, ...
39through TTTT, 256 patterns in all.
40<P>
41The invariants are formulas in the expected pattern frequencies, not
42the observed pattern frequencies.  When they are computed using the
43observed pattern frequencies, we will
44usually find that they are not precisely zero even when the model is correct
45and we have the correct tree topology.  Only as the number of nucleotides
46scored becomes infinite will the observed pattern frequencies approach their
47expectations; otherwise, we must do a statistical test of the invariants.
48<P>
49Some explanation of invariants will be found in the above papers, and also
50in my recent review article on statistical aspects of inferring phylogenies
51(Felsenstein, 1988b).  Although invariants have some important advantages,
52their validity also depends on symmetry assumptions that may not be satisfied.
53In the discussion below suppose that the possible unrooted phylogenies are
54I: ((A,B),(C,D)),  II: ((A,C),(B,D)), and III: ((A,D),(B,C)).
55<P>
56<H2>Lake's Invariants, Their Testing and Assumptions</H2>
57<P>
58Lake's invariants are fairly simple to describe: the patterns involved are
59only those in which there are two purines and two pyrimidines at a site.
60Thus a site with AACT would affect the invariants, but a site with AAGG would
61not.  Let us use (as Lake does)
62the symbols 1, 2, 3, and 4, with the proviso that 1 and 2
63are either both of the purines or both of the pyrimidines; 3 and 4 are the
64other two nucleotides.  Thus 1 and 2 always differ by a transition; so do
653 and 4.  Lake's invariants, expressed in terms of expected frequencies, are
66the three quantities:
67<P>
68(1)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         P(1133) + P(1234) - P(1134) - P(1233),
69<P>
70(2)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         P(1313) + P(1324) - P(1314) - P(1323),
71<P>
72(3)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;         P(1331) + P(1342) - P(1341) - P(1332),
73<P>
74He showed that invariants (2) and (3) are zero under Topology I, (1) and (3)
75are zero under topology II, and (1) and (2) are zero under Topology III.  If,
76for example, we see a site with pattern ACGC, we can start by setting 1=A.
77Then 2 must be G.  We can then set 3=C (so that 4 is T).  Thus its pattern
78type, making those substitutions, is 1323.  P(1323) is the expected
79probability of the type of pattern which includes ACGC, TGAG, GTAT, etc.
80<P>
81Lake's invariants are easily tested with observed frequencies.  For example,
82the first of them is a test of whether there are as many sites of types 1133
83and 1234 as there are of types 1134 and 1233; this is easily tested with a
84chi-square test or, as in this program, with an exact binomial test.  Note
85that with several invariants to test, we risk overestimating the significance
86of results if we simply accept the nominal 95% levels of significance
87(Li and Guoy, 1990).
88<P>
89Lake's invariants assume that each site is evolving independently, and that
90starting from any base a transversion is equally likely to end up at each of
91the two possible bases (thus, an A undergoing a transversion is equally likely
92to end up as a C or a T, and similarly for the other four bases from which one
93could start.  Interestingly, Lake's results do not assume that rates of
94evolution are the same at all sites.  The result that the total of 1133 and 1234
95is expected to be the same as the total of 1134 and 1233 is unaffected by the
96fact that we may have aggregated the counts over classes of sites evolving at
97different rates.
98<P>
99<H2>Cavender's Invariants, Their Testing and Assumptions</H2>
100<P>
101Cavender's invariants (Cavender and Felsenstein, 1987) are for the case of
102a character with two states.  In the nucleic acid case we can classify
103nucleotides into two states, R and Y (Purine and Pyrimidine) and then use the
104two-state results.  Cavender starts, as before, with the pattern frequencies.
105Coding purines as R and pyrimidines as Y, the patterns types are RRRR, RRRY,
106and so on until YYYY, a total of 16 types.  Cavender found quadratic functions
107of the expected frequencies of these 16 types that were expected to be zero
108under a given phylogeny, irrespective of branch lengths.  Two invariants
109(called K and L) were found for each tree topology.  The L invariants are
110particularly easy to understand.  If we have the tree topology ((A,B),(C,D)),
111then in the case of two symmetric states, the event that A and B have the same
112state should be independent of whether C and D have the same state, as the
113events determining these happen in different parts of the tree.  We can set
114up a contingency table:
115<P>
116<PRE>
117                                 C = D         C =/= D
118                           ------------------------------
119                          |
120                   A = B  |   YYYY, YYRR,     YYYR, YYRY,
121                          |   RRRR, RRYY      RRYR, RRRY
122                          |
123                 A =/= B  |   YRYY, YRRR,     YRYR, YRRY,
124                          |   RYYY, RYRR      RYYR, RYRY
125</PRE>
126<P>
127and we expect that the events C = D and A = B will be independent.  Cavender's
128L invariant for this tree topology is simply the negative of the crossproduct
129difference,
130<P>
131&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; P(A=/=B and C=D) P(A=B and C=/=D) - P(A=B and C=D) P(A=/=B and C=/=D).
132<P>
133One of these L invariants is defined for each of the three tree topologies. 
134They can obviously be tested simply by doing a chi-square test on the
135contingency table.  The one corresponding to the correct topology should be
136statistically indistinguishable from zero.  Again, there is a possible
137multiple tests problem if all three are tested at a nominal value of 95%.
138<P>
139The K invariants are differences between the L invariants.  When one of the
140tables is expected to have crossproduct difference zero, the other two are
141expected to be nonzero, and also to be equal.  So the difference of their
142crossproduct differences can be taken; this is the K invariant.  It is not
143so easily tested.
144<P>
145The assumptions of Cavender's invariants are different from those of
146Lake's.  One obviously need not assume anything about the frequencies of, or
147transitions among, the two different purines or the two different
148pyrimidines.  However one does need to assume independent events at each site,
149and one needs to assume that the Y and R states are symmetric, that the
150probability per unit time that a Y changes into an R is the same as the
151probability that an R changes into a Y, so that we expect equal frequencies
152of the two states.  There is also an assumption that all sites are changing
153between these two states at the same expected rate.  This assumption is not
154needed for Lake's invariants, since expectations of sums are equal to
155sums of expectations, but for Cavender's it is, since products of expectations
156are not equal to expectations of products.
157<P>
158It is helpful to have both sorts of invariants available; with further work we
159may appreciate what other invaraints there are for various models of nucleic
160acid change.
161<P>
162<H2>INPUT FORMAT</H2>
163<P>
164The input data for DNAINVAR is standard.  The first line of the input file
165contains the
166number of species (which must always be 4 for this version of DNAINVAR)
167and the number of sites.
168<P>
169Next come the species data.  Each
170sequence starts on a new line, has a ten-character species name
171that must be blank-filled to be of that length, followed immediately
172by the species data in the one-letter code.  The sequences must either
173be in the "interleaved" or "sequential" formats
174described in the Molecular Sequence Programs document.  The I option
175selects between them.  The sequences can have internal
176blanks in the sequence but there must be no extra blanks at the end of the
177terminated line.  Note that a blank is not a valid symbol for a deletion.
178<P>
179The options are selected using an interactive menu.  The menu looks like this:
180<P>
181<TABLE><TR><TD BGCOLOR=white>
182<PRE>
183
184Nucleic acid sequence Invariants method, version 3.6a3
185
186Settings for this run:
187  W                       Sites weighted?  No
188  M           Analyze multiple data sets?  No
189  I          Input sequences interleaved?  Yes
190  0   Terminal type (IBM PC, ANSI, none)?  (none)
191  1    Print out the data at start of run  No
192  2  Print indications of progress of run  Yes
193  3      Print out the counts of patterns  Yes
194  4              Print out the invariants  Yes
195
196  Y to accept these or type the letter for one to change
197
198</PRE>
199</TD></TR></TABLE>
200<P>
201The user either types "Y" (followed, of course, by a carriage-return)
202if the settings shown are to be accepted, or the letter or digit corresponding
203to an option that is to be changed.
204<P>
205The options W, M and 0 are the usual ones.  They are described in the
206main documentation file of this package.  Option I is the same as in
207other molecular sequence programs and is described in the documentation file
208for the sequence programs.
209<P>
210<H2>OUTPUT FORMAT</H2>
211<P>
212The output consists first (if option 1 is selected) of a reprinting of the
213input data, then (if option 2 is on) tables
214of observed pattern frequencies and pattern type frequencies.  A table will
215be printed out, in alphabetic order AAAA through TTTT of all the patterns
216that appear among the sites and the number of times each appears.  This table
217will be invaluable for computation of any other invariants.  There follows
218another table, of pattern types, using the 1234 notation, in numerical
219order 1111 through 1234, of the number of times each type of pattern appears.
220In this computation all sites at which there are any ambiguities or deletions
221are omitted.  Cavender's invariants could actually be computed from sites
222that have only Y or R ambiguities; this will be done in the next release of
223this program.
224<P>
225If option 3 is on the invariants are then printed out,
226together with their statistical
227tests.  For Lake's invariants the two sums which are expected to be equal are
228printed out, and then the result of an one-tailed exact binomial test which
229tests whether the difference is expected to be this positive or more.  The P
230level is given (but remember the multiple-tests problem!).
231<P>
232For Cavender's L invariants the contingency tables are given.  Each is tested
233with a one-tailed chi-square test.  It is possible that the expected numbers
234in some categories could be too small for valid use of this test; the program
235does not check for this.  It is also possible that the chi-square could be
236significant but in the wrong direction; this is not tested in the current
237version of the program.  To check it beware of a chi-square greater than 3.841
238but with a positive invariant.  The invariants themselves are computed, as the
239difference of cross-products.  Their absolute magnitudes are not important,
240but which one is closest to zero may be indicative.  Significantly nonzero
241invariants should be negative if the model is valid.  The K invariants, which
242are simply differences among the L invariants, are also printed out without
243any test on them being conducted.   Note that it is possible to use the
244bootstrap utility SEQBOOT to create multiple data sets, and from the
245output from sunning all of these get the empirical variability of these
246quadratic invariants.
247<P>
248<H2>PROGRAM CONSTANTS</H2>
249<P>
250The constants
251that are defined at the beginning of the program include
252"maxsp",
253which must always be 4 and should not be changed.
254<P>
255The program is very fast, as it has rather little work to do; these methods
256are just a little bit beyond the reach of hand tabulation.  Execution speed
257should never be a limiting factor.
258<P>
259<H2>FUTURE OF THE PROGRAM</H2>
260<P>
261In a future version I hope to allow for Y and R codes in the calculation of
262the Cavender invariants, and to check for significantly negative cross-product
263differences in them, which would indicate violation of the model.  By
264then there should be more known about invariants for larger number of species,
265and any such advances will also be incorporated.
266<P>
267<HR>
268<P>
269<H3>TEST DATA SET</H2>
270<P>
271<TABLE><TR><TD BGCOLOR=white>
272<PRE>
273   4   13
274Alpha     AACGTGGCCAAAT
275Beta      AAGGTCGCCAAAC
276Gamma     CATTTCGTCACAA
277Delta     GGTATTTCGGCCT
278</PRE>
279</TD></TR></TABLE>
280<P>
281<HR>
282<H3>TEST SET OUTPUT (run with all numerical options turned on)</H3>
283<P>
284<TABLE><TR><TD BGCOLOR=white>
285<PRE>
286
287Nucleic acid sequence Invariants method, version 3.6a3
288
289 4 species,  13  sites
290
291Name            Sequences
292----            ---------
293
294Alpha        AACGTGGCCA AAT
295Beta         ..G..C.... ..C
296Gamma        C.TT.C.T.. C.A
297Delta        GGTA.TT.GG CC.
298
299
300
301   Pattern   Number of times
302
303     AAAC         1
304     AAAG         2
305     AACC         1
306     AACG         1
307     CCCG         1
308     CCTC         1
309     CGTT         1
310     GCCT         1
311     GGGT         1
312     GGTA         1
313     TCAT         1
314     TTTT         1
315
316
317Symmetrized patterns (1, 2 = the two purines  and  3, 4 = the two pyrimidines
318                  or  1, 2 = the two pyrimidines  and  3, 4 = the two purines)
319
320     1111         1
321     1112         2
322     1113         3
323     1121         1
324     1132         2
325     1133         1
326     1231         1
327     1322         1
328     1334         1
329
330Tree topologies (unrooted):
331
332    I:  ((Alpha,Beta),(Gamma,Delta))
333   II:  ((Alpha,Gamma),(Beta,Delta))
334  III:  ((Alpha,Delta),(Beta,Gamma))
335
336
337Lake's linear invariants
338 (these are expected to be zero for the two incorrect tree topologies.
339  This is tested by testing the equality of the two parts
340  of each expression using a one-sided exact binomial test.
341  The null hypothesis is that the first part is no larger than the second.)
342
343 Tree                             Exact test P value    Significant?
344
345   I      1    -     0   =     1       0.5000               no
346   II     0    -     0   =     0       1.0000               no
347   III    0    -     0   =     0       1.0000               no
348
349
350Cavender's quadratic invariants (type L) using purines vs. pyrimidines
351 (these are expected to be zero, and thus have a nonsignificant
352  chi-square, for the correct tree topology)
353They will be misled if there are substantially
354different evolutionary rate between sites, or
355different purine:pyrimidine ratios from 1:1.
356
357  Tree I:
358
359   Contingency Table
360
361      2     8
362      1     2
363
364   Quadratic invariant =             4.0
365
366   Chi-square =    0.23111 (not significant)
367
368
369  Tree II:
370
371   Contingency Table
372
373      1     5
374      1     6
375
376   Quadratic invariant =            -1.0
377
378   Chi-square =    0.01407 (not significant)
379
380
381  Tree III:
382
383   Contingency Table
384
385      1     2
386      6     4
387
388   Quadratic invariant =             8.0
389
390   Chi-square =    0.66032 (not significant)
391
392
393
394
395Cavender's quadratic invariants (type K) using purines vs. pyrimidines
396 (these are expected to be zero for the correct tree topology)
397They will be misled if there are substantially
398different evolutionary rate between sites, or
399different purine:pyrimidine ratios from 1:1.
400No statistical test is done on them here.
401
402  Tree I:              -9.0
403  Tree II:              4.0
404  Tree III:             5.0
405
406</PRE>
407</TD></TR></TABLE>
408</BODY>
409</HTML>
Note: See TracBrowser for help on using the repository browser.