1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
---|
2 | <HTML> |
---|
3 | <HEAD> |
---|
4 | <TITLE>dnainvar</TITLE> |
---|
5 | <META NAME="description" CONTENT="dnainvar"> |
---|
6 | <META NAME="keywords" CONTENT="dnainvar"> |
---|
7 | <META NAME="resource-type" CONTENT="document"> |
---|
8 | <META NAME="distribution" CONTENT="global"> |
---|
9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
---|
10 | </HEAD> |
---|
11 | <BODY BGCOLOR="#ccffff"> |
---|
12 | <DIV ALIGN=RIGHT> |
---|
13 | version 3.6 |
---|
14 | </DIV> |
---|
15 | <P> |
---|
16 | <DIV ALIGN=CENTER> |
---|
17 | <H1>DNAINVAR -- Program to compute Lake's and Cavender's<BR> |
---|
18 | phylogenetic invariants from nucleotide sequences</H1> |
---|
19 | </DIV> |
---|
20 | <P> |
---|
21 | © Copyright 1986-2002 by the University of Washington. |
---|
22 | Written by Joseph Felsenstein. Permission is granted to copy |
---|
23 | this document provided that no fee is charged for it and that this copyright |
---|
24 | notice is not removed. |
---|
25 | <P> |
---|
26 | This program reads in nucleotide sequences for four species and computes |
---|
27 | the phylogenetic invariants discovered by James Cavender (Cavender and |
---|
28 | Felsenstein, 1987) and James Lake (1987). Lake's method is also |
---|
29 | called by him "evolutionary parsimony". I prefer Cavender's more |
---|
30 | mathematically precise term "invariants", as the method bears somewhat |
---|
31 | more relationship to likelihood methods than to parsimony. The invariants |
---|
32 | are mathematical |
---|
33 | formulas (in the present case linear or quadratic) in the EXPECTED |
---|
34 | frequencies of site patterns which are zero for all trees of a given |
---|
35 | tree topology, irrespective of branch lengths. We can consider at a given |
---|
36 | site that if there |
---|
37 | are no ambiguities, we could have for four species the nucleotide patterns |
---|
38 | (considering the same site across all four species) AAAA, AAAC, AAAG, ... |
---|
39 | through TTTT, 256 patterns in all. |
---|
40 | <P> |
---|
41 | The invariants are formulas in the expected pattern frequencies, not |
---|
42 | the observed pattern frequencies. When they are computed using the |
---|
43 | observed pattern frequencies, we will |
---|
44 | usually find that they are not precisely zero even when the model is correct |
---|
45 | and we have the correct tree topology. Only as the number of nucleotides |
---|
46 | scored becomes infinite will the observed pattern frequencies approach their |
---|
47 | expectations; otherwise, we must do a statistical test of the invariants. |
---|
48 | <P> |
---|
49 | Some explanation of invariants will be found in the above papers, and also |
---|
50 | in my recent review article on statistical aspects of inferring phylogenies |
---|
51 | (Felsenstein, 1988b). Although invariants have some important advantages, |
---|
52 | their validity also depends on symmetry assumptions that may not be satisfied. |
---|
53 | In the discussion below suppose that the possible unrooted phylogenies are |
---|
54 | I: ((A,B),(C,D)), II: ((A,C),(B,D)), and III: ((A,D),(B,C)). |
---|
55 | <P> |
---|
56 | <H2>Lake's Invariants, Their Testing and Assumptions</H2> |
---|
57 | <P> |
---|
58 | Lake's invariants are fairly simple to describe: the patterns involved are |
---|
59 | only those in which there are two purines and two pyrimidines at a site. |
---|
60 | Thus a site with AACT would affect the invariants, but a site with AAGG would |
---|
61 | not. Let us use (as Lake does) |
---|
62 | the symbols 1, 2, 3, and 4, with the proviso that 1 and 2 |
---|
63 | are either both of the purines or both of the pyrimidines; 3 and 4 are the |
---|
64 | other two nucleotides. Thus 1 and 2 always differ by a transition; so do |
---|
65 | 3 and 4. Lake's invariants, expressed in terms of expected frequencies, are |
---|
66 | the three quantities: |
---|
67 | <P> |
---|
68 | (1) P(1133) + P(1234) - P(1134) - P(1233), |
---|
69 | <P> |
---|
70 | (2) P(1313) + P(1324) - P(1314) - P(1323), |
---|
71 | <P> |
---|
72 | (3) P(1331) + P(1342) - P(1341) - P(1332), |
---|
73 | <P> |
---|
74 | He showed that invariants (2) and (3) are zero under Topology I, (1) and (3) |
---|
75 | are zero under topology II, and (1) and (2) are zero under Topology III. If, |
---|
76 | for example, we see a site with pattern ACGC, we can start by setting 1=A. |
---|
77 | Then 2 must be G. We can then set 3=C (so that 4 is T). Thus its pattern |
---|
78 | type, making those substitutions, is 1323. P(1323) is the expected |
---|
79 | probability of the type of pattern which includes ACGC, TGAG, GTAT, etc. |
---|
80 | <P> |
---|
81 | Lake's invariants are easily tested with observed frequencies. For example, |
---|
82 | the first of them is a test of whether there are as many sites of types 1133 |
---|
83 | and 1234 as there are of types 1134 and 1233; this is easily tested with a |
---|
84 | chi-square test or, as in this program, with an exact binomial test. Note |
---|
85 | that with several invariants to test, we risk overestimating the significance |
---|
86 | of results if we simply accept the nominal 95% levels of significance |
---|
87 | (Li and Guoy, 1990). |
---|
88 | <P> |
---|
89 | Lake's invariants assume that each site is evolving independently, and that |
---|
90 | starting from any base a transversion is equally likely to end up at each of |
---|
91 | the two possible bases (thus, an A undergoing a transversion is equally likely |
---|
92 | to end up as a C or a T, and similarly for the other four bases from which one |
---|
93 | could start. Interestingly, Lake's results do not assume that rates of |
---|
94 | evolution are the same at all sites. The result that the total of 1133 and 1234 |
---|
95 | is expected to be the same as the total of 1134 and 1233 is unaffected by the |
---|
96 | fact that we may have aggregated the counts over classes of sites evolving at |
---|
97 | different rates. |
---|
98 | <P> |
---|
99 | <H2>Cavender's Invariants, Their Testing and Assumptions</H2> |
---|
100 | <P> |
---|
101 | Cavender's invariants (Cavender and Felsenstein, 1987) are for the case of |
---|
102 | a character with two states. In the nucleic acid case we can classify |
---|
103 | nucleotides into two states, R and Y (Purine and Pyrimidine) and then use the |
---|
104 | two-state results. Cavender starts, as before, with the pattern frequencies. |
---|
105 | Coding purines as R and pyrimidines as Y, the patterns types are RRRR, RRRY, |
---|
106 | and so on until YYYY, a total of 16 types. Cavender found quadratic functions |
---|
107 | of the expected frequencies of these 16 types that were expected to be zero |
---|
108 | under a given phylogeny, irrespective of branch lengths. Two invariants |
---|
109 | (called K and L) were found for each tree topology. The L invariants are |
---|
110 | particularly easy to understand. If we have the tree topology ((A,B),(C,D)), |
---|
111 | then in the case of two symmetric states, the event that A and B have the same |
---|
112 | state should be independent of whether C and D have the same state, as the |
---|
113 | events determining these happen in different parts of the tree. We can set |
---|
114 | up a contingency table: |
---|
115 | <P> |
---|
116 | <PRE> |
---|
117 | C = D C =/= D |
---|
118 | ------------------------------ |
---|
119 | | |
---|
120 | A = B | YYYY, YYRR, YYYR, YYRY, |
---|
121 | | RRRR, RRYY RRYR, RRRY |
---|
122 | | |
---|
123 | A =/= B | YRYY, YRRR, YRYR, YRRY, |
---|
124 | | RYYY, RYRR RYYR, RYRY |
---|
125 | </PRE> |
---|
126 | <P> |
---|
127 | and we expect that the events C = D and A = B will be independent. Cavender's |
---|
128 | L invariant for this tree topology is simply the negative of the crossproduct |
---|
129 | difference, |
---|
130 | <P> |
---|
131 | P(A=/=B and C=D) P(A=B and C=/=D) - P(A=B and C=D) P(A=/=B and C=/=D). |
---|
132 | <P> |
---|
133 | One of these L invariants is defined for each of the three tree topologies. |
---|
134 | They can obviously be tested simply by doing a chi-square test on the |
---|
135 | contingency table. The one corresponding to the correct topology should be |
---|
136 | statistically indistinguishable from zero. Again, there is a possible |
---|
137 | multiple tests problem if all three are tested at a nominal value of 95%. |
---|
138 | <P> |
---|
139 | The K invariants are differences between the L invariants. When one of the |
---|
140 | tables is expected to have crossproduct difference zero, the other two are |
---|
141 | expected to be nonzero, and also to be equal. So the difference of their |
---|
142 | crossproduct differences can be taken; this is the K invariant. It is not |
---|
143 | so easily tested. |
---|
144 | <P> |
---|
145 | The assumptions of Cavender's invariants are different from those of |
---|
146 | Lake's. One obviously need not assume anything about the frequencies of, or |
---|
147 | transitions among, the two different purines or the two different |
---|
148 | pyrimidines. However one does need to assume independent events at each site, |
---|
149 | and one needs to assume that the Y and R states are symmetric, that the |
---|
150 | probability per unit time that a Y changes into an R is the same as the |
---|
151 | probability that an R changes into a Y, so that we expect equal frequencies |
---|
152 | of the two states. There is also an assumption that all sites are changing |
---|
153 | between these two states at the same expected rate. This assumption is not |
---|
154 | needed for Lake's invariants, since expectations of sums are equal to |
---|
155 | sums of expectations, but for Cavender's it is, since products of expectations |
---|
156 | are not equal to expectations of products. |
---|
157 | <P> |
---|
158 | It is helpful to have both sorts of invariants available; with further work we |
---|
159 | may appreciate what other invaraints there are for various models of nucleic |
---|
160 | acid change. |
---|
161 | <P> |
---|
162 | <H2>INPUT FORMAT</H2> |
---|
163 | <P> |
---|
164 | The input data for DNAINVAR is standard. The first line of the input file |
---|
165 | contains the |
---|
166 | number of species (which must always be 4 for this version of DNAINVAR) |
---|
167 | and the number of sites. |
---|
168 | <P> |
---|
169 | Next come the species data. Each |
---|
170 | sequence starts on a new line, has a ten-character species name |
---|
171 | that must be blank-filled to be of that length, followed immediately |
---|
172 | by the species data in the one-letter code. The sequences must either |
---|
173 | be in the "interleaved" or "sequential" formats |
---|
174 | described in the Molecular Sequence Programs document. The I option |
---|
175 | selects between them. The sequences can have internal |
---|
176 | blanks in the sequence but there must be no extra blanks at the end of the |
---|
177 | terminated line. Note that a blank is not a valid symbol for a deletion. |
---|
178 | <P> |
---|
179 | The options are selected using an interactive menu. The menu looks like this: |
---|
180 | <P> |
---|
181 | <TABLE><TR><TD BGCOLOR=white> |
---|
182 | <PRE> |
---|
183 | |
---|
184 | Nucleic acid sequence Invariants method, version 3.6a3 |
---|
185 | |
---|
186 | Settings for this run: |
---|
187 | W Sites weighted? No |
---|
188 | M Analyze multiple data sets? No |
---|
189 | I Input sequences interleaved? Yes |
---|
190 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
---|
191 | 1 Print out the data at start of run No |
---|
192 | 2 Print indications of progress of run Yes |
---|
193 | 3 Print out the counts of patterns Yes |
---|
194 | 4 Print out the invariants Yes |
---|
195 | |
---|
196 | Y to accept these or type the letter for one to change |
---|
197 | |
---|
198 | </PRE> |
---|
199 | </TD></TR></TABLE> |
---|
200 | <P> |
---|
201 | The user either types "Y" (followed, of course, by a carriage-return) |
---|
202 | if the settings shown are to be accepted, or the letter or digit corresponding |
---|
203 | to an option that is to be changed. |
---|
204 | <P> |
---|
205 | The options W, M and 0 are the usual ones. They are described in the |
---|
206 | main documentation file of this package. Option I is the same as in |
---|
207 | other molecular sequence programs and is described in the documentation file |
---|
208 | for the sequence programs. |
---|
209 | <P> |
---|
210 | <H2>OUTPUT FORMAT</H2> |
---|
211 | <P> |
---|
212 | The output consists first (if option 1 is selected) of a reprinting of the |
---|
213 | input data, then (if option 2 is on) tables |
---|
214 | of observed pattern frequencies and pattern type frequencies. A table will |
---|
215 | be printed out, in alphabetic order AAAA through TTTT of all the patterns |
---|
216 | that appear among the sites and the number of times each appears. This table |
---|
217 | will be invaluable for computation of any other invariants. There follows |
---|
218 | another table, of pattern types, using the 1234 notation, in numerical |
---|
219 | order 1111 through 1234, of the number of times each type of pattern appears. |
---|
220 | In this computation all sites at which there are any ambiguities or deletions |
---|
221 | are omitted. Cavender's invariants could actually be computed from sites |
---|
222 | that have only Y or R ambiguities; this will be done in the next release of |
---|
223 | this program. |
---|
224 | <P> |
---|
225 | If option 3 is on the invariants are then printed out, |
---|
226 | together with their statistical |
---|
227 | tests. For Lake's invariants the two sums which are expected to be equal are |
---|
228 | printed out, and then the result of an one-tailed exact binomial test which |
---|
229 | tests whether the difference is expected to be this positive or more. The P |
---|
230 | level is given (but remember the multiple-tests problem!). |
---|
231 | <P> |
---|
232 | For Cavender's L invariants the contingency tables are given. Each is tested |
---|
233 | with a one-tailed chi-square test. It is possible that the expected numbers |
---|
234 | in some categories could be too small for valid use of this test; the program |
---|
235 | does not check for this. It is also possible that the chi-square could be |
---|
236 | significant but in the wrong direction; this is not tested in the current |
---|
237 | version of the program. To check it beware of a chi-square greater than 3.841 |
---|
238 | but with a positive invariant. The invariants themselves are computed, as the |
---|
239 | difference of cross-products. Their absolute magnitudes are not important, |
---|
240 | but which one is closest to zero may be indicative. Significantly nonzero |
---|
241 | invariants should be negative if the model is valid. The K invariants, which |
---|
242 | are simply differences among the L invariants, are also printed out without |
---|
243 | any test on them being conducted. Note that it is possible to use the |
---|
244 | bootstrap utility SEQBOOT to create multiple data sets, and from the |
---|
245 | output from sunning all of these get the empirical variability of these |
---|
246 | quadratic invariants. |
---|
247 | <P> |
---|
248 | <H2>PROGRAM CONSTANTS</H2> |
---|
249 | <P> |
---|
250 | The constants |
---|
251 | that are defined at the beginning of the program include |
---|
252 | "maxsp", |
---|
253 | which must always be 4 and should not be changed. |
---|
254 | <P> |
---|
255 | The program is very fast, as it has rather little work to do; these methods |
---|
256 | are just a little bit beyond the reach of hand tabulation. Execution speed |
---|
257 | should never be a limiting factor. |
---|
258 | <P> |
---|
259 | <H2>FUTURE OF THE PROGRAM</H2> |
---|
260 | <P> |
---|
261 | In a future version I hope to allow for Y and R codes in the calculation of |
---|
262 | the Cavender invariants, and to check for significantly negative cross-product |
---|
263 | differences in them, which would indicate violation of the model. By |
---|
264 | then there should be more known about invariants for larger number of species, |
---|
265 | and any such advances will also be incorporated. |
---|
266 | <P> |
---|
267 | <HR> |
---|
268 | <P> |
---|
269 | <H3>TEST DATA SET</H2> |
---|
270 | <P> |
---|
271 | <TABLE><TR><TD BGCOLOR=white> |
---|
272 | <PRE> |
---|
273 | 4 13 |
---|
274 | Alpha AACGTGGCCAAAT |
---|
275 | Beta AAGGTCGCCAAAC |
---|
276 | Gamma CATTTCGTCACAA |
---|
277 | Delta GGTATTTCGGCCT |
---|
278 | </PRE> |
---|
279 | </TD></TR></TABLE> |
---|
280 | <P> |
---|
281 | <HR> |
---|
282 | <H3>TEST SET OUTPUT (run with all numerical options turned on)</H3> |
---|
283 | <P> |
---|
284 | <TABLE><TR><TD BGCOLOR=white> |
---|
285 | <PRE> |
---|
286 | |
---|
287 | Nucleic acid sequence Invariants method, version 3.6a3 |
---|
288 | |
---|
289 | 4 species, 13 sites |
---|
290 | |
---|
291 | Name Sequences |
---|
292 | ---- --------- |
---|
293 | |
---|
294 | Alpha AACGTGGCCA AAT |
---|
295 | Beta ..G..C.... ..C |
---|
296 | Gamma C.TT.C.T.. C.A |
---|
297 | Delta GGTA.TT.GG CC. |
---|
298 | |
---|
299 | |
---|
300 | |
---|
301 | Pattern Number of times |
---|
302 | |
---|
303 | AAAC 1 |
---|
304 | AAAG 2 |
---|
305 | AACC 1 |
---|
306 | AACG 1 |
---|
307 | CCCG 1 |
---|
308 | CCTC 1 |
---|
309 | CGTT 1 |
---|
310 | GCCT 1 |
---|
311 | GGGT 1 |
---|
312 | GGTA 1 |
---|
313 | TCAT 1 |
---|
314 | TTTT 1 |
---|
315 | |
---|
316 | |
---|
317 | Symmetrized patterns (1, 2 = the two purines and 3, 4 = the two pyrimidines |
---|
318 | or 1, 2 = the two pyrimidines and 3, 4 = the two purines) |
---|
319 | |
---|
320 | 1111 1 |
---|
321 | 1112 2 |
---|
322 | 1113 3 |
---|
323 | 1121 1 |
---|
324 | 1132 2 |
---|
325 | 1133 1 |
---|
326 | 1231 1 |
---|
327 | 1322 1 |
---|
328 | 1334 1 |
---|
329 | |
---|
330 | Tree topologies (unrooted): |
---|
331 | |
---|
332 | I: ((Alpha,Beta),(Gamma,Delta)) |
---|
333 | II: ((Alpha,Gamma),(Beta,Delta)) |
---|
334 | III: ((Alpha,Delta),(Beta,Gamma)) |
---|
335 | |
---|
336 | |
---|
337 | Lake's linear invariants |
---|
338 | (these are expected to be zero for the two incorrect tree topologies. |
---|
339 | This is tested by testing the equality of the two parts |
---|
340 | of each expression using a one-sided exact binomial test. |
---|
341 | The null hypothesis is that the first part is no larger than the second.) |
---|
342 | |
---|
343 | Tree Exact test P value Significant? |
---|
344 | |
---|
345 | I 1 - 0 = 1 0.5000 no |
---|
346 | II 0 - 0 = 0 1.0000 no |
---|
347 | III 0 - 0 = 0 1.0000 no |
---|
348 | |
---|
349 | |
---|
350 | Cavender's quadratic invariants (type L) using purines vs. pyrimidines |
---|
351 | (these are expected to be zero, and thus have a nonsignificant |
---|
352 | chi-square, for the correct tree topology) |
---|
353 | They will be misled if there are substantially |
---|
354 | different evolutionary rate between sites, or |
---|
355 | different purine:pyrimidine ratios from 1:1. |
---|
356 | |
---|
357 | Tree I: |
---|
358 | |
---|
359 | Contingency Table |
---|
360 | |
---|
361 | 2 8 |
---|
362 | 1 2 |
---|
363 | |
---|
364 | Quadratic invariant = 4.0 |
---|
365 | |
---|
366 | Chi-square = 0.23111 (not significant) |
---|
367 | |
---|
368 | |
---|
369 | Tree II: |
---|
370 | |
---|
371 | Contingency Table |
---|
372 | |
---|
373 | 1 5 |
---|
374 | 1 6 |
---|
375 | |
---|
376 | Quadratic invariant = -1.0 |
---|
377 | |
---|
378 | Chi-square = 0.01407 (not significant) |
---|
379 | |
---|
380 | |
---|
381 | Tree III: |
---|
382 | |
---|
383 | Contingency Table |
---|
384 | |
---|
385 | 1 2 |
---|
386 | 6 4 |
---|
387 | |
---|
388 | Quadratic invariant = 8.0 |
---|
389 | |
---|
390 | Chi-square = 0.66032 (not significant) |
---|
391 | |
---|
392 | |
---|
393 | |
---|
394 | |
---|
395 | Cavender's quadratic invariants (type K) using purines vs. pyrimidines |
---|
396 | (these are expected to be zero for the correct tree topology) |
---|
397 | They will be misled if there are substantially |
---|
398 | different evolutionary rate between sites, or |
---|
399 | different purine:pyrimidine ratios from 1:1. |
---|
400 | No statistical test is done on them here. |
---|
401 | |
---|
402 | Tree I: -9.0 |
---|
403 | Tree II: 4.0 |
---|
404 | Tree III: 5.0 |
---|
405 | |
---|
406 | </PRE> |
---|
407 | </TD></TR></TABLE> |
---|
408 | </BODY> |
---|
409 | </HTML> |
---|