Context Navigation

dnainvar.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 16.2 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>dnainvar</TITLE>
5	<META NAME="description" CONTENT="dnainvar">
6	<META NAME="keywords" CONTENT="dnainvar">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>DNAINVAR -- Program to compute Lake's and Cavender's<BR>
18	phylogenetic invariants from nucleotide sequences</H1>
19	</DIV>
20	<P>
21	© Copyright 1986-2002 by the University of Washington.
22	Written by Joseph Felsenstein. Permission is granted to copy
23	this document provided that no fee is charged for it and that this copyright
24	notice is not removed.
25	<P>
26	This program reads in nucleotide sequences for four species and computes
27	the phylogenetic invariants discovered by James Cavender (Cavender and
28	Felsenstein, 1987) and James Lake (1987). Lake's method is also
29	called by him "evolutionary parsimony". I prefer Cavender's more
30	mathematically precise term "invariants", as the method bears somewhat
31	more relationship to likelihood methods than to parsimony. The invariants
32	are mathematical
33	formulas (in the present case linear or quadratic) in the EXPECTED
34	frequencies of site patterns which are zero for all trees of a given
35	tree topology, irrespective of branch lengths. We can consider at a given
36	site that if there
37	are no ambiguities, we could have for four species the nucleotide patterns
38	(considering the same site across all four species) AAAA, AAAC, AAAG, ...
39	through TTTT, 256 patterns in all.
40	<P>
41	The invariants are formulas in the expected pattern frequencies, not
42	the observed pattern frequencies. When they are computed using the
43	observed pattern frequencies, we will
44	usually find that they are not precisely zero even when the model is correct
45	and we have the correct tree topology. Only as the number of nucleotides
46	scored becomes infinite will the observed pattern frequencies approach their
47	expectations; otherwise, we must do a statistical test of the invariants.
48	<P>
49	Some explanation of invariants will be found in the above papers, and also
50	in my recent review article on statistical aspects of inferring phylogenies
51	(Felsenstein, 1988b). Although invariants have some important advantages,
52	their validity also depends on symmetry assumptions that may not be satisfied.
53	In the discussion below suppose that the possible unrooted phylogenies are
54	I: ((A,B),(C,D)), II: ((A,C),(B,D)), and III: ((A,D),(B,C)).
55	<P>
56	<H2>Lake's Invariants, Their Testing and Assumptions</H2>
57	<P>
58	Lake's invariants are fairly simple to describe: the patterns involved are
59	only those in which there are two purines and two pyrimidines at a site.
60	Thus a site with AACT would affect the invariants, but a site with AAGG would
61	not. Let us use (as Lake does)
62	the symbols 1, 2, 3, and 4, with the proviso that 1 and 2
63	are either both of the purines or both of the pyrimidines; 3 and 4 are the
64	other two nucleotides. Thus 1 and 2 always differ by a transition; so do
65	3 and 4. Lake's invariants, expressed in terms of expected frequencies, are
66	the three quantities:
67	<P>
68	(1)      P(1133) + P(1234) - P(1134) - P(1233),
69	<P>
70	(2)      P(1313) + P(1324) - P(1314) - P(1323),
71	<P>
72	(3)      P(1331) + P(1342) - P(1341) - P(1332),
73	<P>
74	He showed that invariants (2) and (3) are zero under Topology I, (1) and (3)
75	are zero under topology II, and (1) and (2) are zero under Topology III. If,
76	for example, we see a site with pattern ACGC, we can start by setting 1=A.
77	Then 2 must be G. We can then set 3=C (so that 4 is T). Thus its pattern
78	type, making those substitutions, is 1323. P(1323) is the expected
79	probability of the type of pattern which includes ACGC, TGAG, GTAT, etc.
80	<P>
81	Lake's invariants are easily tested with observed frequencies. For example,
82	the first of them is a test of whether there are as many sites of types 1133
83	and 1234 as there are of types 1134 and 1233; this is easily tested with a
84	chi-square test or, as in this program, with an exact binomial test. Note
85	that with several invariants to test, we risk overestimating the significance
86	of results if we simply accept the nominal 95% levels of significance
87	(Li and Guoy, 1990).
88	<P>
89	Lake's invariants assume that each site is evolving independently, and that
90	starting from any base a transversion is equally likely to end up at each of
91	the two possible bases (thus, an A undergoing a transversion is equally likely
92	to end up as a C or a T, and similarly for the other four bases from which one
93	could start. Interestingly, Lake's results do not assume that rates of
94	evolution are the same at all sites. The result that the total of 1133 and 1234
95	is expected to be the same as the total of 1134 and 1233 is unaffected by the
96	fact that we may have aggregated the counts over classes of sites evolving at
97	different rates.
98	<P>
99	<H2>Cavender's Invariants, Their Testing and Assumptions</H2>
100	<P>
101	Cavender's invariants (Cavender and Felsenstein, 1987) are for the case of
102	a character with two states. In the nucleic acid case we can classify
103	nucleotides into two states, R and Y (Purine and Pyrimidine) and then use the
104	two-state results. Cavender starts, as before, with the pattern frequencies.
105	Coding purines as R and pyrimidines as Y, the patterns types are RRRR, RRRY,
106	and so on until YYYY, a total of 16 types. Cavender found quadratic functions
107	of the expected frequencies of these 16 types that were expected to be zero
108	under a given phylogeny, irrespective of branch lengths. Two invariants
109	(called K and L) were found for each tree topology. The L invariants are
110	particularly easy to understand. If we have the tree topology ((A,B),(C,D)),
111	then in the case of two symmetric states, the event that A and B have the same
112	state should be independent of whether C and D have the same state, as the
113	events determining these happen in different parts of the tree. We can set
114	up a contingency table:
115	<P>
116	<PRE>
117	C = D C =/= D
118	------------------------------
119	\|
120	A = B \| YYYY, YYRR, YYYR, YYRY,
121	\| RRRR, RRYY RRYR, RRRY
122	\|
123	A =/= B \| YRYY, YRRR, YRYR, YRRY,
124	\| RYYY, RYRR RYYR, RYRY
125	</PRE>
126	<P>
127	and we expect that the events C = D and A = B will be independent. Cavender's
128	L invariant for this tree topology is simply the negative of the crossproduct
129	difference,
130	<P>
131	P(A=/=B and C=D) P(A=B and C=/=D) - P(A=B and C=D) P(A=/=B and C=/=D).
132	<P>
133	One of these L invariants is defined for each of the three tree topologies.
134	They can obviously be tested simply by doing a chi-square test on the
135	contingency table. The one corresponding to the correct topology should be
136	statistically indistinguishable from zero. Again, there is a possible
137	multiple tests problem if all three are tested at a nominal value of 95%.
138	<P>
139	The K invariants are differences between the L invariants. When one of the
140	tables is expected to have crossproduct difference zero, the other two are
141	expected to be nonzero, and also to be equal. So the difference of their
142	crossproduct differences can be taken; this is the K invariant. It is not
143	so easily tested.
144	<P>
145	The assumptions of Cavender's invariants are different from those of
146	Lake's. One obviously need not assume anything about the frequencies of, or
147	transitions among, the two different purines or the two different
148	pyrimidines. However one does need to assume independent events at each site,
149	and one needs to assume that the Y and R states are symmetric, that the
150	probability per unit time that a Y changes into an R is the same as the
151	probability that an R changes into a Y, so that we expect equal frequencies
152	of the two states. There is also an assumption that all sites are changing
153	between these two states at the same expected rate. This assumption is not
154	needed for Lake's invariants, since expectations of sums are equal to
155	sums of expectations, but for Cavender's it is, since products of expectations
156	are not equal to expectations of products.
157	<P>
158	It is helpful to have both sorts of invariants available; with further work we
159	may appreciate what other invaraints there are for various models of nucleic
160	acid change.
161	<P>
162	<H2>INPUT FORMAT</H2>
163	<P>
164	The input data for DNAINVAR is standard. The first line of the input file
165	contains the
166	number of species (which must always be 4 for this version of DNAINVAR)
167	and the number of sites.
168	<P>
169	Next come the species data. Each
170	sequence starts on a new line, has a ten-character species name
171	that must be blank-filled to be of that length, followed immediately
172	by the species data in the one-letter code. The sequences must either
173	be in the "interleaved" or "sequential" formats
174	described in the Molecular Sequence Programs document. The I option
175	selects between them. The sequences can have internal
176	blanks in the sequence but there must be no extra blanks at the end of the
177	terminated line. Note that a blank is not a valid symbol for a deletion.
178	<P>
179	The options are selected using an interactive menu. The menu looks like this:
180	<P>
181	<TABLE><TR><TD BGCOLOR=white>
182	<PRE>
183
184	Nucleic acid sequence Invariants method, version 3.6a3
185
186	Settings for this run:
187	W Sites weighted? No
188	M Analyze multiple data sets? No
189	I Input sequences interleaved? Yes
190	0 Terminal type (IBM PC, ANSI, none)? (none)
191	1 Print out the data at start of run No
192	2 Print indications of progress of run Yes
193	3 Print out the counts of patterns Yes
194	4 Print out the invariants Yes
195
196	Y to accept these or type the letter for one to change
197
198	</PRE>
199	</TD></TR></TABLE>
200	<P>
201	The user either types "Y" (followed, of course, by a carriage-return)
202	if the settings shown are to be accepted, or the letter or digit corresponding
203	to an option that is to be changed.
204	<P>
205	The options W, M and 0 are the usual ones. They are described in the
206	main documentation file of this package. Option I is the same as in
207	other molecular sequence programs and is described in the documentation file
208	for the sequence programs.
209	<P>
210	<H2>OUTPUT FORMAT</H2>
211	<P>
212	The output consists first (if option 1 is selected) of a reprinting of the
213	input data, then (if option 2 is on) tables
214	of observed pattern frequencies and pattern type frequencies. A table will
215	be printed out, in alphabetic order AAAA through TTTT of all the patterns
216	that appear among the sites and the number of times each appears. This table
217	will be invaluable for computation of any other invariants. There follows
218	another table, of pattern types, using the 1234 notation, in numerical
219	order 1111 through 1234, of the number of times each type of pattern appears.
220	In this computation all sites at which there are any ambiguities or deletions
221	are omitted. Cavender's invariants could actually be computed from sites
222	that have only Y or R ambiguities; this will be done in the next release of
223	this program.
224	<P>
225	If option 3 is on the invariants are then printed out,
226	together with their statistical
227	tests. For Lake's invariants the two sums which are expected to be equal are
228	printed out, and then the result of an one-tailed exact binomial test which
229	tests whether the difference is expected to be this positive or more. The P
230	level is given (but remember the multiple-tests problem!).
231	<P>
232	For Cavender's L invariants the contingency tables are given. Each is tested
233	with a one-tailed chi-square test. It is possible that the expected numbers
234	in some categories could be too small for valid use of this test; the program
235	does not check for this. It is also possible that the chi-square could be
236	significant but in the wrong direction; this is not tested in the current
237	version of the program. To check it beware of a chi-square greater than 3.841
238	but with a positive invariant. The invariants themselves are computed, as the
239	difference of cross-products. Their absolute magnitudes are not important,
240	but which one is closest to zero may be indicative. Significantly nonzero
241	invariants should be negative if the model is valid. The K invariants, which
242	are simply differences among the L invariants, are also printed out without
243	any test on them being conducted. Note that it is possible to use the
244	bootstrap utility SEQBOOT to create multiple data sets, and from the
245	output from sunning all of these get the empirical variability of these
246	quadratic invariants.
247	<P>
248	<H2>PROGRAM CONSTANTS</H2>
249	<P>
250	The constants
251	that are defined at the beginning of the program include
252	"maxsp",
253	which must always be 4 and should not be changed.
254	<P>
255	The program is very fast, as it has rather little work to do; these methods
256	are just a little bit beyond the reach of hand tabulation. Execution speed
257	should never be a limiting factor.
258	<P>
259	<H2>FUTURE OF THE PROGRAM</H2>
260	<P>
261	In a future version I hope to allow for Y and R codes in the calculation of
262	the Cavender invariants, and to check for significantly negative cross-product
263	differences in them, which would indicate violation of the model. By
264	then there should be more known about invariants for larger number of species,
265	and any such advances will also be incorporated.
266	<P>
267	<HR>
268	<P>
269	<H3>TEST DATA SET</H2>
270	<P>
271	<TABLE><TR><TD BGCOLOR=white>
272	<PRE>
273	4 13
274	Alpha AACGTGGCCAAAT
275	Beta AAGGTCGCCAAAC
276	Gamma CATTTCGTCACAA
277	Delta GGTATTTCGGCCT
278	</PRE>
279	</TD></TR></TABLE>
280	<P>
281	<HR>
282	<H3>TEST SET OUTPUT (run with all numerical options turned on)</H3>
283	<P>
284	<TABLE><TR><TD BGCOLOR=white>
285	<PRE>
286
287	Nucleic acid sequence Invariants method, version 3.6a3
288
289	4 species, 13 sites
290
291	Name Sequences
292	---- ---------
293
294	Alpha AACGTGGCCA AAT
295	Beta ..G..C.... ..C
296	Gamma C.TT.C.T.. C.A
297	Delta GGTA.TT.GG CC.
298
299
300
301	Pattern Number of times
302
303	AAAC 1
304	AAAG 2
305	AACC 1
306	AACG 1
307	CCCG 1
308	CCTC 1
309	CGTT 1
310	GCCT 1
311	GGGT 1
312	GGTA 1
313	TCAT 1
314	TTTT 1
315
316
317	Symmetrized patterns (1, 2 = the two purines and 3, 4 = the two pyrimidines
318	or 1, 2 = the two pyrimidines and 3, 4 = the two purines)
319
320	1111 1
321	1112 2
322	1113 3
323	1121 1
324	1132 2
325	1133 1
326	1231 1
327	1322 1
328	1334 1
329
330	Tree topologies (unrooted):
331
332	I: ((Alpha,Beta),(Gamma,Delta))
333	II: ((Alpha,Gamma),(Beta,Delta))
334	III: ((Alpha,Delta),(Beta,Gamma))
335
336
337	Lake's linear invariants
338	(these are expected to be zero for the two incorrect tree topologies.
339	This is tested by testing the equality of the two parts
340	of each expression using a one-sided exact binomial test.
341	The null hypothesis is that the first part is no larger than the second.)
342
343	Tree Exact test P value Significant?
344
345	I 1 - 0 = 1 0.5000 no
346	II 0 - 0 = 0 1.0000 no
347	III 0 - 0 = 0 1.0000 no
348
349
350	Cavender's quadratic invariants (type L) using purines vs. pyrimidines
351	(these are expected to be zero, and thus have a nonsignificant
352	chi-square, for the correct tree topology)
353	They will be misled if there are substantially
354	different evolutionary rate between sites, or
355	different purine:pyrimidine ratios from 1:1.
356
357	Tree I:
358
359	Contingency Table
360
361	2 8
362	1 2
363
364	Quadratic invariant = 4.0
365
366	Chi-square = 0.23111 (not significant)
367
368
369	Tree II:
370
371	Contingency Table
372
373	1 5
374	1 6
375
376	Quadratic invariant = -1.0
377
378	Chi-square = 0.01407 (not significant)
379
380
381	Tree III:
382
383	Contingency Table
384
385	1 2
386	6 4
387
388	Quadratic invariant = 8.0
389
390	Chi-square = 0.66032 (not significant)
391
392
393
394
395	Cavender's quadratic invariants (type K) using purines vs. pyrimidines
396	(these are expected to be zero for the correct tree topology)
397	They will be misled if there are substantially
398	different evolutionary rate between sites, or
399	different purine:pyrimidine ratios from 1:1.
400	No statistical test is done on them here.
401
402	Tree I: -9.0
403	Tree II: 4.0
404	Tree III: 5.0
405
406	</PRE>
407	</TD></TR></TABLE>
408	</BODY>
409	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/dnainvar.html

Download in other formats: