Context Navigation

dnacomp.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 11.5 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>dnacomp</TITLE>
5	<META NAME="description" CONTENT="dnacomp">
6	<META NAME="keywords" CONTENT="dnacomp">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>DNACOMP -- DNA Compatibility Program</H1>
18	</DIV>
19	<P>
20	© Copyright 1986-2002 by the University of
21	Washington. Written by Joseph Felsenstein. Permission is granted to copy
22	this document provided that no fee is charged for it and that this copyright
23	notice is not removed.
24	<P>
25	</EM>
26	This program implements the compatibility method for DNA sequence
27	data. For a four-state character without a character-state tree, as in
28	DNA sequences, the usual clique theorems cannot be applied. The
29	approach taken in this program is to directly evaluate each tree
30	topology by counting how many substitutions are needed in each site,
31	comparing this to the minimum number that might be needed (one less than
32	the number of bases observed at that site), and then evaluating the
33	number of sites which achieve the minimum number. This is the
34	evaluation of the tree (the number of compatible sites), and the
35	topology is chosen so as to maximize that number.
36	<P>
37	Compatibility methods originated with Le Quesne's (1969) suggestion that
38	one ought to look for trees supported by the largest number of perfectly
39	fitting (compatible) characters. Fitch (1975) showed by counterexample that
40	one could not use the pairwise compatibility methods used in CLIQUE to
41	discover the largest clique of jointly compatible characters.
42	<P>
43	The assumptions of this method are similar to those of CLIQUE. In
44	a paper in the Biological Journal of the Linnean Society (1981b)
45	I discuss this matter extensively. In effect, the assumptions are that:
46	<OL>
47	<LI>Each character evolves independently.
48	<LI>Different lineages evolve independently.
49	<LI>The ancestral base at each site is unknown.
50	<LI>The rates of change in most sites over the time spans involved
51	in the the divergence of the group are very small.
52	<LI>A few of the sites have very high rates of change.
53	<LI>We do not know in advance which are the high and which the low
54	rate sites.
55	</OL>
56	<P>
57	That these are the assumptions of compatibility methods has been documented
58	in a series of papers of mine: (1973a, 1978b, 1979, 1981b,
59	1983b, 1988b). For an opposing
60	view arguing that arguments such as mine are invalid
61	and that parsimony (and perhaps compatibility) methods make no substantive
62	assumptions such as these, see the papers by Farris (1983) and Sober (1983a,
63	1983b, 1988), but also read the exchange between Felsenstein and Sober (1986).
64	<P>
65	There is, however, some reason to believe that the present criterion is not the
66	proper way to correct for the presence of some sites with high rates of
67	change in nucleotide sequence data. It can be argued that sites showing more
68	than two nucleotide states, even if those are compatible with the other sites,
69	are also candidates for sites with high rates of change. It might then be more
70	proper to use DNAPARS with the Threshold option with a threshold value of 2.
71	<P>
72	Change from an occupied site to a gap is counted as one
73	change. Reversion from a gap to an occupied site is allowed and is also
74	counted as one change. Note that this in effect assumes that a gap
75	N bases long is N separate events. This may be an overcorrection. When
76	we have nonoverlapping gaps, we could instead code a gap as a
77	single event by changing all but the first "-" in the gap into "?" characters.
78	In this way only the first base of the gap causes the program to infer a
79	change.
80	<P>
81	The input data is standard. The first line of the input file contains the
82	number of species and the number of sites.
83	<P>
84	Next come the species data. Each
85	sequence starts on a new line, has a ten-character species name
86	that must be blank-filled to be of that length, followed immediately
87	by the species data in the one-letter code. The sequences must either
88	be in the "interleaved" or "sequential" formats
89	described in the Molecular Sequence Programs document. The I option
90	selects between them. The sequences can have internal
91	blanks in the sequence but there must be no extra blanks at the end of the
92	terminated line. Note that a blank is not a valid symbol for a deletion.
93	<P>
94	The options are selected using an interactive menu. The menu looks like this:
95	<P>
96	<TABLE><TR><TD BGCOLOR=white>
97	<PRE>
98
99	DNA compatibility algorithm, version 3.6a3
100
101	Settings for this run:
102	U Search for best tree? Yes
103	J Randomize input order of sequences? No. Use input order
104	O Outgroup root? No, use as outgroup species 1
105	W Sites weighted? No
106	M Analyze multiple data sets? No
107	I Input sequences interleaved? Yes
108	0 Terminal type (IBM PC, ANSI, none)? (none)
109	1 Print out the data at start of run No
110	2 Print indications of progress of run Yes
111	3 Print out tree Yes
112	4 Print steps & compatibility at sites No
113	5 Print sequences at all nodes of tree No
114	6 Write out trees onto tree file? Yes
115
116	Are these settings correct? (type Y or the letter for one to change)
117
118	</PRE>
119	</TD></TR></TABLE>
120	<P>
121	The user either types "Y" (followed, of course, by a carriage-return)
122	if the settings shown are to be accepted, or the letter or digit corresponding
123	to an option that is to be changed.
124	<P>
125	The options U, J, O, W, M, and 0 are the usual ones. They are described in the
126	main documentation file of this package. Option I is the same as in
127	other molecular sequence programs and is described in the documentation file
128	for the sequence programs.
129	<P>
130	The O (outgroup) option has no effect if the U (user-defined tree) option
131	is in effect. The user-defined trees (option U) fed in must be strictly
132	bifurcating, with a two-way split at their base.
133	<P>
134	The interpretation of weights (option W) in the case of a compatibility method
135	is that they count how many times the character (in this case the site) is
136	counted in the analysis. Thus a character can be dropped from the
137	analysis by assigning it zero weight. On the other hand, giving it a
138	weight of 5 means that in any clique it is in, it is counted as 5
139	characters when the size of the clique is evaluated. Generally, weights
140	other than 0 or 1 do not have much meaning when dealing with DNA sequences.
141	<P>
142	Output is standard: if option 1 is toggled on, the data is printed out,
143	with the convention that "." means "the same as in the first species".
144	Then comes a list of equally parsimonious trees, and (if option 2 is
145	toggled on) a table of the
146	number of changes of state required in each character. If option 5 is toggled
147	on, a table is printed
148	out after each tree, showing for each branch whether there are known to be
149	changes in the branch, and what the states are inferred to have been at the
150	top end of the branch. If the inferred state is a "?" or one of the IUB
151	ambiguity symbols, there will be multiple
152	equally-parsimonious assignments of states; the user must work these out for
153	themselves by hand. A "?" in the reconstructed states means that in
154	addition to one or more bases, a gap may or may not be present. If
155	option 6 is left in its default state the trees
156	found will be written to a tree file, so that they are available to be used
157	in other programs.
158	<P>
159	If the U (User Tree) option is used and more than one tree is supplied,
160	the program also performs a statistical test of each of these trees against the
161	one with highest likelihood. If there are two user trees, the test
162	done is one which is due to Kishino and Hasegawa (1989), a version
163	of a test originally introduced by Templeton (1983). In this
164	implementation it uses the mean and variance of weighted
165	compatibility differences between trees, taken across sites. If the two
166	trees compatibilities are more than 1.96 standard deviations different then
167	the trees are declared significantly different.
168	<P>
169	If there are more than two trees, the test done is an extension of
170	the KHT test, due to Shimodaira and Hasegawa (1999). They pointed out
171	that a correction for the number of trees was necessary, and they
172	introduced a resampling method to make this correction. In the version
173	used here the variances and covariances of the sum of weighted
174	compatibilities of sites are computed for all pairs of trees. To
175	test whether the
176	difference between each tree and the best one is larger than could have
177	been expected if they all had the same expected compatibility,
178	compatibilities for all trees are sampled with these covariances and equal
179	means (Shimodaira and Hasegawa's "least favorable hypothesis"),
180	and a P value is computed from the fraction of times the difference between
181	the tree's value and the highest compatibility exceeds that actually
182	observed. Note that this sampling needs random numbers, and so the
183	program will prompt the user for a random number seed if one has not
184	already been supplied. With the two-tree KHT test no random numbers
185	are used.
186	<P>
187	In either the KHT or the SH test the program
188	prints out a table of the compatibility of each tree, the differences of
189	each from the highest one, the variance of that quantity as determined by
190	the compatibility differences at individual sites, and a conclusion as to
191	whether that tree is or is not significantly worse than the best one.
192	<P>
193	The algorithm is a straightforward modification of DNAPARS, but with
194	some extra machinery added to calculate, as each species is added, how
195	many base changes are the minimum which could be required at that site. The
196	program runs fairly quickly.
197	<P>
198	The constants
199	which can be changed at the beginning of the program are:
200	the name length "nmlngth",
201	"maxtrees", the maximum number of trees which the program will store for output,
202	and "maxuser",
203	the maximum number of user trees that can be used in the paired sites test.
204	<P>
205	<HR><H3>TEST DATA SET</H3>
206	<P>
207	<TABLE><TR><TD BGCOLOR=white>
208	<PRE>
209	5 13
210	Alpha AACGUGGCCAAAU
211	Beta AAGGUCGCCAAAC
212	Gamma CAUUUCGUCACAA
213	Delta GGUAUUUCGGCCU
214	Epsilon GGGAUCUCGGCCC
215	</PRE>
216	</TD></TR></TABLE>
217	<P>
218	<H3>CONTENTS OF OUTPUT FILE (if all numerical options are turned on)</H3>
219	<P>
220	<TABLE><TR><TD BGCOLOR=white>
221	<PRE>
222
223	DNA compatibility algorithm, version 3.6a3
224
225	5 species, 13 sites
226
227	Name Sequences
228	---- ---------
229
230	Alpha AACGUGGCCA AAU
231	Beta AAGGUCGCCA AAC
232	Gamma CAUUUCGUCA CAA
233	Delta GGUAUUUCGG CCU
234	Epsilon GGGAUCUCGG CCC
235
236
237
238	One most parsimonious tree found:
239
240
241
242
243	+--Epsilon
244	+--4
245	+--3 +--Delta
246	! !
247	+--2 +-----Gamma
248	! !
249	1 +--------Beta
250	!
251	+-----------Alpha
252
253	remember: this is an unrooted tree!
254
255
256	total number of compatible sites is 11.0
257
258	steps in each site:
259	0 1 2 3 4 5 6 7 8 9
260	*-----------------------------------------
261	0\| 2 1 3 2 0 2 1 1 1
262	10\| 1 1 1 3
263
264	compatibility (Y or N) of each site with this tree:
265
266	0123456789
267	*----------
268	0 ! YYNYYYYYY
269	10 !YYYN
270
271	From To Any Steps? State at upper node
272
273	1 AABGTSGCCA AAY
274	1 2 maybe AABGTCGCCA AAY
275	2 3 yes VAKDTCGCCA CAY
276	3 4 yes GGKATCTCGG CCY
277	4 Epsilon maybe GGGATCTCGG CCC
278	4 Delta yes GGTATTTCGG CCT
279	3 Gamma yes CATTTCGTCA CAA
280	2 Beta maybe AAGGTCGCCA AAC
281	1 Alpha maybe AACGTGGCCA AAT
282
283
284	</PRE>
285	</TD></TR></TABLE>
286	</BODY>
287	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: branches/alilink/GDE/PHYLIP/doc/dnacomp.html

Download in other formats: