Context Navigation

sequence.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 16.0 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>sequence</TITLE>
5	<META NAME="description" CONTENT="sequence">
6	<META NAME="keywords" CONTENT="sequence">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<DIV ALIGN=CENTER>
16	<H1>Molecular Sequence Programs</H1>
17	</CENTER>
18	<P>
19	(c) Copyright 1986-2000 by The University of
20	Washington. Written by Joseph Felsenstein. Permission is granted to copy
21	this document provided that no fee is charged for it and that this copyright
22	notice is not removed.
23	<P>
24	These programs estimate phylogenies from protein
25	sequence or nucleic acid sequence data. PROTPARS uses a parsimony method
26	intermediate between Eck and Dayhoff's
27	method (1966) of allowing transitions between all amino acids and counting
28	those, and Fitch's (1971) method of counting the number of nucleotide changes
29	that would be needed to evolve the protein sequence. DNAPARS uses the
30	parsimony method allowing changes between all bases
31	and counting the number of those. DNAMOVE is an interactive parsimony
32	program allowing the user to rearrange trees by hand and see where
33	characters states change. DNAPENNY
34	uses the branch-and-bound method to search for all most
35	parsimonious trees in the nucleic acid sequence case. DNACOMP
36	adapts to nucleotide sequences the compatibility (largest clique)
37	approach. DNAINVAR does not directly estimate a phylogeny, but computes Lake's
38	(1987) and Cavender's (Cavender and Felsenstein, 1987) phylogenetic invariants,
39	which are quantities whose values depend on the phylogeny. DNAML does a
40	maximum likelihood estimate of the phylogeny (Felsenstein, 1981a). DNAMLK
41	is similar to DNAML but assumes a molecular clock. DNADIST
42	computes distance measures between pairs of species from nucleotide sequences,
43	distances that can then be used by the distance matrix programs FITCH and
44	KITSCH. RESTML does a maximum likelihood estimate from restriction
45	sites data. SEQBOOT allows you to read in a data set and then produce
46	multiple data sets from it by bootstrapping, delete-half jackknifing, or
47	by permuting within sites. This
48	then allows most of these methods to be bootstrapped or jackknifed, and
49	for the Permutation Tail Probability Test of Archie (1989) and Faith and
50	Cranston (1991) to be carried out.
51	<P>
52	The input and output format for RESTML is described in
53	its document files. In general its input format is similar to
54	those described here, except that the one-letter codes for restriction sites
55	is specific to that program and is described in that document file. Since
56	the input formats for the eight DNA sequence and two protein sequence
57	programs apply to more than one program, they are described here. Their
58	input formats are standard, making use of the IUPAC standards.
59	.sp 2
60	.ce
61	INTERLEAVED AND SEQUENTIAL FORMATS
62	<P>
63	The sequences can continue over multiple lines; when this is done the
64	sequences must be either in "interleaved" format, similar to the
65	output of alignment programs, or "sequential" format. These are
66	described in the main document file. In sequential format all
67	of one sequence is given, possibly on multiple lines, before the next starts.
68	In interleaved format the first part of the file should contain the first
69	part of each of the sequences, then possibly a line containing nothing
70	but a carriage-return character, then the second part of each sequence,
71	and so on. Only the first parts of the sequences should be preceded by
72	names. Here is a hypothetical example of interleaved format:
73	<P>
74	<TABLE><TR><TD BGCOLOR=white>
75	<PRE>
76	5 42
77	Turkey AAGCTNGGGC ATTTCAGGGT
78	Salmo gairAAGCCTTGGC AGTGCAGGGT
79	H. SapiensACCGGTTGGC CGTTCAGGGT
80	Chimp AAACCCTTGC CGTTACGCTT
81	Gorilla AAACCCTTGC CGGTACGCTT
82
83	GAGCCCGGGC AATACAGGGT AT
84	GAGCCGTGGC CGGGCACGGT AT
85	ACAGGTTGGC CGTTCAGGGT AA
86	AAACCGAGGC CGGGACACTC AT
87	AAACCATTGC CGGTACGCTT AA
88	</PRE>
89	</TD></TR></TABLE>
90	<P>
91	while in sequential format the same sequences would be:
92	<P>
93	<TABLE><TR><TD BGCOLOR=white>
94	<PRE>
95	5 42
96	Turkey AAGCTNGGGC ATTTCAGGGT
97	GAGCCCGGGC AATACAGGGT AT
98	Salmo gairAAGCCTTGGC AGTGCAGGGT
99	GAGCCGTGGC CGGGCACGGT AT
100	H. SapiensACCGGTTGGC CGTTCAGGGT
101	ACAGGTTGGC CGTTCAGGGT AA
102	Chimp AAACCCTTGC CGTTACGCTT
103	AAACCGAGGC CGGGACACTC AT
104	Gorilla AAACCCTTGC CGGTACGCTT
105	AAACCATTGC CGGTACGCTT AA
106	</PRE>
107	</TD></TR></TABLE>
108	<P>
109	Note, of course, that a portion of a sequence like this:
110	<P>
111	300 AAGCGTGAAC GTTGTACTAA TRCAG
112	<P>
113	is perfectly legal, assuming that the species name has gone before, and is
114	filled out to full length by blanks. The above
115	digits and blanks will be ignored, the sequence being taken as starting
116	at the first base symbol (in this case an A). This should enable you to
117	use output from many multiple-sequence alignment programs with only
118	minimal editing.
119	<P>
120	In interleaved format
121	the present versions of the programs may sometimes have difficulties with the
122	blank lines between groups of lines, and if so you might want to retype
123	those lines, making sure that they have only a carriage-return and no blank
124	characters on them, or you may perhaps have to eliminate them. The symptoms
125	of this problem are that the programs complain that the sequences are not
126	properly aligned, and you can find no other cause for this complaint.
127	<P>
128	<H2>INPUT FOR THE DNA SEQUENCE PROGRAMS</H2>
129	<P>
130	The input format for the DNA sequence programs is
131	standard: the data have A's, G's, C's and T's (or U's). The first line of the
132	input file contains the number of species and the number of sites. As
133	with the other programs, options information may follow this. Following this,
134	each species starts on a new line. The first 10
135	characters of that line are the species name. There then follows
136	the base sequence of that species, each character
137	being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V,
138	W, X, Y, ?, or - (a period was also previously allowed but it is no longer
139	allowed, because it sometimes is used in different senses in other
140	programs). Blanks will be ignored, and so will numerical
141	digits. This allows GENBANK and EMBL sequence entries to be read with
142	minimum editing.
143	<P>
144	These characters can be either upper or lower case. The algorithms
145	convert all input characters to upper case (which is how they
146	are treated). The characters constitute the IUPAC (IUB) nucleic acid code
147	plus some slight
148	extensions. They enable input of nucleic acid sequences taking full account
149	of any ambiguities in the sequence.
150	<P>
151	<DIV ALIGN=CENTER>
152	<TABLE BORDER=0>
153	<TR><TD ALIGN=LEFT><B>Symbol</B><TD><TD><B>Meaning</B></TD><TD></TD></TR>
154	<TR><TD></TD><TD></TD></TD></TR>
155	<TR><TD>A<TD><TD>Adenine</TD><TD></TD></TR>
156	<TR><TD>G<TD><TD>Guanine</TD><TD></TD></TR>
157	<TR><TD>C<TD><TD>Cytosine</TD><TD></TD></TR>
158	<TR><TD>T<TD><TD>Thymine</TD><TD></TD></TR>
159	<TR><TD>U<TD><TD>Uracil </TD><TD></TD></TR>
160	<TR><TD>Y<TD><TD>pYrimidine<TD><TD>(C or T)</TD></TR>
161	<TR><TD>R<TD><TD>puRine<TD><TD>(A or G)</TD></TR>
162	<TR><TD>W<TD><TD>"Weak"<TD><TD>(A or T)</TD></TR>
163	<TR><TD>S<TD><TD>"Strong"<TD><TD>(C or G)</TD></TR>
164	<TR><TD>K<TD><TD>"Keto"<TD><TD>(T or G)</TD></TR>
165	<TR><TD>M<TD><TD>"aMino"<TD><TD>(C or A)</TD></TR>
166	<TR><TD>B<TD><TD>not A<TD><TD>(C or G or T)</TD></TR>
167	<TR><TD>D<TD><TD>not C<TD><TD>(A or G or T)</TD></TR>
168	<TR><TD>H<TD><TD>not G<TD><TD>(A or C or T)</TD></TR>
169	<TR><TD>V<TD><TD>not T<TD><TD>(A or C or G)</TD></TR>
170	<TR><TD>X,N,?<TD><TD>unknown<TD><TD>(A or C or G or T)</TD></TR>
171	<TR><TD>O<TD><TD>deletion</TD><TD></TD></TR>
172	<TR><TD>-<TD><TD>deletion</TD><TD></TD></TR>
173	</TABLE>
174	</DIV>
175	<P>
176	<H2>INPUT FOR THE PROTEIN SEQUENCE PROGRAMS</H2>
177	<P>
178	The input for the protein sequence programs is fairly standard. The first
179	line contains the
180	number of species and the number of amino acid positions (counting any
181	stop codons that you want to include). These are followed on the same line
182	by the options. The only options which
183	need information in the input file are U (User Tree) and W (Weights). They are
184	as described in the main documentation file. If the W (Weights) option is
185	used there must be a W in the first line of the input file.
186	<P>
187	Next come the species data. Each
188	sequence starts on a new line, has a ten-character species name
189	that must be blank-filled to be of that length, followed immediately
190	by the species data in the one-letter code. The sequences must either
191	be in the "interleaved" or "sequential" formats. The I option
192	selects between them. The sequences can have internal
193	blanks in the sequence but there must be no extra blanks at the end of the
194	terminated line. Note that a blank is not a valid symbol for a deletion.
195	<P>
196	The protein sequences are given by the one-letter code used by
197	the late Margaret Dayhoff's group in the Atlas of Protein Sequences,
198	and consistent with the IUB standard abbreviations.
199	In the present version it is:
200	<P>
201	<DIV ALIGN=CENTER>
202	<TABLE>
203	<TR><TD><B ALIGN=CENTER>Symbol</B></TD><TD ALIGN=CENTER><B>Stands for</B></TD></TR>
204	<TR><TD ALIGN=CENTER></TD><TD ALIGN=CENTER></TD></TR>
205	<TR><TD ALIGN=CENTER>A</TD><TD ALIGN=CENTER>ala</TD></TR>
206	<TR><TD ALIGN=CENTER>B</TD><TD ALIGN=CENTER>asx</TD></TR>
207	<TR><TD ALIGN=CENTER>C</TD><TD ALIGN=CENTER>cys</TD></TR>
208	<TR><TD ALIGN=CENTER>D</TD><TD ALIGN=CENTER>asp</TD></TR>
209	<TR><TD ALIGN=CENTER>E</TD><TD ALIGN=CENTER>glu</TD></TR>
210	<TR><TD ALIGN=CENTER>F</TD><TD ALIGN=CENTER>phe</TD></TR>
211	<TR><TD ALIGN=CENTER>G</TD><TD ALIGN=CENTER>gly</TD></TR>
212	<TR><TD ALIGN=CENTER>H</TD><TD ALIGN=CENTER>his</TD></TR>
213	<TR><TD ALIGN=CENTER>I</TD><TD ALIGN=CENTER>ileu</TD></TR>
214	<TR><TD ALIGN=CENTER>J</TD><TD ALIGN=CENTER>(not used)</TD></TR>
215	<TR><TD ALIGN=CENTER>K</TD><TD ALIGN=CENTER>lys</TD></TR>
216	<TR><TD ALIGN=CENTER>L</TD><TD ALIGN=CENTER>leu</TD></TR>
217	<TR><TD ALIGN=CENTER>M</TD><TD ALIGN=CENTER>met</TD></TR>
218	<TR><TD ALIGN=CENTER>N</TD><TD ALIGN=CENTER>asn</TD></TR>
219	<TR><TD ALIGN=CENTER>O</TD><TD ALIGN=CENTER>(not used)</TD></TR>
220	<TR><TD ALIGN=CENTER>P</TD><TD ALIGN=CENTER>pro</TD></TR>
221	<TR><TD ALIGN=CENTER>Q</TD><TD ALIGN=CENTER>gln</TD></TR>
222	<TR><TD ALIGN=CENTER>R</TD><TD ALIGN=CENTER>arg</TD></TR>
223	<TR><TD ALIGN=CENTER>S</TD><TD ALIGN=CENTER>ser</TD></TR>
224	<TR><TD ALIGN=CENTER>T</TD><TD ALIGN=CENTER>thr</TD></TR>
225	<TR><TD ALIGN=CENTER>U</TD><TD ALIGN=CENTER>(not used)</TD></TR>
226	<TR><TD ALIGN=CENTER>V</TD><TD ALIGN=CENTER>val</TD></TR>
227	<TR><TD ALIGN=CENTER>W</TD><TD ALIGN=CENTER>trp</TD></TR>
228	<TR><TD ALIGN=CENTER>X</TD><TD ALIGN=CENTER>unknown amino acid</TD></TR>
229	<TR><TD ALIGN=CENTER>Y</TD><TD ALIGN=CENTER>tyr</TD></TR>
230	<TR><TD ALIGN=CENTER>Z</TD><TD ALIGN=CENTER>glx</TD></TR>
231	<TR><TD ALIGN=CENTER>*</TD><TD ALIGN=CENTER>nonsense (stop)</TD></TR>
232	<TR><TD ALIGN=CENTER>?</TD><TD ALIGN=CENTER>unknown amino acid or deletion</TD></TR>
233	<TR><TD ALIGN=CENTER>-</TD><TD ALIGN=CENTER>deletion</TD></TR>
234	</TABLE>
235	</DIV>
236	<P>
237	where "nonsense", and "unknown" mean respectively a nonsense (chain
238	termination) codon and an amino acid whose identity has not been
239	determined. The state "asx" means "either asn or asp",
240	and the state "glx" means "either gln or glu" and the state "deletion"
241	means that alignment studies indicate a deletion has happened in the
242	ancestry of this position, so that it is no longer present. Note that
243	if two polypeptide chains are being used that are of different length
244	owing to one terminating before the other, they can be coded as (say)
245	<PRE>
246	HIINMA*????
247	HIPNMGVWABT
248	</PRE>
249	since after the stop codon we do not definitely know that
250	there has been a deletion, and do not know what amino acid would
251	have been there. If DNA studies tell us that there is
252	DNA sequence in that region, then we could use "X" rather than "?". Note
253	that "X" means an unknown amino acid, but definitely an amino acid,
254	while "?" could mean either that or a deletion. Otherwise one will usually
255	want to use "?" after a stop codon, if one does not know what amino acid is
256	there. If the DNA sequence has been observed there, one probably ought to
257	resist putting in the amino acids that this DNA would code for, and one should
258	use "X" instead, because under the assumptions implicit in this either the
259	parsimony or the distance
260	methods, changes to any noncoding sequence are much easier than
261	changes in a coding region that change the amino acid
262	<P>
263	Here are the same one-letter codes tabulated the other way 'round:
264	<P>
265	<DIV ALIGN=CENTER>
266	<TABLE>
267	<TR><TD ALIGN=CENTER><B>Amino acid</B></TD><TD ALIGN=CENTER><B>One-letter code</B></TD></TR>
268	<TR><TD ALIGN=CENTER></TD><TD ALIGN=CENTER></TD></TR></TD></TR>
269	<TR><TD ALIGN=CENTER>ala</TD><TD ALIGN=CENTER>A</TD></TR>
270	<TR><TD ALIGN=CENTER>arg</TD><TD ALIGN=CENTER>R</TD></TR>
271	<TR><TD ALIGN=CENTER>asn</TD><TD ALIGN=CENTER>N</TD></TR>
272	<TR><TD ALIGN=CENTER>asp</TD><TD ALIGN=CENTER>D</TD></TR>
273	<TR><TD ALIGN=CENTER>asx</TD><TD ALIGN=CENTER>B</TD></TR>
274	<TR><TD ALIGN=CENTER>cys</TD><TD ALIGN=CENTER>C</TD></TR>
275	<TR><TD ALIGN=CENTER>gln</TD><TD ALIGN=CENTER>Q</TD></TR>
276	<TR><TD ALIGN=CENTER>glu</TD><TD ALIGN=CENTER>E</TD></TR>
277	<TR><TD ALIGN=CENTER>gly</TD><TD ALIGN=CENTER>G</TD></TR>
278	<TR><TD ALIGN=CENTER>glx</TD><TD ALIGN=CENTER>Z</TD></TR>
279	<TR><TD ALIGN=CENTER>his</TD><TD ALIGN=CENTER>H</TD></TR>
280	<TR><TD ALIGN=CENTER>ileu</TD><TD ALIGN=CENTER>I</TD></TR>
281	<TR><TD ALIGN=CENTER>leu</TD><TD ALIGN=CENTER>L</TD></TR>
282	<TR><TD ALIGN=CENTER>lys</TD><TD ALIGN=CENTER>K</TD></TR>
283	<TR><TD ALIGN=CENTER>met</TD><TD ALIGN=CENTER>M</TD></TR>
284	<TR><TD ALIGN=CENTER>phe</TD><TD ALIGN=CENTER>F</TD></TR>
285	<TR><TD ALIGN=CENTER>pro</TD><TD ALIGN=CENTER>P</TD></TR>
286	<TR><TD ALIGN=CENTER>ser</TD><TD ALIGN=CENTER>S</TD></TR>
287	<TR><TD ALIGN=CENTER>thr</TD><TD ALIGN=CENTER>T</TD></TR>
288	<TR><TD ALIGN=CENTER>trp</TD><TD ALIGN=CENTER>W</TD></TR>
289	<TR><TD ALIGN=CENTER>tyr</TD><TD ALIGN=CENTER>Y</TD></TR>
290	<TR><TD ALIGN=CENTER>val</TD><TD ALIGN=CENTER>V</TD></TR>
291	<TR><TD ALIGN=CENTER>deletion</TD><TD ALIGN=CENTER>-</TD></TR>
292	<TR><TD ALIGN=CENTER>nonsense (stop)</TD><TD ALIGN=CENTER>*</TD></TR>
293	<TR><TD ALIGN=CENTER>unknown amino acid</TD><TD ALIGN=CENTER>X</TD></TR>
294	<TR><TD ALIGN=CENTER>unknown (incl. deletion)</TD><TD ALIGN=CENTER>?</TD></TR>
295	</TABLE>
296	</DIV>
297	<P>
298	<H2>THE OPTIONS</H2>
299	<P>
300	The programs allow options chosen from their menus. Many of these are as described in the
301	main documentation file, particularly the options J, O, U, T, W,
302	and Y. (Although T has a different meaning in the programs DNAML and
303	DNADIST than in the others).
304	<P>
305	The U option indicates that
306	user-defined trees are provided at the end of the input file. This
307	happens in the usual way, except that for PROTPARS, DNAPARS, DNACOMP, and
308	DNAMLK, the trees must be strictly
309	bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For
310	DNAML and RESTML it must have a trifurcation at its base,
311	e. g.: ((A,B),C,(D,E));. The
312	root of the tree may in those cases be placed arbitrarily, since the trees
313	needed are actually unrooted, though they look different when printed out. The
314	program RETREE should enable you to reroot the trees without having to
315	hand-edit or retype them. For
316	DNAMOVE the U option is not available (although
317	there is an equivalent feature which uses rooted user trees).
318	<P>
319	A feature of the nucleotide sequence programs other than DNAMOVE
320	is that they save time and computer memory space by recognizing sites
321	at which the pattern of bases is the same, and doing their computation only
322	once. Thus if we have only four species but a large number of sites, there
323	are (ignoring ambiguous bases) only about 256 different patterns of
324	nucleotides (4 x 4 x 4 x 4) that can occur. The programs automatically
325	count how many occurrences there are of each and then only needs to do as much
326	computation as would be
327	needed with 256 sites, even though the number of sites is actually much
328	larger. If there are ambiguities (such as Y or R nucleotides), these are also
329	handled correctly, and do not cause trouble. The programs store the full
330	sequences but reserve other space for bookkeeping only for the distinct
331	patterns. This saves space. Thus the programs will run very effectively
332	with few species and many sites. On larger numbers of species,
333	if rates of evolution are small, many of the sites will be invariant
334	(such as having all A's) and thus will mostly have one of four patterns. The
335	programs will in this way automatically avoid doing duplicate
336	computations for such sites.
337	</BODY>
338	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: trunk/GDE/PHYLIP/doc/sequence.html

Download in other formats: