source: trunk/GDE/PHYLIP/doc/sequence.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 16.0 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>sequence</TITLE>
5<META NAME="description" CONTENT="sequence">
6<META NAME="keywords" CONTENT="sequence">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<DIV ALIGN=CENTER>
16<H1>Molecular Sequence Programs</H1>
17</CENTER>
18<P>
19(c) Copyright 1986-2000 by The University of
20Washington.  Written by Joseph Felsenstein.  Permission is granted to copy
21this document provided that no fee is charged for it and that this copyright
22notice is not removed.
23<P>
24These programs estimate phylogenies from protein
25sequence or nucleic acid sequence data.  PROTPARS uses a parsimony method
26intermediate between Eck and Dayhoff's
27method (1966) of allowing transitions between all amino acids and counting
28those, and Fitch's (1971) method of counting the number of nucleotide changes
29that would be needed to evolve the protein sequence.  DNAPARS uses the
30parsimony method allowing changes between all bases
31and counting the number of those.  DNAMOVE is an interactive parsimony
32program allowing the user to rearrange trees by hand and see where
33characters states change.  DNAPENNY
34uses the branch-and-bound method to search for all most
35parsimonious trees in the nucleic acid sequence case.  DNACOMP
36adapts to nucleotide sequences the compatibility (largest clique)
37approach.  DNAINVAR does not directly estimate a phylogeny, but computes Lake's
38(1987) and Cavender's (Cavender and Felsenstein, 1987) phylogenetic invariants,
39which are quantities whose values depend on the phylogeny.  DNAML does a
40maximum likelihood estimate of the phylogeny (Felsenstein, 1981a).  DNAMLK
41is similar to DNAML but assumes a molecular clock.  DNADIST
42computes distance measures between pairs of species from nucleotide sequences,
43distances that can then be used by the distance matrix programs FITCH and
44KITSCH. RESTML does a maximum likelihood estimate from restriction
45sites data.   SEQBOOT allows you to read in a data set and then produce
46multiple data sets from it by bootstrapping, delete-half jackknifing, or
47by permuting within sites.  This
48then allows most of these methods to be bootstrapped or jackknifed, and
49for the Permutation Tail Probability Test of Archie (1989) and Faith and
50Cranston (1991) to be carried out.
51<P>
52The input and output format for RESTML is described in
53its document files.  In general its input format is similar to
54those described here, except that the one-letter codes for restriction sites
55is specific to that program and is described in that document file.  Since
56the input formats for the eight DNA sequence and two protein sequence
57programs apply to more than one program, they are described here.  Their
58input formats are standard, making use of the IUPAC standards.
59.sp 2
60.ce
61INTERLEAVED AND SEQUENTIAL FORMATS
62<P>
63The sequences can continue over multiple lines; when this is done the
64sequences must be either in "interleaved" format, similar to the
65output of alignment programs, or "sequential" format.  These are
66described in the main document file.  In sequential format all
67of one sequence is given, possibly on multiple lines, before the next starts.
68In interleaved format the first part of the file should contain the first
69part of each of the sequences, then possibly a line containing nothing
70but a carriage-return character, then the second part of each sequence,
71and so on.  Only the first parts of the sequences should be preceded by
72names.  Here is a hypothetical example of interleaved format:
73<P>
74<TABLE><TR><TD BGCOLOR=white>
75<PRE>
76  5    42
77Turkey    AAGCTNGGGC ATTTCAGGGT
78Salmo gairAAGCCTTGGC AGTGCAGGGT
79H. SapiensACCGGTTGGC CGTTCAGGGT
80Chimp     AAACCCTTGC CGTTACGCTT
81Gorilla   AAACCCTTGC CGGTACGCTT
82
83GAGCCCGGGC AATACAGGGT AT
84GAGCCGTGGC CGGGCACGGT AT
85ACAGGTTGGC CGTTCAGGGT AA
86AAACCGAGGC CGGGACACTC AT
87AAACCATTGC CGGTACGCTT AA
88</PRE>
89</TD></TR></TABLE>
90<P>
91while in sequential format the same sequences would be:
92<P>
93<TABLE><TR><TD BGCOLOR=white>
94<PRE>
95  5    42
96Turkey    AAGCTNGGGC ATTTCAGGGT
97GAGCCCGGGC AATACAGGGT AT
98Salmo gairAAGCCTTGGC AGTGCAGGGT
99GAGCCGTGGC CGGGCACGGT AT
100H. SapiensACCGGTTGGC CGTTCAGGGT
101ACAGGTTGGC CGTTCAGGGT AA
102Chimp     AAACCCTTGC CGTTACGCTT
103AAACCGAGGC CGGGACACTC AT
104Gorilla   AAACCCTTGC CGGTACGCTT
105AAACCATTGC CGGTACGCTT AA
106</PRE>
107</TD></TR></TABLE>
108<P>
109Note, of course, that a portion of a sequence like this:
110<P>
111   300   AAGCGTGAAC GTTGTACTAA TRCAG
112<P>
113is perfectly legal, assuming that the species name has gone before, and is
114filled out to full length by blanks.  The above
115digits and blanks will be ignored, the sequence being taken as starting
116at the first base symbol (in this case an A).  This should enable you to
117use output from many multiple-sequence alignment programs with only
118minimal editing.
119<P>
120In interleaved format
121the present versions of the programs may sometimes have difficulties with the
122blank lines between groups of lines, and if so you might want to retype
123those lines, making sure that they have only a carriage-return and no blank
124characters on them, or you may perhaps have to eliminate them.  The symptoms
125of this problem are that the programs complain that the sequences are not
126properly aligned, and you can find no other cause for this complaint.
127<P>
128<H2>INPUT FOR THE DNA SEQUENCE PROGRAMS</H2>
129<P>
130The input format for the DNA sequence programs is
131standard: the data have A's, G's, C's and T's (or U's).  The first line of the
132input file contains the number of species and the number of sites.  As
133with the other programs, options information may follow this.  Following this,
134each species starts on a new line.  The first 10
135characters of that line are the species name.  There then follows
136the base sequence of that species, each character
137being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V,
138W, X, Y, ?, or - (a period was also previously allowed but it is no longer
139allowed, because it sometimes is used in different senses in other
140programs).  Blanks will be ignored, and so will numerical
141digits.  This allows GENBANK and EMBL sequence entries to be read with
142minimum editing.
143<P>
144These characters can be either upper or lower case.  The algorithms
145convert all input characters to upper case (which is how they
146are treated).  The characters constitute the IUPAC (IUB) nucleic acid code
147plus some slight
148extensions.  They enable input of nucleic acid sequences taking full account
149of any ambiguities in the sequence.
150<P>
151<DIV ALIGN=CENTER>
152<TABLE BORDER=0>
153<TR><TD ALIGN=LEFT><B>Symbol</B><TD><TD><B>Meaning</B></TD><TD></TD></TR>
154<TR><TD></TD><TD></TD></TD></TR>
155<TR><TD>A<TD><TD>Adenine</TD><TD></TD></TR>
156<TR><TD>G<TD><TD>Guanine</TD><TD></TD></TR>
157<TR><TD>C<TD><TD>Cytosine</TD><TD></TD></TR>
158<TR><TD>T<TD><TD>Thymine</TD><TD></TD></TR>
159<TR><TD>U<TD><TD>Uracil </TD><TD></TD></TR>
160<TR><TD>Y<TD><TD>pYrimidine<TD><TD>(C or T)</TD></TR>
161<TR><TD>R<TD><TD>puRine<TD><TD>(A or G)</TD></TR>
162<TR><TD>W<TD><TD>"Weak"<TD><TD>(A or T)</TD></TR>
163<TR><TD>S<TD><TD>"Strong"<TD><TD>(C or G)</TD></TR>
164<TR><TD>K<TD><TD>"Keto"<TD><TD>(T or G)</TD></TR>
165<TR><TD>M<TD><TD>"aMino"<TD><TD>(C or A)</TD></TR>
166<TR><TD>B<TD><TD>not A<TD><TD>(C or G or T)</TD></TR>
167<TR><TD>D<TD><TD>not C<TD><TD>(A or G or T)</TD></TR>
168<TR><TD>H<TD><TD>not G<TD><TD>(A or C or T)</TD></TR>
169<TR><TD>V<TD><TD>not T<TD><TD>(A or C or G)</TD></TR>
170<TR><TD>X,N,?<TD><TD>unknown<TD><TD>(A or C or G or T)</TD></TR>
171<TR><TD>O<TD><TD>deletion</TD><TD></TD></TR>
172<TR><TD>-<TD><TD>deletion</TD><TD></TD></TR>
173</TABLE>
174</DIV>
175<P>
176<H2>INPUT FOR THE PROTEIN SEQUENCE PROGRAMS</H2>
177<P>
178The input for the protein sequence programs is fairly standard.  The first
179line contains the
180number of species and the number of amino acid positions (counting any
181stop codons that you want to include).  These are followed on the same line
182by the options.  The only options which
183need information in the input file are U (User Tree) and W (Weights).  They are
184as described in the main documentation file.  If the W (Weights) option is
185used there must be a W in the first line of the input file.
186<P>
187Next come the species data.  Each
188sequence starts on a new line, has a ten-character species name
189that must be blank-filled to be of that length, followed immediately
190by the species data in the one-letter code.  The sequences must either
191be in the "interleaved" or "sequential" formats.  The I option
192selects between them.  The sequences can have internal
193blanks in the sequence but there must be no extra blanks at the end of the
194terminated line.  Note that a blank is not a valid symbol for a deletion.
195<P>
196The protein sequences are given by the one-letter code used by
197the late Margaret Dayhoff's group in the Atlas of Protein Sequences,
198and consistent with the IUB standard abbreviations.
199In the present version it is:
200<P>
201<DIV ALIGN=CENTER>
202<TABLE>
203<TR><TD><B ALIGN=CENTER>Symbol</B></TD><TD ALIGN=CENTER><B>Stands for</B></TD></TR>
204<TR><TD ALIGN=CENTER></TD><TD ALIGN=CENTER></TD></TR>
205<TR><TD ALIGN=CENTER>A</TD><TD ALIGN=CENTER>ala</TD></TR>
206<TR><TD ALIGN=CENTER>B</TD><TD ALIGN=CENTER>asx</TD></TR>
207<TR><TD ALIGN=CENTER>C</TD><TD ALIGN=CENTER>cys</TD></TR>
208<TR><TD ALIGN=CENTER>D</TD><TD ALIGN=CENTER>asp</TD></TR>
209<TR><TD ALIGN=CENTER>E</TD><TD ALIGN=CENTER>glu</TD></TR>
210<TR><TD ALIGN=CENTER>F</TD><TD ALIGN=CENTER>phe</TD></TR>
211<TR><TD ALIGN=CENTER>G</TD><TD ALIGN=CENTER>gly</TD></TR>
212<TR><TD ALIGN=CENTER>H</TD><TD ALIGN=CENTER>his</TD></TR>
213<TR><TD ALIGN=CENTER>I</TD><TD ALIGN=CENTER>ileu</TD></TR>
214<TR><TD ALIGN=CENTER>J</TD><TD ALIGN=CENTER>(not used)</TD></TR>
215<TR><TD ALIGN=CENTER>K</TD><TD ALIGN=CENTER>lys</TD></TR>
216<TR><TD ALIGN=CENTER>L</TD><TD ALIGN=CENTER>leu</TD></TR>
217<TR><TD ALIGN=CENTER>M</TD><TD ALIGN=CENTER>met</TD></TR>
218<TR><TD ALIGN=CENTER>N</TD><TD ALIGN=CENTER>asn</TD></TR>
219<TR><TD ALIGN=CENTER>O</TD><TD ALIGN=CENTER>(not used)</TD></TR>
220<TR><TD ALIGN=CENTER>P</TD><TD ALIGN=CENTER>pro</TD></TR>
221<TR><TD ALIGN=CENTER>Q</TD><TD ALIGN=CENTER>gln</TD></TR>
222<TR><TD ALIGN=CENTER>R</TD><TD ALIGN=CENTER>arg</TD></TR>
223<TR><TD ALIGN=CENTER>S</TD><TD ALIGN=CENTER>ser</TD></TR>
224<TR><TD ALIGN=CENTER>T</TD><TD ALIGN=CENTER>thr</TD></TR>
225<TR><TD ALIGN=CENTER>U</TD><TD ALIGN=CENTER>(not used)</TD></TR>
226<TR><TD ALIGN=CENTER>V</TD><TD ALIGN=CENTER>val</TD></TR>
227<TR><TD ALIGN=CENTER>W</TD><TD ALIGN=CENTER>trp</TD></TR>
228<TR><TD ALIGN=CENTER>X</TD><TD ALIGN=CENTER>unknown amino acid</TD></TR>
229<TR><TD ALIGN=CENTER>Y</TD><TD ALIGN=CENTER>tyr</TD></TR>
230<TR><TD ALIGN=CENTER>Z</TD><TD ALIGN=CENTER>glx</TD></TR>
231<TR><TD ALIGN=CENTER>*</TD><TD ALIGN=CENTER>nonsense (stop)</TD></TR>
232<TR><TD ALIGN=CENTER>?</TD><TD ALIGN=CENTER>unknown amino acid or deletion</TD></TR>
233<TR><TD ALIGN=CENTER>-</TD><TD ALIGN=CENTER>deletion</TD></TR>
234</TABLE>
235</DIV>
236<P>
237where "nonsense", and "unknown" mean respectively a nonsense (chain
238termination) codon and an amino acid whose identity has not been
239determined.  The state "asx" means "either asn or asp",
240and the state "glx" means "either gln or glu" and the state "deletion"
241means that alignment studies indicate a deletion has happened in the
242ancestry of this position, so that it is no longer present.  Note that
243if two polypeptide chains are being used that are of different length
244owing to one terminating before the other, they can be coded as (say)
245<PRE>
246             HIINMA*????
247             HIPNMGVWABT
248</PRE>
249since after the stop codon we do not definitely know that
250there has been a deletion, and do not know what amino acid would
251have been there.  If DNA studies tell us that there is
252DNA sequence in that region, then we could use "X" rather than "?".  Note
253that "X" means an unknown amino acid, but definitely an amino acid,
254while "?" could mean either that or a deletion.  Otherwise one will usually
255want to use "?" after a stop codon, if one does not know what amino acid is
256there.  If the DNA sequence has been observed there, one probably ought to
257resist putting in the amino acids that this DNA would code for, and one should
258use "X" instead, because under the assumptions implicit in this either the
259parsimony or the distance
260methods, changes to any noncoding sequence are much easier than
261changes in a coding region that change the amino acid
262<P>
263Here are the same one-letter codes tabulated the other way 'round:
264<P>
265<DIV ALIGN=CENTER>
266<TABLE>
267<TR><TD ALIGN=CENTER><B>Amino acid</B></TD><TD ALIGN=CENTER><B>One-letter code</B></TD></TR>
268<TR><TD ALIGN=CENTER></TD><TD ALIGN=CENTER></TD></TR></TD></TR>
269<TR><TD ALIGN=CENTER>ala</TD><TD ALIGN=CENTER>A</TD></TR>
270<TR><TD ALIGN=CENTER>arg</TD><TD ALIGN=CENTER>R</TD></TR>
271<TR><TD ALIGN=CENTER>asn</TD><TD ALIGN=CENTER>N</TD></TR>
272<TR><TD ALIGN=CENTER>asp</TD><TD ALIGN=CENTER>D</TD></TR>
273<TR><TD ALIGN=CENTER>asx</TD><TD ALIGN=CENTER>B</TD></TR>
274<TR><TD ALIGN=CENTER>cys</TD><TD ALIGN=CENTER>C</TD></TR>
275<TR><TD ALIGN=CENTER>gln</TD><TD ALIGN=CENTER>Q</TD></TR>
276<TR><TD ALIGN=CENTER>glu</TD><TD ALIGN=CENTER>E</TD></TR>
277<TR><TD ALIGN=CENTER>gly</TD><TD ALIGN=CENTER>G</TD></TR>
278<TR><TD ALIGN=CENTER>glx</TD><TD ALIGN=CENTER>Z</TD></TR>
279<TR><TD ALIGN=CENTER>his</TD><TD ALIGN=CENTER>H</TD></TR>
280<TR><TD ALIGN=CENTER>ileu</TD><TD ALIGN=CENTER>I</TD></TR>
281<TR><TD ALIGN=CENTER>leu</TD><TD ALIGN=CENTER>L</TD></TR>
282<TR><TD ALIGN=CENTER>lys</TD><TD ALIGN=CENTER>K</TD></TR>
283<TR><TD ALIGN=CENTER>met</TD><TD ALIGN=CENTER>M</TD></TR>
284<TR><TD ALIGN=CENTER>phe</TD><TD ALIGN=CENTER>F</TD></TR>
285<TR><TD ALIGN=CENTER>pro</TD><TD ALIGN=CENTER>P</TD></TR>
286<TR><TD ALIGN=CENTER>ser</TD><TD ALIGN=CENTER>S</TD></TR>
287<TR><TD ALIGN=CENTER>thr</TD><TD ALIGN=CENTER>T</TD></TR>
288<TR><TD ALIGN=CENTER>trp</TD><TD ALIGN=CENTER>W</TD></TR>
289<TR><TD ALIGN=CENTER>tyr</TD><TD ALIGN=CENTER>Y</TD></TR>
290<TR><TD ALIGN=CENTER>val</TD><TD ALIGN=CENTER>V</TD></TR>
291<TR><TD ALIGN=CENTER>deletion</TD><TD ALIGN=CENTER>-</TD></TR>
292<TR><TD ALIGN=CENTER>nonsense (stop)</TD><TD ALIGN=CENTER>*</TD></TR>
293<TR><TD ALIGN=CENTER>unknown amino acid</TD><TD ALIGN=CENTER>X</TD></TR>
294<TR><TD ALIGN=CENTER>unknown (incl. deletion)</TD><TD ALIGN=CENTER>?</TD></TR>
295</TABLE>
296</DIV>
297<P>
298<H2>THE OPTIONS</H2>
299<P>
300The programs allow options chosen from their menus.  Many of these are as described in the
301main documentation file, particularly the options J, O, U, T, W,
302and Y.  (Although T has a different meaning in the programs DNAML and
303DNADIST than in the others). 
304<P>
305The U option indicates that
306user-defined trees are provided at the end of the input file.  This
307happens in the usual way, except that for PROTPARS, DNAPARS, DNACOMP, and
308DNAMLK, the trees must be strictly
309bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));.  For
310DNAML and RESTML it must have a trifurcation at its base,
311e. g.: ((A,B),C,(D,E));.  The
312root of the tree may in those cases be placed arbitrarily, since the trees
313needed are actually unrooted, though they look different when printed out.  The
314program RETREE should enable you to reroot the trees without having to
315hand-edit or retype them.  For
316DNAMOVE the U option is not available (although
317there is an equivalent feature which uses rooted user trees).
318<P>
319A feature of the nucleotide sequence programs other than DNAMOVE
320is that they save time and computer memory space by recognizing sites
321at which the pattern of bases is the same, and doing their computation only
322once.  Thus if we have only four species but a large number of sites, there
323are (ignoring ambiguous bases) only about 256 different patterns of
324nucleotides (4 x 4 x 4 x 4) that can occur.  The programs automatically
325count how many occurrences there are of each and then only needs to do as much
326computation as would be
327needed with 256 sites, even though the number of sites is actually much
328larger.  If there are ambiguities (such as Y or R nucleotides), these are also
329handled correctly, and do not cause trouble.  The programs store the full
330sequences but reserve other space for bookkeeping only for the distinct
331patterns.  This saves space.  Thus the programs will run very effectively
332with few species and many sites.  On larger numbers of species,
333if rates of evolution are small, many of the sites will be invariant
334(such as having all A's) and thus will mostly have one of four patterns.  The
335programs will in this way automatically avoid doing duplicate
336computations for such sites.
337</BODY>
338</HTML>
Note: See TracBrowser for help on using the repository browser.