source: trunk/GDE/PHYLIP/doc/protdist.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 18.8 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>protdist</TITLE>
5<META NAME="description" CONTENT="protdist">
6<META NAME="keywords" CONTENT="protdist">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>PROTDIST -- Program to compute distance matrix<BR>
18from protein sequences</H1>
19</DIV>
20<P>
21&#169; Copyright 1993, 2000-2002 by the University of Washington.  Permission
22is granted to copy
23this document provided that no fee is charged for it and that this copyright
24notice is not removed.
25<P>
26This program uses protein sequences to compute a distance matrix, under
27four different models of amino acid replacement. It can also
28compute a table of similarity between the amino acid sequences.
29The distance for each
30pair of species estimates the total branch length between the two species, and
31can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR.  This
32is an alternative to use of the sequence data itself in the
33parsimony program PROTPARS.
34<P>
35The program reads in protein sequences and writes an output file containing
36the distance matrix or similarity table.  The four models of amino acid
37substitution are
38one which is based on the Jones, Taylor and Thornton (1992) model of
39amino acid change, one based on the PAM matrixes of Margaret Dayhoff, one due to
40Kimura (1983) which approximates it based simply on the fraction of
41similar amino acids, and one based on a model in which the amino acids are
42divided up into groups, with change occurring based on the genetic code
43but with greater difficulty of changing between groups.  The program correctly
44takes into account a variety of sequence ambiguities.
45<P>
46The four methods are:
47<P>
48(1) The Dayhoff PAM matrix.  This uses Dayhoff's PAM 001 matrix from
49Dayhoff (1979), page 348.  The PAM model is an empirical one that
50scales probabilities of change from one amino acid to another in
51terms of a unit which is an expected 1% change between two amino acid
52sequences.  The PAM 001 matrix is used to make a transition probability
53matrix which allows prediction of the probability of changing from any
54one amino acid to any other, and also predicts equilibrium amino acid
55composition.  The program assumes that these probabilities are correct
56and bases its computations of distance on them.  The distance that is
57computed is scaled in units of expected fraction of amino acids
58changed.  This is a unit of 100 PAM's.
59<P>
60(2) The Jones-Taylor-Thornton model.  This is similar to the Dayhoff
61PAM model, except that it is based on a recounting of the number of
62observed changes in amino acids by Jones, Taylor, and Thornton (1992).
63They used a much larger sample of protein sequences than did Dayhoff.
64The distance is scaled in units of the expected fraction of amino acids
65changed (100 PAM's).  Because its sample is so much larger this
66model is to be preferred over the original Dayhoff PAM model.  It
67is the default model in this program.
68<P>
69(3) Kimura's distance.  This is a rough-and-ready distance formula for
70approximating PAM distance by simply measuring the fraction of amino
71acids, p, that differs between two sequences and computing the
72distance as (Kimura, 1983)
73<P>
74&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;D   =   - log<SUB>e</SUB>  ( 1 - p - 0.2 p<SUP>2</SUP> ).
75<P>
76This is very quick to do but has some obvious limitations.  It does not
77take into account which amino acids differ or to what amino acids
78they change, so some information is lost.  The units of the distance
79measure are fraction of amino acids differing, as also in the case of
80the PAM distance.  If the fraction of amino acids differing gets larger
81than 0.8541 the distance becomes infinite.
82<P>
83(4) The Categories distance.  This is my own concoction.  I imagined
84a nucleotide sequence changing according to Kimura's 2-parameter model,
85with the exception that some changes of amino acids are less likely than
86others.  The amino acids are grouped into a series of categories.  Any
87base change that does not change which category the amino acid is in is
88allowed, but if an amino acid changes category this is allowed only a
89certain fraction of the time.  The fraction is called the "ease" and
90there is a parameter for it, which is 1.0 when all changes are allowed
91and near 0.0 when changes between categories are nearly impossible.
92<P>
93In this option I have allowed the user to select the Transition/Transversion
94ratio, which of several genetic codes to use, and which categorization
95of amino acids to use.  There are three of them, a somewhat random sample:
96<P>
97<DL COMPACT>
98<DT>(a)</DT> <DD>The George-Hunt-Barker (1988) classification of amino acids,</DD>
99<DT>(b)</DT> <DD>A classification provided by my colleague Ben Hall when I asked him for one,</DD>
100<DT>(c)</DT> <DD>One I found in an old "baby biochemistry" book (Conn and Stumpf, 1963),
101which contains most of the biochemistry I was ever taught, and all that I ever
102learned.</DD>
103</DL>
104<P>
105Interestingly enough, all of them are consisten with the same linear
106ordering of amino acids, which they divide up in different ways.  For the
107Categories model I have set as default the George/Hunt/Barker classification
108with the "ease" parameter set to 0.457 which is approximately the value
109implied by the empirical rates in the Dayhoff PAM matrix.
110<P>
111The method uses, as I have noted, Kimura's (1980) 2-parameter model of DNA
112change.  The Kimura "2-parameter" model allows
113for a difference between transition and transversion rates.  Its transition
114probability matrix for a short interval of time is:
115<P>
116<PRE>
117              To:     A        G        C        T
118                   ---------------------------------
119               A  | 1-a-2b     a         b       b
120       From:   G  |   a      1-a-2b      b       b
121               C  |   b        b       1-a-2b    a
122               T  |   b        b         a     1-a-2b
123</PRE>
124<P>
125where <EM>a</EM> is <EM>u dt</EM>, the product of the rate of transitions per unit time and <EM>dt</EM> 
126is the length <EM>dt</EM> of the time interval, and <EM>b</EM> is <EM>v dt</EM>, the product of half the
127rate of transversions (i.e., the rate of a specific transversion)
128and the length dt of the time interval.
129<P>
130Each distance that is calculated is an estimate, from that particular pair of
131species, of the divergence time between those two species.  The Kimura
132distance is straightforward to compute.  The other two are considerably
133slower, and they look at all positions, and find that distance which
134makes the likelihood highest.  This likelihood is in effect the length of
135the internal branch in a two-species tree that connects these two
136species.  Its likelihood is just the product, under the model, of the
137probabilities of each position having the (one or) two amino acids that
138are actually found.  This is fairly slow to compute.
139<P>
140The computation proceeds from an eigenanalysis (spectral decomposition)
141of the transition probability matrix.  In the case of the PAM 001 matrix
142the eigenvalues and eigenvectors are precomputed and are hard-coded
143into the program in over 400 statements.  In the case of the Categories
144model the program computes the eigenvalues and eigenvectors itself, which
145will add a delay.  But the delay is independent of the number of species
146as the calculation is done only once, at the outset.
147<P>
148The actual algorithm for estimating the distance is in both cases a
149bisection algorithm which tries to find the point at which the derivative
150os the likelihood is zero.  Some of the kinds of ambiguous amino acids
151like "glx" are correctly taken into account.  However, gaps are treated
152as if they are unkown nucleotides, which means those positions get dropped
153from that particular comparison.  However, they are not dropped from the
154whole analysis.  You need not eliminate regions containing gaps, as long
155as you are reasonably sure of the alignment there.
156<P>
157Note that there is an
158assumption that we are looking at all positions, including those
159that have not changed at all.  It is important not to restrict attention
160to some positions based on whether or not they have changed; doing that
161would bias the distances by making them too large, and that in turn
162would cause the distances
163to misinterpret the meaning of those positions that
164had changed.
165<P>
166The program can now correct distances for unequal rates of change at different
167amino acid positions.   This correction, which was introduced for DNA
168sequences by Jin and Nei (1990), assumes that the distribution of rates
169of change among amino acid positions follows a Gamma distribution.  The
170user is asked for the value of a parameter that determines the amount of
171variation of rates among amino acid positions.  Instead of the more
172widely-known coefficient alpha, PROTDIST uses the coefficient of variation
173(ratio of the standard deviation to the mean) of rates among amino acid
174positions.  .  So if there is 20% variation in rates, the CV is
175is 0.20.  The square of the C.V. is also the reciprocal of the
176better-known "shape parameter", alpha, of the Gamma
177distribution, so in this case the shape parameter alpha = 1/(0.20*0.20)
178= 25.  If you want to achieve a particular value of alpha, such as 10,
179you will want to use a CV of 1/sqrt(100) = 1/10 = 0.1.
180<P>
181In addition to the four distance calculations, the program can also
182compute a table of similarities between amino acid sequences.  These values
183are the fractions of amino acid positions identical between the sequences.
184The diagonal values are 1.0000.  No attempt is made to count similarity
185of nonidentical amino acids, so that no credit is given for having
186(for example) different hydrophobic amino acids at the corresponding
187positions in the two sequences.  This option has been requested by many
188users, who need it for descriptive purposes.  It is not intended that
189the table be used for inferring the tree.
190<P>
191<H2>INPUT FORMAT AND OPTIONS</H2>
192<P>
193Input is fairly standard, with one addition.  As usual the first line of the
194file gives the number of species and the number of sites.  There follows the
195character W if the Weights option is being used.
196<P>
197Next come the species data.  Each
198sequence starts on a new line, has a ten-character species name
199that must be blank-filled to be of that length, followed immediately
200by the species data in the one-letter code.  The sequences must either
201be in the "interleaved" or "sequential" formats
202described in the Molecular Sequence Programs document.  The I option
203selects between them.  The sequences can have internal
204blanks in the sequence but there must be no extra blanks at the end of the
205terminated line.  Note that a blank is not a valid symbol for a deletion.
206<P>
207After that are the lines (if any) containing the information for the
208W option, as described below.
209<P>
210The options are selected using an interactive menu.  The menu looks like this:
211<P>
212<TABLE><TR><TD BGCOLOR=white>
213<PRE>
214
215Protein distance algorithm, version 3.6a3
216
217Settings for this run:
218  P     Use JTT, PAM, Kimura or categories model?  Jones-Taylor-Thornton matrix
219  G  Gamma distribution of rates among positions?  No
220  C           One category of substitution rates?  Yes
221  W                    Use weights for positions?  No
222  M                   Analyze multiple data sets?  No
223  I                  Input sequences interleaved?  Yes
224  0                 Terminal type (IBM PC, ANSI)?  (none)
225  1            Print out the data at start of run  No
226  2          Print indications of progress of run  Yes
227
228Are these settings correct? (type Y or the letter for one to change)
229</PRE>
230</TD></TR></TABLE>
231<P>
232The user either types "Y" (followed, of course, by a carriage-return)
233if the settings shown are to be accepted, or the letter or digit corresponding
234to an option that is to be changed.
235<P>
236The G option chooses Gamma distributed rates of evolution across amino
237acid psoitions.  The program will pronmpt you for the Coefficient of Variation
238of rates.  As is noted above, thi is 1/sqrt(alpha) if alpha is the more
239familiar "shape coefficient" of the Gamma distribution.  If the G option
240is not selected, the program defaults to having no variation of rates
241among sites.
242<P>
243The options M and 0 are the usual ones.  They are described in the
244main documentation file of this package.  Option I is the same as in
245other molecular sequence programs and is described in the documentation file
246for the sequence programs.
247<P>
248The P option selects one of the four distance methods, or the
249similarity table.  It toggles among these
250five methods. The default method, if none is specified, is the
251Jones-Taylor-Thornton model.  If the Categories distance is selected
252another menu option, T, will appear allowing the user
253to supply the Transition/Transversion ratio that should be assumed
254at the underlying DNA level, and another one, C, which allows the
255user to select among various nuclear and mitochondrial genetic codes.i
256The transition/transversion ratio can be any number from 0.5 upwards.
257<P>
258The W (Weights) option is invoked in the usual way, with only weights 0
259and 1 allowed.  It selects a set of sites to be analyzed, ignoring the
260others.  The sites selected are those with weight 1.  If the W option is
261not invoked, all sites are analyzed.
262<P>
263<H2>OUTPUT FORMAT</H2>
264<P>
265As the
266distances are computed, the program prints on your screen or terminal
267the names of the species in turn,
268followed by one dot (".") for each other species for which the distance to
269that species has been computed.  Thus if there are ten species, the first
270species name is printed out, followed by one dot, then on the next line
271the next species name is printed out followed by two dots, then the
272next followed by three dots, and so on.  The pattern of dots should form
273a triangle.  When the distance matrix has been written out to the output
274file, the user is notified of that.
275<P>
276The output file contains on its first line the number of species. The
277distance matrix is then printed in standard
278form, with each species starting on a new line with the species name, followed
279by the distances to the species in order.  These continue onto a new line
280after every nine distances.  The distance matrix is square
281with zero distances on the diagonal.  In general the format of the distance
282matrix is such that it can serve as input to any of the distance matrix
283programs.
284<P>
285If the similarity table is selected, the table that is produced is not
286in a format that can be used as input to the distance matrix programs.
287it has a heading, and the species names are also put at the tops of the
288columns of the table (or rather, the first 8 characters of each species
289name is there, the other two characters omitted to save space).  There
290is not an option to put the table into a format that can be read by
291the distance matrix programs, nor is there one to make it into a table
292of fractions of difference by subtracting the similarity values from 1.
293This is done deliberately to make it more difficult for the use to
294use these values to construct trees.  The similarity values are
295not corrected for multiple changes, and their use to construct trees
296(even after converting them to fractions of difference) would be
297wrong, as it would lead to severe conflict between the distant
298pairs of sequences and the close pairs of sequences.
299<P>
300If the option to print out the data is selected, the output file will
301precede the data by more complete information on the input and the menu
302selections.  The output file begins by giving the number of species and the
303number of characters, and the identity of the distance measure that is
304being used.
305<P>
306In the Categories model of substitution,
307the distances printed out are scaled in terms of expected numbers of
308substitutions, counting both transitions and transversions but not
309replacements of a base by itself, and scaled so that the average rate of
310change is set to 1.0.  For the Dayhoff PAM and Kimura models the
311distance are scaled in terms of the expected numbers of amino acid
312substitutions per site.  Of course, when a branch is twice as
313long this does not mean that there will be twice as much net change expected
314along it, since some of the changes may occur in the same site and overlie or
315even reverse each
316other.  The branch lengths estimates here are in terms of the expected
317underlying numbers of changes.  That means that a branch of length 0.26
318is 26 times as long as one which would show a 1% difference between
319the protein (or nucleotide) sequences at the beginning and end of the
320branch.  But we
321would not expect the sequences at the beginning and end of the branch to be
32226% different, as there would be some overlaying of changes.
323<P>
324One problem that can arise is that two or more of the species can be so
325dissimilar that the distance between them would have to be infinite, as
326the likelihood rises indefinitely as the estimated divergence time
327increases.  For example, with the Kimura model, if the two sequences
328differ in 85.41% or more of their positions then the estimate of divergence
329time would be infinite.  Since there is no way to represent an infinite
330distance in the output file, the program regards this as an error, issues a
331warning message indicating which pair of species are causing the problem, and
332computes a distance of -1.0.
333<P>
334<H2>PROGRAM CONSTANTS</H2>
335<P>
336The constants that are available to be changed by the user at the beginning
337of the program include
338"namelength", the length of species names in
339characters, and "epsilon", a parameter which controls the accuracy of the
340results of the iterations which estimate the distances.  Making "epsilon"
341smaller will increase run times but result in more decimal places of
342accuracy.  This should not be necessary.
343<P>
344The program spends most of its time doing real arithmetic.  Any software or
345hardware changes that speed up that arithmetic will speed it up by a nearly
346proportional amount.
347<P>
348<HR>
349<P>
350<H3>TEST DATA SET</H3>
351<P>
352(Note that although these may look like DNA sequences, they are being
353treated as protein sequences consisting entirely of alanine, cystine,
354glycine, and threonine).
355<P>
356<TABLE><TR><TD BGCOLOR=white>
357<PRE>
358   5   13
359Alpha     AACGTGGCCACAT
360Beta      AAGGTCGCCACAC
361Gamma     CAGTTCGCCACAA
362Delta     GAGATTTCCGCCT
363Epsilon   GAGATCTCCGCCC
364</PRE>
365</TD></TR></TABLE>
366<P>
367<HR>
368<P>
369<H3>CONTENTS OF OUTPUT FILE (with all numerical options on )</H3>
370<P>
371(Note that when the numerical options are not on, the output file produced is
372in the correct format to be used as an input file in the distance matrix
373programs).
374<P>
375<TABLE><TR><TD BGCOLOR=white>
376<PRE>
377
378  Jones-Taylor-Thornton model distance
379
380Name            Sequences
381----            ---------
382
383Alpha        AACGTGGCCA CAT
384Beta         ..G..C.... ..C
385Gamma        C.GT.C.... ..A
386Delta        G.GA.TT..G .C.
387Epsilon      G.GA.CT..G .CC
388
389
390
391Alpha       0.0000  0.3304  0.6257  1.0320  1.3541
392Beta        0.3304  0.0000  0.3756  1.0963  0.6776
393Gamma       0.6257  0.3756  0.0000  0.9758  0.8616
394Delta       1.0320  1.0963  0.9758  0.0000  0.2267
395Epsilon     1.3541  0.6776  0.8616  0.2267  0.0000
396</PRE>
397</TD></TR></TABLE>
398</BODY>
399</HTML>
Note: See TracBrowser for help on using the repository browser.