Context Navigation

protdist.html

Visit:

Last change on this file was 2176, checked in by westram, 22 years ago
* empty log message *
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 18.8 KB

Line
1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2	<HTML>
3	<HEAD>
4	<TITLE>protdist</TITLE>
5	<META NAME="description" CONTENT="protdist">
6	<META NAME="keywords" CONTENT="protdist">
7	<META NAME="resource-type" CONTENT="document">
8	<META NAME="distribution" CONTENT="global">
9	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10	</HEAD>
11	<BODY BGCOLOR="#ccffff">
12	<DIV ALIGN=RIGHT>
13	version 3.6
14	</DIV>
15	<P>
16	<DIV ALIGN=CENTER>
17	<H1>PROTDIST -- Program to compute distance matrix<BR>
18	from protein sequences</H1>
19	</DIV>
20	<P>
21	© Copyright 1993, 2000-2002 by the University of Washington. Permission
22	is granted to copy
23	this document provided that no fee is charged for it and that this copyright
24	notice is not removed.
25	<P>
26	This program uses protein sequences to compute a distance matrix, under
27	four different models of amino acid replacement. It can also
28	compute a table of similarity between the amino acid sequences.
29	The distance for each
30	pair of species estimates the total branch length between the two species, and
31	can be used in the distance matrix programs FITCH, KITSCH or NEIGHBOR. This
32	is an alternative to use of the sequence data itself in the
33	parsimony program PROTPARS.
34	<P>
35	The program reads in protein sequences and writes an output file containing
36	the distance matrix or similarity table. The four models of amino acid
37	substitution are
38	one which is based on the Jones, Taylor and Thornton (1992) model of
39	amino acid change, one based on the PAM matrixes of Margaret Dayhoff, one due to
40	Kimura (1983) which approximates it based simply on the fraction of
41	similar amino acids, and one based on a model in which the amino acids are
42	divided up into groups, with change occurring based on the genetic code
43	but with greater difficulty of changing between groups. The program correctly
44	takes into account a variety of sequence ambiguities.
45	<P>
46	The four methods are:
47	<P>
48	(1) The Dayhoff PAM matrix. This uses Dayhoff's PAM 001 matrix from
49	Dayhoff (1979), page 348. The PAM model is an empirical one that
50	scales probabilities of change from one amino acid to another in
51	terms of a unit which is an expected 1% change between two amino acid
52	sequences. The PAM 001 matrix is used to make a transition probability
53	matrix which allows prediction of the probability of changing from any
54	one amino acid to any other, and also predicts equilibrium amino acid
55	composition. The program assumes that these probabilities are correct
56	and bases its computations of distance on them. The distance that is
57	computed is scaled in units of expected fraction of amino acids
58	changed. This is a unit of 100 PAM's.
59	<P>
60	(2) The Jones-Taylor-Thornton model. This is similar to the Dayhoff
61	PAM model, except that it is based on a recounting of the number of
62	observed changes in amino acids by Jones, Taylor, and Thornton (1992).
63	They used a much larger sample of protein sequences than did Dayhoff.
64	The distance is scaled in units of the expected fraction of amino acids
65	changed (100 PAM's). Because its sample is so much larger this
66	model is to be preferred over the original Dayhoff PAM model. It
67	is the default model in this program.
68	<P>
69	(3) Kimura's distance. This is a rough-and-ready distance formula for
70	approximating PAM distance by simply measuring the fraction of amino
71	acids, p, that differs between two sequences and computing the
72	distance as (Kimura, 1983)
73	<P>
74	D = - log<SUB>e</SUB> ( 1 - p - 0.2 p<SUP>2</SUP> ).
75	<P>
76	This is very quick to do but has some obvious limitations. It does not
77	take into account which amino acids differ or to what amino acids
78	they change, so some information is lost. The units of the distance
79	measure are fraction of amino acids differing, as also in the case of
80	the PAM distance. If the fraction of amino acids differing gets larger
81	than 0.8541 the distance becomes infinite.
82	<P>
83	(4) The Categories distance. This is my own concoction. I imagined
84	a nucleotide sequence changing according to Kimura's 2-parameter model,
85	with the exception that some changes of amino acids are less likely than
86	others. The amino acids are grouped into a series of categories. Any
87	base change that does not change which category the amino acid is in is
88	allowed, but if an amino acid changes category this is allowed only a
89	certain fraction of the time. The fraction is called the "ease" and
90	there is a parameter for it, which is 1.0 when all changes are allowed
91	and near 0.0 when changes between categories are nearly impossible.
92	<P>
93	In this option I have allowed the user to select the Transition/Transversion
94	ratio, which of several genetic codes to use, and which categorization
95	of amino acids to use. There are three of them, a somewhat random sample:
96	<P>
97	<DL COMPACT>
98	<DT>(a)</DT> <DD>The George-Hunt-Barker (1988) classification of amino acids,</DD>
99	<DT>(b)</DT> <DD>A classification provided by my colleague Ben Hall when I asked him for one,</DD>
100	<DT>(c)</DT> <DD>One I found in an old "baby biochemistry" book (Conn and Stumpf, 1963),
101	which contains most of the biochemistry I was ever taught, and all that I ever
102	learned.</DD>
103	</DL>
104	<P>
105	Interestingly enough, all of them are consisten with the same linear
106	ordering of amino acids, which they divide up in different ways. For the
107	Categories model I have set as default the George/Hunt/Barker classification
108	with the "ease" parameter set to 0.457 which is approximately the value
109	implied by the empirical rates in the Dayhoff PAM matrix.
110	<P>
111	The method uses, as I have noted, Kimura's (1980) 2-parameter model of DNA
112	change. The Kimura "2-parameter" model allows
113	for a difference between transition and transversion rates. Its transition
114	probability matrix for a short interval of time is:
115	<P>
116	<PRE>
117	To: A G C T
118	---------------------------------
119	A \| 1-a-2b a b b
120	From: G \| a 1-a-2b b b
121	C \| b b 1-a-2b a
122	T \| b b a 1-a-2b
123	</PRE>
124	<P>
125	where <EM>a</EM> is <EM>u dt</EM>, the product of the rate of transitions per unit time and <EM>dt</EM>
126	is the length <EM>dt</EM> of the time interval, and <EM>b</EM> is <EM>v dt</EM>, the product of half the
127	rate of transversions (i.e., the rate of a specific transversion)
128	and the length dt of the time interval.
129	<P>
130	Each distance that is calculated is an estimate, from that particular pair of
131	species, of the divergence time between those two species. The Kimura
132	distance is straightforward to compute. The other two are considerably
133	slower, and they look at all positions, and find that distance which
134	makes the likelihood highest. This likelihood is in effect the length of
135	the internal branch in a two-species tree that connects these two
136	species. Its likelihood is just the product, under the model, of the
137	probabilities of each position having the (one or) two amino acids that
138	are actually found. This is fairly slow to compute.
139	<P>
140	The computation proceeds from an eigenanalysis (spectral decomposition)
141	of the transition probability matrix. In the case of the PAM 001 matrix
142	the eigenvalues and eigenvectors are precomputed and are hard-coded
143	into the program in over 400 statements. In the case of the Categories
144	model the program computes the eigenvalues and eigenvectors itself, which
145	will add a delay. But the delay is independent of the number of species
146	as the calculation is done only once, at the outset.
147	<P>
148	The actual algorithm for estimating the distance is in both cases a
149	bisection algorithm which tries to find the point at which the derivative
150	os the likelihood is zero. Some of the kinds of ambiguous amino acids
151	like "glx" are correctly taken into account. However, gaps are treated
152	as if they are unkown nucleotides, which means those positions get dropped
153	from that particular comparison. However, they are not dropped from the
154	whole analysis. You need not eliminate regions containing gaps, as long
155	as you are reasonably sure of the alignment there.
156	<P>
157	Note that there is an
158	assumption that we are looking at all positions, including those
159	that have not changed at all. It is important not to restrict attention
160	to some positions based on whether or not they have changed; doing that
161	would bias the distances by making them too large, and that in turn
162	would cause the distances
163	to misinterpret the meaning of those positions that
164	had changed.
165	<P>
166	The program can now correct distances for unequal rates of change at different
167	amino acid positions. This correction, which was introduced for DNA
168	sequences by Jin and Nei (1990), assumes that the distribution of rates
169	of change among amino acid positions follows a Gamma distribution. The
170	user is asked for the value of a parameter that determines the amount of
171	variation of rates among amino acid positions. Instead of the more
172	widely-known coefficient alpha, PROTDIST uses the coefficient of variation
173	(ratio of the standard deviation to the mean) of rates among amino acid
174	positions. . So if there is 20% variation in rates, the CV is
175	is 0.20. The square of the C.V. is also the reciprocal of the
176	better-known "shape parameter", alpha, of the Gamma
177	distribution, so in this case the shape parameter alpha = 1/(0.20*0.20)
178	= 25. If you want to achieve a particular value of alpha, such as 10,
179	you will want to use a CV of 1/sqrt(100) = 1/10 = 0.1.
180	<P>
181	In addition to the four distance calculations, the program can also
182	compute a table of similarities between amino acid sequences. These values
183	are the fractions of amino acid positions identical between the sequences.
184	The diagonal values are 1.0000. No attempt is made to count similarity
185	of nonidentical amino acids, so that no credit is given for having
186	(for example) different hydrophobic amino acids at the corresponding
187	positions in the two sequences. This option has been requested by many
188	users, who need it for descriptive purposes. It is not intended that
189	the table be used for inferring the tree.
190	<P>
191	<H2>INPUT FORMAT AND OPTIONS</H2>
192	<P>
193	Input is fairly standard, with one addition. As usual the first line of the
194	file gives the number of species and the number of sites. There follows the
195	character W if the Weights option is being used.
196	<P>
197	Next come the species data. Each
198	sequence starts on a new line, has a ten-character species name
199	that must be blank-filled to be of that length, followed immediately
200	by the species data in the one-letter code. The sequences must either
201	be in the "interleaved" or "sequential" formats
202	described in the Molecular Sequence Programs document. The I option
203	selects between them. The sequences can have internal
204	blanks in the sequence but there must be no extra blanks at the end of the
205	terminated line. Note that a blank is not a valid symbol for a deletion.
206	<P>
207	After that are the lines (if any) containing the information for the
208	W option, as described below.
209	<P>
210	The options are selected using an interactive menu. The menu looks like this:
211	<P>
212	<TABLE><TR><TD BGCOLOR=white>
213	<PRE>
214
215	Protein distance algorithm, version 3.6a3
216
217	Settings for this run:
218	P Use JTT, PAM, Kimura or categories model? Jones-Taylor-Thornton matrix
219	G Gamma distribution of rates among positions? No
220	C One category of substitution rates? Yes
221	W Use weights for positions? No
222	M Analyze multiple data sets? No
223	I Input sequences interleaved? Yes
224	0 Terminal type (IBM PC, ANSI)? (none)
225	1 Print out the data at start of run No
226	2 Print indications of progress of run Yes
227
228	Are these settings correct? (type Y or the letter for one to change)
229	</PRE>
230	</TD></TR></TABLE>
231	<P>
232	The user either types "Y" (followed, of course, by a carriage-return)
233	if the settings shown are to be accepted, or the letter or digit corresponding
234	to an option that is to be changed.
235	<P>
236	The G option chooses Gamma distributed rates of evolution across amino
237	acid psoitions. The program will pronmpt you for the Coefficient of Variation
238	of rates. As is noted above, thi is 1/sqrt(alpha) if alpha is the more
239	familiar "shape coefficient" of the Gamma distribution. If the G option
240	is not selected, the program defaults to having no variation of rates
241	among sites.
242	<P>
243	The options M and 0 are the usual ones. They are described in the
244	main documentation file of this package. Option I is the same as in
245	other molecular sequence programs and is described in the documentation file
246	for the sequence programs.
247	<P>
248	The P option selects one of the four distance methods, or the
249	similarity table. It toggles among these
250	five methods. The default method, if none is specified, is the
251	Jones-Taylor-Thornton model. If the Categories distance is selected
252	another menu option, T, will appear allowing the user
253	to supply the Transition/Transversion ratio that should be assumed
254	at the underlying DNA level, and another one, C, which allows the
255	user to select among various nuclear and mitochondrial genetic codes.i
256	The transition/transversion ratio can be any number from 0.5 upwards.
257	<P>
258	The W (Weights) option is invoked in the usual way, with only weights 0
259	and 1 allowed. It selects a set of sites to be analyzed, ignoring the
260	others. The sites selected are those with weight 1. If the W option is
261	not invoked, all sites are analyzed.
262	<P>
263	<H2>OUTPUT FORMAT</H2>
264	<P>
265	As the
266	distances are computed, the program prints on your screen or terminal
267	the names of the species in turn,
268	followed by one dot (".") for each other species for which the distance to
269	that species has been computed. Thus if there are ten species, the first
270	species name is printed out, followed by one dot, then on the next line
271	the next species name is printed out followed by two dots, then the
272	next followed by three dots, and so on. The pattern of dots should form
273	a triangle. When the distance matrix has been written out to the output
274	file, the user is notified of that.
275	<P>
276	The output file contains on its first line the number of species. The
277	distance matrix is then printed in standard
278	form, with each species starting on a new line with the species name, followed
279	by the distances to the species in order. These continue onto a new line
280	after every nine distances. The distance matrix is square
281	with zero distances on the diagonal. In general the format of the distance
282	matrix is such that it can serve as input to any of the distance matrix
283	programs.
284	<P>
285	If the similarity table is selected, the table that is produced is not
286	in a format that can be used as input to the distance matrix programs.
287	it has a heading, and the species names are also put at the tops of the
288	columns of the table (or rather, the first 8 characters of each species
289	name is there, the other two characters omitted to save space). There
290	is not an option to put the table into a format that can be read by
291	the distance matrix programs, nor is there one to make it into a table
292	of fractions of difference by subtracting the similarity values from 1.
293	This is done deliberately to make it more difficult for the use to
294	use these values to construct trees. The similarity values are
295	not corrected for multiple changes, and their use to construct trees
296	(even after converting them to fractions of difference) would be
297	wrong, as it would lead to severe conflict between the distant
298	pairs of sequences and the close pairs of sequences.
299	<P>
300	If the option to print out the data is selected, the output file will
301	precede the data by more complete information on the input and the menu
302	selections. The output file begins by giving the number of species and the
303	number of characters, and the identity of the distance measure that is
304	being used.
305	<P>
306	In the Categories model of substitution,
307	the distances printed out are scaled in terms of expected numbers of
308	substitutions, counting both transitions and transversions but not
309	replacements of a base by itself, and scaled so that the average rate of
310	change is set to 1.0. For the Dayhoff PAM and Kimura models the
311	distance are scaled in terms of the expected numbers of amino acid
312	substitutions per site. Of course, when a branch is twice as
313	long this does not mean that there will be twice as much net change expected
314	along it, since some of the changes may occur in the same site and overlie or
315	even reverse each
316	other. The branch lengths estimates here are in terms of the expected
317	underlying numbers of changes. That means that a branch of length 0.26
318	is 26 times as long as one which would show a 1% difference between
319	the protein (or nucleotide) sequences at the beginning and end of the
320	branch. But we
321	would not expect the sequences at the beginning and end of the branch to be
322	26% different, as there would be some overlaying of changes.
323	<P>
324	One problem that can arise is that two or more of the species can be so
325	dissimilar that the distance between them would have to be infinite, as
326	the likelihood rises indefinitely as the estimated divergence time
327	increases. For example, with the Kimura model, if the two sequences
328	differ in 85.41% or more of their positions then the estimate of divergence
329	time would be infinite. Since there is no way to represent an infinite
330	distance in the output file, the program regards this as an error, issues a
331	warning message indicating which pair of species are causing the problem, and
332	computes a distance of -1.0.
333	<P>
334	<H2>PROGRAM CONSTANTS</H2>
335	<P>
336	The constants that are available to be changed by the user at the beginning
337	of the program include
338	"namelength", the length of species names in
339	characters, and "epsilon", a parameter which controls the accuracy of the
340	results of the iterations which estimate the distances. Making "epsilon"
341	smaller will increase run times but result in more decimal places of
342	accuracy. This should not be necessary.
343	<P>
344	The program spends most of its time doing real arithmetic. Any software or
345	hardware changes that speed up that arithmetic will speed it up by a nearly
346	proportional amount.
347	<P>
348	<HR>
349	<P>
350	<H3>TEST DATA SET</H3>
351	<P>
352	(Note that although these may look like DNA sequences, they are being
353	treated as protein sequences consisting entirely of alanine, cystine,
354	glycine, and threonine).
355	<P>
356	<TABLE><TR><TD BGCOLOR=white>
357	<PRE>
358	5 13
359	Alpha AACGTGGCCACAT
360	Beta AAGGTCGCCACAC
361	Gamma CAGTTCGCCACAA
362	Delta GAGATTTCCGCCT
363	Epsilon GAGATCTCCGCCC
364	</PRE>
365	</TD></TR></TABLE>
366	<P>
367	<HR>
368	<P>
369	<H3>CONTENTS OF OUTPUT FILE (with all numerical options on )</H3>
370	<P>
371	(Note that when the numerical options are not on, the output file produced is
372	in the correct format to be used as an input file in the distance matrix
373	programs).
374	<P>
375	<TABLE><TR><TD BGCOLOR=white>
376	<PRE>
377
378	Jones-Taylor-Thornton model distance
379
380	Name Sequences
381	---- ---------
382
383	Alpha AACGTGGCCA CAT
384	Beta ..G..C.... ..C
385	Gamma C.GT.C.... ..A
386	Delta G.GA.TT..G .C.
387	Epsilon G.GA.CT..G .CC
388
389
390
391	Alpha 0.0000 0.3304 0.6257 1.0320 1.3541
392	Beta 0.3304 0.0000 0.3756 1.0963 0.6776
393	Gamma 0.6257 0.3756 0.0000 0.9758 0.8616
394	Delta 1.0320 1.0963 0.9758 0.0000 0.2267
395	Epsilon 1.3541 0.6776 0.8616 0.2267 0.0000
396	</PRE>
397	</TD></TR></TABLE>
398	</BODY>
399	</HTML>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: branches/stable/GDE/PHYLIP/doc/protdist.html

Download in other formats: