Context Navigation

← Previous Revision
Next Revision →
Blame
Revision Log

clustal_help

Visit:

Last change on this file was 2, checked in by oldcode, 25 years ago
Initial revision
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 12.6 KB

Line
1	ClustalV - Multiple sequence alignment by Des Higgins
2
3	ClustalV is an order independent multiple sequences alignment algorithm.
4	It first determines which sequences are similar to eachother, and aligns
5	those sequences first. This helps reduce alignment bias based on the
6	early alignment of poor sequences.
7
8
9	Be warned that all sequences are returned in upper case, and all
10	extended information under Get info.. will be lost. For this reason,
11	the results are placed in a new GDE window.
12
13	The following is the help file that Higgins provides.
14
15	---
16
17	This is the on-line help file for CLUSTAL V.
18
19	It should be named or defined as: clustalv_hlp
20
21	>>HELP<< 1 General help for CLUSTAL V
22	CLUSTAL V is a general purpose multiple alignment program for DNA or proteins.
23
24	SEQUENCE INPUT: all sequences must be in 1 file, one after another. 3 formats
25	are automatically recognised: NBRF/PIR, EMBL/SWISSPROT or Pearson (Fasta).
26	All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
27	except "-" which is used to indicate a GAP. Upper or lower case is allowed.
28
29
30	To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
31	INPUT them; go to menu item 2 to do the multiple alignment.
32
33
34	PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to
35	add a new sequence to an old alignment. GAPS in the old alignments are
36	indicated using the "-" character. PROFILES can be input as PIR format files.
37
38
39	PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
40	in PIR format with "-" characters to indicate gaps) OR after a multiple
41	alignemnt while the alignment is still in memory.
42	>>HELP<< 2 Help for multiple alignments
43
44	If you have already loaded sequences, use menu item 1 to do the complete
45	multiple alignment. You will be prompted for 2 output files: 1 for the
46	alignment itself; another to store a dendrogram that describes the similarity
47	of the sequences to each other.
48
49	Multiple alignments are carried out in 3 stages (automatically done from menu
50	item 1 ... multiple alignments NOW):
51
52	1) all sequences are compared to each other (pairwise alignments);
53
54	2) a dendrogram (like a phylogenetic tree) is constructed, describing the
55	approximate groupings of the sequences by similarity (stored in a file).
56
57	3) the final multiple alignment is carried out, using the dendrogram as a guide.
58
59
60	PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial
61	alignments.
62
63	MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
64
65
66
67
68	You can skip the first stages (pairwise alignments; dendrogram) by using an
69	old dendrogram file (menu item 3); or you can just produce the dendrogram
70	with no final multiple alignment (menu item 2).
71
72
73	OUTPUT FORMAT: Menu item 6 (format options) allows you to choose between 4
74	different alignment formats (CLUSTAL, GCG, NBRF/PIR and PHYLIP).
75
76
77	>>HELP<< 3 Help for pairwise alignment parameters
78
79	A similarity score is calculated between every pair of sequence and these are
80	used to construct the dendrogram which guides the final multiple alignment.
81
82	These similarity scores are calculated from fast, approximate, global align-
83	ments, which are controlled by 4 parameters. 2 techniques are used to make
84	these alignments very fast: 1) only exactly matching fragments (k-tuples) are
85	considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
86	are used.
87
88
89	K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
90	INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
91	For longer sequences (e.g. > 300 residues) you may need to increase the default.
92
93
94	GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
95	little affect on the speed or sensitivity.
96
97
98
99
100
101
102	TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
103	dot-matrix plot) is calculated. Only the best ones (with most matches) are
104	used in the alignment. This parameter specifies how many. Decrease for speed;
105	increase for sensitivity.
106
107
108	DIAGONAL WINDOW: This is the number of diagonals around each of the 'best'
109	diagonals that will be used. Decrease for speed; increase for sensitivity.
110
111
112	SCORING METHOD = PERCENTAGE or ABSOLUTE: This controls whether the similarity
113	scores are calculated as raw alignment scores (number of k-tuple matches minus a
114	gap penalty for every gap) (ABSOLUTE) or as the alignment score divided by the
115	length of the shorter sequence (PERCENTAGE).
116
117
118
119	>>HELP<< 4 Help for multiple alignment parameters
120	These parameters control the final multiple alignment. There are 2 gap penalty
121	parameters and 1 for whether transitions (A <--> G or C <--> T) are weighted in
122	DNA alignments. The default weight matrix for protein alignments is a PAM250
123	matrix, converted to distances.
124
125	GAP PENALTY (FIXED): This is a penalty for opening up a gap. Decrease it
126	and you will encourage gaps of all sizes. TERMINAL GAPS are penalised (same as
127	internal ones). BEWARE: if you make this too small (+/- 5 or so), the program
128	will prefer to align each sequence opposite a long gap.
129
130	GAP PENALTY (VARYING): This penalty is incurred for every item in a gap. This
131	penalises long gaps more. Increase this and gaps will get shorter. BEWARE:
132	if you make this too small (+/- 5 or so), the program will prefer to align each
133	sequence opposite a long gap.
134
135	TRANSITIONS = WEIGHTED or UNWEIGHTED: With UNWEIGHTED transitions identical
136	bases in a DNA alignment have a DISTANCE of 0; different ones have a distance
137	of 10. If transitions are WEIGHTED then A vs G and C vs T will have a distance
138	of 5 (less distant than A vs C,T or C vs A,G).
139	>>HELP<< 5 Help for output format options.
140	Four output formats are offered. You can choose more than one (or all four if
141	you wish). NBRF/PIR format is ESPECIALLY USEFUL. Alignments that are written
142	in this format can be used again as input (for calculating phylogenetic trees;
143	profile alignments; general input).
144
145	CLUSTAL format output is a self explanatory alignment format. It shows the
146	sequences aligned in blocks.
147
148	GCG output can be used by any of the GCG programs that can work on multiple
149	alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
150	.msf format files (multiple sequence file); new in version 7 of GCG.
151
152	PHYLIP format output can be used for input to the PHYLIP package of Joe
153	Felsenstein. This is an extremely widely used package for doing every
154	imaginable form of phylogenetic analysis (MUCH more than the the modest intro-
155	duction offered by this program).
156
157	NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
158	characters "-" are used to indicate the positions of gaps in the multiple
159	alignment. These files can be re-used as input in any part of clustal that
160	allows sequences (or alignments or profiles) to be read in.
161	>>HELP<< 6 Help for profile alignments
162
163	By PROFILE ALIGNMENT, we mean the alignment of two old alignments. One of the
164	alignments can be a single sequence.
165
166	The profiles should be in PIR format (one of the 4 output formats produced by
167	this program). This is the same as standard NBRF/PIR format, with 1 addition:
168	gap characters are indicated by "-".
169
170	The alignment method produces a global, optimal alignment using an amino acid
171	weight matrix (PAM250 is default) and 2 gap penalty parameters.
172
173	Profile alignments allow you to store alignments of your favourite sequences (as
174	long as they are in PIR format) and add new sequences to them in small bunches
175	at a time. One of the 2 profiles can simply be a single sequence.
176
177
178
179	>>HELP<< 7 Help for phylogenetic trees
180	Before calculating a tree, you must have an alignment in memory. This can be
181	input in NBRF/PIR format or you should have just carried out a full multiple
182	alignment and the alignment is still in memory.
183
184	The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
185	you calculate distances (percent divergence) between all pairs of sequence from
186	a multiple alignment; second you apply the NJ method to the distance matrix.
187
188	EXCLUDE POSITIONS WITH GAPS? If you choose this option, any alignment positions
189	where ANY of the sequences have a gap will be ignored. This guarantees that
190	the distances will be 'metric'. Also, it means that 'like' will be compared to
191	'like' in all distances. The disadvantage is that you may throw away much of
192	the data if there are many gaps.
193
194	CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this
195	option makes little difference. For greater divergence, this option corrects
196	for the fact that observed distances underestimate actual evolutionary dist-
197	ances. This is because, as sequences diverge, more than one substitution will
198	happen at many sites. However, you only see one difference when you look at the
199	present day sequences. Therefore, this option has the effect of stretching
200	branch lengths in trees (especially long branches). The corrections used here
201	(for DNA or proteins) are both due to Motoo Kimura.
202
203	To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED
204	tree and all branch lengths. The root of the tree can only be inferred by
205	using an outgroup (a sequence that you are certain branches at the outside
206	of the tree .... certain on biological grounds) OR if you assume a degree
207	of constancy in the 'molecular clock', you can place the root along the
208	longest branch.
209
210	BOOTSTRAPPING is a method for deriving confidence values for the groupings in
211	a tree (first adapted for trees by Joe Felsenstein). It involves making N
212	random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000);
213	drawing N trees (1 from each sample) and counting how many times each grouping
214	from the original tree occurs in the sample trees. For a group to be consid-
215	ered significant at the 5% level (p <= 0.05) it should occur in at least 95% of
216	the sample trees. You must supply a seed number for the random number generator.
217	>>HELP<< 8 Help for choosing protein weight matrix
218	For protein alignments, you use a weight matrix to determine the similarity of
219	non-identical amino acids. For example, Tyr aligned with Phe is usually judged
220	to be 'better' than Tyr aligned with Pro.
221
222
223
224	There are three 'in-built' weight matrices offered:
225
226
227	1) PAM 100 and 2) PAM 250 These are from the work of M. Dayhoff and are often
228	simply called Dayhoff matrices. The pam 250 matrix is the most commonly used
229	and is the default in most protein comparison packages. It is claimed that
230	a pam 100 matrix is more sensitive in many cases, so we have included it
231	here.
232
233
234	3) Identity matrix. This matrix just scores identical residues.
235
236
237
238
239
240	You can also input your own matrix. If so then be careful: 1) follow the
241	instructions on format below; 2) watch the gap penalty parameters (the default
242	values may no be appropriate). Conservative substitutions will not be
243	indicated in alignments.
244
245	The values in a new weight matrix must be integers and the scores should be
246	similarities. You can use negative as well as positive values if you wish.
247
248
249	INPUT FORMAT The lower triangle of a 20x20 matrix of values is read in, in free
250	format, row by row. The diagonal must be included. Using the 1 letter code,
251	the order of amino acids in the matrix is: CSTPAGNDEQHRKMILVFYW. Seperate
252	the values by spaces (not commas). You can put the values on as many lines
253	as you like as long as they are in the right order.
254
255
256	GAP PENALTIES The default gap penalty parameters work fine with a PAM 250
257	matrix. The range of PAM 250 values is 0 to 25 (when rescaled to be positive)
258	and the default gap penalties are 10 each. Very approximately, the best gap
259	penalty settings are 2/5 the maximum weight matrix score.
260	>>HELP<< 9 Help for command line parameters
261	DATA (sequences)
262
263	/INFILE=file.ext :input sequences.
264	/PROFILE1=file.ext and /PROFILE2=file.ext :profiles (old alignment).
265
266	VERBS (do things)
267
268	/HELP or /CHECK :list the command line params.
269	/ALIGN :do full multiple alignment
270	/TREE :calculate NJ tree.
271	/BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
272
273	PARAMETERS (set things)
274
275	*Pairwise alignments:*
276	/KTUP=n :word size /TOPDIAGS=n :number of best diags.
277	/WINDOW=n :window around best diags. /PAIRGAP=n :gap penalty
278
279	*Multiple alignments:*
280	/FIXEDGAP=n :fixed length gap pen. /FLOATGAP=n :variable length gap pen.
281	/MATRIX= :PAM100 or ID or file name. /TYPE=p or d :type is prot. or DNA
282	/OUTPUT= :GCG or PHYLIP or PIR. /TRANSIT :transitions not weighted.
283
284	*Trees:* /SEED :seed number for bootstraps.
285	/KIMURA :use Kimura's correction. /TOSSGAPS :ignore positions with gaps.
286

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format