Context Navigation

← Previous Revision
Next Revision →
Normal
Revision Log

clustalw_help

Visit:

Last change on this file was 10842, checked in by westram, 12 years ago
reintegrates 'help' into 'trunk': adds: log:branches/help@10647:10841 log:branches/helptest@10704:10720
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 32.1 KB

Rev	Line
[176]	1
[1754]	2	This is the on-line help file for CLUSTAL W ( version 1.83).
[176]	3
	4	It should be named or defined as: clustalw_help
	5	except with MSDOS in which case it should be named CLUSTALW.HLP
	6
	7	For full details of usage and algorithms, please read the CLUSTALW.DOC file.
	8
	9
	10	Toby Gibson EMBL, Heidelberg, Germany.
	11	Des Higgins UCC, Cork, Ireland.
	12	Julie Thompson IGBMC, Strasbourg, France.
	13
	14
	15
[1754]	16	>>NEW <<
	17
	18	Fasta output
	19	===========
	20
	21	Write/Read sequence with range specified. The command line syntax
	22	for range specification is flexible. You can use one of the following
	23	syntax.
	24
	25	-range=n:m
	26	-range=n-m
	27	-range="n m"
	28
	29	where m is the starting and m is the length of the sequence.
	30
	31	Range and range numbers.
	32	=======================
	33
	34	Include range numbers in the ouput.
	35
	36	-seqno_range=on/off
	37
	38	The sequence range will be appended as to the names of the sequence.
	39
	40
	41	PIM: Percentage Identity Matrix
	42	===============================
	43
	44
	45
[176]	46	>>HELP 1 << General help for CLUSTAL W (1.81)
	47
	48	Clustal W is a general purpose multiple alignment program for DNA or proteins.
	49
	50	SEQUENCE INPUT: all sequences must be in 1 file, one after another.
	51	7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT,
	52	Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.
	53	All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
	54	except "-" which is used to indicate a GAP ("." in MSF-RSF).
	55
	56	To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
	57	INPUT them; go to menu item 2 to do the multiple alignment.
	58
	59	PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to
	60	add a new sequence to an old alignment, or to use secondary structure to guide
	61	the alignment process. GAPS in the old alignments are indicated using the "-"
	62	character. PROFILES can be input in ANY of the allowed formats; just
	63	use "-" (or "." for MSF-RSF) for each gap position.
	64
	65	PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
	66	with "-" characters to indicate gaps) OR after a multiple alignment while the
	67	alignment is still in memory.
	68
	69
	70	The program tries to automatically recognise the different file formats used
	71	and to guess whether the sequences are amino acid or nucleotide. This is not
	72	always foolproof.
	73
	74	FASTA and NBRF-PIR formats are recognised by having a ">" as the first
	75	character in the file.
	76
	77	EMBL-Swiss Prot formats are recognised by the letters
	78	ID at the start of the file (the token for the entry name field).
	79
	80	CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
	81
	82	GCG-MSF format is recognised by one of the following:
	83	- the word PileUp at the start of the file.
	84	- the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
	85	at the start of the file.
	86	- the word MSF on the first line of the line, and the characters ..
	87	at the end of this line.
	88
	89	GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of
	90	the file.
	91
	92
	93	If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
	94	sequence will be assumed to be nucleotide. This works in 97.3% of cases
	95	but watch out!
	96
	97	>>HELP 2 << Help for multiple alignments
	98
	99	If you have already loaded sequences, use menu item 1 to do the complete
	100	multiple alignment. You will be prompted for 2 output files: 1 for the
	101	alignment itself; another to store a dendrogram that describes the similarity
	102	of the sequences to each other.
	103
	104	Multiple alignments are carried out in 3 stages (automatically done from menu
	105	item 1 ...Do complete multiple alignments now):
	106
	107	1) all sequences are compared to each other (pairwise alignments);
	108
	109	2) a dendrogram (like a phylogenetic tree) is constructed, describing the
	110	approximate groupings of the sequences by similarity (stored in a file).
	111
	112	3) the final multiple alignment is carried out, using the dendrogram as a guide.
	113
	114
	115	PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial
	116	alignments.
	117
	118	MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
	119
	120
	121	RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences
	122	during multiple alignment if you wish to change the parameters and try again.
	123	This only takes effect just before you do a second multiple alignment. You
	124	can make phylogenetic trees after alignment whether or not this is ON.
	125	If you turn this OFF, the new gaps are kept even if you do a second multiple
	126	alignment. This allows you to iterate the alignment gradually. Sometimes, the
	127	alignment is improved by a second or third pass.
	128
	129	SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the
	130	screen as well as to the output file.
	131
	132	You can skip the first stages (pairwise alignments; dendrogram) by using an
	133	old dendrogram file (menu item 3); or you can just produce the dendrogram
	134	with no final multiple alignment (menu item 2).
	135
	136
	137	OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6
[1754]	138	different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA).
[176]	139
	140
	141	>>HELP 3 << Help for pairwise alignment parameters
	142	A distance is calculated between every pair of sequences and these are used to
	143	construct the dendrogram which guides the final multiple alignment. The scores
	144	are calculated from separate pairwise alignments. These can be calculated using
	145	2 methods: dynamic programming (slow but accurate) or by the method of Wilbur
	146	and Lipman (extremely fast but approximate).
	147
	148	You can choose between the 2 alignment methods using menu option 8. The
	149	slow-accurate method is fine for short sequences but will be VERY SLOW for
	150	many (e.g. >100) long (e.g. >1000 residue) sequences.
	151
	152	SLOW-ACCURATE alignment parameters:
	153	These parameters do not have any affect on the speed of the alignments.
	154	They are used to give initial alignments which are then rescored to give percent
	155	identity scores. These % scores are the ones which are displayed on the
	156	screen. The scores are converted to distances for the trees.
	157
	158	1) Gap Open Penalty: the penalty for opening a gap in the alignment.
	159	2) Gap extension penalty: the penalty for extending a gap by 1 residue.
	160	3) Protein weight matrix: the scoring table which describes the similarity
	161	of each amino acid to each other.
	162	4) DNA weight matrix: the scores assigned to matches and mismatches
	163	(including IUB ambiguity codes).
	164
	165
	166	FAST-APPROXIMATE alignment parameters:
	167
[6136]	168	These similarity scores are calculated from fast, approximate, global alignments,
	169	which are controlled by 4 parameters. 2 techniques are used to make
[176]	170	these alignments very fast: 1) only exactly matching fragments (k-tuples) are
	171	considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
	172	are used.
	173
	174	K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
	175	INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
	176	For longer sequences (e.g. >1000 residues) you may need to increase the default.
	177
	178	GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
	179	little affect on the speed or sensitivity except for extreme values.
	180
	181	TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
	182	dot-matrix plot) is calculated. Only the best ones (with most matches) are
	183	used in the alignment. This parameter specifies how many. Decrease for speed;
	184	increase for sensitivity.
	185
	186	WINDOW SIZE: This is the number of diagonals around each of the 'best'
	187	diagonals that will be used. Decrease for speed; increase for sensitivity.
	188
	189
	190	>>HELP 4 << Help for multiple alignment parameters
	191
	192	These parameters control the final multiple alignment. This is the core of the
	193	program and the details are complicated. To fully understand the use of the
	194	parameters and the scoring system, you will have to refer to the documentation.
	195
	196	Each step in the final multiple alignment consists of aligning two alignments
	197	or sequences. This is done progressively, following the branching order in
	198	the GUIDE TREE. The basic parameters to control this are two gap penalties and
[6141]	199	the scores for various identical-non-identical residues.
[176]	200
[10842]	201	1) and ..
	202
	203	2) The GAP PENALTIES are set by menu items 1 and 2. These control the
[176]	204	cost of opening up every new gap and the cost of every item in a gap.
	205	Increasing the gap opening penalty will make gaps less frequent. Increasing
	206	the gap extension penalty will make gaps shorter. Terminal gaps are not
	207	penalised.
	208
	209	3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most
	210	distantly related sequences until after the most closely related sequences have
	211	been aligned. The setting shows the percent identity level required to delay
	212	the addition of a sequence; sequences that are less identical than this level
	213	to any other sequences will be aligned later.
	214
	215
	216
	217	4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T
	218	i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0
	219	and 1; a weight of zero means that the transitions are scored as mismatches,
	220	while a weight of 1 gives the transitions the match score. For distantly related
	221	DNA sequences, the weight should be near to zero; for closely related sequences
	222	it can be useful to assign a higher score.
	223
	224
	225	5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of
	226	weight matrices. The default for proteins in version 1.8 is the PAM series
	227	derived by Gonnet and colleagues. Note, a series is used! The actual matrix
	228	that is used depends on how similar the sequences to be aligned at this
	229	alignment step are. Different matrices work differently at each evolutionary
	230	distance.
	231
	232	6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)
	233	can be selected. The default is the matrix used by BESTFIT for comparison of
	234	nucleic acid sequences.
	235
	236	Further help is offered in the weight matrix menu.
	237
	238
	239	7) In the weight matrices, you can use negative as well as positive values if
	240	you wish, although the matrix will be automatically adjusted to all positive
	241	scores, unless the NEGATIVE MATRIX option is selected.
	242
	243	8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty
	244	options which are only used in protein alignments.
	245
	246
	247	>>HELP A << Help for protein gap parameters.
	248	1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce
	249	or increase the gap opening penalties at each position in the alignment or
	250	sequence. See the documentation for details. As an example, positions that
	251	are rich in glycine are more likely to have an adjacent gap than positions that
	252	are rich in valine.
	253
[10842]	254	2) [and ..]
	255
	256	3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within
[176]	257	a run (5 or more residues) of hydrophilic amino acids; these are likely to
	258	be loop or random coil regions where gaps are more common. The residues that
	259	are "considered" to be hydrophilic are set by menu item 3.
	260
	261	4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too
	262	close to each other. Gaps that are less than this distance apart are penalised
	263	more than other gaps. This does not prevent close gaps; it makes them less
	264	frequent, promoting a block-like appearance of the alignment.
	265
	266	5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes
	267	of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).
	268	If you turn this off, end gaps will be ignored for this purpose. This is
	269	useful when you wish to align fragments where the end gaps are not biologically
	270	meaningful.
	271	>>HELP 5 << Help for output format options.
	272
	273	Six output formats are offered. You can choose any (or all 6 if you wish).
	274
	275	CLUSTAL format output is a self explanatory alignment format. It shows the
	276	sequences aligned in blocks. It can be read in again at a later date to
	277	(for example) calculate a phylogenetic tree or add a new sequence with a
	278	profile alignment.
	279
	280	GCG output can be used by any of the GCG programs that can work on multiple
	281	alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
	282	.msf format files (multiple sequence file); new in version 7 of GCG.
	283
	284	PHYLIP format output can be used for input to the PHYLIP package of Joe
	285	Felsenstein. This is an extremely widely used package for doing every
[6136]	286	imaginable form of phylogenetic analysis (MUCH more than the the modest
	287	introduction offered by this program).
[176]	288
	289	NBRF-PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
	290	characters "-" are used to indicate the positions of gaps in the multiple
	291	alignment. These files can be re-used as input in any part of clustal that
	292	allows sequences (or alignments or profiles) to be read in.
	293
	294	GDE: this is the flat file format used by the GDE package of Steven Smith.
	295
	296	NEXUS: the format used by several phylogeny programs, including PAUP and
	297	MacClade.
	298
	299	GDE OUTPUT CASE: sequences in GDE format may be written in either upper or
	300	lower case.
	301
	302	CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the
	303	alignment lines in clustalw format.
	304
	305	OUTPUT ORDER is used to control the order of the sequences in the output
	306	alignments. By default, the order corresponds to the order in which the
	307	sequences were aligned (from the guide tree-dendrogram), thus automatically
	308	grouping closely related sequences. This switch can be used to set the order
	309	to the same as the input file.
[1754]	310
[176]	311	PARAMETER OUTPUT: This option allows you to save all your parameter settings
	312	in a parameter file. This file can be used subsequently to rerun Clustal W
	313	using the same parameters.
	314
	315	>>HELP 6 << Help for profile and structure alignments
	316
	317	By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile
	318	alignments allow you to store alignments of your favourite sequences and add
	319	new sequences to them in small bunches at a time. A profile is simply an
	320	alignment of one or more sequences (e.g. an alignment output file from CLUSTAL
	321	W). Each input can be a single sequence. One or both sets of input sequences
	322	may include secondary structure assignments or gap penalty masks to guide the
	323	alignment.
	324
	325	The profiles can be in any of the allowed input formats with "-" characters
	326	used to specify gaps (except for MSF-RSF where "." is used).
	327
	328	You have to specify the 2 profiles by choosing menu items 1 and 2 and giving
	329	2 file names. Then Menu item 3 will align the 2 profiles to each other.
	330	Secondary structure masks in either profile can be used to guide the alignment.
	331
	332	Menu item 4 will take the sequences in the second profile and align them to
	333	the first profile, 1 at a time. This is useful to add some new sequences to
	334	an existing alignment, or to align a set of sequences to a known structure.
	335	In this case, the second profile would not be pre-aligned.
	336
	337
	338	The alignment parameters can be set using menu items 5, 6 and 7. These are
	339	EXACTLY the same parameters as used by the general, automatic multiple
	340	alignment procedure. The general multiple alignment procedure is simply a
	341	series of profile alignments. Carrying out a series of profile alignments on
	342	larger and larger groups of sequences, allows you to manually build up a
	343	complete alignment, if necessary editing intermediate alignments.
	344
	345	SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set 2D structure
	346	parameters. If a solved structure is available, it can be used to guide the
	347	alignment by raising gap penalties within secondary structure elements, so
	348	that gaps will preferentially be inserted into unstructured surface loops.
	349	Alternatively, a user-specified gap penalty mask can be supplied directly.
	350
	351	A gap penalty mask is a series of numbers between 1 and 9, one per position in
	352	the alignment. Each number specifies how much the gap opening penalty is to be
	353	raised at that position (raised by multiplying the basic gap opening penalty
	354	by the number) i.e. a mask figure of 1 at a position means no change
	355	in gap opening penalty; a figure of 4 means that the gap opening penalty is
	356	four times greater at that position, making gaps 4 times harder to open.
	357
	358	The format for gap penalty masks and secondary structure masks is explained
	359	in the help under option 0 (secondary structure options).
	360	>>HELP B << Help for secondary structure - gap penalty masks
	361
	362	The use of secondary structure-based penalties has been shown to improve the
	363	accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty
	364	masks to be supplied with the input sequences. The masks work by raising gap
	365	penalties in specified regions (typically secondary structure elements) so that
	366	gaps are preferentially opened in the less well conserved regions (typically
	367	surface loops).
	368
	369	Options 1 and 2 control whether the input secondary structure information or
	370	gap penalty masks will be used.
	371
	372	Option 3 controls whether the secondary structure and gap penalty masks should
	373	be included in the output alignment.
	374
	375	Options 4 and 5 provide the value for raising the gap penalty at core Alpha
	376	Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues
	377	denote the A and B core structure notation. The basic gap penalties are
	378	multiplied by the amount specified.
	379
	380	Option 6 provides the value for the gap penalty in Loops. By default this
	381	penalty is not raised. In CLUSTAL format, loops are specified by "." in the
	382	secondary structure notation.
	383
	384	Option 7 provides the value for setting the gap penalty at the ends of
	385	secondary structures. Ends of secondary structures are observed to grow
	386	and-or shrink in related structures. Therefore by default these are given
	387	intermediate values, lower than the core penalties. All secondary structure
	388	read in as lower case in CLUSTAL format gets the reduced terminal penalty.
	389
	390	Options 8 and 9 specify the range of structure termini for the intermediate
	391	penalties. In the alignment output, these are indicated as lower case.
	392	For Alpha Helices, by default, the range spans the end helical turn. For
	393	Beta Strands, the default range spans the end residue and the adjacent loop
	394	residue, since sequence conservation often extends beyond the actual H-bonded
	395	Beta Strand.
	396
	397	CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format input
	398	files. For many 3-D protein structures, secondary structure information is
	399	recorded in the feature tables of SWISS-PROT database entries. You should
	400	always check that the assignments are correct - some are quite inaccurate.
	401	CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.
	402
	403	FT HELIX 100 115
	404	FT STRAND 118 119
	405
	406	The structure and penalty masks can also be read from CLUSTAL alignment format
	407	as comment lines beginning "!SS_" or "!GM_" e.g.
	408
	409	!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
	410	!GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444
	411	HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
	412
	413	Note that the mask itself is a set of numbers between 1 and 9 each of which is
	414	assigned to the residue(s) in the same column below.
	415
	416	In GDE flat file format, the masks are specified as text and the names must
	417	begin with "SS_ or "GM_.
	418
	419	Either a structure or penalty mask or both may be used. If both are included in
	420	an alignment, the user will be asked which is to be used.
	421
	422	>>HELP C << Help for secondary structure - gap penalty mask output options
	423
	424	The options in this menu let you choose whether or not to include the masks
	425	in the CLUSTAL W output alignments. Showing both is useful for understanding
	426	how the masks work. The secondary structure information is itself very useful
	427	in judging the alignment quality and in seeing how residue conservation
	428	patterns vary with secondary structure.
	429
	430
	431	>>HELP 7 << Help for phylogenetic trees
	432
	433	1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be
	434	input in any format or you should have just carried out a full multiple
	435	alignment and the alignment is still in memory.
	436
	437
	438	************* Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!! *************
	439
	440
	441	The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
	442	you calculate distances (percent divergence) between all pairs of sequence from
	443	a multiple alignment; second you apply the NJ method to the distance matrix.
	444
	445	2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where
	446	ANY of the sequences have a gap will be ignored. This means that 'like' will be
	447	compared to 'like' in all distances, which is highly desirable. It also
	448	automatically throws away the most ambiguous parts of the alignment, which are
	449	concentrated around gaps (usually). The disadvantage is that you may throw away
	450	much of the data if there are many gaps (which is why it is difficult for us to
	451	make it the default).
	452
	453
	454
	455	3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this
	456	option makes no difference. For greater divergence, it corrects for the fact
	457	that observed distances underestimate actual evolutionary distances. This is
	458	because, as sequences diverge, more than one substitution will happen at many
	459	sites. However, you only see one difference when you look at the present day
	460	sequences. Therefore, this option has the effect of stretching branch lengths
	461	in trees (especially long branches). The corrections used here (for DNA or
	462	proteins) are both due to Motoo Kimura. See the documentation for details.
	463
	464	Where possible, this option should be used. However, for VERY divergent
	465	sequences, the distances cannot be reliably corrected. You will be warned if
	466	this happens. Even if none of the distances in a data set exceed the reliable
	467	threshold, if you bootstrap the data, some of the bootstrap distances may
	468	randomly exceed the safe limit.
	469
	470	4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED
	471	tree and all branch lengths. The root of the tree can only be inferred by
	472	using an outgroup (a sequence that you are certain branches at the outside
	473	of the tree .... certain on biological grounds) OR if you assume a degree
	474	of constancy in the 'molecular clock', you can place the root in the 'middle'
	475	of the tree (roughly equidistant from all tips).
	476
	477	5) TOGGLE PHYLIP BOOTSTRAP POSITIONS
	478	By default, the bootstrap values are correctly placed on the tree branches of
	479	the phylip format output tree. The toggle allows them to be placed on the
	480	nodes, which is incorrect, but some display packages (e.g. TreeTool, TreeView
	481	and Phylowin) only support node labelling but not branch labelling. Care
	482	should be taken to note which branches and labels go together.
	483
	484	6) OUTPUT FORMATS: four different formats are allowed. None of these displays
	485	the tree visually. Useful display programs accepting PHYLIP format include
	486	NJplot (from Manolo Gouy and supplied with Clustal W), TreeView (Mac-PC), and
	487	PHYLIP itself - OR get the PHYLIP package and use the tree drawing facilities
	488	there. (Get the PHYLIP package anyway if you are interested in trees). The
	489	NEXUS format can be read into PAUP or MacClade.
	490
	491	>>HELP 8 << Help for choosing a weight matrix
	492
	493	For protein alignments, you use a weight matrix to determine the similarity of
	494	non-identical amino acids. For example, Tyr aligned with Phe is usually judged
	495	to be 'better' than Tyr aligned with Pro.
	496
	497	There are three 'in-built' series of weight matrices offered. Each consists of
	498	several matrices which work differently at different evolutionary distances. To
	499	see the exact details, read the documentation. Crudely, we store several
	500	matrices in memory, spanning the full range of amino acid distance (from almost
	501	identical sequences to highly divergent ones). For very similar sequences, it
	502	is best to use a strict weight matrix which only gives a high score to
	503	identities and the most favoured conservative substitutions. For more divergent
	504	sequences, it is appropriate to use "softer" matrices which give a high score
	505	to many other frequent substitutions.
	506
	507	1) BLOSUM (Henikoff). These matrices appear to be the best available for
	508	carrying out database similarity (homology searches). The matrices used are:
	509	Blosum 80, 62, 45 and 30. (BLOSUM was the default in earlier Clustal W
	510	versions)
	511
	512	2) PAM (Dayhoff). These have been extremely widely used since the late '70s.
	513	We use the PAM 20, 60, 120 and 350 matrices.
	514
	515	3) GONNET. These matrices were derived using almost the same procedure as the
	516	Dayhoff one (above) but are much more up to date and are based on a far larger
	517	data set. They appear to be more sensitive than the Dayhoff series. We use the
	518	GONNET 80, 120, 160, 250 and 350 matrices. This series is the default for
	519	Clustal W version 1.8.
	520
	521	We also supply an identity matrix which gives a score of 1.0 to two identical
	522	amino acids and a score of zero otherwise. This matrix is not very useful.
	523	Alternatively, you can read in your own (just one matrix, not a series).
	524
	525	A new matrix can be read from a file on disk, if the filename consists only
	526	of lower case characters. The values in the new weight matrix must be integers
	527	and the scores should be similarities. You can use negative as well as positive
	528	values if you wish, although the matrix will be automatically adjusted to all
	529	positive scores.
	530
	531
	532
	533	For DNA, a single matrix (not a series) is used. Two hard-coded matrices are
	534	available:
	535
	536
	537	1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
	538	of nucleic acid sequences. X's and N's are treated as matches to any IUB
	539	ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
	540
	541
	542	2) CLUSTALW(1.6). The previous system used by Clustal W, in which matches score
	543	1.0 and mismatches score 0. All matches for IUB symbols also score 0.
	544
	545	INPUT FORMAT The format used for a new matrix is the same as the BLAST program.
	546	Any lines beginning with a # character are assumed to be comments. The first
	547	non-comment line should contain a list of amino acids in any order, using the
	548	1 letter code, followed by a * character. This should be followed by a square
	549	matrix of integer scores, with one row and one column for each amino acid. The
	550	last row and column of the matrix (corresponding to the * character) contain
	551	the minimum score over the whole matrix.
	552
	553	>>HELP 9 << Help for command line parameters
	554	DATA (sequences)
	555
	556	-INFILE=file.ext :input sequences.
	557	-PROFILE1=file.ext and -PROFILE2=file.ext :profiles (old alignment).
	558
	559
	560	VERBS (do things)
	561
	562	-OPTIONS :list the command line parameters
	563	-HELP or -CHECK :outline the command line params.
	564	-ALIGN :do full multiple alignment
	565	-TREE :calculate NJ tree.
	566	-BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps; def. = 1000).
	567	-CONVERT :output the input sequences in a different file format.
	568
	569
	570	PARAMETERS (set things)
	571
	572	*General settings:**
	573	-INTERACTIVE :read command line, then enter normal interactive menus
	574	-QUICKTREE :use FAST algorithm for the alignment guide tree
	575	-TYPE= :PROTEIN or DNA sequences
	576	-NEGATIVE :protein alignment with negative values in matrix
	577	-OUTFILE= :sequence alignment file name
	578	-OUTPUT= :GCG, GDE, PHYLIP, PIR or NEXUS
	579	-OUTORDER= :INPUT or ALIGNED
	580	-CASE :LOWER or UPPER (for GDE output only)
	581	-SEQNOS= :OFF or ON (for Clustal output only)
[1754]	582	-SEQNO_RANGE=:OFF or ON (NEW: for all output formats)
	583	-RANGE=m,n :sequence range to write starting m to m+n.
[176]	584
	585	*Fast Pairwise Alignments:*
	586	-KTUPLE=n :word size
	587	-TOPDIAGS=n :number of best diags.
	588	-WINDOW=n :window around best diags.
	589	-PAIRGAP=n :gap penalty
	590	-SCORE :PERCENT or ABSOLUTE
	591
	592
	593	*Slow Pairwise Alignments:*
	594	-PWMATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
	595	-PWDNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
	596	-PWGAPOPEN=f :gap opening penalty
	597	-PWGAPEXT=f :gap opening penalty
	598
	599
	600	*Multiple Alignments:*
	601	-NEWTREE= :file for new guide tree
	602	-USETREE= :file for old guide tree
	603	-MATRIX= :Protein weight matrix=BLOSUM, PAM, GONNET, ID or filename
	604	-DNAMATRIX= :DNA weight matrix=IUB, CLUSTALW or filename
	605	-GAPOPEN=f :gap opening penalty
	606	-GAPEXT=f :gap extension penalty
	607	-ENDGAPS :no end gap separation pen.
	608	-GAPDIST=n :gap separation pen. range
	609	-NOPGAP :residue-specific gaps off
	610	-NOHGAP :hydrophilic gaps off
	611	-HGAPRESIDUES= :list hydrophilic res.
	612	-MAXDIV=n :% ident. for delay
	613	-TYPE= :PROTEIN or DNA
	614	-TRANSWEIGHT=f :transitions weighting
	615
	616
	617	*Profile Alignments:*
	618	-PROFILE :Merge two alignments by profile alignment
	619	-NEWTREE1= :file for new guide tree for profile1
	620	-NEWTREE2= :file for new guide tree for profile2
	621	-USETREE1= :file for old guide tree for profile1
	622	-USETREE2= :file for old guide tree for profile2
	623
	624
	625	*Sequence to Profile Alignments:*
	626	-SEQUENCES :Sequentially add profile2 sequences to profile1 alignment
	627	-NEWTREE= :file for new guide tree
	628	-USETREE= :file for old guide tree
	629
	630
	631	*Structure Alignments:*
	632	-NOSECSTR1 :do not use secondary structure-gap penalty mask for profile 1
	633	-NOSECSTR2 :do not use secondary structure-gap penalty mask for profile 2
	634	-SECSTROUT=STRUCTURE or MASK or BOTH or NONE :output in alignment file
	635	-HELIXGAP=n :gap penalty for helix core residues
	636	-STRANDGAP=n :gap penalty for strand core residues
	637	-LOOPGAP=n :gap penalty for loop regions
	638	-TERMINALGAP=n :gap penalty for structure termini
	639	-HELIXENDIN=n :number of residues inside helix to be treated as terminal
	640	-HELIXENDOUT=n :number of residues outside helix to be treated as terminal
	641	-STRANDENDIN=n :number of residues inside strand to be treated as terminal
	642	-STRANDENDOUT=n:number of residues outside strand to be treated as terminal
	643
	644
	645	*Trees:*
	646	-OUTPUTTREE=nj OR phylip OR dist OR nexus
	647	-SEED=n :seed number for bootstraps.
	648	-KIMURA :use Kimura's correction.
	649	-TOSSGAPS :ignore positions with gaps.
	650	-BOOTLABELS=node OR branch :position of bootstrap values in tree display
	651
	652	>>HELP 0 << Help for tree output format options
	653
[10842]	654	Four output formats are offered:
	655	1) Clustal,
	656	2) Phylip,
	657	3) Just the distances
	658	4) Nexus
[176]	659
	660	None of these formats displays the results graphically. Many packages can
	661	display trees in the the PHYLIP format 2) below. It can also be imported into
	662	the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM for graphical display.
	663	NEXUS format trees can be read by PAUP and MacClade.
	664
[10842]	665	1) Clustal format output.
[176]	666
[10842]	667	This format is verbose and lists all of the distances between the sequences and
	668	the number of alignment positions used for each. The tree is described at the
	669	end of the file. It lists the sequences that are joined at each alignment step
	670	and the branch lengths. After two sequences are joined, it is referred to later
	671	as a NODE. The number of a NODE is the number of the lowest sequence in that
	672	NODE.
	673
[176]	674	2) Phylip format output.
	675
[10842]	676	This format is the New Hampshire format, used by many phylogenetic analysis
	677	packages. It consists of a series of nested parentheses, describing the
	678	branching order, with the sequence names and branch lengths. It can be used by
	679	the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see the
	680	trees graphically. This is the same format used during multiple alignment for
	681	the guide trees.
	682
	683	Use this format with NJplot (Manolo Gouy), supplied with Clustal W. Some other
	684	packages that can read and display New Hampshire format are TreeView (Mac/PC),
	685	TreeTool (UNIX), and Phylowin.
[176]	686
	687	3) The distances only.
	688
[10842]	689	This format just outputs a matrix of all the pairwise distances in a format
	690	that can be used by the Phylip package. It used to be useful when one could not
	691	produce distances from protein sequences in the Phylip package but is now
	692	redundant (Protdist of Phylip 3.5 now does this).
[176]	693
[10842]	694	4) NEXUS FORMAT TREE.
	695
	696	This format is used by several popular phylogeny programs,
	697	including PAUP and MacClade. The format is described fully in:
	698	Maddison, D. R., D. L. Swofford and W. P. Maddison. 1997.
	699	NEXUS: an extensible file format for systematic information.
	700	Systematic Biology 46:590-621.
	701
[176]	702	5) TOGGLE PHYLIP BOOTSTRAP POSITIONS
	703
[10842]	704	By default, the bootstrap values are placed on the nodes of the phylip format
	705	output tree. This is inaccurate as the bootstrap values should be associated
	706	with the tree branches and not the nodes. However, this format can be read and
	707	displayed by TreeTool, TreeView and Phylowin. An option is available to
	708	correctly place the bootstrap values on the branches with which they are
	709	associated.
	710

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format