Context Navigation

← Previous Revision
Next Revision →
Blame
Revision Log

FASTA.help

Visit:

Last change on this file was 2, checked in by oldcode, 25 years ago
Initial revision
Property svn:eol-style set to `native` Property svn:keywords set to `Author Date Id Revision`
File size: 44.8 KB

Line
1
2
3
4
5	FASTA.DOC Release 1.6
6
7
8
9	COPYRIGHT NOTICE
10
11	Copyright 1988, 1991, 1992 by William R. Pearson and the
12	University of Virginia. All rights reserved. The FASTA program
13	and documentation may not be sold or incorporated into a
14	commercial product, in whole or in part, without written consent
15	of William R. Pearson and the University of Virginia. For
16	further information regarding permission for use or reproduction,
17	please contact: William R. Wilkerson, Assistant Provost for
18	Research, University of Virginia, P.O. Box 9025, Charlottesville,
19	VA 22906-9025, (804) 924-6853
20
21
22	The FASTA program package
23
24	Introduction
25
26	This documentation describes the version 1.6c of the FASTA
27	program package (see W. R. Pearson and D. J. Lipman (1988),
28	"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
29	2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence
30	Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
31	98). Version 1.6 is the first release for the IBM-PC and
32	Macintosh since version 1.4 (version 1.5 was distributed only via
33	ftp to unix machines). Version 1.6 has a large number of
34	improvements over versions 1.4 and 1.5, including the ability to
35	search libraries in several different formats in the same run,
36	more robust algorithms for aligning sequences along a band, and
37	additional, rigorous (but slow) programs for sequence searching,
38	statistical analysis, and local sequence alignment. In addition,
39	several additional options are included. Programs that are new
40	with version 1.6 are highlighted in italics.
41
42
43	Although there are a large number of programs in this package,
44	they belong to three groups:
45
46
47	Library search programs: FASTA, TFASTA, SSEARCH
48
49	Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN
50
51	Statistical significance: RDF2, RELATE, RSS
52
53
54	In addition, there are several programs for other sequence
55	analysis tasks:
56
57
58	ALIGN - global alignment of two sequences (no limit on gaps).
59
60	EXTRACTP, SINDEX - programs to index (SINDEX) and extract sequences
61
62
63	- 1 -
64
65
66
67
68
69
70
71	FASTA.DOC Release 1.6
72
73
74	from a protein sequence database.
75
76	EXTRACTN - programs to extract sequences from the GenBank floppy disk
77	format data base.
78
79
80	In addition, I have included several programs for protein
81	sequence analysis, including a Kyte-Doolittle hydropathicity
82	plotting program (GREASE, TGREASE), and a secondary structure
83	prediction package (GARNIER).
84
85	The FASTA sequence comparison programs on this disk are
86	improved versions of the FASTP program, originally described in
87	Science (Lipman and Pearson, (1985) Science 227:1435-1441). We
88	have made several improvements. First, the library search
89	programs use a more sensitive method for the initial comparison
90	of two sequences which allows the scores of several similar
91	regions to be combined. As a result, the results of a library
92	search are now given with three scores, initn (the new initial
93	score which may include several similar regions), init1 (the old
94	fastp initial score from the best initial region), and opt (the
95	old fastp optimized score allowing gaps in a 32 residue wide
96	band).
97
98	These programs have also been modified to become "universal"
99	(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
100	FAST-N (nucleotides)); by changing the environment variable
101	SMATRIX, the programs can be used to search protein sequences,
102	DNA sequences, or whatever you like. By default, FASTA, LFASTA,
103	and the RDF programs automatically recognize protein and DNA
104	sequences. Sequences are first read as amino acids, and then
105	converted to nucleotides if the sequence is greater than 85%
106	A,C,G,T (the '-n' option can be used to indicate DNA sequences).
107	TFASTA compares protein sequences to a translated DNA sequence.
108	Alternative scoring matrices can also be used. In addition to
109	the PAM250 matrix for proteins, matrices based on simple
110	identities or the genetic code can also be used for sequence
111	comparisons or evaluation of significance. Several different
112	protein sequence matrices have been included; instructions for
113	constructing your own scoring matrix are included in the file
114	FORMAT.DOC.
115
116
117	The remainder of this document is divided into three sections:
118	(1) a brief history of the changes to the FASTA package; (2) A
119	guide to installing the programs and databases; (3) A guide to
120	using the FASTA programs. The programs are very easy to use, so
121	if you are using them on a machine that is administered by
122	someone else, you may want to skip to section (3) to learn how to
123	use the programs, and then read section (1) to look at some of
124	the more recent changes. If you are installing the programs on
125	your own machine, you will need to read section (2) carefully.
126
127
128	- 2 -
129
130
131
132
133
134
135
136	FASTA.DOC Release 1.6
137
138
139	1. Revision History
140
141	1.1. Changes with version 1.6
142
143	FASTA version 1.6 uses a new method for calculating optimal
144	scores in a band (the optimization or last step in the FASTA
145	algorithm). In addition, it uses a linear-space method for
146	calculating the actual alignments. The FASTA package also
147	includes four new programs:
148
149	SSEARCH a program to search a sequence database using the
150	rigorous Smith-Waterman algorithm (this program is
151	about 100-fold slower than FASTA with ktup=2 (for
152	proteins).
153
154	RSS a version of RDF2 that uses a rigorous Smith-Waterman
155	calculation to score similarities
156
157	LALIGN A rigorous local sequence alignment program that will
158	display the N-best local alignments (N=10 by default).
159
160	PLALIGN a version of lalign that plots the local alignments.
161
162	The LALIGN/PLALIGN programs incorporate the "sim" algorithm
163	described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357.
164	The SSEARCH and RSS programs incorporate algorithms described by
165	Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
166
167	LFASTA and PLFASTA now calculate a different number of local
168	similarities; they now behave more like LALIGN/PLALIGN. Since
169	local alignments of identical sequences produce "mirror-image"
170	alignments, lalign and lfasta consider only one-half of the
171	potential alignments between sequences from identical file names.
172	Thus
173
174	lfasta mchu.aa mchu.aa
175
176	Displays only two alignments, with earlier versions of the
177	program, it would have displayed five, including the identity
178	alignment. PLFASTA does display five alignments; when two
179	identical filenames are given, it draws the identity alignment,
180	calculates the two unique local alignments, draws them, and draws
181	their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
182	filenames, rather than the actual sequences, to determine whether
183	sequences are identical; you can "trick" the programs into
184	behaving the old way by putting the same sequence in two
185	different files.
186
187	1.2. Changes with version 1.5
188
189	FASTA version 1.5 includes a number of substantial revisions
190	to improve the performance and sensitivity of the program. It is
191	now possible to tell the program to optimize all of the initn
192
193
194	- 3 -
195
196
197
198
199
200
201
202	FASTA.DOC Release 1.6
203
204
205	scores greater than a threshold. The threshold is set at the
206	same value as the old FASTA cutoff score (approximately 0.5
207	standard deviations above the mean for average length sequences).
208	For highest sensitivity, you can use the -c 1 option to set the
209	threshold to 1. (This will slow the search down about 5-fold).
210	Alternatively, you can tell FASTA to sort the results by the
211	init1, rather than the initn, score by using the -1 option.
212	FASTA -1 ... will report the results the way the older FASTP
213	program did. A comparison of the performance of FASTA in this,
214	its slowest mode, with the standard FASTA and the Smith-Waterman
215	algorithm has been published in Genomics (1991) 11:635-650.
216
217	A new method has been provided for selecting libraries. In
218	the past, one could enter the name of a sequence file to be
219	searched or a single letter that would specify a library from the
220	list included in the $FASTLIBS file. Now, you can specify a set
221	of library files with a string of letters preceded by a '%'.
222	Thus, if the FASTLIBS file has the lines:
223
224
225	Genbank 70 primates$1P/seqlib/gbpri.seq
226	Genbank 70 rodents$1R/seqlib/gbrod.seq
227	Genbank 70 other mammals$1M/seqlib/gbmam.seq
228	Genbank 70 vertebrates $1B/seqlib/gbvrt.seq
229
230
231	Then the string: "%PRMB" would tell FASTA to search the four
232	libraries listed above. The %PRMB string can be entered either
233	on the command line or when the program asks for a filename or
234	library letter.
235
236	FASTA1.5 also provides additional flexibility for specifying
237	the number of results and alignments to be displayed with the -Q
238	(quiet) option. The -b number option allows you to specify the
239	number of sequence scores to show when the search is finished.
240	Thus
241
242
243	FASTA -b 100 ...
244
245
246	tells the program to display the top 100 sequence scores. In the
247	past, if you displayed 100 scores (in -Q mode), you would also
248	have store 100 alignments. The -d option allows you to limit the
249	number of alignments shown. FASTA -b 100 -d 20 would show 100
250	scores and 20 alignments.
251
252	The old CUTOFF parameter is no longer used. The program
253	stores the best 2000 (IBM-PC, MAC) or 6000 (Unix, VMS) scores and
254	then throws out the lowest 25%, stores the next 500 (1500) better
255	than the threshold determined with the first scores were
256	discarded, and repeats the process as the library is scanned. As
257	a result, the best 1500 - 2000 (4500 - 6000) scores are saved.
258
259
260	- 4 -
261
262
263
264
265
266
267
268	FASTA.DOC Release 1.6
269
270
271	The old cut-off parameter was also used to set the joining
272	threshold for the calculation of the initn score from initial
273	regions. This joining threshold can now be set with the -g
274	option or with the GAPCUT parameter.
275
276	Finally, FASTA can provide a complete list of all of the
277	sequences and scores calculated to a file with the -r (results)
278	option. FASTA -r results.out ... creates a file with a list of
279	scores for every sequence in the library. The list is not
280	sorted, and only includes those scores calculated during the
281	initial scan of the library (the optimized score is not
282	calculated unless the -o option is used).
283
284	2. Installing the FASTA package
285
286	2.1. Installing the programs
287
288	2.1.1. IBM-PC/DOS version
289
290	For the IBM-PC/DOS version, the FASTA source code disk
291	contains the complete source code to all of the programs on the
292	other disks. The programs were compiled with Borland's Turbo
293	'C++', using Borland's MAKE utility. The graphics programs
294	(PLFASTA, TGREASE) use the graphics device drivers supplied with
295	the Turbo 'C' V2.0 package. Also included are the documentation
296	files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the
297	files the source code disk to run the programs. The files on
298	this disk are identical to the UNIX and VMS versions that run on
299	larger machines. Also included is the code to compile
300	ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize
301	for end-gaps.
302
303	If you have the DOS or Macintosh version of the FASTA
304	package, to install the programs you should:
305
306	(1) Make a new directory (folder) for the FASTA programs. This
307	need not be the same as the directory for your sequence
308	databases.
309
310	(2) Copy the files from the FASTA source disk to the new
311	directory.
312
313	(3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
314	PATH command to include the FASTA directory and (b) add the
315	line:
316
317	set FASTLIBS=c:\yourfastadirectory\fastgbs
318
319	On the Macintosh, you may need to edit the "environment"
320	file and change the line that reads:
321
322	FASTLIBS=fastgbs
323
324
325	- 5 -
326
327
328
329
330
331
332
333	FASTA.DOC Release 1.6
334
335
336	to indicate the full directory path for the fastgbs file,
337	for example:
338
339	FASTLIBS=Q105:FASTA:fastgbs
340
341
342	(4) Finally, you will need to edit the fastgbs file. This is
343	usually the most confusing part of the installation. An
344	example of this file is shown below; to customize this file
345	for your machine, you will need to change the file names
346	from those provided in the fastgbs file to ones that reflect
347	the directory names and file names you use on your machine.
348	This is explained in more detail below. In addition, some
349	entries in the fastgbs file refer to other files of file
350	names. These files of file names (as opposed to actual
351	database files) may also need to be edited.
352
353	2.1.2. Unix version
354
355	The FASTA distribution comes with several makefile's that
356	can be used to compile the FASTA programs. Over the years, as
357	ATT Unix System 5 and BSD unix have converged, these files have
358	become very similar. To begin with, I recommend using the
359	standard Makefile. There are two values in the makefile that
360	should be checked against the values used on your system: the HZ
361	value, which is the frequency in ticks per second used by the
362	times() system call, this value can usually be found by running:
363
364	grep HZ /usr/include/sys/*
365
366	and the functions available to return random numbers. If you
367	have a rand48() function that returns a 32-bit random number, use
368	it and use the lines:
369
370	NRAND=nrand48
371	RANFLG= -DRAND32
372
373	If not, you will need to use the rand() function call and
374	determine whether it returns a 16-bit or a 32-bit value. These
375	functions are used by RDF2 and RSS. If you have problems
376	compiling the programs, you may want to examine the makefile.unx
377	and makefile.sun files, to look for differences. I have tried to
378	use very standard unix functions in these programs, and they have
379	been successfully compiled, with very small changes to the
380	Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
381	machines (under the BSD environment).
382
383	2.2. Installing the libraries
384
385	2.2.1. The NBRF protein sequence library
386
387	The FASTA program package does not include any protein or
388	DNA sequence libraries. You can obtain the PIR protein sequence
389
390
391	- 6 -
392
393
394
395
396
397
398
399	FASTA.DOC Release 1.6
400
401
402	database from:
403
404	National Biomedical Research Foundation
405	Georgetown University Medical Center
406	3900 Reservoir Rd, N.W.
407	Washington, D.C. 20007
408
409	In addition, this database is available via anonymous ftp from
410	the host "ftp.bchs.uh.edu". It is available in two formats, VMS
411	and CODATA format. The "VMS" format (library type 5 below) can
412	be searched much faster, can be easily reformatted for use by the
413	"BLAST" rapid searching program, and is compatible with the
414	Genetics Computer Group package of programs. The CODATA format
415	is used by the EUGENE/MBIR computing package from Baylor (library
416	type 2).
417
418	(DOS/Macintosh users) The SINDEX and EXTRACTP programs now
419	allow you to index a file in one subdirectory, and then move the
420	library without having to remake the index. When you type:
421	SINDEX @prot.nam, two index files are created: PROT.IXX and
422	PROT.INX. PROT.IXX is a binary file that cannot be edited; it
423	contains the offsets into the library files for each of the
424	sequence entries. PROT.INX looks exactly like the original
425	PROT.NAM file, and can be edited. However, you cannot change the
426	order of the library files in PROT.INX. What you can do is
427	change the first line, which indicates the directory where the
428	library files can be found. The index in PROT.IXX might tell
429	EXTRACTP to find the entry LCBO at offset 123,456 in the PROT.3
430	file. If you changed the PROT.3 line in PROT.INX to PROT.4, LCBO
431	would not be extracted properly. However, if you decide to move
432	your library files from disk /usr/tmp to disk /usr/lib, you can
433	edit PROT.INX to reflect this change.
434
435	EXTRACTP has also been updated to use the new indexing
436	scheme. To extract sequences from a multi-file library that you
437	made with SINDEX @prot.nam, type: EXTRACTP @prot.nam, or set the
438	environment variable AABANK=@prot.nam. Then enter the protein
439	sequence identifiers as before. Remember, if you move the
440	library into a different directory, you will need to copy both
441	the .IXX and .INX files to use EXTRACTP. You can test EXTRACTP
442	by trying to extract the PIR sequences LCBO, HBHU, or CCHU. If
443	you do not get an error message, the sequences were successfully
444	extracted. They are automatically saved to a file with the name
445	"sequence.aa". So "LCBO" would be found in "lcbo.aa". When you
446	need to extract a sequence from the NEW.LIB library, you will
447	have to set AABANK=new.lib.
448
449	2.2.2. The GENBANK DNA sequence library
450
451	FASTA, TFASTA, and EXTRACTN search and extract sequences
452	from the GENBANK DNA sequence library in its compressed, floppy
453	disk format. This library is available from:
454
455
456	- 7 -
457
458
459
460
461
462
463
464	FASTA.DOC Release 1.6
465
466
467	GENBANK
468	c/o Intelligenetics
469	700 E. El Camino Real
470	Mountain View, CA 94040
471	(415) 962-7300
472
473	(The GBANN program used to extract DNA sequence annotations.
474	Unfortunately, GBANN has not been updated since release 63.0 of
475	GENBANK, when some changes in the annotation files were made.
476	GBANN no longer works.)
477
478	The GenBank DNA sequence library is also available via
479	anonymous FTP from genbank.bio.net.
480
481	2.2.3. The EMBL CD-ROM libraries
482
483	The European Molecular Biology Laboratory (EMBL) is
484	distributing a CD-ROM that contains both the complete EMBL DNA
485	sequence database (which should be essentially identical to the
486	GenBank DNA sequence database) and the SWISS-PROT protein
487	sequence database. SWISS-PROT is derived from the NBRF Protein
488	sequence database with additions from the EMBL DNA sequence
489	database. This CD-ROM is a "best-buy," since it provides both
490	DNA and protein sequence libraries. It is available from:
491
492
493	EMBL Data Library
494	Meyerhofstr. 1
495	D-6900 Heidelberg
496	Germany
497	+49 6221 387258
498	Email: SOFTWARE@EMBL-Heidelberg.DE
499
500
501
502	In addition, the SWISS-PROT protein sequence database is
503	available via anonymous FTP from the hosts genbank.bio.net and
504	ncbi.nlm.nih.gov.
505
506	2.3. Finding the libraries: FASTLIBS
507
508	FASTA and TFASTA use the environment variable FASTLIBS to
509	find the protein and DNA sequence libraries. The FASTLIBS
510	variable contains the name of a file that has the actual
511	filenames of the libraries. The FASTGBS file on is an example of
512	a file that can be referred to by FASTLIBS. To use the FASTGBS
513	file, type:
514
515	setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX)
516	or
517	FASTLIBS=/usr/lib/fasta/fastgbs; export FASTLIBS (SysV UNIX)
518
519	Then edit the FASTGBS file to indicate where the protein and DNA
520
521
522	- 8 -
523
524
525
526
527
528
529
530	FASTA.DOC Release 1.6
531
532
533	sequence libraries can be found. If you have a hard disk and
534	your protein sequence library is kept in the file
535	/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
536	in the directory: /usr/lib/genbank, then fastgbs might contain:
537
538	NBRF Protein$0P/usr/lib/seq/aabank.lib 0
539	SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
540	GB Primate$1P@/usr/lib/genbank/gpri.nam
541	GB Rodent$1R@/usr/lib/genbank/grod.nam
542	GB Mammal$1M@/usr/lib/genbank/gmammal.nam
543	^ 1 ^^^^ 4 ^ ^
544	23 (5)
545
546	The first line of this file says that there is a copy of the NBRF
547	protein sequence database (which is a protein database) that can
548	be selected by typing "P" on the command line or when the
549	database menu is presented in the file /usr/lib/seq/aabank.lib.
550
551	Note that there are 4 or 5 fields in the lines in fastgbs.
552	The first field is the description of the library which will be
553	displayed by FASTA; it ends with a '$'. The second field (1
554	character), is a 0 if the library is a protein library and 1 if
555	it is a DNA library. The third field (1 character) is the
556	character to be typed to select the library.
557
558	The fourth field is the name of the library file. In the
559	example above, the /usr/lib/seq/aabank.lib file contains the
560	entire protein sequence library. However the DNA library file
561	names are preceded by a '@', because these files (gpri.nam,
562	grod.nam, gmammal.nam) do not contain the sequences; instead they
563	the names of the files which contain the sequences. This is done
564	because the GENBANK DNA database is broken down in to a large
565	number of smaller files. In order to search the entire primate
566	database, you must search more than a dozen files.
567
568	In addition, an optional fifth field can be used to specify
569	the format of the library file. Alternatively, you can specify
570	the library format in a file of file names (a file preceded by an
571	'@'). This field must be separated from the file name by a space
572	character (' ') from the filename. In the example above, the
573	aabank.lib file is in Pearson/FASTA format, while the swiss.seq
574	file is in PIR/VMS format (from the EMBL CD-ROM), while the DNA
575	sequences are in compressed GenBank format. No file type number
576	is included for the Genbank files, because it is included in the
577	file of filenames (see below). Currently, FASTA can read the
578	following formats:
579
580	0 Pearson/FASTA (>SEQID - comment/sequence)
581	1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
582	2 NBRF CODATA (ENTRY/SEQUENCE)
583	3 EMBL/SWISS-PROT (ID/DE/SQ)
584	4 Intelligenetics (;comment/SEQID/sequence)
585	5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
586
587
588	- 9 -
589
590
591
592
593
594
595
596	FASTA.DOC Release 1.6
597
598
599	9 Compressed Genbank Floppy format
600
601	(In the near future, I hope to support the BLAST formats.) In
602	particular, this version will work with the EMBL and PIR VMS
603	formats that are distributed on the EMBL CD-ROM. The latter
604	format (PIR VMS) is much faster to search than EMBL format. If a
605	library format is not specified, for example, because you are
606	just comparing two sequences, Pearson/FASTA (format 0) is used by
607	default. To change this default, you may set the LIBTYPE
608	environment variable to a number. For example,
609
610	setenv LIBTYPE 1
611
612	would cause the program to use the GenBank LOCUS format by
613	default for libraries (or the second sequence file), but the
614	Pearson/FASTA format would still be used for the query sequence.
615
616	You can specify a group of library files by putting a '@'
617	symbol before a file that contains a list of file names to be
618	searched. For example, if @gpri.nam is in the fastgbs file, the
619	file "gpri.nam" might contain the lines:
620
621	</usr/lib/genbank
622	>glocus.idx
623	gpri1.seq
624	gpri2.seq
625	gpri12.seq
626
627	In this case, the line beginning with a '<' indicates the
628	directory the files will be found in. The line beginning with a
629	'>' indicates the index file; this is only used for the GENBANK
630	compressed DNA database. The remaining lines name the actual
631	sequence files. So the first sequence file to be searched would
632	be:
633
634	/usr/lib/genbank/gpri1.seq
635
636	The notation "<PIRNAQ:" might be used under the VAX/VMS operating
637	system. Under UNIX, the trailing '/' is left off, so the library
638	directory might be written as "</usr/seqlib". In addition, when
639	using the floppy disk version of GENBANK, annotation files are
640	also required. These files (*.ano) should be placed in the same
641	directory as the *.seq files.
642
643	With version 1.4 of the FASTA package, the FASTA and TFASTA
644	programs can search a library composed of different files in
645	different sequence formats. For example, you may wish to search
646	the Genbank files (which are in compressed floppy format) and the
647	EMBL DNA sequence database on CD-ROM. To do this, you simply
648	list the names and filetypes of the files to be searched in a
649	file of filenames. For example, to search the mammalian portion
650	of Genbank, the unannotated portion of Genbank, and the
651	unannotated portion of the EMBL library, you could use the file:
652
653
654	- 10 -
655
656
657
658
659
660
661
662	FASTA.DOC Release 1.6
663
664
665	</usr/lib/DNA
666	>glocus.idx
667	gpri1.seq 9
668	gpri2.seq 9
669	...
670	gpri9.seq 9
671	# (this '#' causes the program to display the size of the library)
672	grod1.seq 9
673	...
674	gmam1.seq 9
675	...
676	guna1.seq 9
677	...
678	unanno.seq 5
679	#
680
681
682	You do not need to include library format numbers if you
683	only use the Pearson/FASTA version of the PIR protein se-
684	quence library and the Genbank DNA database on floppy
685	disks. If no library type is specified, the program as-
686	sumes that type 0 is being used (unless you have set LIB-
687	TYPE). However, if the program sees an index file line
688	(e.g. ">glocus.idx"), it assumes that the files are in
689	Genbank floppy disk format (type 9).
690
691
692	Although FASTA works best when the libraries are saved on a
693	hard disk, this is not required. If you do not have a hard disk,
694	you could refer to the protein database files by making a file
695	"prot.nam" with the lines:
696
697	<B:
698	prot.0
699	prot.1
700	...
701	prot.6
702	# (print library summary)
703	new.0
704	...
705
706	The FASTA program would then look for the files on the B: drive,
707	and when it did not find them, it would allow you replace the
708	diskette in the drive.
709
710
711	Test the setup by running FASTA. Enter the sequence file
712	'MUSPLFM.AA' when the program requests it (this file is included
713	with the programs). The program should then ask you to select a
714	protein sequence library. Alternatively, if you run the TFASTA
715	program and use the MUSPLFM.AA query sequence, the program should
716	show you a selection of DNA sequence libraries. Once the fastgbs
717	file has been set up correctly, you can set FASTLIBS=fastgbs in
718
719
720	- 11 -
721
722
723
724
725
726
727
728	FASTA.DOC Release 1.6
729
730
731	your AUTOEXEC.BAT file, and you will not need to remember where
732	the libraries are kept or how they are named.
733
734	The EXTRACTN program extracts DNA sequences or annotations
735	from the GENBANK DNA sequence library in the compressed floppy
736	disk format. To tell EXTRACTN where to find the DNA sequence
737	library and index files, set the environment variable GBLIB.
738
739	setenv GBLIB /usr/lib/genbank
740
741
742	FASTA and TFASTA must open a large number of files when
743	searching and reporting the results of a GENBANK floppy disk
744	format library search. You may have problems with the large
745	number of files under DOS on IBM-PC's (Unix and VMS users will
746	not have these problems). If you are going to search the GENBANK
747	floppy disk format DNA sequence library under DOS, you should add
748	the line:
749
750	FILES=16
751
752	to your CONFIG.SYS file. (Typically this is already done for
753	programs like Windows or WordPerfect.)
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785	- 12 -
786
787
788
789
790
791
792
793	FASTA.DOC Release 1.6
794
795
796	3. Using the FASTA Package
797
798	3.1. Overview
799
800	The FASTA sequence comparison programs all require similar
801	information, the name of a query sequence file, a library file,
802	and the ktup parameter. All of the programs can accept arguments
803	on the command line, or they will prompt for the file names and
804	ktup value.
805
806	To use FASTA, simply type:
807
808	FASTA
809	and you will be prompted for :
810	the name of the test sequence file
811	the name of the library file
812	and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
813
814	ktup of 2 is about 5 times faster than ktup = 1.
815	For a 200 aa sequence against a 10,000,000 aa
816	library, the program takes about 30 min with
817	ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
818	IBM-PC.
819
820
821	The program can also be run by typing
822
823	FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
824
825
826	Included with the package are the test files, MUSPLFM.AA,
827	LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that
828	everything is working, you can try:
829
830	fasta musplfm.aa lcbo.aa
831	and
832	tfasta musplfm.aa bovprl.seq
833
834	To test the local similarity programs LFASTA and PLFASTA, try:
835
836	lfasta mchu.aa mchu.aa
837	and
838	plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics
839	or on a Tektronix terminal under UNIX or VMS)
840
841	MCHU (calmodulin) has four duplicated calcium binding sites that
842	are clearly detected by LFASTA. For a more complicated example,
843	try MWRTC1.aa, myosin heavy chain.
844
845	3.2. Sequence files
846
847	The FASTA programs know about three kinds of sequence files
848	(four under VMS): (1) plain sequence files that can only be used
849
850
851	- 13 -
852
853
854
855
856
857
858
859	FASTA.DOC Release 1.6
860
861
862	as query sequences or for LFASTA, RDF2, and ALIGN. (2) Standard
863	library files. These are the same as plain sequence files, each
864	sequence is preceded by a comment line with a '>' in the first
865	column. (3) distributed sequence libraries (this is a broad class
866	that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
867	flat-file format, EMBL flat-file format, and Intelligenetics
868	format. All of the files that you create should be of type (1)
869	or (2). Type (2) files (ones with a be used as query or library
870	sequence files by all of the programs.
871
872	I have included several sample test files, *.AA. The first
873	line may begin with a '>' or ';' followed by a comment. The
874	text after ';' in other lines will be ignored. Spaces and
875	tabs (and anything else that is not an amino-acid code) are
876	ignored.
877
878	Library files should have the form:
879
880	>Sequence name and identifier
881	A F A S Y T .... actual sequence.
882	F S S .... second line of sequence.
883	>Next sequence name and identifier
884
885	This is the form of the PROT.* supplied with the floppy disk
886	version of the PIR protein sequence library. You can also build
887	your own library by concatenating several sequence files. Just
888	be sure that each sequence is preceded by a line beginning with a
889	'>' with a sequence name.
890
891	The test file should not have lines longer than 120
892	characters, and sequences entered with word processors should use
893	a document mode, with normal carriage returns at the end of
894	lines.
895
896	Program Summary
897
898	3.3. Sequence search programs
899
900	FASTA universal sequence comparison. Defaults to comparing
901	protein sequences; if the sequences are > 85% A+C+G+T
902	or the -n option is used, a DNA sequence is assumed.
903
904	TFASTA Search DNA library for a protein sequence by
905	translating the DNA sequence to protein in all six
906	frames (three forward frames with the -3 command line
907	option). TFASTA with ktup=2 is about as fast as a DNA
908	FASTA with ktup=4, and is substantially more sensitive.
909	(also reads the GENBANK library)
910
911	SSEARCH Universal sequence comparison using the Smith-Waterman
912	algorithm ( T. F. Smith and M. S. Waterman (1981) J.
913	Mol. Biol. 147:195-197). This program uses code
914	developed by Huang and Miller (X. Huang, R. C.
915
916
917	- 14 -
918
919
920
921
922
923
924
925	FASTA.DOC Release 1.6
926
927
928	Hardison, W. Miller (1990) CABIOS 6:373-381) for
929	calculating the local similarity score and code from
930	the ALIGN program (see below) for calculating the local
931	alignment. SSEARCH is about 100-times slower than
932	FASTA with ktup=2 (for proteins). It should never be
933	used to search an entire protein sequence library, but
934	can be used to search several hundred sequences.
935
936	ALIGN optimal global alignment of two sequences with no
937	short-cuts. This program is a slightly modified
938	version of one taken from E. Myers and W. Miller. The
939	algorithm is described in E. Myers and W. Miller,
940	"Optimal Alignments in Linear Space" (CABIOS (1988)
941	4:11-17).
942
943	3.4. Local similarity programs
944
945	LFASTA local similarity searches showing local alignments.
946	The algorithm used to calculate the local alignment in
947	a band has been improved (Chao, Pearson, and Miller,
948	submitted).
949
950	PLFASTA local similarity searches with plot output (on the IBM,
951	this program requires that the environment variable
952	BGIDIR be set).
953
954	PCLFASTA (unix only) local similarity searches with plot output
955	using pic commands.
956
957	LALIGN Calculates the N-best local alignments using a rigorous
958	algorithm. (N=10 by default.) The algorithm was
959	developed by Huang and Miller (X. Huang and W. Miller
960	(1991) Adv. Appl. Math. 12:337-357), which is a
961	linear-space version of an algorithm described by M. S.
962	Waterman and M. Eggert (J. Mol. Biol. 197:723-728).
963	Like SSEARCH, LALIGN is rigorous, but also very slow.
964
965	PLALIGN A version of LALIGN that plots its output to a screen
966	or to a Tektronix terminal emulator.
967
968	3.5. Statistical Significance
969
970	RDF2 improved version of RDF program with all three scoring
971	methods (now includes local, or window, shuffle
972	routine)
973
974	RSS A version of RDF2 that uses the rigorous Smith-Waterman
975	calculation used by SSEARCH. RSS should provide a more
976	rigorous test of the statistical significance of a
977	similarity score.
978
979	RELATE significance program described by Dayhoff (Atlas of
980	Protein Sequence and Structure, Vol. 5, Supplement 3).
981
982
983	- 15 -
984
985
986
987
988
989
990
991	FASTA.DOC Release 1.6
992
993
994	Each chunk of 25 residues in one sequence is compared
995	to every 25 residue fragment of the second sequence.
996	Sequences which are genuinely related will have a large
997	number of scores greater than 3 standard deviations
998	above the mean score of all of the comparisons.
999
1000	3.6. Other analysis programs
1001
1002	AACOMP calculate the amino acid composition and molecular
1003	weight of a sequence.
1004
1005	BESTSCOR calculate the best self-comparison score.
1006
1007	GREASE Kyte-Doolittle hydropathicity profile
1008
1009	TGREASE graphic plot of Kyte-Doolittle profile
1010
1011	FROMGB convert from GenBank LOCUS format (also used by the
1012	IBI-Pustell programs) to Pearson/FASTA format.
1013
1014	GARNIER A secondary structure prediction program using the
1015	method of Garnier, Osgusthorpe, and Robson, J. Mol.
1016	Biol., (1978) 120:97-120.
1017
1018	3.7. Searching for keywords
1019
1020	FINDP (DOS, Macintosh only) Searches the protein sequence
1021	library title lines (or the aabank.nam file created by
1022	SINDEX) for a list of key words. For example:
1023
1024	FINDP aabank.nam trypsin
1025
1026	will search the file of title lines and report all
1027	lines with the word "trypsin" in them. You can search
1028	for several words at once, by putting several words on
1029	the line. Normally, FINDP (and FINDN) ignore upper and
1030	lower case. If you would like to search for a specific
1031	case, e.g. Trypsin but not chymotrypsin, use the -l
1032	option:
1033
1034	FINDP aabank.nam -l Trypsin
1035
1036
1037	FINDN Searches the GENBANK *.ano annotation files for words.
1038	FINDN can search a specific file, or a list of
1039	annotation files. For example, if the file GPRIA.NAM
1040	contains the lines:
1041
1042	gpri1.ano
1043	gpri2.ano
1044	gpri3.ano
1045	...
1046	then
1047
1048
1049	- 16 -
1050
1051
1052
1053
1054
1055
1056
1057	FASTA.DOC Release 1.6
1058
1059
1060	FINDN @gpria.nam trypsin
1061
1062	would search all of the files. FINDN also uses "-l" to
1063	preserve upper/lower case distinctions.
1064
1065	3.8. Options
1066
1067	These programs have a number of output options, which are
1068	invoked by the environment variables LINLEN, SHOWALL, and MARKX.
1069	Alternatively, these values can be controlled by command line
1070	options. The number of sequence residues per output line is now
1071	adjustable by setting the environment variable LINLEN, or the
1072	command line option -w. LINLEN is normally 60, to change it set
1073	LINLEN=80 before running the program or add -w 80 to the command
1074	line. LINLEN can be set up to 200. SHOWALL (-a) determines
1075	whether all, or just a portion, of the aligned sequences are
1076	displayed. Previously, FASTP would show the entire length of
1077	both sequences in an alignment while FASTN would only show the
1078	portions of the two sequences that overlapped. Now the default is
1079	to show only the overlap between the two sequences, to show
1080	complete sequences, set SHOWALL=1, or use the -a option on the
1081	command line.
1082
1083	The differences between the two aligned sequences can be
1084	highlighted in three different ways by changing the environment
1085	variable MARKX or the -m option. Normally (MARKX=0) the program
1086	uses ':' do denote identities and '.' to denote conservative
1087	replacements. If MARKX=1, the program will not mark identities;
1088	instead conservative replacements are denoted by a 'x' and non-
1089	conservative substitutions by a 'X'. If MARKX=2, the residues in
1090	the second sequence are only shown if they are different from the
1091	first. Thus the three options are:
1092
1093
1094	MARKX=0 (default) MARKX=1 MARKX=2
1095
1096	MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT
1097	::..:: ::: xx X ..KS..Y...
1098	MWKSCGYPYT MWKSCGYPYT
1099
1100
1101	3.9. Command line options
1102
1103	It is now possible to specify several options on the
1104	command line, instead of using environment variables. The
1105	command line options are preceded by a dash; the following
1106	options are available:
1107
1108	-a same as showall=1
1109
1110	-b number of sequence scores to be shown on output
1111
1112
1113
1114	- 17 -
1115
1116
1117
1118
1119
1120
1121
1122	FASTA.DOC Release 1.6
1123
1124
1125	-c # threshold score for optimization (OPTCUT). Set "-c 1"
1126	and "-o" to optimize every sequence in a database.
1127	(This slows the program down about 5-fold).
1128
1129	-d # number of alignments to be reported by default. (Used
1130	in conjunction with -Q).
1131
1132	-f identical match score from scoring matrix in the scan
1133	for initial regions. (default for protein) (PAMFACT=1)
1134
1135	-g # Threshold for joining init1 segments to build an initn
1136	score (GAPCUT).
1137
1138	-k use constant score in scan for initial regions (like
1139	old fastp, fastn, default for DNA) (PAMFACT=0)
1140
1141	-l file location of library menu file (FASTLIBS)
1142
1143	-m # MARKX = # (0, 1, 2)
1144
1145	-n Force the query sequence to be treated as a DNA
1146	sequence. This is particularly useful for query
1147	sequences that contain a large number of ambiguous
1148	residues, e.g. transcription factor binding sites.
1149
1150	-o optimize all scores greater than OPTCUT. If '-c' is
1151	not specified, OPTCUT will be calculated from the
1152	length of the sequence and the ktup setting, as the old
1153	CUTOFF value used to be.
1154
1155	-Q quiet - does not prompt for any input. Writes scores
1156	and alignments to the terminal or standard output file.
1157
1158	-r file save a results summary line for every sequence in the
1159	sequence library. The summary line includes the
1160	sequence identifier, superfamily number (if available)
1161	position in the library, and the similarity scores
1162	calculated. This option can be used to evaluate the
1163	sensitivity and selectivity of different search
1164	strategies (see W. R. Pearson (1991) Genomics 11:635-
1165	650.)
1166
1167	-s file SMATRIX is read from file. Several SMATRIX files are
1168	provided with the standard distribution. For protein
1169	sequences: codaa.mat - based on minimum mutation
1170	matrix; idnaa.mat - identity matrix; idpaa.mat -
1171	identity matrix for mismatches, but identical matches
1172	weighted according to the PAM250 matrix; pam250.mat -
1173	the PAM250 matrix developed by Dayhoff et al (Atlas of
1174	Protein Sequence and Structure, vol. 5, suppl. 3,
1175	1978); pam120.mat - a PAM120 matrix. The SMATRIX also
1176	specifies the penalties for the first residue in a gap
1177	and additional residues in a gap; FASTA, the other
1178
1179
1180	- 18 -
1181
1182
1183
1184
1185
1186
1187
1188	FASTA.DOC Release 1.6
1189
1190
1191	alignment programs, and the SMATRIX files use -12 and
1192	-4. Currently, to change the -12, -4 gap penalties, the
1193	SMATRIX file must be edited.
1194
1195	-v (LINEVAL) values used for line styles in plfasta
1196
1197	-w # line length (width) = number (<200)
1198
1199	-x specifies offsets for the beginning of the query and
1200	library sequence. For example, if you are comparing
1201	upstream regions for two genes, and the first sequence
1202	contains 500 nt of upstream sequence while the second
1203	contains 300 nt of upstream sequence, you might try:
1204
1205	fasta -x "-500 -300" seq1.nt seq2.nt
1206
1207	If the -x option is not used, FASTA assumes numbering
1208	starts with 1. This option will not work properly with
1209	the translated library sequence with tfasta. (You
1210	should double check to be certain the negative
1211	numbering works properly.)
1212
1213	-1 sort output by init1 score (as FASTP used to do).
1214
1215	-3 (TFASTA only) translate only three forward frames
1216
1217
1218	For example:
1219
1220	fasta -w 80 -a seq1.aa seq.aa
1221
1222	would compare the sequence in seq1.aa to that in seq2.aa and
1223	display the results with 80 residues on an output line, showing
1224	all of the residues in both sequences. Be sure to enter the
1225	options before entering the file names, or just enter the options
1226	on the command line, and the program will prompt for the file
1227	names.
1228
1229	Not all of these options are appropriate for all of the
1230	programs. The options above are used by FASTA and TFASTA RELATE
1231	uses the -s option, ALIGN uses the -w, -m, and -s options, and
1232	the RDF2 programs use -c, -f, -k, and -s.
1233
1234	4. Environment variable summary
1235
1236	Environment variables allow you to set search parameters
1237	that will be used frequently when you run a program; for example,
1238	if you prefer to use the PAM120 scoring matrix, you might "set
1239	SMATRIX=120." Command line parameters, if used, always override
1240	environment variable settings. The following environment
1241	variables are used by this program:
1242
1243
1244
1245	- 19 -
1246
1247
1248
1249
1250
1251
1252
1253	FASTA.DOC Release 1.6
1254
1255
1256	AABANK the file name of the default sequence library.
1257
1258	FASTLIBS the location of the file which contains the list of
1259	library files to be searched.
1260
1261	GAPCUT threshold used for joining init1 regions in the second
1262	step of FASTA. Normally set based on sequence length
1263	and ktup.
1264
1265	GBLIB the directory where the EXTRACTN files and glocus.idx
1266	are found.
1267
1268	LIBTYPE used to specify the format of the library sequence for
1269	FASTA and TFASTA.
1270
1271	LINLEN output line length - can go up to 200
1272
1273	LINEVAL used by plfasta to determine the relationship between
1274	line style and similarity score (-v). This should be a
1275	string of three numbers, e.g. "200 100 50"
1276
1277	MARKX symbol for denoting matches, mismatches. Note that this
1278	symbol is only used across the optimized local region;
1279	sequences that are outside this region are not marked.
1280
1281	OPTCUT Set the threshold to be used for optimization in a band
1282	around the best initial region. Normally the OPTCUT
1283	value is calculated from the length of the sequence and
1284	the ktup value (for a 200 residue sequence, it is about
1285	28). If OPTCUT=1, every sequence in the database will
1286	be optimized. This is the most sensitive option.
1287
1288	PAMFACT This version of fasta uses a more sensitive method for
1289	identifying initial regions. Instead of using a
1290	constant factor (fact) for each match in a ktup, it
1291	uses the scoring matrix (PAM) scores. While this works
1292	well for protein sequences, it has not been as
1293	carefully tested for DNA sequences, so by default, this
1294	modification is used for proteins but not for DNA. The
1295	-f 1 option forces this option on. -f 0 forces it off.
1296	Setting the PAMFACT environment variable to 1 forces
1297	the option on; PAMFACT=0 turns it off.
1298
1299	SHOWALL on output, show the complete sequence instead of just
1300	the overlap of the two aligned sequences.
1301
1302	SMATRIX alternative scoring matrix file.
1303
1304	TEKPLOT (IBM-PC only, Unix and VMS versions generate Tektronix
1305	graphics by default) Generate Tektronix output.
1306	Normally, PLFASTA and TGREASE plot graphs using the
1307	Turbo C graphics library. Unfortunately, often these
1308	plots cannot be printed out without special programs.
1309
1310
1311	- 20 -
1312
1313
1314
1315
1316
1317
1318
1319	FASTA.DOC Release 1.6
1320
1321
1322	(I have used GRAFPLUS, from Jewell Technologies, (206)
1323	937-1081, $50, successfully.) However, if you set
1324	TEKPLOT=1, tektronix graphics commands will be used.
1325	Tektronix commands can be used together with the
1326	PLOTDEV program, available from Microplot Systems, 1897
1327	Red Fern Dr. Columbus, OH, 43229, (614) 882-4786, for
1328	$40, which also allows you to print out graphics on the
1329	screen.
1330
1331
1332	As always, please inform me of bugs as soon as possible.
1333
1334	William R. Pearson
1335	Department of Biochemistry
1336	Box 440, Jordan Hall
1337	U. of Virginia
1338	Charlottesville, VA
1339
1340	wrp@virginia.EDU
1341	wrp@virginia.BITNET
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376	- 21 -
1377
1378
1379

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format