source: tags/initial/GDEHELP/FASTA.help

Last change on this file was 2, checked in by oldcode, 24 years ago

Initial revision

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 44.8 KB
Line 
1
2
3
4
5FASTA.DOC                                        Release 1.6
6
7
8
9                        COPYRIGHT NOTICE
10
11Copyright 1988, 1991, 1992 by William R. Pearson and the
12University of Virginia.  All rights reserved. The FASTA program
13and documentation may not be sold or incorporated into a
14commercial product, in whole or in part, without written consent
15of William R. Pearson and the University of Virginia.  For
16further information regarding permission for use or reproduction,
17please contact: William R. Wilkerson, Assistant Provost for
18Research, University of Virginia, P.O. Box 9025, Charlottesville,
19VA 22906-9025, (804) 924-6853
20
21
22The FASTA program package
23
24Introduction
25
26     This documentation describes the version 1.6c of the FASTA
27program package (see W. R. Pearson and D. J. Lipman (1988),
28"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-
292448, and W. R.  Pearson (1990) "Rapid and Sensitive Sequence
30Comparison with FASTP and FASTA" Methods in Enzymology 183:63-
3198).  Version 1.6 is the first release for the IBM-PC and
32Macintosh since version 1.4 (version 1.5 was distributed only via
33ftp to unix machines).  Version 1.6 has a large number of
34improvements over versions 1.4 and 1.5, including the ability to
35search libraries in several different formats in the same run,
36more robust algorithms for aligning sequences along a band, and
37additional, rigorous (but slow) programs for sequence searching,
38statistical analysis, and local sequence alignment.  In addition,
39several additional options are included.  Programs that are new
40with version 1.6 are highlighted in italics.
41
42
43Although there are a large number of programs in this package,
44they belong to three groups:
45
46
47    Library search programs: FASTA, TFASTA, SSEARCH
48
49    Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN
50
51    Statistical significance: RDF2, RELATE, RSS
52
53
54In addition, there are several programs for other sequence
55analysis tasks:
56
57
58    ALIGN - global alignment of two sequences (no limit on gaps).
59
60    EXTRACTP, SINDEX - programs to index (SINDEX) and extract sequences
61
62
63                              - 1 -
64
65
66
67
68
69
70
71FASTA.DOC                                             Release 1.6
72
73
74    from a protein sequence database.
75
76    EXTRACTN - programs to extract sequences from the GenBank floppy disk
77    format data base.
78
79
80In addition, I have included several programs for protein
81sequence analysis, including a Kyte-Doolittle hydropathicity
82plotting program (GREASE, TGREASE), and a secondary structure
83prediction package (GARNIER).
84
85     The FASTA sequence comparison programs on this disk are
86improved versions of the FASTP program, originally described in
87Science (Lipman and Pearson, (1985) Science 227:1435-1441).  We
88have made several improvements.  First, the library search
89programs use a more sensitive method for the initial comparison
90of two sequences which allows the scores of several similar
91regions to be combined.  As a result, the results of a library
92search are now given with three scores, initn (the new initial
93score which may include several similar regions), init1 (the old
94fastp initial score from the best initial region), and opt (the
95old fastp optimized score allowing gaps in a 32 residue wide
96band).
97
98     These programs have also been modified to become "universal"
99(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or
100FAST-N (nucleotides)); by changing the environment variable
101SMATRIX, the programs can be used to search protein sequences,
102DNA sequences, or whatever you like.  By default, FASTA, LFASTA,
103and the RDF programs automatically recognize protein and DNA
104sequences.  Sequences are first read as amino acids, and then
105converted to nucleotides if the sequence is greater than 85%
106A,C,G,T (the '-n' option can be used to indicate DNA sequences).
107TFASTA compares protein sequences to a translated DNA sequence.
108Alternative scoring matrices can also be used.  In addition to
109the PAM250 matrix for proteins, matrices based on simple
110identities or the genetic code can also be used for sequence
111comparisons or evaluation of significance.  Several different
112protein sequence matrices have been included; instructions for
113constructing your own scoring matrix are included in the file
114FORMAT.DOC.
115
116
117The remainder of this document is divided into three sections:
118(1) a brief history of the changes to the FASTA package; (2) A
119guide to installing the programs and databases; (3) A guide to
120using the FASTA programs. The programs are very easy to use, so
121if you are using them on a machine that is administered by
122someone else, you may want to skip to section (3) to learn how to
123use the programs, and then read section (1) to look at some of
124the more recent changes.  If you are installing the programs on
125your own machine, you will need to read section (2) carefully.
126
127
128                              - 2 -
129
130
131
132
133
134
135
136FASTA.DOC                                             Release 1.6
137
138
1391.  Revision History
140
1411.1.  Changes with version 1.6
142
143     FASTA version 1.6 uses a new method for calculating optimal
144scores in a band (the optimization or last step in the FASTA
145algorithm). In addition, it uses a linear-space method for
146calculating the actual alignments.  The FASTA package also
147includes four new programs:
148
149SSEARCH   a program to search a sequence database using the
150          rigorous Smith-Waterman algorithm (this program is
151          about 100-fold slower than FASTA with ktup=2 (for
152          proteins).
153
154RSS       a version of RDF2 that uses a rigorous Smith-Waterman
155          calculation to score similarities
156
157LALIGN    A rigorous local sequence alignment program that will
158          display the N-best local alignments (N=10 by default).
159
160PLALIGN   a version of lalign that plots the local alignments.
161
162     The LALIGN/PLALIGN programs incorporate the "sim" algorithm
163described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357.
164The SSEARCH and RSS programs incorporate algorithms described by
165Huang, Hardison, and Miller (1990) CABIOS 6:373-381.
166
167     LFASTA and PLFASTA now calculate a different number of local
168similarities; they now behave more like LALIGN/PLALIGN.  Since
169local alignments of identical sequences produce "mirror-image"
170alignments, lalign and lfasta consider only one-half of the
171potential alignments between sequences from identical file names.
172Thus
173
174    lfasta mchu.aa mchu.aa
175
176Displays only two alignments, with earlier versions of the
177program, it would have displayed five, including the identity
178alignment.  PLFASTA does display five alignments; when two
179identical filenames are given, it draws the identity alignment,
180calculates the two unique local alignments, draws them, and draws
181their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the
182filenames, rather than the actual sequences, to determine whether
183sequences are identical; you can "trick" the programs into
184behaving the old way by putting the same sequence in two
185different files.
186
1871.2.  Changes with version 1.5
188
189     FASTA version 1.5 includes a number of substantial revisions
190to improve the performance and sensitivity of the program.  It is
191now possible to tell the program to optimize all of the initn
192
193
194                              - 3 -
195
196
197
198
199
200
201
202FASTA.DOC                                             Release 1.6
203
204
205scores greater than a threshold.  The threshold is set at the
206same value as the old FASTA cutoff score (approximately 0.5
207standard deviations above the mean for average length sequences).
208For highest sensitivity, you can use the -c 1 option to set the
209threshold to 1.  (This will slow the search down about 5-fold).
210Alternatively, you can tell FASTA to sort the results by the
211init1, rather than the initn, score by using the -1 option.
212FASTA -1 ...  will report the results the way the older FASTP
213program did.  A comparison of the performance of FASTA in this,
214its slowest mode, with the standard FASTA and the Smith-Waterman
215algorithm has been published in Genomics (1991) 11:635-650.
216
217     A new method has been provided for selecting libraries. In
218the past, one could enter the name of a sequence file to be
219searched or a single letter that would specify a library from the
220list included in the $FASTLIBS file. Now, you can specify a set
221of library files with a string of letters preceded by a '%'.
222Thus, if the FASTLIBS file has the lines:
223
224
225    Genbank 70 primates$1P/seqlib/gbpri.seq
226    Genbank 70 rodents$1R/seqlib/gbrod.seq
227    Genbank 70 other mammals$1M/seqlib/gbmam.seq
228    Genbank 70 vertebrates $1B/seqlib/gbvrt.seq
229
230
231Then the string: "%PRMB" would tell FASTA to search the four
232libraries listed above.  The %PRMB string can be entered either
233on the command line or when the program asks for a filename or
234library letter.
235
236     FASTA1.5 also provides additional flexibility for specifying
237the number of results and alignments to be displayed with the -Q
238(quiet) option.  The -b number option allows you to specify the
239number of sequence scores to show when the search is finished.
240Thus
241
242
243    FASTA -b 100 ...
244
245
246tells the program to display the top 100 sequence scores. In the
247past, if you displayed 100 scores (in -Q mode), you would also
248have store 100 alignments. The -d option allows you to limit the
249number of alignments shown.  FASTA -b 100 -d 20 would show 100
250scores and 20 alignments.
251
252     The old CUTOFF parameter is no longer used.  The program
253stores the best 2000 (IBM-PC, MAC) or 6000 (Unix, VMS) scores and
254then throws out the lowest 25%, stores the next 500 (1500) better
255than the threshold determined with the first scores were
256discarded, and repeats the process as the library is scanned.  As
257a result, the best 1500 - 2000 (4500 - 6000) scores are saved.
258
259
260                              - 4 -
261
262
263
264
265
266
267
268FASTA.DOC                                             Release 1.6
269
270
271The old cut-off parameter was also used to set the joining
272threshold for the calculation of the initn score from initial
273regions.  This joining threshold can now be set with the -g
274option or with the GAPCUT parameter.
275
276     Finally, FASTA can provide a complete list of all of the
277sequences and scores calculated to a file with the -r (results)
278option.  FASTA -r results.out ... creates a file with a list of
279scores for every sequence in the library.  The list is not
280sorted, and only includes those scores calculated during the
281initial scan of the library (the optimized score is not
282calculated unless the -o option is used).
283
2842.  Installing the FASTA package
285
2862.1.  Installing the programs
287
2882.1.1.  IBM-PC/DOS version
289
290     For the IBM-PC/DOS version, the FASTA source code disk
291contains the complete source code to all of the programs on the
292other disks.  The programs were compiled with Borland's Turbo
293'C++', using Borland's MAKE utility.  The graphics programs
294(PLFASTA, TGREASE) use the graphics device drivers supplied with
295the Turbo 'C' V2.0 package.  Also included are the documentation
296files PROGRAMS.DOC and FORMAT.DOC.  You do not need any of the
297files the source code disk to run the programs.  The files on
298this disk are identical to the UNIX and VMS versions that run on
299larger machines.  Also included is the code to compile
300ALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not penalize
301for end-gaps.
302
303     If you have the DOS or Macintosh version of the FASTA
304package, to install the programs you should:
305
306(1)  Make a new directory (folder) for the FASTA programs.  This
307     need not be the same as the directory for your sequence
308     databases.
309
310(2)  Copy the files from the FASTA source disk to the new
311     directory.
312
313(3)  (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your
314     PATH command to include the FASTA directory and (b) add the
315     line:
316
317         set FASTLIBS=c:\yourfastadirectory\fastgbs
318
319     On the Macintosh, you may need to edit the "environment"
320     file and change the line that reads:
321
322         FASTLIBS=fastgbs
323
324
325                              - 5 -
326
327
328
329
330
331
332
333FASTA.DOC                                             Release 1.6
334
335
336     to indicate the full directory path for the fastgbs file,
337     for example:
338
339         FASTLIBS=Q105:FASTA:fastgbs
340
341
342(4)  Finally, you will need to edit the fastgbs file.  This is
343     usually the most confusing part of the installation.  An
344     example of this file is shown below; to customize this file
345     for your machine, you will need to change the file names
346     from those provided in the fastgbs file to ones that reflect
347     the directory names and file names you use on your machine.
348     This is explained in more detail below.  In addition, some
349     entries in the fastgbs file refer to other files of file
350     names.  These files of file names (as opposed to actual
351     database files) may also need to be edited.
352
3532.1.2.  Unix version
354
355     The FASTA distribution comes with several makefile's that
356can be used to compile the FASTA programs.  Over the years, as
357ATT Unix System 5 and BSD unix have converged, these files have
358become very similar. To begin with, I recommend using the
359standard Makefile.  There are two values in the makefile that
360should be checked against the values used on your system: the HZ
361value, which is the frequency in ticks per second used by the
362times() system call, this value can usually be found by running:
363
364    grep HZ /usr/include/sys/*
365
366and the functions available to return random numbers.  If you
367have a rand48() function that returns a 32-bit random number, use
368it and use the lines:
369
370    NRAND=nrand48
371    RANFLG= -DRAND32
372
373If not, you will need to use the rand() function call and
374determine whether it returns a 16-bit or a 32-bit value.  These
375functions are used by RDF2 and RSS.  If you have problems
376compiling the programs, you may want to examine the makefile.unx
377and makefile.sun files, to look for differences.  I have tried to
378use very standard unix functions in these programs, and they have
379been successfully compiled, with very small changes to the
380Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS
381machines (under the BSD environment).
382
3832.2.  Installing the libraries
384
3852.2.1.  The NBRF protein sequence library
386
387     The FASTA program package does not include any protein or
388DNA sequence libraries.  You can obtain the PIR protein sequence
389
390
391                              - 6 -
392
393
394
395
396
397
398
399FASTA.DOC                                             Release 1.6
400
401
402database from:
403
404    National  Biomedical Research Foundation
405    Georgetown  University  Medical  Center
406    3900 Reservoir Rd, N.W.
407    Washington, D.C. 20007
408
409In addition, this database is available via anonymous ftp from
410the host "ftp.bchs.uh.edu". It is available in two formats, VMS
411and CODATA format.  The "VMS" format (library type 5 below) can
412be searched much faster, can be easily reformatted for use by the
413"BLAST" rapid searching program, and is compatible with the
414Genetics Computer Group package of programs.  The CODATA format
415is used by the EUGENE/MBIR computing package from Baylor (library
416type 2).
417
418     (DOS/Macintosh users) The SINDEX and EXTRACTP programs now
419allow you to index a file in one subdirectory, and then move the
420library without having to remake the index.  When you type:
421SINDEX @prot.nam, two index files are created: PROT.IXX and
422PROT.INX.  PROT.IXX is a binary file that cannot be edited; it
423contains the offsets into the library files for each of the
424sequence entries.  PROT.INX looks exactly like the original
425PROT.NAM file, and can be edited.  However, you cannot change the
426order of the library files in PROT.INX.  What you can do is
427change the first line, which indicates the directory where the
428library files can be found.  The index in PROT.IXX might tell
429EXTRACTP to find the entry LCBO at offset 123,456 in the PROT.3
430file.  If you changed the PROT.3 line in PROT.INX to PROT.4, LCBO
431would not be extracted properly.  However, if you decide to move
432your library files from disk /usr/tmp to disk /usr/lib, you can
433edit PROT.INX to reflect this change.
434
435     EXTRACTP has also been updated to use the new indexing
436scheme.  To extract sequences from a multi-file library that you
437made with SINDEX @prot.nam, type: EXTRACTP @prot.nam, or set the
438environment variable AABANK=@prot.nam.  Then enter the protein
439sequence identifiers as before.  Remember, if you move the
440library into a different directory, you will need to copy both
441the *.IXX and *.INX files to use EXTRACTP.  You can test EXTRACTP
442by trying to extract the PIR sequences LCBO, HBHU, or CCHU.  If
443you do not get an error message, the sequences were successfully
444extracted.  They are automatically saved to a file with the name
445"sequence.aa".  So "LCBO" would be found in "lcbo.aa". When you
446need to extract a sequence from the NEW.LIB library, you will
447have to set AABANK=new.lib.
448
4492.2.2.  The GENBANK DNA sequence library
450
451     FASTA, TFASTA, and EXTRACTN search and extract sequences
452from the GENBANK DNA sequence library in its compressed, floppy
453disk format.  This library is available from:
454
455
456                              - 7 -
457
458
459
460
461
462
463
464FASTA.DOC                                             Release 1.6
465
466
467    GENBANK
468    c/o Intelligenetics
469    700 E. El Camino Real
470    Mountain View, CA  94040
471    (415) 962-7300
472
473(The GBANN program used to extract DNA sequence annotations.
474Unfortunately, GBANN has not been updated since release 63.0 of
475GENBANK, when some changes in the annotation files were made.
476GBANN no longer works.)
477
478     The GenBank DNA sequence library is also available via
479anonymous FTP from genbank.bio.net.
480
4812.2.3.  The EMBL CD-ROM libraries
482
483     The European Molecular Biology Laboratory (EMBL) is
484distributing a CD-ROM that contains both the complete EMBL DNA
485sequence database (which should be essentially identical to the
486GenBank DNA sequence database) and the SWISS-PROT protein
487sequence database. SWISS-PROT is derived from the NBRF Protein
488sequence database with additions from the EMBL DNA sequence
489database.  This CD-ROM is a "best-buy," since it provides both
490DNA and protein sequence libraries.  It is available from:
491
492
493    EMBL Data Library
494    Meyerhofstr. 1
495    D-6900 Heidelberg
496    Germany
497    +49 6221 387258
498    Email: SOFTWARE@EMBL-Heidelberg.DE
499
500
501
502     In addition, the SWISS-PROT protein sequence database is
503available via anonymous FTP from the hosts genbank.bio.net and
504ncbi.nlm.nih.gov.
505
5062.3.  Finding the libraries: FASTLIBS
507
508     FASTA and TFASTA use the environment variable FASTLIBS to
509find the protein and DNA sequence libraries.  The FASTLIBS
510variable contains the name of a file that has the actual
511filenames of the libraries.  The FASTGBS file on is an example of
512a file that can be referred to by FASTLIBS. To use the FASTGBS
513file, type:
514
515    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX)
516    or
517    FASTLIBS=/usr/lib/fasta/fastgbs; export FASTLIBS (SysV UNIX)
518
519Then edit the FASTGBS file to indicate where the protein and DNA
520
521
522                              - 8 -
523
524
525
526
527
528
529
530FASTA.DOC                                             Release 1.6
531
532
533sequence libraries can be found.  If you have a hard disk and
534your protein sequence library is kept in the file
535/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
536in the directory: /usr/lib/genbank, then fastgbs might contain:
537
538    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
539    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
540    GB Primate$1P@/usr/lib/genbank/gpri.nam
541    GB Rodent$1R@/usr/lib/genbank/grod.nam
542    GB Mammal$1M@/usr/lib/genbank/gmammal.nam
543    ^   1    ^^^^       4                   ^     ^
544              23                             (5)
545
546The first line of this file says that there is a copy of the NBRF
547protein sequence database (which is a protein database) that can
548be selected by typing "P" on the command line or when the
549database menu is presented in the file /usr/lib/seq/aabank.lib.
550
551     Note that there are 4 or 5 fields in the lines in fastgbs.
552The first field is the description of the library which will be
553displayed by FASTA; it ends with a '$'.  The second field (1
554character), is a 0 if the library is a protein library and 1 if
555it is a DNA library.  The third field (1 character) is the
556character to be typed to select the library.
557
558     The fourth field is the name of the library file.  In the
559example above, the /usr/lib/seq/aabank.lib file contains the
560entire protein sequence library.  However the DNA library file
561names are preceded by a '@', because these files (gpri.nam,
562grod.nam, gmammal.nam) do not contain the sequences; instead they
563the names of the files which contain the sequences.  This is done
564because the GENBANK DNA database is broken down in to a large
565number of smaller files.  In order to search the entire primate
566database, you must search more than a dozen files.
567
568     In addition, an optional fifth field can be used to specify
569the format of the library file.  Alternatively, you can specify
570the library format in a file of file names (a file preceded by an
571'@').  This field must be separated from the file name by a space
572character (' ') from the filename.  In the example above, the
573aabank.lib file is in Pearson/FASTA format, while the swiss.seq
574file is in PIR/VMS format (from the EMBL CD-ROM), while the DNA
575sequences are in compressed GenBank format.  No file type number
576is included for the Genbank files, because it is included in the
577file of filenames (see below).  Currently, FASTA can read the
578following formats:
579
580    0 Pearson/FASTA (>SEQID - comment/sequence)
581    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
582    2 NBRF CODATA (ENTRY/SEQUENCE)
583    3 EMBL/SWISS-PROT (ID/DE/SQ)
584    4 Intelligenetics (;comment/SEQID/sequence)
585    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
586
587
588                              - 9 -
589
590
591
592
593
594
595
596FASTA.DOC                                             Release 1.6
597
598
599    9 Compressed Genbank Floppy format
600
601(In the near future, I hope to support the BLAST formats.) In
602particular, this version will work with the EMBL and PIR VMS
603formats that are distributed on the EMBL CD-ROM. The latter
604format (PIR VMS) is much faster to search than EMBL format.  If a
605library format is not specified, for example, because you are
606just comparing two sequences, Pearson/FASTA (format 0) is used by
607default.  To change this default, you may set the LIBTYPE
608environment variable to a number.  For example,
609
610    setenv LIBTYPE 1
611
612would cause the program to use the GenBank LOCUS format by
613default for libraries (or the second sequence file), but the
614Pearson/FASTA format would still be used for the query sequence.
615
616     You can specify a group of library files by putting a '@'
617symbol before a file that contains a list of file names to be
618searched.  For example, if @gpri.nam is in the fastgbs file, the
619file "gpri.nam" might contain the lines:
620
621    </usr/lib/genbank
622    >glocus.idx
623    gpri1.seq
624    gpri2.seq
625    gpri12.seq
626
627In this case, the line beginning with a '<' indicates the
628directory the files will be found in.  The line beginning with a
629'>' indicates the index file; this is only used for the GENBANK
630compressed DNA database.  The remaining lines name the actual
631sequence files.  So the first sequence file to be searched would
632be:
633
634    /usr/lib/genbank/gpri1.seq
635
636The notation "<PIRNAQ:" might be used under the VAX/VMS operating
637system. Under UNIX, the trailing '/' is left off, so the library
638directory might be written as "</usr/seqlib".  In addition, when
639using the floppy disk version of GENBANK, annotation files are
640also required. These files (*.ano) should be placed in the same
641directory as the *.seq files.
642
643     With version 1.4 of the FASTA package, the FASTA and TFASTA
644programs can search a library composed of different files in
645different sequence formats.  For example, you may wish to search
646the Genbank files (which are in compressed floppy format) and the
647EMBL DNA sequence database on CD-ROM.  To do this, you simply
648list the names and filetypes of the files to be searched in a
649file of filenames.  For example, to search the mammalian portion
650of Genbank, the unannotated portion of Genbank, and the
651unannotated portion of the EMBL library, you could use the file:
652
653
654                             - 10 -
655
656
657
658
659
660
661
662FASTA.DOC                                             Release 1.6
663
664
665    </usr/lib/DNA
666    >glocus.idx
667    gpri1.seq 9
668    gpri2.seq 9
669    ...
670    gpri9.seq 9
671    #  (this '#' causes the program to display the size of the library)
672    grod1.seq 9
673    ...
674    gmam1.seq 9
675    ...
676    guna1.seq 9
677    ...
678    unanno.seq 5
679    #
680
681
682    You do not need to include library format numbers if  you
683    only use the Pearson/FASTA version of the PIR protein se-
684    quence library and the Genbank  DNA  database  on  floppy
685    disks.   If no library type is specified, the program as-
686    sumes that type 0 is being used (unless you have set LIB-
687    TYPE).   However,  if the program sees an index file line
688    (e.g. ">glocus.idx"), it assumes that the  files  are  in
689    Genbank floppy disk format (type 9).
690
691
692     Although FASTA works best when the libraries are saved on a
693hard disk, this is not required.  If you do not have a hard disk,
694you could refer to the protein database files by making a file
695"prot.nam" with the lines:
696
697    <B:
698    prot.0
699    prot.1
700    ...
701    prot.6
702    #       (print library summary)
703    new.0
704    ...
705
706The FASTA program would then look for the files on the B: drive,
707and when it did not find them, it would allow you replace the
708diskette in the drive.
709
710
711     Test the setup by running FASTA.  Enter the sequence file
712'MUSPLFM.AA' when the program requests it (this file is included
713with the programs).  The program should then ask you to select a
714protein sequence library.  Alternatively, if you run the TFASTA
715program and use the MUSPLFM.AA query sequence, the program should
716show you a selection of DNA sequence libraries.  Once the fastgbs
717file has been set up correctly, you can set FASTLIBS=fastgbs in
718
719
720                             - 11 -
721
722
723
724
725
726
727
728FASTA.DOC                                             Release 1.6
729
730
731your AUTOEXEC.BAT file, and you will not need to remember where
732the libraries are kept or how they are named.
733
734     The EXTRACTN program extracts DNA sequences or annotations
735from the GENBANK DNA sequence library in the compressed floppy
736disk format. To tell EXTRACTN where to find the DNA sequence
737library and index files, set the environment variable GBLIB.
738
739    setenv GBLIB /usr/lib/genbank
740
741
742     FASTA and TFASTA must open a large number of files when
743searching and reporting the results of a GENBANK floppy disk
744format library search.  You may have problems with the large
745number of files under DOS on IBM-PC's (Unix and VMS users will
746not have these problems).  If you are going to search the GENBANK
747floppy disk format DNA sequence library under DOS, you should add
748the line:
749
750    FILES=16
751
752to your CONFIG.SYS file.  (Typically this is already done for
753programs like Windows or WordPerfect.)
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785                             - 12 -
786
787
788
789
790
791
792
793FASTA.DOC                                             Release 1.6
794
795
7963.  Using the FASTA Package
797
7983.1.  Overview
799
800     The FASTA sequence comparison programs all require similar
801information, the name of a query sequence file, a library file,
802and the ktup parameter.  All of the programs can accept arguments
803on the command line, or they will prompt for the file names and
804ktup value.
805
806To use FASTA, simply type:
807
808    FASTA
809    and you will be prompted for :
810         the name of the test sequence file
811         the name of the library file
812         and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
813
814             ktup of 2 is about 5 times faster than ktup = 1.
815             For  a  200  aa sequence against a 10,000,000 aa
816             library, the program takes  about  30  min  with
817             ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286
818             IBM-PC.
819
820
821The program can also be run by typing
822
823    FASTA test.aa /lib/bigfile.lib ktup (1 or 2)
824
825
826Included with the package are the test files, MUSPLFM.AA,
827LCBO.AA, MCHU.AA and BOVPRL.SEQ.  To check to make certain that
828everything is working, you can try:
829
830    fasta musplfm.aa lcbo.aa
831    and
832    tfasta musplfm.aa bovprl.seq
833
834To test the local similarity programs LFASTA and PLFASTA, try:
835
836    lfasta mchu.aa mchu.aa
837    and
838    plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics
839    or on a Tektronix terminal under UNIX or VMS)
840
841MCHU (calmodulin) has four duplicated calcium binding sites that
842are clearly detected by LFASTA.  For a more complicated example,
843try MWRTC1.aa, myosin heavy chain.
844
8453.2.  Sequence files
846
847     The FASTA programs know about three kinds of sequence files
848(four under VMS): (1) plain sequence files that can only be used
849
850
851                             - 13 -
852
853
854
855
856
857
858
859FASTA.DOC                                             Release 1.6
860
861
862as query sequences or for LFASTA, RDF2, and ALIGN. (2) Standard
863library files.  These are the same as plain sequence files, each
864sequence is preceded by a comment line with a '>' in the first
865column. (3) distributed sequence libraries (this is a broad class
866that includes the NBRF/PIR VMS and blocked ascii formats, Genbank
867flat-file format, EMBL flat-file format, and Intelligenetics
868format.  All of the files that you create should be of type (1)
869or (2).  Type (2) files (ones with a be used as query or library
870sequence files by all of the programs.
871
872     I have included several sample test files, *.AA.  The first
873line may begin with a '>'  or ';' followed by a comment.  The
874text after ';' in other lines will  be  ignored.   Spaces  and
875tabs  (and anything else that  is  not  an amino-acid code) are
876ignored.
877
878     Library files should have the form:
879
880    >Sequence name and identifier
881    A F A S Y T .... actual sequence.
882    F S S       .... second line of sequence.
883    >Next sequence name and identifier
884
885This is the form of the PROT.* supplied with the floppy disk
886version of the PIR protein sequence library. You can also build
887your own library by concatenating several sequence files.  Just
888be sure that each sequence is preceded by a line beginning with a
889'>' with a sequence name.
890
891     The test file should not have lines longer than 120
892characters, and sequences entered with word processors should use
893a document mode, with normal carriage returns at the end of
894lines.
895
896Program Summary
897
8983.3.  Sequence search programs
899
900FASTA     universal sequence comparison. Defaults to comparing
901          protein sequences; if the sequences are > 85% A+C+G+T
902          or the -n option is used, a DNA sequence is assumed.
903
904TFASTA    Search DNA library for a protein sequence by
905          translating the DNA sequence to protein in all six
906          frames (three forward frames with the -3 command line
907          option). TFASTA with ktup=2 is about as fast as a DNA
908          FASTA with ktup=4, and is substantially more sensitive.
909          (also reads the GENBANK library)
910
911SSEARCH   Universal sequence comparison using the Smith-Waterman
912          algorithm ( T. F. Smith and M. S. Waterman (1981) J.
913          Mol. Biol. 147:195-197).  This program uses code
914          developed by Huang and Miller (X. Huang, R. C.
915
916
917                             - 14 -
918
919
920
921
922
923
924
925FASTA.DOC                                             Release 1.6
926
927
928          Hardison, W. Miller (1990) CABIOS 6:373-381) for
929          calculating the local similarity score and code from
930          the ALIGN program (see below) for calculating the local
931          alignment.  SSEARCH is about 100-times slower than
932          FASTA with ktup=2 (for proteins).  It should never be
933          used to search an entire protein sequence library, but
934          can be used to search several hundred sequences.
935
936ALIGN     optimal global alignment of two sequences with no
937          short-cuts.  This program is a slightly modified
938          version of one taken from E.  Myers and W. Miller. The
939          algorithm is described in E. Myers and W.  Miller,
940          "Optimal Alignments in Linear Space" (CABIOS (1988)
941          4:11-17).
942
9433.4.  Local similarity programs
944
945LFASTA    local similarity searches showing local alignments.
946          The algorithm used to calculate the local alignment in
947          a band has been improved (Chao, Pearson, and Miller,
948          submitted).
949
950PLFASTA   local similarity searches with plot output (on the IBM,
951          this program requires that the environment variable
952          BGIDIR be set).
953
954PCLFASTA  (unix only) local similarity searches with plot output
955          using pic commands.
956
957LALIGN    Calculates the N-best local alignments using a rigorous
958          algorithm.  (N=10 by default.) The algorithm was
959          developed by Huang and Miller (X.  Huang and W.  Miller
960          (1991) Adv. Appl. Math. 12:337-357), which is a
961          linear-space version of an algorithm described by M. S.
962          Waterman and M. Eggert (J.  Mol. Biol. 197:723-728).
963          Like SSEARCH, LALIGN is rigorous, but also very slow.
964
965PLALIGN   A version of LALIGN that plots its output to a screen
966          or to a Tektronix terminal emulator.
967
9683.5.  Statistical Significance
969
970RDF2      improved version of RDF program with all three scoring
971          methods (now includes local, or window, shuffle
972          routine)
973
974RSS       A version of RDF2 that uses the rigorous Smith-Waterman
975          calculation used by SSEARCH.  RSS should provide a more
976          rigorous test of the statistical significance of a
977          similarity score.
978
979RELATE    significance program described by Dayhoff (Atlas of
980          Protein Sequence and Structure, Vol. 5, Supplement 3).
981
982
983                             - 15 -
984
985
986
987
988
989
990
991FASTA.DOC                                             Release 1.6
992
993
994          Each chunk of 25 residues in one sequence is compared
995          to every 25 residue fragment of the second sequence.
996          Sequences which are genuinely related will have a large
997          number of scores greater than 3 standard deviations
998          above the mean score of all of the comparisons.
999
10003.6.  Other analysis programs
1001
1002AACOMP    calculate the amino acid composition and molecular
1003          weight of a sequence.
1004
1005BESTSCOR  calculate the best self-comparison score.
1006
1007GREASE    Kyte-Doolittle hydropathicity profile
1008
1009TGREASE   graphic plot of Kyte-Doolittle profile
1010
1011FROMGB    convert from GenBank LOCUS format (also used by the
1012          IBI-Pustell programs) to Pearson/FASTA format.
1013
1014GARNIER   A secondary structure prediction program using the
1015          method of Garnier, Osgusthorpe, and Robson, J. Mol.
1016          Biol., (1978) 120:97-120.
1017
10183.7.  Searching for keywords
1019
1020FINDP     (DOS, Macintosh only) Searches the protein sequence
1021          library title lines (or the aabank.nam file created by
1022          SINDEX) for a list of key words.  For example:
1023
1024              FINDP aabank.nam trypsin
1025
1026          will search the file of title lines and report all
1027          lines with the word "trypsin" in them.  You can search
1028          for several words at once, by putting several words on
1029          the line.  Normally, FINDP (and FINDN) ignore upper and
1030          lower case.  If you would like to search for a specific
1031          case, e.g. Trypsin but not chymotrypsin, use the -l
1032          option:
1033
1034              FINDP aabank.nam -l Trypsin
1035
1036
1037FINDN     Searches the GENBANK *.ano annotation files for words.
1038          FINDN can search a specific file, or a list of
1039          annotation files.  For example, if the file GPRIA.NAM
1040          contains the lines:
1041
1042              gpri1.ano
1043              gpri2.ano
1044              gpri3.ano
1045              ...
1046              then
1047
1048
1049                             - 16 -
1050
1051
1052
1053
1054
1055
1056
1057FASTA.DOC                                             Release 1.6
1058
1059
1060              FINDN @gpria.nam trypsin
1061
1062          would search all of the files.  FINDN also uses "-l" to
1063          preserve upper/lower case distinctions.
1064
10653.8.  Options
1066
1067     These programs have a number of output options, which are
1068invoked by the environment variables LINLEN, SHOWALL, and MARKX.
1069Alternatively, these values can be controlled by command line
1070options.  The number of sequence residues per output line is now
1071adjustable by setting the environment variable LINLEN, or the
1072command line option -w.  LINLEN is normally 60, to change it set
1073LINLEN=80 before running the program or add -w 80 to the command
1074line.  LINLEN can be set up to 200.  SHOWALL (-a) determines
1075whether all, or just a portion, of the aligned sequences are
1076displayed.  Previously, FASTP would show the entire length of
1077both sequences in an alignment while FASTN would only show the
1078portions of the two sequences that overlapped. Now the default is
1079to show only the overlap between the two sequences, to show
1080complete sequences, set SHOWALL=1, or use the -a option on the
1081command line.
1082
1083     The differences between the two aligned sequences can be
1084highlighted in three different ways by changing the environment
1085variable MARKX or the -m option.  Normally (MARKX=0) the program
1086uses ':' do denote identities and '.' to denote conservative
1087replacements.  If MARKX=1, the program will not mark identities;
1088instead conservative replacements are denoted by a 'x' and non-
1089conservative substitutions by a 'X'.  If MARKX=2, the residues in
1090the second sequence are only shown if they are different from the
1091first.  Thus the three options are:
1092
1093
1094    MARKX=0 (default)       MARKX=1        MARKX=2
1095
1096            MWRTCGPPYT     MWRTCGPPYT     MWRTCGPPYT
1097            ::..:: :::       xx  X        ..KS..Y...
1098            MWKSCGYPYT     MWKSCGYPYT
1099
1100
11013.9.  Command line options
1102
1103     It is now possible to specify  several options on the
1104command line, instead of using environment variables.  The
1105command line options are preceded by a dash; the following
1106options are available:
1107
1108-a        same as showall=1
1109
1110-b        number of sequence scores to be shown on output
1111
1112
1113
1114                             - 17 -
1115
1116
1117
1118
1119
1120
1121
1122FASTA.DOC                                             Release 1.6
1123
1124
1125-c #      threshold score for optimization (OPTCUT).  Set "-c 1"
1126          and "-o" to optimize every sequence in a database.
1127          (This slows the program down about 5-fold).
1128
1129-d #      number of alignments to be reported by default. (Used
1130          in conjunction with -Q).
1131
1132-f         identical match score from scoring matrix in the scan
1133          for initial regions. (default for protein) (PAMFACT=1)
1134
1135-g #      Threshold for joining init1 segments to build an initn
1136          score (GAPCUT).
1137
1138-k        use constant score in scan for initial regions (like
1139          old fastp, fastn, default for DNA) (PAMFACT=0)
1140
1141-l file   location of library menu file (FASTLIBS)
1142
1143-m #      MARKX = # (0, 1, 2)
1144
1145-n        Force the query sequence to be treated as a DNA
1146          sequence.  This is particularly useful for query
1147          sequences that contain a large number of ambiguous
1148          residues, e.g. transcription factor binding sites.
1149
1150-o        optimize all scores greater than OPTCUT.  If '-c' is
1151          not specified, OPTCUT will be calculated from the
1152          length of the sequence and the ktup setting, as the old
1153          CUTOFF value used to be.
1154
1155-Q        quiet - does not prompt for any input.  Writes scores
1156          and alignments to the terminal or standard output file.
1157
1158-r file   save a results summary line for every sequence in the
1159          sequence library.  The summary line includes the
1160          sequence identifier, superfamily number (if available)
1161          position in the library, and the similarity scores
1162          calculated.  This option can be used to evaluate the
1163          sensitivity and selectivity of different search
1164          strategies (see W. R. Pearson (1991) Genomics 11:635-
1165          650.)
1166
1167-s file   SMATRIX is read from file.  Several SMATRIX files are
1168          provided with the standard distribution.  For protein
1169          sequences: codaa.mat - based on minimum mutation
1170          matrix; idnaa.mat - identity matrix; idpaa.mat -
1171          identity matrix for mismatches, but identical matches
1172          weighted according to the PAM250 matrix; pam250.mat -
1173          the PAM250 matrix developed by Dayhoff et al (Atlas of
1174          Protein Sequence and Structure, vol. 5, suppl. 3,
1175          1978); pam120.mat - a PAM120 matrix.  The SMATRIX also
1176          specifies the penalties for the first residue in a gap
1177          and additional residues in a gap; FASTA, the other
1178
1179
1180                             - 18 -
1181
1182
1183
1184
1185
1186
1187
1188FASTA.DOC                                             Release 1.6
1189
1190
1191          alignment programs, and the SMATRIX files use -12 and
1192          -4. Currently, to change the -12, -4 gap penalties, the
1193          SMATRIX file must be edited.
1194
1195-v        (LINEVAL) values used for line styles in plfasta
1196
1197-w #      line length (width) = number (<200)
1198
1199-x        specifies offsets for the beginning of the query and
1200          library sequence.  For example, if you are comparing
1201          upstream regions for two genes, and the first sequence
1202          contains 500 nt of upstream sequence while the second
1203          contains 300 nt of upstream sequence, you might try:
1204
1205              fasta -x "-500 -300" seq1.nt seq2.nt
1206
1207          If the -x option is not used, FASTA assumes numbering
1208          starts with 1.  This option will not work properly with
1209          the translated library sequence with tfasta.  (You
1210          should double check to be certain the negative
1211          numbering works properly.)
1212
1213-1        sort output by init1 score (as FASTP used to do).
1214
1215-3        (TFASTA only) translate only three forward frames
1216
1217
1218For example:
1219
1220    fasta -w 80 -a seq1.aa seq.aa
1221
1222would compare the sequence in seq1.aa to that in seq2.aa and
1223display the results with 80 residues on an output line, showing
1224all of the residues in both sequences.  Be sure to enter the
1225options before entering the file names, or just enter the options
1226on the command line, and the program will prompt for the file
1227names.
1228
1229     Not all of these options are appropriate for all of the
1230programs.  The options above are used by FASTA and TFASTA RELATE
1231uses the -s option, ALIGN uses the -w, -m, and -s options, and
1232the RDF2 programs use -c, -f, -k, and -s.
1233
12344.  Environment variable summary
1235
1236     Environment variables allow you to set search parameters
1237that will be used frequently when you run a program; for example,
1238if you prefer to use the PAM120 scoring matrix, you might "set
1239SMATRIX=120."  Command line parameters, if used, always override
1240environment variable settings. The following environment
1241variables are used by this program:
1242
1243
1244
1245                             - 19 -
1246
1247
1248
1249
1250
1251
1252
1253FASTA.DOC                                             Release 1.6
1254
1255
1256AABANK    the file name  of the default sequence library.
1257
1258FASTLIBS  the location of the file which contains the list of
1259          library files to be searched.
1260
1261GAPCUT    threshold used for joining init1 regions in the second
1262          step of FASTA.  Normally set based on sequence length
1263          and ktup.
1264
1265GBLIB     the directory where the EXTRACTN files and glocus.idx
1266          are found.
1267
1268LIBTYPE   used to specify the format of the library sequence for
1269          FASTA and TFASTA.
1270
1271LINLEN    output line length - can go up to 200
1272
1273LINEVAL   used by plfasta to determine the relationship between
1274          line style and similarity score (-v).  This should be a
1275          string of three numbers, e.g.  "200 100 50"
1276
1277MARKX     symbol for denoting matches, mismatches. Note that this
1278          symbol is only used across the optimized local region;
1279          sequences that are outside this region are not marked.
1280
1281OPTCUT    Set the threshold to be used for optimization in a band
1282          around the best initial region.  Normally the OPTCUT
1283          value is calculated from the length of the sequence and
1284          the ktup value (for a 200 residue sequence, it is about
1285          28).  If OPTCUT=1, every sequence in the database will
1286          be optimized.  This is the most sensitive option.
1287
1288PAMFACT   This version of fasta uses a more sensitive method for
1289          identifying initial regions. Instead of using a
1290          constant factor (fact) for each match in a ktup, it
1291          uses the scoring matrix (PAM) scores.  While this works
1292          well for protein sequences, it has not been as
1293          carefully tested for DNA sequences, so by default, this
1294          modification is used for proteins but not for DNA.  The
1295          -f 1 option forces this option on. -f 0 forces it off.
1296          Setting the PAMFACT environment variable to 1 forces
1297          the option on; PAMFACT=0 turns it off.
1298
1299SHOWALL   on output, show the complete sequence instead of just
1300          the overlap of the two aligned sequences.
1301
1302SMATRIX   alternative scoring matrix file.
1303
1304TEKPLOT   (IBM-PC only, Unix and VMS versions generate Tektronix
1305          graphics by default) Generate Tektronix output.
1306          Normally, PLFASTA and TGREASE plot graphs using the
1307          Turbo C graphics library.  Unfortunately, often these
1308          plots cannot be printed out without special programs.
1309
1310
1311                             - 20 -
1312
1313
1314
1315
1316
1317
1318
1319FASTA.DOC                                             Release 1.6
1320
1321
1322          (I have used GRAFPLUS, from Jewell Technologies, (206)
1323          937-1081, $50, successfully.) However, if you set
1324          TEKPLOT=1, tektronix graphics commands will be used.
1325          Tektronix commands can be used together with the
1326          PLOTDEV program, available from Microplot Systems, 1897
1327          Red Fern Dr.  Columbus, OH, 43229, (614) 882-4786, for
1328          $40, which also allows you to print out graphics on the
1329          screen.
1330
1331
1332As always, please inform me of bugs as soon as possible.
1333
1334William R. Pearson
1335Department of Biochemistry
1336Box 440, Jordan Hall
1337U. of Virginia
1338Charlottesville, VA
1339
1340wrp@virginia.EDU
1341wrp@virginia.BITNET
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376                             - 21 -
1377
1378
1379
Note: See TracBrowser for help on using the repository browser.