1 | |
---|
2 | |
---|
3 | |
---|
4 | |
---|
5 | FASTA.DOC Release 1.6 |
---|
6 | |
---|
7 | |
---|
8 | |
---|
9 | COPYRIGHT NOTICE |
---|
10 | |
---|
11 | Copyright 1988, 1991, 1992 by William R. Pearson and the |
---|
12 | University of Virginia. All rights reserved. The FASTA program |
---|
13 | and documentation may not be sold or incorporated into a |
---|
14 | commercial product, in whole or in part, without written consent |
---|
15 | of William R. Pearson and the University of Virginia. For |
---|
16 | further information regarding permission for use or reproduction, |
---|
17 | please contact: William R. Wilkerson, Assistant Provost for |
---|
18 | Research, University of Virginia, P.O. Box 9025, Charlottesville, |
---|
19 | VA 22906-9025, (804) 924-6853 |
---|
20 | |
---|
21 | |
---|
22 | The FASTA program package |
---|
23 | |
---|
24 | Introduction |
---|
25 | |
---|
26 | This documentation describes the version 1.6c of the FASTA |
---|
27 | program package (see W. R. Pearson and D. J. Lipman (1988), |
---|
28 | "Improved Tools for Biological Sequence Analysis", PNAS 85:2444- |
---|
29 | 2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence |
---|
30 | Comparison with FASTP and FASTA" Methods in Enzymology 183:63- |
---|
31 | 98). Version 1.6 is the first release for the IBM-PC and |
---|
32 | Macintosh since version 1.4 (version 1.5 was distributed only via |
---|
33 | ftp to unix machines). Version 1.6 has a large number of |
---|
34 | improvements over versions 1.4 and 1.5, including the ability to |
---|
35 | search libraries in several different formats in the same run, |
---|
36 | more robust algorithms for aligning sequences along a band, and |
---|
37 | additional, rigorous (but slow) programs for sequence searching, |
---|
38 | statistical analysis, and local sequence alignment. In addition, |
---|
39 | several additional options are included. Programs that are new |
---|
40 | with version 1.6 are highlighted in italics. |
---|
41 | |
---|
42 | |
---|
43 | Although there are a large number of programs in this package, |
---|
44 | they belong to three groups: |
---|
45 | |
---|
46 | |
---|
47 | Library search programs: FASTA, TFASTA, SSEARCH |
---|
48 | |
---|
49 | Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN |
---|
50 | |
---|
51 | Statistical significance: RDF2, RELATE, RSS |
---|
52 | |
---|
53 | |
---|
54 | In addition, there are several programs for other sequence |
---|
55 | analysis tasks: |
---|
56 | |
---|
57 | |
---|
58 | ALIGN - global alignment of two sequences (no limit on gaps). |
---|
59 | |
---|
60 | EXTRACTP, SINDEX - programs to index (SINDEX) and extract sequences |
---|
61 | |
---|
62 | |
---|
63 | - 1 - |
---|
64 | |
---|
65 | |
---|
66 | |
---|
67 | |
---|
68 | |
---|
69 | |
---|
70 | |
---|
71 | FASTA.DOC Release 1.6 |
---|
72 | |
---|
73 | |
---|
74 | from a protein sequence database. |
---|
75 | |
---|
76 | EXTRACTN - programs to extract sequences from the GenBank floppy disk |
---|
77 | format data base. |
---|
78 | |
---|
79 | |
---|
80 | In addition, I have included several programs for protein |
---|
81 | sequence analysis, including a Kyte-Doolittle hydropathicity |
---|
82 | plotting program (GREASE, TGREASE), and a secondary structure |
---|
83 | prediction package (GARNIER). |
---|
84 | |
---|
85 | The FASTA sequence comparison programs on this disk are |
---|
86 | improved versions of the FASTP program, originally described in |
---|
87 | Science (Lipman and Pearson, (1985) Science 227:1435-1441). We |
---|
88 | have made several improvements. First, the library search |
---|
89 | programs use a more sensitive method for the initial comparison |
---|
90 | of two sequences which allows the scores of several similar |
---|
91 | regions to be combined. As a result, the results of a library |
---|
92 | search are now given with three scores, initn (the new initial |
---|
93 | score which may include several similar regions), init1 (the old |
---|
94 | fastp initial score from the best initial region), and opt (the |
---|
95 | old fastp optimized score allowing gaps in a 32 residue wide |
---|
96 | band). |
---|
97 | |
---|
98 | These programs have also been modified to become "universal" |
---|
99 | (hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or |
---|
100 | FAST-N (nucleotides)); by changing the environment variable |
---|
101 | SMATRIX, the programs can be used to search protein sequences, |
---|
102 | DNA sequences, or whatever you like. By default, FASTA, LFASTA, |
---|
103 | and the RDF programs automatically recognize protein and DNA |
---|
104 | sequences. Sequences are first read as amino acids, and then |
---|
105 | converted to nucleotides if the sequence is greater than 85% |
---|
106 | A,C,G,T (the '-n' option can be used to indicate DNA sequences). |
---|
107 | TFASTA compares protein sequences to a translated DNA sequence. |
---|
108 | Alternative scoring matrices can also be used. In addition to |
---|
109 | the PAM250 matrix for proteins, matrices based on simple |
---|
110 | identities or the genetic code can also be used for sequence |
---|
111 | comparisons or evaluation of significance. Several different |
---|
112 | protein sequence matrices have been included; instructions for |
---|
113 | constructing your own scoring matrix are included in the file |
---|
114 | FORMAT.DOC. |
---|
115 | |
---|
116 | |
---|
117 | The remainder of this document is divided into three sections: |
---|
118 | (1) a brief history of the changes to the FASTA package; (2) A |
---|
119 | guide to installing the programs and databases; (3) A guide to |
---|
120 | using the FASTA programs. The programs are very easy to use, so |
---|
121 | if you are using them on a machine that is administered by |
---|
122 | someone else, you may want to skip to section (3) to learn how to |
---|
123 | use the programs, and then read section (1) to look at some of |
---|
124 | the more recent changes. If you are installing the programs on |
---|
125 | your own machine, you will need to read section (2) carefully. |
---|
126 | |
---|
127 | |
---|
128 | - 2 - |
---|
129 | |
---|
130 | |
---|
131 | |
---|
132 | |
---|
133 | |
---|
134 | |
---|
135 | |
---|
136 | FASTA.DOC Release 1.6 |
---|
137 | |
---|
138 | |
---|
139 | 1. Revision History |
---|
140 | |
---|
141 | 1.1. Changes with version 1.6 |
---|
142 | |
---|
143 | FASTA version 1.6 uses a new method for calculating optimal |
---|
144 | scores in a band (the optimization or last step in the FASTA |
---|
145 | algorithm). In addition, it uses a linear-space method for |
---|
146 | calculating the actual alignments. The FASTA package also |
---|
147 | includes four new programs: |
---|
148 | |
---|
149 | SSEARCH a program to search a sequence database using the |
---|
150 | rigorous Smith-Waterman algorithm (this program is |
---|
151 | about 100-fold slower than FASTA with ktup=2 (for |
---|
152 | proteins). |
---|
153 | |
---|
154 | RSS a version of RDF2 that uses a rigorous Smith-Waterman |
---|
155 | calculation to score similarities |
---|
156 | |
---|
157 | LALIGN A rigorous local sequence alignment program that will |
---|
158 | display the N-best local alignments (N=10 by default). |
---|
159 | |
---|
160 | PLALIGN a version of lalign that plots the local alignments. |
---|
161 | |
---|
162 | The LALIGN/PLALIGN programs incorporate the "sim" algorithm |
---|
163 | described by Huang and Miller (1991) Adv. Appl. Math. 12:337-357. |
---|
164 | The SSEARCH and RSS programs incorporate algorithms described by |
---|
165 | Huang, Hardison, and Miller (1990) CABIOS 6:373-381. |
---|
166 | |
---|
167 | LFASTA and PLFASTA now calculate a different number of local |
---|
168 | similarities; they now behave more like LALIGN/PLALIGN. Since |
---|
169 | local alignments of identical sequences produce "mirror-image" |
---|
170 | alignments, lalign and lfasta consider only one-half of the |
---|
171 | potential alignments between sequences from identical file names. |
---|
172 | Thus |
---|
173 | |
---|
174 | lfasta mchu.aa mchu.aa |
---|
175 | |
---|
176 | Displays only two alignments, with earlier versions of the |
---|
177 | program, it would have displayed five, including the identity |
---|
178 | alignment. PLFASTA does display five alignments; when two |
---|
179 | identical filenames are given, it draws the identity alignment, |
---|
180 | calculates the two unique local alignments, draws them, and draws |
---|
181 | their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the |
---|
182 | filenames, rather than the actual sequences, to determine whether |
---|
183 | sequences are identical; you can "trick" the programs into |
---|
184 | behaving the old way by putting the same sequence in two |
---|
185 | different files. |
---|
186 | |
---|
187 | 1.2. Changes with version 1.5 |
---|
188 | |
---|
189 | FASTA version 1.5 includes a number of substantial revisions |
---|
190 | to improve the performance and sensitivity of the program. It is |
---|
191 | now possible to tell the program to optimize all of the initn |
---|
192 | |
---|
193 | |
---|
194 | - 3 - |
---|
195 | |
---|
196 | |
---|
197 | |
---|
198 | |
---|
199 | |
---|
200 | |
---|
201 | |
---|
202 | FASTA.DOC Release 1.6 |
---|
203 | |
---|
204 | |
---|
205 | scores greater than a threshold. The threshold is set at the |
---|
206 | same value as the old FASTA cutoff score (approximately 0.5 |
---|
207 | standard deviations above the mean for average length sequences). |
---|
208 | For highest sensitivity, you can use the -c 1 option to set the |
---|
209 | threshold to 1. (This will slow the search down about 5-fold). |
---|
210 | Alternatively, you can tell FASTA to sort the results by the |
---|
211 | init1, rather than the initn, score by using the -1 option. |
---|
212 | FASTA -1 ... will report the results the way the older FASTP |
---|
213 | program did. A comparison of the performance of FASTA in this, |
---|
214 | its slowest mode, with the standard FASTA and the Smith-Waterman |
---|
215 | algorithm has been published in Genomics (1991) 11:635-650. |
---|
216 | |
---|
217 | A new method has been provided for selecting libraries. In |
---|
218 | the past, one could enter the name of a sequence file to be |
---|
219 | searched or a single letter that would specify a library from the |
---|
220 | list included in the $FASTLIBS file. Now, you can specify a set |
---|
221 | of library files with a string of letters preceded by a '%'. |
---|
222 | Thus, if the FASTLIBS file has the lines: |
---|
223 | |
---|
224 | |
---|
225 | Genbank 70 primates$1P/seqlib/gbpri.seq |
---|
226 | Genbank 70 rodents$1R/seqlib/gbrod.seq |
---|
227 | Genbank 70 other mammals$1M/seqlib/gbmam.seq |
---|
228 | Genbank 70 vertebrates $1B/seqlib/gbvrt.seq |
---|
229 | |
---|
230 | |
---|
231 | Then the string: "%PRMB" would tell FASTA to search the four |
---|
232 | libraries listed above. The %PRMB string can be entered either |
---|
233 | on the command line or when the program asks for a filename or |
---|
234 | library letter. |
---|
235 | |
---|
236 | FASTA1.5 also provides additional flexibility for specifying |
---|
237 | the number of results and alignments to be displayed with the -Q |
---|
238 | (quiet) option. The -b number option allows you to specify the |
---|
239 | number of sequence scores to show when the search is finished. |
---|
240 | Thus |
---|
241 | |
---|
242 | |
---|
243 | FASTA -b 100 ... |
---|
244 | |
---|
245 | |
---|
246 | tells the program to display the top 100 sequence scores. In the |
---|
247 | past, if you displayed 100 scores (in -Q mode), you would also |
---|
248 | have store 100 alignments. The -d option allows you to limit the |
---|
249 | number of alignments shown. FASTA -b 100 -d 20 would show 100 |
---|
250 | scores and 20 alignments. |
---|
251 | |
---|
252 | The old CUTOFF parameter is no longer used. The program |
---|
253 | stores the best 2000 (IBM-PC, MAC) or 6000 (Unix, VMS) scores and |
---|
254 | then throws out the lowest 25%, stores the next 500 (1500) better |
---|
255 | than the threshold determined with the first scores were |
---|
256 | discarded, and repeats the process as the library is scanned. As |
---|
257 | a result, the best 1500 - 2000 (4500 - 6000) scores are saved. |
---|
258 | |
---|
259 | |
---|
260 | - 4 - |
---|
261 | |
---|
262 | |
---|
263 | |
---|
264 | |
---|
265 | |
---|
266 | |
---|
267 | |
---|
268 | FASTA.DOC Release 1.6 |
---|
269 | |
---|
270 | |
---|
271 | The old cut-off parameter was also used to set the joining |
---|
272 | threshold for the calculation of the initn score from initial |
---|
273 | regions. This joining threshold can now be set with the -g |
---|
274 | option or with the GAPCUT parameter. |
---|
275 | |
---|
276 | Finally, FASTA can provide a complete list of all of the |
---|
277 | sequences and scores calculated to a file with the -r (results) |
---|
278 | option. FASTA -r results.out ... creates a file with a list of |
---|
279 | scores for every sequence in the library. The list is not |
---|
280 | sorted, and only includes those scores calculated during the |
---|
281 | initial scan of the library (the optimized score is not |
---|
282 | calculated unless the -o option is used). |
---|
283 | |
---|
284 | 2. Installing the FASTA package |
---|
285 | |
---|
286 | 2.1. Installing the programs |
---|
287 | |
---|
288 | 2.1.1. IBM-PC/DOS version |
---|
289 | |
---|
290 | For the IBM-PC/DOS version, the FASTA source code disk |
---|
291 | contains the complete source code to all of the programs on the |
---|
292 | other disks. The programs were compiled with Borland's Turbo |
---|
293 | 'C++', using Borland's MAKE utility. The graphics programs |
---|
294 | (PLFASTA, TGREASE) use the graphics device drivers supplied with |
---|
295 | the Turbo 'C' V2.0 package. Also included are the documentation |
---|
296 | files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the |
---|
297 | files the source code disk to run the programs. The files on |
---|
298 | this disk are identical to the UNIX and VMS versions that run on |
---|
299 | larger machines. Also included is the code to compile |
---|
300 | ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize |
---|
301 | for end-gaps. |
---|
302 | |
---|
303 | If you have the DOS or Macintosh version of the FASTA |
---|
304 | package, to install the programs you should: |
---|
305 | |
---|
306 | (1) Make a new directory (folder) for the FASTA programs. This |
---|
307 | need not be the same as the directory for your sequence |
---|
308 | databases. |
---|
309 | |
---|
310 | (2) Copy the files from the FASTA source disk to the new |
---|
311 | directory. |
---|
312 | |
---|
313 | (3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your |
---|
314 | PATH command to include the FASTA directory and (b) add the |
---|
315 | line: |
---|
316 | |
---|
317 | set FASTLIBS=c:\yourfastadirectory\fastgbs |
---|
318 | |
---|
319 | On the Macintosh, you may need to edit the "environment" |
---|
320 | file and change the line that reads: |
---|
321 | |
---|
322 | FASTLIBS=fastgbs |
---|
323 | |
---|
324 | |
---|
325 | - 5 - |
---|
326 | |
---|
327 | |
---|
328 | |
---|
329 | |
---|
330 | |
---|
331 | |
---|
332 | |
---|
333 | FASTA.DOC Release 1.6 |
---|
334 | |
---|
335 | |
---|
336 | to indicate the full directory path for the fastgbs file, |
---|
337 | for example: |
---|
338 | |
---|
339 | FASTLIBS=Q105:FASTA:fastgbs |
---|
340 | |
---|
341 | |
---|
342 | (4) Finally, you will need to edit the fastgbs file. This is |
---|
343 | usually the most confusing part of the installation. An |
---|
344 | example of this file is shown below; to customize this file |
---|
345 | for your machine, you will need to change the file names |
---|
346 | from those provided in the fastgbs file to ones that reflect |
---|
347 | the directory names and file names you use on your machine. |
---|
348 | This is explained in more detail below. In addition, some |
---|
349 | entries in the fastgbs file refer to other files of file |
---|
350 | names. These files of file names (as opposed to actual |
---|
351 | database files) may also need to be edited. |
---|
352 | |
---|
353 | 2.1.2. Unix version |
---|
354 | |
---|
355 | The FASTA distribution comes with several makefile's that |
---|
356 | can be used to compile the FASTA programs. Over the years, as |
---|
357 | ATT Unix System 5 and BSD unix have converged, these files have |
---|
358 | become very similar. To begin with, I recommend using the |
---|
359 | standard Makefile. There are two values in the makefile that |
---|
360 | should be checked against the values used on your system: the HZ |
---|
361 | value, which is the frequency in ticks per second used by the |
---|
362 | times() system call, this value can usually be found by running: |
---|
363 | |
---|
364 | grep HZ /usr/include/sys/* |
---|
365 | |
---|
366 | and the functions available to return random numbers. If you |
---|
367 | have a rand48() function that returns a 32-bit random number, use |
---|
368 | it and use the lines: |
---|
369 | |
---|
370 | NRAND=nrand48 |
---|
371 | RANFLG= -DRAND32 |
---|
372 | |
---|
373 | If not, you will need to use the rand() function call and |
---|
374 | determine whether it returns a 16-bit or a 32-bit value. These |
---|
375 | functions are used by RDF2 and RSS. If you have problems |
---|
376 | compiling the programs, you may want to examine the makefile.unx |
---|
377 | and makefile.sun files, to look for differences. I have tried to |
---|
378 | use very standard unix functions in these programs, and they have |
---|
379 | been successfully compiled, with very small changes to the |
---|
380 | Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS |
---|
381 | machines (under the BSD environment). |
---|
382 | |
---|
383 | 2.2. Installing the libraries |
---|
384 | |
---|
385 | 2.2.1. The NBRF protein sequence library |
---|
386 | |
---|
387 | The FASTA program package does not include any protein or |
---|
388 | DNA sequence libraries. You can obtain the PIR protein sequence |
---|
389 | |
---|
390 | |
---|
391 | - 6 - |
---|
392 | |
---|
393 | |
---|
394 | |
---|
395 | |
---|
396 | |
---|
397 | |
---|
398 | |
---|
399 | FASTA.DOC Release 1.6 |
---|
400 | |
---|
401 | |
---|
402 | database from: |
---|
403 | |
---|
404 | National Biomedical Research Foundation |
---|
405 | Georgetown University Medical Center |
---|
406 | 3900 Reservoir Rd, N.W. |
---|
407 | Washington, D.C. 20007 |
---|
408 | |
---|
409 | In addition, this database is available via anonymous ftp from |
---|
410 | the host "ftp.bchs.uh.edu". It is available in two formats, VMS |
---|
411 | and CODATA format. The "VMS" format (library type 5 below) can |
---|
412 | be searched much faster, can be easily reformatted for use by the |
---|
413 | "BLAST" rapid searching program, and is compatible with the |
---|
414 | Genetics Computer Group package of programs. The CODATA format |
---|
415 | is used by the EUGENE/MBIR computing package from Baylor (library |
---|
416 | type 2). |
---|
417 | |
---|
418 | (DOS/Macintosh users) The SINDEX and EXTRACTP programs now |
---|
419 | allow you to index a file in one subdirectory, and then move the |
---|
420 | library without having to remake the index. When you type: |
---|
421 | SINDEX @prot.nam, two index files are created: PROT.IXX and |
---|
422 | PROT.INX. PROT.IXX is a binary file that cannot be edited; it |
---|
423 | contains the offsets into the library files for each of the |
---|
424 | sequence entries. PROT.INX looks exactly like the original |
---|
425 | PROT.NAM file, and can be edited. However, you cannot change the |
---|
426 | order of the library files in PROT.INX. What you can do is |
---|
427 | change the first line, which indicates the directory where the |
---|
428 | library files can be found. The index in PROT.IXX might tell |
---|
429 | EXTRACTP to find the entry LCBO at offset 123,456 in the PROT.3 |
---|
430 | file. If you changed the PROT.3 line in PROT.INX to PROT.4, LCBO |
---|
431 | would not be extracted properly. However, if you decide to move |
---|
432 | your library files from disk /usr/tmp to disk /usr/lib, you can |
---|
433 | edit PROT.INX to reflect this change. |
---|
434 | |
---|
435 | EXTRACTP has also been updated to use the new indexing |
---|
436 | scheme. To extract sequences from a multi-file library that you |
---|
437 | made with SINDEX @prot.nam, type: EXTRACTP @prot.nam, or set the |
---|
438 | environment variable AABANK=@prot.nam. Then enter the protein |
---|
439 | sequence identifiers as before. Remember, if you move the |
---|
440 | library into a different directory, you will need to copy both |
---|
441 | the *.IXX and *.INX files to use EXTRACTP. You can test EXTRACTP |
---|
442 | by trying to extract the PIR sequences LCBO, HBHU, or CCHU. If |
---|
443 | you do not get an error message, the sequences were successfully |
---|
444 | extracted. They are automatically saved to a file with the name |
---|
445 | "sequence.aa". So "LCBO" would be found in "lcbo.aa". When you |
---|
446 | need to extract a sequence from the NEW.LIB library, you will |
---|
447 | have to set AABANK=new.lib. |
---|
448 | |
---|
449 | 2.2.2. The GENBANK DNA sequence library |
---|
450 | |
---|
451 | FASTA, TFASTA, and EXTRACTN search and extract sequences |
---|
452 | from the GENBANK DNA sequence library in its compressed, floppy |
---|
453 | disk format. This library is available from: |
---|
454 | |
---|
455 | |
---|
456 | - 7 - |
---|
457 | |
---|
458 | |
---|
459 | |
---|
460 | |
---|
461 | |
---|
462 | |
---|
463 | |
---|
464 | FASTA.DOC Release 1.6 |
---|
465 | |
---|
466 | |
---|
467 | GENBANK |
---|
468 | c/o Intelligenetics |
---|
469 | 700 E. El Camino Real |
---|
470 | Mountain View, CA 94040 |
---|
471 | (415) 962-7300 |
---|
472 | |
---|
473 | (The GBANN program used to extract DNA sequence annotations. |
---|
474 | Unfortunately, GBANN has not been updated since release 63.0 of |
---|
475 | GENBANK, when some changes in the annotation files were made. |
---|
476 | GBANN no longer works.) |
---|
477 | |
---|
478 | The GenBank DNA sequence library is also available via |
---|
479 | anonymous FTP from genbank.bio.net. |
---|
480 | |
---|
481 | 2.2.3. The EMBL CD-ROM libraries |
---|
482 | |
---|
483 | The European Molecular Biology Laboratory (EMBL) is |
---|
484 | distributing a CD-ROM that contains both the complete EMBL DNA |
---|
485 | sequence database (which should be essentially identical to the |
---|
486 | GenBank DNA sequence database) and the SWISS-PROT protein |
---|
487 | sequence database. SWISS-PROT is derived from the NBRF Protein |
---|
488 | sequence database with additions from the EMBL DNA sequence |
---|
489 | database. This CD-ROM is a "best-buy," since it provides both |
---|
490 | DNA and protein sequence libraries. It is available from: |
---|
491 | |
---|
492 | |
---|
493 | EMBL Data Library |
---|
494 | Meyerhofstr. 1 |
---|
495 | D-6900 Heidelberg |
---|
496 | Germany |
---|
497 | +49 6221 387258 |
---|
498 | Email: SOFTWARE@EMBL-Heidelberg.DE |
---|
499 | |
---|
500 | |
---|
501 | |
---|
502 | In addition, the SWISS-PROT protein sequence database is |
---|
503 | available via anonymous FTP from the hosts genbank.bio.net and |
---|
504 | ncbi.nlm.nih.gov. |
---|
505 | |
---|
506 | 2.3. Finding the libraries: FASTLIBS |
---|
507 | |
---|
508 | FASTA and TFASTA use the environment variable FASTLIBS to |
---|
509 | find the protein and DNA sequence libraries. The FASTLIBS |
---|
510 | variable contains the name of a file that has the actual |
---|
511 | filenames of the libraries. The FASTGBS file on is an example of |
---|
512 | a file that can be referred to by FASTLIBS. To use the FASTGBS |
---|
513 | file, type: |
---|
514 | |
---|
515 | setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX) |
---|
516 | or |
---|
517 | FASTLIBS=/usr/lib/fasta/fastgbs; export FASTLIBS (SysV UNIX) |
---|
518 | |
---|
519 | Then edit the FASTGBS file to indicate where the protein and DNA |
---|
520 | |
---|
521 | |
---|
522 | - 8 - |
---|
523 | |
---|
524 | |
---|
525 | |
---|
526 | |
---|
527 | |
---|
528 | |
---|
529 | |
---|
530 | FASTA.DOC Release 1.6 |
---|
531 | |
---|
532 | |
---|
533 | sequence libraries can be found. If you have a hard disk and |
---|
534 | your protein sequence library is kept in the file |
---|
535 | /usr/lib/aabank.lib and your Genbank DNA sequence library is kept |
---|
536 | in the directory: /usr/lib/genbank, then fastgbs might contain: |
---|
537 | |
---|
538 | NBRF Protein$0P/usr/lib/seq/aabank.lib 0 |
---|
539 | SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5 |
---|
540 | GB Primate$1P@/usr/lib/genbank/gpri.nam |
---|
541 | GB Rodent$1R@/usr/lib/genbank/grod.nam |
---|
542 | GB Mammal$1M@/usr/lib/genbank/gmammal.nam |
---|
543 | ^ 1 ^^^^ 4 ^ ^ |
---|
544 | 23 (5) |
---|
545 | |
---|
546 | The first line of this file says that there is a copy of the NBRF |
---|
547 | protein sequence database (which is a protein database) that can |
---|
548 | be selected by typing "P" on the command line or when the |
---|
549 | database menu is presented in the file /usr/lib/seq/aabank.lib. |
---|
550 | |
---|
551 | Note that there are 4 or 5 fields in the lines in fastgbs. |
---|
552 | The first field is the description of the library which will be |
---|
553 | displayed by FASTA; it ends with a '$'. The second field (1 |
---|
554 | character), is a 0 if the library is a protein library and 1 if |
---|
555 | it is a DNA library. The third field (1 character) is the |
---|
556 | character to be typed to select the library. |
---|
557 | |
---|
558 | The fourth field is the name of the library file. In the |
---|
559 | example above, the /usr/lib/seq/aabank.lib file contains the |
---|
560 | entire protein sequence library. However the DNA library file |
---|
561 | names are preceded by a '@', because these files (gpri.nam, |
---|
562 | grod.nam, gmammal.nam) do not contain the sequences; instead they |
---|
563 | the names of the files which contain the sequences. This is done |
---|
564 | because the GENBANK DNA database is broken down in to a large |
---|
565 | number of smaller files. In order to search the entire primate |
---|
566 | database, you must search more than a dozen files. |
---|
567 | |
---|
568 | In addition, an optional fifth field can be used to specify |
---|
569 | the format of the library file. Alternatively, you can specify |
---|
570 | the library format in a file of file names (a file preceded by an |
---|
571 | '@'). This field must be separated from the file name by a space |
---|
572 | character (' ') from the filename. In the example above, the |
---|
573 | aabank.lib file is in Pearson/FASTA format, while the swiss.seq |
---|
574 | file is in PIR/VMS format (from the EMBL CD-ROM), while the DNA |
---|
575 | sequences are in compressed GenBank format. No file type number |
---|
576 | is included for the Genbank files, because it is included in the |
---|
577 | file of filenames (see below). Currently, FASTA can read the |
---|
578 | following formats: |
---|
579 | |
---|
580 | 0 Pearson/FASTA (>SEQID - comment/sequence) |
---|
581 | 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN) |
---|
582 | 2 NBRF CODATA (ENTRY/SEQUENCE) |
---|
583 | 3 EMBL/SWISS-PROT (ID/DE/SQ) |
---|
584 | 4 Intelligenetics (;comment/SEQID/sequence) |
---|
585 | 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence) |
---|
586 | |
---|
587 | |
---|
588 | - 9 - |
---|
589 | |
---|
590 | |
---|
591 | |
---|
592 | |
---|
593 | |
---|
594 | |
---|
595 | |
---|
596 | FASTA.DOC Release 1.6 |
---|
597 | |
---|
598 | |
---|
599 | 9 Compressed Genbank Floppy format |
---|
600 | |
---|
601 | (In the near future, I hope to support the BLAST formats.) In |
---|
602 | particular, this version will work with the EMBL and PIR VMS |
---|
603 | formats that are distributed on the EMBL CD-ROM. The latter |
---|
604 | format (PIR VMS) is much faster to search than EMBL format. If a |
---|
605 | library format is not specified, for example, because you are |
---|
606 | just comparing two sequences, Pearson/FASTA (format 0) is used by |
---|
607 | default. To change this default, you may set the LIBTYPE |
---|
608 | environment variable to a number. For example, |
---|
609 | |
---|
610 | setenv LIBTYPE 1 |
---|
611 | |
---|
612 | would cause the program to use the GenBank LOCUS format by |
---|
613 | default for libraries (or the second sequence file), but the |
---|
614 | Pearson/FASTA format would still be used for the query sequence. |
---|
615 | |
---|
616 | You can specify a group of library files by putting a '@' |
---|
617 | symbol before a file that contains a list of file names to be |
---|
618 | searched. For example, if @gpri.nam is in the fastgbs file, the |
---|
619 | file "gpri.nam" might contain the lines: |
---|
620 | |
---|
621 | </usr/lib/genbank |
---|
622 | >glocus.idx |
---|
623 | gpri1.seq |
---|
624 | gpri2.seq |
---|
625 | gpri12.seq |
---|
626 | |
---|
627 | In this case, the line beginning with a '<' indicates the |
---|
628 | directory the files will be found in. The line beginning with a |
---|
629 | '>' indicates the index file; this is only used for the GENBANK |
---|
630 | compressed DNA database. The remaining lines name the actual |
---|
631 | sequence files. So the first sequence file to be searched would |
---|
632 | be: |
---|
633 | |
---|
634 | /usr/lib/genbank/gpri1.seq |
---|
635 | |
---|
636 | The notation "<PIRNAQ:" might be used under the VAX/VMS operating |
---|
637 | system. Under UNIX, the trailing '/' is left off, so the library |
---|
638 | directory might be written as "</usr/seqlib". In addition, when |
---|
639 | using the floppy disk version of GENBANK, annotation files are |
---|
640 | also required. These files (*.ano) should be placed in the same |
---|
641 | directory as the *.seq files. |
---|
642 | |
---|
643 | With version 1.4 of the FASTA package, the FASTA and TFASTA |
---|
644 | programs can search a library composed of different files in |
---|
645 | different sequence formats. For example, you may wish to search |
---|
646 | the Genbank files (which are in compressed floppy format) and the |
---|
647 | EMBL DNA sequence database on CD-ROM. To do this, you simply |
---|
648 | list the names and filetypes of the files to be searched in a |
---|
649 | file of filenames. For example, to search the mammalian portion |
---|
650 | of Genbank, the unannotated portion of Genbank, and the |
---|
651 | unannotated portion of the EMBL library, you could use the file: |
---|
652 | |
---|
653 | |
---|
654 | - 10 - |
---|
655 | |
---|
656 | |
---|
657 | |
---|
658 | |
---|
659 | |
---|
660 | |
---|
661 | |
---|
662 | FASTA.DOC Release 1.6 |
---|
663 | |
---|
664 | |
---|
665 | </usr/lib/DNA |
---|
666 | >glocus.idx |
---|
667 | gpri1.seq 9 |
---|
668 | gpri2.seq 9 |
---|
669 | ... |
---|
670 | gpri9.seq 9 |
---|
671 | # (this '#' causes the program to display the size of the library) |
---|
672 | grod1.seq 9 |
---|
673 | ... |
---|
674 | gmam1.seq 9 |
---|
675 | ... |
---|
676 | guna1.seq 9 |
---|
677 | ... |
---|
678 | unanno.seq 5 |
---|
679 | # |
---|
680 | |
---|
681 | |
---|
682 | You do not need to include library format numbers if you |
---|
683 | only use the Pearson/FASTA version of the PIR protein se- |
---|
684 | quence library and the Genbank DNA database on floppy |
---|
685 | disks. If no library type is specified, the program as- |
---|
686 | sumes that type 0 is being used (unless you have set LIB- |
---|
687 | TYPE). However, if the program sees an index file line |
---|
688 | (e.g. ">glocus.idx"), it assumes that the files are in |
---|
689 | Genbank floppy disk format (type 9). |
---|
690 | |
---|
691 | |
---|
692 | Although FASTA works best when the libraries are saved on a |
---|
693 | hard disk, this is not required. If you do not have a hard disk, |
---|
694 | you could refer to the protein database files by making a file |
---|
695 | "prot.nam" with the lines: |
---|
696 | |
---|
697 | <B: |
---|
698 | prot.0 |
---|
699 | prot.1 |
---|
700 | ... |
---|
701 | prot.6 |
---|
702 | # (print library summary) |
---|
703 | new.0 |
---|
704 | ... |
---|
705 | |
---|
706 | The FASTA program would then look for the files on the B: drive, |
---|
707 | and when it did not find them, it would allow you replace the |
---|
708 | diskette in the drive. |
---|
709 | |
---|
710 | |
---|
711 | Test the setup by running FASTA. Enter the sequence file |
---|
712 | 'MUSPLFM.AA' when the program requests it (this file is included |
---|
713 | with the programs). The program should then ask you to select a |
---|
714 | protein sequence library. Alternatively, if you run the TFASTA |
---|
715 | program and use the MUSPLFM.AA query sequence, the program should |
---|
716 | show you a selection of DNA sequence libraries. Once the fastgbs |
---|
717 | file has been set up correctly, you can set FASTLIBS=fastgbs in |
---|
718 | |
---|
719 | |
---|
720 | - 11 - |
---|
721 | |
---|
722 | |
---|
723 | |
---|
724 | |
---|
725 | |
---|
726 | |
---|
727 | |
---|
728 | FASTA.DOC Release 1.6 |
---|
729 | |
---|
730 | |
---|
731 | your AUTOEXEC.BAT file, and you will not need to remember where |
---|
732 | the libraries are kept or how they are named. |
---|
733 | |
---|
734 | The EXTRACTN program extracts DNA sequences or annotations |
---|
735 | from the GENBANK DNA sequence library in the compressed floppy |
---|
736 | disk format. To tell EXTRACTN where to find the DNA sequence |
---|
737 | library and index files, set the environment variable GBLIB. |
---|
738 | |
---|
739 | setenv GBLIB /usr/lib/genbank |
---|
740 | |
---|
741 | |
---|
742 | FASTA and TFASTA must open a large number of files when |
---|
743 | searching and reporting the results of a GENBANK floppy disk |
---|
744 | format library search. You may have problems with the large |
---|
745 | number of files under DOS on IBM-PC's (Unix and VMS users will |
---|
746 | not have these problems). If you are going to search the GENBANK |
---|
747 | floppy disk format DNA sequence library under DOS, you should add |
---|
748 | the line: |
---|
749 | |
---|
750 | FILES=16 |
---|
751 | |
---|
752 | to your CONFIG.SYS file. (Typically this is already done for |
---|
753 | programs like Windows or WordPerfect.) |
---|
754 | |
---|
755 | |
---|
756 | |
---|
757 | |
---|
758 | |
---|
759 | |
---|
760 | |
---|
761 | |
---|
762 | |
---|
763 | |
---|
764 | |
---|
765 | |
---|
766 | |
---|
767 | |
---|
768 | |
---|
769 | |
---|
770 | |
---|
771 | |
---|
772 | |
---|
773 | |
---|
774 | |
---|
775 | |
---|
776 | |
---|
777 | |
---|
778 | |
---|
779 | |
---|
780 | |
---|
781 | |
---|
782 | |
---|
783 | |
---|
784 | |
---|
785 | - 12 - |
---|
786 | |
---|
787 | |
---|
788 | |
---|
789 | |
---|
790 | |
---|
791 | |
---|
792 | |
---|
793 | FASTA.DOC Release 1.6 |
---|
794 | |
---|
795 | |
---|
796 | 3. Using the FASTA Package |
---|
797 | |
---|
798 | 3.1. Overview |
---|
799 | |
---|
800 | The FASTA sequence comparison programs all require similar |
---|
801 | information, the name of a query sequence file, a library file, |
---|
802 | and the ktup parameter. All of the programs can accept arguments |
---|
803 | on the command line, or they will prompt for the file names and |
---|
804 | ktup value. |
---|
805 | |
---|
806 | To use FASTA, simply type: |
---|
807 | |
---|
808 | FASTA |
---|
809 | and you will be prompted for : |
---|
810 | the name of the test sequence file |
---|
811 | the name of the library file |
---|
812 | and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences) |
---|
813 | |
---|
814 | ktup of 2 is about 5 times faster than ktup = 1. |
---|
815 | For a 200 aa sequence against a 10,000,000 aa |
---|
816 | library, the program takes about 30 min with |
---|
817 | ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 |
---|
818 | IBM-PC. |
---|
819 | |
---|
820 | |
---|
821 | The program can also be run by typing |
---|
822 | |
---|
823 | FASTA test.aa /lib/bigfile.lib ktup (1 or 2) |
---|
824 | |
---|
825 | |
---|
826 | Included with the package are the test files, MUSPLFM.AA, |
---|
827 | LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that |
---|
828 | everything is working, you can try: |
---|
829 | |
---|
830 | fasta musplfm.aa lcbo.aa |
---|
831 | and |
---|
832 | tfasta musplfm.aa bovprl.seq |
---|
833 | |
---|
834 | To test the local similarity programs LFASTA and PLFASTA, try: |
---|
835 | |
---|
836 | lfasta mchu.aa mchu.aa |
---|
837 | and |
---|
838 | plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics |
---|
839 | or on a Tektronix terminal under UNIX or VMS) |
---|
840 | |
---|
841 | MCHU (calmodulin) has four duplicated calcium binding sites that |
---|
842 | are clearly detected by LFASTA. For a more complicated example, |
---|
843 | try MWRTC1.aa, myosin heavy chain. |
---|
844 | |
---|
845 | 3.2. Sequence files |
---|
846 | |
---|
847 | The FASTA programs know about three kinds of sequence files |
---|
848 | (four under VMS): (1) plain sequence files that can only be used |
---|
849 | |
---|
850 | |
---|
851 | - 13 - |
---|
852 | |
---|
853 | |
---|
854 | |
---|
855 | |
---|
856 | |
---|
857 | |
---|
858 | |
---|
859 | FASTA.DOC Release 1.6 |
---|
860 | |
---|
861 | |
---|
862 | as query sequences or for LFASTA, RDF2, and ALIGN. (2) Standard |
---|
863 | library files. These are the same as plain sequence files, each |
---|
864 | sequence is preceded by a comment line with a '>' in the first |
---|
865 | column. (3) distributed sequence libraries (this is a broad class |
---|
866 | that includes the NBRF/PIR VMS and blocked ascii formats, Genbank |
---|
867 | flat-file format, EMBL flat-file format, and Intelligenetics |
---|
868 | format. All of the files that you create should be of type (1) |
---|
869 | or (2). Type (2) files (ones with a be used as query or library |
---|
870 | sequence files by all of the programs. |
---|
871 | |
---|
872 | I have included several sample test files, *.AA. The first |
---|
873 | line may begin with a '>' or ';' followed by a comment. The |
---|
874 | text after ';' in other lines will be ignored. Spaces and |
---|
875 | tabs (and anything else that is not an amino-acid code) are |
---|
876 | ignored. |
---|
877 | |
---|
878 | Library files should have the form: |
---|
879 | |
---|
880 | >Sequence name and identifier |
---|
881 | A F A S Y T .... actual sequence. |
---|
882 | F S S .... second line of sequence. |
---|
883 | >Next sequence name and identifier |
---|
884 | |
---|
885 | This is the form of the PROT.* supplied with the floppy disk |
---|
886 | version of the PIR protein sequence library. You can also build |
---|
887 | your own library by concatenating several sequence files. Just |
---|
888 | be sure that each sequence is preceded by a line beginning with a |
---|
889 | '>' with a sequence name. |
---|
890 | |
---|
891 | The test file should not have lines longer than 120 |
---|
892 | characters, and sequences entered with word processors should use |
---|
893 | a document mode, with normal carriage returns at the end of |
---|
894 | lines. |
---|
895 | |
---|
896 | Program Summary |
---|
897 | |
---|
898 | 3.3. Sequence search programs |
---|
899 | |
---|
900 | FASTA universal sequence comparison. Defaults to comparing |
---|
901 | protein sequences; if the sequences are > 85% A+C+G+T |
---|
902 | or the -n option is used, a DNA sequence is assumed. |
---|
903 | |
---|
904 | TFASTA Search DNA library for a protein sequence by |
---|
905 | translating the DNA sequence to protein in all six |
---|
906 | frames (three forward frames with the -3 command line |
---|
907 | option). TFASTA with ktup=2 is about as fast as a DNA |
---|
908 | FASTA with ktup=4, and is substantially more sensitive. |
---|
909 | (also reads the GENBANK library) |
---|
910 | |
---|
911 | SSEARCH Universal sequence comparison using the Smith-Waterman |
---|
912 | algorithm ( T. F. Smith and M. S. Waterman (1981) J. |
---|
913 | Mol. Biol. 147:195-197). This program uses code |
---|
914 | developed by Huang and Miller (X. Huang, R. C. |
---|
915 | |
---|
916 | |
---|
917 | - 14 - |
---|
918 | |
---|
919 | |
---|
920 | |
---|
921 | |
---|
922 | |
---|
923 | |
---|
924 | |
---|
925 | FASTA.DOC Release 1.6 |
---|
926 | |
---|
927 | |
---|
928 | Hardison, W. Miller (1990) CABIOS 6:373-381) for |
---|
929 | calculating the local similarity score and code from |
---|
930 | the ALIGN program (see below) for calculating the local |
---|
931 | alignment. SSEARCH is about 100-times slower than |
---|
932 | FASTA with ktup=2 (for proteins). It should never be |
---|
933 | used to search an entire protein sequence library, but |
---|
934 | can be used to search several hundred sequences. |
---|
935 | |
---|
936 | ALIGN optimal global alignment of two sequences with no |
---|
937 | short-cuts. This program is a slightly modified |
---|
938 | version of one taken from E. Myers and W. Miller. The |
---|
939 | algorithm is described in E. Myers and W. Miller, |
---|
940 | "Optimal Alignments in Linear Space" (CABIOS (1988) |
---|
941 | 4:11-17). |
---|
942 | |
---|
943 | 3.4. Local similarity programs |
---|
944 | |
---|
945 | LFASTA local similarity searches showing local alignments. |
---|
946 | The algorithm used to calculate the local alignment in |
---|
947 | a band has been improved (Chao, Pearson, and Miller, |
---|
948 | submitted). |
---|
949 | |
---|
950 | PLFASTA local similarity searches with plot output (on the IBM, |
---|
951 | this program requires that the environment variable |
---|
952 | BGIDIR be set). |
---|
953 | |
---|
954 | PCLFASTA (unix only) local similarity searches with plot output |
---|
955 | using pic commands. |
---|
956 | |
---|
957 | LALIGN Calculates the N-best local alignments using a rigorous |
---|
958 | algorithm. (N=10 by default.) The algorithm was |
---|
959 | developed by Huang and Miller (X. Huang and W. Miller |
---|
960 | (1991) Adv. Appl. Math. 12:337-357), which is a |
---|
961 | linear-space version of an algorithm described by M. S. |
---|
962 | Waterman and M. Eggert (J. Mol. Biol. 197:723-728). |
---|
963 | Like SSEARCH, LALIGN is rigorous, but also very slow. |
---|
964 | |
---|
965 | PLALIGN A version of LALIGN that plots its output to a screen |
---|
966 | or to a Tektronix terminal emulator. |
---|
967 | |
---|
968 | 3.5. Statistical Significance |
---|
969 | |
---|
970 | RDF2 improved version of RDF program with all three scoring |
---|
971 | methods (now includes local, or window, shuffle |
---|
972 | routine) |
---|
973 | |
---|
974 | RSS A version of RDF2 that uses the rigorous Smith-Waterman |
---|
975 | calculation used by SSEARCH. RSS should provide a more |
---|
976 | rigorous test of the statistical significance of a |
---|
977 | similarity score. |
---|
978 | |
---|
979 | RELATE significance program described by Dayhoff (Atlas of |
---|
980 | Protein Sequence and Structure, Vol. 5, Supplement 3). |
---|
981 | |
---|
982 | |
---|
983 | - 15 - |
---|
984 | |
---|
985 | |
---|
986 | |
---|
987 | |
---|
988 | |
---|
989 | |
---|
990 | |
---|
991 | FASTA.DOC Release 1.6 |
---|
992 | |
---|
993 | |
---|
994 | Each chunk of 25 residues in one sequence is compared |
---|
995 | to every 25 residue fragment of the second sequence. |
---|
996 | Sequences which are genuinely related will have a large |
---|
997 | number of scores greater than 3 standard deviations |
---|
998 | above the mean score of all of the comparisons. |
---|
999 | |
---|
1000 | 3.6. Other analysis programs |
---|
1001 | |
---|
1002 | AACOMP calculate the amino acid composition and molecular |
---|
1003 | weight of a sequence. |
---|
1004 | |
---|
1005 | BESTSCOR calculate the best self-comparison score. |
---|
1006 | |
---|
1007 | GREASE Kyte-Doolittle hydropathicity profile |
---|
1008 | |
---|
1009 | TGREASE graphic plot of Kyte-Doolittle profile |
---|
1010 | |
---|
1011 | FROMGB convert from GenBank LOCUS format (also used by the |
---|
1012 | IBI-Pustell programs) to Pearson/FASTA format. |
---|
1013 | |
---|
1014 | GARNIER A secondary structure prediction program using the |
---|
1015 | method of Garnier, Osgusthorpe, and Robson, J. Mol. |
---|
1016 | Biol., (1978) 120:97-120. |
---|
1017 | |
---|
1018 | 3.7. Searching for keywords |
---|
1019 | |
---|
1020 | FINDP (DOS, Macintosh only) Searches the protein sequence |
---|
1021 | library title lines (or the aabank.nam file created by |
---|
1022 | SINDEX) for a list of key words. For example: |
---|
1023 | |
---|
1024 | FINDP aabank.nam trypsin |
---|
1025 | |
---|
1026 | will search the file of title lines and report all |
---|
1027 | lines with the word "trypsin" in them. You can search |
---|
1028 | for several words at once, by putting several words on |
---|
1029 | the line. Normally, FINDP (and FINDN) ignore upper and |
---|
1030 | lower case. If you would like to search for a specific |
---|
1031 | case, e.g. Trypsin but not chymotrypsin, use the -l |
---|
1032 | option: |
---|
1033 | |
---|
1034 | FINDP aabank.nam -l Trypsin |
---|
1035 | |
---|
1036 | |
---|
1037 | FINDN Searches the GENBANK *.ano annotation files for words. |
---|
1038 | FINDN can search a specific file, or a list of |
---|
1039 | annotation files. For example, if the file GPRIA.NAM |
---|
1040 | contains the lines: |
---|
1041 | |
---|
1042 | gpri1.ano |
---|
1043 | gpri2.ano |
---|
1044 | gpri3.ano |
---|
1045 | ... |
---|
1046 | then |
---|
1047 | |
---|
1048 | |
---|
1049 | - 16 - |
---|
1050 | |
---|
1051 | |
---|
1052 | |
---|
1053 | |
---|
1054 | |
---|
1055 | |
---|
1056 | |
---|
1057 | FASTA.DOC Release 1.6 |
---|
1058 | |
---|
1059 | |
---|
1060 | FINDN @gpria.nam trypsin |
---|
1061 | |
---|
1062 | would search all of the files. FINDN also uses "-l" to |
---|
1063 | preserve upper/lower case distinctions. |
---|
1064 | |
---|
1065 | 3.8. Options |
---|
1066 | |
---|
1067 | These programs have a number of output options, which are |
---|
1068 | invoked by the environment variables LINLEN, SHOWALL, and MARKX. |
---|
1069 | Alternatively, these values can be controlled by command line |
---|
1070 | options. The number of sequence residues per output line is now |
---|
1071 | adjustable by setting the environment variable LINLEN, or the |
---|
1072 | command line option -w. LINLEN is normally 60, to change it set |
---|
1073 | LINLEN=80 before running the program or add -w 80 to the command |
---|
1074 | line. LINLEN can be set up to 200. SHOWALL (-a) determines |
---|
1075 | whether all, or just a portion, of the aligned sequences are |
---|
1076 | displayed. Previously, FASTP would show the entire length of |
---|
1077 | both sequences in an alignment while FASTN would only show the |
---|
1078 | portions of the two sequences that overlapped. Now the default is |
---|
1079 | to show only the overlap between the two sequences, to show |
---|
1080 | complete sequences, set SHOWALL=1, or use the -a option on the |
---|
1081 | command line. |
---|
1082 | |
---|
1083 | The differences between the two aligned sequences can be |
---|
1084 | highlighted in three different ways by changing the environment |
---|
1085 | variable MARKX or the -m option. Normally (MARKX=0) the program |
---|
1086 | uses ':' do denote identities and '.' to denote conservative |
---|
1087 | replacements. If MARKX=1, the program will not mark identities; |
---|
1088 | instead conservative replacements are denoted by a 'x' and non- |
---|
1089 | conservative substitutions by a 'X'. If MARKX=2, the residues in |
---|
1090 | the second sequence are only shown if they are different from the |
---|
1091 | first. Thus the three options are: |
---|
1092 | |
---|
1093 | |
---|
1094 | MARKX=0 (default) MARKX=1 MARKX=2 |
---|
1095 | |
---|
1096 | MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT |
---|
1097 | ::..:: ::: xx X ..KS..Y... |
---|
1098 | MWKSCGYPYT MWKSCGYPYT |
---|
1099 | |
---|
1100 | |
---|
1101 | 3.9. Command line options |
---|
1102 | |
---|
1103 | It is now possible to specify several options on the |
---|
1104 | command line, instead of using environment variables. The |
---|
1105 | command line options are preceded by a dash; the following |
---|
1106 | options are available: |
---|
1107 | |
---|
1108 | -a same as showall=1 |
---|
1109 | |
---|
1110 | -b number of sequence scores to be shown on output |
---|
1111 | |
---|
1112 | |
---|
1113 | |
---|
1114 | - 17 - |
---|
1115 | |
---|
1116 | |
---|
1117 | |
---|
1118 | |
---|
1119 | |
---|
1120 | |
---|
1121 | |
---|
1122 | FASTA.DOC Release 1.6 |
---|
1123 | |
---|
1124 | |
---|
1125 | -c # threshold score for optimization (OPTCUT). Set "-c 1" |
---|
1126 | and "-o" to optimize every sequence in a database. |
---|
1127 | (This slows the program down about 5-fold). |
---|
1128 | |
---|
1129 | -d # number of alignments to be reported by default. (Used |
---|
1130 | in conjunction with -Q). |
---|
1131 | |
---|
1132 | -f identical match score from scoring matrix in the scan |
---|
1133 | for initial regions. (default for protein) (PAMFACT=1) |
---|
1134 | |
---|
1135 | -g # Threshold for joining init1 segments to build an initn |
---|
1136 | score (GAPCUT). |
---|
1137 | |
---|
1138 | -k use constant score in scan for initial regions (like |
---|
1139 | old fastp, fastn, default for DNA) (PAMFACT=0) |
---|
1140 | |
---|
1141 | -l file location of library menu file (FASTLIBS) |
---|
1142 | |
---|
1143 | -m # MARKX = # (0, 1, 2) |
---|
1144 | |
---|
1145 | -n Force the query sequence to be treated as a DNA |
---|
1146 | sequence. This is particularly useful for query |
---|
1147 | sequences that contain a large number of ambiguous |
---|
1148 | residues, e.g. transcription factor binding sites. |
---|
1149 | |
---|
1150 | -o optimize all scores greater than OPTCUT. If '-c' is |
---|
1151 | not specified, OPTCUT will be calculated from the |
---|
1152 | length of the sequence and the ktup setting, as the old |
---|
1153 | CUTOFF value used to be. |
---|
1154 | |
---|
1155 | -Q quiet - does not prompt for any input. Writes scores |
---|
1156 | and alignments to the terminal or standard output file. |
---|
1157 | |
---|
1158 | -r file save a results summary line for every sequence in the |
---|
1159 | sequence library. The summary line includes the |
---|
1160 | sequence identifier, superfamily number (if available) |
---|
1161 | position in the library, and the similarity scores |
---|
1162 | calculated. This option can be used to evaluate the |
---|
1163 | sensitivity and selectivity of different search |
---|
1164 | strategies (see W. R. Pearson (1991) Genomics 11:635- |
---|
1165 | 650.) |
---|
1166 | |
---|
1167 | -s file SMATRIX is read from file. Several SMATRIX files are |
---|
1168 | provided with the standard distribution. For protein |
---|
1169 | sequences: codaa.mat - based on minimum mutation |
---|
1170 | matrix; idnaa.mat - identity matrix; idpaa.mat - |
---|
1171 | identity matrix for mismatches, but identical matches |
---|
1172 | weighted according to the PAM250 matrix; pam250.mat - |
---|
1173 | the PAM250 matrix developed by Dayhoff et al (Atlas of |
---|
1174 | Protein Sequence and Structure, vol. 5, suppl. 3, |
---|
1175 | 1978); pam120.mat - a PAM120 matrix. The SMATRIX also |
---|
1176 | specifies the penalties for the first residue in a gap |
---|
1177 | and additional residues in a gap; FASTA, the other |
---|
1178 | |
---|
1179 | |
---|
1180 | - 18 - |
---|
1181 | |
---|
1182 | |
---|
1183 | |
---|
1184 | |
---|
1185 | |
---|
1186 | |
---|
1187 | |
---|
1188 | FASTA.DOC Release 1.6 |
---|
1189 | |
---|
1190 | |
---|
1191 | alignment programs, and the SMATRIX files use -12 and |
---|
1192 | -4. Currently, to change the -12, -4 gap penalties, the |
---|
1193 | SMATRIX file must be edited. |
---|
1194 | |
---|
1195 | -v (LINEVAL) values used for line styles in plfasta |
---|
1196 | |
---|
1197 | -w # line length (width) = number (<200) |
---|
1198 | |
---|
1199 | -x specifies offsets for the beginning of the query and |
---|
1200 | library sequence. For example, if you are comparing |
---|
1201 | upstream regions for two genes, and the first sequence |
---|
1202 | contains 500 nt of upstream sequence while the second |
---|
1203 | contains 300 nt of upstream sequence, you might try: |
---|
1204 | |
---|
1205 | fasta -x "-500 -300" seq1.nt seq2.nt |
---|
1206 | |
---|
1207 | If the -x option is not used, FASTA assumes numbering |
---|
1208 | starts with 1. This option will not work properly with |
---|
1209 | the translated library sequence with tfasta. (You |
---|
1210 | should double check to be certain the negative |
---|
1211 | numbering works properly.) |
---|
1212 | |
---|
1213 | -1 sort output by init1 score (as FASTP used to do). |
---|
1214 | |
---|
1215 | -3 (TFASTA only) translate only three forward frames |
---|
1216 | |
---|
1217 | |
---|
1218 | For example: |
---|
1219 | |
---|
1220 | fasta -w 80 -a seq1.aa seq.aa |
---|
1221 | |
---|
1222 | would compare the sequence in seq1.aa to that in seq2.aa and |
---|
1223 | display the results with 80 residues on an output line, showing |
---|
1224 | all of the residues in both sequences. Be sure to enter the |
---|
1225 | options before entering the file names, or just enter the options |
---|
1226 | on the command line, and the program will prompt for the file |
---|
1227 | names. |
---|
1228 | |
---|
1229 | Not all of these options are appropriate for all of the |
---|
1230 | programs. The options above are used by FASTA and TFASTA RELATE |
---|
1231 | uses the -s option, ALIGN uses the -w, -m, and -s options, and |
---|
1232 | the RDF2 programs use -c, -f, -k, and -s. |
---|
1233 | |
---|
1234 | 4. Environment variable summary |
---|
1235 | |
---|
1236 | Environment variables allow you to set search parameters |
---|
1237 | that will be used frequently when you run a program; for example, |
---|
1238 | if you prefer to use the PAM120 scoring matrix, you might "set |
---|
1239 | SMATRIX=120." Command line parameters, if used, always override |
---|
1240 | environment variable settings. The following environment |
---|
1241 | variables are used by this program: |
---|
1242 | |
---|
1243 | |
---|
1244 | |
---|
1245 | - 19 - |
---|
1246 | |
---|
1247 | |
---|
1248 | |
---|
1249 | |
---|
1250 | |
---|
1251 | |
---|
1252 | |
---|
1253 | FASTA.DOC Release 1.6 |
---|
1254 | |
---|
1255 | |
---|
1256 | AABANK the file name of the default sequence library. |
---|
1257 | |
---|
1258 | FASTLIBS the location of the file which contains the list of |
---|
1259 | library files to be searched. |
---|
1260 | |
---|
1261 | GAPCUT threshold used for joining init1 regions in the second |
---|
1262 | step of FASTA. Normally set based on sequence length |
---|
1263 | and ktup. |
---|
1264 | |
---|
1265 | GBLIB the directory where the EXTRACTN files and glocus.idx |
---|
1266 | are found. |
---|
1267 | |
---|
1268 | LIBTYPE used to specify the format of the library sequence for |
---|
1269 | FASTA and TFASTA. |
---|
1270 | |
---|
1271 | LINLEN output line length - can go up to 200 |
---|
1272 | |
---|
1273 | LINEVAL used by plfasta to determine the relationship between |
---|
1274 | line style and similarity score (-v). This should be a |
---|
1275 | string of three numbers, e.g. "200 100 50" |
---|
1276 | |
---|
1277 | MARKX symbol for denoting matches, mismatches. Note that this |
---|
1278 | symbol is only used across the optimized local region; |
---|
1279 | sequences that are outside this region are not marked. |
---|
1280 | |
---|
1281 | OPTCUT Set the threshold to be used for optimization in a band |
---|
1282 | around the best initial region. Normally the OPTCUT |
---|
1283 | value is calculated from the length of the sequence and |
---|
1284 | the ktup value (for a 200 residue sequence, it is about |
---|
1285 | 28). If OPTCUT=1, every sequence in the database will |
---|
1286 | be optimized. This is the most sensitive option. |
---|
1287 | |
---|
1288 | PAMFACT This version of fasta uses a more sensitive method for |
---|
1289 | identifying initial regions. Instead of using a |
---|
1290 | constant factor (fact) for each match in a ktup, it |
---|
1291 | uses the scoring matrix (PAM) scores. While this works |
---|
1292 | well for protein sequences, it has not been as |
---|
1293 | carefully tested for DNA sequences, so by default, this |
---|
1294 | modification is used for proteins but not for DNA. The |
---|
1295 | -f 1 option forces this option on. -f 0 forces it off. |
---|
1296 | Setting the PAMFACT environment variable to 1 forces |
---|
1297 | the option on; PAMFACT=0 turns it off. |
---|
1298 | |
---|
1299 | SHOWALL on output, show the complete sequence instead of just |
---|
1300 | the overlap of the two aligned sequences. |
---|
1301 | |
---|
1302 | SMATRIX alternative scoring matrix file. |
---|
1303 | |
---|
1304 | TEKPLOT (IBM-PC only, Unix and VMS versions generate Tektronix |
---|
1305 | graphics by default) Generate Tektronix output. |
---|
1306 | Normally, PLFASTA and TGREASE plot graphs using the |
---|
1307 | Turbo C graphics library. Unfortunately, often these |
---|
1308 | plots cannot be printed out without special programs. |
---|
1309 | |
---|
1310 | |
---|
1311 | - 20 - |
---|
1312 | |
---|
1313 | |
---|
1314 | |
---|
1315 | |
---|
1316 | |
---|
1317 | |
---|
1318 | |
---|
1319 | FASTA.DOC Release 1.6 |
---|
1320 | |
---|
1321 | |
---|
1322 | (I have used GRAFPLUS, from Jewell Technologies, (206) |
---|
1323 | 937-1081, $50, successfully.) However, if you set |
---|
1324 | TEKPLOT=1, tektronix graphics commands will be used. |
---|
1325 | Tektronix commands can be used together with the |
---|
1326 | PLOTDEV program, available from Microplot Systems, 1897 |
---|
1327 | Red Fern Dr. Columbus, OH, 43229, (614) 882-4786, for |
---|
1328 | $40, which also allows you to print out graphics on the |
---|
1329 | screen. |
---|
1330 | |
---|
1331 | |
---|
1332 | As always, please inform me of bugs as soon as possible. |
---|
1333 | |
---|
1334 | William R. Pearson |
---|
1335 | Department of Biochemistry |
---|
1336 | Box 440, Jordan Hall |
---|
1337 | U. of Virginia |
---|
1338 | Charlottesville, VA |
---|
1339 | |
---|
1340 | wrp@virginia.EDU |
---|
1341 | wrp@virginia.BITNET |
---|
1342 | |
---|
1343 | |
---|
1344 | |
---|
1345 | |
---|
1346 | |
---|
1347 | |
---|
1348 | |
---|
1349 | |
---|
1350 | |
---|
1351 | |
---|
1352 | |
---|
1353 | |
---|
1354 | |
---|
1355 | |
---|
1356 | |
---|
1357 | |
---|
1358 | |
---|
1359 | |
---|
1360 | |
---|
1361 | |
---|
1362 | |
---|
1363 | |
---|
1364 | |
---|
1365 | |
---|
1366 | |
---|
1367 | |
---|
1368 | |
---|
1369 | |
---|
1370 | |
---|
1371 | |
---|
1372 | |
---|
1373 | |
---|
1374 | |
---|
1375 | |
---|
1376 | - 21 - |
---|
1377 | |
---|
1378 | |
---|
1379 | |
---|