| 1 | |
|---|
| 2 | | * ReadSeq.Help -- 30 Dec 92 |
|---|
| 3 | | * |
|---|
| 4 | | * Reads and writes nucleic/protein sequences in various |
|---|
| 5 | | * formats. Data files may have multiple sequences. |
|---|
| 6 | | * |
|---|
| 7 | | * Copyright 1990 by d.g.gilbert |
|---|
| 8 | | * biology dept., indiana university, bloomington, in 47405 |
|---|
| 9 | | * e-mail: gilbertd@bio.indiana.edu |
|---|
| 10 | | * |
|---|
| 11 | | * This program may be freely copied and used by anyone. |
|---|
| 12 | | * Developers are encourged to incorporate parts in their |
|---|
| 13 | | * programs, rather than devise their own private sequence |
|---|
| 14 | | * format. |
|---|
| 15 | | * |
|---|
| 16 | | * This should compile and run with any ANSI C compiler. |
|---|
| 17 | | * Please advise me of any bugs, additions or corrections. |
|---|
| 18 | |
|---|
| 19 | Readseq is particularly useful as it automatically detects many |
|---|
| 20 | sequence formats, and interconverts among them. |
|---|
| 21 | |
|---|
| 22 | Formats which readseq currently understands: |
|---|
| 23 | |
|---|
| 24 | * IG/Stanford, used by Intelligenetics and others |
|---|
| 25 | * GenBank/GB, genbank flatfile format |
|---|
| 26 | * NBRF format |
|---|
| 27 | * EMBL, EMBL flatfile format |
|---|
| 28 | * GCG, single sequence format of GCG software |
|---|
| 29 | * DNAStrider, for common Mac program |
|---|
| 30 | * Fitch format, limited use |
|---|
| 31 | * Pearson/Fasta, a common format used by Fasta programs and others |
|---|
| 32 | * Zuker format, limited use. Input only. |
|---|
| 33 | * Olsen, format printed by Olsen VMS sequence editor. Input only. |
|---|
| 34 | * Phylip3.2, sequential format for Phylip programs |
|---|
| 35 | * Phylip, interleaved format for Phylip programs (v3.3, v3.4) |
|---|
| 36 | * Plain/Raw, sequence data only (no name, document, numbering) |
|---|
| 37 | + MSF multi sequence format used by GCG software |
|---|
| 38 | + PAUP's multiple sequence (NEXUS) format |
|---|
| 39 | + PIR/CODATA format used by PIR |
|---|
| 40 | + ASN.1 format used by NCBI |
|---|
| 41 | + Pretty print with various options for nice looking output. Output only. |
|---|
| 42 | |
|---|
| 43 | See the included "Formats" file for detail on file formats. |
|---|
| 44 | |
|---|
| 45 | |
|---|
| 46 | Example usage: |
|---|
| 47 | readseq |
|---|
| 48 | -- for interactive use |
|---|
| 49 | |
|---|
| 50 | readseq my.1st.seq my.2nd.seq -all -format=genbank -output=my.gb |
|---|
| 51 | -- convert all of two input files to one genbank format output file |
|---|
| 52 | |
|---|
| 53 | readseq my.seq -all -form=pretty -nameleft=3 -numleft -numright -numtop -match |
|---|
| 54 | -- output to standard output a file in a pretty format |
|---|
| 55 | |
|---|
| 56 | readseq my.seq -item=9,8,3,2 -degap -CASE -rev -f=msf -out=my.rev |
|---|
| 57 | -- select 4 items from input, degap, reverse, and uppercase them |
|---|
| 58 | |
|---|
| 59 | cat *.seq | readseq -pipe -all -format=asn > bunch-of.asn |
|---|
| 60 | -- pipe a bunch of data thru readseq, converting all to asn |
|---|
| 61 | |
|---|
| 62 | |
|---|
| 63 | The brief usage of readseq is as follows. The "[]" denote |
|---|
| 64 | optional parts of the syntax: |
|---|
| 65 | |
|---|
| 66 | readseq -help |
|---|
| 67 | readSeq (27Dec92), multi-format molbio sequence reader. |
|---|
| 68 | usage: readseq [-options] in.seq > out.seq |
|---|
| 69 | options |
|---|
| 70 | -a[ll] select All sequences |
|---|
| 71 | -c[aselower] change to lower case |
|---|
| 72 | -C[ASEUPPER] change to UPPER CASE |
|---|
| 73 | -degap[=-] remove gap symbols |
|---|
| 74 | -i[tem=2,3,4] select Item number(s) from several |
|---|
| 75 | -l[ist] List sequences only |
|---|
| 76 | -o[utput=]out.seq redirect Output |
|---|
| 77 | -p[ipe] Pipe (command line, <stdin, >stdout) |
|---|
| 78 | -r[everse] change to Reverse-complement |
|---|
| 79 | -v[erbose] Verbose progress |
|---|
| 80 | -f[ormat=]# Format number for output, or |
|---|
| 81 | -f[ormat=]Name Format name for output: |
|---|
| 82 | | 1. IG/Stanford 10. Olsen (in-only) |
|---|
| 83 | | 2. GenBank/GB 11. Phylip3.2 |
|---|
| 84 | | 3. NBRF 12. Phylip |
|---|
| 85 | | 4. EMBL 13. Plain/Raw |
|---|
| 86 | | 5. GCG 14. PIR/CODATA |
|---|
| 87 | | 6. DNAStrider 15. MSF |
|---|
| 88 | | 7. Fitch 16. ASN.1 |
|---|
| 89 | | 8. Pearson/Fasta 17. PAUP |
|---|
| 90 | | 9. Zuker 18. Pretty (out-only) |
|---|
| 91 | |
|---|
| 92 | Pretty format options: |
|---|
| 93 | -wid[th]=# sequence line width |
|---|
| 94 | -tab=# left indent |
|---|
| 95 | -col[space]=# column space within sequence line on output |
|---|
| 96 | -gap[count] count gap chars in sequence numbers |
|---|
| 97 | -nameleft, -nameright[=#] name on left/right side [=max width] |
|---|
| 98 | -nametop name at top/bottom |
|---|
| 99 | -numleft, -numright seq index on left/right side |
|---|
| 100 | -numtop, -numbot index on top/bottom |
|---|
| 101 | -match[=.] use match base for 2..n species |
|---|
| 102 | -inter[line=#] blank line(s) between sequence blocks |
|---|
| 103 | |
|---|
| 104 | |
|---|
| 105 | Notes: |
|---|
| 106 | |
|---|
| 107 | In use, readseq will respond to command line arguments, or to |
|---|
| 108 | interactive use. Command line arguments cannot be combined |
|---|
| 109 | but must each follow a switch character (-). In this release, |
|---|
| 110 | the command line options are now words, with an equals (=) |
|---|
| 111 | to separate parameter(s) fromt he command. You cannot put a |
|---|
| 112 | space between a command and its parameter, as is usual for |
|---|
| 113 | Unix programs (this is to preserve compatibility with VMS). |
|---|
| 114 | The command line syntax of the earlier versions is still |
|---|
| 115 | supported. |
|---|
| 116 | |
|---|
| 117 | See the file Formats for details of the sequence formats which |
|---|
| 118 | are supported by readseq. The auto-detection feature of |
|---|
| 119 | readseq which distinguishes these formats looks for some of the |
|---|
| 120 | unique keywords and symbols that are found in each format. It |
|---|
| 121 | is not infallible at this, though it attempts to exclude unknown |
|---|
| 122 | formats. In general, if you feed to readseq a sequence file that |
|---|
| 123 | you know is one of these common formats, you are okay. If you feed |
|---|
| 124 | it data that might be oddball formats, or non-sequence data, |
|---|
| 125 | you might well get garbage results. Also, different developers |
|---|
| 126 | are always thinking up minor twists on these common formats |
|---|
| 127 | (like PAUP requiring a blank line between blocks of Phylip format, |
|---|
| 128 | or IG adding form feeds between sequences), which may cause hassles. |
|---|
| 129 | |
|---|
| 130 | In general, output supports only minimal subsets of each format |
|---|
| 131 | needed for sequence data exchanges. Features, descriptions |
|---|
| 132 | and other format-unique information is discarded. |
|---|
| 133 | |
|---|
| 134 | The pretty format requires additional options to generate a |
|---|
| 135 | nice output. Try the various pretty options to see what you like. |
|---|
| 136 | Pretty format is OUPUT only, readseq cannot read a Pretty format |
|---|
| 137 | file. |
|---|
| 138 | |
|---|
| 139 | Readseq is NOT optimized for LARGE files. It generally makes several |
|---|
| 140 | reads thru each input file (one per sequence output at present, future |
|---|
| 141 | version may optimize this). It should handle input and output files |
|---|
| 142 | and sequences of any size, but will slow down quite a bit for very large |
|---|
| 143 | (multi megabyte) sized files. It is NOT recommended for converting |
|---|
| 144 | databanks or large subsets there-of. It is primarily directed at the |
|---|
| 145 | small files that researchers use to maintain their personal data, which |
|---|
| 146 | they frequently need to interconvert for the various analysis programs |
|---|
| 147 | which so frequently require a special format. |
|---|
| 148 | |
|---|
| 149 | Users of Olsen multi sequence editor (VMS). The Olsen format |
|---|
| 150 | here is produced with the print command: |
|---|
| 151 | print/out=some.file |
|---|
| 152 | Use Genbank output from readseq to produce a format that this |
|---|
| 153 | editor can read, and use the command |
|---|
| 154 | load/genbank some.file |
|---|
| 155 | Dan Davison has a VMS program that will convert to/from the |
|---|
| 156 | Olsen native binary data format. E-mail davison@uh.edu |
|---|
| 157 | |
|---|
| 158 | Warning: Phylip format input is now supported (30Dec92), however the |
|---|
| 159 | auto-detection of Phylip format is very probabilistic and messy, |
|---|
| 160 | especially distinguishing sequential from interleaved versions. It |
|---|
| 161 | is not recommended that one use readseq to convert files from Phylip |
|---|
| 162 | format to others unless essential. |
|---|
| 163 | |
|---|
| 164 | |
|---|
| 165 | This program is available thru Internet gopher, as |
|---|
| 166 | |
|---|
| 167 | gopher ftp.bio.indiana.edu |
|---|
| 168 | browse into the IUBio-Software+Data/molbio/readseq/ folder |
|---|
| 169 | select the readseq.shar document |
|---|
| 170 | |
|---|
| 171 | Or thru anonymous FTP in this manner: |
|---|
| 172 | my_computer> ftp ftp.bio.indiana.edu (or IP address 129.79.224.25) |
|---|
| 173 | username: anonymous |
|---|
| 174 | password: my_username@my_computer |
|---|
| 175 | ftp> cd molbio/readseq |
|---|
| 176 | ftp> get readseq.shar |
|---|
| 177 | ftp> bye |
|---|
| 178 | |
|---|
| 179 | readseq.shar is a Unix shell archive of the readseq files. |
|---|
| 180 | This file can be editted by any text editor to reconstitute the |
|---|
| 181 | original files, for those who do not have a Unix system or an |
|---|
| 182 | Unshar program. Read the top of this .shar file for further |
|---|
| 183 | instructions. |
|---|
| 184 | |
|---|
| 185 | There are also pre-compiled executables for the following computers: |
|---|
| 186 | Silicon Graphics Iris, Sparc (Sun Sparcstation & clones), VMS-Vax, |
|---|
| 187 | Macintosh. Use binary ftp to transfer these, except Macintosh. The |
|---|
| 188 | Mac version is just the command-line program in a window, not very |
|---|
| 189 | handy. |
|---|
| 190 | |
|---|
| 191 | C source files: |
|---|
| 192 | readseq.c ureadseq.c ureadasn.c ureadseq.h |
|---|
| 193 | |
|---|
| 194 | Document files: |
|---|
| 195 | Readme (this doc) |
|---|
| 196 | Formats (description of sequence file formats) |
|---|
| 197 | add.gdemenu (GDE program users can add this to the .GDEmenu file) |
|---|
| 198 | Stdfiles -- test sequence files |
|---|
| 199 | Makefile -- Unix make file |
|---|
| 200 | Make.com -- VMS make file |
|---|
| 201 | *.std -- files for testing validity of readseq |
|---|
| 202 | |
|---|
| 203 | |
|---|
| 204 | Recent changes (see also readseq.c for all history of changes): |
|---|
| 205 | |
|---|
| 206 | 4 May 92 |
|---|
| 207 | |
|---|
| 208 | + added 32 bit CRC checksum as alternative to GCG 6.5bit checksum |
|---|
| 209 | |
|---|
| 210 | Aug 92 |
|---|
| 211 | |
|---|
| 212 | = fixed Olsen format input to handle files w/ more sequences, |
|---|
| 213 | not to mess up when more than one seq has same identifier, |
|---|
| 214 | and to convert number masks to symbols. |
|---|
| 215 | = IG format fix to understand ^L |
|---|
| 216 | |
|---|
| 217 | 30 Dec 92 |
|---|
| 218 | |
|---|
| 219 | * revised command-line & interactive interface. Suggested form is now |
|---|
| 220 | |
|---|
| 221 | readseq infile -format=genbank -output=outfile -item=1,3,4 ... |
|---|
| 222 | |
|---|
| 223 | but remains compatible with prior commandlines: |
|---|
| 224 | |
|---|
| 225 | readseq infile -f2 -ooutfile -i3 ... |
|---|
| 226 | |
|---|
| 227 | + added GCG MSF multi sequence file format |
|---|
| 228 | + added PIR/CODATA format |
|---|
| 229 | + added NCBI ASN.1 sequence file format |
|---|
| 230 | + added Pretty, multi sequence pretty output (only) |
|---|
| 231 | + added PAUP multi seq format |
|---|
| 232 | + added degap option |
|---|
| 233 | + added Gary Williams (GWW, G.Williams@CRC.AC.UK) reverse-complement option. |
|---|
| 234 | + added support for reading Phylip formats (interleave & sequential) |
|---|
| 235 | * string fixes, dropped need for compiler flags NOSTR, FIXTOUPPER, NEEDSTRCASECMP |
|---|
| 236 | * changed 32bit checksum to default, -DSMALLCHECKSUM for GCG version |
|---|
| 237 | |
|---|
| 238 | |
|---|