1 | |
---|
2 | | * ReadSeq.Help -- 30 Dec 92 |
---|
3 | | * |
---|
4 | | * Reads and writes nucleic/protein sequences in various |
---|
5 | | * formats. Data files may have multiple sequences. |
---|
6 | | * |
---|
7 | | * Copyright 1990 by d.g.gilbert |
---|
8 | | * biology dept., indiana university, bloomington, in 47405 |
---|
9 | | * e-mail: gilbertd@bio.indiana.edu |
---|
10 | | * |
---|
11 | | * This program may be freely copied and used by anyone. |
---|
12 | | * Developers are encourged to incorporate parts in their |
---|
13 | | * programs, rather than devise their own private sequence |
---|
14 | | * format. |
---|
15 | | * |
---|
16 | | * This should compile and run with any ANSI C compiler. |
---|
17 | | * Please advise me of any bugs, additions or corrections. |
---|
18 | |
---|
19 | Readseq is particularly useful as it automatically detects many |
---|
20 | sequence formats, and interconverts among them. |
---|
21 | |
---|
22 | Formats which readseq currently understands: |
---|
23 | |
---|
24 | * IG/Stanford, used by Intelligenetics and others |
---|
25 | * GenBank/GB, genbank flatfile format |
---|
26 | * NBRF format |
---|
27 | * EMBL, EMBL flatfile format |
---|
28 | * GCG, single sequence format of GCG software |
---|
29 | * DNAStrider, for common Mac program |
---|
30 | * Fitch format, limited use |
---|
31 | * Pearson/Fasta, a common format used by Fasta programs and others |
---|
32 | * Zuker format, limited use. Input only. |
---|
33 | * Olsen, format printed by Olsen VMS sequence editor. Input only. |
---|
34 | * Phylip3.2, sequential format for Phylip programs |
---|
35 | * Phylip, interleaved format for Phylip programs (v3.3, v3.4) |
---|
36 | * Plain/Raw, sequence data only (no name, document, numbering) |
---|
37 | + MSF multi sequence format used by GCG software |
---|
38 | + PAUP's multiple sequence (NEXUS) format |
---|
39 | + PIR/CODATA format used by PIR |
---|
40 | + ASN.1 format used by NCBI |
---|
41 | + Pretty print with various options for nice looking output. Output only. |
---|
42 | |
---|
43 | See the included "Formats" file for detail on file formats. |
---|
44 | |
---|
45 | |
---|
46 | Example usage: |
---|
47 | readseq |
---|
48 | -- for interactive use |
---|
49 | |
---|
50 | readseq my.1st.seq my.2nd.seq -all -format=genbank -output=my.gb |
---|
51 | -- convert all of two input files to one genbank format output file |
---|
52 | |
---|
53 | readseq my.seq -all -form=pretty -nameleft=3 -numleft -numright -numtop -match |
---|
54 | -- output to standard output a file in a pretty format |
---|
55 | |
---|
56 | readseq my.seq -item=9,8,3,2 -degap -CASE -rev -f=msf -out=my.rev |
---|
57 | -- select 4 items from input, degap, reverse, and uppercase them |
---|
58 | |
---|
59 | cat *.seq | readseq -pipe -all -format=asn > bunch-of.asn |
---|
60 | -- pipe a bunch of data thru readseq, converting all to asn |
---|
61 | |
---|
62 | |
---|
63 | The brief usage of readseq is as follows. The "[]" denote |
---|
64 | optional parts of the syntax: |
---|
65 | |
---|
66 | readseq -help |
---|
67 | readSeq (27Dec92), multi-format molbio sequence reader. |
---|
68 | usage: readseq [-options] in.seq > out.seq |
---|
69 | options |
---|
70 | -a[ll] select All sequences |
---|
71 | -c[aselower] change to lower case |
---|
72 | -C[ASEUPPER] change to UPPER CASE |
---|
73 | -degap[=-] remove gap symbols |
---|
74 | -i[tem=2,3,4] select Item number(s) from several |
---|
75 | -l[ist] List sequences only |
---|
76 | -o[utput=]out.seq redirect Output |
---|
77 | -p[ipe] Pipe (command line, <stdin, >stdout) |
---|
78 | -r[everse] change to Reverse-complement |
---|
79 | -v[erbose] Verbose progress |
---|
80 | -f[ormat=]# Format number for output, or |
---|
81 | -f[ormat=]Name Format name for output: |
---|
82 | | 1. IG/Stanford 10. Olsen (in-only) |
---|
83 | | 2. GenBank/GB 11. Phylip3.2 |
---|
84 | | 3. NBRF 12. Phylip |
---|
85 | | 4. EMBL 13. Plain/Raw |
---|
86 | | 5. GCG 14. PIR/CODATA |
---|
87 | | 6. DNAStrider 15. MSF |
---|
88 | | 7. Fitch 16. ASN.1 |
---|
89 | | 8. Pearson/Fasta 17. PAUP |
---|
90 | | 9. Zuker 18. Pretty (out-only) |
---|
91 | |
---|
92 | Pretty format options: |
---|
93 | -wid[th]=# sequence line width |
---|
94 | -tab=# left indent |
---|
95 | -col[space]=# column space within sequence line on output |
---|
96 | -gap[count] count gap chars in sequence numbers |
---|
97 | -nameleft, -nameright[=#] name on left/right side [=max width] |
---|
98 | -nametop name at top/bottom |
---|
99 | -numleft, -numright seq index on left/right side |
---|
100 | -numtop, -numbot index on top/bottom |
---|
101 | -match[=.] use match base for 2..n species |
---|
102 | -inter[line=#] blank line(s) between sequence blocks |
---|
103 | |
---|
104 | |
---|
105 | Notes: |
---|
106 | |
---|
107 | In use, readseq will respond to command line arguments, or to |
---|
108 | interactive use. Command line arguments cannot be combined |
---|
109 | but must each follow a switch character (-). In this release, |
---|
110 | the command line options are now words, with an equals (=) |
---|
111 | to separate parameter(s) fromt he command. You cannot put a |
---|
112 | space between a command and its parameter, as is usual for |
---|
113 | Unix programs (this is to preserve compatibility with VMS). |
---|
114 | The command line syntax of the earlier versions is still |
---|
115 | supported. |
---|
116 | |
---|
117 | See the file Formats for details of the sequence formats which |
---|
118 | are supported by readseq. The auto-detection feature of |
---|
119 | readseq which distinguishes these formats looks for some of the |
---|
120 | unique keywords and symbols that are found in each format. It |
---|
121 | is not infallible at this, though it attempts to exclude unknown |
---|
122 | formats. In general, if you feed to readseq a sequence file that |
---|
123 | you know is one of these common formats, you are okay. If you feed |
---|
124 | it data that might be oddball formats, or non-sequence data, |
---|
125 | you might well get garbage results. Also, different developers |
---|
126 | are always thinking up minor twists on these common formats |
---|
127 | (like PAUP requiring a blank line between blocks of Phylip format, |
---|
128 | or IG adding form feeds between sequences), which may cause hassles. |
---|
129 | |
---|
130 | In general, output supports only minimal subsets of each format |
---|
131 | needed for sequence data exchanges. Features, descriptions |
---|
132 | and other format-unique information is discarded. |
---|
133 | |
---|
134 | The pretty format requires additional options to generate a |
---|
135 | nice output. Try the various pretty options to see what you like. |
---|
136 | Pretty format is OUPUT only, readseq cannot read a Pretty format |
---|
137 | file. |
---|
138 | |
---|
139 | Readseq is NOT optimized for LARGE files. It generally makes several |
---|
140 | reads thru each input file (one per sequence output at present, future |
---|
141 | version may optimize this). It should handle input and output files |
---|
142 | and sequences of any size, but will slow down quite a bit for very large |
---|
143 | (multi megabyte) sized files. It is NOT recommended for converting |
---|
144 | databanks or large subsets there-of. It is primarily directed at the |
---|
145 | small files that researchers use to maintain their personal data, which |
---|
146 | they frequently need to interconvert for the various analysis programs |
---|
147 | which so frequently require a special format. |
---|
148 | |
---|
149 | Users of Olsen multi sequence editor (VMS). The Olsen format |
---|
150 | here is produced with the print command: |
---|
151 | print/out=some.file |
---|
152 | Use Genbank output from readseq to produce a format that this |
---|
153 | editor can read, and use the command |
---|
154 | load/genbank some.file |
---|
155 | Dan Davison has a VMS program that will convert to/from the |
---|
156 | Olsen native binary data format. E-mail davison@uh.edu |
---|
157 | |
---|
158 | Warning: Phylip format input is now supported (30Dec92), however the |
---|
159 | auto-detection of Phylip format is very probabilistic and messy, |
---|
160 | especially distinguishing sequential from interleaved versions. It |
---|
161 | is not recommended that one use readseq to convert files from Phylip |
---|
162 | format to others unless essential. |
---|
163 | |
---|
164 | |
---|
165 | This program is available thru Internet gopher, as |
---|
166 | |
---|
167 | gopher ftp.bio.indiana.edu |
---|
168 | browse into the IUBio-Software+Data/molbio/readseq/ folder |
---|
169 | select the readseq.shar document |
---|
170 | |
---|
171 | Or thru anonymous FTP in this manner: |
---|
172 | my_computer> ftp ftp.bio.indiana.edu (or IP address 129.79.224.25) |
---|
173 | username: anonymous |
---|
174 | password: my_username@my_computer |
---|
175 | ftp> cd molbio/readseq |
---|
176 | ftp> get readseq.shar |
---|
177 | ftp> bye |
---|
178 | |
---|
179 | readseq.shar is a Unix shell archive of the readseq files. |
---|
180 | This file can be editted by any text editor to reconstitute the |
---|
181 | original files, for those who do not have a Unix system or an |
---|
182 | Unshar program. Read the top of this .shar file for further |
---|
183 | instructions. |
---|
184 | |
---|
185 | There are also pre-compiled executables for the following computers: |
---|
186 | Silicon Graphics Iris, Sparc (Sun Sparcstation & clones), VMS-Vax, |
---|
187 | Macintosh. Use binary ftp to transfer these, except Macintosh. The |
---|
188 | Mac version is just the command-line program in a window, not very |
---|
189 | handy. |
---|
190 | |
---|
191 | C source files: |
---|
192 | readseq.c ureadseq.c ureadasn.c ureadseq.h |
---|
193 | |
---|
194 | Document files: |
---|
195 | Readme (this doc) |
---|
196 | Formats (description of sequence file formats) |
---|
197 | add.gdemenu (GDE program users can add this to the .GDEmenu file) |
---|
198 | Stdfiles -- test sequence files |
---|
199 | Makefile -- Unix make file |
---|
200 | Make.com -- VMS make file |
---|
201 | *.std -- files for testing validity of readseq |
---|
202 | |
---|
203 | |
---|
204 | Recent changes (see also readseq.c for all history of changes): |
---|
205 | |
---|
206 | 4 May 92 |
---|
207 | |
---|
208 | + added 32 bit CRC checksum as alternative to GCG 6.5bit checksum |
---|
209 | |
---|
210 | Aug 92 |
---|
211 | |
---|
212 | = fixed Olsen format input to handle files w/ more sequences, |
---|
213 | not to mess up when more than one seq has same identifier, |
---|
214 | and to convert number masks to symbols. |
---|
215 | = IG format fix to understand ^L |
---|
216 | |
---|
217 | 30 Dec 92 |
---|
218 | |
---|
219 | * revised command-line & interactive interface. Suggested form is now |
---|
220 | |
---|
221 | readseq infile -format=genbank -output=outfile -item=1,3,4 ... |
---|
222 | |
---|
223 | but remains compatible with prior commandlines: |
---|
224 | |
---|
225 | readseq infile -f2 -ooutfile -i3 ... |
---|
226 | |
---|
227 | + added GCG MSF multi sequence file format |
---|
228 | + added PIR/CODATA format |
---|
229 | + added NCBI ASN.1 sequence file format |
---|
230 | + added Pretty, multi sequence pretty output (only) |
---|
231 | + added PAUP multi seq format |
---|
232 | + added degap option |
---|
233 | + added Gary Williams (GWW, G.Williams@CRC.AC.UK) reverse-complement option. |
---|
234 | + added support for reading Phylip formats (interleave & sequential) |
---|
235 | * string fixes, dropped need for compiler flags NOSTR, FIXTOUPPER, NEEDSTRCASECMP |
---|
236 | * changed 32bit checksum to default, -DSMALLCHECKSUM for GCG version |
---|
237 | |
---|
238 | |
---|