source: trunk/GDE/SINA/builddir/doc/source/commandline.rst

Last change on this file was 19170, checked in by westram, 2 years ago
  • sina source
    • unpack + remove tarball
    • no longer ignore sina builddir.
File size: 31.1 KB

Commandline Reference

SINA adds sequences to an existing multiple sequence alignment (MSA). It can also execute a homology search based on the computed alignment and generate a per sequence classifications from the search results.

Synopsis

?
.. program:: sina

sina [options] :option:`-i` <unaligned> :option:`-r` <reference> :option:`-o` <aligned>

???

sina [ :option:`-h` | :option:`--help` | :option:`--help-all` | :option:`--version` | :option:`--has-cli-vers` ]

?????

Description

:program:`sina` aligns the sequences in the file <unaligned> to match the alignment in <reference> and places the aligned sequences in the file <aligned>.

?

Please refer to the publication for details on the algorithm(s).

General Options

?
.. program:: sina
?
.. option:: -h, --help
   Displays brief command line description.
?
.. option:: -H, --help-all
   Displays full command line description. If in doubt, refer to this
   as it will always be in sync with your installation of SINA.
?
.. option:: -V, --version
   Shows the version of SINA.
?
.. option:: -i filename, --in=filename (-)
   Specifies the file containing the input sequences. Allowable file
   formats are ARB (see :ref:`ARB Options`) and FASTA (see `FASTA
   Options, optionally gzipped). The format will be selected based on
   the file name unless overridden with :option:`--intype`.
   Special file names: "``-``" (dash) will read sequences from
   standard input.  "``:``" (colon) will connect to a running ARB
   database (SINA must be a child process of ARB).
?
.. option:: -o filename [filename [...]], --out=filename (-)
   Specifies the output file(s) to which the aligned sequences and/or
   meta data will be written. Allowable file formats are ARB (see
   :ref:`ARB Options`) and FASTA (see :ref:`FASTA Options`, optionally
   gzipped) and CSV (see :ref:`CSV Options`). The format will be
   selected based on the file name unless overridden with
   :option:`--outtype`. Specifying multiple names or specifying the
   option multiple times will output all data to each file.
   See :option:`-i` for special filenames ``-`` and ``:``. Use ``-o
   /dev/null`` to disable output.
   Not specifying this option at all will write sequences to
   ``stdout`` if the input is FASTA format. When reading from an ARB
   database, output is written back to the source database (specified
   with :option:`-i`).
?
.. option:: -r filename, --db=filename
   Specifies the file containing the reference alignment. This file
   must be in ARB format.
   To convert a reference alignment from FASTA to ARB format, run::
       sina -i reference.fasta --prealigned -o reference.arb
?
.. option:: -t [all], --turn [=all]
   Enables turn check stage. Sequences not oriented in accordance with
   the reference database will be reverse complemented as needed.
   If ``all`` is specified, sequences will also be tested for only
   reversal or only complemented (this should only be necessary if
   your data was mishandled).
?
.. option:: -S, --search
   Enables the search stage. See :ref:`Search & Classify` below for
   more information.
?
.. option:: -P, --prealigned
   Disables the alignment stage. This is useful if you have already
   aligned sequences you wish to pass directly into the search stage,
   or if you want to use SINA to convert between any of its supported
   file formats.
?
.. option:: -v, --verbose
   Increase logging verbosity. Can be specified multiple times.
?
.. option:: -q, --quiet
   Decrease logging verbosity. Can be specified multiple times.
?
.. option:: --log-file=filename
   Specify log file. The output written to the log file will always be
   verbose and is not affected by using :option:`-v` or :option:`-q`.
?
.. option:: --meta-fmt=[none|header|comment|csv]
   Configures how meta data (such as alignment score or sequence classification results) are to be exported.
   **none**
     No output other than in the log is generated.
   **header**
     Appends meta data as ``[key=value]`` pairs to the FASTA header line
   **comment**
     Appends meta data as ``; key: value`` lines between the
     FASTA header and the sequence data.
   **csv**
     Writes meta data into a CSV side car file.
   .. deprecated:: 1.7.0
      Use ``-o output.csv`` instead.
?
.. option::  -f fields, --fields=fields
   Configures the set of fields written to the output file. See
   :ref:`Field Reference` for a description of the available
   fields.
?
.. option:: -p, --threads (automatic)
   Override automatic detection of the number of threads used by
   SINA. This is usually only necessary if you need to constrain SINA
   to a lower number of threads. According to the Intel engineers
   whose *Threaded Building Blocks* library does the thread number
   detection for SINA, the only reason to use this parameter should be
   scalability testing.
?
.. option:: --num-pts (1)
   Set the maximum number of ARB PT server instances used by SINA. See
   also :option:`--fs-engine` below. If you are using the
   **pt-server** engine, this setting will be the limiting factor in
   your throughput. Be aware, however, that each PT server will occupy
   additional system memory. Choosing a too high value may cause SINA
   to fail with out-of-memory errors.
?
.. option:: --add-relatives=n (0)
   Add up to *n* reference sequences for each query sequence to the
   output file. If `Search & Classify`_ is enabled via :option:`--search`, the
   reference sequences are selected from the search result. Otherwise,
   they are selected from the query's alignment reference set.
   If the source set is smaller than *n*, no further sequences are
   added to the output. Sequences already included are skipped, but
   count towards the *n* of the query sequence.

Reference Selection Options

These options configure how the set of reference sequences used during alignment is selected from the configured reference database.

?
.. program:: sina
?
.. option:: --fs-engine=[internal|pt-server]
   Selects the search engine used to find closely related reference
   sequences for the alignment stage.
   **pt-server**
     Uses the ARB PT server to execute the k-mer search. The ARB
     PT server is a truncated suffix trie implementation implemented
     as part of the ARB package.
   **internal**
     Uses an internal k-mer search implementation.
?
.. option:: --fs-kmer-len=k (10)
   Set the size of *k* for the reference search. For SSU rRNA
   sequences, the default of 10 is a good value. For different
   sequence types, different values may perform better. For 5S, for
   example, 6 has shown to be more effective.
?
.. option:: --fs-min=n (15)
   Set the minimum number of reference sequences used for each query.
?
.. option:: --fs-max=n (40)
   Set the maximum number of reference sequences used for each query.
?
.. option:: --fs-msc=n (0.7)
   Set the minimum similarity reference sequences are required to have
   with the query sequence. This affects the range between
   :option:`--fs-min` and :option:`--fs-max`.
?
.. option:: --fs-req=n (1)
   Set the minimum number of reference sequences that must be found in
   order to attempt alignment. If fewer sequences than indicated here
   are found, the respective query sequence will be discarded.
?
.. option:: --fs-req-full=n (1)
   Set the minimum number of *full length* (see
   :option:`--fs-full-len`) reference sequences that must be included
   in the selected reference set. The search will proceed regardless
   of other settings until this setting has been satisfied. If it
   cannot be satisfied by any sequence in the reference database, the
   query sequence will be discarded.
   This setting exists to ensure that the entire length of the query
   sequence will be covered in the presence of partial sequences
   contained within your reference database.
   **Note:**
     If you are working with sequences other than 16S, you need to
     adjust this value or the value of :option:`--fs-full-len`
     accordingly. In particular when working with short reference
     sequences, this setting may prevent any acceptable reference
     sequences from being found, leading to no sequences being aligned.
?
.. option:: --fs-full-len=n (1400)
   Set the minimum length a sequence is required to have to be
   considered *full length*.
?
.. option:: --fs-req-gaps=n (10)
   Set the minimum number of gaps a reference sequence is required to
   contain to be considered. This setting ensures that unaligned
   sequences contained within the reference database are not used as
   reference (this may happen when SINA is used from within ARB).
?
.. option::  --fs-min-len=n (150)
   Set the minimum length reference sequences are required to
   have. Sequences shorter than this will not be included in the
   selection.
   **Note:**
     If you are working with particularly short reference sequences,
     you will need to lower this settings to allow any reference
     sequences to be found.

Search & Classify Options

When enabled via :option:`--search`, SINA will execute a homology search. Unlike most homology search tools, SINA uses the inferred multiple sequence alignment to determine the similarity of each query with the reference sequences, rather than computing pairwise optimal alignments. The similarity values will therefore be generally lower than the results of a pairwise alignment based homology search.

?

Based on the search results, SINA can be instructed to compute a lowest common ancestor (LCA) based classification of the input sequences. For this, your reference database must include a field containing taxonomic classifications for each reference sequence. The field contents must be in the format t Domain;Phylum;.... SINA will compute query classifications as the deepest classification shared by at least the fraction :option:`--lca-quorum` of the search result.

??
.. program:: sina
?
.. option:: --search-db=filename (=db)
   Specify an alternate reference database to use for search and
   classify. This can be useful if you have a specially curated
   alignment reference, but wish to search a larger set of sequences
   for classification purposes.
?
.. option:: --search-engine=[internal|pt-server]
   Override the value of :option:`--fs-engine` for use within the
   search module.
?
.. option:: --search-min-sim=id (0.7)
   The minimum fractional identity each result sequence must have with
   the query.
?
.. option:: --search-max-result=n (10)
   The maximum number of search results to return for each query sequence.
?
.. option:: --lca-fields=names
   Enables the classification stage. The parameter *name* must be a
   colon or comma separated list of field names in the search database
   containing the classification reference data. Use
   :option:`--arb-list-fields` to show a list of the fields available
   in a given ARB database. When using a SILVA ARB database as
   reference, the fields `tax_slv`, `tax_embl` (renamed to
   `tax_embl_ebi_ena` in newer releases) and `tax_ltp` contain the
   reference classifications according to the SILVA, EMBL-EBI/ENA and
   LTP taxonomies, respectively. When using a SILVA SSU ARB database,
   the fields `tax_gg` (only older databases) and `tax_rdp` are
   available additionally, containing the reference classifications
   according to RDP II and Greengenes, respectively. Newer SILVA
   databases also contain the field `tax_gtdb` containing
   classifications from the Genome Taxonomy Database.
?
.. option:: --lca-quorum=fraction (0.7)
   Sets the fraction of the search result that must share the same
   classification. Using the default parameters
   :option:`--search-max-result`\=10 and :option:`--lca-quorum`\=0.7, this
   means that the deepest classification shared by 7 out of the top 10
   search results is chosen for the query sequence.

Advanced Options

?
.. option:: --show-conf
   Print the values of all configuration options (including defaults)
   at startup.
?
.. option::  --intype=[auto|arb|fasta] (auto)
   Set the file format for :option:`--in`. If set to *auto* (default),
   the type is selected based on the file extension.
?
.. option::  --outtype=[auto|arb|fasta|csv|none] (auto)
   Set the file format for :option:`--out`. If set to **auto**
   (default), the type is selected based on the file extension. The
   option can be specified multiple times. It applies to all files
   listed in the next :option:`--out` option. If no output files are
   specified and this option is set to **none**, no output is produced
   at all.
?
.. option::  --preserve-order
   Preserve the order of the input sequences in the output.
?
.. option::  --max-in-flight=n (2 * number of CPU cores)
   Set the maximum number of sequences "in flight", i.e. processed in
   parallel.
?
.. option::  --has-cli-vers=cliversion
   Verify that this version of SINA supports the CLI version
   **cliversion**. Exits immediately with exit code 0 if true and 1 if
   false.
?
.. option::  --no-align
   Backwards compatibility alias for :option:`--prealigned`.

Logging Options

?
.. option:: --show-diff
   Show differences between the inferred alignment and the original
   alignment. Requires either aligned sequences to be passed into sina
   via :option:`--in` or that a database with matching names is
   specified using :option:`--orig-db`.
   .. todo:: Fix text below
   This flag enables visualization of alignment differences. This
   feature allows you to quickly assess where your alignment differs
   from the one SINA computed. By also showing you the alignment of
   the reference sequences used for aligning the sequence, you can get
   an idea of why SINA came to its conclusions.  Many cases of
   "sub-optimal" alignment can be attributed to inconsistent
   alignment of the reference sequences.  To fix such problems, you
   could either correct the alignment of the reference sequences or
   add your corrected sequence to the reference alignment.
   Alignment difference visualization requires that the input
   sequences be already aligned in a way compatible with the used
   reference alignment. For positions at which the original alignment
   and the alignment computed by SINA differ, output as shown below
   will be printed to the log::
     Dumping pos 1121 through 1141:
     ---------  4 14 16-17 21 24
     G-C-AGUC-  40 <---(%% ORIG %%)
     GCA--GUC-  41 <---(## NEW ##)
     GCA-AGUC-  0-3 5-13 15 18-20 22-23 25-27 29-39
     GCAA-GUC-  28
   In this case, the bases '\texttt{C}' and '\texttt{A}' where placed
   in other columns than as per the original alignment. The original
   alignment is marked with \texttt{<-{}--(\%\% ORIG \%\%)}. The new
   alignment is marked with \texttt{<-{}--(\#\# NEW \#\#)}. The
   numbers to the right of the alignment excerpt indicate the indices
   of the sequences in the alignment reference (field
?
.. option:: --show-dist
   Show distance to original alignment
   .. todo:: describe values generated
?
.. option::  --orig-db=filename
   Specify a database containing the original alignments for use with
   :option:`--show-dist` and :option:`--show-diff`. The sequence names
   in the input file and in the reference database must match exactly.
?
.. option::  --colors
   Use ANSI codes to show alignments dumped by :option:`--show-diff`
   in color.

ARB Options

These options configure behavior supported only by the ARB backend for input and output sequences.

?
.. option:: --arb-list-fields=FILE
   Show the per-sequence meta-data fields available in the specified
   ARB database and exit.
?
.. option:: --markcopied
   Set *Mark* on sequences copied from the reference.
   .. todo: This feature is broken after reimplementing copy
?
.. option:: --markaligned
   Set *Mark* on sequences updated or added by alignment stage.
?
.. option:: --prot-level=n (4)
   Set the *protection level* to use when writing sequences to the
   output database.
?
.. option:: --select-file=filename
   Instead of iterating over the entire input database, process only
   the sequences listed in *filename*. The names must match the ARB
   *name* field and be separated by newlines. Use "**-**" to read from
   standard input.
?
.. option:: --select-step=n (1)
   Process only every *n*\th sequence. Can be combined with
   :option:`--select-file` and :option:`--select-skip`.
?
.. option:: --select-skip=n (0)
   Do not process the first *n* sequences. Can be combined with
   :option:`--select-file` and :option:`--select-step`.

FASTA Options

These options configure behavior supported only by the FASTA backend for input and output sequences.

?
.. option:: --line-length=n (0)
   Output sequences using at most *n* characters per line. Set to 0 to
   place the entire output sequence on one line.
?
.. option:: --min-idty=id
   Exclude sequences sharing less than *id* fractional identity with
   any of the alignment reference sequences from the output. Implies
   :option:`--calc-idty`.
?
.. option:: --fasta-write-dna
   Write output sequences as DNA, rather than the default
   RNA. (I.e. use T and t rather than U and U).
?
.. option:: --fasta-write-dots
   Use dots (".") rather than dashes ("-") for gaps that indicate
   missing data rather than an actual insertion/deletion. Most often,
   those are only the terminal gaps at the ends of the alignment.
   .. todo:: Check whether internal dots are handled correctly.
?
.. option:: --fasta-idx=n
   Only process sequences *starting* withing the *n*\th block of bytes
   within the input FASTA file. The first block has index 0.
?
.. option:: --fasta-block=size
   Sets the size in bytes for the blocks used by :option:`--fasta-idx`.

CSV Options

These options configure the CSV output backend.

?
.. option:: --csv-crlf
   Enables RFC4180 compliant CSV output using CRLF as line
   separator. This was the default behavior for the old, deprecated
   CSV writer enabled using :option:`--meta-fmt`.
?
.. option:: --csv-sep=sep (,)
   Specifies the string used to separate fields. May be multple
   characters. By default, a comma is used when writing to STDOUT or
   files ending in ``.csv`` or ``.csv.gz`` and a ``TAB`` character is
   used when writing to files ending in ``.tsv`` or ``.tsv.gz``.
 .. option:: --csv-id=id (name)
    Specifies the field name used for the always present ID
    column. For sequences read from FASTA, this is the first word of
    the header. For sequences read from ARB, this is the ``name``
    field. By default, the column name is ``name``.

Alignment Options

?
.. option:: --realign
   Forces computing the alignment of query sequences even if a
   reference sequence containing the exact sequence was found. Without
   this flag, SINA will copy the alignment from the reference
   sequence.
?
.. option:: --overhang=[attach|remove|edge] (attach)
   Configures how unaligned bases at the edge of the alignment (overhanging bases) should be handled.
   **attach**
     Overhang bases will be placed next to the last aligned base consecutively.
   **remove**
     Overhang bases will be deleted.
     .. todo:: This feature appears to be broken.
   **edge**
     Overhang bases will be placed next to the outer edge of the alignment.
?
.. option:: --lowercase=[none|original|unaligned]
   Configures which bases should be written using lower case characters.
   **none**
     All bases will use upper case characters
   **original**
     All bases will be written using the case they had in the input data.
   **unaligned**
     Aligned bases will be written in upper case; unaligned bases will
     be written in lower case. This serves to mark sections of the
     query sequences that could not be aligned because they were
     insertions (internal or edge) with respect to any of the
     reference sequences.
?
.. option:: --insertion=[shift|forbid|remove]
   Configures how the alignment width is preserved.
   **shift**
     The alignment is executed without constraining insertion
     sizes. Insertions for which insufficient columns exist between
     the adjoining aligned bases are force fitted into the alignment
     using NAST. That is, the minimum number of aligned bases to the
     left and right of the insertion are moved to accommodate the
     insertion.
     This mode will add warnings to the log for each sequence in which
     aligned bases had to be moved.
   **forbid**
     The alignment is executed using a scoring scheme disallowing
     insertions for which insufficient columns exist in the alignment.
     This mode causes less "misalignments" than the **shift** mode as
     it computes the best alignment under the constraint that no
     columns may be added to the alignment. However, it will not show
     if the computed alignment suffered from a lack of empty columns.
   **remove**
     The alignment is executed without constraining insertion
     sizes. Insertions larger than the number of columns between the
     adjoining aligned bases are truncated.
     While this mode yields the most accurate alignment for sequences
     with large insertions, it should be used with care as it modifies
     the original sequence.
?
.. option:: --fs-no-graph
   Instructs SINA to use a profile vector instead of a DAG to perform
   the alignment. That is, the base frequencies for all selected
   reference sequences are collected into a vector and the query is
   aligned to this vector weighting the alignment scores according to
   the respective frequencies.
   This feature was added in response to the requests of a reviewer of
   the original SINA publication and only intended to demonstrate that
   the DAG/POA approach is superior to the profile vector approach. Do
   not use this other than for testing.
?
.. option:: --fs-weight=weight (1)
   Adjust the weight factor for the frequency at which a node was
   observed in the reference alignment. Use 0 to disable weighting.
   This feature prefers the more common placement for bases with
   inconsistent alignment in the reference database.
?
.. option:: --match-score=n (2)
   Configures the score given for a match (should be positive).
?
.. option:: --mismatch-score=n (-1)
   Configures the score given for a mismatch (should be negative).
?
.. option:: --pen-gap=n                 gap open penalty (5)
   Configures the penalty subtracted from the score for opening a gap
   (should be positive).
?
.. option:: --pen-gapext=n
   Configures the penalty subtracted from the score for extending a
   gap (should be positive).
?
.. option:: --debug-graph
   Writes the DAG computed from the reference sequences for each query
   sequences to disk in dot format.
?
.. option:: --use-subst-matrix
   Weights the match and mismatch scores according to the overall base
   frequencies observed in the database.
   This feature is experimental and does not currently improve the
   results.
?
.. option:: --write-used-rels
   Writes the names of the alignment reference sequences into the
   field `used_rels`. This option allows using the ARB *mark used
   rels* feature to highlight the reference sequences used to align a
   given query sequence.
?
.. option:: --calc-idty
   Computes the highest similarity the aligned query sequence has with
   any of the sequences in the alignment reference set. The value is
   written to the field *align_ident_slv*.

Advanced Reference Selection Options

?
.. option:: --ptdb=filename
   Alias of :option:`--db` for backwards compatibility.
?
.. option:: --ptport=port_or_socket (:/tmp/sina_pt_<pid>)
   Configures the port or socket on which the ARB PT server for the
   reference alignment is expected or started. To use a TCP port,
   specify *<hostname>*:*<port>*. If *<hostname>* is not `localhost`,
   the PT server must be launched externally. To use a Unix socket,
   specify `:<filename>`.
   When :option:`--num-pts` is greater than 1, the additional PT
   servers port names are generated by appending the respective
   number. Using port numbers greater of equal to 10000 will therefore
   not work.
   By default, the file `/tmp/sina_pt_<pid>` is used, where `<pid>` is
   replaced by the process ID of the SINA instance.
?
.. option:: --fs-kmer-no-fast
   Use all k-mers occurring in the query sequence in the search. By
   default, only k-mers starting with an A are used for extra
   performance.
?
.. option:: --fs-kmer-mm=n (0)
   Allow k-mer matches in the reference database to contain *n*
   mismatches. This feature is only supported by the **pt-server**
   search engine and requires substantial additional compute time (in
   particular for *n* > 1).
?
.. option:: --fs-kmer-norel
   Use absolute (number of shared k-mers) match scores in the kmer
   search rather than relative (number or shared k-mers divided by
   length of reference sequence) match scores.
?
.. option:: --fs-msc-max=id (2)
   Overriding all other options, reference sequences having a
   similarity with the query higher than this value are excluded from
   the alignment reference.
   This option artificially increases the difficulty of the alignment
   by increasing the distance of a query to any reference found in the
   database. It's purpose of this option is to generate a sufficiently
   large *N* of test cases for statistical analysis of SINA's accuracy
   for sequences distant to the reference alignment.
?
.. option:: --fs-leave-query-out
   Excludes sequences from the alignment reference sharing the same
   name as the respective query sequence. (For testing and evaluation).
?
.. option:: --gene-start=n (0)
   Sets the beginning of the gene within the reference alignment. See
   :option:`--fs-cover-gene`.
?
.. option:: --gene-end=n (0)
   Sets the end of the gene within the reference alignment. See
   :option:`--fs-cover-gene`.
?
.. option:: --fs-cover-gene=n (0)
   Similar to :option:`--fs-req-full`, this option requires a total of
   *n* sequences to cover each the beginning and the end of the gene
   within the alignment. This option is more precise than
   :option:`--fs-req-full`, but requires that the column numbers for
   the range in which the full gene is expected be specified via
   :option:`--gene-start` and :option:`--gene-end`.
?
.. option:: --filter=name
   Chooses an *ARB posvar filter* to use for weighting alignment
   positions by their variability.
?
.. option:: --auto-filter-field=name
   Configures a database field using which the value of
   :option:`--filter` is determined by majority vote from the selected
   reference sequences. Since the filters are usually computed at
   domain level, this approach is usually sufficient to select an
   appropriate filter. For SILVA database, the field `tax_slv` contains
   appropriate data.
?
.. option:: --auto-filter-threshold arg
   Sets the minimum quorum required for automatic filter
   selection. See :option:`--lca-quorum` for information on how the
   value is interpreted.
?
.. option:: --fs-oldmatch
   Use the pre-1.6.0 implementation for composing the alignment
   family. Requires :option:`--fs-engine` = ``pt-server``.

Search & Classify Options

?
.. option:: --search-port=port_or_socket (:/tmp/sina_pt2_<pid>)
   See :option:`--ptport`. This option sets the port for the search
   database. It is only used if :option:`--search-db` is specified and
   its value differs from the one given by :option:`--db`.
?
.. option:: --search-all
   Calculate the similarity of the query sequences with **all**
   reference sequences. Normally, SINA will only calculate the
   similarity for the sequences returned by a k-mer based similarity
   search. See also :option:`--search-kmer-candidates`.
?
.. option:: --search-no-fast              don't use fast family search
   See :option:`--fs-kmer-no-fast`. This option configures the same
   behavior for the search stage.
?
.. option:: --search-kmer-candidates=n (1000)
   Configures the number of candidate reference sequences retrieved
   from the k-mer based search. For each candidate, the MSA based
   similarity is calculated and the search result based on these
   numbers. A value for *n* one or two orders larger than
   :option:`--search-max-result` is usually quite sufficient.
?
.. option:: --search-kmer-len=n (10)
   See :option:`--fs-kmer-len`. Sets *k* for the kmer based candidate
   search.
?
.. option:: --search-kmer-mm arg
   See :option:`--fs-kmer-mm`. Sets the number of allowed mismatches
   within each kmer. Only available with the **pt-server** search
   engine.
?
.. option:: --search-kmer-norel
   See :option:`--fs-kmer-norel`. Configures the candidate search to
   use absolute rather than length-relative scores for ordering the
   results.
?
.. option:: --search-ignore-super
   Omit reference sequences of which the query is an exact sub-string
   from the result. Useful for testing and evaluation of the
   classification feature.
?
.. option:: --search-copy-fields=fields
   Specifies a (colon or comma separated) list of meta-data fields to
   be copied from each search result sequence into the output
   sequence. In the output sequence, the field names will each be
   prefixed with `copy_<acc>_` where `<acc>` is the value of the *acc*
   field in the reference.
   Use :option:`--arb-list-fields` for listing the fields available in
   a given ARB database.
?
.. option:: --search-iupac=[pessimistic|*optimistic|exact] (optimistic)
   Configures how ambiguous bases are matched when computing the
   scores for the search results.
   **pessimistic**
    Ambiguous bases do not match anything because they *could* always
    be a mismatch.
   **optimistic**
     Ambiguous bases are considered matches if a match with the other
     (potentially also ambiguous base) is possible. That is, `N` will
     match everything, including `Y`.
   **exact**
     Matches on character level. `N` matches exactly `N`.
?
.. option:: --search-correction=[none|jc] (none)
   Apply distance correction to search result scores.
   **none**
     Leave score unmodified.
   **jc**
     Apply Jukes-Cantor correction.
?
.. option:: --search-cover=[abs|query|target|min|max|avg|overlap|all|nogap] (query)
   Compute sequence similarity as the fraction of the number of matches and
   **abs**
     the number 1: yields the absolute number of matching bases
   **query**
     the length of the query sequence. Yields the fraction of the
     query covered by the reference sequence.
   **target**
     the length of the target sequence. Yields the fraction of the
     result sequence covered by the query sequence.
   **min**
     the length of the shorter of the sequences compared.
   **max**
     the length of the longer of the sequences compared.
   **avg**
     the average length of the two sequences compared.
   **nogap**
     the number of columns in which both sequences have bases. Yields
     the equivalent of *matches / (matches+mismatches)*.
   **all**
     the number of columns in which either sequence has a
     bases. Similar to **nogap**, but does not ignore indel events.
   **overlap**
     the length of the overlapping portion of the two sequences.
?
.. option:: --search-filter-lowercase
   Ignore lowercase bases when scoring result sequences. This can be
   used in conjunction with :option:`--lowercase`\=unaligned to ignore
   unaligned bases during the search and classification stage.
?
.. only:: man
   .. include:: fields.rst
Note: See TracBrowser for help on using the repository browser.