source: trunk/GDE/SINA/builddir/doc/source/fields.rst

Last change on this file was 19170, checked in by westram, 2 years ago
  • sina source
    • unpack + remove tarball
    • no longer ignore sina builddir.
File size: 6.1 KB

Field Reference

?
.. only:: man
   Field Reference
   ---------------
?
.. program:: field

SINA generates a number of named meta-data values for each processed sequence. By default, all values will be written to ARB output files, while CSV output requires that each meta-data field is specified using :option:`-f/--fields <sina -f>`. The fields generated by SINA and the typical fields present in ARB databases are described below.

?

Basic Fields

The below fields are standard ARB meta data fields describing each sequence.

?
.. option:: name
   The ``Id`` of the sequence. In FASTA input and output, this
   value is mapped to the first word of the header line.
?
.. option:: full_name
   The textual description of the sequence. In FASTA input and output,
   this value is mapped to all but the first word of the header line.
?
.. option:: acc
   The sequence accession number. This field is relevant for ARB
   input/output as together with :option:`start` it defines the unique
   identity of the sequence when regenerating sequence names. For
   sequences read from FASTA and written to ARB, SINA will generate a
   pseud-accession as ``ARB_`` followed by a 8 character hexadecimal
   CRC32 checksum. This matches the behavior of ARB during FASTA
   import.
?
.. option:: version
   The version part of the accession number reference.
?
.. option:: start
   The start position of the gene sequence with the sequence
   referenced by the accession number.
?
.. option:: stop
   The stop position of the gene sequence with the sequence
   referenced by the accession number.

SINA specific fields

?
.. option:: align_quality_slv
   The alignment "quality". The alignment score, normalized to remove
   weighting effects and scaled as integer between 0 and 100. If the
   alignment for the sequence was copied from an identical match to a
   reference sequence, the value is set to 100.
?
.. option:: align_cutoff_head_slv
   The number of unaligned basepairs at the beginning of the sequence.
?
.. option:: align_cutoff_tail_slv
   The number of unaligned basepairs at the end of the sequence.
?
.. option:: aligned_slv
   The time and date at which the sequence was aligned.
?
.. option:: align_startpos_slv
   The position of the first base of the sequence within the reference
   alignment.
?
.. option:: align_stoppos_slv
   The position of the last base of the sequence within the reference
   alignment.
?
.. option:: align_ident_slv
   The highest fractional identity of the aligned sequence with any of
   the used reference sequences. The value is computed using
   optimistic IUPAC comparison (N matches anything) over the
   overlapping region of each pair of sequences.
?
.. option:: nuc_gene_slv
   The number of basepairs aligned within the gene. (Currently not
   computed).
?
.. option:: align_bp_score_slv
   A score indicating the average binding strength of basepairs
   aligned into helix regions. Each pair of bases aligned to opposing
   sides of a helix specified in the reference database is assigned a
   score (``AG`` = 0.5, ``AU`` = 1.1, ``CG`` = 1.5, ``GG`` = 0.4,
   ``GU`` = 0.9), the sum of scores divided by the number of helix
   positions with bases on either side and multiplied by 100.
?
.. option:: align_family_slv
   The reference sequences used to align the query sequence. Each
   reference is listed as ``ACC.START:SCORE`` where ``ACC`` and
   ``START`` are the contents of the reference sequence's respective
   :option:`acc` and :option:`start` fields and ``SCORE`` is the score
   assigned by the sequence search engine (ARB PT server or internal
   kmer search).
?
.. option:: align_log_slv
   A log of events that occurred during the alignment of a query sequence.
?
.. option:: align_filter_slv
   The weighting filter selected for the query sequence, if any.
?
.. option:: nearest_slv
   The results from the sequence search. Available only when the
   search stage is enabled (:option:`-S/--search <sina --search>`).
   Each matched sequence is given as ``ACC.VERSION.START.STOP~SCORE``
   where ``ACC``, ``VERSION``, ``START``, and ``STOP`` are the
   contents of the matched sequence's respective :option:`acc`,
   :option:`version`, :option:`start` and :option:`stop` fields and
   ``SCORE`` is the score calculated according to the search settings.

SILVA taxonomy fields:

The SILVA SSU and LSU databases in ARB format contain taxonomic meta data suitable for generating taxonomic assingments using the :option:--lca-fields option. Each of the following fields contains the taxonomic assignment as a "materialized path" (Domain; Phylum; ...). The _name field contains the sequence name assigned by the respective taxonomy.

?
.. option:: tax_slv
   The `SILVA <https://www.arb-silva.de>`_ taxonomy.
?
.. option:: tax_embl
   The `EMBL-EBI/ENA <https://www.ebi.ac.uk/ena>`_ taxonomy. Note that
   the name was changed to `tax_embl_ebi_ena` in newer releases of
   SILVA.
?
.. option:: tax_embl_ebi_ena
   The `EMBL-EBI/ENA <https://www.ebi.ac.uk/ena>`_ taxonomy.
?
.. option:: tax_ltp
   The `Living Tree Project (LTP)
   <https://www.arb-silva.de/projects/living-tree>`_ taxonomy.
?
.. option:: tax_gg
   The Greengenes taxonomy. (Discontinued)
?
.. option:: tax_rdp
   The `RDP II <https://rdp.cme.msu.edu/>`_ taxonomy.
?
.. option:: tax_gtdb
   The `Genome Taxonomy Database <https://gtdb.ecogenomic.org/>`_ Taxonomy.

Additional standard ARB fields:

?
.. option:: ali_16s/data
   The actual sequence alignment. This field type always has the form
   ``ali_<name>/data``, with ``<name>`` indicating the alignment (ARB
   databases may contain multiple alignments).
?
.. option:: ARB_color
   A number indicating in which color the sequence should be highlighted inside of ARB.
?
.. option:: used_rels
   The :option:`names <name>` of the reference sequences used during
   alignment separated by spaces. This field is generated only if
   :option:`--write-used-rels <sina --write-used-rels>` is given. It
   allows selecting the reference sequences via a special menu item
   from within ARB.
?
.. option:: nuc
   The length of the sequence.
Note: See TracBrowser for help on using the repository browser.