source: trunk/HELP_SOURCE/source/sina_main.hlp

Last change on this file was 19532, checked in by westram, 6 weeks ago
  • reintegrates 'help' into 'trunk'
    • tweak arb documentation:
      • automatically link
        • ticket references to arb bug tracker (only affects html version).
        • found URLs.
      • page titles
        • warn about long titles.
        • introduce SUBTITLEs (automatically triggered by multi-line titles in source files).
        • increase allowed length (limited by subwindow width).
      • cleanup header sections in all helpfiles.
      • fix and/or update several help files.
      • document syntax of help sources.
      • build issues:
        • when xml validation fails, next build no longer uses invalid xml ⇒ keeps failing.
        • remove output files on error (including files below ARBHOME/lib).
        • pipe output through logs to ensure proper wrapping in Entering/Leaving lines.
    • moves Tree admin + NDS menu entries to top of menu
  • adds: log:branches/help@18783:19531
File size: 19.5 KB
Line 
1#       main topics:
2UP      arb.hlp
3UP      glossary.hlp
4
5#       sub topics:
6#SUB     subtopic.hlp
7
8# format described in ../help.readme
9
10TITLE           Graph Aligner (SINA)
11
12OCCURRENCE      ARB Editor -> Edit -> Prototypical Graph Aligner
13
14DESCRIPTION     SINA is an alternative to the integrated aligners.
15                It has been developed for the SILVA project.
16
17                Other than the integrated aligners SINA
18                 - uses aligned sequences from the reference database as reference to align the selected sequences.
19                 - employs full dynamic programming to create the alignment.
20                 - considers all selected relatives at once, instead of falling back to
21                   less similar sequences only if the current sequence is missing bases (e.g.
22                   because it is a partial sequence).
23
24SECTION         SINA documentation online
25
26                LINK{https://sina.readthedocs.io/_/downloads/en/stable/pdf/}
27
28                Note: parts of this document were taken from the above SINA documentation.
29
30SECTION         SINA version
31
32                Please note this documentation applies to the modified SINA version 1.7.2
33                which (optionally) is delivered with arb.
34
35                     Modifications applied:
36                       * fix build with arb7 + gcc 7.5
37                       * define CLI interface "ARB7.1" (allows arb to detect SINA 1.7.2)
38                       * fixed some error messages (clearness; do NOT show WHOLE alignment; explicitely show bad character)
39                       * new option '--dont-expect-start' (allows to work with databases not containing 'start'-fields)
40
41                ARB still supports SINA version 1.3.
42                If there is no option to select the reference database, you are either using version 1.3
43                or you try to use a (newer) SINA version which does not yet support the "ARB7.1"-CLI-interface.
44
45                In case you are using version 1.3, you may like to use the old, corresponding version of this helppage
46                which may be accessed via LINK{http://bugs.arb-home.de/browser/trunk/HELP_SOURCE/source/sina_main.hlp?rev=18781}
47
48SECTION         OPTIONS
49
50                Select the sequences to be aligned as usual ("Current Species", "Selected
51                Species" or "Marked Species").
52
53                Select a PT-Server or SINA kmer-search:
54
55                       You may select a PT-Server that will be used for reference search.
56                       Make sure it is up to date and contains all sequences you want to be considered
57                       as reference.
58
59                       Alternatively select '-undefined-' to use the sina-internal engine for reference search.
60                       SINA will maintain an index file (.sidx) that will be stored next to the reference database file.
61                       It will automatically get updated after the reference database changed.
62
63                Select a reference database:
64
65                       The sequences from the reference database will be used as references when aligning the
66                       sequences in the current database.
67
68                       Normally you will like to use the current database as reference database.
69                       This can be done in two ways:
70                            * select 'Last saved' to use the last saved state of the current database.
71                            * select 'Current' to use the loaded database. This is currently not possible
72                              with PT-Server.
73
74                       Alternatively you may specify any other database as reference
75                       database via 'Explicit as [selected]'. This could e.g. be the same
76                       database used to build the PT-Server.
77
78                           This allows e.g. to use a high quality database containing
79                           only typestrains as reference while working on small, specialized databases.
80
81                           It will also avoid polluting the set of references with the state of your
82                           current working dataset. So you dont have to fear some badly aligned sequence
83                           from your working set will be used as a reference.
84
85                       The effective reference database path will be shown in the input field below.
86
87                       Important note:
88
89                           The previously integrated SINA version 1.3 always aligned versus
90                           the state of the referenced sequences in the CURRENT DATABASE.
91
92                           SINA version 1.7 always aligns versus the state of the referenced sequences
93                           in the REFERENCE DATABASE (which may, but does not have to be the same!).
94
95                           When using SINA version 1.7 with the PT-Server, please MAKE SURE
96                           you specify the same database used to calculate the PT-Server as
97                           reference database.
98
99                           Follow these steps:
100
101                                  1. save the reference database (e.g. as ref.arb)
102                                  2. start arb on ref.arb and calculate a PT-Server
103                                  3. go back to your working database, open the sina window,
104                                     specify the calculated PT-Server AND specify the saved
105                                     ref.arb as reference database.
106                                  4. restart at 1. whenever you need to update your references
107
108                           If SINA detects any inconsistencies between the PT-Server and the reference
109                           database, it will silently try to update the PT-Server database.
110
111
112                       Excluded references:
113
114                           Some sequences will not be used as references:
115                             * sequences with less than 10 gaps are considered not aligned and will
116                               not be used as references.
117                             * if "Realign" (see advanced options) is checked, a sequence will never
118                               be used as references for itself.
119
120                           Some options define additional requirements for the chosen
121                           reference sequence set (see option below, esp. advanced options).
122
123                Decide what to do with possible overhang ("Overhang placement").
124                If your sequence extends beyond the
125                reference sequences on either side of the alignment, those bases cannot be
126                aligned properly. Three options of handling this situation are supported:
127
128                    "keep attached"
129
130                        just leave them dangling, directly attached to the last base that
131                        could be aligned properly
132
133                    "move to edge"
134
135                        move them out to the very beginning and end of
136                        the alignment. This allows you to easily spot sequences
137                        with overhang, and decide what to do yourself. Recommended,
138                        but only if you check your sequences after alignment!
139
140                    "remove"
141
142                        automatically remove these bases.
143
144                Choose "Handling of unmappable insertions":
145
146                    Configures how the alignment width is preserved.
147
148                    "Shift surrounding bases"
149
150                        The alignment is executed without constraining insertion sizes.
151                        Insertions for which insufficient columns exist between the
152                        adjoining aligned bases are force fitted into the alignment
153                        using NAST. That is, the minimum number of aligned bases to the
154                        left and right of the insertion are moved to accommodate the
155                        insertion.  This mode will add warnings to the log for each
156                        sequence in which aligned bases had to be moved.
157
158                    "Forbid during DP alignment"
159
160                        The alignment is executed using a scoring scheme disallowing insertions
161                        for which insufficient columns exist in the alignment.  This mode
162                        causes less “misalignments” than the shift mode as it computes the best
163                        alignment under the constraint that no columns may be added to the
164                        alignment. However, it will not show if the computed alignment suffered
165                        from a lack of empty columns.
166
167                    "Delete bases"
168
169                        The alignment is executed without constraining insertion sizes. Insertions
170                        larger than the number of columns between the adjoining aligned bases are
171                        truncated.  While this mode yields the most accurate alignment for
172                        sequences with large insertions, it should be used with care as it
173                        modifies the original sequence.
174
175                Choose "Character Case":
176
177                    Configures which bases should be written using lower case characters.
178
179                    "Do not modify"
180
181                        All bases will be written using the case they had in the input data.
182
183                    "Show unaligned bases as lower case"
184
185                        Aligned bases will be written in upper case; unaligned bases will be
186                        written in lower case. This serves to mark sections of the query
187                        sequences that could not be aligned because they were insertions
188                        (internal or edge) with respect to any of the reference sequences.
189
190                    "Uppercase all"
191
192                        All bases will use upper case characters
193
194                Define "Family conservation weight": (default 1)
195
196                    Adjust the weight factor for the frequency at which a node was observed
197                    in the reference alignment. Use 0 to disable weighting.  This feature
198                    prefers the more common placement for bases with inconsistent alignment
199                    in the reference database.
200
201                Define "Size of full-length sequences": (default 1400)
202
203                    Set the minimum length a reference sequence is required to have
204                    to be considered full length.
205                    See also "Minimal number of full length sequences" in
206                    ADVANCED OPTIONS below.
207
208                Select a "Protection Level" higher than that of the sequences if you want the
209                alignment software to actually modify the bases. Choose a lower protection
210                level to execute a "dry run", not changing anything. Note that sequences
211                with a protection level of zero will always be changed.
212
213SECTION         Verbosity
214
215                All output will be printed to the console that opens when you start sina.
216
217                Several options allow you to change the noisiness:
218
219                    When "Show changed sections" is checked, sina shows differences
220                    between the inferred alignment and the original alignment.
221
222                    That output will be colorized if "color bases" is checked.
223
224                    Check "Show statistic" to
225                    show the distance to original alignment.
226
227SECTION         TRICKS
228
229                If you want to see how the alignment that would be produced by the graph
230                aligner differs from your current alignment, and why the program would
231                act that way, you can set the protection level to "0" and the Logging level
232                to "debug". The output on the console will now include all differing sections
233                of the alignment and the matching parts of the reference sequences.
234
235SECTION         ADVANCED OPTIONS
236
237                Select the "Show advanced options" Button at the top to gain access to
238                the you-may-now-shoot-yourself-in-the-foot-severely dialog window.
239
240                Don't be surprised if the graph aligner crashes after you entered silly
241                values here. No sanity check of your options is done.
242
243                Pos.Var:
244
245                        Select a positional variability filter. If possible, use the filter
246                        appropriate for the type of sequences you want aligned. Positional
247                        variability statistics will be considered when placing the individual bases.
248
249                Field used for automatic filter selection:
250
251                        Configures a database field using which the value for positional
252                        variability filter is determined by majority vote from the selected
253                        reference sequences.
254                        Since the filters are usually computed at domain level, this approach is usually
255                        sufficient to select an appropriate filter.
256                        For SILVA database, the field 'tax_slv' contains appropriate data.
257
258
259                Turn check:
260
261                        If selected (default) sequences will be automatically reversed
262                        and/or complemented if this will likely improve the alignment.
263
264
265                Realign:
266
267                        If selected, the sequence itself is excluded from the result of
268                        the executed PT-Server family search. If deselected, the alignment
269                        of an identical sequence found by the PT-Server is copied.
270
271
272# @@@ add option to mark references or aligned, then document here.
273# (Copy and) mark sequence used as reference:
274#
275#       Mark the sequences that were used as a reference during alignment.
276#       This allows you to easily load them into the editor to review the
277#       decisions made by the graph aligner.
278#       If you also selected the "Load reference" option, sequences will be
279#       copied into your current database prior to being marked.
280
281
282                Gap insertion/extension penalties: (default is 5/2)
283
284                        You can change the penalties associated with opening and extending
285                        gaps.
286
287                Match/mismatch scores: (default is 2/-1)
288
289                        Configures the scores given for a match (should be positive) and
290                        a mismatch (should be negative).
291
292                Family search min/min_score/max: (default 40/0.7/40)
293
294                        The first value tells the graph aligner how many sequences it should
295                        try to always use. The second value determines the minimal identity
296                        with the target sequence additional reference sequences should have.
297                        The third value selects the maximal number of sequences to be used
298                        as a reference.
299
300                Minimal number of full length sequences: (default 1)
301
302                        Set the minimum number of full length (see "Size of full-length sequences"
303                        setting above) reference sequences that must be included in the selected
304                        reference set. The search will proceed regardless of other settings until
305                        this setting has been satisfied.  If it cannot be satisfied by any
306                        sequence in the reference database, the query sequence will be discarded.
307                        This setting exists to ensure that the entire length of the query sequence
308                        will be covered in the presence of partial sequences contained within your
309                        reference database.
310
311                Family search oligo length/mismatches: (default 10/0)
312
313                        The first value sets the size of k for the reference search (size of kmer).
314                        For SSU rRNA sequences, the default of 10 is a good value.
315                        For different sequence types, different values may perform better.
316                        For 5S, for example, 6 has shown to be more effective.
317
318                        The second value allows k-mer matches in the reference database to contain n mismatches.
319                        This feature is only supported by the pt-server search engine and
320                        requires substantial additional compute time (in particular for n > 1).
321
322                Minimal reference sequence length: (default 150)
323
324                        Set the minimum length reference sequences are required to have.
325                        Sequences shorter than this will not be included in the selection.
326
327                        Note: If you are working with particularly short reference sequences, you
328                        will need to lower this settings to allow any reference sequences to be
329                        found.
330
331                Alignment bounds: (default 0/0)
332
333                        These values set the beginning and the end of the gene within the reference
334                        alignment.
335                        See "Number of references required to touch bounds" for more information.
336
337                Number of references required to touch bounds: (default: 0)
338
339                        Similar to "Minimal number of full length sequences", this option requires a
340                        total of n sequences to cover each the beginning and the end of the gene
341                        within the alignment.
342
343                        This option is more precise than "Minimal number of full length sequences", but
344                        requires that the column numbers for the range in which the full gene is
345                        expected be specified via "Alignment bounds" (see above).
346
347                Save used references in 'used_rels': (default is off)
348
349                        Writes the names of the alignment reference sequences into the field used_rels.
350                        This option allows using LINK{markbyref.hlp} to highlight the reference
351                        sequences used to align a given query sequence.
352
353                Store highest identity in 'align_ident_slv': (default is off)
354
355                        Computes the highest similarity the aligned query sequence has with any of
356                        the sequences in the alignment reference set. The value is written to the
357                        field 'align_ident_slv'.
358
359                Disable fast search: (default is to use fast search)
360
361                        Use all k-mers occurring in the query sequence in the search. By default,
362                        only k-mers starting with an A are used for extra performance.
363
364                Score search results by absolute oligo match count: (default is off)
365
366                        Use absolute (number of shared k-mers) match scores in the kmer search
367                        rather than relative (number or shared k-mers divided by length of reference
368                        sequence) match scores.
369
370                Suppress warnings about missing 'start' field: (default is off)
371
372                        This option suppresses warnings about missing 'start' fields and allows to
373                        use sina with databases not using the 'start' w/o getting flooded with
374                        warnings.
375
376                SINA command: (default "arb_sina.sh")
377
378                        If arb has problems finding the sina binary for whatever reasons, you may
379                        specify an explicit path here.
380                        Please note, doing so will stop a fat-tarball-installation from working!
381
382NOTES           SINA automatically decides the number of threads being used.
383
384                When recording macros acting on the SINA window, problems
385                with the used macro-IDs may occur ("XSINA" vs. "SINA" prefix) and
386                ARB complaining about unknown macro ids (e.g.
387                sth like "Unknown action 'SINA/CURR_PT_SERVER' in macro").
388                This is caused by the way the sina window toggles
389                between 'normal' and 'advanced' options.
390                To avoid this, toggle between advanced and normal at the start of your macro.
391                If problems persist manually correct the action prefixes in your macro.
392
393WARNINGS        When using SINA 1.3 you have to make sure that the alignment selected
394                in LINK{ad_align.hlp} is the same alignment as used in the ARB_EDIT4 instance.
395                Starting with SINA 1.7.2 it will always use the same alignment as the editor.
396
397BUGS            No bugs known
Note: See TracBrowser for help on using the repository browser.