source: trunk/HELP_SOURCE/source/sina_main.hlp

Last change on this file was 19348, checked in by westram, 2 years ago
  • emphasize reference- and ptserver-database have to be the same database!
File size: 19.7 KB
Line 
1#Please insert up references in the next lines (line starts with keyword UP)
2UP      arb.hlp
3UP      glossary.hlp
4
5#Please insert subtopic references  (line starts with keyword SUB)
6#SUB    subtopic.hlp
7
8# Hypertext links in helptext can be added like this: LINK{ref.hlp|http://add|bla@domain}
9
10#************* Title of helpfile !! and start of real helpfile ********
11TITLE           Graph Aligner (SINA)
12
13OCCURRENCE      ARB Editor -> Edit -> Prototypical Graph Aligner
14
15DESCRIPTION     SINA is an alternative to the integrated aligners.
16                It has been developed for the SILVA project.
17
18                Other than the integrated aligners SINA
19                 - uses aligned sequences from the reference database as reference to align the selected sequences.
20                 - employs full dynamic programming to create the alignment.
21                 - considers all selected relatives at once, instead of falling back to
22                   less similar sequences only if the current sequence is missing bases (e.g.
23                   because it is a partial sequence).
24
25SECTION         SINA documentation online
26
27                LINK{https://sina.readthedocs.io/_/downloads/en/stable/pdf/}
28
29                Note: parts of this document were taken from the above SINA documentation.
30
31SECTION         SINA version
32
33                Please note this documentation applies to the modified SINA version 1.7.2
34                which (optionally) is delivered with arb.
35
36                     Modifications applied:
37                       * fix build with arb7 + gcc 7.5
38                       * define CLI interface "ARB7.1" (allows arb to detect SINA 1.7.2)
39                       * fixed some error messages (clearness; do NOT show WHOLE alignment; explicitely show bad character)
40                       * new option '--dont-expect-start' (allows to work with databases not containing 'start'-fields)
41
42                ARB still supports SINA version 1.3.
43                If there is no option to select the reference database, you are either using version 1.3
44                or you try to use a (newer) SINA version which does not yet support the "ARB7.1"-CLI-interface.
45
46                In case you are using version 1.3, you may like to use the old, corresponding version of this helppage
47                which may be accessed via LINK{http://bugs.arb-home.de/browser/trunk/HELP_SOURCE/source/sina_main.hlp?rev=18781}
48
49SECTION         OPTIONS
50
51                Select the sequences to be aligned as usual ("Current Species", "Selected
52                Species" or "Marked Species").
53
54                Select a PT-Server or SINA kmer-search:
55
56                       You may select a PT-Server that will be used for reference search.
57                       Make sure it is up to date and contains all sequences you want to be considered
58                       as reference.
59
60                       Alternatively select '-undefined-' to use the sina-internal engine for reference search.
61                       SINA will maintain an index file (.sidx) that will be stored next to the reference database file.
62                       It will automatically get updated after the reference database changed.
63
64                Select a reference database:
65
66                       The sequences from the reference database will be used as references when aligning the
67                       sequences in the current database.
68
69                       Normally you will like to use the current database as reference database.
70                       This can be done in two ways:
71                            * select 'Last saved' to use the last saved state of the current database.
72                            * select 'Current' to use the loaded database. This is currently not possible
73                              with PT-Server.
74
75                       Alternatively you may specify any other database as reference
76                       database via 'Explicit as [selected]'. This could e.g. be the same
77                       database used to build the PT-Server.
78
79                           This allows e.g. to use a high quality database containing
80                           only typestrains as reference while working on small, specialized databases.
81
82                           It will also avoid polluting the set of references with the state of your
83                           current working dataset. So you dont have to fear some badly aligned sequence
84                           from your working set will be used as a reference.
85
86                       The effective reference database path will be shown in the input field below.
87
88                       Important note:
89
90                           The previously integrated SINA version 1.3 always aligned versus
91                           the state of the referenced sequences in the CURRENT DATABASE.
92
93                           SINA version 1.7 always aligns versus the state of the referenced sequences
94                           in the REFERENCE DATABASE (which may, but does not have to be the same!).
95
96                           When using SINA version 1.7 with the PT-Server, please MAKE SURE
97                           you specify the same database used to calculate the PT-Server as
98                           reference database.
99
100                           Follow these steps:
101
102                                  1. save the reference database (e.g. as ref.arb)
103                                  2. start arb on ref.arb and calculate a PT-Server
104                                  3. go back to your working database, open the sina window,
105                                     specify the calculated PT-Server AND specify the saved
106                                     ref.arb as reference database.
107                                  4. restart at 1. whenever you need to update your references
108
109                           If SINA detects any inconsistencies between the PT-Server and the reference
110                           database, it will silently try to update the PT-Server database.
111
112
113                       Excluded references:
114
115                           Some sequences will not be used as references:
116                             * sequences with less than 10 gaps are considered not aligned and will
117                               not be used as references.
118                             * if "Realign" (see advanced options) is checked, a sequence will never
119                               be used as references for itself.
120
121                           Some options define additional requirements for the chosen
122                           reference sequence set (see option below, esp. advanced options).
123
124                Decide what to do with possible overhang ("Overhang placement").
125                If your sequence extends beyond the
126                reference sequences on either side of the alignment, those bases cannot be
127                aligned properly. Three options of handling this situation are supported:
128
129                    "keep attached"
130
131                        just leave them dangling, directly attached to the last base that
132                        could be aligned properly
133
134                    "move to edge"
135
136                        move them out to the very beginning and end of
137                        the alignment. This allows you to easily spot sequences
138                        with overhang, and decide what to do yourself. Recommended,
139                        but only if you check your sequences after alignment!
140
141                    "remove"
142
143                        automatically remove these bases.
144
145                Choose "Handling of unmappable insertions":
146
147                    Configures how the alignment width is preserved.
148
149                    "Shift surrounding bases"
150
151                        The alignment is executed without constraining insertion sizes.
152                        Insertions for which insufficient columns exist between the
153                        adjoining aligned bases are force fitted into the alignment
154                        using NAST. That is, the minimum number of aligned bases to the
155                        left and right of the insertion are moved to accommodate the
156                        insertion.  This mode will add warnings to the log for each
157                        sequence in which aligned bases had to be moved.
158
159                    "Forbid during DP alignment"
160
161                        The alignment is executed using a scoring scheme disallowing insertions
162                        for which insufficient columns exist in the alignment.  This mode
163                        causes less “misalignments” than the shift mode as it computes the best
164                        alignment under the constraint that no columns may be added to the
165                        alignment. However, it will not show if the computed alignment suffered
166                        from a lack of empty columns.
167
168                    "Delete bases"
169
170                        The alignment is executed without constraining insertion sizes. Insertions
171                        larger than the number of columns between the adjoining aligned bases are
172                        truncated.  While this mode yields the most accurate alignment for
173                        sequences with large insertions, it should be used with care as it
174                        modifies the original sequence.
175
176                Choose "Character Case":
177
178                    Configures which bases should be written using lower case characters.
179
180                    "Do not modify"
181
182                        All bases will be written using the case they had in the input data.
183
184                    "Show unaligned bases as lower case"
185
186                        Aligned bases will be written in upper case; unaligned bases will be
187                        written in lower case. This serves to mark sections of the query
188                        sequences that could not be aligned because they were insertions
189                        (internal or edge) with respect to any of the reference sequences.
190
191                    "Uppercase all"
192
193                        All bases will use upper case characters
194
195                Define "Family conservation weight": (default 1)
196
197                    Adjust the weight factor for the frequency at which a node was observed
198                    in the reference alignment. Use 0 to disable weighting.  This feature
199                    prefers the more common placement for bases with inconsistent alignment
200                    in the reference database.
201
202                Define "Size of full-length sequences": (default 1400)
203
204                    Set the minimum length a reference sequence is required to have
205                    to be considered full length.
206                    See also "Minimal number of full length sequences" in
207                    ADVANCED OPTIONS below.
208
209                Select a "Protection Level" higher than that of the sequences if you want the
210                alignment software to actually modify the bases. Choose a lower protection
211                level to execute a "dry run", not changing anything. Note that sequences
212                with a protection level of zero will always be changed.
213
214SECTION         Verbosity
215
216                All output will be printed to the console that opens when you start sina.
217
218                Several options allow you to change the noisiness:
219
220                    When "Show changed sections" is checked, sina shows differences
221                    between the inferred alignment and the original alignment.
222
223                    That output will be colorized if "color bases" is checked.
224
225                    Check "Show statistic" to
226                    show the distance to original alignment.
227
228SECTION         TRICKS
229
230                If you want to see how the alignment that would be produced by the graph
231                aligner differs from your current alignment, and why the program would
232                act that way, you can set the protection level to "0" and the Logging level
233                to "debug". The output on the console will now include all differing sections
234                of the alignment and the matching parts of the reference sequences.
235
236SECTION         ADVANCED OPTIONS
237
238                Select the "Show advanced options" Button at the top to gain access to
239                the you-may-now-shoot-yourself-in-the-foot-severely dialog window.
240
241                Don't be surprised if the graph aligner crashes after you entered silly
242                values here. No sanity check of your options is done.
243
244                Pos.Var:
245
246                        Select a positional variability filter. If possible, use the filter
247                        appropriate for the type of sequences you want aligned. Positional
248                        variability statistics will be considered when placing the individual bases.
249
250                Field used for automatic filter selection:
251
252                        Configures a database field using which the value for positional
253                        variability filter is determined by majority vote from the selected
254                        reference sequences.
255                        Since the filters are usually computed at domain level, this approach is usually
256                        sufficient to select an appropriate filter.
257                        For SILVA database, the field 'tax_slv' contains appropriate data.
258
259
260                Turn check:
261
262                        If selected (default) sequences will be automatically reversed
263                        and/or complemented if this will likely improve the alignment.
264
265
266                Realign:
267
268                        If selected, the sequence itself is excluded from the result of
269                        the executed PT-Server family search. If deselected, the alignment
270                        of an identical sequence found by the PT-Server is copied.
271
272
273# @@@ add option to mark references or aligned, then document here.
274# (Copy and) mark sequence used as reference:
275#
276#       Mark the sequences that were used as a reference during alignment.
277#       This allows you to easily load them into the editor to review the
278#       decisions made by the graph aligner.
279#       If you also selected the "Load reference" option, sequences will be
280#       copied into your current database prior to being marked.
281
282
283                Gap insertion/extension penalties: (default is 5/2)
284
285                        You can change the penalties associated with opening and extending
286                        gaps.
287
288                Match/mismatch scores: (default is 2/-1)
289
290                        Configures the scores given for a match (should be positive) and
291                        a mismatch (should be negative).
292
293                Family search min/min_score/max: (default 40/0.7/40)
294
295                        The first value tells the graph aligner how many sequences it should
296                        try to always use. The second value determines the minimal identity
297                        with the target sequence additional reference sequences should have.
298                        The third value selects the maximal number of sequences to be used
299                        as a reference.
300
301                Minimal number of full length sequences: (default 1)
302
303                        Set the minimum number of full length (see "Size of full-length sequences"
304                        setting above) reference sequences that must be included in the selected
305                        reference set. The search will proceed regardless of other settings until
306                        this setting has been satisfied.  If it cannot be satisfied by any
307                        sequence in the reference database, the query sequence will be discarded.
308                        This setting exists to ensure that the entire length of the query sequence
309                        will be covered in the presence of partial sequences contained within your
310                        reference database.
311
312                Family search oligo length/mismatches: (default 10/0)
313
314                        The first value sets the size of k for the reference search (size of kmer).
315                        For SSU rRNA sequences, the default of 10 is a good value.
316                        For different sequence types, different values may perform better.
317                        For 5S, for example, 6 has shown to be more effective.
318
319                        The second value allows k-mer matches in the reference database to contain n mismatches.
320                        This feature is only supported by the pt-server search engine and
321                        requires substantial additional compute time (in particular for n > 1).
322
323                Minimal reference sequence length: (default 150)
324
325                        Set the minimum length reference sequences are required to have.
326                        Sequences shorter than this will not be included in the selection.
327
328                        Note: If you are working with particularly short reference sequences, you
329                        will need to lower this settings to allow any reference sequences to be
330                        found.
331
332                Alignment bounds: (default 0/0)
333
334                        These values set the beginning and the end of the gene within the reference
335                        alignment.
336                        See "Number of references required to touch bounds" for more information.
337
338                Number of references required to touch bounds: (default: 0)
339
340                        Similar to "Minimal number of full length sequences", this option requires a
341                        total of n sequences to cover each the beginning and the end of the gene
342                        within the alignment.
343
344                        This option is more precise than "Minimal number of full length sequences", but
345                        requires that the column numbers for the range in which the full gene is
346                        expected be specified via "Alignment bounds" (see above).
347
348                Save used references in 'used_rels': (default is off)
349
350                        Writes the names of the alignment reference sequences into the field used_rels.
351                        This option allows using LINK{markbyref.hlp} to highlight the reference
352                        sequences used to align a given query sequence.
353
354                Store highest identity in 'align_ident_slv': (default is off)
355
356                        Computes the highest similarity the aligned query sequence has with any of
357                        the sequences in the alignment reference set. The value is written to the
358                        field 'align_ident_slv'.
359
360                Disable fast search: (default is to use fast search)
361
362                        Use all k-mers occurring in the query sequence in the search. By default,
363                        only k-mers starting with an A are used for extra performance.
364
365                Score search results by absolute oligo match count: (default is off)
366
367                        Use absolute (number of shared k-mers) match scores in the kmer search
368                        rather than relative (number or shared k-mers divided by length of reference
369                        sequence) match scores.
370
371                Suppress warnings about missing 'start' field: (default is off)
372
373                        This option suppresses warnings about missing 'start' fields and allows to
374                        use sina with databases not using the 'start' w/o getting flooded with
375                        warnings.
376
377                SINA command: (default "arb_sina.sh")
378
379                        If arb has problems finding the sina binary for whatever reasons, you may
380                        specify an explicit path here.
381                        Please note, doing so will stop a fat-tarball-installation from working!
382
383NOTES           SINA automatically decides the number of threads being used.
384
385                When recording macros acting on the SINA window, problems
386                with the used macro-IDs may occur ("XSINA" vs. "SINA" prefix) and
387                ARB complaining about unknown macro ids (e.g.
388                sth like "Unknown action 'SINA/CURR_PT_SERVER' in macro").
389                This is caused by the way the sina window toggles
390                between 'normal' and 'advanced' options.
391                To avoid this, toggle between advanced and normal at the start of your macro.
392                If problems persist manually correct the action prefixes in your macro.
393
394WARNINGS        When using SINA 1.3 you have to make sure that the alignment selected
395                in LINK{ad_align.hlp} is the same alignment as used in the ARB_EDIT4 instance.
396                Starting with SINA 1.7.2 it will always use the same alignment as the editor.
397
398BUGS            No bugs known
Note: See TracBrowser for help on using the repository browser.