1 | # main topics: |
---|
2 | UP arb.hlp |
---|
3 | UP glossary.hlp |
---|
4 | |
---|
5 | # sub topics: |
---|
6 | #SUB subtopic.hlp |
---|
7 | |
---|
8 | # format described in ../help.readme |
---|
9 | |
---|
10 | TITLE Graph Aligner (SINA) |
---|
11 | |
---|
12 | OCCURRENCE ARB Editor -> Edit -> Prototypical Graph Aligner |
---|
13 | |
---|
14 | DESCRIPTION SINA is an alternative to the integrated aligners. |
---|
15 | It has been developed for the SILVA project. |
---|
16 | |
---|
17 | Other than the integrated aligners SINA |
---|
18 | - uses aligned sequences from the reference database as reference to align the selected sequences. |
---|
19 | - employs full dynamic programming to create the alignment. |
---|
20 | - considers all selected relatives at once, instead of falling back to |
---|
21 | less similar sequences only if the current sequence is missing bases (e.g. |
---|
22 | because it is a partial sequence). |
---|
23 | |
---|
24 | SECTION SINA documentation online |
---|
25 | |
---|
26 | LINK{https://sina.readthedocs.io/_/downloads/en/stable/pdf/} |
---|
27 | |
---|
28 | Note: parts of this document were taken from the above SINA documentation. |
---|
29 | |
---|
30 | SECTION SINA version |
---|
31 | |
---|
32 | Please note this documentation applies to the modified SINA version 1.7.2 |
---|
33 | which (optionally) is delivered with arb. |
---|
34 | |
---|
35 | Modifications applied: |
---|
36 | * fix build with arb7 + gcc 7.5 |
---|
37 | * define CLI interface "ARB7.1" (allows arb to detect SINA 1.7.2) |
---|
38 | * fixed some error messages (clearness; do NOT show WHOLE alignment; explicitely show bad character) |
---|
39 | * new option '--dont-expect-start' (allows to work with databases not containing 'start'-fields) |
---|
40 | |
---|
41 | ARB still supports SINA version 1.3. |
---|
42 | If there is no option to select the reference database, you are either using version 1.3 |
---|
43 | or you try to use a (newer) SINA version which does not yet support the "ARB7.1"-CLI-interface. |
---|
44 | |
---|
45 | In case you are using version 1.3, you may like to use the old, corresponding version of this helppage |
---|
46 | which may be accessed via LINK{http://bugs.arb-home.de/browser/trunk/HELP_SOURCE/source/sina_main.hlp?rev=18781} |
---|
47 | |
---|
48 | SECTION OPTIONS |
---|
49 | |
---|
50 | Select the sequences to be aligned as usual ("Current Species", "Selected |
---|
51 | Species" or "Marked Species"). |
---|
52 | |
---|
53 | Select a PT-Server or SINA kmer-search: |
---|
54 | |
---|
55 | You may select a PT-Server that will be used for reference search. |
---|
56 | Make sure it is up to date and contains all sequences you want to be considered |
---|
57 | as reference. |
---|
58 | |
---|
59 | Alternatively select '-undefined-' to use the sina-internal engine for reference search. |
---|
60 | SINA will maintain an index file (.sidx) that will be stored next to the reference database file. |
---|
61 | It will automatically get updated after the reference database changed. |
---|
62 | |
---|
63 | Select a reference database: |
---|
64 | |
---|
65 | The sequences from the reference database will be used as references when aligning the |
---|
66 | sequences in the current database. |
---|
67 | |
---|
68 | Normally you will like to use the current database as reference database. |
---|
69 | This can be done in two ways: |
---|
70 | * select 'Last saved' to use the last saved state of the current database. |
---|
71 | * select 'Current' to use the loaded database. This is currently not possible |
---|
72 | with PT-Server. |
---|
73 | |
---|
74 | Alternatively you may specify any other database as reference |
---|
75 | database via 'Explicit as [selected]'. This could e.g. be the same |
---|
76 | database used to build the PT-Server. |
---|
77 | |
---|
78 | This allows e.g. to use a high quality database containing |
---|
79 | only typestrains as reference while working on small, specialized databases. |
---|
80 | |
---|
81 | It will also avoid polluting the set of references with the state of your |
---|
82 | current working dataset. So you dont have to fear some badly aligned sequence |
---|
83 | from your working set will be used as a reference. |
---|
84 | |
---|
85 | The effective reference database path will be shown in the input field below. |
---|
86 | |
---|
87 | Important note: |
---|
88 | |
---|
89 | The previously integrated SINA version 1.3 always aligned versus |
---|
90 | the state of the referenced sequences in the CURRENT DATABASE. |
---|
91 | |
---|
92 | SINA version 1.7 always aligns versus the state of the referenced sequences |
---|
93 | in the REFERENCE DATABASE (which may, but does not have to be the same!). |
---|
94 | |
---|
95 | When using SINA version 1.7 with the PT-Server, please MAKE SURE |
---|
96 | you specify the same database used to calculate the PT-Server as |
---|
97 | reference database. |
---|
98 | |
---|
99 | Follow these steps: |
---|
100 | |
---|
101 | 1. save the reference database (e.g. as ref.arb) |
---|
102 | 2. start arb on ref.arb and calculate a PT-Server |
---|
103 | 3. go back to your working database, open the sina window, |
---|
104 | specify the calculated PT-Server AND specify the saved |
---|
105 | ref.arb as reference database. |
---|
106 | 4. restart at 1. whenever you need to update your references |
---|
107 | |
---|
108 | If SINA detects any inconsistencies between the PT-Server and the reference |
---|
109 | database, it will silently try to update the PT-Server database. |
---|
110 | |
---|
111 | |
---|
112 | Excluded references: |
---|
113 | |
---|
114 | Some sequences will not be used as references: |
---|
115 | * sequences with less than 10 gaps are considered not aligned and will |
---|
116 | not be used as references. |
---|
117 | * if "Realign" (see advanced options) is checked, a sequence will never |
---|
118 | be used as references for itself. |
---|
119 | |
---|
120 | Some options define additional requirements for the chosen |
---|
121 | reference sequence set (see option below, esp. advanced options). |
---|
122 | |
---|
123 | Decide what to do with possible overhang ("Overhang placement"). |
---|
124 | If your sequence extends beyond the |
---|
125 | reference sequences on either side of the alignment, those bases cannot be |
---|
126 | aligned properly. Three options of handling this situation are supported: |
---|
127 | |
---|
128 | "keep attached" |
---|
129 | |
---|
130 | just leave them dangling, directly attached to the last base that |
---|
131 | could be aligned properly |
---|
132 | |
---|
133 | "move to edge" |
---|
134 | |
---|
135 | move them out to the very beginning and end of |
---|
136 | the alignment. This allows you to easily spot sequences |
---|
137 | with overhang, and decide what to do yourself. Recommended, |
---|
138 | but only if you check your sequences after alignment! |
---|
139 | |
---|
140 | "remove" |
---|
141 | |
---|
142 | automatically remove these bases. |
---|
143 | |
---|
144 | Choose "Handling of unmappable insertions": |
---|
145 | |
---|
146 | Configures how the alignment width is preserved. |
---|
147 | |
---|
148 | "Shift surrounding bases" |
---|
149 | |
---|
150 | The alignment is executed without constraining insertion sizes. |
---|
151 | Insertions for which insufficient columns exist between the |
---|
152 | adjoining aligned bases are force fitted into the alignment |
---|
153 | using NAST. That is, the minimum number of aligned bases to the |
---|
154 | left and right of the insertion are moved to accommodate the |
---|
155 | insertion. This mode will add warnings to the log for each |
---|
156 | sequence in which aligned bases had to be moved. |
---|
157 | |
---|
158 | "Forbid during DP alignment" |
---|
159 | |
---|
160 | The alignment is executed using a scoring scheme disallowing insertions |
---|
161 | for which insufficient columns exist in the alignment. This mode |
---|
162 | causes less âmisalignmentsâ than the shift mode as it computes the best |
---|
163 | alignment under the constraint that no columns may be added to the |
---|
164 | alignment. However, it will not show if the computed alignment suffered |
---|
165 | from a lack of empty columns. |
---|
166 | |
---|
167 | "Delete bases" |
---|
168 | |
---|
169 | The alignment is executed without constraining insertion sizes. Insertions |
---|
170 | larger than the number of columns between the adjoining aligned bases are |
---|
171 | truncated. While this mode yields the most accurate alignment for |
---|
172 | sequences with large insertions, it should be used with care as it |
---|
173 | modifies the original sequence. |
---|
174 | |
---|
175 | Choose "Character Case": |
---|
176 | |
---|
177 | Configures which bases should be written using lower case characters. |
---|
178 | |
---|
179 | "Do not modify" |
---|
180 | |
---|
181 | All bases will be written using the case they had in the input data. |
---|
182 | |
---|
183 | "Show unaligned bases as lower case" |
---|
184 | |
---|
185 | Aligned bases will be written in upper case; unaligned bases will be |
---|
186 | written in lower case. This serves to mark sections of the query |
---|
187 | sequences that could not be aligned because they were insertions |
---|
188 | (internal or edge) with respect to any of the reference sequences. |
---|
189 | |
---|
190 | "Uppercase all" |
---|
191 | |
---|
192 | All bases will use upper case characters |
---|
193 | |
---|
194 | Define "Family conservation weight": (default 1) |
---|
195 | |
---|
196 | Adjust the weight factor for the frequency at which a node was observed |
---|
197 | in the reference alignment. Use 0 to disable weighting. This feature |
---|
198 | prefers the more common placement for bases with inconsistent alignment |
---|
199 | in the reference database. |
---|
200 | |
---|
201 | Define "Size of full-length sequences": (default 1400) |
---|
202 | |
---|
203 | Set the minimum length a reference sequence is required to have |
---|
204 | to be considered full length. |
---|
205 | See also "Minimal number of full length sequences" in |
---|
206 | ADVANCED OPTIONS below. |
---|
207 | |
---|
208 | Select a "Protection Level" higher than that of the sequences if you want the |
---|
209 | alignment software to actually modify the bases. Choose a lower protection |
---|
210 | level to execute a "dry run", not changing anything. Note that sequences |
---|
211 | with a protection level of zero will always be changed. |
---|
212 | |
---|
213 | SECTION Verbosity |
---|
214 | |
---|
215 | All output will be printed to the console that opens when you start sina. |
---|
216 | |
---|
217 | Several options allow you to change the noisiness: |
---|
218 | |
---|
219 | When "Show changed sections" is checked, sina shows differences |
---|
220 | between the inferred alignment and the original alignment. |
---|
221 | |
---|
222 | That output will be colorized if "color bases" is checked. |
---|
223 | |
---|
224 | Check "Show statistic" to |
---|
225 | show the distance to original alignment. |
---|
226 | |
---|
227 | SECTION TRICKS |
---|
228 | |
---|
229 | If you want to see how the alignment that would be produced by the graph |
---|
230 | aligner differs from your current alignment, and why the program would |
---|
231 | act that way, you can set the protection level to "0" and the Logging level |
---|
232 | to "debug". The output on the console will now include all differing sections |
---|
233 | of the alignment and the matching parts of the reference sequences. |
---|
234 | |
---|
235 | SECTION ADVANCED OPTIONS |
---|
236 | |
---|
237 | Select the "Show advanced options" Button at the top to gain access to |
---|
238 | the you-may-now-shoot-yourself-in-the-foot-severely dialog window. |
---|
239 | |
---|
240 | Don't be surprised if the graph aligner crashes after you entered silly |
---|
241 | values here. No sanity check of your options is done. |
---|
242 | |
---|
243 | Pos.Var: |
---|
244 | |
---|
245 | Select a positional variability filter. If possible, use the filter |
---|
246 | appropriate for the type of sequences you want aligned. Positional |
---|
247 | variability statistics will be considered when placing the individual bases. |
---|
248 | |
---|
249 | Field used for automatic filter selection: |
---|
250 | |
---|
251 | Configures a database field using which the value for positional |
---|
252 | variability filter is determined by majority vote from the selected |
---|
253 | reference sequences. |
---|
254 | Since the filters are usually computed at domain level, this approach is usually |
---|
255 | sufficient to select an appropriate filter. |
---|
256 | For SILVA database, the field 'tax_slv' contains appropriate data. |
---|
257 | |
---|
258 | |
---|
259 | Turn check: |
---|
260 | |
---|
261 | If selected (default) sequences will be automatically reversed |
---|
262 | and/or complemented if this will likely improve the alignment. |
---|
263 | |
---|
264 | |
---|
265 | Realign: |
---|
266 | |
---|
267 | If selected, the sequence itself is excluded from the result of |
---|
268 | the executed PT-Server family search. If deselected, the alignment |
---|
269 | of an identical sequence found by the PT-Server is copied. |
---|
270 | |
---|
271 | |
---|
272 | # @@@ add option to mark references or aligned, then document here. |
---|
273 | # (Copy and) mark sequence used as reference: |
---|
274 | # |
---|
275 | # Mark the sequences that were used as a reference during alignment. |
---|
276 | # This allows you to easily load them into the editor to review the |
---|
277 | # decisions made by the graph aligner. |
---|
278 | # If you also selected the "Load reference" option, sequences will be |
---|
279 | # copied into your current database prior to being marked. |
---|
280 | |
---|
281 | |
---|
282 | Gap insertion/extension penalties: (default is 5/2) |
---|
283 | |
---|
284 | You can change the penalties associated with opening and extending |
---|
285 | gaps. |
---|
286 | |
---|
287 | Match/mismatch scores: (default is 2/-1) |
---|
288 | |
---|
289 | Configures the scores given for a match (should be positive) and |
---|
290 | a mismatch (should be negative). |
---|
291 | |
---|
292 | Family search min/min_score/max: (default 40/0.7/40) |
---|
293 | |
---|
294 | The first value tells the graph aligner how many sequences it should |
---|
295 | try to always use. The second value determines the minimal identity |
---|
296 | with the target sequence additional reference sequences should have. |
---|
297 | The third value selects the maximal number of sequences to be used |
---|
298 | as a reference. |
---|
299 | |
---|
300 | Minimal number of full length sequences: (default 1) |
---|
301 | |
---|
302 | Set the minimum number of full length (see "Size of full-length sequences" |
---|
303 | setting above) reference sequences that must be included in the selected |
---|
304 | reference set. The search will proceed regardless of other settings until |
---|
305 | this setting has been satisfied. If it cannot be satisfied by any |
---|
306 | sequence in the reference database, the query sequence will be discarded. |
---|
307 | This setting exists to ensure that the entire length of the query sequence |
---|
308 | will be covered in the presence of partial sequences contained within your |
---|
309 | reference database. |
---|
310 | |
---|
311 | Family search oligo length/mismatches: (default 10/0) |
---|
312 | |
---|
313 | The first value sets the size of k for the reference search (size of kmer). |
---|
314 | For SSU rRNA sequences, the default of 10 is a good value. |
---|
315 | For different sequence types, different values may perform better. |
---|
316 | For 5S, for example, 6 has shown to be more effective. |
---|
317 | |
---|
318 | The second value allows k-mer matches in the reference database to contain n mismatches. |
---|
319 | This feature is only supported by the pt-server search engine and |
---|
320 | requires substantial additional compute time (in particular for n > 1). |
---|
321 | |
---|
322 | Minimal reference sequence length: (default 150) |
---|
323 | |
---|
324 | Set the minimum length reference sequences are required to have. |
---|
325 | Sequences shorter than this will not be included in the selection. |
---|
326 | |
---|
327 | Note: If you are working with particularly short reference sequences, you |
---|
328 | will need to lower this settings to allow any reference sequences to be |
---|
329 | found. |
---|
330 | |
---|
331 | Alignment bounds: (default 0/0) |
---|
332 | |
---|
333 | These values set the beginning and the end of the gene within the reference |
---|
334 | alignment. |
---|
335 | See "Number of references required to touch bounds" for more information. |
---|
336 | |
---|
337 | Number of references required to touch bounds: (default: 0) |
---|
338 | |
---|
339 | Similar to "Minimal number of full length sequences", this option requires a |
---|
340 | total of n sequences to cover each the beginning and the end of the gene |
---|
341 | within the alignment. |
---|
342 | |
---|
343 | This option is more precise than "Minimal number of full length sequences", but |
---|
344 | requires that the column numbers for the range in which the full gene is |
---|
345 | expected be specified via "Alignment bounds" (see above). |
---|
346 | |
---|
347 | Save used references in 'used_rels': (default is off) |
---|
348 | |
---|
349 | Writes the names of the alignment reference sequences into the field used_rels. |
---|
350 | This option allows using LINK{markbyref.hlp} to highlight the reference |
---|
351 | sequences used to align a given query sequence. |
---|
352 | |
---|
353 | Store highest identity in 'align_ident_slv': (default is off) |
---|
354 | |
---|
355 | Computes the highest similarity the aligned query sequence has with any of |
---|
356 | the sequences in the alignment reference set. The value is written to the |
---|
357 | field 'align_ident_slv'. |
---|
358 | |
---|
359 | Disable fast search: (default is to use fast search) |
---|
360 | |
---|
361 | Use all k-mers occurring in the query sequence in the search. By default, |
---|
362 | only k-mers starting with an A are used for extra performance. |
---|
363 | |
---|
364 | Score search results by absolute oligo match count: (default is off) |
---|
365 | |
---|
366 | Use absolute (number of shared k-mers) match scores in the kmer search |
---|
367 | rather than relative (number or shared k-mers divided by length of reference |
---|
368 | sequence) match scores. |
---|
369 | |
---|
370 | Suppress warnings about missing 'start' field: (default is off) |
---|
371 | |
---|
372 | This option suppresses warnings about missing 'start' fields and allows to |
---|
373 | use sina with databases not using the 'start' w/o getting flooded with |
---|
374 | warnings. |
---|
375 | |
---|
376 | SINA command: (default "arb_sina.sh") |
---|
377 | |
---|
378 | If arb has problems finding the sina binary for whatever reasons, you may |
---|
379 | specify an explicit path here. |
---|
380 | Please note, doing so will stop a fat-tarball-installation from working! |
---|
381 | |
---|
382 | NOTES SINA automatically decides the number of threads being used. |
---|
383 | |
---|
384 | When recording macros acting on the SINA window, problems |
---|
385 | with the used macro-IDs may occur ("XSINA" vs. "SINA" prefix) and |
---|
386 | ARB complaining about unknown macro ids (e.g. |
---|
387 | sth like "Unknown action 'SINA/CURR_PT_SERVER' in macro"). |
---|
388 | This is caused by the way the sina window toggles |
---|
389 | between 'normal' and 'advanced' options. |
---|
390 | To avoid this, toggle between advanced and normal at the start of your macro. |
---|
391 | If problems persist manually correct the action prefixes in your macro. |
---|
392 | |
---|
393 | WARNINGS When using SINA 1.3 you have to make sure that the alignment selected |
---|
394 | in LINK{ad_align.hlp} is the same alignment as used in the ARB_EDIT4 instance. |
---|
395 | Starting with SINA 1.7.2 it will always use the same alignment as the editor. |
---|
396 | |
---|
397 | BUGS No bugs known |
---|