1 | #Please insert up references in the next lines (line starts with keyword UP) |
---|
2 | UP arb.hlp |
---|
3 | UP glossary.hlp |
---|
4 | |
---|
5 | #Please insert subtopic references (line starts with keyword SUB) |
---|
6 | #SUB subtopic.hlp |
---|
7 | |
---|
8 | # Hypertext links in helptext can be added like this: LINK{ref.hlp|http://add|bla@domain} |
---|
9 | |
---|
10 | #************* Title of helpfile !! and start of real helpfile ******** |
---|
11 | TITLE Graph Aligner (SINA) |
---|
12 | |
---|
13 | OCCURRENCE ARB Editor -> Edit -> Prototypical Graph Aligner |
---|
14 | |
---|
15 | DESCRIPTION SINA is an alternative to the integrated aligners. |
---|
16 | It has been developed for the SILVA project. |
---|
17 | |
---|
18 | Other than the integrated aligners SINA |
---|
19 | - uses aligned sequences from the reference database as reference to align the selected sequences. |
---|
20 | - employs full dynamic programming to create the alignment. |
---|
21 | - considers all selected relatives at once, instead of falling back to |
---|
22 | less similar sequences only if the current sequence is missing bases (e.g. |
---|
23 | because it is a partial sequence). |
---|
24 | |
---|
25 | SECTION SINA documentation online |
---|
26 | |
---|
27 | LINK{https://sina.readthedocs.io/_/downloads/en/stable/pdf/} |
---|
28 | |
---|
29 | Note: parts of this document were taken from the above SINA documentation. |
---|
30 | |
---|
31 | SECTION SINA version |
---|
32 | |
---|
33 | Please note this documentation applies to the modified SINA version 1.7.2 |
---|
34 | which (optionally) is delivered with arb. |
---|
35 | |
---|
36 | Modifications applied: |
---|
37 | * fix build with arb7 + gcc 7.5 |
---|
38 | * define CLI interface "ARB7.1" (allows arb to detect SINA 1.7.2) |
---|
39 | * fixed some error messages (clearness; do NOT show WHOLE alignment; explicitely show bad character) |
---|
40 | * new option '--dont-expect-start' (allows to work with databases not containing 'start'-fields) |
---|
41 | |
---|
42 | ARB still supports SINA version 1.3. |
---|
43 | If there is no option to select the reference database, you are either using version 1.3 |
---|
44 | or you try to use a (newer) SINA version which does not yet support the "ARB7.1"-CLI-interface. |
---|
45 | |
---|
46 | In case you are using version 1.3, you may like to use the old, corresponding version of this helppage |
---|
47 | which may be accessed via LINK{http://bugs.arb-home.de/browser/trunk/HELP_SOURCE/source/sina_main.hlp?rev=18781} |
---|
48 | |
---|
49 | SECTION OPTIONS |
---|
50 | |
---|
51 | Select the sequences to be aligned as usual ("Current Species", "Selected |
---|
52 | Species" or "Marked Species"). |
---|
53 | |
---|
54 | Select a PT-Server or SINA kmer-search: |
---|
55 | |
---|
56 | You may select a PT-Server that will be used for reference search. |
---|
57 | Make sure it is up to date and contains all sequences you want to be considered |
---|
58 | as reference. |
---|
59 | |
---|
60 | Alternatively select '-undefined-' to use the sina-internal engine for reference search. |
---|
61 | SINA will maintain an index file (.sidx) that will be stored next to the reference database file. |
---|
62 | It will automatically get updated after the reference database changed. |
---|
63 | |
---|
64 | Select a reference database: |
---|
65 | |
---|
66 | The sequences from the reference database will be used as references when aligning the |
---|
67 | sequences in the current database. |
---|
68 | |
---|
69 | Normally you will like to use the current database as reference database. |
---|
70 | This can be done in two ways: |
---|
71 | * select 'Last saved' to use the last saved state of the current database. |
---|
72 | * select 'Current' to use the loaded database. This is currently not possible |
---|
73 | with PT-Server. |
---|
74 | |
---|
75 | Alternatively you may specify any other database as reference |
---|
76 | database via 'Explicit as [selected]'. This could e.g. be the same |
---|
77 | database used to build the PT-Server. |
---|
78 | |
---|
79 | This allows e.g. to use a high quality database containing |
---|
80 | only typestrains as reference while working on small, specialized databases. |
---|
81 | |
---|
82 | It will also avoid polluting the set of references with the state of your |
---|
83 | current working dataset. So you dont have to fear some badly aligned sequence |
---|
84 | from your working set will be used as a reference. |
---|
85 | |
---|
86 | The effective reference database path will be shown in the input field below. |
---|
87 | |
---|
88 | Important note: |
---|
89 | |
---|
90 | The previously integrated SINA version 1.3 always aligned versus |
---|
91 | the state of the referenced sequences in the CURRENT DATABASE. |
---|
92 | |
---|
93 | SINA version 1.7 always aligns versus the state of the referenced sequences |
---|
94 | in the REFERENCE DATABASE (which may, but does not have to be the same!). |
---|
95 | |
---|
96 | When using SINA version 1.7 with the PT-Server, please MAKE SURE |
---|
97 | you specify the same database used to calculate the PT-Server as |
---|
98 | reference database. |
---|
99 | |
---|
100 | Follow these steps: |
---|
101 | |
---|
102 | 1. save the reference database (e.g. as ref.arb) |
---|
103 | 2. start arb on ref.arb and calculate a PT-Server |
---|
104 | 3. go back to your working database, open the sina window, |
---|
105 | specify the calculated PT-Server AND specify the saved |
---|
106 | ref.arb as reference database. |
---|
107 | 4. restart at 1. whenever you need to update your references |
---|
108 | |
---|
109 | If SINA detects any inconsistencies between the PT-Server and the reference |
---|
110 | database, it will silently try to update the PT-Server database. |
---|
111 | |
---|
112 | |
---|
113 | Excluded references: |
---|
114 | |
---|
115 | Some sequences will not be used as references: |
---|
116 | * sequences with less than 10 gaps are considered not aligned and will |
---|
117 | not be used as references. |
---|
118 | * if "Realign" (see advanced options) is checked, a sequence will never |
---|
119 | be used as references for itself. |
---|
120 | |
---|
121 | Some options define additional requirements for the chosen |
---|
122 | reference sequence set (see option below, esp. advanced options). |
---|
123 | |
---|
124 | Decide what to do with possible overhang ("Overhang placement"). |
---|
125 | If your sequence extends beyond the |
---|
126 | reference sequences on either side of the alignment, those bases cannot be |
---|
127 | aligned properly. Three options of handling this situation are supported: |
---|
128 | |
---|
129 | "keep attached" |
---|
130 | |
---|
131 | just leave them dangling, directly attached to the last base that |
---|
132 | could be aligned properly |
---|
133 | |
---|
134 | "move to edge" |
---|
135 | |
---|
136 | move them out to the very beginning and end of |
---|
137 | the alignment. This allows you to easily spot sequences |
---|
138 | with overhang, and decide what to do yourself. Recommended, |
---|
139 | but only if you check your sequences after alignment! |
---|
140 | |
---|
141 | "remove" |
---|
142 | |
---|
143 | automatically remove these bases. |
---|
144 | |
---|
145 | Choose "Handling of unmappable insertions": |
---|
146 | |
---|
147 | Configures how the alignment width is preserved. |
---|
148 | |
---|
149 | "Shift surrounding bases" |
---|
150 | |
---|
151 | The alignment is executed without constraining insertion sizes. |
---|
152 | Insertions for which insufficient columns exist between the |
---|
153 | adjoining aligned bases are force fitted into the alignment |
---|
154 | using NAST. That is, the minimum number of aligned bases to the |
---|
155 | left and right of the insertion are moved to accommodate the |
---|
156 | insertion. This mode will add warnings to the log for each |
---|
157 | sequence in which aligned bases had to be moved. |
---|
158 | |
---|
159 | "Forbid during DP alignment" |
---|
160 | |
---|
161 | The alignment is executed using a scoring scheme disallowing insertions |
---|
162 | for which insufficient columns exist in the alignment. This mode |
---|
163 | causes less âmisalignmentsâ than the shift mode as it computes the best |
---|
164 | alignment under the constraint that no columns may be added to the |
---|
165 | alignment. However, it will not show if the computed alignment suffered |
---|
166 | from a lack of empty columns. |
---|
167 | |
---|
168 | "Delete bases" |
---|
169 | |
---|
170 | The alignment is executed without constraining insertion sizes. Insertions |
---|
171 | larger than the number of columns between the adjoining aligned bases are |
---|
172 | truncated. While this mode yields the most accurate alignment for |
---|
173 | sequences with large insertions, it should be used with care as it |
---|
174 | modifies the original sequence. |
---|
175 | |
---|
176 | Choose "Character Case": |
---|
177 | |
---|
178 | Configures which bases should be written using lower case characters. |
---|
179 | |
---|
180 | "Do not modify" |
---|
181 | |
---|
182 | All bases will be written using the case they had in the input data. |
---|
183 | |
---|
184 | "Show unaligned bases as lower case" |
---|
185 | |
---|
186 | Aligned bases will be written in upper case; unaligned bases will be |
---|
187 | written in lower case. This serves to mark sections of the query |
---|
188 | sequences that could not be aligned because they were insertions |
---|
189 | (internal or edge) with respect to any of the reference sequences. |
---|
190 | |
---|
191 | "Uppercase all" |
---|
192 | |
---|
193 | All bases will use upper case characters |
---|
194 | |
---|
195 | Define "Family conservation weight": (default 1) |
---|
196 | |
---|
197 | Adjust the weight factor for the frequency at which a node was observed |
---|
198 | in the reference alignment. Use 0 to disable weighting. This feature |
---|
199 | prefers the more common placement for bases with inconsistent alignment |
---|
200 | in the reference database. |
---|
201 | |
---|
202 | Define "Size of full-length sequences": (default 1400) |
---|
203 | |
---|
204 | Set the minimum length a reference sequence is required to have |
---|
205 | to be considered full length. |
---|
206 | See also "Minimal number of full length sequences" in |
---|
207 | ADVANCED OPTIONS below. |
---|
208 | |
---|
209 | Select a "Protection Level" higher than that of the sequences if you want the |
---|
210 | alignment software to actually modify the bases. Choose a lower protection |
---|
211 | level to execute a "dry run", not changing anything. Note that sequences |
---|
212 | with a protection level of zero will always be changed. |
---|
213 | |
---|
214 | SECTION Verbosity |
---|
215 | |
---|
216 | All output will be printed to the console that opens when you start sina. |
---|
217 | |
---|
218 | Several options allow you to change the noisiness: |
---|
219 | |
---|
220 | When "Show changed sections" is checked, sina shows differences |
---|
221 | between the inferred alignment and the original alignment. |
---|
222 | |
---|
223 | That output will be colorized if "color bases" is checked. |
---|
224 | |
---|
225 | Check "Show statistic" to |
---|
226 | show the distance to original alignment. |
---|
227 | |
---|
228 | SECTION TRICKS |
---|
229 | |
---|
230 | If you want to see how the alignment that would be produced by the graph |
---|
231 | aligner differs from your current alignment, and why the program would |
---|
232 | act that way, you can set the protection level to "0" and the Logging level |
---|
233 | to "debug". The output on the console will now include all differing sections |
---|
234 | of the alignment and the matching parts of the reference sequences. |
---|
235 | |
---|
236 | SECTION ADVANCED OPTIONS |
---|
237 | |
---|
238 | Select the "Show advanced options" Button at the top to gain access to |
---|
239 | the you-may-now-shoot-yourself-in-the-foot-severely dialog window. |
---|
240 | |
---|
241 | Don't be surprised if the graph aligner crashes after you entered silly |
---|
242 | values here. No sanity check of your options is done. |
---|
243 | |
---|
244 | Pos.Var: |
---|
245 | |
---|
246 | Select a positional variability filter. If possible, use the filter |
---|
247 | appropriate for the type of sequences you want aligned. Positional |
---|
248 | variability statistics will be considered when placing the individual bases. |
---|
249 | |
---|
250 | Field used for automatic filter selection: |
---|
251 | |
---|
252 | Configures a database field using which the value for positional |
---|
253 | variability filter is determined by majority vote from the selected |
---|
254 | reference sequences. |
---|
255 | Since the filters are usually computed at domain level, this approach is usually |
---|
256 | sufficient to select an appropriate filter. |
---|
257 | For SILVA database, the field 'tax_slv' contains appropriate data. |
---|
258 | |
---|
259 | |
---|
260 | Turn check: |
---|
261 | |
---|
262 | If selected (default) sequences will be automatically reversed |
---|
263 | and/or complemented if this will likely improve the alignment. |
---|
264 | |
---|
265 | |
---|
266 | Realign: |
---|
267 | |
---|
268 | If selected, the sequence itself is excluded from the result of |
---|
269 | the executed PT-Server family search. If deselected, the alignment |
---|
270 | of an identical sequence found by the PT-Server is copied. |
---|
271 | |
---|
272 | |
---|
273 | # @@@ add option to mark references or aligned, then document here. |
---|
274 | # (Copy and) mark sequence used as reference: |
---|
275 | # |
---|
276 | # Mark the sequences that were used as a reference during alignment. |
---|
277 | # This allows you to easily load them into the editor to review the |
---|
278 | # decisions made by the graph aligner. |
---|
279 | # If you also selected the "Load reference" option, sequences will be |
---|
280 | # copied into your current database prior to being marked. |
---|
281 | |
---|
282 | |
---|
283 | Gap insertion/extension penalties: (default is 5/2) |
---|
284 | |
---|
285 | You can change the penalties associated with opening and extending |
---|
286 | gaps. |
---|
287 | |
---|
288 | Match/mismatch scores: (default is 2/-1) |
---|
289 | |
---|
290 | Configures the scores given for a match (should be positive) and |
---|
291 | a mismatch (should be negative). |
---|
292 | |
---|
293 | Family search min/min_score/max: (default 40/0.7/40) |
---|
294 | |
---|
295 | The first value tells the graph aligner how many sequences it should |
---|
296 | try to always use. The second value determines the minimal identity |
---|
297 | with the target sequence additional reference sequences should have. |
---|
298 | The third value selects the maximal number of sequences to be used |
---|
299 | as a reference. |
---|
300 | |
---|
301 | Minimal number of full length sequences: (default 1) |
---|
302 | |
---|
303 | Set the minimum number of full length (see "Size of full-length sequences" |
---|
304 | setting above) reference sequences that must be included in the selected |
---|
305 | reference set. The search will proceed regardless of other settings until |
---|
306 | this setting has been satisfied. If it cannot be satisfied by any |
---|
307 | sequence in the reference database, the query sequence will be discarded. |
---|
308 | This setting exists to ensure that the entire length of the query sequence |
---|
309 | will be covered in the presence of partial sequences contained within your |
---|
310 | reference database. |
---|
311 | |
---|
312 | Family search oligo length/mismatches: (default 10/0) |
---|
313 | |
---|
314 | The first value sets the size of k for the reference search (size of kmer). |
---|
315 | For SSU rRNA sequences, the default of 10 is a good value. |
---|
316 | For different sequence types, different values may perform better. |
---|
317 | For 5S, for example, 6 has shown to be more effective. |
---|
318 | |
---|
319 | The second value allows k-mer matches in the reference database to contain n mismatches. |
---|
320 | This feature is only supported by the pt-server search engine and |
---|
321 | requires substantial additional compute time (in particular for n > 1). |
---|
322 | |
---|
323 | Minimal reference sequence length: (default 150) |
---|
324 | |
---|
325 | Set the minimum length reference sequences are required to have. |
---|
326 | Sequences shorter than this will not be included in the selection. |
---|
327 | |
---|
328 | Note: If you are working with particularly short reference sequences, you |
---|
329 | will need to lower this settings to allow any reference sequences to be |
---|
330 | found. |
---|
331 | |
---|
332 | Alignment bounds: (default 0/0) |
---|
333 | |
---|
334 | These values set the beginning and the end of the gene within the reference |
---|
335 | alignment. |
---|
336 | See "Number of references required to touch bounds" for more information. |
---|
337 | |
---|
338 | Number of references required to touch bounds: (default: 0) |
---|
339 | |
---|
340 | Similar to "Minimal number of full length sequences", this option requires a |
---|
341 | total of n sequences to cover each the beginning and the end of the gene |
---|
342 | within the alignment. |
---|
343 | |
---|
344 | This option is more precise than "Minimal number of full length sequences", but |
---|
345 | requires that the column numbers for the range in which the full gene is |
---|
346 | expected be specified via "Alignment bounds" (see above). |
---|
347 | |
---|
348 | Save used references in 'used_rels': (default is off) |
---|
349 | |
---|
350 | Writes the names of the alignment reference sequences into the field used_rels. |
---|
351 | This option allows using LINK{markbyref.hlp} to highlight the reference |
---|
352 | sequences used to align a given query sequence. |
---|
353 | |
---|
354 | Store highest identity in 'align_ident_slv': (default is off) |
---|
355 | |
---|
356 | Computes the highest similarity the aligned query sequence has with any of |
---|
357 | the sequences in the alignment reference set. The value is written to the |
---|
358 | field 'align_ident_slv'. |
---|
359 | |
---|
360 | Disable fast search: (default is to use fast search) |
---|
361 | |
---|
362 | Use all k-mers occurring in the query sequence in the search. By default, |
---|
363 | only k-mers starting with an A are used for extra performance. |
---|
364 | |
---|
365 | Score search results by absolute oligo match count: (default is off) |
---|
366 | |
---|
367 | Use absolute (number of shared k-mers) match scores in the kmer search |
---|
368 | rather than relative (number or shared k-mers divided by length of reference |
---|
369 | sequence) match scores. |
---|
370 | |
---|
371 | Suppress warnings about missing 'start' field: (default is off) |
---|
372 | |
---|
373 | This option suppresses warnings about missing 'start' fields and allows to |
---|
374 | use sina with databases not using the 'start' w/o getting flooded with |
---|
375 | warnings. |
---|
376 | |
---|
377 | SINA command: (default "arb_sina.sh") |
---|
378 | |
---|
379 | If arb has problems finding the sina binary for whatever reasons, you may |
---|
380 | specify an explicit path here. |
---|
381 | Please note, doing so will stop a fat-tarball-installation from working! |
---|
382 | |
---|
383 | NOTES SINA automatically decides the number of threads being used. |
---|
384 | |
---|
385 | When recording macros acting on the SINA window, problems |
---|
386 | with the used macro-IDs may occur ("XSINA" vs. "SINA" prefix) and |
---|
387 | ARB complaining about unknown macro ids (e.g. |
---|
388 | sth like "Unknown action 'SINA/CURR_PT_SERVER' in macro"). |
---|
389 | This is caused by the way the sina window toggles |
---|
390 | between 'normal' and 'advanced' options. |
---|
391 | To avoid this, toggle between advanced and normal at the start of your macro. |
---|
392 | If problems persist manually correct the action prefixes in your macro. |
---|
393 | |
---|
394 | WARNINGS When using SINA 1.3 you have to make sure that the alignment selected |
---|
395 | in LINK{ad_align.hlp} is the same alignment as used in the ARB_EDIT4 instance. |
---|
396 | Starting with SINA 1.7.2 it will always use the same alignment as the editor. |
---|
397 | |
---|
398 | BUGS No bugs known |
---|