| 1 | # main topics: |
|---|
| 2 | UP arb.hlp |
|---|
| 3 | UP glossary.hlp |
|---|
| 4 | |
|---|
| 5 | # sub topics: |
|---|
| 6 | #SUB subtopic.hlp |
|---|
| 7 | |
|---|
| 8 | # format described in ../help.readme |
|---|
| 9 | |
|---|
| 10 | TITLE Graph Aligner (SINA) |
|---|
| 11 | |
|---|
| 12 | OCCURRENCE ARB Editor -> Edit -> Prototypical Graph Aligner |
|---|
| 13 | |
|---|
| 14 | DESCRIPTION SINA is an alternative to the integrated aligners. |
|---|
| 15 | It has been developed for the SILVA project. |
|---|
| 16 | |
|---|
| 17 | Other than the integrated aligners SINA |
|---|
| 18 | - uses aligned sequences from the reference database as reference to align the selected sequences. |
|---|
| 19 | - employs full dynamic programming to create the alignment. |
|---|
| 20 | - considers all selected relatives at once, instead of falling back to |
|---|
| 21 | less similar sequences only if the current sequence is missing bases (e.g. |
|---|
| 22 | because it is a partial sequence). |
|---|
| 23 | |
|---|
| 24 | SECTION SINA documentation online |
|---|
| 25 | |
|---|
| 26 | LINK{https://sina.readthedocs.io/_/downloads/en/stable/pdf/} |
|---|
| 27 | |
|---|
| 28 | Note: parts of this document were taken from the above SINA documentation. |
|---|
| 29 | |
|---|
| 30 | SECTION SINA version |
|---|
| 31 | |
|---|
| 32 | Please note this documentation applies to the modified SINA version 1.7.2 |
|---|
| 33 | which (optionally) is delivered with arb. |
|---|
| 34 | |
|---|
| 35 | Modifications applied: |
|---|
| 36 | * fix build with arb7 + gcc 7.5 |
|---|
| 37 | * define CLI interface "ARB7.1" (allows arb to detect SINA 1.7.2) |
|---|
| 38 | * fixed some error messages (clearness; do NOT show WHOLE alignment; explicitely show bad character) |
|---|
| 39 | * new option '--dont-expect-start' (allows to work with databases not containing 'start'-fields) |
|---|
| 40 | |
|---|
| 41 | ARB still supports SINA version 1.3. |
|---|
| 42 | If there is no option to select the reference database, you are either using version 1.3 |
|---|
| 43 | or you try to use a (newer) SINA version which does not yet support the "ARB7.1"-CLI-interface. |
|---|
| 44 | |
|---|
| 45 | In case you are using version 1.3, you may like to use the old, corresponding version of this helppage |
|---|
| 46 | which may be accessed via LINK{http://bugs.arb-home.de/browser/trunk/HELP_SOURCE/source/sina_main.hlp?rev=18781} |
|---|
| 47 | |
|---|
| 48 | SECTION OPTIONS |
|---|
| 49 | |
|---|
| 50 | Select the sequences to be aligned as usual ("Current Species", "Selected |
|---|
| 51 | Species" or "Marked Species"). |
|---|
| 52 | |
|---|
| 53 | Select a PT-Server or SINA kmer-search: |
|---|
| 54 | |
|---|
| 55 | You may select a PT-Server that will be used for reference search. |
|---|
| 56 | Make sure it is up to date and contains all sequences you want to be considered |
|---|
| 57 | as reference. |
|---|
| 58 | |
|---|
| 59 | Alternatively select '-undefined-' to use the sina-internal engine for reference search. |
|---|
| 60 | SINA will maintain an index file (.sidx) that will be stored next to the reference database file. |
|---|
| 61 | It will automatically get updated after the reference database changed. |
|---|
| 62 | |
|---|
| 63 | Select a reference database: |
|---|
| 64 | |
|---|
| 65 | The sequences from the reference database will be used as references when aligning the |
|---|
| 66 | sequences in the current database. |
|---|
| 67 | |
|---|
| 68 | Normally you will like to use the current database as reference database. |
|---|
| 69 | This can be done in two ways: |
|---|
| 70 | * select 'Last saved' to use the last saved state of the current database. |
|---|
| 71 | * select 'Current' to use the loaded database. This is currently not possible |
|---|
| 72 | with PT-Server. |
|---|
| 73 | |
|---|
| 74 | Alternatively you may specify any other database as reference |
|---|
| 75 | database via 'Explicit as [selected]'. This could e.g. be the same |
|---|
| 76 | database used to build the PT-Server. |
|---|
| 77 | |
|---|
| 78 | This allows e.g. to use a high quality database containing |
|---|
| 79 | only typestrains as reference while working on small, specialized databases. |
|---|
| 80 | |
|---|
| 81 | It will also avoid polluting the set of references with the state of your |
|---|
| 82 | current working dataset. So you dont have to fear some badly aligned sequence |
|---|
| 83 | from your working set will be used as a reference. |
|---|
| 84 | |
|---|
| 85 | The effective reference database path will be shown in the input field below. |
|---|
| 86 | |
|---|
| 87 | Important note: |
|---|
| 88 | |
|---|
| 89 | The previously integrated SINA version 1.3 always aligned versus |
|---|
| 90 | the state of the referenced sequences in the CURRENT DATABASE. |
|---|
| 91 | |
|---|
| 92 | SINA version 1.7 always aligns versus the state of the referenced sequences |
|---|
| 93 | in the REFERENCE DATABASE (which may, but does not have to be the same!). |
|---|
| 94 | |
|---|
| 95 | When using SINA version 1.7 with the PT-Server, please MAKE SURE |
|---|
| 96 | you specify the same database used to calculate the PT-Server as |
|---|
| 97 | reference database. |
|---|
| 98 | |
|---|
| 99 | Follow these steps: |
|---|
| 100 | |
|---|
| 101 | 1. save the reference database (e.g. as ref.arb) |
|---|
| 102 | 2. start arb on ref.arb and calculate a PT-Server |
|---|
| 103 | 3. go back to your working database, open the sina window, |
|---|
| 104 | specify the calculated PT-Server AND specify the saved |
|---|
| 105 | ref.arb as reference database. |
|---|
| 106 | 4. restart at 1. whenever you need to update your references |
|---|
| 107 | |
|---|
| 108 | If SINA detects any inconsistencies between the PT-Server and the reference |
|---|
| 109 | database, it will silently try to update the PT-Server database. |
|---|
| 110 | |
|---|
| 111 | |
|---|
| 112 | Excluded references: |
|---|
| 113 | |
|---|
| 114 | Some sequences will not be used as references: |
|---|
| 115 | * sequences with less than 10 gaps are considered not aligned and will |
|---|
| 116 | not be used as references. |
|---|
| 117 | * if "Realign" (see advanced options) is checked, a sequence will never |
|---|
| 118 | be used as references for itself. |
|---|
| 119 | |
|---|
| 120 | Some options define additional requirements for the chosen |
|---|
| 121 | reference sequence set (see option below, esp. advanced options). |
|---|
| 122 | |
|---|
| 123 | Decide what to do with possible overhang ("Overhang placement"). |
|---|
| 124 | If your sequence extends beyond the |
|---|
| 125 | reference sequences on either side of the alignment, those bases cannot be |
|---|
| 126 | aligned properly. Three options of handling this situation are supported: |
|---|
| 127 | |
|---|
| 128 | "keep attached" |
|---|
| 129 | |
|---|
| 130 | just leave them dangling, directly attached to the last base that |
|---|
| 131 | could be aligned properly |
|---|
| 132 | |
|---|
| 133 | "move to edge" |
|---|
| 134 | |
|---|
| 135 | move them out to the very beginning and end of |
|---|
| 136 | the alignment. This allows you to easily spot sequences |
|---|
| 137 | with overhang, and decide what to do yourself. Recommended, |
|---|
| 138 | but only if you check your sequences after alignment! |
|---|
| 139 | |
|---|
| 140 | "remove" |
|---|
| 141 | |
|---|
| 142 | automatically remove these bases. |
|---|
| 143 | |
|---|
| 144 | Choose "Handling of unmappable insertions": |
|---|
| 145 | |
|---|
| 146 | Configures how the alignment width is preserved. |
|---|
| 147 | |
|---|
| 148 | "Shift surrounding bases" |
|---|
| 149 | |
|---|
| 150 | The alignment is executed without constraining insertion sizes. |
|---|
| 151 | Insertions for which insufficient columns exist between the |
|---|
| 152 | adjoining aligned bases are force fitted into the alignment |
|---|
| 153 | using NAST. That is, the minimum number of aligned bases to the |
|---|
| 154 | left and right of the insertion are moved to accommodate the |
|---|
| 155 | insertion. This mode will add warnings to the log for each |
|---|
| 156 | sequence in which aligned bases had to be moved. |
|---|
| 157 | |
|---|
| 158 | "Forbid during DP alignment" |
|---|
| 159 | |
|---|
| 160 | The alignment is executed using a scoring scheme disallowing insertions |
|---|
| 161 | for which insufficient columns exist in the alignment. This mode |
|---|
| 162 | causes less âmisalignmentsâ than the shift mode as it computes the best |
|---|
| 163 | alignment under the constraint that no columns may be added to the |
|---|
| 164 | alignment. However, it will not show if the computed alignment suffered |
|---|
| 165 | from a lack of empty columns. |
|---|
| 166 | |
|---|
| 167 | "Delete bases" |
|---|
| 168 | |
|---|
| 169 | The alignment is executed without constraining insertion sizes. Insertions |
|---|
| 170 | larger than the number of columns between the adjoining aligned bases are |
|---|
| 171 | truncated. While this mode yields the most accurate alignment for |
|---|
| 172 | sequences with large insertions, it should be used with care as it |
|---|
| 173 | modifies the original sequence. |
|---|
| 174 | |
|---|
| 175 | Choose "Character Case": |
|---|
| 176 | |
|---|
| 177 | Configures which bases should be written using lower case characters. |
|---|
| 178 | |
|---|
| 179 | "Do not modify" |
|---|
| 180 | |
|---|
| 181 | All bases will be written using the case they had in the input data. |
|---|
| 182 | |
|---|
| 183 | "Show unaligned bases as lower case" |
|---|
| 184 | |
|---|
| 185 | Aligned bases will be written in upper case; unaligned bases will be |
|---|
| 186 | written in lower case. This serves to mark sections of the query |
|---|
| 187 | sequences that could not be aligned because they were insertions |
|---|
| 188 | (internal or edge) with respect to any of the reference sequences. |
|---|
| 189 | |
|---|
| 190 | "Uppercase all" |
|---|
| 191 | |
|---|
| 192 | All bases will use upper case characters |
|---|
| 193 | |
|---|
| 194 | Define "Family conservation weight": (default 1) |
|---|
| 195 | |
|---|
| 196 | Adjust the weight factor for the frequency at which a node was observed |
|---|
| 197 | in the reference alignment. Use 0 to disable weighting. This feature |
|---|
| 198 | prefers the more common placement for bases with inconsistent alignment |
|---|
| 199 | in the reference database. |
|---|
| 200 | |
|---|
| 201 | Define "Size of full-length sequences": (default 1400) |
|---|
| 202 | |
|---|
| 203 | Set the minimum length a reference sequence is required to have |
|---|
| 204 | to be considered full length. |
|---|
| 205 | See also "Minimal number of full length sequences" in |
|---|
| 206 | ADVANCED OPTIONS below. |
|---|
| 207 | |
|---|
| 208 | Select a "Protection Level" higher than that of the sequences if you want the |
|---|
| 209 | alignment software to actually modify the bases. Choose a lower protection |
|---|
| 210 | level to execute a "dry run", not changing anything. Note that sequences |
|---|
| 211 | with a protection level of zero will always be changed. |
|---|
| 212 | |
|---|
| 213 | SECTION Verbosity |
|---|
| 214 | |
|---|
| 215 | All output will be printed to the console that opens when you start sina. |
|---|
| 216 | |
|---|
| 217 | Several options allow you to change the noisiness: |
|---|
| 218 | |
|---|
| 219 | When "Show changed sections" is checked, sina shows differences |
|---|
| 220 | between the inferred alignment and the original alignment. |
|---|
| 221 | |
|---|
| 222 | That output will be colorized if "color bases" is checked. |
|---|
| 223 | |
|---|
| 224 | Check "Show statistic" to |
|---|
| 225 | show the distance to original alignment. |
|---|
| 226 | |
|---|
| 227 | SECTION TRICKS |
|---|
| 228 | |
|---|
| 229 | If you want to see how the alignment that would be produced by the graph |
|---|
| 230 | aligner differs from your current alignment, and why the program would |
|---|
| 231 | act that way, you can set the protection level to "0" and the Logging level |
|---|
| 232 | to "debug". The output on the console will now include all differing sections |
|---|
| 233 | of the alignment and the matching parts of the reference sequences. |
|---|
| 234 | |
|---|
| 235 | SECTION ADVANCED OPTIONS |
|---|
| 236 | |
|---|
| 237 | Select the "Show advanced options" Button at the top to gain access to |
|---|
| 238 | the you-may-now-shoot-yourself-in-the-foot-severely dialog window. |
|---|
| 239 | |
|---|
| 240 | Don't be surprised if the graph aligner crashes after you entered silly |
|---|
| 241 | values here. No sanity check of your options is done. |
|---|
| 242 | |
|---|
| 243 | Pos.Var: |
|---|
| 244 | |
|---|
| 245 | Select a positional variability filter. If possible, use the filter |
|---|
| 246 | appropriate for the type of sequences you want aligned. Positional |
|---|
| 247 | variability statistics will be considered when placing the individual bases. |
|---|
| 248 | |
|---|
| 249 | Field used for automatic filter selection: |
|---|
| 250 | |
|---|
| 251 | Configures a database field using which the value for positional |
|---|
| 252 | variability filter is determined by majority vote from the selected |
|---|
| 253 | reference sequences. |
|---|
| 254 | Since the filters are usually computed at domain level, this approach is usually |
|---|
| 255 | sufficient to select an appropriate filter. |
|---|
| 256 | For SILVA database, the field 'tax_slv' contains appropriate data. |
|---|
| 257 | |
|---|
| 258 | |
|---|
| 259 | Turn check: |
|---|
| 260 | |
|---|
| 261 | If selected (default) sequences will be automatically reversed |
|---|
| 262 | and/or complemented if this will likely improve the alignment. |
|---|
| 263 | |
|---|
| 264 | |
|---|
| 265 | Realign: |
|---|
| 266 | |
|---|
| 267 | If selected, the sequence itself is excluded from the result of |
|---|
| 268 | the executed PT-Server family search. If deselected, the alignment |
|---|
| 269 | of an identical sequence found by the PT-Server is copied. |
|---|
| 270 | |
|---|
| 271 | |
|---|
| 272 | # @@@ add option to mark references or aligned, then document here. |
|---|
| 273 | # (Copy and) mark sequence used as reference: |
|---|
| 274 | # |
|---|
| 275 | # Mark the sequences that were used as a reference during alignment. |
|---|
| 276 | # This allows you to easily load them into the editor to review the |
|---|
| 277 | # decisions made by the graph aligner. |
|---|
| 278 | # If you also selected the "Load reference" option, sequences will be |
|---|
| 279 | # copied into your current database prior to being marked. |
|---|
| 280 | |
|---|
| 281 | |
|---|
| 282 | Gap insertion/extension penalties: (default is 5/2) |
|---|
| 283 | |
|---|
| 284 | You can change the penalties associated with opening and extending |
|---|
| 285 | gaps. |
|---|
| 286 | |
|---|
| 287 | Match/mismatch scores: (default is 2/-1) |
|---|
| 288 | |
|---|
| 289 | Configures the scores given for a match (should be positive) and |
|---|
| 290 | a mismatch (should be negative). |
|---|
| 291 | |
|---|
| 292 | Family search min/min_score/max: (default 40/0.7/40) |
|---|
| 293 | |
|---|
| 294 | The first value tells the graph aligner how many sequences it should |
|---|
| 295 | try to always use. The second value determines the minimal identity |
|---|
| 296 | with the target sequence additional reference sequences should have. |
|---|
| 297 | The third value selects the maximal number of sequences to be used |
|---|
| 298 | as a reference. |
|---|
| 299 | |
|---|
| 300 | Minimal number of full length sequences: (default 1) |
|---|
| 301 | |
|---|
| 302 | Set the minimum number of full length (see "Size of full-length sequences" |
|---|
| 303 | setting above) reference sequences that must be included in the selected |
|---|
| 304 | reference set. The search will proceed regardless of other settings until |
|---|
| 305 | this setting has been satisfied. If it cannot be satisfied by any |
|---|
| 306 | sequence in the reference database, the query sequence will be discarded. |
|---|
| 307 | This setting exists to ensure that the entire length of the query sequence |
|---|
| 308 | will be covered in the presence of partial sequences contained within your |
|---|
| 309 | reference database. |
|---|
| 310 | |
|---|
| 311 | Family search oligo length/mismatches: (default 10/0) |
|---|
| 312 | |
|---|
| 313 | The first value sets the size of k for the reference search (size of kmer). |
|---|
| 314 | For SSU rRNA sequences, the default of 10 is a good value. |
|---|
| 315 | For different sequence types, different values may perform better. |
|---|
| 316 | For 5S, for example, 6 has shown to be more effective. |
|---|
| 317 | |
|---|
| 318 | The second value allows k-mer matches in the reference database to contain n mismatches. |
|---|
| 319 | This feature is only supported by the pt-server search engine and |
|---|
| 320 | requires substantial additional compute time (in particular for n > 1). |
|---|
| 321 | |
|---|
| 322 | Minimal reference sequence length: (default 150) |
|---|
| 323 | |
|---|
| 324 | Set the minimum length reference sequences are required to have. |
|---|
| 325 | Sequences shorter than this will not be included in the selection. |
|---|
| 326 | |
|---|
| 327 | Note: If you are working with particularly short reference sequences, you |
|---|
| 328 | will need to lower this settings to allow any reference sequences to be |
|---|
| 329 | found. |
|---|
| 330 | |
|---|
| 331 | Alignment bounds: (default 0/0) |
|---|
| 332 | |
|---|
| 333 | These values set the beginning and the end of the gene within the reference |
|---|
| 334 | alignment. |
|---|
| 335 | See "Number of references required to touch bounds" for more information. |
|---|
| 336 | |
|---|
| 337 | Number of references required to touch bounds: (default: 0) |
|---|
| 338 | |
|---|
| 339 | Similar to "Minimal number of full length sequences", this option requires a |
|---|
| 340 | total of n sequences to cover each the beginning and the end of the gene |
|---|
| 341 | within the alignment. |
|---|
| 342 | |
|---|
| 343 | This option is more precise than "Minimal number of full length sequences", but |
|---|
| 344 | requires that the column numbers for the range in which the full gene is |
|---|
| 345 | expected be specified via "Alignment bounds" (see above). |
|---|
| 346 | |
|---|
| 347 | Save used references in 'used_rels': (default is off) |
|---|
| 348 | |
|---|
| 349 | Writes the names of the alignment reference sequences into the field used_rels. |
|---|
| 350 | This option allows using LINK{markbyref.hlp} to highlight the reference |
|---|
| 351 | sequences used to align a given query sequence. |
|---|
| 352 | |
|---|
| 353 | Store highest identity in 'align_ident_slv': (default is off) |
|---|
| 354 | |
|---|
| 355 | Computes the highest similarity the aligned query sequence has with any of |
|---|
| 356 | the sequences in the alignment reference set. The value is written to the |
|---|
| 357 | field 'align_ident_slv'. |
|---|
| 358 | |
|---|
| 359 | Disable fast search: (default is to use fast search) |
|---|
| 360 | |
|---|
| 361 | Use all k-mers occurring in the query sequence in the search. By default, |
|---|
| 362 | only k-mers starting with an A are used for extra performance. |
|---|
| 363 | |
|---|
| 364 | Score search results by absolute oligo match count: (default is off) |
|---|
| 365 | |
|---|
| 366 | Use absolute (number of shared k-mers) match scores in the kmer search |
|---|
| 367 | rather than relative (number or shared k-mers divided by length of reference |
|---|
| 368 | sequence) match scores. |
|---|
| 369 | |
|---|
| 370 | Suppress warnings about missing 'start' field: (default is off) |
|---|
| 371 | |
|---|
| 372 | This option suppresses warnings about missing 'start' fields and allows to |
|---|
| 373 | use sina with databases not using the 'start' w/o getting flooded with |
|---|
| 374 | warnings. |
|---|
| 375 | |
|---|
| 376 | SINA command: (default "arb_sina.sh") |
|---|
| 377 | |
|---|
| 378 | If arb has problems finding the sina binary for whatever reasons, you may |
|---|
| 379 | specify an explicit path here. |
|---|
| 380 | Please note, doing so will stop a fat-tarball-installation from working! |
|---|
| 381 | |
|---|
| 382 | NOTES SINA automatically decides the number of threads being used. |
|---|
| 383 | |
|---|
| 384 | When recording macros acting on the SINA window, problems |
|---|
| 385 | with the used macro-IDs may occur ("XSINA" vs. "SINA" prefix) and |
|---|
| 386 | ARB complaining about unknown macro ids (e.g. |
|---|
| 387 | sth like "Unknown action 'SINA/CURR_PT_SERVER' in macro"). |
|---|
| 388 | This is caused by the way the sina window toggles |
|---|
| 389 | between 'normal' and 'advanced' options. |
|---|
| 390 | To avoid this, toggle between advanced and normal at the start of your macro. |
|---|
| 391 | If problems persist manually correct the action prefixes in your macro. |
|---|
| 392 | |
|---|
| 393 | WARNINGS When using SINA 1.3 you have to make sure that the alignment selected |
|---|
| 394 | in LINK{ad_align.hlp} is the same alignment as used in the ARB_EDIT4 instance. |
|---|
| 395 | Starting with SINA 1.7.2 it will always use the same alignment as the editor. |
|---|
| 396 | |
|---|
| 397 | BUGS No bugs known |
|---|