Context Navigation

← Previous Revision
Next Revision →
Blame
Revision Log

sina_main.hlp

Visit:

Last change on this file was 19348, checked in by westram, 3 years ago
emphasize reference- and ptserver-database have to be the same database!
File size: 19.7 KB

Line
1	#Please insert up references in the next lines (line starts with keyword UP)
2	UP arb.hlp
3	UP glossary.hlp
4
5	#Please insert subtopic references (line starts with keyword SUB)
6	#SUB subtopic.hlp
7
8	# Hypertext links in helptext can be added like this: LINK{ref.hlp\|http://add\|bla@domain}
9
10	#*********** Title of helpfile !! and start of real helpfile ******
11	TITLE Graph Aligner (SINA)
12
13	OCCURRENCE ARB Editor -> Edit -> Prototypical Graph Aligner
14
15	DESCRIPTION SINA is an alternative to the integrated aligners.
16	It has been developed for the SILVA project.
17
18	Other than the integrated aligners SINA
19	- uses aligned sequences from the reference database as reference to align the selected sequences.
20	- employs full dynamic programming to create the alignment.
21	- considers all selected relatives at once, instead of falling back to
22	less similar sequences only if the current sequence is missing bases (e.g.
23	because it is a partial sequence).
24
25	SECTION SINA documentation online
26
27	LINK{https://sina.readthedocs.io/_/downloads/en/stable/pdf/}
28
29	Note: parts of this document were taken from the above SINA documentation.
30
31	SECTION SINA version
32
33	Please note this documentation applies to the modified SINA version 1.7.2
34	which (optionally) is delivered with arb.
35
36	Modifications applied:
37	* fix build with arb7 + gcc 7.5
38	* define CLI interface "ARB7.1" (allows arb to detect SINA 1.7.2)
39	* fixed some error messages (clearness; do NOT show WHOLE alignment; explicitely show bad character)
40	* new option '--dont-expect-start' (allows to work with databases not containing 'start'-fields)
41
42	ARB still supports SINA version 1.3.
43	If there is no option to select the reference database, you are either using version 1.3
44	or you try to use a (newer) SINA version which does not yet support the "ARB7.1"-CLI-interface.
45
46	In case you are using version 1.3, you may like to use the old, corresponding version of this helppage
47	which may be accessed via LINK{http://bugs.arb-home.de/browser/trunk/HELP_SOURCE/source/sina_main.hlp?rev=18781}
48
49	SECTION OPTIONS
50
51	Select the sequences to be aligned as usual ("Current Species", "Selected
52	Species" or "Marked Species").
53
54	Select a PT-Server or SINA kmer-search:
55
56	You may select a PT-Server that will be used for reference search.
57	Make sure it is up to date and contains all sequences you want to be considered
58	as reference.
59
60	Alternatively select '-undefined-' to use the sina-internal engine for reference search.
61	SINA will maintain an index file (.sidx) that will be stored next to the reference database file.
62	It will automatically get updated after the reference database changed.
63
64	Select a reference database:
65
66	The sequences from the reference database will be used as references when aligning the
67	sequences in the current database.
68
69	Normally you will like to use the current database as reference database.
70	This can be done in two ways:
71	* select 'Last saved' to use the last saved state of the current database.
72	* select 'Current' to use the loaded database. This is currently not possible
73	with PT-Server.
74
75	Alternatively you may specify any other database as reference
76	database via 'Explicit as [selected]'. This could e.g. be the same
77	database used to build the PT-Server.
78
79	This allows e.g. to use a high quality database containing
80	only typestrains as reference while working on small, specialized databases.
81
82	It will also avoid polluting the set of references with the state of your
83	current working dataset. So you dont have to fear some badly aligned sequence
84	from your working set will be used as a reference.
85
86	The effective reference database path will be shown in the input field below.
87
88	Important note:
89
90	The previously integrated SINA version 1.3 always aligned versus
91	the state of the referenced sequences in the CURRENT DATABASE.
92
93	SINA version 1.7 always aligns versus the state of the referenced sequences
94	in the REFERENCE DATABASE (which may, but does not have to be the same!).
95
96	When using SINA version 1.7 with the PT-Server, please MAKE SURE
97	you specify the same database used to calculate the PT-Server as
98	reference database.
99
100	Follow these steps:
101
102	1. save the reference database (e.g. as ref.arb)
103	2. start arb on ref.arb and calculate a PT-Server
104	3. go back to your working database, open the sina window,
105	specify the calculated PT-Server AND specify the saved
106	ref.arb as reference database.
107	4. restart at 1. whenever you need to update your references
108
109	If SINA detects any inconsistencies between the PT-Server and the reference
110	database, it will silently try to update the PT-Server database.
111
112
113	Excluded references:
114
115	Some sequences will not be used as references:
116	* sequences with less than 10 gaps are considered not aligned and will
117	not be used as references.
118	* if "Realign" (see advanced options) is checked, a sequence will never
119	be used as references for itself.
120
121	Some options define additional requirements for the chosen
122	reference sequence set (see option below, esp. advanced options).
123
124	Decide what to do with possible overhang ("Overhang placement").
125	If your sequence extends beyond the
126	reference sequences on either side of the alignment, those bases cannot be
127	aligned properly. Three options of handling this situation are supported:
128
129	"keep attached"
130
131	just leave them dangling, directly attached to the last base that
132	could be aligned properly
133
134	"move to edge"
135
136	move them out to the very beginning and end of
137	the alignment. This allows you to easily spot sequences
138	with overhang, and decide what to do yourself. Recommended,
139	but only if you check your sequences after alignment!
140
141	"remove"
142
143	automatically remove these bases.
144
145	Choose "Handling of unmappable insertions":
146
147	Configures how the alignment width is preserved.
148
149	"Shift surrounding bases"
150
151	The alignment is executed without constraining insertion sizes.
152	Insertions for which insufficient columns exist between the
153	adjoining aligned bases are force fitted into the alignment
154	using NAST. That is, the minimum number of aligned bases to the
155	left and right of the insertion are moved to accommodate the
156	insertion. This mode will add warnings to the log for each
157	sequence in which aligned bases had to be moved.
158
159	"Forbid during DP alignment"
160
161	The alignment is executed using a scoring scheme disallowing insertions
162	for which insufficient columns exist in the alignment. This mode
163	causes less âmisalignmentsâ than the shift mode as it computes the best
164	alignment under the constraint that no columns may be added to the
165	alignment. However, it will not show if the computed alignment suffered
166	from a lack of empty columns.
167
168	"Delete bases"
169
170	The alignment is executed without constraining insertion sizes. Insertions
171	larger than the number of columns between the adjoining aligned bases are
172	truncated. While this mode yields the most accurate alignment for
173	sequences with large insertions, it should be used with care as it
174	modifies the original sequence.
175
176	Choose "Character Case":
177
178	Configures which bases should be written using lower case characters.
179
180	"Do not modify"
181
182	All bases will be written using the case they had in the input data.
183
184	"Show unaligned bases as lower case"
185
186	Aligned bases will be written in upper case; unaligned bases will be
187	written in lower case. This serves to mark sections of the query
188	sequences that could not be aligned because they were insertions
189	(internal or edge) with respect to any of the reference sequences.
190
191	"Uppercase all"
192
193	All bases will use upper case characters
194
195	Define "Family conservation weight": (default 1)
196
197	Adjust the weight factor for the frequency at which a node was observed
198	in the reference alignment. Use 0 to disable weighting. This feature
199	prefers the more common placement for bases with inconsistent alignment
200	in the reference database.
201
202	Define "Size of full-length sequences": (default 1400)
203
204	Set the minimum length a reference sequence is required to have
205	to be considered full length.
206	See also "Minimal number of full length sequences" in
207	ADVANCED OPTIONS below.
208
209	Select a "Protection Level" higher than that of the sequences if you want the
210	alignment software to actually modify the bases. Choose a lower protection
211	level to execute a "dry run", not changing anything. Note that sequences
212	with a protection level of zero will always be changed.
213
214	SECTION Verbosity
215
216	All output will be printed to the console that opens when you start sina.
217
218	Several options allow you to change the noisiness:
219
220	When "Show changed sections" is checked, sina shows differences
221	between the inferred alignment and the original alignment.
222
223	That output will be colorized if "color bases" is checked.
224
225	Check "Show statistic" to
226	show the distance to original alignment.
227
228	SECTION TRICKS
229
230	If you want to see how the alignment that would be produced by the graph
231	aligner differs from your current alignment, and why the program would
232	act that way, you can set the protection level to "0" and the Logging level
233	to "debug". The output on the console will now include all differing sections
234	of the alignment and the matching parts of the reference sequences.
235
236	SECTION ADVANCED OPTIONS
237
238	Select the "Show advanced options" Button at the top to gain access to
239	the you-may-now-shoot-yourself-in-the-foot-severely dialog window.
240
241	Don't be surprised if the graph aligner crashes after you entered silly
242	values here. No sanity check of your options is done.
243
244	Pos.Var:
245
246	Select a positional variability filter. If possible, use the filter
247	appropriate for the type of sequences you want aligned. Positional
248	variability statistics will be considered when placing the individual bases.
249
250	Field used for automatic filter selection:
251
252	Configures a database field using which the value for positional
253	variability filter is determined by majority vote from the selected
254	reference sequences.
255	Since the filters are usually computed at domain level, this approach is usually
256	sufficient to select an appropriate filter.
257	For SILVA database, the field 'tax_slv' contains appropriate data.
258
259
260	Turn check:
261
262	If selected (default) sequences will be automatically reversed
263	and/or complemented if this will likely improve the alignment.
264
265
266	Realign:
267
268	If selected, the sequence itself is excluded from the result of
269	the executed PT-Server family search. If deselected, the alignment
270	of an identical sequence found by the PT-Server is copied.
271
272
273	# @@@ add option to mark references or aligned, then document here.
274	# (Copy and) mark sequence used as reference:
275	#
276	# Mark the sequences that were used as a reference during alignment.
277	# This allows you to easily load them into the editor to review the
278	# decisions made by the graph aligner.
279	# If you also selected the "Load reference" option, sequences will be
280	# copied into your current database prior to being marked.
281
282
283	Gap insertion/extension penalties: (default is 5/2)
284
285	You can change the penalties associated with opening and extending
286	gaps.
287
288	Match/mismatch scores: (default is 2/-1)
289
290	Configures the scores given for a match (should be positive) and
291	a mismatch (should be negative).
292
293	Family search min/min_score/max: (default 40/0.7/40)
294
295	The first value tells the graph aligner how many sequences it should
296	try to always use. The second value determines the minimal identity
297	with the target sequence additional reference sequences should have.
298	The third value selects the maximal number of sequences to be used
299	as a reference.
300
301	Minimal number of full length sequences: (default 1)
302
303	Set the minimum number of full length (see "Size of full-length sequences"
304	setting above) reference sequences that must be included in the selected
305	reference set. The search will proceed regardless of other settings until
306	this setting has been satisfied. If it cannot be satisfied by any
307	sequence in the reference database, the query sequence will be discarded.
308	This setting exists to ensure that the entire length of the query sequence
309	will be covered in the presence of partial sequences contained within your
310	reference database.
311
312	Family search oligo length/mismatches: (default 10/0)
313
314	The first value sets the size of k for the reference search (size of kmer).
315	For SSU rRNA sequences, the default of 10 is a good value.
316	For different sequence types, different values may perform better.
317	For 5S, for example, 6 has shown to be more effective.
318
319	The second value allows k-mer matches in the reference database to contain n mismatches.
320	This feature is only supported by the pt-server search engine and
321	requires substantial additional compute time (in particular for n > 1).
322
323	Minimal reference sequence length: (default 150)
324
325	Set the minimum length reference sequences are required to have.
326	Sequences shorter than this will not be included in the selection.
327
328	Note: If you are working with particularly short reference sequences, you
329	will need to lower this settings to allow any reference sequences to be
330	found.
331
332	Alignment bounds: (default 0/0)
333
334	These values set the beginning and the end of the gene within the reference
335	alignment.
336	See "Number of references required to touch bounds" for more information.
337
338	Number of references required to touch bounds: (default: 0)
339
340	Similar to "Minimal number of full length sequences", this option requires a
341	total of n sequences to cover each the beginning and the end of the gene
342	within the alignment.
343
344	This option is more precise than "Minimal number of full length sequences", but
345	requires that the column numbers for the range in which the full gene is
346	expected be specified via "Alignment bounds" (see above).
347
348	Save used references in 'used_rels': (default is off)
349
350	Writes the names of the alignment reference sequences into the field used_rels.
351	This option allows using LINK{markbyref.hlp} to highlight the reference
352	sequences used to align a given query sequence.
353
354	Store highest identity in 'align_ident_slv': (default is off)
355
356	Computes the highest similarity the aligned query sequence has with any of
357	the sequences in the alignment reference set. The value is written to the
358	field 'align_ident_slv'.
359
360	Disable fast search: (default is to use fast search)
361
362	Use all k-mers occurring in the query sequence in the search. By default,
363	only k-mers starting with an A are used for extra performance.
364
365	Score search results by absolute oligo match count: (default is off)
366
367	Use absolute (number of shared k-mers) match scores in the kmer search
368	rather than relative (number or shared k-mers divided by length of reference
369	sequence) match scores.
370
371	Suppress warnings about missing 'start' field: (default is off)
372
373	This option suppresses warnings about missing 'start' fields and allows to
374	use sina with databases not using the 'start' w/o getting flooded with
375	warnings.
376
377	SINA command: (default "arb_sina.sh")
378
379	If arb has problems finding the sina binary for whatever reasons, you may
380	specify an explicit path here.
381	Please note, doing so will stop a fat-tarball-installation from working!
382
383	NOTES SINA automatically decides the number of threads being used.
384
385	When recording macros acting on the SINA window, problems
386	with the used macro-IDs may occur ("XSINA" vs. "SINA" prefix) and
387	ARB complaining about unknown macro ids (e.g.
388	sth like "Unknown action 'SINA/CURR_PT_SERVER' in macro").
389	This is caused by the way the sina window toggles
390	between 'normal' and 'advanced' options.
391	To avoid this, toggle between advanced and normal at the start of your macro.
392	If problems persist manually correct the action prefixes in your macro.
393
394	WARNINGS When using SINA 1.3 you have to make sure that the alignment selected
395	in LINK{ad_align.hlp} is the same alignment as used in the ARB_EDIT4 instance.
396	Starting with SINA 1.7.2 it will always use the same alignment as the editor.
397
398	BUGS No bugs known

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format