Opened 10 years ago
Last modified 3 years ago
#601 assigned misbehavior
Adding Short Sequences to SILVA_NR_119 does not have reliable placement in tree
Reported by: | guest | Owned by: | westram |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | ARB_PARSIMONY | Version: | arb-6.x |
Keywords: | Quick Add Marked | Cc: | CONNON@… |
Description
Using ARB 6.0.2 and OS RedHAt Linux 6.
What I have—-I have 10 short 253 bp sequences that have been checked for chimeras using UCHIME, SLAYER and PINTAIL in Mothur and are shown to be clean. I imported these 10 to ARB and aligned with fast aligner and then checked by hand to clean up the sequence ends (ie. bring them in from the outer edges).
Problem with Quick Add Marked tool—-I add them to the with the "quick add marked parsimony" tool and then check their location using the store_full_taxonomy feature. I remove them from the since the placement does not look right and I add them back one at a time and they go into different places some extremely different such as Alpha versus Gammaproteobacteria. I was worried that the low quality (ie. red seqs) flagged in the SILVA_NR_119 could be the problem so I removed all of these low quality seqs from the and repeated the "quick add" with my 10 seqs. Again when I add them all together I get one set of taxonomies and if I add them one at a time the placement in the changes as compared to adding them together as a set of 10.
How is the add marked species to tool working? Is it adding one from the list of 10 changing the each time before adding the next seq OR is it adding each of the 10 sequences to the independent of the other 9? It does not seem that the sequences should no be going to the same place in the no matter if added one at a time or as a group of 10. I have never seen this happen in all the years I have been using this feature. In the past the placement was always robust and the same. Over the last year I have seen this "instability" several times but just ignored it thinking I had done something wrong. I have spent 3 days adding and removing these 10 sequences from the SILVA_119 in various orders to see if I could determine that my sequences were actually chimeras but I see no pattern that would lead me to believe they are chimeras. It would help in my troubleshooting if I could be given some documentation on how the quick add feature handles batches of sequences as compared to adding one at a time.
It is critical that we can rely on our sequences to be placed correctly in the so this is something that I am willing to work on more.
See attached fasta file of the 10 seqs. Note: The seq names were generated using a taxonomic ID feature in the QIIME pipeline against the SILVA111 database additionally modified to remove all low scoring sequences (ie. red ones as determined by the SILVA folks plus more that we deemed low quality). It was the difference in the taxonomies determined by QIIME pipeline versus ARB that brought the problem to my attention. Thank you, Stephanie Connon, Arb since 1996
Attachments (2)
Change History (8)
Changed 10 years ago by guest
comment:1 Changed 10 years ago by guest
My email address is CONNON@…
comment:2 Changed 10 years ago by epruesse
- Cc CONNON@… added
comment:3 in reply to: ↑ description Changed 10 years ago by westram
- Owner changed from devel to westram
- Status changed from new to accepted
Hello Stephanie,
What I have—-I have 10 short 253 bp sequences […]
ARB parsimony will not perform well if you add partial sequences, esp. if they are that short. Adding them like you normally do with full-length sequences will count the missing bps (compared with a full-length seq) as deletions. Doing so often results in trees where all partials (or several clusters of partials) fall into separate subtrees, because the distance between two partial sequences is smaller than the distance between each partial and the nearest full-length sequence. Also (as you've experienced) the insertion order of partial sequences has an much higher impact on the resulting topology, compared with the impact it has on full-length sequences.
There is a special function to place partial sequences into trees:
- mark the partial species to add
- start ARB_PARSIMONY interactively
- select ARB_PARRSIMONY/Tree/Add species to /Add marked partial species
See also http://help.arb-home.de/pa_partial.html
Please give that feature a try and report back whether it leads to better results or not.
How is the add marked species to tool working? Is it adding one from the list of 10 changing the each time before adding the next seq
yes. The insertion order depends on the weighted amount of base positions (after filtering). Species with higher base counts are inserted first (as their placement will be "more determined"). In case of equal bp it has no comprehensible order.
OR is it adding each of the 10 sequences to the independent of the other 9?
no.
The 'Add marked partial species' method adds ALL partial species independently to the . Therefor the insertion order has only an effect if 3 or more partial sequences get attached to the same full-length sequence.
Stephanie Connon, Arb since 1996
Nice - nearly as long as i work on ARB
Ralf
comment:4 Changed 10 years ago by westram
testing blocked by #608
comment:5 Changed 7 years ago by westram
- Status changed from accepted to assigned
comment:6 Changed 3 years ago by westram
- Version changed from arb-6.0 to arb-6.x
unify tickets for all 6.x versions.
fasta file of Stephanie Connon's 10 test seqs