Opened 4 years ago

Last modified 3 months ago

#601 assigned misbehavior

Adding Short Sequences to SILVA_NR_119 does not have reliable placement in tree

Reported by: guest Owned by: westram
Priority: major Milestone:
Component: ARB_PARSIMONY Version: arb-6.0
Keywords: Quick Add Marked Cc: CONNON@…

Description

Using ARB 6.0.2 and OS RedHAt Linux 6.

What I have—-I have 10 short 253 bp sequences that have been checked for chimeras using UCHIME, SLAYER and PINTAIL in Mothur and are shown to be clean. I imported these 10 to ARB and aligned with fast aligner and then checked by hand to clean up the sequence ends (ie. bring them in from the outer edges).

Problem with Quick Add Marked tool—-I add them to the tree with the "quick add marked parsimony" tool and then check their location using the store_full_taxonomy feature. I remove them from the tree since the placement does not look right and I add them back one at a time and they go into different places some extremely different such as Alpha versus Gammaproteobacteria. I was worried that the low quality (ie. red seqs) flagged in the SILVA_NR_119 tree could be the problem so I removed all of these low quality seqs from the tree and repeated the "quick add" with my 10 seqs. Again when I add them all together I get one set of taxonomies and if I add them one at a time the placement in the tree changes as compared to adding them together as a set of 10.

How is the add marked species to tree tool working? Is it adding one from the list of 10 changing the tree each time before adding the next seq OR is it adding each of the 10 sequences to the tree independent of the other 9? It does not seem that the sequences should no be going to the same place in the tree no matter if added one at a time or as a group of 10. I have never seen this happen in all the years I have been using this feature. In the past the placement was always robust and the same. Over the last year I have seen this "instability" several times but just ignored it thinking I had done something wrong. I have spent 3 days adding and removing these 10 sequences from the SILVA_119 tree in various orders to see if I could determine that my sequences were actually chimeras but I see no pattern that would lead me to believe they are chimeras. It would help in my troubleshooting if I could be given some documentation on how the quick add feature handles batches of sequences as compared to adding one at a time.

It is critical that we can rely on our sequences to be placed correctly in the tree so this is something that I am willing to work on more.

See attached fasta file of the 10 seqs. Note: The seq names were generated using a taxonomic ID feature in the QIIME pipeline against the SILVA111 database additionally modified to remove all low scoring sequences (ie. red ones as determined by the SILVA folks plus more that we deemed low quality). It was the difference in the taxonomies determined by QIIME pipeline versus ARB that brought the problem to my attention. Thank you, Stephanie Connon, Arb user since 1996

Attachments (2)

Connon_QuickAddPars_BugReport.fasta (3.0 KB) - added by guest 4 years ago.
fasta file of Stephanie Connon's 10 test seqs
Connon_QuickAddTaxonomies_usingSilva119Arb.xlsx (17.7 KB) - added by guest 4 years ago.
Click the 3 tabs to see how I added seqs to the tree and the tree placement

Download all attachments as: .zip

Change History (7)

Changed 4 years ago by guest

fasta file of Stephanie Connon's 10 test seqs

Changed 4 years ago by guest

Click the 3 tabs to see how I added seqs to the tree and the tree placement

comment:1 Changed 4 years ago by guest

My email address is CONNON@…

comment:2 Changed 4 years ago by epruesse

  • Cc CONNON@… added

comment:3 in reply to: ↑ description Changed 4 years ago by westram

  • Owner changed from devel to westram
  • Status changed from new to accepted

Hello Stephanie,

What I have—-I have 10 short 253 bp sequences […]

ARB parsimony will not perform well if you add partial sequences, esp. if they are that short. Adding them like you normally do with full-length sequences will count the missing bps (compared with a full-length seq) as deletions. Doing so often results in trees where all partials (or several clusters of partials) fall into separate subtrees, because the distance between two partial sequences is smaller than the distance between each partial and the nearest full-length sequence. Also (as you've experienced) the insertion order of partial sequences has an much higher impact on the resulting topology, compared with the impact it has on full-length sequences.

There is a special function to place partial sequences into trees:

  • mark the partial species to add
  • start ARB_PARSIMONY interactively
  • select ARB_PARRSIMONY/Tree/Add species to tree/Add marked partial species

See also http://help.arb-home.de/pa_partial.html

Please give that feature a try and report back whether it leads to better results or not.

How is the add marked species to tree tool working? Is it adding one from the list of 10 changing the tree each time before adding the next seq

yes. The insertion order depends on the weighted amount of base positions (after filtering). Species with higher base counts are inserted first (as their placement will be "more determined"). In case of equal bp it has no comprehensible order.

OR is it adding each of the 10 sequences to the tree independent of the other 9?

no.

The 'Add marked partial species' method adds ALL partial species independently to the tree. Therefor the insertion order has only an effect if 3 or more partial sequences get attached to the same full-length sequence.

Stephanie Connon, Arb user since 1996

Nice - nearly as long as i work on ARB :)

Ralf

comment:4 Changed 4 years ago by westram

testing blocked by #608

comment:5 Changed 3 months ago by westram

  • Status changed from accepted to assigned
Note: See TracTickets for help on using tickets.