Opened 15 years ago

Closed 15 years ago

#153 closed defect (fixed)

Problem with the ARB Newick Tree Exporter

Reported by: fog Owned by: westram
Priority: major Milestone:
Component: ARB_NTREE Version: release_20071207
Keywords: Cc: pablo.yarza@…

Description (last modified by westram)

Newick exports can not be processed by external programs. Reported by several users, checked and report written by Pablo Yarza:

  1. Commas to separate fields
    • Occurrence: Tree→ Tree admin→ Export.
    • Options: Newick format. NDS. branch lengths and group names.
    • Problems:
      • Most of the times, the output file can be opened with drawgram without problems. However there are some issues behind.
      • In the output file, fields get separated by (, ). Somehow drawgram considers it like 3 distinct branches of equal length starting from the same node. Instead of considering it like 1 single branch with some more information (name, accession number, gene length)
    • Example: ('Pseudomonas sp., AF12345, 1493':0.0161
    • Possible solution: when using NDS: Do not use (, ) to separate fields. Use ( ) or (_). The underscore is automatically converted by the newick standard as a ( ).
  1. Single quotes adjoining branch labels generate confusion.
    • Occurrence: Tree→ Tree admin→ Export.
    • Options: Newick format. NDS. branch lengths and group names.
    • Problems:
      • As It seems, ARB newick exporter writes single quotes flanking each label only when finds at least one space in the string.
      • When there is only one field to be printed out, if it contains spaces, single quotes will be printed as well. This is often the case of the "full_name" field. And this is not the case of the "name" or "acc" fields.
      • The program drawtree prints out also the quote, which might be undesirable or confusing for some users. (e.g. In taxonomy, species name between quotes is used to specify that this name was not validly published)
    • Notes:
      • We recently received complaints regarding the newick file provided by the LTP. Those People parsed the newick file that we provide and claimed that our tree contained more species than it actually has. They use single quotes as an identifier for every single sequence, which sounds reasonable.
      • However, when you also export group names (as we do with our trees), those containing more than one word (e.g. the group: Gammaproteobacteria and Betaproteobacteria) will be printed out between single quotes as well.
  • A possible solution would be to do not use quoting.
  1. Square brackets
    • Occurrence : Tree→ Tree admin→ Export.
    • Options: Newick format.
    • Problems:
      • ARB allows you to enter/modify some information about your tree, through the tree administrator.
      • When exporting the tree, ARB automatically seems to do the following:
        1. ALWAYS Begins the file writing an opening square bracket ([)
        2. writes the information of your tree
        3. If ARB does not find any closing square bracket (]), writes one at the end. So, even if your tree contains no information, ARB will export ([]) as the header of the output file.
        4. If ARB finds one or more closing brackets (]) ANYWHERE in the text, It will write only one, ALWAYS at the end of the header. This implies a source of problems when the information of the tree actually contains brackets.
        5. Below, in a new line, ARB writes the tree in newick format.

For example, suppose that my tree contains the following additional information:

This is my tree [February 20 2009].
[created as copy of 'tree_test1']

Now, if you export the tree, the header of the newick file will contain:

[This is my tree [February 20 2009.
[created as copy of 'tree_test1']

At least for drawgram, every opening square bracket in the header has to contain its respective closing squeare bracket. As you see in the example above there are three ([) but only one (]). So, there are two missing closing square brackets, and this file will never be read by drawgram.

  • Possible solution: ALWAYS write the same number of ([) than (]).
  1. Colons, semicolons, commas and brackets
  • Occurrence : Tree→ Tree admin→ Export.
  • Options: Newick format.
  • Problems:
    • All these symbols can alter the structure of the newick file.
    • Although the label of a branch appears flanked by quotes in the newick file, if it contains any of those symbols inside the newick tree gets corrupted.
      • Colon: to specify the branch length
      • commas: to separate members of the same node
      • Semicolon: to end the newick file
      • Brackets: to define a node

They are present in a lot of fields like "fullname_ltp", "tax_rdp_name", "journal", "tax_embl", "date", etc. And for sure can be present in personal fields.

  • Example:
    'Moraxella (subgen. Moraxella Lwoff 1939) lacunata, D64049, Moraxellaceae':0.00350,

In this case, the presence of brackets makes drawgram crash when trying to open the newick tree.

  • Possible solutions:
    • In the case of brackets, perhaps ARB should delete this symbol from the field while exporting.
    • For the rest of symbols, they could be replaced by neutral sybols that do not affect the structure of the file. Like ( ), (_), (#), (%), etc.

Change History (4)

comment:1 Changed 15 years ago by westram

  • Description modified (diff)
  • Owner changed from devel to westram

comment:2 Changed 15 years ago by westram

see also Ticket #148

comment:3 Changed 15 years ago by westram

  • 3. is fixed by [6028]
  • ARB now expects that the tree comment is well-formed (contains same number of 'and?')

comment:4 Changed 15 years ago by westram

  • Resolution set to fixed
  • Status changed from new to closed

Hi Pablo,

  • fixed by [6066]
  • regarding your single requests
    1. Commas to separate fields
      • may now replace all special characters ("()[]:;," plus currently used quote) by '_'
    2. Single quotes adjoining branch labels generate confusion.
      • may use double quotes or no quotes now
    3. already fixed
    4. Colons, semicolons, commas and brackets
      • see 1.

I'll build a version later, you will find it here

Note: See TracTickets for help on using tickets.