Opened 16 years ago
Closed 16 years ago
#153 closed defect (fixed)
Problem with the ARB Newick Tree Exporter
Reported by: | fog | Owned by: | westram |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | ARB_NTREE | Version: | release_20071207 |
Keywords: | Cc: | pablo.yarza@… |
Description (last modified by westram)
Newick exports can not be processed by external programs. Reported by several , checked and report written by Pablo Yarza:
- Commas to separate fields
- Occurrence: Tree→ Tree admin→ Export.
- Options: Newick format. NDS. branch lengths and group names.
- Problems:
- Most of the times, the output file can be opened with drawgram without problems. However there are some issues behind.
- In the output file, fields get separated by (, ). Somehow drawgram considers it like 3 distinct branches of equal length starting from the same node. Instead of considering it like 1 single branch with some more information (name, accession number, gene length)
- Example: ('Pseudomonas sp., AF12345, 1493'.0161
- Possible solution: when using NDS: Do not use (, ) to separate fields. Use ( ) or (_). The underscore is automatically converted by the newick standard as a ( ).
- Single quotes adjoining branch labels generate confusion.
- Occurrence: Tree→ Tree admin→ Export.
- Options: Newick format. NDS. branch lengths and group names.
- Problems:
- As It seems, ARB newick exporter writes single quotes flanking each label only when finds at least one space in the string.
- When there is only one field to be printed out, if it contains spaces, single quotes will be printed as well. This is often the case of the "full_name" field. And this is not the case of the "name" or "acc" fields.
- The program drawtree prints out also the quote, which might be undesirable or confusing for some . (e.g. In taxonomy, species name between quotes is used to specify that this name was not validly published)
- Notes:
- We recently received complaints regarding the newick file provided by the LTP. Those People parsed the newick file that we provide and claimed that our contained more species than it actually has. They use single quotes as an identifier for every single sequence, which sounds reasonable.
- However, when you also export group names (as we do with our trees), those containing more than one word (e.g. the group: Gammaproteobacteria and Betaproteobacteria) will be printed out between single quotes as well.
- A possible solution would be to do not use quoting.
- Square brackets
- Occurrence : Tree→ Tree admin→ Export.
- Options: Newick format.
- Problems:
- ARB allows you to enter/modify some information about your , through the administrator.
- When exporting the , ARB automatically seems to do the following:
- ALWAYS Begins the file writing an opening square bracket ([)
- writes the information of your
- If ARB does not find any closing square bracket (]), writes one at the end. So, even if your contains no information, ARB will export ([]) as the header of the output file.
- If ARB finds one or more closing brackets (]) ANYWHERE in the text, It will write only one, ALWAYS at the end of the header. This implies a source of problems when the information of the actually contains brackets.
- Below, in a new line, ARB writes the in newick format.
For example, suppose that my contains the following additional information:
This is my tree [February 20 2009]. [created as copy of 'tree_test1']Now, if you export the , the header of the newick file will contain:
[This is my tree [February 20 2009. [created as copy of 'tree_test1']
At least for drawgram, every opening square bracket in the header has to contain its respective closing squeare bracket. As you see in the example above there are three ([) but only one (]). So, there are two missing closing square brackets, and this file will never be read by drawgram.
- Possible solution: ALWAYS write the same number of ([) than (]).
- Colons, semicolons, commas and brackets
- Occurrence : Tree→ Tree admin→ Export.
- Options: Newick format.
- Problems:
- All these symbols can alter the structure of the newick file.
- Although the label of a branch appears flanked by quotes in the newick file, if it contains any of those symbols inside the newick gets corrupted.
- Colon: to specify the branch length
- commas: to separate members of the same node
- Semicolon: to end the newick file
- Brackets: to define a node
They are present in a lot of fields like "fullname_ltp", "tax_rdp_name", "journal", "tax_embl", "date", etc. And for sure can be present in personal fields.
- Example:
'Moraxella (subgen. Moraxella Lwoff 1939) lacunata, D64049, Moraxellaceae'.00350,
In this case, the presence of brackets makes drawgram crash when trying to open the newick .
- Possible solutions:
- In the case of brackets, perhaps ARB should delete this symbol from the field while exporting.
- For the rest of symbols, they could be replaced by neutral sybols that do not affect the structure of the file. Like ( ), (_), , (%), etc.
Change History (4)
comment:1 Changed 16 years ago by westram
- Description modified (diff)
- Owner changed from devel to westram
comment:2 Changed 16 years ago by westram
see also Ticket #148
comment:3 Changed 16 years ago by westram
comment:4 Changed 16 years ago by westram
- Resolution set to fixed
- Status changed from new to closed
Hi Pablo,
- fixed by [6066]
- regarding your single requests
- Commas to separate fields
- may now replace all special characters ("()[]:;," plus currently used quote) by '_'
- Single quotes adjoining branch labels generate confusion.
- may use double quotes or no quotes now
- already fixed
- Colons, semicolons, commas and brackets
- see 1.
- Commas to separate fields
I'll build a version later, you will find it here
Newick format: