Context Navigation

← Previous Ticket
Next Ticket →

#604 closed enhancement (discarded)

arb_parsimony should support multi cores

Reported by:	westram	Owned by:	westram
Priority:	major	Milestone:	wishlist
Component:	ARB_PARSIMONY	Version:	SVN
Keywords:		Cc:

Description (last modified by westram)

Possible parallel execution:

~~distance calculation (general):~~
- ~~divide alignment into regions; calculate regions asynchronously~~
  tested. no gain. far too fine-grained!
inserting a batch of species:
- ~~implement #643 (detect best insert positions for all species in one traversal)~~
- ~~calculate parsimony value asynchronously (using Cxx11 futures)~~
  implemented (as part of class CombinableSeq)
- insert_species_into_tree performs all combines in a loop
  - needs a generalization of CombinableSeq which also works for protein sequences
  - for async execution it has to create multiple instances of CombinableSeq

Change History (15)

comment:1 follow-ups: ↓ 2 ↓ 3 Changed 11 years ago by epruesse

Brief benchmark:

On the SSU Ref NR 111 the below script removes 313 sequences from the tree and adds them again. Running 1-8 instances of ARB on a 16 core Xeon, arb_progress estimates 18 minutes for adding the sequences.

So we're not memory bound here.

According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.

#!/usr/bin/perl
use strict;
use warnings;

use lib "$ENV{'ARBHOME'}/lib/";
use ARB;

my $gb_main = ARB::open(":","r");
if (not $gb_main) {
  my $error = ARB::await_error();
  die "$error";
}

# recording started @ Thu Sep 25 06:26:55 2014
BIO::remote_action($gb_main,'ARB_NT','species_search');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/key_0','full_name');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/query_0','pseudo*');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/query_1','*9');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/operator_1','and');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/SEARCH_spec');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/MARK_LISTED_UNMARK_REST');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/CLOSE');

BIO::remote_action($gb_main,'ARB_NT','ARB_NT/tree_remove_marked');
BIO::remote_action($gb_main,'ARB_NT','arb_pars_quick');
BIO::remote_action($gb_main,'ARB_PARS','PARS_PROPS/SELECT_FILTER');
BIO::remote_awar($gb_main,'ARB_PARS','tmp/pars/filter/subname',' datapos_var_ssuref:bacteria');
BIO::remote_awar($gb_main,'ARB_PARS','tmp/pars/filter/cancel','1234567.0-=');
BIO::remote_action($gb_main,'ARB_PARS','FILTER_SELECT_646076448/CLOSE');
BIO::remote_action($gb_main,'ARB_PARS','PARS_PROPS/GO');
# recording stopped @ Thu Sep 25 06:30:23 2014
ARB::close($gb_main);

comment:2 in reply to: ↑ 1 Changed 11 years ago by westram

Replying to epruesse:

According to perf, 85% of the time is spent in combine().

That is good news Thank you for benchmarking!

comment:3 in reply to: ↑ 1 ; follow-ups: ↓ 4 ↓ 6 Changed 11 years ago by westram

Owner changed from devel to westram
Status changed from new to assigned

Replying to epruesse:

According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.

loop optimization was implemented with log:trunk@12967-12970,12973,12975
- speedup 2x-4x
need some automated check whether loop optimization really happens (otherwise this will regress)

comment:4 in reply to: ↑ 3 ; follow-up: ↓ 5 Changed 11 years ago by epruesse

Replying to westram:

need some automated check whether loop optimization really happens (otherwise this will regress)

I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.

Options:

enable "-fopt-info" and grep output for "LOOP VECTORIZED"
- would have to do this during build, awkward
checheck binary for SSE instructions
- doesn't really work, tons of things are using SSE, memset among others
use intrinsics to hand-coded SSE (can steal from GCC code)
- lots of extra code, not easy to maintain, needs switches to detect support
do performance tests to check for degradation
- would be good for all of ARB, but requires an isolated test platform

comment:5 in reply to: ↑ 4 Changed 11 years ago by westram

Replying to epruesse:

I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.

enable "-fopt-info" and grep output for "LOOP VECTORIZED"
would have to do this during build, awkward

That's my plan. I agree that it's unlikely for the involved code to regress, but i hope you agree it's normal that unlikely things happen during development.

comment:6 in reply to: ↑ 3 Changed 10 years ago by westram

need some automated check whether loop optimization really happens (otherwise this will regress)

implemented by [13443]

comment:7 Changed 9 years ago by westram

Description modified (diff)
Milestone set to wishlist2016

comment:8 Changed 9 years ago by westram

Priority changed from normal to major

comment:9 Changed 9 years ago by westram

Milestone changed from wishlist2016 to wishlist

Milestone renamed

comment:10 Changed 8 years ago by westram

Version set to SVN

comment:11 Changed 8 years ago by westram

Milestone changed from wishlist to r17q4

comment:12 Changed 8 years ago by westram

Status changed from assigned to accepted

comment:13 Changed 8 years ago by westram

Status changed from accepted to _started

comment:14 Changed 7 years ago by westram

Description modified (diff)
Milestone changed from r17q4 to wishlist
Status changed from _started to assigned

comment:15 Changed 6 years ago by westram

Resolution set to discarded
Status changed from assigned to closed

tested optimizations:

asynchronous execution of combine itself: far too fine-granular
asynchronous insertion of all added species (at one insert position):
- uses much memory
- still too fine-granular; runtime approx. 2000%

conclusion:

optimization performance will not benefit from parallel execution
species insertion may benefit, but only
- by emphasizing the memory footprint and
- by complicating the arb-pars-code.
Insertion is not that time critical since #643.

⇒ implementation was discarded.

Note: See TracTickets for help on using tickets.

Download in other formats: