Opened 10 years ago

Closed 5 years ago

#604 closed enhancement (discarded)

arb_parsimony should support multi cores

Reported by: westram Owned by: westram
Priority: major Milestone: wishlist
Component: ARB_PARSIMONY Version: SVN
Keywords: Cc:

Description (last modified by westram)

Possible parallel execution:

  • distance calculation (general):
    • divide alignment into regions; calculate regions asynchronously
      (./) tested. no gain. far too fine-grained!
  • inserting a batch of species:
    • implement #643 (detect best insert positions for all species in one traversal) (./)
    • calculate parsimony value asynchronously (using Cxx11 futures)
      (./) implemented (as part of class CombinableSeq)
    • insert_species_into_tree performs all combines in a loop
      • needs a generalization of CombinableSeq which also works for protein sequences
      • for async execution it has to create multiple instances of CombinableSeq

Change History (15)

comment:1 follow-ups: Changed 10 years ago by epruesse

Brief benchmark:

On the SSU Ref NR 111 the below script removes 313 sequences from the tree and adds them again. Running 1-8 instances of ARB on a 16 core Xeon, arb_progress estimates 18 minutes for adding the sequences.

So we're not memory bound here.

According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.

#!/usr/bin/perl
use strict;
use warnings;

use lib "$ENV{'ARBHOME'}/lib/";
use ARB;

my $gb_main = ARB::open(":","r");
if (not $gb_main) {
  my $error = ARB::await_error();
  die "$error";
}

# recording started @ Thu Sep 25 06:26:55 2014
BIO::remote_action($gb_main,'ARB_NT','species_search');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/key_0','full_name');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/query_0','pseudo*');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/query_1','*9');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/operator_1','and');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/SEARCH_spec');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/MARK_LISTED_UNMARK_REST');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/CLOSE');

BIO::remote_action($gb_main,'ARB_NT','ARB_NT/tree_remove_marked');
BIO::remote_action($gb_main,'ARB_NT','arb_pars_quick');
BIO::remote_action($gb_main,'ARB_PARS','PARS_PROPS/SELECT_FILTER');
BIO::remote_awar($gb_main,'ARB_PARS','tmp/pars/filter/subname',' datapos_var_ssuref:bacteria');
BIO::remote_awar($gb_main,'ARB_PARS','tmp/pars/filter/cancel','1234567.0-=');
BIO::remote_action($gb_main,'ARB_PARS','FILTER_SELECT_646076448/CLOSE');
BIO::remote_action($gb_main,'ARB_PARS','PARS_PROPS/GO');
# recording stopped @ Thu Sep 25 06:30:23 2014
ARB::close($gb_main);

comment:2 in reply to: ↑ 1 Changed 10 years ago by westram

Replying to epruesse:

According to perf, 85% of the time is spent in combine().

That is good news :) Thank you for benchmarking!

comment:3 in reply to: ↑ 1 ; follow-ups: Changed 10 years ago by westram

  • Owner changed from devel to westram
  • Status changed from new to assigned

Replying to epruesse:

According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.

  • loop optimization was implemented with log:trunk@12967-12970,12973,12975
    • speedup 2x-4x
  • need some automated check whether loop optimization really happens (otherwise this will regress)

comment:4 in reply to: ↑ 3 ; follow-up: Changed 10 years ago by epruesse

Replying to westram:

  • need some automated check whether loop optimization really happens (otherwise this will regress)

I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.

Options:

  1. enable "-fopt-info" and grep output for "LOOP VECTORIZED"
    • would have to do this during build, awkward
  2. checheck binary for SSE instructions
    • doesn't really work, tons of things are using SSE, memset among others
  3. use intrinsics to hand-coded SSE (can steal from GCC code)
    • lots of extra code, not easy to maintain, needs switches to detect support
  4. do performance tests to check for degradation
    • would be good for all of ARB, but requires an isolated test platform

comment:5 in reply to: ↑ 4 Changed 10 years ago by westram

Replying to epruesse:

I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.

  1. enable "-fopt-info" and grep output for "LOOP VECTORIZED"
    • would have to do this during build, awkward

That's my plan. I agree that it's unlikely for the involved code to regress, but i hope you agree it's normal that unlikely things happen during development. ;-)

comment:6 in reply to: ↑ 3 Changed 10 years ago by westram

  • need some automated check whether loop optimization really happens (otherwise this will regress)

implemented by [13443]

comment:7 Changed 9 years ago by westram

  • Description modified (diff)
  • Milestone set to wishlist2016

comment:8 Changed 9 years ago by westram

  • Priority changed from normal to major

comment:9 Changed 9 years ago by westram

  • Milestone changed from wishlist2016 to wishlist

Milestone renamed

comment:10 Changed 8 years ago by westram

  • Version set to SVN

comment:11 Changed 8 years ago by westram

  • Milestone changed from wishlist to r17q4

comment:12 Changed 7 years ago by westram

  • Status changed from assigned to accepted

comment:13 Changed 7 years ago by westram

  • Status changed from accepted to _started

comment:14 Changed 7 years ago by westram

  • Description modified (diff)
  • Milestone changed from r17q4 to wishlist
  • Status changed from _started to assigned

comment:15 Changed 5 years ago by westram

  • Resolution set to discarded
  • Status changed from assigned to closed

tested optimizations:

  • asynchronous execution of combine itself: far too fine-granular
  • asynchronous insertion of all added species (at one insert position):
    • uses much memory
    • still too fine-granular; runtime approx. 2000% :-(

conclusion:

  • tree optimization performance will not benefit from parallel execution
  • species insertion may benefit, but only
    • by emphasizing the memory footprint and
    • by complicating the arb-pars-code.
    Insertion is not that time critical since #643.

⇒ implementation was discarded.

Note: See TracTickets for help on using tickets.