Opened 4 years ago

Last modified 8 months ago

#604 assigned enhancement

arb_parsimony should support multi cores

Reported by: westram Owned by: westram
Priority: major Milestone: wishlist
Component: ARB_PARSIMONY Version: SVN
Keywords: Cc:

Description (last modified by westram)

Possible parallel execution:

  • distance calculation (general):
    • divide alignment into regions; calculate regions asynchronously
      (./) tested. no gain. far too fine-grained!
  • inserting a batch of species:
    • implement #643 (detect best insert positions for all species in one traversal) (./)
    • calculate parsimony value asynchronously (using Cxx11 futures)
      (./) implemented (as part of class CombinableSeq)
    • insert_species_into_tree performs all combines in a loop
      • needs a generalization of CombinableSeq which also works for protein sequences
      • for async execution it has to create multiple instances of CombinableSeq

Change History (14)

comment:1 follow-ups: Changed 4 years ago by epruesse

Brief benchmark:

On the SSU Ref NR 111 the below script removes 313 sequences from the tree and adds them again. Running 1-8 instances of ARB on a 16 core Xeon, arb_progress estimates 18 minutes for adding the sequences.

So we're not memory bound here.

According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.

#!/usr/bin/perl
use strict;
use warnings;

use lib "$ENV{'ARBHOME'}/lib/";
use ARB;

my $gb_main = ARB::open(":","r");
if (not $gb_main) {
  my $error = ARB::await_error();
  die "$error";
}

# recording started @ Thu Sep 25 06:26:55 2014
BIO::remote_action($gb_main,'ARB_NT','species_search');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/key_0','full_name');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/query_0','pseudo*');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/query_1','*9');
BIO::remote_awar($gb_main,'ARB_NT','tmp/dbquery_spec/operator_1','and');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/SEARCH_spec');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/MARK_LISTED_UNMARK_REST');
BIO::remote_action($gb_main,'ARB_NT','SPECIES_QUERY/CLOSE');

BIO::remote_action($gb_main,'ARB_NT','ARB_NT/tree_remove_marked');
BIO::remote_action($gb_main,'ARB_NT','arb_pars_quick');
BIO::remote_action($gb_main,'ARB_PARS','PARS_PROPS/SELECT_FILTER');
BIO::remote_awar($gb_main,'ARB_PARS','tmp/pars/filter/subname',' datapos_var_ssuref:bacteria');
BIO::remote_awar($gb_main,'ARB_PARS','tmp/pars/filter/cancel','1234567.0-=');
BIO::remote_action($gb_main,'ARB_PARS','FILTER_SELECT_646076448/CLOSE');
BIO::remote_action($gb_main,'ARB_PARS','PARS_PROPS/GO');
# recording stopped @ Thu Sep 25 06:30:23 2014
ARB::close($gb_main);

comment:2 in reply to: ↑ 1 Changed 4 years ago by westram

Replying to epruesse:

According to perf, 85% of the time is spent in combine().

That is good news :) Thank you for benchmarking!

comment:3 in reply to: ↑ 1 ; follow-ups: Changed 4 years ago by westram

  • Owner changed from devel to westram
  • Status changed from new to assigned

Replying to epruesse:

According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.

  • loop optimization was implemented with log:trunk@12967-12970,12973,12975
    • speedup 2x-4x
  • need some automated check whether loop optimization really happens (otherwise this will regress)

comment:4 in reply to: ↑ 3 ; follow-up: Changed 4 years ago by epruesse

Replying to westram:

  • need some automated check whether loop optimization really happens (otherwise this will regress)

I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.

Options:

  1. enable "-fopt-info" and grep output for "LOOP VECTORIZED"
    • would have to do this during build, awkward
  2. checheck binary for SSE instructions
    • doesn't really work, tons of things are using SSE, memset among others
  3. use intrinsics to hand-coded SSE (can steal from GCC code)
    • lots of extra code, not easy to maintain, needs switches to detect support
  4. do performance tests to check for degradation
    • would be good for all of ARB, but requires an isolated test platform

comment:5 in reply to: ↑ 4 Changed 4 years ago by westram

Replying to epruesse:

I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.

  1. enable "-fopt-info" and grep output for "LOOP VECTORIZED"
    • would have to do this during build, awkward

That's my plan. I agree that it's unlikely for the involved code to regress, but i hope you agree it's normal that unlikely things happen during development. ;-)

comment:6 in reply to: ↑ 3 Changed 4 years ago by westram

  • need some automated check whether loop optimization really happens (otherwise this will regress)

implemented by [13443]

comment:7 Changed 3 years ago by westram

  • Description modified (diff)
  • Milestone set to wishlist2016

comment:8 Changed 3 years ago by westram

  • Priority changed from normal to major

comment:9 Changed 3 years ago by westram

  • Milestone changed from wishlist2016 to wishlist

Milestone renamed

comment:10 Changed 22 months ago by westram

  • Version set to SVN

comment:11 Changed 21 months ago by westram

  • Milestone changed from wishlist to r17q4

comment:12 Changed 13 months ago by westram

  • Status changed from assigned to accepted

comment:13 Changed 13 months ago by westram

  • Status changed from accepted to _started

comment:14 Changed 8 months ago by westram

  • Description modified (diff)
  • Milestone changed from r17q4 to wishlist
  • Status changed from _started to assigned
Note: See TracTickets for help on using tickets.