Opened 10 years ago
Closed 5 years ago
#604 closed enhancement (discarded)
arb_parsimony should support multi cores
Reported by: | westram | Owned by: | westram |
---|---|---|---|
Priority: | major | Milestone: | wishlist |
Component: | ARB_PARSIMONY | Version: | SVN |
Keywords: | Cc: |
Description (last modified by westram)
Possible parallel execution:
distance calculation (general):divide alignment into regions; calculate regions asynchronously
tested. no gain. far too fine-grained!
- inserting a batch of species:
implement #643 (detect best insert positions for all species in one traversal)calculate parsimony value asynchronously (using Cxx11 futures)
implemented (as part of class CombinableSeq)- insert_species_into_tree performs all combines in a loop
- needs a generalization of CombinableSeq which also works for protein sequences
- for async execution it has to create multiple instances of CombinableSeq
Change History (15)
comment:1 follow-ups: ↓ 2 ↓ 3 Changed 10 years ago by epruesse
comment:2 in reply to: ↑ 1 Changed 10 years ago by westram
Replying to epruesse:
According to perf, 85% of the time is spent in combine().
That is good news Thank you for benchmarking!
comment:3 in reply to: ↑ 1 ; follow-ups: ↓ 4 ↓ 6 Changed 10 years ago by westram
- Owner changed from devel to westram
- Status changed from new to assigned
Replying to epruesse:
According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.
- loop optimization was implemented with log:trunk@12967-12970,12973,12975
- speedup 2x-4x
- need some automated check whether loop optimization really happens (otherwise this will regress)
comment:4 in reply to: ↑ 3 ; follow-up: ↓ 5 Changed 10 years ago by epruesse
Replying to westram:
- need some automated check whether loop optimization really happens (otherwise this will regress)
I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.
Options:
- enable "-fopt-info" and grep output for "LOOP VECTORIZED"
- would have to do this during build, awkward
- checheck binary for SSE instructions
- doesn't really work, tons of things are using SSE, memset among others
- use intrinsics to hand-coded SSE (can steal from GCC code)
- lots of extra code, not easy to maintain, needs switches to detect support
- do performance tests to check for degradation
- would be good for all of ARB, but requires an isolated test platform
comment:5 in reply to: ↑ 4 Changed 10 years ago by westram
Replying to epruesse:
I'd already considered this, but found no nice way to do it. Hence the comment in the file. The loop needs only AP_filter, so I think it's unlikely to change randomly. My hope would also be for GCC's vectorizer to get smarter over time, making this more robust.
- enable "-fopt-info" and grep output for "LOOP VECTORIZED"
- would have to do this during build, awkward
That's my plan. I agree that it's unlikely for the involved code to regress, but i hope you agree it's normal that unlikely things happen during development.
comment:6 in reply to: ↑ 3 Changed 10 years ago by westram
- need some automated check whether loop optimization really happens (otherwise this will regress)
implemented by [13443]
comment:7 Changed 9 years ago by westram
- Description modified (diff)
- Milestone set to wishlist2016
comment:8 Changed 9 years ago by westram
- Priority changed from normal to major
comment:9 Changed 9 years ago by westram
- Milestone changed from wishlist2016 to wishlist
Milestone renamed
comment:10 Changed 8 years ago by westram
- Version set to SVN
comment:11 Changed 8 years ago by westram
- Milestone changed from wishlist to r17q4
comment:12 Changed 7 years ago by westram
- Status changed from assigned to accepted
comment:13 Changed 7 years ago by westram
- Status changed from accepted to _started
comment:14 Changed 6 years ago by westram
- Description modified (diff)
- Milestone changed from r17q4 to wishlist
- Status changed from _started to assigned
comment:15 Changed 5 years ago by westram
- Resolution set to discarded
- Status changed from assigned to closed
tested optimizations:
- asynchronous execution of combine itself: far too fine-granular
- asynchronous insertion of all added species (at one insert position):
- uses much memory
- still too fine-granular; runtime approx. 2000%
conclusion:
- optimization performance will not benefit from parallel execution
- species insertion may benefit, but only
- by emphasizing the memory footprint and
- by complicating the arb-pars-code.
⇒ implementation was discarded.
Brief benchmark:
On the SSU Ref NR 111 the below script removes 313 sequences from the and adds them again. Running 1-8 instances of ARB on a 16 core Xeon, arb_progress estimates 18 minutes for adding the sequences.
So we're not memory bound here.
According to perf, 85% of the time is spent in combine(). My guess is that it's the loop over the sequence width. Since this one works "by character", SIMD by SSE might be an easier option than "SIMT" by OpenMP or explicit threading.