#665 closed enhancement (implemented)
save/load database zip-compressed
Reported by: | westram | Owned by: | westram |
---|---|---|---|
Priority: | normal | Milestone: | arb7.0 |
Component: | Library (DB) | Version: | |
Keywords: | Cc: | cquast@… |
Description
- open pipe instead of file
- read:FILE = popen("gunzip < zipped.arb");
- write:FILE = popen("gzip > zipped.arb");
- same for bzip2/bunzip2
- reader should autodetect, i.e. try normal, then gunzip, then bunzip2
Change History (16)
comment:1 Changed 9 years ago by westram
- Status changed from new to accepted
comment:2 follow-up: ↓ 7 Changed 9 years ago by epruesse
comment:3 follow-up: ↓ 6 Changed 9 years ago by epruesse
- If piping, I'd use pigz over gzip. pigz uses zlib and pthreads to implement parallel zip/unzip in gz format.
- bzip2 doesn't gain too much in terms of compression to justify the time when saving an arb file.
comment:4 Changed 9 years ago by epruesse
Another alternative would be using liblz4: https://github.com/Cyan4973/lz4
According to their tests:
lib | setting | ratio | write MB/s | read MB/s |
zlib | -1 | 2.7 | 59 | 250 |
zlib | -6 | 3.0 | 18 | 270 |
lz4 | 2.1 | 385 | 1850 |
So
- much faster
- reasonable compressio
- doesn't work as drop-in replacement for open/close/read/write
comment:5 Changed 9 years ago by epruesse
Another benchmark (archivers, not libs, but the numbers won't be much different):
bzip2 at ~1 minute for a 445 mb file is too slow for "save". LZMA is barely fast enough with good compression, lz4 is nicely fast but needs twice as much space as the LZMA compressed data. zlib ranges in the middle.
I'd probably go with zlib. It's reasonably fast, reasonably well at compressing data and by far the easiest to implement, especially given that it will open uncompressed files transparently.
comment:6 in reply to: ↑ 3 ; follow-up: ↓ 8 Changed 9 years ago by westram
Replying to epruesse:
- bzip2 doesn't gain too much in terms of compression to justify the time when saving an arb file.
I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)
compression | size |
none | ~26Gb |
gzip | 1.4Gb |
bzip2 | 212Mb |
comment:7 in reply to: ↑ 2 Changed 9 years ago by westram
Replying to epruesse:
You could also simply use gzopen, gzclose, gzread and gzwrite instead of open, close, read and write.
I don't see advantage in using sth special inside arbcode, when i can use a pipe which will use 2 cores automatically. My save test showed arb_2_ascii at ~40% cpu and gzip at ~100%.
pigz sounds nice, but doesnt seem to be common, at least it's not installed on my machines.
comment:8 in reply to: ↑ 6 ; follow-up: ↓ 11 Changed 9 years ago by epruesse
I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)
Ascii? Why would anyone use that besides for debugging? The "save" is horribly slow and the files even with compression enormous.
comment:9 Changed 9 years ago by epruesse
BTW: If this is for scripted use of ARB, there really isn't a need to mess with the source at all — bash has "process substitution" for exactly this purpose:
Instead of
some_writing_tool ascii.arb some_reading_tool ascii.arb
use
some_writing_tool >(gzip -c > ascii.arb.gz) some_reading_tool <(gunzip -c ascii.arb.gz)
It's commonly used in bioinformatics to deal with the large files and multitude of compressed formats, so most people will know about it.
It would be *very* neat to have this in ARB for the normal arb files, though, to cut down on the time needed to save with slower disks (e.g. NFS).
comment:10 follow-up: ↓ 12 Changed 9 years ago by epruesse
Stats on compressing SSURef_NR99_123_SILVA_12_07_15_opt.arb
tool | setting | compr | decompr | size |
xz | -9 | 102.27 | 12.49 | 187988 |
xz | -5 | 39.88 | 13.29 | 206384 |
rar | -m5 | 26.14 | 3.60 | 216856 |
rar | -m3 | 21.79 | 3.68 | 217768 |
xz | -1 | 11.80 | 14.06 | 231208 |
bzip2 | -9 | 61.31 | 25.20 | 241592 |
bzip2 | -5 | 60.37 | 26.99 | 252964 |
rar | -m1 | 7.90 | 4.21 | 263992 |
lz4 | -9 | 12.94 | 0.48 | 278144 |
lz4 | -5 | 9.88 | 0.48 | 279584 |
gzip | -9 | 32.88 | 1.70 | 279960 |
pigz | -9 | 6.85 | 2.03 | 280092 |
gzip | -5 | 14.09 | 1.70 | 285956 |
pigz | -5 | 3.66 | 2.11 | 286080 |
bzip2 | -1 | 59.25 | 26.74 | 290992 |
pigz | -1 | 2.44 | 2.25 | 308392 |
gzip | -1 | 9.24 | 1.92 | 308964 |
lz4 | -1 | 1.77 | 0.5 | 325752 |
comment:11 in reply to: ↑ 8 Changed 9 years ago by westram
Replying to epruesse:
I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)
Ascii? Why would anyone use that besides for debugging? The "save" is horribly slow and the files even with compression enormous.
Christian needs streamed saving for silva pipeline to reduce memory footprint on cluster (see #666). Using bzip on ascii results in a filesize similar to gzipped binary database (212Mb vs 228Mb). Runtime is less important in that case.
I'm not sure whether saving binary format directly to a zip-stream works at all - there might be random access. Have to check the code first.
Just thought having the possibility to save with extra compression would be a nice feature in general and therefore decided to add the feature to ARBDB.
comment:12 in reply to: ↑ 10 Changed 9 years ago by epruesse
With NR 123 saved as ascii:
tool | setting | comp | decomp | size |
xz | 9 | 620.01 | 47.01 | 196180 |
xz | 5 | 381.52 | 50.32 | 274776 |
bzip2 | 9 | 497.02 | 213.04 | 281344 |
bzip2 | 5 | 468.42 | 210.29 | 331704 |
rar | 9 | 455.12 | 49.37 | 334048 |
rar | 5 | 424.41 | 47.59 | 354356 |
xz | 1 | 88.03 | 63.21 | 451404 |
bzip2 | 1 | 438.41 | 210.23 | 549260 |
lz4 | 9 | 670.03 | 15.95 | 1065356 |
lz4 | 5 | 232.53 | 17.70 | 1207972 |
rar | 1 | 215.39 | 81.65 | 1213740 |
gzip | 9 | 1910.64 | 44.70 | 1575136 |
pigz | 9 | 459.23 | 47.31 | 1586660 |
gzip | 5 | 297.64 | 50.79 | 1605916 |
pigz | 5 | 78.88 | 53.69 | 1617504 |
lz4 | 1 | 39.86 | 19.51 | 1754660 |
gzip | 1 | 134.04 | 31.19 | 1922608 |
pigz | 1 | 43.36 | 32.91 | 1932976 |
comment:13 Changed 9 years ago by westram
- Status changed from accepted to _started
comment:14 Changed 9 years ago by westram
- Resolution set to implemented
- Status changed from _started to closed
by [14556]
comment:15 Changed 9 years ago by westram
- Milestone set to arb6.1
mark changes that got fixed after arb 6.0.x
You could also simply use gzopen, gzclose, gzread and gzwrite instead of open, close, read and write.
From the docs of gzread:
See /usr/include/zlib.h for documentation.