Opened 9 years ago

Closed 9 years ago

Last modified 3 years ago

#665 closed enhancement (implemented)

save/load database zip-compressed

Reported by: westram Owned by: westram
Priority: normal Milestone: arb7.0
Component: Library (DB) Version:
Keywords: Cc: cquast@…

Description

  • open pipe instead of file
    • read:FILE = popen("gunzip < zipped.arb");
    • write:FILE = popen("gzip > zipped.arb");
    • same for bzip2/bunzip2
  • reader should autodetect, i.e. try normal, then gunzip, then bunzip2

Change History (16)

comment:1 Changed 9 years ago by westram

  • Status changed from new to accepted

comment:2 follow-up: Changed 9 years ago by epruesse

You could also simply use gzopen, gzclose, gzread and gzwrite instead of open, close, read and write.

#include <zlib.h>

gzFile FILE = gzopen(path, mode)
const char buf[4096];
int len;
while (len = gzread(FILE, buf, sizeof(buf)) {
 ...
}

From the docs of gzread:

If the input file is not in gzip format, gzread copies the given number of bytes into the buffer directly from the file.

See /usr/include/zlib.h for documentation.

Last edited 9 years ago by westram (previous) (diff)

comment:3 follow-up: Changed 9 years ago by epruesse

  • If piping, I'd use pigz over gzip. pigz uses zlib and pthreads to implement parallel zip/unzip in gz format.
  • bzip2 doesn't gain too much in terms of compression to justify the time when saving an arb file.

comment:4 Changed 9 years ago by epruesse

Another alternative would be using liblz4: https://github.com/Cyan4973/lz4

According to their tests:

lib setting ratio write MB/s read MB/s
zlib -1 2.7 59 250
zlib -6 3.0 18 270
lz4 2.1 385 1850

So

  • much faster
  • reasonable compressio
  • doesn't work as drop-in replacement for open/close/read/write
Last edited 9 years ago by westram (previous) (diff)

comment:5 Changed 9 years ago by epruesse

Another benchmark (archivers, not libs, but the numbers won't be much different):

http://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

bzip2 at ~1 minute for a 445 mb file is too slow for "save". LZMA is barely fast enough with good compression, lz4 is nicely fast but needs twice as much space as the LZMA compressed data. zlib ranges in the middle.

I'd probably go with zlib. It's reasonably fast, reasonably well at compressing data and by far the easiest to implement, especially given that it will open uncompressed files transparently.

Last edited 9 years ago by epruesse (previous) (diff)

comment:6 in reply to: ↑ 3 ; follow-up: Changed 9 years ago by westram

Replying to epruesse:

  • bzip2 doesn't gain too much in terms of compression to justify the time when saving an arb file.

I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)

compressionsize
none ~26Gb
gzip 1.4Gb
bzip2 212Mb
Last edited 9 years ago by westram (previous) (diff)

comment:7 in reply to: ↑ 2 Changed 9 years ago by westram

Replying to epruesse:

You could also simply use gzopen, gzclose, gzread and gzwrite instead of open, close, read and write.

I don't see advantage in using sth special inside arbcode, when i can use a pipe which will use 2 cores automatically. My save test showed arb_2_ascii at ~40% cpu and gzip at ~100%.

pigz sounds nice, but doesnt seem to be common, at least it's not installed on my machines.

comment:8 in reply to: ↑ 6 ; follow-up: Changed 9 years ago by epruesse

I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)

Ascii? Why would anyone use that besides for debugging? The "save" is horribly slow and the files even with compression enormous.

comment:9 Changed 9 years ago by epruesse

BTW: If this is for scripted use of ARB, there really isn't a need to mess with the source at all — bash has "process substitution" for exactly this purpose:

Instead of

  some_writing_tool ascii.arb
  some_reading_tool ascii.arb

use

  some_writing_tool >(gzip -c > ascii.arb.gz)
  some_reading_tool <(gunzip -c ascii.arb.gz)

It's commonly used in bioinformatics to deal with the large files and multitude of compressed formats, so most people will know about it.

It would be *very* neat to have this in ARB for the normal arb files, though, to cut down on the time needed to save with slower disks (e.g. NFS).

comment:10 follow-up: Changed 9 years ago by epruesse

Stats on compressing SSURef_NR99_123_SILVA_12_07_15_opt.arb

tool setting compr decompr size
xz -9 102.27 12.49 187988
xz -5 39.88 13.29 206384
rar -m5 26.14 3.60 216856
rar -m3 21.79 3.68 217768
xz -1 11.80 14.06 231208
bzip2 -9 61.31 25.20 241592
bzip2 -5 60.37 26.99 252964
rar -m1 7.90 4.21 263992
lz4 -9 12.94 0.48 278144
lz4 -5 9.88 0.48 279584
gzip -9 32.88 1.70 279960
pigz -9 6.85 2.03 280092
gzip -5 14.09 1.70 285956
pigz -5 3.66 2.11 286080
bzip2 -1 59.25 26.74 290992
pigz -1 2.44 2.25 308392
gzip -1 9.24 1.92 308964
lz4 -1 1.77 0.5 325752
Last edited 9 years ago by westram (previous) (diff)

comment:11 in reply to: ↑ 8 Changed 9 years ago by westram

Replying to epruesse:

I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)

Ascii? Why would anyone use that besides for debugging? The "save" is horribly slow and the files even with compression enormous.

Christian needs streamed saving for silva pipeline to reduce memory footprint on cluster (see #666). Using bzip on ascii results in a filesize similar to gzipped binary database (212Mb vs 228Mb). Runtime is less important in that case.

I'm not sure whether saving binary format directly to a zip-stream works at all - there might be random access. Have to check the code first.

Just thought having the possibility to save with extra compression would be a nice feature in general and therefore decided to add the feature to ARBDB.

comment:12 in reply to: ↑ 10 Changed 9 years ago by epruesse

With NR 123 saved as ascii:

toolsetting comp decomp size
xz9 620.01 47.01 196180
xz5 381.52 50.32 274776
bzip29 497.02 213.04 281344
bzip25 468.42 210.29 331704
rar9 455.12 49.37 334048
rar5 424.41 47.59 354356
xz1 88.03 63.21 451404
bzip21 438.41 210.23 549260
lz49 670.03 15.95 1065356
lz45 232.53 17.70 1207972
rar1 215.39 81.65 1213740
gzip9 1910.64 44.70 1575136
pigz9 459.23 47.31 1586660
gzip5 297.64 50.79 1605916
pigz5 78.88 53.69 1617504
lz41 39.86 19.51 1754660
gzip1 134.04 31.19 1922608
pigz1 43.36 32.91 1932976
Last edited 9 years ago by westram (previous) (diff)

comment:13 Changed 9 years ago by westram

  • Status changed from accepted to _started

comment:14 Changed 9 years ago by westram

  • Resolution set to implemented
  • Status changed from _started to closed

by [14556]

comment:15 Changed 9 years ago by westram

  • Milestone set to arb6.1

mark changes that got fixed after arb 6.0.x

comment:16 Changed 3 years ago by westram

  • Milestone changed from arb6.1 to arb7.0

Milestone renamed

Note: See TracTickets for help on using tickets.