Context Navigation

← Previous Ticket
Next Ticket →

#665 closed enhancement (implemented)

save/load database zip-compressed

Reported by:	westram	Owned by:	westram
Priority:	normal	Milestone:	arb7.0
Component:	Library (DB)	Version:
Keywords:		Cc:	cquast@…

Description

open pipe instead of file
- read:FILE = popen("gunzip < zipped.arb");
- write:FILE = popen("gzip > zipped.arb");
- same for bzip2/bunzip2
reader should autodetect, i.e. try normal, then gunzip, then bunzip2

Change History (16)

comment:1 Changed 10 years ago by westram

Status changed from new to accepted

comment:2 follow-up: ↓ 7 Changed 10 years ago by epruesse

You could also simply use gzopen, gzclose, gzread and gzwrite instead of open, close, read and write.

#include <zlib.h>

gzFile FILE = gzopen(path, mode)
const char buf[4096];
int len;
while (len = gzread(FILE, buf, sizeof(buf)) {
 ...
}

From the docs of gzread:

If the input file is not in gzip format, gzread copies the given number of bytes into the buffer directly from the file.

See /usr/include/zlib.h for documentation.

Last edited 10 years ago by westram (previous) (diff)

comment:3 follow-up: ↓ 6 Changed 10 years ago by epruesse

If piping, I'd use pigz over gzip. pigz uses zlib and pthreads to implement parallel zip/unzip in gz format.
bzip2 doesn't gain too much in terms of compression to justify the time when saving an arb file.

comment:4 Changed 10 years ago by epruesse

Another alternative would be using liblz4: https://github.com/Cyan4973/lz4

According to their tests:

lib	setting	ratio	write MB/s	read MB/s
zlib	-1	2.7	59	250
zlib	-6	3.0	18	270
lz4		2.1	385	1850

much faster
reasonable compressio
doesn't work as drop-in replacement for open/close/read/write

Last edited 10 years ago by westram (previous) (diff)

comment:5 Changed 10 years ago by epruesse

Another benchmark (archivers, not libs, but the numbers won't be much different):

http://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO

bzip2 at ~1 minute for a 445 mb file is too slow for "save". LZMA is barely fast enough with good compression, lz4 is nicely fast but needs twice as much space as the LZMA compressed data. zlib ranges in the middle.

I'd probably go with zlib. It's reasonably fast, reasonably well at compressing data and by far the easiest to implement, especially given that it will open uncompressed files transparently.

Last edited 10 years ago by epruesse (previous) (diff)

comment:6 in reply to: ↑ 3 ; follow-up: ↓ 8 Changed 10 years ago by westram

Replying to epruesse:

bzip2 doesn't gain too much in terms of compression to justify the time when saving an arb file.

I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)

compression	size
none	~26Gb
gzip	1.4Gb
bzip2	212Mb

Last edited 10 years ago by westram (previous) (diff)

comment:7 in reply to: ↑ 2 Changed 10 years ago by westram

Replying to epruesse:

You could also simply use gzopen, gzclose, gzread and gzwrite instead of open, close, read and write.

I don't see advantage in using sth special inside arbcode, when i can use a pipe which will use 2 cores automatically. My save test showed arb_2_ascii at ~40% cpu and gzip at ~100%.

pigz sounds nice, but doesnt seem to be common, at least it's not installed on my machines.

comment:8 in reply to: ↑ 6 ; follow-up: ↓ 11 Changed 10 years ago by epruesse

I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)

Ascii? Why would anyone use that besides for debugging? The "save" is horribly slow and the files even with compression enormous.

comment:9 Changed 10 years ago by epruesse

BTW: If this is for scripted use of ARB, there really isn't a need to mess with the source at all — bash has "process substitution" for exactly this purpose:

Instead of

  some_writing_tool ascii.arb
  some_reading_tool ascii.arb

use

  some_writing_tool >(gzip -c > ascii.arb.gz)
  some_reading_tool <(gunzip -c ascii.arb.gz)

It's commonly used in bioinformatics to deal with the large files and multitude of compressed formats, so most people will know about it.

It would be *very* neat to have this in ARB for the normal arb files, though, to cut down on the time needed to save with slower disks (e.g. NFS).

comment:10 follow-up: ↓ 12 Changed 10 years ago by epruesse

Stats on compressing SSURef_NR99_123_SILVA_12_07_15_opt.arb

tool	setting	compr	decompr	size
xz	-9	102.27	12.49	187988
xz	-5	39.88	13.29	206384
rar	-m5	26.14	3.60	216856
rar	-m3	21.79	3.68	217768
xz	-1	11.80	14.06	231208
bzip2	-9	61.31	25.20	241592
bzip2	-5	60.37	26.99	252964
rar	-m1	7.90	4.21	263992
lz4	-9	12.94	0.48	278144
lz4	-5	9.88	0.48	279584
gzip	-9	32.88	1.70	279960
pigz	-9	6.85	2.03	280092
gzip	-5	14.09	1.70	285956
pigz	-5	3.66	2.11	286080
bzip2	-1	59.25	26.74	290992
pigz	-1	2.44	2.25	308392
gzip	-1	9.24	1.92	308964
lz4	-1	1.77	0.5	325752

Last edited 10 years ago by westram (previous) (diff)

comment:11 in reply to: ↑ 8 Changed 10 years ago by westram

Replying to epruesse:

I cannot confirm (tested with SSURef_NR99_119_SILVA_14_07_14_opt.arb saved as ascii)

Ascii? Why would anyone use that besides for debugging? The "save" is horribly slow and the files even with compression enormous.

Christian needs streamed saving for silva pipeline to reduce memory footprint on cluster (see #666). Using bzip on ascii results in a filesize similar to gzipped binary database (212Mb vs 228Mb). Runtime is less important in that case.

I'm not sure whether saving binary format directly to a zip-stream works at all - there might be random access. Have to check the code first.

Just thought having the possibility to save with extra compression would be a nice feature in general and therefore decided to add the feature to ARBDB.

comment:12 in reply to: ↑ 10 Changed 10 years ago by epruesse

With NR 123 saved as ascii:

tool	setting	comp	decomp	size
xz	9	620.01	47.01	196180
xz	5	381.52	50.32	274776
bzip2	9	497.02	213.04	281344
bzip2	5	468.42	210.29	331704
rar	9	455.12	49.37	334048
rar	5	424.41	47.59	354356
xz	1	88.03	63.21	451404
bzip2	1	438.41	210.23	549260
lz4	9	670.03	15.95	1065356
lz4	5	232.53	17.70	1207972
rar	1	215.39	81.65	1213740
gzip	9	1910.64	44.70	1575136
pigz	9	459.23	47.31	1586660
gzip	5	297.64	50.79	1605916
pigz	5	78.88	53.69	1617504
lz4	1	39.86	19.51	1754660
gzip	1	134.04	31.19	1922608
pigz	1	43.36	32.91	1932976

Last edited 10 years ago by westram (previous) (diff)

comment:13 Changed 10 years ago by westram

Status changed from accepted to _started

comment:14 Changed 10 years ago by westram

Resolution set to implemented
Status changed from _started to closed

by [14556]

comment:15 Changed 10 years ago by westram

Milestone set to arb6.1

mark changes that got fixed after arb 6.0.x

comment:16 Changed 4 years ago by westram

Milestone changed from arb6.1 to arb7.0

Milestone renamed

Note: See TracTickets for help on using tickets.

Download in other formats: