Opened 10 years ago

Closed 9 years ago

Last modified 8 years ago

#2349 closed enhancement (fixed)

CELL raster format: make ZLIB level 3 standard compression instead of RLE

Reported by: neteler Owned by: grass-dev@…
Priority: critical Milestone: 7.2.0
Component: Raster Version: svn-trunk
Keywords: compression, r.compress, null Cc:
CPU: Unspecified Platform: All

Description

At time, integer maps (CELL) are still compressed with RLE This leads to a huge waste of disk space when it comes to large data.

Proposal: make ZLIB, level 3 the standard compression.

At time we can enable the environment variable GRASS_INT_ZLIB but it will use the default ZLIB level 6 compression which is too CPU intensive. So a (user) control over this is important.

BTW: Manual of r.compress updated in r60814, needs to be backported.

Attachments (2)

zlib_default.diff (5.6 KB ) - added by neteler 10 years ago.
bundle patch for relbr7
compressed_nulls.diff (11.0 KB ) - added by glynn 10 years ago.
implement compressed nulls

Download all attachments as: .zip

Change History (55)

comment:1 by neteler, 10 years ago

default ZLIB compression (usually 6), is hardcoded in:

lib/gis/flate.c, line 330

However, the used compression is yet RLE.

in reply to:  description ; comment:2 by glynn, 10 years ago

Replying to neteler:

At time, integer maps (CELL) are still compressed with RLE This leads to a huge waste of disk space when it comes to large data.

Proposal: make ZLIB, level 3 the standard compression.

Is GRASS_INT_ZLIB support now old enough that it can be taken for granted?

At time we can enable the environment variable GRASS_INT_ZLIB but it will use the default ZLIB level 6 compression which is too CPU intensive. So a (user) control over this is important.

The current behaviour is that setting GRASS_INT_ZLIB to anything (even an empty string) will enable zlib compression at the hard-coded level. One option is to parse the value as an integer and use the result as the compression level. However, it's possible that people are currently using e.g. GRASS_INT_ZLIB=1 to enable it with the existing default level.

Another option is to add another environment variable for the level.

Aside: if there are still systems out there using the historical limit of 4096 bytes of memory for the combination of environment variables and arguments (argv), we might want to think about making GRASS less greedy with respect to environment variables.

in reply to:  2 ; comment:3 by neteler, 10 years ago

Replying to glynn:

Replying to neteler:

At time, integer maps (CELL) are still compressed with RLE This leads to a huge waste of disk space when it comes to large data.

Proposal: make ZLIB, level 3 the standard compression.

Is GRASS_INT_ZLIB support now old enough that it can be taken for granted?

I hope yes. I am not aware of negative reports.

At time we can enable the environment variable GRASS_INT_ZLIB but it will use the default ZLIB level 6 compression which is too CPU intensive. So a (user) control over this is important.

The current behaviour is that setting GRASS_INT_ZLIB to anything (even an empty string) will enable zlib compression at the hard-coded level.

Exactly.

One option is to parse the value as an integer and use the result as the compression level. However, it's possible that people are currently using e.g. GRASS_INT_ZLIB=1 to enable it with the existing default level.

Another option is to add another environment variable for the level.

Yes, a new GRASS_ZLIBLEVEL may be less invasive.

Aside: if there are still systems out there using the historical limit of 4096 bytes of memory for the combination of environment variables and arguments (argv), we might want to think about making GRASS less greedy with respect to environment variables.

You mean the number and/or the length?

in reply to:  3 comment:4 by glynn, 10 years ago

Replying to neteler:

Aside: if there are still systems out there using the historical limit of 4096 bytes of memory for the combination of environment variables and arguments (argv), we might want to think about making GRASS less greedy with respect to environment variables.

You mean the number and/or the length?

Mainly the number.

in reply to:  description ; comment:5 by glynn, 10 years ago

Replying to neteler:

Proposal: make ZLIB, level 3 the standard compression.

r61380 implements the following behaviour:

  • zlib compression is the default. Set GRASS_INT_ZLIB=0 to use RLE compression.
  • The compression level can be set via GRASS_ZLIB_LEVEL, whose value should be an integer between 0 and 9. If not set (or if the value cannot be parsed as an integer), zlib's default compression level will be used (lib/gis/flate.c:333, if a different default is preferred).

comment:6 by neteler, 10 years ago

Putting here to not forget about: raster/r.compress/r.compress.html needs to be updated

in reply to:  6 comment:7 by glynn, 10 years ago

Replying to neteler:

raster/r.compress/r.compress.html needs to be updated

Done in r61500.

comment:8 by neteler, 10 years ago

Keywords: compression added

In case of a backport to relbr7, this should be the needed changes:

r61380 + r61420 + r61422 + r61500

comment:9 by neteler, 10 years ago

Priority: criticalblocker

Any objections to backport?

by neteler, 10 years ago

Attachment: zlib_default.diff added

bundle patch for relbr7

in reply to:  5 comment:10 by neteler, 10 years ago

Replying to glynn:

  • The compression level can be set via GRASS_ZLIB_LEVEL, whose value should be an integer between 0 and 9. If not set (or if the value cannot be parsed as an integer), zlib's default compression level will be used (lib/gis/flate.c:333, if a different default is preferred).

In relbranch7 there is currently:

Z_DEFAULT_COMPRESSION = 1 - "gives the best compromise between speed and compression" as per r61424.

Perhaps zlib compression 1 should be adopted for trunk?

in reply to:  8 comment:11 by neteler, 10 years ago

Keywords: r.compress added

Replying to neteler:

In case of a backport to relbr7, this should be the needed changes:

r61380 + r61420 + r61422 + r61500

Backported to relbr7 in r61797.

Remains open which default ZLIB level should be used, see comment:10

comment:12 by neteler, 10 years ago

It seems that the NULL file is not compressed (file "cell_misc/null"). According to

http://lists.osgeo.org/pipermail/grass-user/2010-January/054216.html

it is one bit (null/non-null) for each cell. It looks like this:

[neteler@giscluster modis_lst_reconstructed_europe_daily]$ hexdump cell_misc/lst_2002_196_average/null | head
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
0000ad0 ffff ffff ffff ffe0 ffff ffff ffff ffff
0000ae0 ffff ffff ffff ffff ffff ffff ffff ffff
*
00015a0 ffff ffff ffff ffff ffff ffff e0ff ffff
00015b0 ffff ffff ffff ffff ffff ffff ffff ffff
*
0002080 ffff ffff ffe0 ffff ffff ffff ffff ffff
0002090 ffff ffff ffff ffff ffff ffff ffff ffff
...

For our 12829 daily min/average/max MODIS LST maps covering Europe, the null files consume a lot of space:

[neteler@giscluster cell_misc]$ du -hs .
621G	.

Question: Would it be possible to compress also the null files, even with just a weak compression?

in reply to:  12 ; comment:13 by glynn, 10 years ago

Replying to neteler:

It seems that the NULL file is not compressed (file "cell_misc/null"). According to

http://lists.osgeo.org/pipermail/grass-user/2010-January/054216.html

it is one bit (null/non-null) for each cell.

Question: Would it be possible to compress also the null files, even with just a weak compression?

Being uncompressed, the null files don't contain an index. The offset to the beginning of a given row is obtained by multiplying the row number by the number of bytes per row (which is just the number of columns divided by 8, rounded upwards).

The main issue is likely to be the need to support both formats. We need to

  • Be able to read and write the uncompressed format, for compatibility with existing versions of GRASS.
  • Be able to distinguish between compressed and uncompressed formats on read.
  • Provide some mechanism (i.e. yet another environment variable) to indicate which format to use on write.

in reply to:  13 comment:14 by neteler, 10 years ago

Milestone: 7.0.08.0.0
Priority: blockercritical
Version: svn-releasebranch70svn-trunk

Replying to glynn: ...

The main issue is likely to be the need to support both formats. We need to

... at this point raster format changes may go into GRASS GIS 8.

comment:15 by neteler, 10 years ago

Keywords: null added
Milestone: 8.0.07.1.0

back to this topic: Here on our system I found > 1.7TB of NULL files in a single location, all uncompressed.

What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?

For backward compatibility, r.null could extended to convert from compressed null2 to uncompressed null (similar to v.build for the new spatial index in G7).

in reply to:  15 ; comment:16 by glynn, 10 years ago

Replying to neteler:

back to this topic: Here on our system I found > 1.7TB of NULL files in a single location, all uncompressed.

How large are the null files compared to the cell/fcell files?

What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?

That's probably not a great deal of work, but as with any such change, we need to consider the migration strategy. If we just start creating compressed null files, mapsets will cease to be usable with older versions.

in reply to:  16 comment:17 by neteler, 10 years ago

Replying to glynn:

How large are the null files compared to the cell/fcell files?

  • With MODIS LST data, the null files are between 1.7 and 7.6 times larger than the cell files (we store the LST maps in deg C * 100 as integer to save disk space).
  • With 100k random points, the null file is 7.1 times larger than the fcell map
  • With the EU 25m DEM, the null file is way smaller that the derived aspect map (17% of the DEM fcell file)

What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?

That's probably not a great deal of work, but as with any such change, we need to consider the migration strategy. If we just start creating compressed null files, mapsets will cease to be usable with older versions.

Right but this could be covered with an addon/new script in G6 and earlier (as v.build does for vector data).

by glynn, 10 years ago

Attachment: compressed_nulls.diff added

implement compressed nulls

in reply to:  15 comment:18 by glynn, 10 years ago

Replying to neteler:

What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?

Please test attachment:compressed_nulls.diff

Note that r.null and r.support will also require some changes, as they assume that the file consists of nothing but the null data (i.e. no index).

comment:19 by neteler, 10 years ago

Thanks for the patch. I applied it to my local copy of relbr70. Testing with this map

GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.info -g eu_dem25_rand100k
north=6000000
south=1000000
east=8000000
west=2000000
nsres=25
ewres=25
rows=200000
cols=240000
cells=48000000000
datatype=FCELL
ncats=0

I created a new one using the 100k random points FCELL map sampled earlier from the EU 25m DEM:

 r.mapcalc "new = eu_dem25_rand100k"

Using the patch, I see that a NULL file is no longer generated. Expectedly, since the other modules are not updated yet, I cannot read that map:

GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.univar new
ERROR: Raster map <new@PERMANENT>: format field in header file invalid

Next test with INT conversion:

GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.mapcalc "new_int = round(eu_dem25_rand100k)"

GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.univar new_int
0%

... the map is readable and comes with the NULL file but that's as before 6GB big (so no compression occurred).

Are my tests wrong or am I missing anything else?

in reply to:  19 comment:20 by glynn, 10 years ago

Replying to neteler:

Are my tests wrong or am I missing anything else?

First, it shouldn't be necessary to recompile anything except lib/raster (but you should probably run "make clean" in that directory first, as the patch changes the R__ and fileinfo structures which are used throughout the library).

Second, you need to use

export GRASS_COMPRESS_NULLS=1

in order for compressed null files to be generated (cell_misc/<name>/null2).

Reading should automatically use the compressed null file if it exists or the raw null file if it doesn't.

If you don't set GRASS_COMPRESS_NULLS, everything should continue to work as before (including r.null and r.support). If it doesn't, that needs to be fixed now (i.e. before the changes committed; issues which only arise when setting GRASS_COMPRESS_NULLS can be dealt with later).

comment:21 by neteler, 10 years ago

I tried again but don't see a difference in the NULL file size:

export GRASS_COMPRESS_NULLS=1
g.region raster=eu_dem25_rand100k -p
r.mapcalc "new_int = round(eu_dem25_rand100k)"
echo $GRASS_COMPRESS_NULLS
1

Verification:

root@grass ~ # l grassdata/eu_laea/PERMANENT/cell_misc/*
grassdata/eu_laea/PERMANENT/cell_misc/eu_dem25_rand100k:
total 5.6G
-rw-r--r-- 1 root root   53 Aug 15  2014 f_format
-rw-r--r-- 1 root root    5 Aug 15  2014 f_quant
-rw-r--r-- 1 root root   16 Aug 15  2014 f_range
-rw-r--r-- 1 root root 5.6G Aug 15  2014 null

grassdata/eu_laea/PERMANENT/cell_misc/new_int:
total 5.6G
-rw-r--r-- 1 root root 5.6G May 14 19:03 null
-rw-r--r-- 1 root root    9 May 14 19:03 range

in reply to:  21 comment:22 by glynn, 10 years ago

Replying to neteler:

I tried again but don't see a difference in the NULL file size:

Check that the patch applied successfully, that compilation was successful, and that r.mapcalc is using the new libgrass_raster.so, not e.g. an installed version.

The first two can be confirmed by running "strings" on the library and confirming that GRASS_COMPRESS_NULLS appears in the output. The last can be confirmed with e.g.

ldd $(which r.mapcalc)

comment:23 by neteler, 10 years ago

After cleaning up well on that machine, it is now finally working (sorry for the delay).

Test 1: mostly no-data cells

r.univar eu_dem25_rand100k
total null and non-null cells: 48000000000
total null cells: 47999900000
...

r.mapcalc "new_int = round(eu_dem25_rand100k)"

ls -la grassdata/eu_laea/PERMANENT/cell_misc/new_int:
total 5.7G
-rw-r--r-- 1 root root 5.6G May 14 19:03 null
-rw-r--r-- 1 root root  32M May 16 14:42 null2
-rw-r--r-- 1 root root    9 May 16 14:42 range

... that is less than 1% of the original null file size!

Test 2: 54% no-data cells

r.univar MOD11A1.A2000101.LST_Day_1km.reconstruct
total null and non-null cells: 415290645
total null cells: 226949285

r.info -g MOD11A1.A2000101.LST_Day_1km.reconstruct | grep datatype
datatype=CELL

r.mapcalc "modis_new_int = MOD11A1.A2000101.LST_Day_1km.reconstruct"


# original:
ls -la grassdata/eu_laea/PERMANENT/cell_misc/MOD11A1.A2000101.LST_Day_1km.reconstruct
total 50M
drwxr-xr-x 2 root root 4.0K Apr  8  2014 .
drwxr-xr-x 6 root root 4.0K May 17 14:14 ..
-rw-r--r-- 1 root root  50M Apr  8  2014 null
-rw-r--r-- 1 root root   12 Apr  8  2014 range

# new compressed null2:
ls -la grassdata/eu_laea/PERMANENT/cell_misc/modis_new_int
total 1.6M
drwxr-xr-x 2 root root 4.0K May 17 14:14 .
drwxr-xr-x 6 root root 4.0K May 17 14:14 ..
-rw-r--r-- 1 root root 1.6M May 17 14:14 null2
-rw-r--r-- 1 root root   12 May 17 14:14 range

... that is 3% of the original null file size.

in reply to:  23 ; comment:24 by glynn, 10 years ago

Replying to neteler:

After cleaning up well on that machine, it is now finally working (sorry for the delay).

Committed in r65272, with one minor change: before writing out the null file, both cell_misc/null and cell_misc/null2 are deleted, not just the file which is being replaced.

This avoids the situation where overwriting a map with null compression disabled would leave any compressed null file in place, resulting in the map having both a new uncompressed null file and an old compressed null file, with the new file being ignored (a compressed null file takes precedence).

Actually, the precedence should probably be changed. If someone uses an older version of GRASS to overwrite a map which already has a compressed null file, newer versions of GRASS will use the (stale) compressed null file. It would be more robust to use the absence of an uncompressed null file to dictate the use of the compressed file, rather than just the presence of a compressed file.

in reply to:  24 comment:25 by glynn, 10 years ago

Replying to glynn:

Actually, the precedence should probably be changed. If someone uses an older version of GRASS to overwrite a map which already has a compressed null file, newer versions of GRASS will use the (stale) compressed null file. It would be more robust to use the absence of an uncompressed null file to dictate the use of the compressed file, rather than just the presence of a compressed file.

Done in r65273.

comment:26 by neteler, 10 years ago

Thanks, I can do more testing now.

Which is the best way to achieve null2 files after setting the new env var? Using r.mapcalc I would ruin all metadata of my raster maps.

Would r.compress -u followed by r.compress do the job? From a user's point this might be the best solution.

in reply to:  26 ; comment:27 by glynn, 10 years ago

Replying to neteler:

Would r.compress -u followed by r.compress do the job? From a user's point this might be the best solution.

It would probably work. A better (but still simple) option would be to add a flag to r.compress so that it will process an already-compressed map (rather than assuming it doesn't need to do anything).

Ultimately we need something which will just update the null file, probably an enhancement to r.null.

Note that, at present, "r.null -c" or "r.support -n" will generate a broken map if GRASS_COMPRESS_NULLS is set, as they will write compressed data to the null2 file but won't write the row pointers.

Actually, those modules are already on shaky ground because they're passing an invalid map index as the "fd" parameter to Rast__write_null_bits().

in reply to:  27 ; comment:28 by neteler, 10 years ago

Replying to glynn:

Ultimately we need something which will just update the null file, probably an enhancement to r.null.

Yes - this approach would save quite some CPU time.

Note that, at present, "r.null -c" or "r.support -n" will generate a broken map if GRASS_COMPRESS_NULLS is set, as they will write compressed data to the null2 file but won't write the row pointers.

Actually, those modules are already on shaky ground because they're passing an invalid map index as the "fd" parameter to Rast__write_null_bits().

I hope you are willing to implement this, looking forward to it. Thanks.

in reply to:  28 ; comment:29 by glynn, 9 years ago

Replying to neteler:

I hope you are willing to implement this, looking forward to it. Thanks.

I'm working on it. In the meantime, don't use compressed nulls, because they aren't actually working yet. At the point that the row pointers are written out, the null file has already been closed.

in reply to:  29 ; comment:30 by glynn, 9 years ago

Replying to glynn:

In the meantime, don't use compressed nulls, because they aren't actually working yet. At the point that the row pointers are written out, the null file has already been closed.

This should be fixed in r65323, along with the issue that a compressed null file was being ignored (resulting in the zero-is-null compatibility behaviour).

in reply to:  30 ; comment:31 by neteler, 9 years ago

Replying to glynn:

Replying to glynn:

In the meantime, don't use compressed nulls, because they aren't actually working yet. At the point that the row pointers are written out, the null file has already been closed.

This should be fixed in r65323, along with the issue that a compressed null file was being ignored (resulting in the zero-is-null compatibility behaviour).

Is there any snipped which should be backported?

in reply to:  31 ; comment:32 by glynn, 9 years ago

Replying to neteler:

Is there any snipped which should be backported?

I'd suggest not including this feature in any release version until it has received more thorough testing. If it has already been backported in some form, that should be reverted. For now, it should exist only in trunk.

in reply to:  32 ; comment:33 by neteler, 9 years ago

Replying to glynn:

For now, it should exist only in trunk.

Yes, this is the case, no related backports done.

I am just not sure how to test it (I would need the improved r.null/r.compress functionality to not ruin all metadata). Or do you prefer to have tests with newly generated data first? I can do that as well.

in reply to:  33 ; comment:34 by glynn, 9 years ago

Replying to neteler:

I am just not sure how to test it (I would need the improved r.null/r.compress functionality to not ruin all metadata). Or do you prefer to have tests with newly generated data first? I can do that as well.

I added some support for the improved r.null/r.compress functionality to lib/raster, but it was amongst the things broken by r65348 (the tempfile changes), so this needs to wait until that gets fixed or reverted.

in reply to:  34 ; comment:35 by martinl, 9 years ago

Replying to glynn:

Replying to neteler:

I added some support for the improved r.null/r.compress functionality to lib/raster, but it was amongst the things broken by r65348 (the tempfile changes), so this needs to wait until that gets fixed or reverted.

by "fix" you mean to ignore GRASS_TMPDIR_MAPSET when working with raster data?

in reply to:  35 comment:36 by glynn, 9 years ago

Replying to martinl:

by "fix" you mean to ignore GRASS_TMPDIR_MAPSET when working with raster data?

Yes. I don't particularly want to do anything with lib/raster while that's outstanding (the tempfile changes broke Rast__close_null(); while a workaround would be trivial, that's symptomatic of the sort of thing that happens when two people are working on the same code concurrently).

Version 0, edited 9 years ago by glynn (next)

comment:37 by annakrat, 9 years ago

Please see probably related #2689.

in reply to:  34 ; comment:38 by glynn, 9 years ago

Replying to glynn:

I added some support for the improved r.null/r.compress functionality to lib/raster

With r65489, r65490, and r65491, r.null and r.support should work with compressed nulls. r.null now has a -z flag to re-create the null file using the current compression setting (i.e. compress/decompress).

comment:39 by sprice, 9 years ago

Probably related to #2750, where I've implemented LZ4, LZ4HC, and ZSTD support in GRASS for up to 3x speedup in I/O bound modules.

GRASS_COMPRESS_NULLS is slightly slower than not when using the original ZLIB, but faster when using LZ4. LZ4 does a better job compressing null files than ZLIB, in addition to the speedup.

comment:40 by wenzeslaus, 9 years ago

I've executed tests with GRASS_COMPRESS_NULLS=1 and they worked well. Here is what I was running:

export GRASS_COMPRESS_NULLS=1
grass71 /grassdata/nc_basic/user1/ --exec python -m grass.gunittest.main --location nc_basic --location-type nc
export GRASS_INT_ZLIB=0
grass71 /grassdata/nc_basic/user1/ --exec python -m grass.gunittest.main ...

However, I think that our test coverage is quite low for this especially for the usage of NULLs.

in reply to:  38 comment:41 by wenzeslaus, 9 years ago

Replying to glynn:

Replying to glynn:

I added some support for the improved r.null/r.compress functionality to lib/raster

With r65489, r65490, and r65491, r.null and r.support should work with compressed nulls. r.null now has a -z flag to re-create the null file using the current compression setting (i.e. compress/decompress).

This ticket seems like completed to me regarding the code. Is that right? Were there some other tests besides comment:40?

comment:42 by neteler, 9 years ago

Milestone: 7.1.07.0.3

See #2750 for a new compression scheme. Ideally this ticket handled together with #2750.

The NULL compression should definitely go into relbranch70.

in reply to:  38 comment:43 by neteler, 9 years ago

Replying to glynn:

Replying to glynn:

I added some support for the improved r.null/r.compress functionality to lib/raster

With r65489, r65490, and r65491, r.null and r.support should work with compressed nulls. r.null now has a -z flag to re-create the null file using the current compression setting (i.e. compress/decompress).

For the record, here all changesets related to this ticket which I could identify (for a future backport, needed also for a #2750 backport):

r65272, r65273, r65322, r65323, r65489, r65490, r65491, r65775

comment:44 by martinl, 9 years ago

Milestone: 7.0.37.0.4

comment:45 by martinl, 9 years ago

Should be backported at the beginning of release cycle.

comment:46 by martinl, 9 years ago

ASAIU the backport is not planned, right? Can we close this ticket?

in reply to:  46 comment:47 by neteler, 9 years ago

Milestone: 7.0.47.2.0
Resolution: fixed
Status: newclosed

Replying to martinl:

ASAIU the backport is not planned, right? Can we close this ticket?

Closing. Thanks for the great contributions and to all testers.

comment:48 by neteler, 9 years ago

Milestone: 7.2.0

Milestone deleted

comment:49 by neteler, 9 years ago

Milestone: 7.1.0

comment:50 by neteler, 9 years ago

Milestone: 7.1.07.2.0

Milestone renamed

comment:51 by neteler, 8 years ago

As per

... the default ZLIB compression level was invalid, causing ZLIB to not compress at all. Fixed in r69387 + r69388

comment:52 by neteler, 8 years ago

For the record - concerning NULL file compression:

At time cell_misc/null is uncompressed by default.

Using the environment variable GRASS_COMPRESS_NULLS=1 NULL compression is activated for the session for newly created raster maps; and

export GRASS_COMPRESS_NULLS=1
r.null -z raster_map

generates a new compressed file cell_misc/nullcmpr and removes the old uncompressed cell_misc/null file for existing maps.

Note: At least GRASS GIS 7.2 is needed to read a raster map with compressed null file.

comment:53 by neteler, 8 years ago

r.compress manual updated in r69403, r69404.

Note: See TracTickets for help on using tickets.