#2349 closed enhancement (fixed)
CELL raster format: make ZLIB level 3 standard compression instead of RLE
Reported by: | neteler | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 7.2.0 |
Component: | Raster | Version: | svn-trunk |
Keywords: | compression, r.compress, null | Cc: | |
CPU: | Unspecified | Platform: | All |
Description
At time, integer maps (CELL) are still compressed with RLE This leads to a huge waste of disk space when it comes to large data.
Proposal: make ZLIB, level 3 the standard compression.
At time we can enable the environment variable GRASS_INT_ZLIB but it will use the default ZLIB level 6 compression which is too CPU intensive. So a (user) control over this is important.
BTW: Manual of r.compress updated in r60814, needs to be backported.
Attachments (2)
Change History (55)
comment:1 by , 10 years ago
follow-up: 3 comment:2 by , 10 years ago
Replying to neteler:
At time, integer maps (CELL) are still compressed with RLE This leads to a huge waste of disk space when it comes to large data.
Proposal: make ZLIB, level 3 the standard compression.
Is GRASS_INT_ZLIB support now old enough that it can be taken for granted?
At time we can enable the environment variable GRASS_INT_ZLIB but it will use the default ZLIB level 6 compression which is too CPU intensive. So a (user) control over this is important.
The current behaviour is that setting GRASS_INT_ZLIB to anything (even an empty string) will enable zlib compression at the hard-coded level. One option is to parse the value as an integer and use the result as the compression level. However, it's possible that people are currently using e.g. GRASS_INT_ZLIB=1 to enable it with the existing default level.
Another option is to add another environment variable for the level.
Aside: if there are still systems out there using the historical limit of 4096 bytes of memory for the combination of environment variables and arguments (argv), we might want to think about making GRASS less greedy with respect to environment variables.
follow-up: 4 comment:3 by , 10 years ago
Replying to glynn:
Replying to neteler:
At time, integer maps (CELL) are still compressed with RLE This leads to a huge waste of disk space when it comes to large data.
Proposal: make ZLIB, level 3 the standard compression.
Is GRASS_INT_ZLIB support now old enough that it can be taken for granted?
I hope yes. I am not aware of negative reports.
At time we can enable the environment variable GRASS_INT_ZLIB but it will use the default ZLIB level 6 compression which is too CPU intensive. So a (user) control over this is important.
The current behaviour is that setting GRASS_INT_ZLIB to anything (even an empty string) will enable zlib compression at the hard-coded level.
Exactly.
One option is to parse the value as an integer and use the result as the compression level. However, it's possible that people are currently using e.g. GRASS_INT_ZLIB=1 to enable it with the existing default level.
Another option is to add another environment variable for the level.
Yes, a new GRASS_ZLIBLEVEL may be less invasive.
Aside: if there are still systems out there using the historical limit of 4096 bytes of memory for the combination of environment variables and arguments (argv), we might want to think about making GRASS less greedy with respect to environment variables.
You mean the number and/or the length?
comment:4 by , 10 years ago
Replying to neteler:
Aside: if there are still systems out there using the historical limit of 4096 bytes of memory for the combination of environment variables and arguments (argv), we might want to think about making GRASS less greedy with respect to environment variables.
You mean the number and/or the length?
Mainly the number.
follow-up: 10 comment:5 by , 10 years ago
Replying to neteler:
Proposal: make ZLIB, level 3 the standard compression.
r61380 implements the following behaviour:
- zlib compression is the default. Set GRASS_INT_ZLIB=0 to use RLE compression.
- The compression level can be set via GRASS_ZLIB_LEVEL, whose value should be an integer between 0 and 9. If not set (or if the value cannot be parsed as an integer), zlib's default compression level will be used (lib/gis/flate.c:333, if a different default is preferred).
follow-up: 7 comment:6 by , 10 years ago
Putting here to not forget about: raster/r.compress/r.compress.html needs to be updated
comment:7 by , 10 years ago
follow-up: 11 comment:8 by , 10 years ago
Keywords: | compression added |
---|
comment:10 by , 10 years ago
Replying to glynn:
- The compression level can be set via GRASS_ZLIB_LEVEL, whose value should be an integer between 0 and 9. If not set (or if the value cannot be parsed as an integer), zlib's default compression level will be used (lib/gis/flate.c:333, if a different default is preferred).
In relbranch7 there is currently:
Z_DEFAULT_COMPRESSION = 1 - "gives the best compromise between speed and compression" as per r61424.
Perhaps zlib compression 1 should be adopted for trunk?
comment:11 by , 10 years ago
Keywords: | r.compress added |
---|
follow-up: 13 comment:12 by , 10 years ago
It seems that the NULL file is not compressed (file "cell_misc/null"). According to
http://lists.osgeo.org/pipermail/grass-user/2010-January/054216.html
it is one bit (null/non-null) for each cell. It looks like this:
[neteler@giscluster modis_lst_reconstructed_europe_daily]$ hexdump cell_misc/lst_2002_196_average/null | head 0000000 ffff ffff ffff ffff ffff ffff ffff ffff * 0000ad0 ffff ffff ffff ffe0 ffff ffff ffff ffff 0000ae0 ffff ffff ffff ffff ffff ffff ffff ffff * 00015a0 ffff ffff ffff ffff ffff ffff e0ff ffff 00015b0 ffff ffff ffff ffff ffff ffff ffff ffff * 0002080 ffff ffff ffe0 ffff ffff ffff ffff ffff 0002090 ffff ffff ffff ffff ffff ffff ffff ffff ...
For our 12829 daily min/average/max MODIS LST maps covering Europe, the null files consume a lot of space:
[neteler@giscluster cell_misc]$ du -hs . 621G .
Question: Would it be possible to compress also the null files, even with just a weak compression?
follow-up: 14 comment:13 by , 10 years ago
Replying to neteler:
It seems that the NULL file is not compressed (file "cell_misc/null"). According to
http://lists.osgeo.org/pipermail/grass-user/2010-January/054216.html
it is one bit (null/non-null) for each cell.
Question: Would it be possible to compress also the null files, even with just a weak compression?
Being uncompressed, the null files don't contain an index. The offset to the beginning of a given row is obtained by multiplying the row number by the number of bytes per row (which is just the number of columns divided by 8, rounded upwards).
The main issue is likely to be the need to support both formats. We need to
- Be able to read and write the uncompressed format, for compatibility with existing versions of GRASS.
- Be able to distinguish between compressed and uncompressed formats on read.
- Provide some mechanism (i.e. yet another environment variable) to indicate which format to use on write.
comment:14 by , 10 years ago
Milestone: | 7.0.0 → 8.0.0 |
---|---|
Priority: | blocker → critical |
Version: | svn-releasebranch70 → svn-trunk |
Replying to glynn: ...
The main issue is likely to be the need to support both formats. We need to
... at this point raster format changes may go into GRASS GIS 8.
follow-ups: 16 18 comment:15 by , 10 years ago
Keywords: | null added |
---|---|
Milestone: | 8.0.0 → 7.1.0 |
back to this topic: Here on our system I found > 1.7TB of NULL files in a single location, all uncompressed.
What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?
For backward compatibility, r.null could extended to convert from compressed null2 to uncompressed null (similar to v.build for the new spatial index in G7).
follow-up: 17 comment:16 by , 10 years ago
Replying to neteler:
back to this topic: Here on our system I found > 1.7TB of NULL files in a single location, all uncompressed.
How large are the null files compared to the cell/fcell files?
What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?
That's probably not a great deal of work, but as with any such change, we need to consider the migration strategy. If we just start creating compressed null files, mapsets will cease to be usable with older versions.
comment:17 by , 10 years ago
Replying to glynn:
How large are the null files compared to the cell/fcell files?
- With MODIS LST data, the null files are between 1.7 and 7.6 times larger than the cell files (we store the LST maps in deg C * 100 as integer to save disk space).
- With 100k random points, the null file is 7.1 times larger than the fcell map
- With the EU 25m DEM, the null file is way smaller that the derived aspect map (17% of the DEM fcell file)
What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?
That's probably not a great deal of work, but as with any such change, we need to consider the migration strategy. If we just start creating compressed null files, mapsets will cease to be usable with older versions.
Right but this could be covered with an addon/new script in G6 and earlier (as v.build does for vector data).
comment:18 by , 10 years ago
Replying to neteler:
What about having a "null2" file which is compressed and with index. If present, fine, otherwise use the uncompressed well known null file format?
Please test attachment:compressed_nulls.diff
Note that r.null and r.support will also require some changes, as they assume that the file consists of nothing but the null data (i.e. no index).
follow-up: 20 comment:19 by , 10 years ago
Thanks for the patch. I applied it to my local copy of relbr70. Testing with this map
GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.info -g eu_dem25_rand100k north=6000000 south=1000000 east=8000000 west=2000000 nsres=25 ewres=25 rows=200000 cols=240000 cells=48000000000 datatype=FCELL ncats=0
I created a new one using the 100k random points FCELL map sampled earlier from the EU 25m DEM:
r.mapcalc "new = eu_dem25_rand100k"
Using the patch, I see that a NULL file is no longer generated. Expectedly, since the other modules are not updated yet, I cannot read that map:
GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.univar new ERROR: Raster map <new@PERMANENT>: format field in header file invalid
Next test with INT conversion:
GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.mapcalc "new_int = round(eu_dem25_rand100k)" GRASS 7.0.1svn (eu_laea):~/releasebranch_7_0 > r.univar new_int 0%
... the map is readable and comes with the NULL file but that's as before 6GB big (so no compression occurred).
Are my tests wrong or am I missing anything else?
comment:20 by , 10 years ago
Replying to neteler:
Are my tests wrong or am I missing anything else?
First, it shouldn't be necessary to recompile anything except lib/raster (but you should probably run "make clean" in that directory first, as the patch changes the R__ and fileinfo structures which are used throughout the library).
Second, you need to use
export GRASS_COMPRESS_NULLS=1
in order for compressed null files to be generated (cell_misc/<name>/null2).
Reading should automatically use the compressed null file if it exists or the raw null file if it doesn't.
If you don't set GRASS_COMPRESS_NULLS, everything should continue to work as before (including r.null and r.support). If it doesn't, that needs to be fixed now (i.e. before the changes committed; issues which only arise when setting GRASS_COMPRESS_NULLS can be dealt with later).
follow-up: 22 comment:21 by , 10 years ago
I tried again but don't see a difference in the NULL file size:
export GRASS_COMPRESS_NULLS=1 g.region raster=eu_dem25_rand100k -p r.mapcalc "new_int = round(eu_dem25_rand100k)" echo $GRASS_COMPRESS_NULLS 1
Verification:
root@grass ~ # l grassdata/eu_laea/PERMANENT/cell_misc/* grassdata/eu_laea/PERMANENT/cell_misc/eu_dem25_rand100k: total 5.6G -rw-r--r-- 1 root root 53 Aug 15 2014 f_format -rw-r--r-- 1 root root 5 Aug 15 2014 f_quant -rw-r--r-- 1 root root 16 Aug 15 2014 f_range -rw-r--r-- 1 root root 5.6G Aug 15 2014 null grassdata/eu_laea/PERMANENT/cell_misc/new_int: total 5.6G -rw-r--r-- 1 root root 5.6G May 14 19:03 null -rw-r--r-- 1 root root 9 May 14 19:03 range
comment:22 by , 10 years ago
Replying to neteler:
I tried again but don't see a difference in the NULL file size:
Check that the patch applied successfully, that compilation was successful, and that r.mapcalc is using the new libgrass_raster.so, not e.g. an installed version.
The first two can be confirmed by running "strings" on the library and confirming that GRASS_COMPRESS_NULLS appears in the output. The last can be confirmed with e.g.
ldd $(which r.mapcalc)
follow-up: 24 comment:23 by , 10 years ago
After cleaning up well on that machine, it is now finally working (sorry for the delay).
Test 1: mostly no-data cells
r.univar eu_dem25_rand100k total null and non-null cells: 48000000000 total null cells: 47999900000 ... r.mapcalc "new_int = round(eu_dem25_rand100k)" ls -la grassdata/eu_laea/PERMANENT/cell_misc/new_int: total 5.7G -rw-r--r-- 1 root root 5.6G May 14 19:03 null -rw-r--r-- 1 root root 32M May 16 14:42 null2 -rw-r--r-- 1 root root 9 May 16 14:42 range
... that is less than 1% of the original null file size!
Test 2: 54% no-data cells
r.univar MOD11A1.A2000101.LST_Day_1km.reconstruct total null and non-null cells: 415290645 total null cells: 226949285 r.info -g MOD11A1.A2000101.LST_Day_1km.reconstruct | grep datatype datatype=CELL r.mapcalc "modis_new_int = MOD11A1.A2000101.LST_Day_1km.reconstruct" # original: ls -la grassdata/eu_laea/PERMANENT/cell_misc/MOD11A1.A2000101.LST_Day_1km.reconstruct total 50M drwxr-xr-x 2 root root 4.0K Apr 8 2014 . drwxr-xr-x 6 root root 4.0K May 17 14:14 .. -rw-r--r-- 1 root root 50M Apr 8 2014 null -rw-r--r-- 1 root root 12 Apr 8 2014 range # new compressed null2: ls -la grassdata/eu_laea/PERMANENT/cell_misc/modis_new_int total 1.6M drwxr-xr-x 2 root root 4.0K May 17 14:14 . drwxr-xr-x 6 root root 4.0K May 17 14:14 .. -rw-r--r-- 1 root root 1.6M May 17 14:14 null2 -rw-r--r-- 1 root root 12 May 17 14:14 range
... that is 3% of the original null file size.
follow-up: 25 comment:24 by , 10 years ago
Replying to neteler:
After cleaning up well on that machine, it is now finally working (sorry for the delay).
Committed in r65272, with one minor change: before writing out the null file, both cell_misc/null and cell_misc/null2 are deleted, not just the file which is being replaced.
This avoids the situation where overwriting a map with null compression disabled would leave any compressed null file in place, resulting in the map having both a new uncompressed null file and an old compressed null file, with the new file being ignored (a compressed null file takes precedence).
Actually, the precedence should probably be changed. If someone uses an older version of GRASS to overwrite a map which already has a compressed null file, newer versions of GRASS will use the (stale) compressed null file. It would be more robust to use the absence of an uncompressed null file to dictate the use of the compressed file, rather than just the presence of a compressed file.
comment:25 by , 10 years ago
Replying to glynn:
Actually, the precedence should probably be changed. If someone uses an older version of GRASS to overwrite a map which already has a compressed null file, newer versions of GRASS will use the (stale) compressed null file. It would be more robust to use the absence of an uncompressed null file to dictate the use of the compressed file, rather than just the presence of a compressed file.
Done in r65273.
follow-up: 27 comment:26 by , 10 years ago
Thanks, I can do more testing now.
Which is the best way to achieve null2 files after setting the new env var? Using r.mapcalc I would ruin all metadata of my raster maps.
Would r.compress -u followed by r.compress do the job? From a user's point this might be the best solution.
follow-up: 28 comment:27 by , 10 years ago
Replying to neteler:
Would r.compress -u followed by r.compress do the job? From a user's point this might be the best solution.
It would probably work. A better (but still simple) option would be to add a flag to r.compress so that it will process an already-compressed map (rather than assuming it doesn't need to do anything).
Ultimately we need something which will just update the null file, probably an enhancement to r.null.
Note that, at present, "r.null -c" or "r.support -n" will generate a broken map if GRASS_COMPRESS_NULLS is set, as they will write compressed data to the null2 file but won't write the row pointers.
Actually, those modules are already on shaky ground because they're passing an invalid map index as the "fd" parameter to Rast__write_null_bits().
follow-up: 29 comment:28 by , 10 years ago
Replying to glynn:
Ultimately we need something which will just update the null file, probably an enhancement to r.null.
Yes - this approach would save quite some CPU time.
Note that, at present, "r.null -c" or "r.support -n" will generate a broken map if GRASS_COMPRESS_NULLS is set, as they will write compressed data to the null2 file but won't write the row pointers.
Actually, those modules are already on shaky ground because they're passing an invalid map index as the "fd" parameter to Rast__write_null_bits().
I hope you are willing to implement this, looking forward to it. Thanks.
follow-up: 30 comment:29 by , 10 years ago
Replying to neteler:
I hope you are willing to implement this, looking forward to it. Thanks.
I'm working on it. In the meantime, don't use compressed nulls, because they aren't actually working yet. At the point that the row pointers are written out, the null file has already been closed.
follow-up: 31 comment:30 by , 10 years ago
Replying to glynn:
In the meantime, don't use compressed nulls, because they aren't actually working yet. At the point that the row pointers are written out, the null file has already been closed.
This should be fixed in r65323, along with the issue that a compressed null file was being ignored (resulting in the zero-is-null compatibility behaviour).
follow-up: 32 comment:31 by , 9 years ago
Replying to glynn:
Replying to glynn:
In the meantime, don't use compressed nulls, because they aren't actually working yet. At the point that the row pointers are written out, the null file has already been closed.
This should be fixed in r65323, along with the issue that a compressed null file was being ignored (resulting in the zero-is-null compatibility behaviour).
Is there any snipped which should be backported?
follow-up: 33 comment:32 by , 9 years ago
Replying to neteler:
Is there any snipped which should be backported?
I'd suggest not including this feature in any release version until it has received more thorough testing. If it has already been backported in some form, that should be reverted. For now, it should exist only in trunk.
follow-up: 34 comment:33 by , 9 years ago
Replying to glynn:
For now, it should exist only in trunk.
Yes, this is the case, no related backports done.
I am just not sure how to test it (I would need the improved r.null/r.compress functionality to not ruin all metadata). Or do you prefer to have tests with newly generated data first? I can do that as well.
follow-ups: 35 38 comment:34 by , 9 years ago
Replying to neteler:
I am just not sure how to test it (I would need the improved r.null/r.compress functionality to not ruin all metadata). Or do you prefer to have tests with newly generated data first? I can do that as well.
I added some support for the improved r.null/r.compress functionality to lib/raster, but it was amongst the things broken by r65348 (the tempfile changes), so this needs to wait until that gets fixed or reverted.
follow-up: 36 comment:35 by , 9 years ago
Replying to glynn:
Replying to neteler:
I added some support for the improved r.null/r.compress functionality to lib/raster, but it was amongst the things broken by r65348 (the tempfile changes), so this needs to wait until that gets fixed or reverted.
by "fix" you mean to ignore GRASS_TMPDIR_MAPSET when working with raster data?
comment:36 by , 9 years ago
Replying to martinl:
by "fix" you mean to ignore GRASS_TMPDIR_MAPSET when working with raster data?
Yes. I don't particularly want to do anything with lib/raster while that's outstanding (the tempfile changes broke Rast__close_null(); while a workaround would be trivial, that's symptomatic of the sort of thing that happens when two people are working on the same code concurrently).
Changed in r65436, GRASS_TMPDIR_MAPSET is accepted only by vector library.
follow-ups: 41 43 comment:38 by , 9 years ago
Replying to glynn:
I added some support for the improved r.null/r.compress functionality to lib/raster
With r65489, r65490, and r65491, r.null and r.support should work with compressed nulls. r.null now has a -z flag to re-create the null file using the current compression setting (i.e. compress/decompress).
comment:39 by , 9 years ago
Probably related to #2750, where I've implemented LZ4, LZ4HC, and ZSTD support in GRASS for up to 3x speedup in I/O bound modules.
GRASS_COMPRESS_NULLS is slightly slower than not when using the original ZLIB, but faster when using LZ4. LZ4 does a better job compressing null files than ZLIB, in addition to the speedup.
comment:40 by , 9 years ago
I've executed tests with GRASS_COMPRESS_NULLS=1
and they worked well. Here is what I was running:
export GRASS_COMPRESS_NULLS=1 grass71 /grassdata/nc_basic/user1/ --exec python -m grass.gunittest.main --location nc_basic --location-type nc export GRASS_INT_ZLIB=0 grass71 /grassdata/nc_basic/user1/ --exec python -m grass.gunittest.main ...
However, I think that our test coverage is quite low for this especially for the usage of NULLs.
comment:41 by , 9 years ago
Replying to glynn:
Replying to glynn:
I added some support for the improved r.null/r.compress functionality to lib/raster
With r65489, r65490, and r65491, r.null and r.support should work with compressed nulls. r.null now has a -z flag to re-create the null file using the current compression setting (i.e. compress/decompress).
This ticket seems like completed to me regarding the code. Is that right? Were there some other tests besides comment:40?
comment:42 by , 9 years ago
Milestone: | 7.1.0 → 7.0.3 |
---|
comment:43 by , 9 years ago
Replying to glynn:
Replying to glynn:
I added some support for the improved r.null/r.compress functionality to lib/raster
With r65489, r65490, and r65491, r.null and r.support should work with compressed nulls. r.null now has a -z flag to re-create the null file using the current compression setting (i.e. compress/decompress).
For the record, here all changesets related to this ticket which I could identify (for a future backport, needed also for a #2750 backport):
r65272, r65273, r65322, r65323, r65489, r65490, r65491, r65775
comment:44 by , 9 years ago
Milestone: | 7.0.3 → 7.0.4 |
---|
follow-up: 47 comment:46 by , 9 years ago
ASAIU the backport is not planned, right? Can we close this ticket?
comment:47 by , 9 years ago
Milestone: | 7.0.4 → 7.2.0 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Replying to martinl:
ASAIU the backport is not planned, right? Can we close this ticket?
Closing. Thanks for the great contributions and to all testers.
comment:49 by , 9 years ago
Milestone: | → 7.1.0 |
---|
comment:51 by , 8 years ago
As per
- https://lists.osgeo.org/pipermail/grass-dev/2016-September/082092.html
- https://lists.osgeo.org/pipermail/grass-dev/2016-September/082101.html
... the default ZLIB compression level was invalid, causing ZLIB to not compress at all. Fixed in r69387 + r69388
comment:52 by , 8 years ago
For the record - concerning NULL file compression:
At time cell_misc/null
is uncompressed by default.
Using the environment variable GRASS_COMPRESS_NULLS=1
NULL compression is activated for the session for newly created raster maps; and
export GRASS_COMPRESS_NULLS=1 r.null -z raster_map
generates a new compressed file cell_misc/nullcmpr
and removes the old uncompressed cell_misc/null
file for existing maps.
Note: At least GRASS GIS 7.2 is needed to read a raster map with compressed null file.
default ZLIB compression (usually 6), is hardcoded in:
lib/gis/flate.c, line 330
However, the used compression is yet RLE.