Opened 16 years ago
Last modified 6 years ago
#335 new task
export floats and doubles with correct precision
Reported by: | hamish | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | 7.2.4 |
Component: | Default | Version: | svn-trunk |
Keywords: | precision | Cc: | |
CPU: | All | Platform: | All |
Description
see http://thread.gmane.org/gmane.comp.gis.grass.devel/29622/
this started with lib/db/dbmi_base/valuefmt.c %lf -> %.15g
High priority: r.what, r.univar sum_str, r.info's fancy report min= max=, and r.colors. Also 'g.region -g' -> lib/gis/wind_format.c:
static int format_double(double value, char *buf) { sprintf(buf, "%.8f", value); G_trim_decimal(buf); return 0; }
I notice that raster/r.mapcalc/expression.c uses %.8g for a double.
for lat/lon, %.8f is approx 1mm (1852*60*1e-8; ie better than RTK GPS), for meter/feet based units it's very very small.
I guess we may as well to do this properly, i.e. split off FCELL values to something less precise ( %.6g, %f, ? ). Maybe new G_format_double() and G_format_float() fns for that?
$ svngrep -r '%\.[1-9][0-9][fg]' * | cut -f1 -d: | uniq display/d.rast.edit/edit.c display/d.zoom/set.c general/g.setproj/main.c general/g.setproj/get_num.c general/g.setproj/get_deg.c lib/proj/convert.c lib/proj/get_proj.c lib/vector/rtree/gammavol.c lib/vector/Vlib/intersect.c lib/gis/quant_io.c lib/gis/gislib.dox lib/gis/proj3.c lib/gis/color_write.c lib/gis/cats.c lib/g3d/g3dkeys.c lib/g3d/writeascii.c lib/g3d/filecompare.c lib/g3d/g3dcats.c lib/db/dbmi_base/datetime.c lib/db/dbmi_base/valuefmt.c ps/ps.map/ps_fclrtbl.c raster/r.colors/rules.c raster/r.stats/raw_stats.c raster/r.univar2/stats.c raster/r.reclass/main.c raster/r.recode/read_rules.c raster/r.info/main.c raster/r.quant/read_rules.c raster/r.distance/report.c raster/r.cats/main.c raster/r.external/main.c raster/r.what/main.c raster/r.statistics/o_average.c raster/r.average/main.c scripts/r.in.srtm/r.in.srtm vector/v.to.points/main.c vector/v.segment/main.c vector/v.label.sa/main.c vector/v.what.rast/main.c vector/v.to.db/update.c vector/v.to.db/report.c vector/v.kernel/main.c vector/v.in.ascii/points.c
the list is long and manual review is needed for each one.
Hamish
Change History (46)
comment:1 by , 16 years ago
comment:2 by , 16 years ago
Priority: | minor → critical |
---|
comment:3 by , 16 years ago
v.univar too,
g.copy vect=archsites,tmp_arch v.db.addcol tmp_arch column='x double, y double' v.to.db tmp_arch option=coor columns=x,y eval `v.univar -g tmp_arch column=x type=point | grep -w mean` X_MEAN=$mean eval `v.univar -g tmp_arch column=y type=point | grep -w mean` Y_MEAN=$mean echo "Center of point cloud: $X_MEAN, $Y_MEAN"
Center of point cloud: 598326, 4.92146e+06
would be nice for y coord to be intact (ie preserve 3 digits to the right of the decimal point (mm))
Hamish
follow-up: 5 comment:4 by , 15 years ago
'g.region -g' for lat/lon fixed for G_format_*() in r37725,6,7.
comment:5 by , 15 years ago
comment:6 by , 15 years ago
Hi,
is there any reason not to change the following to %.15g?
raster/r.colors/ rules.c:47: sprintf(minstr, "%.25f", (double)min); rules.c:48: sprintf(maxstr, "%.25f", (double)max);
AFAIK %.25f for a double (DCELL) is just reporting noise. It causes out-of-range white areas on a raster map because the 0%,100% color rules are no longer == the actual floating point value (50% chance it falls on the outside of the range)
also, in lib/gis/color_write.c:
#define PRECISION 30 #define THRESHOLD .0000000000000000000000000000005 /* .5 * 10 ^(-30) */ ... static int format_min(char *str, double dval) { double dtmp; sprintf(str, "%.*f", PRECISION, dval); G_trim_decimal(str); sscanf(str, "%lf", &dtmp); if (dtmp != dval) { /* if no zeros after decimal point were trimmed */ sprintf(str, "%.*f", PRECISION, dval - THRESHOLD); /* because precision is probably higher than PRECISION */ } return 0; }
(max is similar but adds the spillover threshold; but apparently this doesn't have the desired effect as 5e-31 is swept away in the noise anyway. ?)
I blindly guess number originally came from an attempt to be a wee bit bigger than 1e-32. ???? (dates back to CVS rev1.1 so no help from the commit history)
i.e. can we always assume DCELL|double to be the same bit depth, and %.15g the most we'll ever be able to squeeze from it?
Hamish
comment:7 by , 15 years ago
oblig "What Every Computer Scientist Should Know About Floating-Point Arithmetic", by David Goldberg: http://docs.sun.com/source/806-3568/ncg_goldberg.html
comment:8 by , 15 years ago
on grass-dev Glynn wrote:
We can assume that it always refers to IEEE 754 double precision format. Systems where this isn't the case are extremely uncommon, and will almost certainly fail to support GRASS without a great deal of additional effort.
And regardless of the precision of "double", I seriously doubt that anyone has actual data which is accurate to 15 significant digits (equivalent to specifying the circumference of the earth to within 40 nanometres, or the height of Everest to within 9 picometres).
r.colors fixed in trunk with r38750, r38752. (see also #720)
Hamish
comment:9 by , 15 years ago
r.info's data range (both fancy and shell style) updated in devbr6 and trunk to use %.7g for floats and %.15g for doubles.
Hamish
comment:10 by , 15 years ago
r.what's cell data output updated in devbr6 and trunk to use %.7g for floats and %.15g for doubles.
comment:11 by , 15 years ago
CPU: | Unspecified → All |
---|---|
Milestone: | 6.4.0 → 6.4.1 |
Platform: | Unspecified → All |
Version: | unspecified → svn-develbranch6 |
good progress made on the critical cases, deferring the rest 'til 6.4.1.
comment:13 by , 15 years ago
Milestone: | → 6.4.1 |
---|
comment:14 by , 14 years ago
with respect to raster/r.stats/raw_stats.c (and so r.out.xyz), the _is_fp() array-filling in main.c needs to be changed to fill the array with the map type, instead of just categorical or FP. The array is already type int, so this shouldn't be too hard. Then raw_stats.c can output %.7g or %.15g depending on FCELL or DCELL (see r.info), instead of the current %.10f for all.
Hamish
comment:15 by , 14 years ago
r.what.color in the dev branches changed to %.15g for FP maps, but that should be broken down by FCELL & DCELL, and FCELL should use %.7g.
Hamish
comment:16 by , 14 years ago
r.reclass done in trunk and devbr6 with r45696,7. Should be backported to relbr64 after some period of testing.
Hamish
comment:17 by , 13 years ago
something thing to keep in mind as we work through these: GRASS's processing modules can be very(!) useful for processing imagery from scanning electron microscopes. Typical pixel resolution for that might be 1 micron. aka the map units aren't necessarily in degrees, meters, or some flavor of feet.
at the other end of things, for star catalog az,el,distance point data projected from a lat/lon import the map units might be parsec or light-years. (I suspect r.mapcalc et al. would also be highly useful for processing large telescope images)
Hamish
comment:18 by , 12 years ago
follow-up: 21 comment:19 by , 12 years ago
Please backport to 6.4 since not much testing happens in 6.5 (but in 6.4.svn). (I see that many backports are missing!).
comment:20 by , 12 years ago
Milestone: | 6.4.1 → 6.4.3 |
---|
follow-up: 23 comment:21 by , 12 years ago
Replying to neteler:
Please backport to 6.4 since not much testing happens in 6.5 (but in 6.4.svn). (I see that many backports are missing!).
Glynn: Binary to text conversions must use %.9g (float) and %.17g (double) in order to preserve fp values in a binary-decimal-binary round-trip, e.g. r.out.ascii followed by r.in.ascii.
The modules [r|v].out.ascii need to be added to the list.
Markus M
follow-up: 25 comment:23 by , 12 years ago
Replying to mmetz:
Glynn: Binary to text conversions must use %.9g (float) and %.17g (double) in order to preserve fp values in a binary-decimal-binary round-trip, e.g. r.out.ascii followed by r.in.ascii.
take care that preserving b-d-b round trip ,on a single platform, is not the only task or consideration. For the r.out.ascii + r.in.ascii round trip it may well be appropriate, but while conceding that point I'd argue that r.*.bin or r.*.mat would be a better choice in those cases.
%.15,7g was chosen because it's perfectly reproducible and doesn't introduce useless .0000000000000001 crap into the data files which G_trim_decimal() can't handle. For things like r.univar, r.info, and v.db.select the output is at least in part for human consumption; there's no practical need to expose the FP noise. The main thing for us to focus on I think is all the remaining lossy %f and meaningless %.56f type stuff in the code, not splitting hairs over preserving quanta finer than GRASS_EPSILON.
wrt lib/gis/color_write.c, look closely & you'll see there is a +/- GRASS_EPSILON adjustment on the range to ensure that the rounding exceeds the range, and so you shouldn't ever get white spots at the peaks and pits.
best, Hamish
(this is something I hope python cleans up with 3.0)
comment:24 by , 12 years ago
I wrote:
look closely & you'll see there is a +/- GRASS_EPSILON adjustment on the range
oops, I used it wrong, you already fixed it.
(2^1023.9999999999999 - 2^1023.9999999999998) > GRASS_EPSILON
Hamish
comment:25 by , 12 years ago
Replying to hamish:
Replying to mmetz:
Glynn: Binary to text conversions must use %.9g (float) and %.17g (double) in order to preserve fp values in a binary-decimal-binary round-trip, e.g. r.out.ascii followed by r.in.ascii.
take care that preserving b-d-b round trip ,on a single platform, is not the only task or consideration. For the r.out.ascii + r.in.ascii round trip it may well be appropriate, but while conceding that point I'd argue that r.*.bin or r.*.mat would be a better choice in those cases.
Please add this information to the manuals if you think this is crucial.
%.15,7g was chosen because it's perfectly reproducible
No, it is not. %.17,9g is reproducible. See IEEE 754 standard.
and doesn't introduce useless .0000000000000001 crap
On what platform do you get this crap? %.17,9g does not produce this crap.
into the data files which G_trim_decimal() can't handle.
G_trim_decimal() is not needed for %.17,9g.
The main thing for us to focus on I think is all the remaining lossy %f and meaningless %.56f type stuff in the code
Sure, therefore %.17,9g.
not splitting hairs over preserving quanta finer than GRASS_EPSILON.
Huh? This thread is not about GRASS_EPSILON. GRASS_EPSILON is 1.0e-15. A perfectly valid range of double fp data is e.g. 1.0e-200 to 1.0e-300, way smaller than GRASS_EPSILON. You do not want to corrupt these data.
BTW, I found that fp calculation errors are rather in the range of 1.0e-4 to 1.0e-8.
Markus M
follow-up: 27 comment:26 by , 12 years ago
%.15,7g was chosen because it's perfectly reproducible
No, it is not. %.17,9g is reproducible. See IEEE 754 standard.
I didn't mean reproducible as far as the binary stored value round-trip was concerned, I meant reproducible as far as getting the same ascii result on two different hardware platforms.
for example, getting r.md5sum (addon to aid the test_suite) to give the same (portable) result when checksumming r.out.ascii output when the data is stored as floating point &/or not strictly real (2.5000000). My understanding, for what it is, is that the least sig. digits can flicker depending on the CPU arch, perhaps the compiler, and programming language's implementation too.
I can't remember which right now, but one of the ogr/gdal/proj4 input streams was introducing .00000000001 style artifacts, it's probably worth digging that up as a real-world test case example as there are probably two classes of this problem with the same symptom: one to do with GRASS's use of %f and atof() in places, and another to do with the output of input values which were malformed before they came into grass.
Another interesting corner-case example to try with this is the earthquake demo from the grass-promo/tutorials/ svn, where the heavily logarithmic data scale within a single dataset really stretches what fits well in the ieee FP model, and so something more linear like magnitude is used to store it instead of raw extajoules.
A perfectly valid range of double fp data is e.g. 1.0e-200 to 1.0e-300, way smaller than GRASS_EPSILON. You do not want to corrupt these data.
(see the following post with the 2^1024
example where I corrected myself about that after seeing how you fixed the error in my earlier attempt to make sure the color min/max rule was bigger than the data range)
regards, Hamish
follow-up: 28 comment:27 by , 12 years ago
Replying to hamish:
No, it is not. %.17,9g is reproducible. See IEEE 754 standard.
I didn't mean reproducible as far as the binary stored value round-trip was concerned, I meant reproducible as far as getting the same ascii result on two different hardware platforms.
All common platforms use IEEE-754 representation, so I wouldn't expect differences due to hardware.
My understanding, for what it is, is that the least sig. digits can flicker depending on the CPU arch, perhaps the compiler, and programming language's implementation too.
Any differences are due to software, not hardware. The main reasons for differences are:
- Whether the software produces the closest decimal value to the actual binary value, or the shortest decimal value which would convert to the actual binary value (those two aren't necessarily the same).
- The rounding mode used in the event of a tie. E.g. 3.0/16=0.1875; if that is displayed with 3 fractional digits, should the result be 0.187 (round toward zero, round toward negative infinity) or 0.188 (round toward positive infinity, round toward nearest even value)?
- Bugs. Many programmers don't understand the details of floating-point, and don't particularly care whether the least-significant digits are "correct" (particularly as that would requiring which flavour of "correct" you actually want).
follow-up: 29 comment:28 by , 12 years ago
Milestone: | 6.4.4 → 6.4.3 |
---|
Replying to glynn:
Replying to hamish:
Replying to mmetz:
No, it is not. %.17,9g is reproducible. See IEEE 754 standard.
I didn't mean reproducible as far as the binary stored value round-trip was concerned, I meant reproducible as far as getting the same ascii result on two different hardware platforms.
All common platforms use IEEE-754 representation, so I wouldn't expect differences due to hardware.
Should r55122 be reverted then?
comment:29 by , 12 years ago
Replying to neteler:
Replying to glynn:
Replying to hamish:
Replying to mmetz:
No, it is not. %.17,9g is reproducible. See IEEE 754 standard.
I didn't mean reproducible as far as the binary stored value round-trip was concerned, I meant reproducible as far as getting the same ascii result on two different hardware platforms.
All common platforms use IEEE-754 representation, so I wouldn't expect differences due to hardware.
Should r55122 be reverted then?
Reverted in r55183.
Markus M
follow-up: 32 comment:31 by , 11 years ago
Replying to hamish:
to be continued ..
Hamish, what exactly needs to be continued here ? Is it fixing those modules on your list that haven't been fixed, yet ? Or is it the discussion on how to handle decimals ?
Moritz
comment:32 by , 10 years ago
Milestone: | 6.4.4 → 6.4.5 |
---|
Replying to mlennert:
Hamish, what exactly needs to be continued here ?
the systematic audit and repair of modules printf'ing double and single precision FP variables with inappropriate %.*g or %.*f, either arbitrarily fixed-precision or using a precision appropriate for the data type.
Is it fixing those modules on your list that haven't been fixed, yet ?
yes, exactly. spin boxes and (eg rectifier) table throughout the wxGUI is related, but probably should have its own ticket.
Or is it the discussion on how to handle decimals ?
I doubt that has an end, but at least we can all agree that %.52f is meaningless and %f is too lossy.
The question of whether to output values with slight rounding, user selectable precision, or try for exact ascii representation of IEEE double FP is I think a case by case question.
e.g. %.15g will avoid outputting coordinates like 278.700000000000045 44.8999999999999986
, which is human-ugly and bloats the ascii filesize a lot, but reversible to the same IEEE double FP value.
fwiw I think we can achieve good defaults so user selectable precision is rarely needed (success is if no one ever thinks to ask for it, since the right thing was done automatically), and there are times like the colr/ file when %.15g max and min range rounded outwards by 1 epsilon step can tidy up the out of range trouble which happens from time to time.
the bug is of high priority as data input/output needs to be flawless, or of such minor jitter that in most cases a nanometer here or there won't harm the overall result.
regards, Hamish
comment:34 by , 9 years ago
Milestone: | → 6.4.6 |
---|
comment:35 by , 9 years ago
This is probably relevant for g7 as well, no ?
So probably the Version and Milestone settings should be pushed up to trunk in order to raise the visibility of this issue.
comment:36 by , 9 years ago
Milestone: | 6.4.6 → 7.0.3 |
---|---|
Version: | svn-develbranch6 → svn-trunk |
bumped to trunk
comment:39 by , 9 years ago
Milestone: | 7.0.4 → 7.0.5 |
---|
comment:40 by , 8 years ago
Milestone: | 7.0.5 → 7.0.6 |
---|
comment:41 by , 8 years ago
Milestone: | 7.0.6 → 7.2.1 |
---|
comment:42 by , 8 years ago
Milestone: | 7.2.1 → 7.2.2 |
---|
comment:45 by , 7 years ago
Milestone: | → 7.2.4 |
---|
IMHO this needs to be fixed before 6.4.0 is released (hopefully end of the year with release candidates starting as soon as vector undo is implemented in the new digitizer).
Markus