Opened 13 years ago
Closed 13 years ago
#1005 closed defect (fixed)
[raster] Problem with ST_SummaryStats(text, text) results
Reported by: | pracine | Owned by: | Bborie Park |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.0.0 |
Component: | raster | Version: | master |
Keywords: | Cc: |
Description
I can observe some problems with the numbers returned by the coverage variant of ST_SummaryStats
1) The sum is the sum of the last tile only. Not of the whole coverage.
2) For a raster coverage of two rasters created like this:
--First test raster CREATE OR REPLACE FUNCTION ST_TestRaster1() RETURNS raster AS $$ DECLARE newrast raster := ST_AddBand(ST_MakeEmptyRaster(2, 2, 0, 0, 1, 1, 0, 0, -1), '32BUI'::text, 0, -1); BEGIN newrast := ST_SetValue(newrast, 1, 1, 1); newrast := ST_SetValue(newrast, 1, 2, 3); newrast := ST_SetValue(newrast, 2, 1, 3); newrast := ST_SetValue(newrast, 2, 2, 1); RETURN newrast; END; $$ LANGUAGE 'plpgsql'; --Second test raster CREATE OR REPLACE FUNCTION ST_TestRaster2() RETURNS raster AS $$ DECLARE newrast raster := ST_AddBand(ST_MakeEmptyRaster(2, 2, 2, 2, 1, 1, 0, 0, -1), '32BUI'::text, 0, -1); BEGIN newrast := ST_SetValue(newrast, 1, 1, 1); newrast := ST_SetValue(newrast, 1, 2, 2); newrast := ST_SetValue(newrast, 2, 1, 3); newrast := ST_SetValue(newrast, 2, 2, 4); RETURN newrast; END; $$ LANGUAGE 'plpgsql'; --Create table DROP TABLE sumstattest; CREATE TABLE sumstattest AS SELECT ST_TestRaster2() rast UNION ALL SELECT ST_TestRaster1() rast;
SELECT * FROM st_summarystats('sumstattest', 'rast');
should return 1.089724736 not 1.06066017770915.
Result verified with the stdev.p function in Excel.
Change History (27)
comment:1 by , 13 years ago
follow-up: 4 comment:2 by , 13 years ago
Thanks for catching the error with the sum as I forgot to collect the sum for coverages. My next revision will fix that.
As for the standard deviation of a coverage, I do confirm the issue. To do this correctly will require going over the coverage table twice, once for the coverage's mean and the second for the mean's standard deviation. So, do we want to be accurate at the expense of speed or be close at the expense of accuracy? Maybe we need to add an option flagging whether or not to get the accurate standard deviation for a coverage?
comment:3 by , 13 years ago
Fixed the ST_SummaryStats for coverage tables bug regarding the value for sum in r7340.
comment:4 by , 13 years ago
Replying to dustymugs:
As for the standard deviation of a coverage, I do confirm the issue. To do this correctly will require going over the coverage table twice, once for the coverage's mean and the second for the mean's standard deviation. So, do we want to be accurate at the expense of speed or be close at the expense of accuracy? Maybe we need to add an option flagging whether or not to get the accurate standard deviation for a coverage?
1) We must be clear about that in the doc and we should provide (a link to?) the method used to get the approximative value
2) I agree with the option. Default should be the approximate value.
I thought the method you found (one pass stddev) provided accurate result. I thought we added the sum to be able to get an accurate stddev. How does it work for one tile? Is the stddev accurate for one tile?
comment:5 by , 13 years ago
The one pass standard deviation is accurate for any single tile. I'm going to do some refactoring today to see if I can adapt the one pass standard deviation to process a coverage table though. If I can get it working, we won't need flagging or anything of the sort.
The sum was added to get the true mean instead of a weighted mean.
comment:6 by , 13 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:7 by , 13 years ago
Interesting that:
SELECT (ss).* FROM (SELECT st_summarystats(rast) ss FROM srtm_22_03_tiled_100x100 ) foo
takes 32 seconds on my machine and:
SELECT * FROM st_summarystats('srtm_22_03_tiled_100x100', 'rast');
takes only 6 seconds. How is this possible? We are counting the same number of pixels here.
follow-up: 11 comment:8 by , 13 years ago
Pierre,
Are you running on 8.4. I think things might have changed in 9.0. check out my article - at the end see my compare between 8.4/9.0 -- but I never got a chance to further test to confirm.
http://www.postgresonline.com/journal/archives/201-returns-table.html
comment:9 by , 13 years ago
I suspect in your first one, the planner has decided not to materialize the subselect.
Interesting to see what would happen if you did something like:
SELECT (ss).* FROM (SELECT st_summarystats(rast) ss FROM srtm_22_03_tiled_100x100 OFFSET 0) foo
comment:11 by , 13 years ago
Replying to robe:
Pierre,
Are you running on 8.4. I think things might have changed in 9.0. check out my article - at the end see my compare between 8.4/9.0 -- but I never got a chance to further test to confirm.
http://www.postgresonline.com/journal/archives/201-returns-table.html
Ok,
SELECT (ST_SummaryStats(rast)).* FROM srtm_22_03_tiled_100x100
still takes 32 seconds and strangely:
SELECT (st_summarystats('srtm_22_03_tiled_100x100', 'rast')).*
now takes also 32 seconds. So the fastest/shorter query is still
SELECT * FROM st_summarystats('srtm_22_03_tiled_100x100', 'rast');
That's on my old 8.4
comment:12 by , 13 years ago
how big is your dataset. of course as you mentioned before its calling the function for each each column in the bad case.
comment:13 by , 13 years ago
It's one 6000x6000 SRTM elevation file tiled into 100x100 tiles (3600 tiles). It's my favorite test dataset.
comment:14 by , 13 years ago
Fixed the standard deviation component of ST_SummaryStats for coverage tables bug in r7342.
comment:16 by , 13 years ago
Yes. It uses the same one-pass standard deviation calculation method that is used for each individual tile. Your example above comes to the correct value.
comment:17 by , 13 years ago
Pierre,
Okay I'm seeing similar behavior on PostgreSQL 9.1 for ST_SummaryStats and ST_Histogram. So I think it has to do with if you materialize vs. not.
So here is a single tile raster I have with 1,517,411 pixels.
-- This takes 1,147ms -- SELECT (f).* FROM (SELECT (ST_SummaryStats(rast,1)) As f from philly_map) As foo;
--but this takes 214 ms
SELECT (f).* FROM (SELECT (ST_SummaryStats(rast,1)) As f from philly_map OFFSET 0) As foo;
NOTE in the second -- I used the OFFSET 0 hack to force materialization of the subselect as described here -- http://www.postgresonline.com/journal/archives/113-How-to-force-PostgreSQL-to-use-a-pre-calculated-value.html
-- The histogram case is tempermental if it materializes -- or not since it is dependent on number of rows
-- Takes 294 ms returns 22 rows (with the offset takes 306ms -- so about the same SELECT (f).* FROM (SELECT ST_Histogram(rast,1) As f from philly_map) As foo;
So I guess short answer -- if we are going to be doing wrapper stuff, we better put in an OFFSET 0 in there just to be safe. Haven't tested on 8.4, but as I recall way back, its the same story.
comment:18 by , 13 years ago
So I understand that Bborie should try inserting OFFSET 0 in the wrappers before getting rid of them?
comment:19 by , 13 years ago
I'd rather get rid of the ST_SummaryStats wrappers:
ST_Count ST_Sum ST_Mean ST_Stddev ST_MinMax
I don't think adding OFFSET 0 to any of the functions will make any difference as the OFFSET 0 applies to the query calling ST_SummaryStats rather than within ST_SummaryStats.
comment:20 by , 13 years ago
I kind of like ST_Count since its a basic measure of how big your tile is. The others I think if people get to that level of granularity they should just run summary stats.
So I vote to keep ST_Count but get rid of the other wrappers. Besides ST_Count has a short-circuit.
comment:21 by , 13 years ago
Sounds good to me. Pierre, thoughts? If you say yes, I'll eliminate the following functions and keep ST_Count:
ST_Sum ST_Mean ST_Stddev ST_MinMax
comment:22 by , 13 years ago
I would remove ST_Count as well. In the case mentioned by robe, it is very easy to do ST_Height() * ST_Width() instead. The only remaining interest for ST_Count() is that is tells you of many pixel have a true value and in this case we are already at the lower level of granularity.
comment:23 by , 13 years ago
I disagree -- I find myself using count a lot (more so than summarystats) and hate to part with it. I can see myself typing ST_Height()*ST_Width() a lot which is too verbose and if I have tiled rasters even more annoying as I'd have to do sum(st_height*st_width) etc or resort to ST_SummaryStats. Most of the time I just care about the count of nodata vs. with nodata. The fact it returns a simple number is sweeter than ST_SummaryStats and easier to register in the mind.
comment:25 by , 13 years ago
I've removed the following wrappers in r7375.
ST_Sum
ST_Mean
ST_Stddev
ST_MinMax
comment:26 by , 13 years ago
Regina and/or Pierre, if you both feel that this ticket has been resolved, please close this ticket. Thanks!
comment:27 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Result from:
are correct.