Opened 13 years ago
Closed 12 years ago
#1828 closed defect (fixed)
Estimate returned by geography_gist_selectivity results in slow query plan for ST_DWithin
Reported by: | realityexists | Owned by: | pramsey |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.0.2 |
Component: | postgis | Version: | 2.0.x |
Keywords: | Cc: |
Description
I've run into a query performance problem that appears to be due to the && opeartor selectivity estimate (geography_gist_selectivity function).
I have a large, complex query which joins the results of several set-returning functions with some tables and filters them by calling another function, which involves 3 ST_DWithin(geog) calls. This used to run in about 10 seconds until I changed the functions to allow them to be inlined. (They previously had "SET search_path FROM current", which prevented inlining.) Now the query doesn't return in 10 minutes! If I again prevent the filtering function from being inlined (eg. by adding STRICT or SECURITY DEFINER or SET) the time goes down to 20 seconds. If I do the same to one of the set-returning functions it goes down to 15 seconds. It seems to change the query plan at the top level: without inlining it picks a Hash Join or Merge Join (fast), but with inlining it picks a Nested Loop (slow).
I asked about this on pgsql-general http://archives.postgresql.org/pgsql-general/2012-05/msg00370.php and Tom Lane suggested that it might be due to a PostGIS selectivity function returning a poor estimate. When the function is not inlined it uses the default estimate, which happens to be good in this case. If I hack postgis.sql to remove the RESTRICT and JOIN clauses, ie.
CREATE OPERATOR && (
LEFTARG = geography, RIGHTARG = geography, PROCEDURE = geography_overlaps, COMMUTATOR = '&&'
);
then the query runs in 5 seconds, which seems to support this theory.
I can reproduce the problem with the attached simplified test case, running on PostgreSQL 9.1.3 with PostGIS 2.0.0. Run initialise.sql, then query.sql. Note that the query is initially fast (140 ms), but after running ANALYZE the query plan changes from Hash Join to Nested Loop and it takes 15000 ms. If you then delete the table statistics again it goes back to the fast plan. If I mark test_points_are_near as STRICT it uses the fast plan. If I remove one of the ST_DWithin calls it uses the fast plan (see plans.txt).
Attachments (9)
Change History (69)
by , 13 years ago
Attachment: | initialise.sql added |
---|
by , 13 years ago
comment:1 by , 13 years ago
comment:3 by , 13 years ago
Removing the RESTRICT and JOIN causes the planner to fall back to the default selectivity estimate which is basically "yeah, it's really selective!" in all cases. Which turns out to be good for your particular query, less good for others. The problem is that somewhere, somehow our estimator is underestimating the selectivity of your particular query parameters, either because the estimate calculations are wrong or perhaps because the stats gathered or wrong or perhaps because of something special/interesting about your geometries themselves.
by , 13 years ago
comment:4 by , 13 years ago
Thanks, what you wrote confirms my understanding of it. EXPLAIN ANALYZE output attached.
comment:5 by , 13 years ago
So, yeah, the planner expects one row to pop out of your spatial filter and instead 6000 do, which throws off the rest of the plan. Can you strip down the query to just the spatial table and filter and see if EXPLAIN ANALYZE still gets things that wrong? Because if you can get it down to one table and one piece of simple SQL we'll have something to work with, repairwise.
comment:6 by , 13 years ago
No, I tried that, but it's fast then. This is the simplest repro I've been able to come up with (and even that took about 5 hours!)
Eg. this query
select id from _test_pos where __test_points_are_near('POINT(7 7)', pos);
takes 80 ms and the plan is
Seq Scan on _test_pos (cost=0.00..10728.00 rows=1 width=4) (actual time=13.492..56.078 rows=5118 loops=1) Filter: ((('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND (pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND _st_dwithin('0101000020E61000000000000000001C400000000000001C40'::geography, pos, 300000::double precision, true)) OR (('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 400000::double precision)) AND (pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND _st_dwithin('0101000020E61000000000000000001C400000000000001C40'::geography, pos, 400000::double precision, true))) Total runtime: 61.388 ms
comment:7 by , 13 years ago
Yeah, this is good though, the point is not the speed, it's the bad estimate. The spatial stuff comes up fast enough, but because the estimate is bad, it lands in the middle of a bad plan which is itself very slow. Your explain shows the behavior we need to diagnose: estimated number of rows is 1, actual rows is 5118. Can you provide the table you are running this SQL against? (And further, cut down the query so it doesn't use a function anymore but still displays the bad estimate?)
comment:8 by , 13 years ago
Sure, in that case the query can be as simple as:
SELECT id FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)', 300000)
The table:
CREATE TABLE _test_pos ( id serial, pos geography(Point,4326) ); WITH data AS ( SELECT generate_series(1, 20000)::float / 1000 AS x ) INSERT INTO _test_pos(pos) SELECT ST_MakePoint(x, x)::geography FROM data;
comment:9 by , 13 years ago
Confirmed, that's a lovely little test that shows a pretty huge selectivity misestimate. No matter what the search radius, small or large, the selectivity estimate of rows returned is 1.
comment:10 by , 13 years ago
Is this still based upon the code I originally wrote? I've had some ideas about improving it for some time, but obviously I now don't have access to same variety of data sets that I once did. If the above is a fairly concrete test case then I'm happy to take a look.
comment:11 by , 13 years ago
Yes, it's still yours. It's a synthetic case, but it's not unreasonable.
comment:12 by , 13 years ago
Okay - if you assign it to me in trac, I'll consider that officially handing over the (olympic?) torch.
comment:13 by , 13 years ago
Owner: | changed from | to
---|
How does the geography estimator differ from the geometry one ? Does the geometry estimator work fine in this case ?
comment:14 by , 13 years ago
The geometry one seems to have the same problem:
Seq Scan on _test_pos_geom (cost=0.00..5517.00 rows=1 width=4) (actual time=3.342..14.989 rows=1415 loops=1)
comment:15 by , 12 years ago
How many points does the table have ?
mcayland: are you planning to look at this any soon ? Is it still the code originating in the 1.0.0 times ?
There should be a perl script to test it under utils.
comment:16 by , 12 years ago
So, it is 20000 rows. Estimate is 1 row if there's no index. The prepare.sql script creates no index. Is it intentional ? Our analyzer is not even called in that case.
comment:17 by , 12 years ago
rewind, I take it back. The estimator is called anyway. For geometry the query must be changed to :
SELECT count(id) FROM _test_pos WHERE geom && ST_Buffer('SRID=4326;POINT(7 7)', 1);
Here's some debugging from the selectivity estimator:
NOTICE: [geometry_estimate.c:geometry_gist_sel_2d:653] geometry_gist_sel called NOTICE: [geometry_estimate.c:estimate_selectivity:446] histogram has 126 cols, 127 rows NOTICE: [geometry_estimate.c:estimate_selectivity:447] histogram geosize is 19.999001x19.999001 NOTICE: [geometry_estimate.c:estimate_selectivity:591] search_box overlaps 182.000000 cells NOTICE: [geometry_estimate.c:estimate_selectivity:592] avg feat overlaps 1.000300 cells NOTICE: [geometry_estimate.c:estimate_selectivity:604] SUM(ov_histo_cells)=0.096025 NOTICE: [geometry_estimate.c:estimate_selectivity:605] gain=0.999700 NOTICE: [geometry_estimate.c:estimate_selectivity:606] selectivity=0.095996 NOTICE: [geometry_estimate.c:geometry_gist_sel_2d:754] returning computed value: 0.095996
comment:18 by , 12 years ago
The estimate for the geometry version is actually pretty good. Estimated: 1920, actual: 2001.
comment:19 by , 12 years ago
And now for geography:
strk=# SELECT count(id) FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)', 300000); NOTICE: [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called NOTICE: [geography_estimate.c:estimate_selectivity:163] histogram has 12 unitsx, 34 unitsy, 36 unitsz NOTICE: [geography_estimate.c:estimate_selectivity:164] histogram geosize is 0.116978x0.321376x0.342003 NOTICE: [geography_estimate.c:estimate_selectivity:401] search_box overlaps 1.000000 cells NOTICE: [geography_estimate.c:estimate_selectivity:402] avg feat overlaps 1.000000 cells NOTICE: [geography_estimate.c:estimate_selectivity:414] SUM(ov_histo_cells)=0.000000 NOTICE: [geography_estimate.c:estimate_selectivity:415] gain=1.000000 NOTICE: [geography_estimate.c:estimate_selectivity:416] selectivity=0.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:572] returning computed value: 0.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called NOTICE: [geography_estimate.c:geography_gist_selectivity:507] no variable argument ? - returning default selectivity count ------- 3839 (1 row)
The "search_box overlaps 1 cells" is suspicius, isn't it ?
comment:20 by , 12 years ago
Just a final note, then I'll leave this to Mark: we want to make automated tests for estimation !!
comment:21 by , 12 years ago
Milestone: | PostGIS 2.0.1 → PostGIS 2.0.2 |
---|
Mark I'm moving to 2.0.2. Move back if you will get to it in the next day and its an easy fix.
comment:22 by , 12 years ago
It would be great to get this fixed soon, because we currently have to work around it by removing the RESTRICT and JOIN clauses for &&, which not only requires a manual hack, but may negatively impact performance in other cases.
comment:23 by , 12 years ago
One of the delays on this is that my dev environment is set up for 9.0 and so WITH... INSERT doesn't work. I've used the following as a substitute:
CREATE OR REPLACE FUNCTION create1828() RETURNS text AS $$
DECLARE
cnt int; x float;
BEGIN
FOR x IN SELECT generate_series(1, 20000)::float / 1000 LOOP
-- RAISE NOTICE 'x is %', x; INSERT INTO _test_pos(pos) SELECT ST_MakePoint(x, x)::geography;
END LOOP;
RETURN 'done';
END;
$$ LANGUAGE 'plpgsql';
Now I have the actual sample data, I'll have a poke around and see if I can figure out what is happening.
comment:24 by , 12 years ago
Okay I've finally figured out what's going on here. The key to solving this was to look at the debug output for the geography plan:
postgis20=# explain analyze SELECT id FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)', 300000); NOTICE: [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called NOTICE: [geography_estimate.c:geography_gist_selectivity:529] requested search box is : 0.985147863137998 0.120960947799834 0.121869343405147, 0.985147863137998 0.120960947799834 0.121869343405147 NOTICE: [geography_estimate.c:geography_gist_selectivity:557] 16013 read from stats NOTICE: [geography_estimate.c:geography_gist_selectivity:559] histo: xmin,ymin,zmin: 0.883022,0.000017,0.000017 NOTICE: [geography_estimate.c:geography_gist_selectivity:560] histo: xmax,ymax: 1.000000,0.321394,0.342020 NOTICE: [geography_estimate.c:geography_gist_selectivity:561] histo: unitsx: 12.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:562] histo: unitsy: 34.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:563] histo: unitsz: 36.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:564] histo: avgFeatureCoverage: 0.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:565] histo: avgFeatureCells: 1.000000 NOTICE: [geography_estimate.c:estimate_selectivity:163] histogram has 12 unitsx, 34 unitsy, 36 unitsz NOTICE: [geography_estimate.c:estimate_selectivity:164] histogram geosize is 0.116978x0.321376x0.342003 NOTICE: [geography_estimate.c:estimate_selectivity:348] [10,12,12] cell val 0.026949999853969 NOTICE: [geography_estimate.c:estimate_selectivity:350] [10,12,12] AOI 0.000000000000000 NOTICE: [geography_estimate.c:estimate_selectivity:352] [10,12,12] gain 0.000000000000000 NOTICE: [geography_estimate.c:estimate_selectivity:357] [10,12,12] adding 0.000000000000000 to value NOTICE: [geography_estimate.c:estimate_selectivity:401] search_box overlaps 1.000000 cells NOTICE: [geography_estimate.c:estimate_selectivity:402] avg feat overlaps 1.000000 cells NOTICE: [geography_estimate.c:estimate_selectivity:414] SUM(ov_histo_cells)=0.000000 NOTICE: [geography_estimate.c:estimate_selectivity:415] gain=1.000000 NOTICE: [geography_estimate.c:estimate_selectivity:416] selectivity=0.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:572] returning computed value: 0.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:464] geography_gist_selectivity called NOTICE: [geography_estimate.c:geography_gist_selectivity:507] no variable argument ? - returning default selectivity QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------- Seq Scan on _test_pos (cost=0.00..5578.00 rows=1 width=4) (actual time=17.973..32.506 rows=3839 loops=1) Filter: ((pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND ('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND _st_dwithin(pos, '010 1000020E61000000000000000001C400000000000001C40'::geography, 300000::double precision, true)) Total runtime: 32.734 ms (3 rows) postgis20=# select st_astext('0101000020E61000000000000000001C400000000000001C40'::geography); st_astext ------------ POINT(7 7) (1 row)
The problem here is that the original query point is being passed directly into the first part of the filter rather than the *expanded* version - hence why the selectivity is always tiny when the distance is large.
Poking around in geography_expand() I see this comment:
/* ** geography_expand(GSERIALIZED *g) returns *GSERIALIZED ** ** warning, this tricky little function does not expand the ** geometry at all, just re-writes bounding box value to be ** a bit bigger. only useful when passing the result along to ** an index operator (&&) */
So our ST_Expand(geog, distance) is returning the same original geography but with an inflated bounding box. BUT in geography_gist_selectivity() we do this:
/* Convert coordinates to 3D geodesic */ FLAGS_SET_GEODETIC(search_box.flags, 1); if (!lwgeom_calculate_gbox_geodetic(geometry, &search_box)) { POSTGIS_DEBUG(3, " search box cannot be calculated"); PG_RETURN_FLOAT8(0.0); }
So what happens is that when we generate our search box, lwgeom_calculate_gbox_geodetic() iterates through all of the pointarrays within the geography to calculate the bounding box for search - but because geography_expand() didn't alter the geography (just its bounding box) we therefore end up back with our unexpanded input geography once again.
I think we have two options here:
1) Change geography_gist_selectivity() to directly pull out the BBOX within the geography instead of recalculating it
2) Write a proper ST_Expand(geography) function
I'm inclined to lean towards 2) because with 1) in place, the query plan will still show the original geography being passed into the && clause which is the confusion that got us here in the first place.
comment:25 by , 12 years ago
Are you saying that ST_AsText(ST_Expand('POINT(0 0)'::geography)) would return 'POINT(0 0)' ? If that's the case I'd consider it a bad bug indeed.
That said, wouldn't it be faster just to always use the cached bbox rather than recomputing it ? Or is there any special need to recompute for geography ? Or is it the old double vs. float story ?
comment:26 by , 12 years ago
And just to confirm this:
postgis20=# select _st_expand('0101000020E61000000000000000001C400000000000001C40'::geography, 8000); _st_expand ---------------------------------------------------- 0101000020E61000000000000000001C400000000000001C40 (1 row)
I guess that's why there is no official ST_Expand(geography, distance) function. Paul - I think we need some input from you on this one.
comment:27 by , 12 years ago
Ok, if it's an "internal" one I've no problem with what it does (although it does sound dangerous) -- Yep, Paul's word is needed to know how to proceed (I think using the cache would be good)
comment:28 by , 12 years ago
strk I recall complaining about this very thing to pramsey. I thought the issue was that there is no way to represent the bounding box for geography as a geography at least that was the case in 1.5) so you really couldn't output it without doing a lengthy computation. That may have changed though with our new underlying format.
What cached bbox are you talking about? certainly not the one for POINT(0 0 ) since that would be wrong.
by , 12 years ago
Attachment: | boxcalc.patch added |
---|
Pick the box out of the serialization rather than calculating it from scratch.
comment:31 by , 12 years ago
OK, my "fix" is attached. It fixes the problem Mark identified, which is that we recalculate the bounding box from scratch during the statistics check, but it doesn't seem to fix the actual selectivity issue yet.
comment:32 by , 12 years ago
The patch does fix things, in that the query box now has a non-zero extent... however the estimate is still way off.
postgis20=# explain analyze SELECT id FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)'::geography, 300000); NOTICE: requested search box is : 0.938059508800507 0.0738726407289505 0.0747810378670692, 1.03223621845245 0.168049246072769 0.168957650661469 QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on _test_pos (cost=0.00..5578.00 rows=1 width=4) (actual time=17.239..67.544 rows=3839 loops=1) Filter: ((pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND ('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND _st_dwithin(pos, '0101000020E61000000000000000001C400000000000001C40'::geography, 300000::double precision, true))
by , 12 years ago
Attachment: | boxcalc3.patch added |
---|
Also pull the cached box in the stats calculation stage
comment:33 by , 12 years ago
With my patches in place, I'd say there's something else going on...
I make a call to run the function...
postgis20=# explain analyze SELECT id FROM _test_pos WHERE ST_DWithin(pos, 'POINT(7 7)'::geography, 300000);
geography_gist_selectivity is called, great. It pulls the geog stats, runs the calculations, and returns a DECENT SELECTIVITY NUMBER, 0.281997. But then, look at the very bottom, it's called a SECOND TIME, and this time with arguments that make it return the DEFAULT selectivity.
NOTICE: [geography_estimate.c:geography_gist_selectivity:495] geography_gist_selectivity called NOTICE: requested search box is : 0.938059508800507 0.0738726407289505 0.0747810378670692, 1.03223621845245 0.168049246072769 0.168957650661469 NOTICE: [geography_estimate.c:geography_gist_selectivity:563] requested search box is : 0.938059508800507 0.0738726407289505 0.0747810378670692, 1.03223621845245 0.168049246072769 0.168957650661469 NOTICE: [geography_estimate.c:geography_gist_selectivity:591] 16013 read from stats NOTICE: [geography_estimate.c:geography_gist_selectivity:593] histo: xmin,ymin,zmin: 0.883022,0.000017,0.000017 NOTICE: [geography_estimate.c:geography_gist_selectivity:594] histo: xmax,ymax: 1.000000,0.321394,0.342020 NOTICE: [geography_estimate.c:geography_gist_selectivity:595] histo: unitsx: 12.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:596] histo: unitsy: 34.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:597] histo: unitsz: 36.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:598] histo: avgFeatureCoverage: 0.000000 NOTICE: [geography_estimate.c:geography_gist_selectivity:599] histo: avgFeatureCells: 1.000000 NOTICE: NOTICE: [geography_estimate.c:estimate_selectivity:196] histogram has 12 unitsx, 34 unitsy, 36 unitsz NOTICE: [geography_estimate.c:estimate_selectivity:197] histogram geosize is 0.116978x0.321376x0.342003 NOTICE: [geography_estimate.c:estimate_selectivity:381] [5,7,7] cell val 0.000000000000000 NOTICE: [geography_estimate.c:estimate_selectivity:383] [5,7,7] AOI 0.000000007526458 NOTICE: [geography_estimate.c:estimate_selectivity:385] [5,7,7] gain 0.009150018534462 ......... NOTICE: [geography_estimate.c:estimate_selectivity:381] [11,17,17] cell val 0.000000000000000 NOTICE: [geography_estimate.c:estimate_selectivity:383] [11,17,17] AOI 0.000000532528078 NOTICE: [geography_estimate.c:estimate_selectivity:385] [11,17,17] gain 0.647401718907906 NOTICE: [geography_estimate.c:estimate_selectivity:390] [11,17,17] adding 0.000000000000000 to value NOTICE: [geography_estimate.c:estimate_selectivity:434] search_box overlaps 847.000000 cells NOTICE: [geography_estimate.c:estimate_selectivity:435] avg feat overlaps 1.000000 cells NOTICE: [geography_estimate.c:estimate_selectivity:447] SUM(ov_histo_cells)=0.281997 NOTICE: [geography_estimate.c:estimate_selectivity:448] gain=1.000000 NOTICE: [geography_estimate.c:estimate_selectivity:449] selectivity=0.281997 NOTICE: [geography_estimate.c:geography_gist_selectivity:609] returning computed value: 0.281997 NOTICE: [geography_estimate.c:geography_gist_selectivity:495] geography_gist_selectivity called NOTICE: [geography_estimate.c:geography_gist_selectivity:538] no variable argument ? - returning default selectivity QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on _test_pos (cost=0.00..5578.00 rows=1 width=4) (actual time=16.487..66.990 rows=3839 loops=1) Filter: ((pos && '0101000020E61000000000000000001C400000000000001C40'::geography) AND ('0101000020E61000000000000000001C400000000000001C40'::geography && _st_expand(pos, 300000::double precision)) AND _st_dwithin(pos, '0101000020E61000000000000000001C400000000000001C40'::geography, 300000::double precision, true)) Total runtime: 67.736 ms (3 rows)
Our double-sided expand trick is not doing what we want, it's not picking the good selectivity and discarding the other, it's picking the other.
comment:34 by , 12 years ago
And so if you change the estimate for <const> && <const> to 1.0 (geography_estimate.c:540), you get the correct estimate on our overall DWithin query.
/* * We are working on two constants.. * TODO: check if expression is true, * returned set would be either * the whole or none. */ if ( ! IsA(self, Var) ) { POSTGIS_DEBUG(3, " no variable argument ? - returning default selectivity"); PG_RETURN_FLOAT8(1.0); }
comment:35 by , 12 years ago
Right - the reason there are 2 calls to the function are because there are 2 && operators in the rewritten query. The first is <col> && <const> (which we can at least attempt an estimate) and the second is <const> && expand(<col>) which we can't do much about because the function is effectively a black box to the planner, hence we bail out returning the default estimate.
I'm thinking that the reason we are underestimating is because in 3 dimensions we intersect a larger number of potential histogram boxes within 3D space, but they are mostly empty due to the curvature of the spheroid which drags avg_feat_cells unusually low. This will need some testing to verify this though. Therefore for geography I think we either need to determine how to alter the mathematics to handle this, or possibly even consider converting the inputs to 2D and using something similar to the existing geometry code.
comment:36 by , 12 years ago
The problem is that for <const> && <const> we're returning DEFAULT_SELECTIVITY, which is very high. So the <const> && <const> term is washing out the other one, the one we care about. As the comment in the code block notes, the <const> && <const> case should return either 0 or 1, depending on the actual evaluation of the statement, but not something in between. I'm wondering if returning 1 might actually be the path of "less harm" in a temporary hack, rather than returning DEFAULT_SELECTIVITY.
comment:37 by , 12 years ago
shouldn't the histogram be a cube for 3d ops ?
another thought is: could be make expand() inlined to allow for better estimation ?
comment:38 by , 12 years ago
Bear in mind that the comment block is misleading - we're not bailing out because we have <const> && <const> because there is no such clause in the generated SQL. If you look closer at the code beforehand, the reason we're dropping into that section is because we have something that's not <var> && <const> or <const> && <var>, which in this case is caused by <const> && <func> created by the _st_expand(pos, 300000) clause.
The only reason this works for your case above is because it just so happens that 300000 covers the entire histogram area and so a selectivity of 1.0 is an appropriate value. If you change this to, say 300, then the selectivity will become inaccurate once again.
I wonder if we can create a slightly modified _st_expand() function with different prototypes for different arguments, e.g.
_st_expand(const, distance) -> returns the geography with expanded box as it (should) do at the moment
_st_expand(col, distance) -> uses the histogram stats to return the bounding box for the given column and then expand it by distance
If we mark this second function as IMMUTABLE then it should get folded into a constant at plan time, and then we can use this to set search_box by detecting this <const> && <const> condition which might stand a chance of returning something slightly meaningful.
comment:39 by , 12 years ago
can't we return "dunno" from the estimator, for a <const> && <const> so not to be considered by the planner ?
comment:40 by , 12 years ago
sorry, for the <const> && <func> (worth an update of that comment and the addition of another block for the <const> && <func> with a more meaningful TODO paragraph)
comment:41 by , 12 years ago
I don't think your second _st_expand(col, distance) example makes sense, since the expanded version is not necessarily the extent of the whole column, it's the extent of the things that pass the other filters in the query.
comment:42 by , 12 years ago
That's not quite true: if you have a query of the form <foo> && <bar> AND <bat> && <baz> then both pairs <foo> && <bar> and <bat> && <baz> will be passed to the selectivity function by the planner in order to find out which one is the cheapest, i.e. the one with the lowest cost will be executed first and then the other clauses will be applied as subsequent AND filters to the resulting dataset. In other words, the selectivity itself specifies the execution order. This why hardcoding it as a low constant for <foo> && <bar> in old versions of PostGIS forced the GiST index to be used all of the time because the default value was so small compared to any of the inbuilt estimates.
Hence if the result of <const> && _st_expand(pos, 300000) returns a lower value then <const> && <var> then it should be chosen first in preference to the <const> && <var> clause. Hardcoding the latter to 1.0 is effectively telling the planner that this second clause is likely to return the entire table, and hence effectively forces <const> && <var> to look the most favourable clause to execute first. So thinking about this particular case again, I think you are right in that you could easily argue that for an ST_DWithin() query <const> && <var> is always going to be better (smaller) than <const> && <expanded var>, but bear in mind changing this is going to have an unknown knock-on effect for existing applications using && within custom queries as _st_expand() isn't always the target function.
Upon reflection, perhaps the least invasive fix in terms of application performance is to teach the geography estimater specifically about <const> && _st_expand() using an oid function lookup, and hence return a selectivity that is slightly higher (and hence less attractive) than the plain old <const> && <var> selectivity.
comment:43 by , 12 years ago
I don't think this problem is exclusively about geography, the same code can be found in the geometry estimate section, and there was a demonstration higher in the ticket that it affects geometry too.
comment:44 by , 12 years ago
I guess you're right about it affecting geometry as well, but couldn't find the demonstration (which comment?).
Teaching _st_expand() to the estimator sounds like an interesting project.
Any chance to reuse more code between geometry and geography, to do less work ? Also, I think it's _very_ important to add some form of regression testing for this.
comment:45 by , 12 years ago
I'm thinking: do we really need to have both VAR && Expand(QUERY) _and_ QUERY && Expand(VAR) ? Are we sure that's conceptually correct to do for ST_DWithin ?
comment:47 by , 12 years ago
pramsey and this one. Sounds like realityexists really wants this and he's been such a good tester that we should reward him by fixing this or giving him a metal.
comment:48 by , 12 years ago
OK, back looking at this *particular* problem, avoiding thinking about handling <func>, just trying to get the selectivity in this case working better...
First question: can we just avoid returning an answer when we are in a "don't know" situation. Answer: No. The caller is expecting a valid number, and will error out if we don't deliver (plancat.c:1033):
if (result < 0.0 || result > 1.0) elog(ERROR, "invalid restriction selectivity: %f", result);
Second question: surely the PgSQL planner has to deal with unknowns! What do they do? Answer: Well, there's an example one caller up from plancat.c, in (clausesel.c:683)
else if (is_funcclause(clause)) { /* * This is not an operator, so we guess at the selectivity. THIS IS A * HACK TO GET V4 OUT THE DOOR. FUNCS SHOULD BE ABLE TO HAVE * SELECTIVITIES THEMSELVES. -- JMH 7/9/92 */ s1 = (Selectivity) 0.3333333; }
So, they return a mid-range value when they don't know, presumably to let other extreme values one way or the other weigh in first.
comment:49 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
comment:50 by , 12 years ago
What about a simple EXPLAIN ANALYZE based test ? I think we can force the analyzer to get the whole dataset in the histogram so that the expected number of results is predictable (rather than based on a smaller sample)
comment:52 by , 12 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Tried r10728 on Linux (32-bit), but it seems to be completely broken: whenever VACUUM runs (manually or automatically) PostgreSQL crashes. I don't have a repro ready for this, but attaching the backtrace and debug log, hopefully that helps.
by , 12 years ago
Attachment: | backtrace-r10728.txt added |
---|
by , 12 years ago
Attachment: | log-r10728.txt added |
---|
comment:53 by , 12 years ago
comment:54 by , 12 years ago
Sorry, it still crashes, though at a different point in my code (and in yours)
comment:55 by , 12 years ago
Can you give me a test table that does it? My little ones are working. Sorry about this!
comment:56 by , 12 years ago
Here you go:
CREATE TABLE _test_point AS SELECT 'POINT(180 50)'::geography AS pos; ANALYZE _test_point;
comment:57 by , 12 years ago
I made a lot of careless mistakes in that sequence, but the final problem ended up being quite subtle. r10731 I didn't fix it, I just reverted it. It's tricky and sensitive. The geodetic bounding box of a point, when converted to float and stored in the gserialized is slightly larger than a point. For a one-entry table, that means the histogram is not just a single box, but something bigger. Math explodes, unpleasantness ensues.
comment:58 by , 12 years ago
OK, tested on r10731 on 32-bit Linux. It doesn't crash any more - good start! A simplified version of my original query is now twice as fast with the filtering function inlined compared to not inlined. The real, original query is still about 20% slower with inlining than without. That's a big improvement on before (when it didn't return at all), but I guess it's still largely a matter of luck.
I can't compare the absolute times before and after, because my "main" Postgres installation (on Windows x64) is still at r10577. I'm holding off upgrading that until the "180/-180" issue in #2066 is fixed.
comment:59 by , 12 years ago
Owner: | changed from | to
---|---|
Status: | reopened → new |
comment:60 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I'm closing this out, I think the fix is as good as we're going to get on stable branch. Doing a function-based analyze I leave as an exercise to brighter folks.
Something unrelated I noticed while investigating this: line 1551 in lwgeodetic.c says
Is that meant to be
?