Opened 15 years ago
Last modified 15 years ago
#230 closed defect
Put in costs in _ST and possibly other functions — at Version 14
Reported by: | nicklas | Owned by: | robe |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 1.5.0 |
Component: | postgis | Version: | 1.4 |
Keywords: | st_dwithin st_expand bbox | Cc: |
Description (last modified by )
I'm revising this since this seems to be an issue mostly because we don't have COSTS assigned to our _ST functions, so the planner in certain cases (like with large geometries), chooses to do a table scan instead of an index scan.
Since costs are only supported in PostgreSQL 8.3+ and we are also going to need typmod support for geography. I'm wondering if we want to just push this to 1.5 and also make PostgreSQL 8.3+ a requirement for 1.5 instead of introducing messy conditional code to not put costs in if 8.2 or lower.
using dataset from states.backup as example
the bbox-comparasion doesn't seem to always happen before the _st_dwithin calculations. To isolate the problem I used _st_dwithin instead of st_dwithin and played with the bbox-comparasions in the query instead. then, if I run:
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and a.the_geom && b.the_geom; or select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and st_expand(a.the_geom, 0) && b.the_geom;
it tells in 30 ms that there is no hit, just as expected the same happens if I try:
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and a.the_geom && b.the_geom and _st_dwithin(a.the_geom, b.the_geom, 0);
fast answer that there is no hit.
BUT
if I run
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and st_expand(a.the_geom, 0) && b.the_geom and _st_dwithin(a.the_geom, b.the_geom, 0);
then, the fun is over. It starts running for 220000 ms before I get the answer. My guess is that in this case _st_dwithin is trigged before the bbox-comparation so it makes the distance calculations to every one of the other polygons in the dataset. It behaves the same if I run:
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and st_dwithin(a.the_geom, b.the_geom,0)
where the function is supposed to handle the bbox-comparasion.
/Nicklas
Change History (15)
by , 15 years ago
Attachment: | states.backup added |
---|
comment:1 by , 15 years ago
Milestone: | postgis 1.5.0 → postgis 1.4.1 |
---|---|
Version: | → 1.4 |
comment:2 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Nicklas, I don't have proof of this, but I suspect this problem only happens with really large geometries and also if you don't have a limit or your limit reaches a certain threshold.
I'm putting the blame right square on this st_expand(geometry, double precision) overloaded function. I suspect for large geometries -- this is passing the huge geometry down the throat of the C layer (when in theory it should just be reading the bounding box and expanding that) in (and also some how messing up the planners favor of an index). Well that's just a theory. I haven't fiddled with costs and so forth.
If you look at our slides -- we some how managed to escape this horrible fate you have demonstrated, but I did confirm your results happen even on 1.3)
See slide 19 all use indexes except middle one which we wouldn't expect to -- strange huh. Only difference is we use a limit and also we simplify in later cases to reduce the size of the geometry.
http://www.bostongis.com/downloads/oscon2009/Oscon2009_PostGISTips.pdf
comment:3 by , 15 years ago
I have been looking some more on this. I don't think that it is st_expand that uses all the power, but when we get over some border the planner decides to do the _st_dwithin before bbox comparasion.
both _st_dwithin and st_expand has cost 1. I guess that makes it a little randomly wich one is calculated first. Is there a reason for that. I tried to give _st_dwithin cost 100 and that seems to solve the problem.
/Nicklas
comment:4 by , 15 years ago
Well the main issue is that costs were introduced in 8.3 so they aren't supported in 8.2. That is one reason we haven't applied costs. So we'll need ot incorporate it in such as fashion as to not apply to 8.2.
The other issue is that we just haven't gotten around to applying costs to all the functions. I think we can take an easy rule of thumb and just set them all to 100 except for the operators though. Then apply say 200 for more intensive like ST_Distance.
comment:5 by , 15 years ago
Yes, in cases like this a distinction would be valuable.
The thing is here that st_distance has cost 100 but that doesn't help the planner to decide when to use _st_dwithin.
/Nicklas
comment:6 by , 15 years ago
It wouldn't. _St_Dwithin and ST_Distance are completely different functions as far as PostgreSQL is concerned. Prior to 1.3 ST_DWithin did use ST_Distance, but it doesn't anymore.
comment:7 by , 15 years ago
sorry, wrong from me. I meant st_dwithin now has cost 100, but _st_dwithin has cost 1.
I didn't mean to put in st_distance
/Nicklas
comment:8 by , 15 years ago
Why would it? Because its written in SQL, the function is transparent (sql functions are generally inlined). See slides -- the link to the index call is part of ST_DWithin. You will also notice that if you look at the pgadmin planner -- the ST_Dwithin call is split into 2 -- the index match and the _ST_DWithin with an extra superfluous EXPAND that never uses an index).
The extra expand is simply to make the function symettric. Even though the answers to expand are identical -- one is indexable and one is not (and only at run-time would you know which one is and which one isn't) so the planner will choose to test the indexable one first and never test the rest of the function if the index call fails.
comment:9 by , 15 years ago
I don't know if I understand you right Regina, but my suggestion comes from this: I use the last example above with st_dwithin
with cost 1 at _st_dwithin, the explain of the query looks like this. It puts _st_dwithin before the bbox comparasions:
"Nested Loop (cost=0.00..4.50 rows=1 width=9)" " Join Filter: (_st_dwithin(a.the_geom, b.the_geom, 0::double precision) AND (a.the_geom && st_expand(b.the_geom, 0::double precision)) AND (b.the_geom && st_expand(a.the_geom, 0::double precision)))" " -> Seq Scan on states b (cost=0.00..1.66 rows=1 width=442568)" " Filter: ((state)::text = 'Hawaii'::text)" " -> Seq Scan on states a (cost=0.00..1.66 rows=52 width=442577)" " Filter: ((a.state)::text <> 'Hawaii'::text)"
But if I change the cost of _st_dwithin to 100 the same exlpanation looks like this:
"Nested Loop (cost=0.00..10.21 rows=1 width=9)" " Join Filter: ((b.the_geom && st_expand(a.the_geom, 0::double precision)) AND _st_dwithin(a.the_geom, b.the_geom, 0::double precision))" " -> Seq Scan on states b (cost=0.00..1.66 rows=1 width=442568)" " Filter: ((state)::text = 'Hawaii'::text)" " -> Index Scan using us_idx_states_the_geom on states a (cost=0.00..8.27 rows=1 width=442577)" " Index Cond: (a.the_geom && st_expand(b.the_geom, 0::double precision))" " Filter: ((a.state)::text <> 'Hawaii'::text)"
Now it puts _st_dwithin after the bounding-box comparasion and the query runs in 16ms instead of very very long time.
/Nicklas
comment:10 by , 15 years ago
Exactly observe ( that part is coming from ST_DWithin -- if you look at the definition of ST_DWithin -- it has 2 expand calls - this is one of them)
" Index Cond: (a.the_geom && st_expand(b.the_geom, 0::double precision))" " Filter: ((a.state)::text <> 'Hawaii'::text)
The join filter part has the portion of ST_DWithin that doesn't use an index (see how the second expand that is pushed to index is missing)
" Join Filter: ((b.the_geom && st_expand(a.the_geom, 0::double
precision)) AND _st_dwithin(a.the_geom, b.the_geom, 0::double precision))"
The planner has split the function into its constituent parts. So lesson learned putting cost on transparent SQL functions that use others is not effective.
comment:11 by , 15 years ago
Ok I've learned a lot about how to understand the explain. Thanks :-)
But I still don't get what the downside is to put cost 100 to _st_dwithin. The it starts using the index and it all looks fine... ,not ??
I will look through your slides to learn more.
Thanks Nicklas
comment:12 by , 15 years ago
nothing wrong with that I can think of. In fact I think all the _st functions should have costs 100 sounds as good as any.
comment:14 by , 15 years ago
Description: | modified (diff) |
---|---|
Milestone: | postgis 1.4.1 → postgis 1.5.0 |
Summary: | st_expand seems to affect the execution order wich affects st_dwithin → Put in costs in _ST and possibly other functions |
Dataset from ticket 137