#230 closed defect (fixed)
Put in costs in _ST and possibly other functions
Reported by: | nicklas | Owned by: | robe |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 1.5.0 |
Component: | postgis | Version: | 1.4 |
Keywords: | st_dwithin st_expand bbox | Cc: |
Description (last modified by )
I'm revising this since this seems to be an issue mostly because we don't have COSTS assigned to our _ST functions, so the planner in certain cases (like with large geometries), chooses to do a table scan instead of an index scan.
Since costs are only supported in PostgreSQL 8.3+ and we are also going to need typmod support for geography. I'm wondering if we want to just push this to 1.5 and also make PostgreSQL 8.3+ a requirement for 1.5 instead of introducing messy conditional code to not put costs in if 8.2 or lower.
using dataset from states.backup as example
the bbox-comparasion doesn't seem to always happen before the _st_dwithin calculations. To isolate the problem I used _st_dwithin instead of st_dwithin and played with the bbox-comparasions in the query instead. then, if I run:
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and a.the_geom && b.the_geom; or select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and st_expand(a.the_geom, 0) && b.the_geom;
it tells in 30 ms that there is no hit, just as expected the same happens if I try:
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and a.the_geom && b.the_geom and _st_dwithin(a.the_geom, b.the_geom, 0);
fast answer that there is no hit.
BUT
if I run
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and st_expand(a.the_geom, 0) && b.the_geom and _st_dwithin(a.the_geom, b.the_geom, 0);
then, the fun is over. It starts running for 220000 ms before I get the answer. My guess is that in this case _st_dwithin is trigged before the bbox-comparation so it makes the distance calculations to every one of the other polygons in the dataset. It behaves the same if I run:
select a.state from us.states a, us.states b where b.state = 'Hawaii' and a.state != 'Hawaii' and st_dwithin(a.the_geom, b.the_geom,0)
where the function is supposed to handle the bbox-comparasion.
/Nicklas
Attachments (1)
Change History (28)
by , 15 years ago
Attachment: | states.backup added |
---|
comment:1 by , 15 years ago
Milestone: | postgis 1.5.0 → postgis 1.4.1 |
---|---|
Version: | → 1.4 |
comment:2 by , 15 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Nicklas, I don't have proof of this, but I suspect this problem only happens with really large geometries and also if you don't have a limit or your limit reaches a certain threshold.
I'm putting the blame right square on this st_expand(geometry, double precision) overloaded function. I suspect for large geometries -- this is passing the huge geometry down the throat of the C layer (when in theory it should just be reading the bounding box and expanding that) in (and also some how messing up the planners favor of an index). Well that's just a theory. I haven't fiddled with costs and so forth.
If you look at our slides -- we some how managed to escape this horrible fate you have demonstrated, but I did confirm your results happen even on 1.3)
See slide 19 all use indexes except middle one which we wouldn't expect to -- strange huh. Only difference is we use a limit and also we simplify in later cases to reduce the size of the geometry.
http://www.bostongis.com/downloads/oscon2009/Oscon2009_PostGISTips.pdf
comment:3 by , 15 years ago
I have been looking some more on this. I don't think that it is st_expand that uses all the power, but when we get over some border the planner decides to do the _st_dwithin before bbox comparasion.
both _st_dwithin and st_expand has cost 1. I guess that makes it a little randomly wich one is calculated first. Is there a reason for that. I tried to give _st_dwithin cost 100 and that seems to solve the problem.
/Nicklas
comment:4 by , 15 years ago
Well the main issue is that costs were introduced in 8.3 so they aren't supported in 8.2. That is one reason we haven't applied costs. So we'll need ot incorporate it in such as fashion as to not apply to 8.2.
The other issue is that we just haven't gotten around to applying costs to all the functions. I think we can take an easy rule of thumb and just set them all to 100 except for the operators though. Then apply say 200 for more intensive like ST_Distance.
comment:5 by , 15 years ago
Yes, in cases like this a distinction would be valuable.
The thing is here that st_distance has cost 100 but that doesn't help the planner to decide when to use _st_dwithin.
/Nicklas
comment:6 by , 15 years ago
It wouldn't. _St_Dwithin and ST_Distance are completely different functions as far as PostgreSQL is concerned. Prior to 1.3 ST_DWithin did use ST_Distance, but it doesn't anymore.
comment:7 by , 15 years ago
sorry, wrong from me. I meant st_dwithin now has cost 100, but _st_dwithin has cost 1.
I didn't mean to put in st_distance
/Nicklas
comment:8 by , 15 years ago
Why would it? Because its written in SQL, the function is transparent (sql functions are generally inlined). See slides -- the link to the index call is part of ST_DWithin. You will also notice that if you look at the pgadmin planner -- the ST_Dwithin call is split into 2 -- the index match and the _ST_DWithin with an extra superfluous EXPAND that never uses an index).
The extra expand is simply to make the function symettric. Even though the answers to expand are identical -- one is indexable and one is not (and only at run-time would you know which one is and which one isn't) so the planner will choose to test the indexable one first and never test the rest of the function if the index call fails.
comment:9 by , 15 years ago
I don't know if I understand you right Regina, but my suggestion comes from this: I use the last example above with st_dwithin
with cost 1 at _st_dwithin, the explain of the query looks like this. It puts _st_dwithin before the bbox comparasions:
"Nested Loop (cost=0.00..4.50 rows=1 width=9)" " Join Filter: (_st_dwithin(a.the_geom, b.the_geom, 0::double precision) AND (a.the_geom && st_expand(b.the_geom, 0::double precision)) AND (b.the_geom && st_expand(a.the_geom, 0::double precision)))" " -> Seq Scan on states b (cost=0.00..1.66 rows=1 width=442568)" " Filter: ((state)::text = 'Hawaii'::text)" " -> Seq Scan on states a (cost=0.00..1.66 rows=52 width=442577)" " Filter: ((a.state)::text <> 'Hawaii'::text)"
But if I change the cost of _st_dwithin to 100 the same exlpanation looks like this:
"Nested Loop (cost=0.00..10.21 rows=1 width=9)" " Join Filter: ((b.the_geom && st_expand(a.the_geom, 0::double precision)) AND _st_dwithin(a.the_geom, b.the_geom, 0::double precision))" " -> Seq Scan on states b (cost=0.00..1.66 rows=1 width=442568)" " Filter: ((state)::text = 'Hawaii'::text)" " -> Index Scan using us_idx_states_the_geom on states a (cost=0.00..8.27 rows=1 width=442577)" " Index Cond: (a.the_geom && st_expand(b.the_geom, 0::double precision))" " Filter: ((a.state)::text <> 'Hawaii'::text)"
Now it puts _st_dwithin after the bounding-box comparasion and the query runs in 16ms instead of very very long time.
/Nicklas
comment:10 by , 15 years ago
Exactly observe ( that part is coming from ST_DWithin -- if you look at the definition of ST_DWithin -- it has 2 expand calls - this is one of them)
" Index Cond: (a.the_geom && st_expand(b.the_geom, 0::double precision))" " Filter: ((a.state)::text <> 'Hawaii'::text)
The join filter part has the portion of ST_DWithin that doesn't use an index (see how the second expand that is pushed to index is missing)
" Join Filter: ((b.the_geom && st_expand(a.the_geom, 0::double
precision)) AND _st_dwithin(a.the_geom, b.the_geom, 0::double precision))"
The planner has split the function into its constituent parts. So lesson learned putting cost on transparent SQL functions that use others is not effective.
comment:11 by , 15 years ago
Ok I've learned a lot about how to understand the explain. Thanks :-)
But I still don't get what the downside is to put cost 100 to _st_dwithin. The it starts using the index and it all looks fine... ,not ??
I will look through your slides to learn more.
Thanks Nicklas
comment:12 by , 15 years ago
nothing wrong with that I can think of. In fact I think all the _st functions should have costs 100 sounds as good as any.
comment:14 by , 15 years ago
Description: | modified (diff) |
---|---|
Milestone: | postgis 1.4.1 → postgis 1.5.0 |
Summary: | st_expand seems to affect the execution order wich affects st_dwithin → Put in costs in _ST and possibly other functions |
comment:15 by , 15 years ago
Is there a problem for postgresql8.2 if we put costs in? Isn't the costs is there now but all is set to cost 1? Will postgresql 8.2 installations be affected by diversing costs?
/Nicklas
comment:16 by , 15 years ago
I don't think PostgreSQL 8.2 understands functions costs. That feature was introduced in 8.3 and the planner was changed accordingly to look for it. So if we put costs in I'm pretty sure the .sql files won't load in 8.2. though I haven't verified but I do recall in some o
comment:17 by , 15 years ago
OK, I see
cost 1 in postgresql 8.3 and 8.4 is just default values. From what comes cost 1 in 8.3, 8.4 installations, it is not mentioned in postgis.sql.in.c?
Couldn't we put the costs in a separate sql-file with every function represented like:
ALTER FUNCTION _ST_DWithin(geometry,geometry,float8) cost 100;
I can see some advantages of that.
1) in installation just one check for postgresql version and if 8.3 or above --> run costs.sql after postgis.sql
2) users could easily move customized costs from one installation to another.
3) we could put new recommended costs in the wiki for example, between releases.
Espescially in the beginning I guess there will be some calibration in the recommended costs. 2 and 3 above is of course not depending on that we distribute costs.sql in the installation, but then the user have one place to play with the costs and a recommended way of storing the customizing by editing the costs.sql.
Wouldn't it be easier too, to try something like Kevin suggesting of getting costs from statistics if they where held separately.
/Nicklas
comment:18 by , 15 years ago
Nicklas, Very Good idea. I like the idea of keeping the costs in a separate file. That way if people decide they want different costs for their functions, they can simply just not run that file. However -- I still want to get rid of 8.2, just from a keeping my sanity point of view. I would like it that at any point in time, we only need to test at most 2 versions of PostgreSQL for new features. Trying to remember what we can and can't do for 2 versions is enough without having to worry about a third. Now we have 8.3 and 8.4 and 8.5 has already started their cycle -- so we'll want to start testing out 8.5 in probably another 3-4 months (or less).
I'm not saying we won't support 8.2. we just shouldn't support it for 1.5 and beyond. I still expect we will support 1.4, 1.3 for years to come, but that support will just be security fixes and major bug fixes that could corrupt data.
comment:19 by , 15 years ago
Sounds fair and with the costs in a separate file we can still avoid issues like the cause of this ticket for people running 1.3 and 1.4 in postgresql 8.3 or above. And the people wanting the new stuff in 1.5 probably want the new stuff in postgresql 8.4 too.
/Nicklas
comment:20 by , 15 years ago
Since the geography type will require 8.3 minimum for geography support, then lets just add the costs directly into the .sql file. I agree that we should start by simply adding costs to the functions within any indexable ST_* functions, perhaps a cost of 100 to the more expensive function and then do some testing to see what happens there.
ATB,
Mark.
comment:21 by , 15 years ago
Mark, by .sql file do you mean postgis.sql? Isn't it much easier to change the costs if they are in a separate file?
/Nicklas
comment:22 by , 15 years ago
Yes, correct. The aim should be for us to pick defaults that "just work" for the majority of people rather than complicating matters by having to tune individual installations.
I know from my work on the statistics side of PostGIS that it is a pointless exercise trying to perfect our estimates so that they are accurate to reality; this is mainly because the existing routines in PostgreSQL are only estimates too. All we need to do is ensure that the more expensive functions are weighted heavier than their cheaper equivalents by a reasonable amount of units so given a dead heat in terms of planner cost, it will always choose the cheaper function first.
ATB,
Mark.
comment:23 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Committed to trunk at r4757
comment:24 by , 15 years ago
Paul, setting all the functional costs to 100 is a great start but shouldn't this ticket remain open? I think the point of this is to inform PostgreSQL of functional cost estimates of every _ST function so it knows how to sort them in the planner.
It's much more costly to use buffer than it is to compute the centroid of a geometry. So if I have "WHERE ST_IsValid(ST_Buffer(geom)) AND ST_Distance(ST_Centroid(geom), 'POINT(0 0)') > 10" I would want the planner to automatically place the centroid calculation first, THEN compute buffer if the first half of the condition succeeds.
For this, we'll need functional estimates on every function that appropriately estimate the cost.
http://postgis.refractions.net/pipermail/postgis-devel/2009-July/006109.html
comment:25 by , 15 years ago
The proximate defect flagged by the ticket has been solved with this commit. If you want an enhancement ticket for a future version, make up a new one in a reasonable milestone (and bear in mind you will be the one doing the implementation :)
comment:26 by , 15 years ago
I agree with Kevin on this. Paul I think you have gone a bit overboard.
Why can't we set just the obvious ones that have caused problems
_ST_DWithin, _ST_Intersects (and the other predicates). This is really more of a situation problem when indexes are involved as the lower costs sometimes makes the index operation not happen.
I'm also very concerned about the rate you are closing things off :). I should be excited, but all I can think is "Paul, stay away from the caffeine". When are we going to have time to test all this.
comment:27 by , 15 years ago
+1 from me, we definitely need to ensure the original test cases work as defined as its a common use case.
I also notice in your commit you've bumped the minimum PostgreSQL version to 8.3 in the documentation, although isn't this actually a requirement of incorporating the new geography typmod code as well as adding costs to functions? I think any change that requires a minimum PostgreSQL version change should incur a polite email to -devel for feedback rather than just committing blind.
Also, if we do bump the minimum version, you need to update the minimum version check in configure.ac.
ATB,
Mark.
Dataset from ticket 137