Opened 21 months ago

Closed 21 months ago

Last modified 21 months ago

#5337 closed defect (invalid)

Row estimate for gist geometry index is more than 500 times smaller than actual count

Reported by: Björn Harrtell Owned by: pramsey
Priority: medium Milestone: PostGIS 3.4.0
Component: postgis Version: 3.2.x
Keywords: Cc:

Description

I have a table that contains ~30 million mixed geometries of points, linestrings and polygons, mostly small but some few exceptions that are large. Data is not uniformly distributed.

Counting rows within a small bbox (~5% of full extent) where real density is high results in 1.552.761 rows. Row estimate when looking at the explained plan is 2.819.

The table is recently analyzed and I've attempted to adjust with set statistics on the geometry column to 1000 instead of default 100 hoping that could improve the estimate but it has little to no effect.

Data is open at https://datafordeler.dk/dataoversigt/geodanmark-vektor/geodanmark-vektor-brugerdefineret-filudtraek/ but unfortunately it is in Danish only (I think?) and requires a cumbersome process to register and download. I might be able to provide a table copy if anyone is interested to look into this issue.

postgis_full_version: POSTGIS="3.2.3 0" [EXTENSION] PGSQL="130" GEOS="3.9.2-CAPI-1.14.3" PROJ="9.0.1" LIBXML="2.9.4" LIBJSON="0.12.1" LIBPROTOBUF="1.2.1" WAGYU="0.5.0 (Internal)" (core procs from "3.1.1 aaf4c79" need upgrade)

Change History (7)

comment:1 by Björn Harrtell, 21 months ago

Note to self might want to see if _postgis_selectivity or _postgis_stats can provide some insight (see https://git.osgeo.org/gitea/postgis/postgis/src/branch/master/postgis/postgis.sql.in#L548)

Last edited 21 months ago by Björn Harrtell (previous) (diff)

comment:2 by Björn Harrtell, 21 months ago

Output from _postgis_stats:

{"ndims":2,"size":[110,88],"extent":{"min":[440050,6.0488e+06],"max":[886366,6.40309e+06]},"table_features":28186628,"sample_features":30000,"not_null_features":30000,"histogram_features":29936,"histogram_cells":9680,"cells_covered":29936}

Seems reasonable to me.

comment:3 by Björn Harrtell, 21 months ago

Also tried _postgis_selectivity with the same bbox as mentioned in description which is POLYGON ((716905.5 6168346.1, 716905.5 6185703.38, 741648.19 6185703.38, 741648.19 6168346.1, 716905.5 6168346.1)) and I get 0.055185432031709364. Assuming that means 5.5% of 28 million that seems to be a good estimate, it doesn't make sense to me why the row estimate is 2819 rows (-> Bitmap Heap Scan on geodkobjekt g (cost=68.27..5625.92 rows=2819 width=472)).

Last edited 21 months ago by Björn Harrtell (previous) (diff)

comment:4 by Björn Harrtell, 21 months ago

Mabye this is simply me doing things the wrong way. The bad row estimate is when using static geometry from a CTE. If I supply the geometry as literal I get a good estimate!

With geometry from joined single row CTE:

-> Bitmap Heap Scan on geodkobjekt g (cost=68.27..5625.92 rows=2819 width=472)

Recheck Cond: (geometri && (st_setsrid((st_extent(geodkobjekt.geometri))::geometry, 25832))) -> Bitmap Index Scan on geodkobjekt_geometri_idx (cost=0.00..67.56 rows=2819 width=0)

Index Cond: (geometri && (st_setsrid((st_extent(geodkobjekt.geometri))::geometry, 25832)))

With literal geometry:

-> Parallel Bitmap Heap Scan on geodkobjekt g (cost=36847.48..1506295.63 rows=648121 width=0)

Recheck Cond: (geometri && '0103000020E8640000010000000500000000000000D3E0254166666686C687574100000000D3E0254185EB51D8B998574114AE476120A2264185EB51D8B998574114AE476120A2264166666686C687574100000000D3E0254166666686C6875741'::geometry) -> Bitmap Index Scan on geodkobjekt_geometri_idx (cost=0.00..36458.60 rows=1555491 width=0)

Index Cond: (geometri && '0103000020E8640000010000000500000000000000D3E0254166666686C687574100000000D3E0254185EB51D8B998574114AE476120A2264185EB51D8B998574114AE476120A2264166666686C687574100000000D3E0254166666686C6875741'::geometry)

comment:5 by Björn Harrtell, 21 months ago

I've also tried rewriting the query to use a single row subquery in the WHERE clause instead of the CTE join, giving the same bad result as the CTE variant.

comment:6 by komzpa, 21 months ago

Resolution: invalid
Status: newclosed

Any dynamically created geometry will not be known at planning time and thus cannot be used in estimation. (st_setsrid((st_extent(geodkobjekt.geometri))::geometry, 25832)) is a dynamically created geometry. Typical way to deal with that is to either create the table really in separate query before this one instead of inline CTE, or to rewrite your query to be a direct JOIN on the tables without proxy calculations (which you can still add on top of the JOIN if needed).

comment:7 by Björn Harrtell, 21 months ago

Thanks @kompza. I had somehow become ignorant of this fact.

Last edited 21 months ago by Björn Harrtell (previous) (diff)
Note: See TracTickets for help on using tickets.