Opened 4 years ago
Closed 3 years ago
#4720 closed defect (fixed)
Memory leak when using Geography
Reported by: | kthujvu | Owned by: | Algunenano |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS PostgreSQL |
Component: | postgis | Version: | 3.0.x |
Keywords: | geography memory | Cc: |
Description
A query with a single-line subquery involving Geography for each row ends up OOMing.
I loaded https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip using Geography using shp2pgsql -c -D -i -I -G ne_110m_admin_0_countries.shp > ne110.sql
, then psql ... -f ne110.sql
.
There is an spatial index ne_110m_admin_0_countries_geog_idx
.
In the queries below I create a grid of points using WITH, the problem also occurs if I have that data in a static table.
This query is eating all memory on my system:
WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog FROM latitudes CROSS JOIN longitudes ) SELECT geog, ( SELECT name FROM ne_110m_admin_0_countries AS ne ORDER BY p.geog <-> ne.geog LIMIT 1 ) FROM points AS p ;
After 10 minutes: >2G and growing about 8M per second
This does not happen if I use Geometry for the points and cast the Geography data in the NE table to Geometry:
WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326) AS geom FROM latitudes CROSS JOIN longitudes ) SELECT geom, ( SELECT name FROM ne_110m_admin_0_countries AS ne ORDER BY p.geom <-> ne.geog::geometry LIMIT 1 ) FROM points AS p ;
After 10 minutes: stable 107M
This does happen when I use the existing Geography data and Geometry for the point grid:
WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326) AS geom FROM latitudes CROSS JOIN longitudes ) SELECT geom, ( SELECT name FROM ne_110m_admin_0_countries AS ne ORDER BY p.geom <-> ne.geog LIMIT 1 ) FROM points AS p ;
After ~5 minutes: >500MB and growing
It does not happen when I cast the existing Geography to Geometry and use Geography for the point data:
WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog FROM latitudes CROSS JOIN longitudes ) SELECT geog, ( SELECT name FROM ne_110m_admin_0_countries AS ne ORDER BY p.geog <-> ne.geog::geometry LIMIT 1 ) FROM points AS p ;
After 10 minutes: stable 103M
It does not happen if I use the volatile data for both sides of the spatial query:
WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog FROM latitudes CROSS JOIN longitudes ) SELECT geog, ( SELECT geog FROM points AS p2 ORDER BY p.geog <-> p2.geog LIMIT 1 ) FROM points AS p ;
After 10 minutes: stable 110M
PostgreSQL 12.2 (Debian 12.2-2.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gc c (Debian 8.3.0-6) 8.3.0, 64-bit
POSTGIS="3.0.1 ec2a9aa" [EXTENSION] PGSQL="120" GEOS="3.7.1-CAPI-1.11.1 27a5e771 " PROJ="Rel. 5.2.0, September 15th, 2018" GDAL="GDAL 2.4.0, released 2018/12/14" LIBXML="2.9.4" LIBJSON="0.12.1" LIBPROTOBUF="1.3.1" WAGYU="0.4.3 (Internal)" TOPO LOGY RASTER
Change History (20)
comment:1 by , 4 years ago
comment:2 by , 4 years ago
I also cannot reproduce it with the 50m dataset.
Here is the full "test suite" I used, uses docker (sorry):
docker run --name "postgis" -p 25432:5432 -d -t kartoza/postgis:12.1 docker exec -it postgis bash # then in the container itself: apt update apt install postgis # for shp2pgsql, update is needed first for the package to be discoverable apt install unzip wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip wget https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/50m/cultural/ne_50m_admin_0_countries.zip unzip ne_50m_admin_0_countries.zip shp2pgsql -c -D -i -I -G ne_50m_admin_0_countries.shp | PGPASSWORD=docker psql -h localhost -U docker gis unzip ne_110m_admin_0_countries.zip shp2pgsql -c -D -i -I -G ne_110m_admin_0_countries.shp | PGPASSWORD=docker psql -h localhost -U docker gis
Then in
PGPASSWORD=docker psql -h localhost -U docker gis
I tested the following query:
WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geography AS geog FROM latitudes CROSS JOIN longitudes ) SELECT geog, ( SELECT name FROM ne_50m_admin_0_countries AS ne ORDER BY p.geog <-> ne.geog LIMIT 1 ) FROM points AS p ;
That query on the ne_50m_admin_0_countries table: 250MB RAM (RES) usage after 30 minutes
Same query with ne_110m_admin_0_countries: > 2G RAM (RES) and growing ~8MB/s after 15 minutes
comment:3 by , 4 years ago
Milestone: | → PostGIS 3.0.3 |
---|
comment:4 by , 4 years ago
I can reproduce it with the 110m query when loading the data from the shp as a geography.
The setup operation (to create the points relation) takes ~5 MB and mantains the memory stable, so the issue comes from the last JOIN.
Since the only way I know to leak memory from Postgis is in indexes (since those don't work with the same Memory context limits) I tried the same query disabling indexscan (set enable_indexscan = false;
) and with that the issue does not appear.
I've had a look at the offending function (gserialized_gist_geog_distance
) but I haven't seen anywhere were a leak might happen, so I don't really know if it's the index leaking memory or the function finishing faster and caching some results. It needs a more in depth investigation.
comment:5 by , 4 years ago
I also see this behaviour if I load the column as a geometry and have an index over that geometry, even If I remove geography out of the query completely:
WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geometry AS geog FROM latitudes CROSS JOIN longitudes ) SELECT geog, ( SELECT name FROM ne_110m_admin_0_countries AS ne ORDER BY p.geog <-> ne.the_geom LIMIT 1 ) FROM points AS p ;
Plan:
Nested Loop (cost=0.01..3203541.08 rows=1000000 width=64) -> Function Scan on generate_series (cost=0.00..10.00 rows=1000 width=32) -> Function Scan on generate_series generate_series_1 (cost=0.00..10.00 rows=1000 width=32) SubPlan 1 -> Limit (cost=0.27..3.05 rows=1 width=17) -> Index Scan using ne_110m_admin_0_countries_the_geom_idx on ne_110m_admin_0_countries ne (cost=0.27..492.07 ro ws=177 width=17) Order By: (the_geom <-> st_setsrid(st_point((generate_series_1.generate_series)::double precision, (generate _series.generate_series)::double precision), 4326)) (7 rows)
OTOH, if I use ST_Distance instead of <->
, which doesn't use indexes, then the memory is really stable. Same if I use <->
but remove indexscan
as a possible subplan.
So it looks like an issue with the index subplans, where either Postgis or Postgres, or both :D, leaking memory or caching way more things.
comment:6 by , 4 years ago
As an extra test, I've tested several other index operators (replacing <-> in the query) and only <->
and <#>
seem to exhibit this behaviour, but I don't know if it might have something to do with the ordering or maybe the recheck operations which I don't know how it works exactly.
comment:7 by , 4 years ago
Just reading through I'm not seeing any places where the <-> operator or the gserialized_gist_distance_2d() function fails to free allocated memory. It seems quite careful about that, in fact. Index support functions are one of the few places you can leak memory easily, since they don't clean up their memory contexts as often as function calls.
comment:8 by , 4 years ago
One thing I've been thinking about and want to review is about the recheck
flag. I think that when recheck is marked we end up calling the non-gist function (ST_Distance in this case), so it might be that ST_Distance needs to be "leak-free".
I'm going to have a look at how this recheck thingy works to see if that offers any explanations.
comment:9 by , 4 years ago
Ok, so gist calls <-> (geometry, geometry) on recheck, and that is defined as:
CREATE OPERATOR <-> ( LEFTARG = geometry, RIGHTARG = geometry, PROCEDURE = geometry_distance_centroid, COMMUTATOR = '<->' );
Following the thread:
CREATE OR REPLACE FUNCTION geometry_distance_centroid(geom1 geometry, geom2 geometry) RETURNS float8 AS 'MODULE_PATHNAME', 'ST_Distance' LANGUAGE 'c' IMMUTABLE STRICT PARALLEL SAFE _COST_MEDIUM;
So ST_Distance is being called in the index.
To simplify things and make sure only index related functions are being used from Postgis, I decided to create a new table to use that instead of the CTE:
CREATE TABLE example_points AS WITH latitudes AS ( SELECT generate_series AS latitude FROM generate_series(-90, 90, 0.1) ), longitudes AS ( SELECT generate_series AS longitude FROM generate_series(-180, 180, 0.1) ), points AS ( SELECT ST_SetSRID(ST_Point(longitude, latitude), 4326)::geometry AS geog FROM latitudes CROSS JOIN longitudes ) Select * from points;
Then the query is simplified to:
SELECT geog, ( SELECT name FROM ne_110m_admin_0_countries AS ne ORDER BY p.geog <-> ne.the_geom LIMIT 1 ) FROM example_points AS p ;
With the following plan:
QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on example_points p (cost=0.00..18181019.39 rows=6485383 width=64) SubPlan 1 -> Limit (cost=0.14..2.79 rows=1 width=17) -> Index Scan using ne_110m_admin_0_countries_the_geom_idx on ne_110m_admin_0_countries ne (cost=0.14..468.49 rows=177 width=17) Order By: (the_geom <-> p.geog) (5 rows)
If I run this on current REL_12_STABLE postgres and master postgis the 2 inputs into Postgis that I see are:
- 97.94% ExecSubPlan ▒ - 97.22% ExecLimit ▒ - 96.22% ExecScan ▒ - 76.51% IndexNextWithReorder ▒ - 50.83% index_getnext_slot ▒ - 49.76% gistgettuple ▒ - 46.69% gistScanPage ▒ - 26.41% FunctionCall5Coll ▒ - 24.93% gserialized_gist_distance_2d
And:
- 98.37% ExecScan ▒ - 98.13% ExecInterpExpr ▒ - 97.94% ExecSubPlan ▒ - 97.22% ExecLimit ▒ - 96.22% ExecScan ▒ - 76.51% IndexNextWithReorder ▒ - 22.71% ExecInterpExpr ▒ - 22.51% ST_Distance
So if anything is leaking on our side it has to be related to the implementation of either gserialized_gist_distance_2d or ST_Distance.
After around 60 seconds running the query, the memory used by the backend in charge of the query was 3.7 GB and growing.
To compare with something, although I don't know how useful / comparable that is, I changed the implementation of gserialized_gist_distance_2d to return 0 and recheck = true without doing any actual work, and ST_Distance to return 0 without any work either. After 60 seconds running the query, the memory reserved by the process was growing but much smaller: 120 MB.
The next thing I've tested was to never recheck (always return false) and rerun the query. When I do this, after 60 seconds the memory is on 50 MB, also growing. But for some reason ST_Distance
is still being called (way less times as it only represents 6.84% of the time.
So I'm 100% sure this has something to do with recheck now, but I don't know if this is on our side (ST_Distance) or in Postgres side.
comment:10 by , 4 years ago
I've modified ST_Distance to create its own memory context (per call) and that removes the memory leak, so the problem is definitely there.
Since the function itself has 2 clear parts (geometry deserialization and lwgeom_mindistance2d call) I decided to narrow down the context change to the function call (after generating the geometries from the serialized buffers), and with that the leak was also gone, so lwgeom_mindistance2d might be allocating memory and not freeing it, which is odd since it's tested in liblwgeom under valgrind.
comment:11 by , 4 years ago
Owner: | changed from | to
---|
comment:12 by , 4 years ago
Milestone: | PostGIS 3.0.3 → PostGIS 3.0.4 |
---|
comment:14 by , 4 years ago
Hm, I don't see any valgrind tests of liblwgeom in the ci scripts anymore. Not that I'm seeing any problems running manually.
==98887== LEAK SUMMARY: ==98887== definitely lost: 0 bytes in 0 blocks ==98887== indirectly lost: 0 bytes in 0 blocks ==98887== possibly lost: 0 bytes in 0 blocks
comment:15 by , 3 years ago
Any hopes of having this dealt with in next couple of days or should we push to 3.0.5?
comment:16 by , 3 years ago
Milestone: | PostGIS 3.0.4 → PostGIS Fund Me |
---|
It needs further investigation at both PG and Postgis level
comment:17 by , 3 years ago
I did investigation of this problem and discovered that memory leaks on PG side. Thread: https://www.postgresql.org/message-id/flat/CAHqSB9gECMENBQmpbv5rvmT3HTaORmMK3Ukg73DsX5H7EJV7jw%40mail.gmail.com
comment:18 by , 3 years ago
Fix is pushed in Postgres and backported to all supported versions https://github.com/postgres/postgres/commit/3f74daa8dfc634132ff7270212c57da5e316b001
comment:19 by , 3 years ago
A big thanks to all of you for this thorough investigation and fix. You are awesome!
comment:20 by , 3 years ago
Milestone: | PostGIS Fund Me → PostGIS PostgreSQL |
---|---|
Resolution: | → fixed |
Status: | new → closed |
@kalenik thanks for fixing.
I'm afraid I'm not seeing this effect... at least not at anything remotely the level you describe. My memory use under both PostGIS 3.0 and 3.1 (master) is climbing only veeeery slowly (14M usage after 4 mintes). It *might* be a leak, but it's not the gusher you're seeing. Only difference in my test is using the 50m dataset instead of 110m. I will attempt with your exact data later.