Opened 7 years ago

Closed 7 years ago

#3904 closed defect (fixed)

PostgreSQL 11 crashing on PostGIS 2.5 trunk

Reported by: robe Owned by: pramsey
Priority: blocker Milestone: PostGIS 2.5.0
Component: postgis Version: 2.4.x
Keywords: Cc:

Description

Since this pg11 commit Debbie's pg 11 runs have been failing. I haven't checked PostGIS 2.4 since we don't test 11 on 2.4

Summary

    Add pg_noinline macro to c.h. (details)
    Improve sys/catcache performance. (details)

Commit a0247e7a11bb9f5fd55694b594a3906b7bd05881 by andres

Add pg_noinline macro to c.h.
Forcing a function not to be inlined can be useful if it's the slow-path
of a performance critical function, or should be visible in profiles to
allow for proper cost attribution.
Author: Andres Freund Discussion:
https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de

The file was modified	src/include/c.h
Commit 141fd1b66ce6e3d10518d66d4008bd368f1505fd by andres

Improve sys/catcache performance.
The following are the individual improvements: 1) Avoidance of
FunctionCallInfo based function calls, replaced by
  more efficient functions with a native C argument interface. 2) Don't
extract columns from a cache entry's tuple whenever matching
  entries - instead store them as a Datum array. This also allows to
  get rid of having to build dummy tuples for negative & list
  entries, and of a hack for dealing with cstring vs. text weirdness. 3)
Reorder members of catcache.h struct, so imortant entries are more
  likely to be on one cacheline. 4) Allowing the compiler to specialize
critical SearchCatCache for a
  specific number of attributes allows to unroll loops and avoid
  other nkeys dependant initialization. 5) Only initializing the ScanKey
when necessary, i.e. catcache misses,
  greatly reduces cache unnecessary cpu cache misses. 6) Split of the
cache-miss case from the hash lookup, reducing stack
  allocations etc in the common case. 7) CatCTup and their corresponding
heaptuple are allocated in one
  piece.
This results in making cache lookups themselves roughly three times as 
fast - full-system benchmarks obviously improve less than that.
I've also evaluated further techniques:
- replace open coded hash with simplehash - the list walk right now
 shows up in profiles. Unfortunately it's not easy to do so safely as
 an entry's memory location can change at various times, which
 doesn't work well with the refcounting and cache invalidation.
- Cacheline-aligning CatCTup entries - helps some with performance,
 but the win isn't big and the code for it is ugly, because the
 tuples have to be freed as well.
- add more proper functions, rather than macros for
 SearchSysCacheCopyN etc., but right now they don't show up in
 profiles.
The reason the macro wrapper for syscache.c/h have to be changed, rather
than just catcache, is that doing otherwise would require exposing the
SysCache array to the outside.  That might be a good idea anyway, but
it's for another day.
Author: Andres Freund Reviewed-By: Robert Haas Discussion:
https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de

The file was modified	src/tools/pgindent/typedefs.list
The file was modified	src/include/utils/syscache.h
The file was modified	src/backend/utils/cache/catcache.c
The file was modified	src/include/utils/catcache.h
The file was modified	src/backend/utils/cache/syscache.c

The postgis regress log looks like this:

PostgreSQL 11devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 5.2.1-22) 5.2.1 20151010, 64-bit
  Postgis 2.5.0dev - r15979 - 2017-10-13 22:01:30
  scripts 2.5.0dev r15979
  GEOS: 3.7.0dev-CAPI-1.11.0 a51a06b1
  PROJ: Rel. 4.9.2, 08 September 2015
  SFCGAL: 1.3.0

Running tests

 loader/Point .............. ok 
 loader/PointM .............. ok 
 loader/PointZ .............. ok 
 loader/MultiPoint .............. ok 
 loader/MultiPointM .............. ok 
 loader/MultiPointZ .............. ok 
 loader/Arc .............. ok 
 loader/ArcM .............. ok 
 loader/ArcZ .............. ok 
 loader/Polygon .............. ok 
 loader/PolygonM .............. ok 
 loader/PolygonZ .............. ok 
 loader/TSTPolygon ......... ok 
 loader/TSIPolygon ......... ok 
 loader/TSTIPolygon ......... ok 
 loader/PointWithSchema ..... ok 
 loader/NoTransPoint ......... ok 
 loader/NotReallyMultiPoint ......... ok 
 loader/MultiToSinglePoint ......... ok 
 loader/ReprojectPts ........ ok 
 loader/ReprojectPtsGeog ........ ok 
 loader/Latin1 .... ok 
 loader/Latin1-implicit .... ok 
 loader/mfile .... ok 
 dumper/literalsrid ....... ok 
 dumper/realtable ....... ok 
 affine .. ok 
 bestsrid .. ok 
 binary .. ok 
 boundary .. ok 
 cluster .. ok 
 concave_hull .. ok 
 ctors .. ok 
 curvetoline .. ok 
 dump .. ok 
 dumppoints .. ok 
 empty .. ok 
 estimatedextent .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_38_diff)
-----------------------------------------------------------------------------
--- estimatedextent_expected	2017-01-01 03:28:16.191591396 +0000
+++ /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_38_out	2017-10-13 22:03:16.164629222 +0000
@@ -1,41 +1,4 @@
-WARNING:  stats for "t.g" do not exist
-#877.1|
-WARNING:  stats for "t.g" do not exist
-#877.2|
-WARNING:  ST_Estimated_Extent signature was deprecated in 2.1.0. Please use ST_EstimatedExtent
-WARNING:  stats for "t.g" do not exist
-#877.2.deprecated|
-WARNING:  stats for "t.g" do not exist
-#877.3||||
-#877.4|-10.15000|20.15000|-50.40000|30.40000
-#877.5|-10.15000|20.15000|-50.40000|30.40000
-WARNING:  stats for "c1.g" do not exist
-#3391.1||||
-WARNING:  stats for "c2.g" do not exist
-#3391.2||||
-WARNING:  stats for "p.g" do not exist
-#3391.3||||
-#3391.4|0.00|1.00|0.00|1.00
-WARNING:  stats for "c2.g" do not exist
-#3391.5||||
-#3391.6|0.00|1.00|0.00|1.00
-#3391.7|0.00|1.00|0.00|1.00
-#3391.8|-1.00|0.00|-1.00|0.00
-#3391.9|-1.01|1.01|-1.01|1.01
-#3391.10|0.00|1.00|0.00|1.00
-#3391.11|-1.00|0.00|-1.00|0.00
-#3391.12|-1.01|2.02|-1.01|2.02
-WARNING:  stats for "p.g" do not exist
-#3391.13||||
-WARNING:  stats for "p.g" do not exist
-#3391.14||||
-WARNING:  stats for "c1.g" do not exist
-#3391.15||||
-WARNING:  stats for "c1.g" do not exist
-#3391.16||||
-#3391.17|0.00|1.00|0.00|1.00
-WARNING:  stats for "p.g" do not exist
-#3391.18||||
-#3391.19|0.00|1.00|0.00|1.00
-#3391.20|0.00|1.00|0.00|1.00
-NOTICE:  drop cascades to 2 other objects
+server closed the connection unexpectedly
+	This probably means the server terminated abnormally
+	before or while processing the request.
+connection to server was lost
-----------------------------------------------------------------------------
 forcecurve .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_39_diff)
-----------------------------------------------------------------------------
--- forcecurve_expected	2017-01-01 03:28:18.503591901 +0000
+++ /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_39_out	2017-10-13 22:03:16.192628299 +0000
@@ -1,12 +1 @@

Then a lot of the rest is just the pg11 trying to get up and then crashing again

-ndovm1|{1,2,3,4,5,8}
-ndovm2|{1,2,4,6,7}
-#3777|GEOMETRYCOLLECTION EMPTY|1
-#3777|LINESTRING(0 0,0 1)|1
-#3777|POINT EMPTY|1
-#3777|POINT(0 0)|3
-#3777|POINT(0 1)|1
-#3777.1|GEOMETRYCOLLECTION EMPTY|1
-#3777.1|LINESTRING(0 0,0 1)|1
-#3777.1|POINT EMPTY|1
-#3777.1|POINT(0 0)|3
-#3777.1|POINT(0 1)|1
+psql: FATAL:  the database system is in recovery mode
-----------------------------------------------------------------------------
 orientation .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_54_diff)
-----------------------------------------------------------------------------
--- orientation_expected	2017-03-17 00:52:28.667011501 +0000
+++ /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_54_out	2017-10-13 22:03:16.276625529 +0000
@@ -1,36 +1 @@
-1|t
-2|t
:
:
-pgcast_06|POLYGON((0 0,0 1,1 1,1 0,0 0))
+psql: FATAL:  the database system is in recovery mode
-----------------------------------------------------------------------------
 out_geography .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_56_diff)
-----------------------------------------------------------------------------
--- out_geography_expected	2017-01-01 03:28:16.555591475 +0000
+++ /var/lib/jenkins/workspace/postgis/regress_pgdev/tmp/2_4_pg11.0w64/test_56_out	2017-10-13 22:03:16.284625266 +0000
@@ -1,68 +1 @@

I suspect this might be something that needs fixing upstream so we'll have to come up with an example not involving PostGIS that exercises the mess.

Change History (4)

comment:1 by robe, 7 years ago

I've taken 11 out of our 2.5 stream until the issue in 11 is resolved. If it resolves after upstream changes, the nightly PostgreSQL dev git / PostGIS test build will catch it.

comment:2 by pramsey, 7 years ago

So, the crash was coming in a SearchSysCache2(STATRELATT, ...) call.

https://github.com/postgis/postgis/blob/f1f6de48cd205e620e56aca8157080d7a34a6888/postgis/gserialized_estimate.c#L946

Which is odd because... STATRELATT hasn't been a syscache key since before PgSQL 9.0 (that's as far as the source I have available to me goes back). The valid key is STATRELATTINH and it's a 3-parameter key. So, this fix might be applicable to all our other branches too, since the 3-parameter STATRELATTINH key goes back as far as any version we support.

comment:3 by pramsey, 7 years ago

In 16175:

Change syscache stats lookup to use STATRELATTINH key and three parameters in all cases. References #3904

comment:4 by pramsey, 7 years ago

Resolution: fixed
Status: newclosed

In 16176:

Change syscache stats lookup to use STATRELATTINH key and three parameters in all cases for 2.4 branch. Closes #3904

Note: See TracTickets for help on using tickets.