Opened 8 years ago

Closed 8 years ago

#3720 closed defect (fixed)

geobuf regress error on debbie and winnie

Reported by: robe Owned by: Björn Harrtell
Priority: medium Milestone: PostGIS 2.4.0
Component: postgis Version: master
Keywords: Cc:

Description

Just noticed this on debbie. I'm rerunning to confirm it's still an issue.

 geobuf .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_diff)
-----------------------------------------------------------------------------
--- geobuf_expected	2017-02-26 18:37:40.110686758 +0000
+++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out	2017-02-26 18:41:54.722689185 +0000
@@ -3,9 +3,18 @@
 dF9mbG9hdBgAIjoKOAoICAIaBAICAgJqBgoEdGVzdGoCGAFqAiABagUKAzEuMWoJEZqZmZmZmfE/
 cgoAAAEBAgIDAwQE
 T3|GAAiEAoOCgwIBBoIAAAAAgIAAAE=
-T4|GAAiHAoaChgIBBICBAQaEAAAAAoKAAAJAgIAAgIAAAE=
+T4|GAAirwEKrAEKqQEIBBICBAQaoAH///////////8B////////////Af///////////wH/////////
+//8B////////////Af///////////wH///////////8B////////////Af///////////wH/////
+//////8B////////////Af///////////wH///////////8B////////////Af///////////wH/
+//////////8B
 T5|GAAiEAoOCgwIARoIFFA8EycTFBM=
-T6|GAAiGgoYChYIAxICAwQaDhQUFBQTKFBQExMUExMT
-T7|GAAiJgokCiIIBRIGAgEDAgUDGhZQUCcKMh0oRhMJACcoCR4ePCgTCQAU
+T6|GAAimwEKmAEKlQEIAxICAwQajAH///////////8B////////////Af///////////wH/////////
+//8B////////////Af///////////wH///////////8B////////////Af///////////wH/////
+//////8B////////////Af///////////wH///////////8B////////////AQ==
+T7|GAAi7wEK7AEK6QEIBRIGAgEDAgUDGtwB////////////Af///////////wH///////////8B////
+////////Af///////////wH///////////8B////////////Af///////////wH///////////8B
+////////////Af///////////wH///////////8B////////////Af///////////wH/////////
+//8B////////////Af///////////wH///////////8B////////////Af///////////wH/////
+//////8B////////////AQ==
 T8|GAAiGAoWChQIBiIGCAAaAggMIggIAhoECAwGCA==
 T9|EAMYACILCgkKBwgAGgMCBAY=
-----------------------------------------------------------------------------

Winnie has got other issues but that might be because I'm in middle of upgrading her GDAL among other things.

Change History (31)

comment:1 by robe, 8 years ago

Owner: changed from pramsey to Björn Harrtell

comment:2 by robe, 8 years ago

Okay still a problem. Started erroring at r15317

comment:3 by robe, 8 years ago

Summary: geobuf regress error on debbiegeobuf regress error on debbie and winnie

just checked. Winnie shows the same error. Perhaps you forgot a regress file commit?

comment:4 by Björn Harrtell, 8 years ago

Looks like I'm getting unexpected output, cause unknown but will investigate further.

comment:5 by Björn Harrtell, 8 years ago

Wierd, the expected output "GAAiHAoaChgIBBICBAQaEAAAAAoKAAAJAgIAAgIAAAE=" seems reliably produced locally and also on Debbie except for 9.6. Using 9.6 locally it's not that simple.

Last edited 8 years ago by Björn Harrtell (previous) (diff)

comment:6 by Björn Harrtell, 8 years ago

Ah, Debbie gives up on fail so what I'm seeing at Debbie are old successful builds for < 9.6.

comment:7 by robe, 8 years ago

yes.

comment:8 by Björn Harrtell, 8 years ago

Now I really need to get to know Debbie better. Must be something different to my local setup and the Travis setup. Could revert ST_AsGeobuf but if it's ok I'll leave it in while I try to reproduce the problem. If I can't reproduce/fix in a few days I'll revert.

comment:9 by Björn Harrtell, 8 years ago

One guess is that this is a protobuf related dependency problem, so I'd like to know specifically which packages/sources for protobuf stuff are being used at Debbie.

comment:10 by Björn Harrtell, 8 years ago

Might be my code but still suspect protobuf dep version. A perhaps good lead is that it looks like the problem only occurs for geometries with several linestrings/rings like the T4 and T6 tests. The T5 test passes ok.

comment:11 by robe, 8 years ago

Would it be possible to spit out the version as part of configure like we have for others or is that really tricky.

 -------------- Dependencies -------------- 
  GEOS config:          /var/lib/jenkins/workspace/geos/rel-3.7.0devw64/bin/geos-config
  GEOS version:         3.7.0dev
  GDAL config:          /var/lib/jenkins/workspace/gdal/rel-2.1w64/bin/gdal-config
  GDAL version:         2.1.3
  SFCGAL config:        /usr/bin/sfcgal-config
  SFCGAL version:       1.3.0
  PostgreSQL config:    /var/lib/jenkins/workspace/pg/rel/pg9.6w64/bin/pg_config
  PostgreSQL version:   PostgreSQL 9.6.2
  PROJ4 version:        49
  Libxml2 config:       /usr/bin/xml2-config
  Libxml2 version:      2.9.4
  JSON-C support:       yes
  protobuf-c support:   yes
  PCRE support:         yes
  PostGIS debug level:  0
  Perl:                 /usr/bin/perl

I forget which apt packages I installed on debbie - it's all a fog.

I know on winnie, since I had to compile that from scratch, I used: protobuf-2.6.1 and protobuf-c-1.1.1

Though winnie is on windows, she's exhibiting the same error, so presume it must be an issue with that combination as well.

comment:12 by robe, 8 years ago

OOps I misspoke about winnie, forgot I did twice, and used newer second time.

So for winnie I used protobuf 3.2.0 and protobuf-c 1.2.1

output from debbie:

protoc --version
libprotoc 3.0.0

How do I get the other details or is that good enough?

Last edited 8 years ago by robe (previous) (diff)

comment:13 by Björn Harrtell, 8 years ago

Good enough for now. I will look into getting version numbers out in the dependency summary. It's a bit messy as the C lib has a transitive dependency on the C++ lib. I've got libprotobuf-c 1.2.1 and libprotobuf 3.0.0 on my system so it seems you got the same at Debbie and I should look probably elsewhere to explain the problem.

comment:14 by Björn Harrtell, 8 years ago

It would be good to know if the unexpected output from the test (which looks like garbage) is identical for successive builds and/or between debbie and winnie. If random I'm probably writing outside past allocated memory, if identical I'm even more confused.

comment:15 by Björn Harrtell, 8 years ago

Ok so I got a new build from Debbie just now, still erroring. The output is:

geobuf .. failed (diff expected obtained: /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_diff)
-----------------------------------------------------------------------------
--- geobuf_expected	2017-02-26 18:37:40.110686758 +0000
+++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out	2017-02-28 20:46:04.812781870 +0000
@@ -3,9 +3,18 @@
 dF9mbG9hdBgAIjoKOAoICAIaBAICAgJqBgoEdGVzdGoCGAFqAiABagUKAzEuMWoJEZqZmZmZmfE/
 cgoAAAEBAgIDAwQE
 T3|GAAiEAoOCgwIBBoIAAAAAgIAAAE=
-T4|GAAiHAoaChgIBBICBAQaEAAAAAoKAAAJAgIAAgIAAAE=
+T4|GAAirwEKrAEKqQEIBBICBAQaoAH///////////8B////////////Af///////////wH/////////
+//8B////////////Af///////////wH///////////8B////////////Af///////////wH/////
+//////8B////////////Af///////////wH///////////8B////////////Af///////////wH/
+//////////8B
 T5|GAAiEAoOCgwIARoIFFA8EycTFBM=
-T6|GAAiGgoYChYIAxICAwQaDhQUFBQTKFBQExMUExMT
-T7|GAAiJgokCiIIBRIGAgEDAgUDGhZQUCcKMh0oRhMJACcoCR4ePCgTCQAU
+T6|GAAimwEKmAEKlQEIAxICAwQajAH///////////8B////////////Af///////////wH/////////
+//8B////////////Af///////////wH///////////8B////////////Af///////////wH/////
+//////8B////////////Af///////////wH///////////8B////////////AQ==
+T7|GAAi7wEK7AEK6QEIBRIGAgEDAgUDGtwB////////////Af///////////wH///////////8B////
+////////Af///////////wH///////////8B////////////Af///////////wH///////////8B
+////////////Af///////////wH///////////8B////////////Af///////////wH/////////
+//8B////////////Af///////////wH///////////8B////////////Af///////////wH/////
+//////8B////////////AQ==
 T8|GAAiGAoWChQIBiIGCAAaAggMIggIAhoECAwGCA==
 T9|EAMYACILCgkKBwgAGgMCBAY=
-----------------------------------------------------------------------------

comment:16 by Björn Harrtell, 8 years ago

Diff result:

< +++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out	2017-02-26 18:41:54.722689185 +0000
---
> +++ /var/lib/jenkins/workspace/postgis/tmp/2_4_pg9.6w64/test_117_out	2017-02-28 20:46:04.812781870 +0000

I.e. only time differs... Should mean this is no accidental memory garbage problem. Also means I have no clue what is causing this strange output. It does not even look base64 encoded.

comment:17 by Björn Harrtell, 8 years ago

I see 2_4_pg9.6w64. I thought this was Debbie builds but perhaps I'm looking at Winnie builds here? I would be less surprised if there is something strange going on in Windows-land only.

comment:18 by Björn Harrtell, 8 years ago

Installed a clean Debian Jessie (8.7.1), protobuf 3.0.2 and protobuf-c 1.2.1 then built postgis svn-trunk and ran tests. Cannot reproduce issue. :(

The only diff I can spot from the logs is that Debbie has GCC 5 and PG 9.6.2 but I use Jessie defaults (GCC 4.9.2 and PG 9.4.10). On my Debian Stretch I got GCC 6.3.0 and PG 9.6.2.

We also have passing build&test at Travis and GitLab pipeline both using slightly different systems and package versions.

I'm not sure how to proceed.

comment:19 by Björn Harrtell, 8 years ago

As there seems to be no short term solution I'm considering reverting ST_AsGeobuf.

@robe, as I haven't found a way to reproduce the issue is it possible to setup debbie/winnie to build a branch which hopefully can be fixed at some point?

comment:20 by robe, 8 years ago

Björn Harrtell,

If you can't replicate on your end, can you disable the failing test for now. I don't think it's worthwhile reverting ST_AsGeoBuf.

Unfortunately I'm tied up for a bit, probably next two weeks, so might not be able to setup a branch before then for testing. The only thing I can think of that both Debbie and Winnie have in common is they both use PostgreSQL compiled with cassert on. I don't think the zlib part is important since I think I have on on winnie.

So I use this command on debbie looks like this to build postgres

./configure --prefix=${WORKSPACE}/rel/pg${PG_VER}w${OS_BUILD} --with-pgport=${PGPORT} --enable-cassert --enable-debug --without-zlib

enable-debug and cassert picks up issues you might not normally see with memory not being properly cleaned.

strk - any chance you can try replicating. You have cassert enabled too right. So you might be seeing the same issue.

comment:21 by robe, 8 years ago

One more thing. I didn't check to see, do you have cunit tests for geobuf. cunit tests are helpful for removing PostgreSQL out of the mix as a culprit.

We have even had in past where postgres cassert logic had a bug in it that we reported upstream to fix. So I wouldn't rule out an issue with upstream.

comment:22 by Björn Harrtell, 8 years ago

Unfortunately I have not been able to make cunit tests for ST_AsGeobuf (or ST_AsMVT). To me, it would significantly increase the complexity of the implementation. I know Paul would love to have cunit tests and I hate to disappoint so this has not been an easy/lazy decision.

Thanks for the info about your build on Debbie/Winnie, I'll investigate. Memory allocation is definitely an area where I might need to check and double check the logic.

Last edited 8 years ago by Björn Harrtell (previous) (diff)

comment:23 by Björn Harrtell, 8 years ago

Reproduced the error after building and running tests with PostgreSQL 9.6.2 --enable-cassert --enable-debug.

comment:24 by Björn Harrtell, 8 years ago

Very strange. I get this wierd output on T4 which encodes a POLYGON((0 0,0 5,5 5,5 0,0 0), (1 1,1 2,2 2,2 1,1 1)). If I remove the inner ring output seems ok. If I keep the ring but force the encoding to only consider the first ring output is wierd but different than the first case. Almost as if the LWGEOM have different semantics if there are multiple rings.

comment:25 by strk, 8 years ago

Sounds like a memory error to me. Not easy to spot because of pooled memory allocation (which is why we love unit tests). Try valgrinding PostgreSQL and see what you get. It happened more than once to me that the problem was in a PG_FREE_IF_COPY(<wrong_argument_here>)

comment:26 by Björn Harrtell, 8 years ago

In 15321:

Clone LWGEOM in aggregation step
References #3720

comment:27 by Björn Harrtell, 8 years ago

I'll have to learn valgrind sometime. And get better at writing unit testable C code. In the meantime I've pinpointed the problem. Because of the nature of Geobuf I need to analyze all input geometries and encode them as the final aggregation step based on the max precision found when analyzing. So I save LWGEOM instances from lwgeom_from_gserialized in each aggregation step and it seems it the memory allocated in this process is going out of scope before the final aggregation step. I don't understand why it only manifest itself on geometries with multiple rings, and how the output can be perfectly consistent.

The problem disappears if I make a lwgeom_clone_deep before saving it in the aggregation context. This is surprising to me as it uses the exact same allocation method as the lwgeom_from_gserialized function. I'm not really happy with this fix so I'll keep looking into this issue.

It's possible I'm doing something wrong regarding the MemoryContext handling that is important in aggregration functions but I can't I understand it fully.

comment:28 by strk, 8 years ago

lwgeom_from_gserialized keeps references to the memory allocated by PG_DETOAST_DATUM, which *might* or might *not* allocate new memory depending on the size of the geometry (thus multi-ring vs. single-ring might make a difference). If you want to be sure to get a copy (assuming you switched the context before calling the macro) you can use PG_DETOAST_DATUM_COPY instead.

That said, I dunno if the whole context switching is done correctly as I see the transition function switching to the aggregate context and never switching back. See for comparison pgis_geometry_accum_transfn

comment:29 by Björn Harrtell, 8 years ago

It seems I can replace lwgeom_clone_deep with PG_DETOAST_DATUM_COPY.

I'm mostly confused about the MemoryContext, initially I tried to do something similar I've seen with the switching back and forth but could not make it work. There is something in the fundamental concept of it I do not understand.

comment:30 by Björn Harrtell, 8 years ago

In 15322:

Use PG_DETOAST_DATUM_COPY instead of clone deep
References #3720

comment:31 by Björn Harrtell, 8 years ago

Resolution: fixed
Status: newclosed
Version: 2.3.xtrunk

Closing this now as Debbie is happy.

Note: See TracTickets for help on using tickets.