Opened 3 years ago

Closed 3 years ago

#4990 closed defect (fixed)

getfacecontainingpoint fails on i386

Reported by: Bas Couwenberg Owned by: strk
Priority: medium Milestone: PostGIS 3.2.0
Component: topology Version: master
Keywords: Cc:

Description

The Debian package build for 3.2.0~alpha1 failed on i386 due to a test failure:

 ./topology/test/regress/getfacecontainingpoint .. failed (diff expected obtained: /tmp/pgis_reg/test_167_diff)
-----------------------------------------------------------------------------
--- ./topology/test/regress/getfacecontainingpoint_expected	2021-09-11 03:34:25.000000000 +0000
+++ /tmp/pgis_reg/test_167_out	2021-09-11 07:38:02.046054220 +0000
@@ -21,21 +21,4 @@
 ERROR:  Two or more faces found
 ERROR:  Two or more faces found
 ERROR:  Two or more faces found
-t1|1
-t2|1
-t3|9
-t4|2
-t5|2
-t6|0
-t7|11
-t8|4
-t9|1
-t10|10
-t11|10
-t12|10
-t13|10
-t14|1
-t15|12
-t16|15
-t17|16
-t18|19
+ERROR:  Closest point is first point of closest segment, unexpectedly
-----------------------------------------------------------------------------

Full buildlog

Attachments (1)

wrongResult.png (2.5 KB ) - added by strk 3 years ago.

Download all attachments as: .zip

Change History (31)

comment:1 by Bas Couwenberg, 3 years ago

Component: postgistopology
Owner: changed from pramsey to strk

comment:2 by strk, 3 years ago

Can you provide a way for me to access that i386 machine for testing ?

comment:3 by Bas Couwenberg, 3 years ago

Getting access to the buildd is not possible, getting access to a porterbox is possible, but not easy for non-Debian people.

You can reproduce the issue in an i386 chroot or container.

comment:4 by robe, 3 years ago

If it's a 32-bit issue, I'm surprised this didn't trigger a failure in bessie32 (freebsd 12 32-bit https://debbie.postgis.net/view/PostGIS/job/PostGIS_Worker_Run/label=bessie32/2879/consoleFull running GEOS 3.9.1 and PostgreSQL 12) or berrie (which is a 32-bit rasbian pi https://debbie.postgis.net/view/PostGIS/job/PostGIS_Worker_Run/label=berrie/2879/consoleFull running 3.7.1-CAPI-1.11.1 and PostgreSQL 12 ) ci bots.

Which version of GEOS are you running? strk does this function rely on GEOS. I didn't check. Just wanted to rule that out.

comment:5 by robe, 3 years ago

Sorry I should have looked at the logs -- so this is a

Failing build

PostgreSQL 13.4 (Debian 13.4-3) on i686-pc-linux-gnu, compiled by gcc (Debian 10.3.0-8) 10.3.0, 32-bit
  Postgis 3.2.0alpha1 - (e357e0a) - 2021-09-11 05:49:54
  scripts 3.2.0alpha1 e357e0a
  raster scripts 3.2.0alpha1 e357e0a
  GEOS: 3.9.1-CAPI-1.14.2
  PROJ: 7.2.1
  SFCGAL: 1.3.10
  GDAL: GDAL 3.3.2, released 2021/09/01

berrie (success). I'm pretty sure the 3.2.0alpha1 was success too but this is the latest run of 3.2 branch

15:58:38 PostgreSQL 12.6 on aarch64-unknown-linux-gnu, compiled by gcc (Raspbian 8.3.0-6+rpi1) 8.3.0, 32-bit
15:58:38   Postgis 3.2.0dev - (d807a0f) - 2021-09-20 19:49:37
15:58:38   scripts 3.2.0dev d807a0f
15:58:38   raster scripts 3.2.0dev d807a0f
15:58:38   GEOS: 3.7.1-CAPI-1.11.1 27a5e771
15:58:38   PROJ: Rel. 5.2.0, September 15th, 2018
15:58:38   GDAL: GDAL 2.4.0, released 2018/12/14

bessie32 (success). Also pretty sure 3.2.0alpha1 passed and latest run is passing

15:52:32 PostgreSQL 12.7 on i386-portbld-freebsd12.2, compiled by gcc10 (FreeBSD Ports Collection) 10.2.0, 32-bit
15:52:32   Postgis 3.2.0dev - (d807a0f) - 2021-09-17 21:07:45
15:52:32   scripts 3.2.0dev d807a0f
15:52:32   raster scripts 3.2.0dev d807a0f
15:52:32   GEOS: 3.9.1-CAPI-1.14.2
15:52:32   PROJ: 7.2.1
15:52:32   SFCGAL: 1.3.8
15:52:32   GDAL: GDAL 3.2.1, released 2020/12/29

comment:6 by strk, 3 years ago

Bas I don't think I can run a i386 Docker image on an x86_64 host system, right ? Do you have access on a 32bit host to run that image on and let me in ?

Regina can you install my ssh key on bessie32 for me to try to reproduce the problem there ?

ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAGrZLLAGEQ8PUopPyt/fTH0WV52z5U3yJ/Ki8TbO6q+BR2monBTz96xutXIeZZmFfoCcP9GgqyNm9+fH+cq1TdZwgAwgHdGtYhOdlM4anCzozYp2ydrWr33SWW8E/cflj8HiN2YcyKwZdVhH803g7iYu+1RDDItkpgp/JhRIK9bXHoRuA== strk@c19

comment:7 by strk, 3 years ago

The build log submitted by Bas is using GEOS-3.9.1 and PostgreSQL 13.4 (Debian 13.4-3)

comment:8 by strk, 3 years ago

The only idea I have about this problem is one of alignment. I'm using getPoint2d_cp for accessing the points, which is known to only work when the pointlist is aligned to the double boundary, which is documented to happen when the POINTARRAY comes from deserialization.

Chances are that we are NOT ensuring double alignment when creating geometries like shortestLine, and this is making the code unstable. Enabling debugging by uncommenting the POSTGIS_DEBUG_LEVEL in liblwgeom/lwgeom_topo.c should reveal this.

I can send a candidate patch but I need a way to reproduce it to test it.

comment:9 by strk, 3 years ago

Merge request https://gitlab.com/postgis/postgis/-/merge_requests/58 contains a partial change to avoid unaligned reads to at least the computed shortestLine. I think there are more possible unaligned reads BUT they could all be made against unserialized geoms which should be safe.

Bas can you test that ?

I wonder if we should provide alignment macros to ensure all our LWGEOMs are aligned no matter what... At least I say PostgreSQL source code making use of such macros for their code

comment:10 by Bas Couwenberg, 3 years ago

Bas I don't think I can run a i386 Docker image on an x86_64 host system, right ?

I don't use Docker, but their docs suggests you can:

https://docs.docker.com/desktop/multi-arch/

Also, the GitLab runner used by the Debian packaging CI is x86_64, but runs the i386 docker image just fine.

Do you have access on a 32bit host to run that image on and let me in ?

No, I only have amd64 systems with i386 chroots.

You can create those chroots on your own system:

# On a Debian system
sudo cowbuilder --create \
                --distribution=sid \
                --architecture=i386 \
                --basepath=/var/cache/pbuilder/base-sid-i386.cow

# On an Ubuntu system
sudo apt-get install debian-archive-keyring
sudo cowbuilder --create \
                --distribution=sid \
                --architecture=i386 \
                --basepath=/var/cache/pbuilder/base-sid-i386.cow \
                --mirror http://deb.debian.org/debian/ \
                --debootstrapopts "--keyring=/usr/share/keyrings/debian-archive-keyring.gpg"

# After creating the chroot
sudo cowbuilder --login \
                --basepath=/var/cache/pbuilder/base-sid-i386.cow

# Within the chroot session after login
sed -i 's/^#deb-src/deb-src/g' /etc/apt/sources.list
apt-get build-dep postgis
apt-get source postgis
cd postgis-*
<install vim/emacs/nano/etc>
<hack/test/repeat>

https://debian-gis-team.pages.debian.net/policy/packaging.html#git-pbuilder

Merge request ​https://gitlab.com/postgis/postgis/-/merge_requests/58 contains a partial change to avoid unaligned reads to at least the computed shortestLine. I think there are more possible unaligned reads BUT they could all be made against unserialized geoms which should be safe.

Still fails with those changes applied to 3.2.0-alpha1:

 ./topology/test/regress/getfacecontainingpoint .. failed (diff expected obtained: /tmp/pgis_reg/test_167_diff)
-----------------------------------------------------------------------------
--- ./topology/test/regress/getfacecontainingpoint_expected	2021-09-23 03:55:19.000000000 +0000
+++ /tmp/pgis_reg/test_167_out	2021-09-23 04:27:47.149083888 +0000
@@ -21,21 +21,4 @@
 ERROR:  Two or more faces found
 ERROR:  Two or more faces found
 ERROR:  Two or more faces found
-t1|1
-t2|1
-t3|9
-t4|2
-t5|2
-t6|0
-t7|11
-t8|4
-t9|1
-t10|10
-t11|10
-t12|10
-t13|10
-t14|1
-t15|12
-t16|15
-t17|16
-t18|19
+ERROR:  Closest point is first point of closest segment, unexpectedly
-----------------------------------------------------------------------------

https://salsa.debian.org/debian-gis-team/postgis/-/jobs/1978768/raw

https://salsa.debian.org/debian-gis-team/postgis/-/pipelines/293026

The package build on the buildd likewise failed:

https://buildd.debian.org/status/fetch.php?pkg=postgis&arch=i386&ver=3.2.0%7Ealpha1%2Bdfsg-1%7Eexp2&stamp=1632372717&raw=0

https://buildd.debian.org/status/package.php?p=postgis&suite=experimental

Last edited 3 years ago by Bas Couwenberg (previous) (diff)

comment:11 by strk, 3 years ago

I must be still missing something (on Ubuntu):

I: Running in no-targz mode
W: cgroups are not available on the host, not using them.
I: Distribution is sid.
I: Current time: Thu Sep 23 18:29:33 CEST 2021
I: pbuilder-time-stamp: 1632414573
I: Building the build environment
I: running debootstrap
/usr/sbin/debootstrap
E: specified keyring file (/usr/share/keyrings/debian-
 archive-keyring.gpg) not found
E: debootstrap failed

comment:12 by strk, 3 years ago

Forget the previous comment, I had a spurious blank on the path to the keyring

comment:13 by strk, 3 years ago

Out of space, can the chroot be resized ?

root@c19:/postgis-3.1.4+dfsg# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           7.8G     0  7.8G   0% /dev/shm

comment:14 by strk, 3 years ago

I've more space under /home but trying to login into a chroot under /home fail due to attempts to create hard-link (which don't work cross-device):

cp: cannot create hard link '/var/cache/pbuilder/build/cow.29706/sbin/getty' to '/home/strk/job/2021/nibio/getFaceContainingPoint-in-C/base-sid-i386.cow/sbin/getty': Invalid cross-device link
E: Failed cowcopy.

comment:16 by Bas Couwenberg, 3 years ago

The chroot is only about 1 GB, the aptcache consumes most of the disk space.

Try adding --aptcache /home/strk/aptcache --buildplace /home/strk/build if your /var is starved of free space.

Alternatively send me your bank details and I'll wire your some money for a new disk.

comment:17 by strk, 3 years ago

Alternatively send me your bank details and I'll wire your some money for a new disk.

Funny :) I've plenty of space under /home but when I create the chroot in there I get hard-link errors upon --login

comment:18 by Bas Couwenberg, 3 years ago

Funny

I'm actually serious. Disk space for some pbuilder chroots is cheap.

This works for me on Debian unstable:

mkdir -p ~/pbuilder/{aptcache,build}
sudo cowbuilder --create \
                --distribution sid \
                --architecture i386 \
                --basepath ~/pbuilder/base-sid-i386.cow \
                --aptcache ~/pbuilder/aptcache \
                --buildplace ~/pbuilder/build
sudo cowbuilder --login \
                --basepath ~/pbuilder/base-sid-i386.cow \
                --aptcache ~/pbuilder/aptcache \
                --buildplace ~/pbuilder/build

With /home & /var (amongst others) on different partitions.

Last edited 3 years ago by Bas Couwenberg (previous) (diff)

comment:19 by strk, 3 years ago

That did it, thank you. I can reproduce now:

PostgreSQL 13.4 (Debian 13.4-3) on i686-pc-linux-gnu, compiled by gcc (Debian 10.3.0-8) 10.3.0, 32-bit
  Postgis 3.2.0dev - (0) - 2021-09-24 06:12:15
  scripts 3.2.0dev 0
  GEOS: 3.9.1-CAPI-1.14.2
  PROJ: 7.2.1

Running tests

 topology/test/regress/getfacecontainingpoint .. failed (diff expected obtained: /tmp/pgis_reg/test_1_diff)
-----------------------------------------------------------------------------
--- topology/test/regress/getfacecontainingpoint_expected       2021-09-24 06:02:21.148830927 +0000
+++ /tmp/pgis_reg/test_1_out    2021-09-24 06:21:41.275122456 +0000
@@ -21,21 +21,4 @@
 ERROR:  Two or more faces found
 ERROR:  Two or more faces found
 ERROR:  Two or more faces found
-t1|1
-t2|1
-t3|9
-t4|2
-t5|2
-t6|0
-t7|11
-t8|4
-t9|1
-t10|10
-t11|10
-t12|10
-t13|10
-t14|1
-t15|12
-t16|15
-t17|16
-t18|19
+ERROR:  Closest point is first point of closest segment, unexpectedly
-----------------------------------------------------------------------------
 uninstall .. ok (4831)

comment:20 by strk, 3 years ago

The failing test is t9, that is:

select getfacecontainingpoint('city_data', 'POINT(7.1 31)');

Here's the run with debugging on:

DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7097] Shortest line to closest edge 26: (null)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7103] Closest point on closest edge: POINT(7 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7209] Closest point is NOT a node
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7236] Closest segment to edge 26 is 0
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7239] Closest segment to edge 26 is LINESTRING(0 5.31406e-315, 5.32539e-315 5.31406e-315)
ERROR:  Closest point is first point of closest segment, unexpectedly

The "closest segment" is clearly bogus, and it's probably because of the unaligned memory. I'm on the right track to fix that, but keep thinking we could be nice and ALWAYS align the pointarray memory to the double boundary

comment:21 by strk, 3 years ago

Wait, the "Shortest line to closest edge 26: (null)" message is actually the culprit! That's an unchecked NULL return from lwgeom_closest_line !

comment:22 by strk, 3 years ago

False alarm, it's the LWDEBUGGF macro being bogus, I'll fix that

comment:23 by strk, 3 years ago

The macro was good, the *caller* of the macro was not (and no compiler warnings by default, which we should fix -- #5002).

After fixing the the call, with [9042ee9fae5f7350cd3b2cd6798fb24e093719c7/git], this is the debugging output:

DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7097] Shortest line to closest edge 26: LINESTRING(7 31,7.1 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7103] Closest point on closest edge: POINT(7 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7209] Closest point is NOT a node
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7236] Closest segment to edge 26 is 0
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7239] Closest segment to edge 26 is LINESTRING(0 5.31406e-315, 5.32539e-315 5.31406e-315)

My PR was fixing the alignment-requiring reads of shortestLine but the problem seems to be in closestSegment...

by strk, 3 years ago

Attachment: wrongResult.png added

comment:24 by strk, 3 years ago

Ok the weird looking Closest Segment coordinates were also due to bogus format string, which was fixed with [1baf9cc491087af450abd31e161ffbd81753a51c/git].

The correct debugging is now:

DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7092] Closest edge is 26: LINESTRING(4 31,7 31,7 34,4 34,4 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7099] Shortest line to closest edge 26: LINESTRING(7 31,7.1 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7105] Closest point on closest edge: POINT(7 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7211] Closest point is NOT a node
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7238] Closest segment on edge 26 is 1
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7241] Closest segment on edge 26 is LINESTRING(7 31, 7 34)
ERROR:  Closest point is first point of closest segment, unexpectedly

Here's an image of the problem case:

The debugging lines tell us that there's something wrong with reading the coordinates of the "closest segment", because segment 0 should be the horizontal one (4 31, 7 31), not the vertical one (7 31, 7 34)!

It's interesting to note that the "closest segment" is now reported to be 1 instead of 0. This may be due to ptarray_closest_segment_2d returning an uint32 as an int. Enabling full warnings might really help here..

comment:25 by strk, 3 years ago

Ok something is definitely fishy here. Adding a debug line in ptarray.c changes results so that the closest segment is reported as being 0 instead of 1. With just this patch:

diff --git a/liblwgeom/ptarray.c b/liblwgeom/ptarray.c
index 7994d60dd..fe23282ea 100644
--- a/liblwgeom/ptarray.c
+++ b/liblwgeom/ptarray.c
@@ -29,7 +29,7 @@
 #include <stdlib.h>
 
 #include "../postgis_config.h"
-/*#define POSTGIS_DEBUG_LEVEL 4*/
+#define POSTGIS_DEBUG_LEVEL 4
 #include "liblwgeom_internal.h"
 #include "lwgeom_log.h"
 
@@ -1308,6 +1308,7 @@ ptarray_closest_segment_2d(const POINTARRAY *pa, const POINT2D *qp, double *dist
                double dist_sqr;
                end = getPoint2d_cp(pa, t);
                dist_sqr = distance2d_sqr_pt_seg(qp, start, end);
+               LWDEBUGF(3, "Sqr distance from seg %d: %g", t-1, dist_sqr);
 
                if (dist_sqr < mindist)
                {

I guess it's time to valgrind the liblwgeom side of this

comment:26 by strk, 3 years ago

I've handled to get debugging output of a failing run by moving the DEBUG to *after* the less-than comparison. This means that the less-then comparison works differently IF the dist_sqr is printed in a debug or not. This very much points to some compiler optimization, using more bits in some cases for computations. Anyway, the debug shows this:

DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7092] Closest edge 26: LINESTRING(4 31,7 31,7 34,4 34,4 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7098] Shortest line to closest edge 26: LINESTRING(7 31,7.1 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7104] Closest point on closest edge: POINT(7 31)
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7210] Closest point is NOT a node
DEBUG:  [ptarray.c:ptarray_closest_segment_2d:1316] Sqr distance from seg 0: 0.009999999999999929084504302
DEBUG:  [ptarray.c:ptarray_closest_segment_2d:1316] Sqr distance from seg 1: 0.009999999999999929084504302
DEBUG:  [lwgeom_topo.c:lwt_GetFaceContainingPoint:7237] Closest segment on edge 26 is 1

That's a 25 significant digit figure, printed as all EQUAL digit, and still finding the second number being < than the first one ! Maybe this is a place to use DBL_EPSILON macro, although I don't feel like we should be doing such things :(

Paul: is DBL_EPSILON macro yours ?

comment:27 by strk, 3 years ago

I've refactored the function to also support the case in which the *next* segment is reported, instead of the *current* when the closest point is actually the vertex connecting the two (this is what's happening, on the 32bit system, when the moon is at some given position...).

The refactored code is in https://gitlab.com/postgis/postgis/-/merge_requests/59 and seems to pass on 32bit systems too -- and btw it should also be slightly faster

comment:28 by strk, 3 years ago

NOTE: the root cause of this bug is closest_point_on_segment returning 1 instead of 0 for a query point which is colllinear to the first horizontal segment followed by a vertical segment. Such test has been pushed in commit [b6cce01cdaf6d31169d73871c374cae332f66992/git] and that test is still failing, on the 32bit system. May be worth another ticket to deal with it, if we want to consider it a bug (

Version 0, edited 3 years ago by strk (next)

comment:29 by Bas Couwenberg, 3 years ago

Confirmed fixed on i386 with the changes from MR-59:

https://salsa.debian.org/debian-gis-team/postgis/-/jobs/2009131/raw

The pipeline still failed due to #4994 however.

comment:30 by Sandro Santilli <strk@…>, 3 years ago

Resolution: fixed
Status: newclosed

In ee69242/git:

Refactor getFaceContainingPoint code

Reduce constructive LWGEOM calls and supports glitches in
closest_segment2d sometimes returning the next segment when
the query point is closest to the vertex connecting the segments.

Fixes #4990 (exception on 32bit)

Note: See TracTickets for help on using tickets.