Opened 3 years ago
Closed 3 years ago
#4990 closed defect (fixed)
getfacecontainingpoint fails on i386
Reported by: | Bas Couwenberg | Owned by: | strk |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 3.2.0 |
Component: | topology | Version: | master |
Keywords: | Cc: |
Description
The Debian package build for 3.2.0~alpha1 failed on i386 due to a test failure:
./topology/test/regress/getfacecontainingpoint .. failed (diff expected obtained: /tmp/pgis_reg/test_167_diff) ----------------------------------------------------------------------------- --- ./topology/test/regress/getfacecontainingpoint_expected 2021-09-11 03:34:25.000000000 +0000 +++ /tmp/pgis_reg/test_167_out 2021-09-11 07:38:02.046054220 +0000 @@ -21,21 +21,4 @@ ERROR: Two or more faces found ERROR: Two or more faces found ERROR: Two or more faces found -t1|1 -t2|1 -t3|9 -t4|2 -t5|2 -t6|0 -t7|11 -t8|4 -t9|1 -t10|10 -t11|10 -t12|10 -t13|10 -t14|1 -t15|12 -t16|15 -t17|16 -t18|19 +ERROR: Closest point is first point of closest segment, unexpectedly -----------------------------------------------------------------------------
Attachments (1)
Change History (31)
comment:1 by , 3 years ago
Component: | postgis → topology |
---|---|
Owner: | changed from | to
comment:2 by , 3 years ago
comment:3 by , 3 years ago
Getting access to the buildd is not possible, getting access to a porterbox is possible, but not easy for non-Debian people.
You can reproduce the issue in an i386 chroot or container.
comment:4 by , 3 years ago
If it's a 32-bit issue, I'm surprised this didn't trigger a failure in bessie32 (freebsd 12 32-bit https://debbie.postgis.net/view/PostGIS/job/PostGIS_Worker_Run/label=bessie32/2879/consoleFull running GEOS 3.9.1 and PostgreSQL 12) or berrie (which is a 32-bit rasbian pi https://debbie.postgis.net/view/PostGIS/job/PostGIS_Worker_Run/label=berrie/2879/consoleFull running 3.7.1-CAPI-1.11.1 and PostgreSQL 12 ) ci bots.
Which version of GEOS are you running? strk does this function rely on GEOS. I didn't check. Just wanted to rule that out.
comment:5 by , 3 years ago
Sorry I should have looked at the logs -- so this is a
Failing build
PostgreSQL 13.4 (Debian 13.4-3) on i686-pc-linux-gnu, compiled by gcc (Debian 10.3.0-8) 10.3.0, 32-bit Postgis 3.2.0alpha1 - (e357e0a) - 2021-09-11 05:49:54 scripts 3.2.0alpha1 e357e0a raster scripts 3.2.0alpha1 e357e0a GEOS: 3.9.1-CAPI-1.14.2 PROJ: 7.2.1 SFCGAL: 1.3.10 GDAL: GDAL 3.3.2, released 2021/09/01
berrie (success). I'm pretty sure the 3.2.0alpha1 was success too but this is the latest run of 3.2 branch
15:58:38 PostgreSQL 12.6 on aarch64-unknown-linux-gnu, compiled by gcc (Raspbian 8.3.0-6+rpi1) 8.3.0, 32-bit 15:58:38 Postgis 3.2.0dev - (d807a0f) - 2021-09-20 19:49:37 15:58:38 scripts 3.2.0dev d807a0f 15:58:38 raster scripts 3.2.0dev d807a0f 15:58:38 GEOS: 3.7.1-CAPI-1.11.1 27a5e771 15:58:38 PROJ: Rel. 5.2.0, September 15th, 2018 15:58:38 GDAL: GDAL 2.4.0, released 2018/12/14
bessie32 (success). Also pretty sure 3.2.0alpha1 passed and latest run is passing
15:52:32 PostgreSQL 12.7 on i386-portbld-freebsd12.2, compiled by gcc10 (FreeBSD Ports Collection) 10.2.0, 32-bit 15:52:32 Postgis 3.2.0dev - (d807a0f) - 2021-09-17 21:07:45 15:52:32 scripts 3.2.0dev d807a0f 15:52:32 raster scripts 3.2.0dev d807a0f 15:52:32 GEOS: 3.9.1-CAPI-1.14.2 15:52:32 PROJ: 7.2.1 15:52:32 SFCGAL: 1.3.8 15:52:32 GDAL: GDAL 3.2.1, released 2020/12/29
comment:6 by , 3 years ago
Bas I don't think I can run a i386 Docker image on an x86_64 host system, right ? Do you have access on a 32bit host to run that image on and let me in ?
Regina can you install my ssh key on bessie32 for me to try to reproduce the problem there ?
ecdsa-sha2-nistp521 AAAAE2VjZHNhLXNoYTItbmlzdHA1MjEAAAAIbmlzdHA1MjEAAACFBAGrZLLAGEQ8PUopPyt/fTH0WV52z5U3yJ/Ki8TbO6q+BR2monBTz96xutXIeZZmFfoCcP9GgqyNm9+fH+cq1TdZwgAwgHdGtYhOdlM4anCzozYp2ydrWr33SWW8E/cflj8HiN2YcyKwZdVhH803g7iYu+1RDDItkpgp/JhRIK9bXHoRuA== strk@c19
comment:7 by , 3 years ago
The build log submitted by Bas is using GEOS-3.9.1 and PostgreSQL 13.4 (Debian 13.4-3)
comment:8 by , 3 years ago
The only idea I have about this problem is one of alignment. I'm using getPoint2d_cp
for accessing the points, which is known to only work when the pointlist is aligned to the double boundary, which is documented to happen when the POINTARRAY comes from deserialization.
Chances are that we are NOT ensuring double alignment when creating geometries like shortestLine, and this is making the code unstable. Enabling debugging by uncommenting the POSTGIS_DEBUG_LEVEL in liblwgeom/lwgeom_topo.c should reveal this.
I can send a candidate patch but I need a way to reproduce it to test it.
comment:9 by , 3 years ago
Merge request https://gitlab.com/postgis/postgis/-/merge_requests/58 contains a partial change to avoid unaligned reads to at least the computed shortestLine. I think there are more possible unaligned reads BUT they could all be made against unserialized geoms which should be safe.
Bas can you test that ?
I wonder if we should provide alignment macros to ensure all our LWGEOMs are aligned no matter what... At least I say PostgreSQL source code making use of such macros for their code
comment:10 by , 3 years ago
Bas I don't think I can run a i386 Docker image on an x86_64 host system, right ?
I don't use Docker, but their docs suggests you can:
Also, the GitLab runner used by the Debian packaging CI is x86_64, but runs the i386 docker image just fine.
Do you have access on a 32bit host to run that image on and let me in ?
No, I only have amd64 systems with i386 chroots.
You can create those chroots on your own system:
# On a Debian system sudo cowbuilder --create \ --distribution=sid \ --architecture=i386 \ --basepath=/var/cache/pbuilder/base-sid-i386.cow # On an Ubuntu system sudo apt-get install debian-archive-keyring sudo cowbuilder --create \ --distribution=sid \ --architecture=i386 \ --basepath=/var/cache/pbuilder/base-sid-i386.cow \ --mirror http://deb.debian.org/debian/ \ --debootstrapopts "--keyring=/usr/share/keyrings/debian-archive-keyring.gpg" # After creating the chroot sudo cowbuilder --login \ --basepath=/var/cache/pbuilder/base-sid-i386.cow # Within the chroot session after login sed -i 's/^#deb-src/deb-src/g' /etc/apt/sources.list apt-get build-dep postgis apt-get source postgis cd postgis-* <install vim/emacs/nano/etc> <hack/test/repeat>
https://debian-gis-team.pages.debian.net/policy/packaging.html#git-pbuilder
Merge request https://gitlab.com/postgis/postgis/-/merge_requests/58 contains a partial change to avoid unaligned reads to at least the computed shortestLine. I think there are more possible unaligned reads BUT they could all be made against unserialized geoms which should be safe.
Still fails with those changes applied to 3.2.0-alpha1:
./topology/test/regress/getfacecontainingpoint .. failed (diff expected obtained: /tmp/pgis_reg/test_167_diff) ----------------------------------------------------------------------------- --- ./topology/test/regress/getfacecontainingpoint_expected 2021-09-23 03:55:19.000000000 +0000 +++ /tmp/pgis_reg/test_167_out 2021-09-23 04:27:47.149083888 +0000 @@ -21,21 +21,4 @@ ERROR: Two or more faces found ERROR: Two or more faces found ERROR: Two or more faces found -t1|1 -t2|1 -t3|9 -t4|2 -t5|2 -t6|0 -t7|11 -t8|4 -t9|1 -t10|10 -t11|10 -t12|10 -t13|10 -t14|1 -t15|12 -t16|15 -t17|16 -t18|19 +ERROR: Closest point is first point of closest segment, unexpectedly -----------------------------------------------------------------------------
https://salsa.debian.org/debian-gis-team/postgis/-/jobs/1978768/raw
https://salsa.debian.org/debian-gis-team/postgis/-/pipelines/293026
The package build on the buildd likewise failed:
https://buildd.debian.org/status/package.php?p=postgis&suite=experimental
comment:11 by , 3 years ago
I must be still missing something (on Ubuntu):
I: Running in no-targz mode W: cgroups are not available on the host, not using them. I: Distribution is sid. I: Current time: Thu Sep 23 18:29:33 CEST 2021 I: pbuilder-time-stamp: 1632414573 I: Building the build environment I: running debootstrap /usr/sbin/debootstrap E: specified keyring file (/usr/share/keyrings/debian- archive-keyring.gpg) not found E: debootstrap failed
comment:12 by , 3 years ago
Forget the previous comment, I had a spurious blank on the path to the keyring
comment:13 by , 3 years ago
Out of space, can the chroot be resized ?
root@c19:/postgis-3.1.4+dfsg# df -h Filesystem Size Used Avail Use% Mounted on tmpfs 7.8G 0 7.8G 0% /dev/shm
comment:14 by , 3 years ago
I've more space under /home but trying to login into a chroot under /home fail due to attempts to create hard-link (which don't work cross-device):
cp: cannot create hard link '/var/cache/pbuilder/build/cow.29706/sbin/getty' to '/home/strk/job/2021/nibio/getFaceContainingPoint-in-C/base-sid-i386.cow/sbin/getty': Invalid cross-device link E: Failed cowcopy.
comment:15 by , 3 years ago
comment:16 by , 3 years ago
The chroot is only about 1 GB, the aptcache consumes most of the disk space.
Try adding --aptcache /home/strk/aptcache --buildplace /home/strk/build
if your /var
is starved of free space.
Alternatively send me your bank details and I'll wire your some money for a new disk.
comment:17 by , 3 years ago
Alternatively send me your bank details and I'll wire your some money for a new disk.
Funny :) I've plenty of space under /home but when I create the chroot in there I get hard-link errors upon --login
comment:18 by , 3 years ago
Funny
I'm actually serious. Disk space for some pbuilder chroots is cheap.
This works for me on Debian unstable:
mkdir -p ~/pbuilder/{aptcache,build} sudo cowbuilder --create \ --distribution sid \ --architecture i386 \ --basepath ~/pbuilder/base-sid-i386.cow \ --aptcache ~/pbuilder/aptcache \ --buildplace ~/pbuilder/build sudo cowbuilder --login \ --basepath ~/pbuilder/base-sid-i386.cow \ --aptcache ~/pbuilder/aptcache \ --buildplace ~/pbuilder/build
With /home
& /var
(amongst others) on different partitions.
comment:19 by , 3 years ago
That did it, thank you. I can reproduce now:
PostgreSQL 13.4 (Debian 13.4-3) on i686-pc-linux-gnu, compiled by gcc (Debian 10.3.0-8) 10.3.0, 32-bit Postgis 3.2.0dev - (0) - 2021-09-24 06:12:15 scripts 3.2.0dev 0 GEOS: 3.9.1-CAPI-1.14.2 PROJ: 7.2.1 Running tests topology/test/regress/getfacecontainingpoint .. failed (diff expected obtained: /tmp/pgis_reg/test_1_diff) ----------------------------------------------------------------------------- --- topology/test/regress/getfacecontainingpoint_expected 2021-09-24 06:02:21.148830927 +0000 +++ /tmp/pgis_reg/test_1_out 2021-09-24 06:21:41.275122456 +0000 @@ -21,21 +21,4 @@ ERROR: Two or more faces found ERROR: Two or more faces found ERROR: Two or more faces found -t1|1 -t2|1 -t3|9 -t4|2 -t5|2 -t6|0 -t7|11 -t8|4 -t9|1 -t10|10 -t11|10 -t12|10 -t13|10 -t14|1 -t15|12 -t16|15 -t17|16 -t18|19 +ERROR: Closest point is first point of closest segment, unexpectedly ----------------------------------------------------------------------------- uninstall .. ok (4831)
comment:20 by , 3 years ago
The failing test is t9, that is:
select getfacecontainingpoint('city_data', 'POINT(7.1 31)');
Here's the run with debugging on:
DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7097] Shortest line to closest edge 26: (null) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7103] Closest point on closest edge: POINT(7 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7209] Closest point is NOT a node DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7236] Closest segment to edge 26 is 0 DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7239] Closest segment to edge 26 is LINESTRING(0 5.31406e-315, 5.32539e-315 5.31406e-315) ERROR: Closest point is first point of closest segment, unexpectedly
The "closest segment" is clearly bogus, and it's probably because of the unaligned memory. I'm on the right track to fix that, but keep thinking we could be nice and ALWAYS align the pointarray memory to the double boundary
comment:21 by , 3 years ago
Wait, the "Shortest line to closest edge 26: (null)" message is actually the culprit! That's an unchecked NULL return from lwgeom_closest_line !
comment:23 by , 3 years ago
The macro was good, the *caller* of the macro was not (and no compiler warnings by default, which we should fix -- #5002).
After fixing the the call, with [9042ee9fae5f7350cd3b2cd6798fb24e093719c7/git], this is the debugging output:
DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7097] Shortest line to closest edge 26: LINESTRING(7 31,7.1 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7103] Closest point on closest edge: POINT(7 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7209] Closest point is NOT a node DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7236] Closest segment to edge 26 is 0 DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7239] Closest segment to edge 26 is LINESTRING(0 5.31406e-315, 5.32539e-315 5.31406e-315)
My PR was fixing the alignment-requiring reads of shortestLine but the problem seems to be in closestSegment...
by , 3 years ago
Attachment: | wrongResult.png added |
---|
comment:24 by , 3 years ago
Ok the weird looking Closest Segment coordinates were also due to bogus format string, which was fixed with [1baf9cc491087af450abd31e161ffbd81753a51c/git].
The correct debugging is now:
DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7092] Closest edge is 26: LINESTRING(4 31,7 31,7 34,4 34,4 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7099] Shortest line to closest edge 26: LINESTRING(7 31,7.1 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7105] Closest point on closest edge: POINT(7 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7211] Closest point is NOT a node DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7238] Closest segment on edge 26 is 1 DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7241] Closest segment on edge 26 is LINESTRING(7 31, 7 34) ERROR: Closest point is first point of closest segment, unexpectedly
Here's an image of the problem case:
The debugging lines tell us that there's something wrong with reading the coordinates of the "closest segment", because segment 0 should be the horizontal one (4 31, 7 31), not the vertical one (7 31, 7 34)!
It's interesting to note that the "closest segment" is now reported to be 1 instead of 0. This may be due to ptarray_closest_segment_2d returning an uint32 as an int. Enabling full warnings might really help here..
comment:25 by , 3 years ago
Ok something is definitely fishy here. Adding a debug line in ptarray.c changes results so that the closest segment is reported as being 0 instead of 1. With just this patch:
diff --git a/liblwgeom/ptarray.c b/liblwgeom/ptarray.c index 7994d60dd..fe23282ea 100644 --- a/liblwgeom/ptarray.c +++ b/liblwgeom/ptarray.c @@ -29,7 +29,7 @@ #include <stdlib.h> #include "../postgis_config.h" -/*#define POSTGIS_DEBUG_LEVEL 4*/ +#define POSTGIS_DEBUG_LEVEL 4 #include "liblwgeom_internal.h" #include "lwgeom_log.h" @@ -1308,6 +1308,7 @@ ptarray_closest_segment_2d(const POINTARRAY *pa, const POINT2D *qp, double *dist double dist_sqr; end = getPoint2d_cp(pa, t); dist_sqr = distance2d_sqr_pt_seg(qp, start, end); + LWDEBUGF(3, "Sqr distance from seg %d: %g", t-1, dist_sqr); if (dist_sqr < mindist) {
I guess it's time to valgrind the liblwgeom side of this
comment:26 by , 3 years ago
I've handled to get debugging output of a failing run by moving the DEBUG to *after* the less-than comparison. This means that the less-then comparison works differently IF the dist_sqr
is printed in a debug or not. This very much points to some compiler optimization, using more bits in some cases for computations. Anyway, the debug shows this:
DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7092] Closest edge 26: LINESTRING(4 31,7 31,7 34,4 34,4 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7098] Shortest line to closest edge 26: LINESTRING(7 31,7.1 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7104] Closest point on closest edge: POINT(7 31) DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7210] Closest point is NOT a node DEBUG: [ptarray.c:ptarray_closest_segment_2d:1316] Sqr distance from seg 0: 0.009999999999999929084504302 DEBUG: [ptarray.c:ptarray_closest_segment_2d:1316] Sqr distance from seg 1: 0.009999999999999929084504302 DEBUG: [lwgeom_topo.c:lwt_GetFaceContainingPoint:7237] Closest segment on edge 26 is 1
That's a 25 significant digit figure, printed as all EQUAL digit, and still finding the second number being < than the first one ! Maybe this is a place to use DBL_EPSILON macro, although I don't feel like we should be doing such things :(
Paul: is DBL_EPSILON macro yours ?
comment:27 by , 3 years ago
I've refactored the function to also support the case in which the *next* segment is reported, instead of the *current* when the closest point is actually the vertex connecting the two (this is what's happening, on the 32bit system, when the moon is at some given position...).
The refactored code is in https://gitlab.com/postgis/postgis/-/merge_requests/59 and seems to pass on 32bit systems too -- and btw it should also be slightly faster
comment:28 by , 3 years ago
NOTE: the root cause of this bug is closest_point_on_segment returning 1 instead of 0 for a query point which is colllinear to the first horizontal segment followed by a vertical segment. Such test has been pushed in commit [b6cce01cdaf6d31169d73871c374cae332f66992/git] and that test is still failing, on the 32bit system. May be worth another ticket to deal with it, if we want to consider it a bug
comment:29 by , 3 years ago
Confirmed fixed on i386 with the changes from MR-59:
https://salsa.debian.org/debian-gis-team/postgis/-/jobs/2009131/raw
The pipeline still failed due to #4994 however.
Can you provide a way for me to access that i386 machine for testing ?