Opened 5 years ago
Last modified 16 months ago
#4684 new defect
concurrent topology construction routine may result in invalid topology
Reported by: | Lars Aksel Opsahl | Owned by: | strk |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS Fund Me |
Component: | topology | Version: | 3.0.x |
Keywords: | reliability | Cc: |
Description
When using topology.TopoGeo_addLinestring we some times end with invalid topology.
It's seem to happen more often when running i multithread environment than in single threaded environment. Running the same code on the same data set may produce different results each time. The result of this are lost surface or that two surfaces are merged together as one. The edges are in most cases represented in the the edge table the data about left_face and right_face wrong. I many case we right_face and left_face are equal and different from 0, that means an dangling egde, but it's not because we can construct a Polygon based on the edges inside the face where spatially located.
There is more detailed information here. https://github.com/larsop/resolve-overlap-and-gap/issues/7
There is also an related bug here https://trac.osgeo.org/postgis/ticket/4681#ticket
Attachments (2)
Change History (38)
comment:1 by , 5 years ago
comment:2 by , 5 years ago
So far I have only tested in a multi thread environment .
If I read you correct I have to be able produce an invalid Topology by running in a single thread ?
Can I run in partly in multiple threads by doing it this way ?
1) Break the input up into many grid cell 2) Insert lines lines that are inside each cell and with min a distance like tolerance to cell border 3) Insert the rest of the lines in a single thread
Lars
comment:3 by , 4 years ago
If the broken input NEVER intersects the grid cell borders then your solution *should* work, based on the fact that NO RING (and thus no face) would ever cross a cell.
Just make sure the topology is valid BEFORE step 3.
comment:4 by , 4 years ago
I have been testing a quite bit on this branch https://github.com/larsop/resolve-overlap-and-gap/tree/test_not_cut_cellborder_crossing_edges_issues_7 where I follow the the steps above and I have not been able to make any with faces with null geometries running step 3 in single thread.
So it's seems to be correct what you are saying here.
Just to confirm this I this I changed step 3 to run parallel https://github.com/larsop/resolve-overlap-and-gap/commit/57ef8b9db710da6c8c023d60b8b617f202502e4b and then I got 67 faces with null geo of 308413 faces.
I also tried to creating blocking area based on mbr in the face table https://github.com/larsop/resolve-overlap-and-gap/commit/b564bdb6d3e77cf064e11cb4418fc41ab5fcddf3 but that did not help very much.
comment:5 by , 4 years ago
Component: | postgis → topology |
---|---|
Owner: | changed from | to
Summary: | Topology construction routine results invalid topology → concurrent topology construction routine may result in invalid topology |
There might be a way, using PostgreSQL locks, to reduce the chances (or completely avoid) that such invalidities are introduced in the topology. Serializing access to the topology primitives could be overkill becuase it would prevent the 3 steps approach above. What we want to avoid is that more than ONE transaction at the time works on the same edges. Row level locks might help with that: https://www.postgresql.org/docs/10/explicit-locking.html#LOCKING-ROWS
comment:6 by , 4 years ago
This sounds like a good idea.
So if I have a set of lines to add, how do I find the edge row and maybe face rows to lock ?
(I have tried to block other threads based on input geometry and faces, but I never got it to work 100%, but if remove this blocking totally in step 3 https://github.com/larsop/resolve-overlap-and-gap/commit/874f843c58921c5e2fca18240aef60f15bfc6b58#diff-0ec63b6826e9888305fd67d50bf428e1totally the server core dumps quite fast)
comment:7 by , 4 years ago
So if I have a set of lines to add, how do I find the edge row and maybe face rows to lock ?
Not easy to tell. When adding ONE linestring, records that will possibly be updated are:
[topo.edge_data]
- edges that will intersect the line (easy to get)
- edges of any face that will be affected by the insertion (hard to find)
[topo.face]
- faces that will be split by the insertion (hard to find)
[topo.node]
- isolated nodes that are within tolerance distance from the line
A pessimistic approach might lock:
- EVERY FACE whos MBR intersects the input line
- EVERY EDGE having any of those faces on its right or left side
- EVERY ISOLATED NODE within tolerance distance from the input line
comment:9 by , 4 years ago
I have been testing the pessimistic approach using this guide line from Sandro
EVERY FACE whos MBR intersects the input line, EVERY EDGE having any of those faces on its right or left side, EVERY ISOLATED NODE within tolerance distance from the input line ,
On this branch https://github.com/larsop/resolve-overlap-and-gap/tree/trac_osgeo_org_postgis_ticket_4684 and what ever I tried I ended up with st_getFaceGeometry that was null.
Here I https://github.com/larsop/resolve-overlap-and-gap/commit/7d71dbe9a71f099d2c4a8ec116ba49f7e749dcfb tested with row-level and no change in blocking area and we get null geometry and it seems to increase with the number of parallel threads.
Here I https://github.com/larsop/resolve-overlap-and-gap/commit/8ef48c02587c5aff0f60e4a05339ca82619f9446 extend the blocking area and I also did row level lock and that seems to generate less errors, but I still get cases where st_getFaceGeometry is null, but fewer.
It seems like more deadlocks are increasing the cases off null geometries from st_getFaceGeometry.
I will do some more checking around deadlocks and test rollback when that happens
comment:10 by , 4 years ago
Here https://github.com/larsop/resolve-overlap-and-gap/commit/dc81279b969b7da75848b1e53f79729beb746469 I did a test where I catch deadlock and does rollback, but I still got cases where st_getFaceGeometry was null.
comment:11 by , 4 years ago
When https://github.com/larsop/resolve-overlap-and-gap/commit/d6eccb1c5bd7a1345a81eea65707bb179eab6355 testing with bigger blocking area and also use this area blocking more rows in edge and face tables, I had case where I got no deadlocks , but still got cases where st_getFaceGeometry was null.
comment:12 by , 4 years ago
It looks like my pessimistic approach was not pessimistic enough, as it forgot to deal with the hard step of finding:
- edges of any face that will be affected by the insertion (hard to find)
Let's consider this simple starting condition:
- Topology has a single face, bound by a single closed edge
- A second edge goes around the existing face but is NOT closed
____ / \ / \ / o \ E2 / / \ \ / / \ E1 \ / / \ \ / / F1 \ \ / / \ \ / `---------' \ / F0 \ / \ `----o o--------'
Now, two separate transactions want to add each a new edge. Let's call them E3 and E4:
____ / \ / \ / o \ E2 / / \ \ / / \ E1 \ / / \ \ / / F1 \ \ / / \ \ / `---------' \ / F0 \ / \ `----o---(E3)---o--------' | | `---(E4)---'
Both transactions will want to update E2 labeling and edge linking but they will be conflicting. Shall E2 be connected to E3 or E4 ? Whichever transaction is committed last will decide, leaving effectively the topology in an invalid state.
Scenarios like the above should be used for testing row-level locking. Eventually we might want to implement row-level locking in PostGIS Topology core.
How to solve the above ? We need to ALSO lock E2, but neither E3 nor E4 overlap any faces bound by E2, so the recipe given before won't catch this case.
E3 intersects E2 though, so that's another thing to check:
- Lock all edges intersecting the incoming input line
- Lock all faces found on both sides of the above edges
- Lock all edges having any of the above faces on their side
We need manageable data to make this ticket workable so please if you have a way use scenarios like this one.
Note I filed https://github.com/larsop/resolve-overlap-and-gap/issues/16 to discuss the code you're working on to try these locking strategies.
by , 4 years ago
Attachment: | db_topo_rowlevel_lock.zip added |
---|
A file to show how row level lock can be used to used avoid topo errors
comment:13 by , 4 years ago
A test case showing how to use row level lock to avoid Topology errors .
Info from the db_topo_rowlevel_lock_readme.txt in file db_topo_rowlevel_lock.zip.
# Create a empty database
psql postgres -c'create database t1'
# Add the code needed for testing
t1 -f db_topo_rowlevel_lock_init.sql
# Test with no rowlevel locking, should return something like this
psql t1 -f db_topo_rowlevel_lock_test_fail.sql ?column? | error | id1 | id2 ------------+------------------+-----+----- validation | face within face | 1 | 2 validation | face within face | 2 | 1 validation | face within face | 1 | 3 validation | face within face | 2 | 3 (4 rows)
# Test with rowlevel locking , no topop errors should be found
psql t1 -f db_topo_rowlevel_lock_test_ok.sql
comment:14 by , 4 years ago
https://trac.osgeo.org/postgis/attachment/ticket/4684/db_topo_bug_4684.zip was not showing bug in Postgis Topology bug it was bug was related to missing primary key , I have now been running 60 interations with no errors. The new correct SQL is uploaded.
So the only problem we here are when multiple threads working with related spatial areas, which are expected.
comment:15 by , 4 years ago
Milestone: | → PostGIS 3.0.2 |
---|
comment:16 by , 4 years ago
Milestone: | PostGIS 3.0.2 → PostGIS 3.0.3 |
---|
comment:17 by , 4 years ago
Milestone: | PostGIS 3.0.3 → PostGIS 3.0.4 |
---|
comment:18 by , 4 years ago
While thinking about how to include row-level locking in core I found another case of parallel insertion of objects creating invalid topology which cannot be resolved by row-level locking primitive tables. It is the case of parallel insertion of a face and an isolated node or line insode the face created by the other process. Simplest scenario:
- Process1 inserts a closed line in the empty topology
- Process2 inserts a point inside the closed line (non visible by its transaction)
Using SQL:
- P0: select createtopology('isoinface');
- P1: begin; select topogeo_addpoint('isoinface', 'POINT(5 5)');
- P2: begin: select topogeo_addlinestring('isoinface', 'LINESTRING(0 0,0 10,10 10,10 0,0 0)');
- P1: commit;
- P2: commit;
At the end of the above process, we'll have th isoinface
topology containing 1 face, 1 edge and 2 nodes. The ISOLATED node (the one inserted by P1) will have a wrong containing_face
value:
=# select n.node_id, st_astext(n.geom) g, n.containing_face, f.face_id computed_containing_face FROM isoinface.node n left outer join isoinface.face f ON (f.face_id != 0 and ST_Contains(ST_GetFaceGeometry('isoinface', f.face_id), n.geom)); node_id | g | containing_face | computed_containing_face ---------+------------+-----------------+-------------------------- 1 | POINT(5 5) | 0 | 1 2 | POINT(0 0) | | (2 rows)
This kind of invalidity is not currently detected by ValidateTopology function (see #3233)
comment:19 by , 4 years ago
Same kind of invalidity (wrong containment info) happens for isolated edges:
Using SQL:
- P0: select createtopology('isoinface');
- P1: begin; select topogeo_addpoint('isoinface', 'LINESTRING(4 5, 6 5)');
- P2: begin: select topogeo_addlinestring('isoinface', 'LINESTRING(0 0,0 10,10 10,10 0,0 0)');
- P1: commit;
- P2: commit;
At the end of the above process, we'll have th isoinface
topology containing 1 face, 2 edge and 3 nodes. The ISOLATED edge (the one inserted by P1) will have a wrong left_face
and right_face
values:
=# select e.edge_id, st_astext(e.geom) g, e.left_face, e.right_face, f.face_id as containing_face from isoinface.edge e left outer join isoinface.face f on (f.face_id != 0 and ST_Contains(ST_GetFaceGeometry('isoinface', f.face_id), e.geom)); edge_id | g | left_face | right_face | containing_face ---------+-------------------------------------+-----------+------------+----------------- 2 | LINESTRING(0 0,0 10,10 10,10 0,0 0) | 0 | 2 | 3 | LINESTRING(4 4,6 6) | 0 | 0 | 2 (2 rows)
This kind of invalidity is also not currently detected by ValidateTopology function (see #4830)
comment:20 by , 4 years ago
For the two cases above there's no row to be locked because the problem is with rows yet to be visible by the other process. The only way to ensure objects in an area would be visible by other transaction would be to acquire a lock on an *area*, reguardless of objects already existing in the area. We could think of a special "spatial_locks" table to be created in each topology where each process would insert locks for "areas", obtainable only for areas not intersecting any other already locked area.
comment:21 by , 4 years ago
If we go the "locking area" approach, the problem of figuring how the extent of the area to be locked remains, as such area would be affected by existing primitives in the topology (ie: adding a small edge closing the last gap in a large ring)
comment:22 by , 4 years ago
I had big problems to get spatial locks that worked with out errors and OK performance so that's way I ended up using grid on grid where I added lines crossing cell borders in a controlled way by merging grid cells and at end I merged the last cells in a single thread. But this solution where add cell crossing lines as final step only works when pumping data into a empty layer. When doing locks I tried use a meta table where each thread updated the spatial to be effected,
Yes so I agree with you that we need a kind of spatial lock for updating.
comment:23 by , 4 years ago
In order to have a clearer picture of what's going wrong, I'v efixed #3233 (ValidateTopology check on node's containing_face) - This only affects topologies with isolated nodes, so not necessarely something you can test, but you could maybe run some performance comparison as I bet ValidateTopology will now be much slower.
comment:24 by , 4 years ago
I've now also committed a fix for #4830 (ValidateTopology check on edge left/right faces)
comment:25 by , 3 years ago
I'm thinking for a start we could make it so that every routine possibly editing the topology would lock the topology.topology record for UPDATE. It would be the safest bet for a start.
comment:26 by , 3 years ago
Merge request https://gitlab.com/postgis/postgis/-/merge_requests/49 stubs a test for one case of concurrent editing topology corruption
comment:27 by , 3 years ago
One thing to consider would be to expose an explicit topology locking routine, so that users who know they are going to perform multiple edits in a single transaction could obtain the lock once, rather than relying on internal locking, which would need to be checked once for each and every single edit (consider adding a single line may trigger multiple inserts and updates)
comment:28 by , 3 years ago
I've pushed a script to see the effects of concurrent topology population. Interesting, in some cases you also get deadlocks, like in this output:
$ ./check_multiprocess_topology_population.sh ERROR: SQL/MM Spatial exception - geometry crosses edge 1 ERROR: Edge changed disposition around start node 1 ERROR: Side-location conflict: new edge starts in face 1 and ends in face 2 ERROR: SQL/MM Spatial exception - geometry crosses edge 5 ERROR: deadlock detected DETAIL: Process 174600 waits for ShareLock on transaction 3178504; blocked by process 174602. Process 174602 waits for ShareLock on transaction 3178502; blocked by process 174600. HINT: See server log for query details. CONTEXT: while updating tuple (1,3) in relation "edge_data" SQL statement "UPDATE "mt_topo".edge_data SET next_right_edge= -16, abs_next_right_edge= 16 WHERE start_node = 5 AND next_right_edge= -5 AND abs_next_right_edge= 5 AND edge_id != 16" Topology mt_topo (id 697, SRID 0, precision 0) 11 nodes, 17 edges, 9 faces, 0 topogeoms in 0 layers
comment:29 by , 3 years ago
Merge request https://gitlab.com/postgis/postgis/-/merge_requests/51 takes the approach of advisory locks
comment:30 by , 3 years ago
It looks like another effective approach to prevent invalidities would be to ensure every transaction possibly modifying a topology runs with SERIALIZABLE TRANSACTION ISOLATION LEVEL. In that case it looks like all possible invalidities will be prevented by a serialization_failure exception.
comment:31 by , 3 years ago
I just learnt that SERIALIZABLE tracking is "pessimistic", that is it may give false conflicts (error out more than needed). This may result in actually slower performance than forced serialization via locks. A test running 16 concurrent transactions each adding a polygon to a topology completes in 1 second when serialized while can take 10 times as much when using SERIALIZABLE TRANSACTION ISOLATION LEVEL and re-trying after sleeping for a random amount of time up to 1 second. Some transactions are re-attempted up to 10 times before succeeding:
[strk@c19:/usr/local/src/postgis/postgis(topology-advisory-lock)] time topology/test/check_multiprocess_topology_population.sh 2>&1 | grep WARNING WARNING: 15917 succeeded on attempt 1 WARNING: 15950 succeeded on attempt 1 WARNING: 16045 succeeded on attempt 1 WARNING: 16046 succeeded on attempt 2 WARNING: 16025 succeeded on attempt 3 WARNING: 16040 succeeded on attempt 3 WARNING: 15940 succeeded on attempt 4 WARNING: 15994 succeeded on attempt 4 WARNING: 16000 succeeded on attempt 4 WARNING: 16005 succeeded on attempt 6 WARNING: 16011 succeeded on attempt 5 WARNING: 16044 succeeded on attempt 6 WARNING: 15924 succeeded on attempt 6 WARNING: 15999 succeeded on attempt 8 WARNING: 15996 succeeded on attempt 9 WARNING: 16028 succeeded on attempt 9 real 0m5.392s user 0m0.729s sys 0m0.168s
comment:32 by , 3 years ago
strk you have plans to do anything with this in the next couple of days or should I push to next milestone?
comment:33 by , 3 years ago
Milestone: | PostGIS 3.0.4 → PostGIS 3.3.0 |
---|
Pushed forward. At this stage I think the best thing to do would be just documenting the risks of concurrently invoking topology editing functions.
comment:34 by , 3 years ago
Keywords: | reliability added |
---|
comment:35 by , 2 years ago
Milestone: | PostGIS 3.3.0 → PostGIS 3.4.0 |
---|
comment:36 by , 16 months ago
Milestone: | PostGIS 3.4.0 → PostGIS Fund Me |
---|
First of all: it is _expected_ that concurrent processes (threads would result in backends, so processes) may end up creating an invalid topology. This is because the topology primitive tables DO NOT have proper integrity constraints ensuring topology validity. They do have partial constraints but not full-coverage of constraints. This was a design choice, initially, to avoid an excessive cost on topology population when the caller knows what he's doing.
We *might* be able to change this now, counting on the ability of caller to disable (temporary?) those constraints. Anyway this is the current situation.
The fact that you end up with different results is ALSO due to this multi-process approach. Depending on what gets into the primitive tables first determines the behaviour of the next operation.
So my question is: can you get topology.TopoGeo_addLinestring to create an invalid topology when being run by a backend having exclusive write access to the database ? If you can NOT, then this would be an INVALID ticket.