#5667 closed enhancement (fixed)
Performance Postgis Topology when adding million lines by using TopoGeo_AddLineString
Reported by: | Lars Aksel Opsahl | Owned by: | strk |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 3.5.0 |
Component: | topology | Version: | 3.4.x |
Keywords: | performance | Cc: | Lars Aksel Opsahl |
Description
Is is difficult to add new method for TopoGeo_AddLineString(
TopoGeo_AddLineString(varchar atopology, geometry aline, float8 tolerance)
with new method like this
TopoGeo_AddLineString(varchar atopology, geometry[] lines, float8 tolerance)
so we can add multiple lines in a single call ?
Can a method like this make it easier to increase performance and reduce the number off calls like the one below
total_min | avg_ms | calls | query 8.63888982881786 | 0.009448244050939514 | 54860288 | SELECT 1 FROM ONLY "gronn_test_03_test_topo"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 9.254862599314976 | 0.010973764762885281 | 50601755 | SELECT 1 FROM ONLY "gronn_test_03_test_topo"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 16.16802178720142 | 0.01995163239587184 | 48621651 | SELECT id,srid,precision,null::geometry FROM topology.topology WHERE name = $1::varchar
Change History (26)
comment:1 by , 9 months ago
follow-up: 4 comment:2 by , 9 months ago
Keywords: | performance added |
---|
I agree about having more batch-loading functions, in general. Such function could be taking a generic geometry and was indeed forseen already: https://git.osgeo.org/gitea/postgis/postgis/src/branch/master/topology/sql/populate.sql.in#L725
It was planned at least since 2.0.0: https://git.osgeo.org/gitea/postgis/postgis/src/tag/2.0.0/topology/sql/populate.sql.in.c#L1057
In the planned form it had no return, while the implemented functions each return a set of identifiers of primitives corresponding to the incoming lines. Returning primitives helps in building TopoGeometry objects.
Your idea of taking an array of lines might help with matching back such a sets of primitives to the input geometry they would belong to, so maybe the return code of your suggested signature could be a TopoElementArray[]
follow-up: 5 comment:3 by , 9 months ago
Keywords: | performance removed |
---|
That was not many I think we have a ratio off about 70 call for each call line add latest time i checked. I the C code it seems like that info could be added as parameter so it could be one to one .
comment:4 by , 9 months ago
Replying to strk:
Your idea of taking an array of lines might help with matching back such a sets of primitives to the input geometry they would belong to, so maybe the return code of your suggested signature could be a TopoElementArray[]
Yes that would be perfect.
comment:5 by , 9 months ago
Replying to Lars Aksel Opsahl:
That was not many I think we have a ratio off about 70 call for each call line add latest time i checked. I the C code it seems like that info could be added as parameter so it could be one to one .
I may have been wrong there with ratio, maybe more 1 to 20
total_min | avg_ms | calls | query 78.72177638407608 | 0.02124369923476904 | 222339176 | SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar 39.22709295360309 | 0.012033813981130253 | 195584341 | SELECT $2 FROM ONLY "gronn_test_03_test_topo"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 41.32200506109425 | 0.01378028794816383 | 179917888 | SELECT $2 FROM ONLY "gronn_test_03_test_topo"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 4910.569903556195 | 26.160878027420043 | 11262397 | SELECT ARRAY(SELECT topology.TopoGeo_addLinestring(_topology_name, new_line, _snap_tolerance))
comment:6 by , 9 months ago
20 queries of topology.topology every single TopoGeo_addLineString is unexpected, I cannot reproduce it with POSTGIS="3.5.0dev 3.4.0rc1-926-g799a63a81"
Also the query in your first report started with:
SELECT id,srid,precision,null::geometry
while the second started with:
SELECT id,srid,precision,$2::geometry
The first one I recognize as coming from the cb_loadTopologyByName
C function which is the one I see called once per TopoGeo_addLineString
call. In the first report we had 1.12 calls of that one for every call of TopoGeo_addLineString
.
The second one I cannot find in the source tree of current PostGIS master branch (git grep 'id,srid,precision'
) so chances are it is coming from client code and would still be there if PostGIS implemented an internal cache to bring down those PostGIS queries down to 1 per statement.
What is the full version of postgis you are using ?
comment:7 by , 9 months ago
select version(); version ------------------------------------------------------------------------------------------------------------------------------------- PostgreSQL 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit (1 row) vroom5.int.nibio.no postgres@rog_01=# SELECT PostGIS_Full_Version(); postgis_full_version ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- POSTGIS="3.4.0 0874ea3" [EXTENSION] PGSQL="120" GEOS="3.12.1-CAPI-1.18.1" (compiled against GEOS 3.10.2) SFCGAL="1.3.7" PROJ="8.2.0 NETWORK_ENABLED=OFF URL_ENDPOINT=https://cdn.proj.org USER_WRITABLE_DIRECTORY=/tmp/proj DATABASE_PATH=/usr/share/proj/proj.db" LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.3" WAGYU="0.5.0 (Internal)" TOPOLOGY (1 row)
follow-up: 9 comment:8 by , 9 months ago
I cannot find the second query in the source tree of PostGIS-3.4.0 0874ea3
either:
[strk@c23:/usr/local/src/postgis/postgis-3.4((HEAD detached at 3.4.0))] git rev-parse HEAD; git grep 'srid,precision' 0874ea342af5392e3cd9f4e6157ef08648c9d2d8 topology/postgis_topology.c: "SELECT id,srid,precision,null::geometry " topology/test/regress/copytopology.sql:SELECT srid,precision FROM topology.topology WHERE name = 'CITY_data_UP_down'; topology/test/regress/createtopology.sql:SELECT name,srid,precision,hasz from topology.topology topology/test/regress/topologysummary.sql:INSERT INTO topology.topology (id,name,srid,precision,hasz)
Are you sure that query is not coming from code external to PostGIS ? I'm trying to reproduce the problem of multiple queries to topology.topology statements being triggered by a single call to TopoGeo_addLineString but I've not been able so far.
follow-up: 10 comment:9 by , 9 months ago
Replying to strk:
I cannot find the second query in the source tree of
PostGIS-3.4.0 0874ea3
You mean this ?
SELECT $2 FROM ONLY "gronn_test_03_test_topo"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
Yes this is a call triggered by the system before doing a update and not directly issued by any user call if I understand correctly.
follow-up: 11 comment:10 by , 9 months ago
I mean this:
SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar
comment:11 by , 9 months ago
Replying to strk:
I mean this:
SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar
I thought this come from the below code but I see that's is different
"SELECT id,srid,precision,null::geometry " "FROM topology.topology WHERE name = $1::varchar";
And I can not find any code like this i may own code.
comment:12 by , 9 months ago
For the record: I've filed #5668 to implement a version of TopoGeo_addLineString
taking a topology.topology record.
I'd still look for the query having a parameter for the geometry field so a downstream ticket would be useful for me to look at, or at least reproduce.
comment:13 by , 9 months ago
Still for the record: I've filed #5669 to implement a generic batch-loading function
comment:14 by , 9 months ago
I believe there is a 1:1 ratio between calls to TopoGeo_addLinestring
and queries to topology.topology table. That same query also serves the purpose of fetching the OID of the geometry type, which is why the null::geometry part is in there.
Implementation of a cache would result in a single query per session, or per transaction, while taking a topology.topology record would still miss the geometry oid extraction that would then need to be still performed on a 1:1 ratio.
For this reason I'm thinking the best performance improvement here would be to implement a cache or the batch-loading function which would have similar effects. The cache would not require any change in the client code, which makes the performance improvement transparent to the users.
comment:15 by , 9 months ago
I've configured my PostgreSQL-14 cluster to preload the pg_stat_statements
module and created the pg_stat_statements
extension in a test database. Then I've created a topology with a single segment edge, reset the stats with SELECT pg_stat_statements_reset()
and then used TopoGeo_addLineString
to add a single-segment line crossing the existing line.
These are the statements (including nested) resulting form that simple operation:
n | query ----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 11 | SELECT $2 FROM ONLY "t"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 2 | SELECT edge_id,start_node,end_node,left_face,right_face,geom FROM "t".edge_data ORDER BY geom <-> $1 ASC LIMIT $2 12 | SELECT $2 FROM ONLY "t"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 1 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "t".edge_data WHERE edge_id IN ($1) 3 | INSERT INTO "t".node (node_id,containing_face,geom) VALUES (DEFAULT,$1,$2::geometry) RETURNING node_id 1 | UPDATE "t".edge_data SET end_node= $1,next_left_edge= $2, abs_next_left_edge= $3,geom=$4::geometry WHERE edge_id = $5 3 | SELECT node_id,containing_face,geom FROM "t".node WHERE geom && $1::geometry 3 | SELECT edge_id,geom FROM "t".edge_data WHERE ST_DWithin($1::geometry, geom, $2) 3 | SELECT EXISTS ( SELECT $1 FROM "t".node WHERE ST_Equals(geom, $2::geometry)) 3 | SELECT nextval($1) 3 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "t".edge WHERE geom && $1::geometry 1 | SELECT r.element_id, r.topogeo_id, r.layer_id, r.element_type FROM "t".relation r , topology.layer l WHERE l.topology_id = $1 AND l.level = $2 AND l.layer_id = r.layer_id AND r.element_id IN ( $3, $4) AND r.element_type = $5 3 | INSERT INTO "t".edge_data (edge_id,start_node,end_node,left_face,right_face,next_left_edge, abs_next_left_edge,next_right_edge, abs_next_right_edge,geom) VALUES ($1,$2,$3,$4,$5,$6,$7,$8,$9,$10::geometry) 2 | SELECT EXISTS ( SELECT $1 FROM "t".edge_data WHERE ST_Within($2::geometry, geom)) 5 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "t".edge_data WHERE start_node IN ($1) OR end_node IN ($2) 4 | SELECT node_id,geom FROM "t".node WHERE ST_DWithin(geom, $1::geometry, $2) 1 | UPDATE "t".edge_data SET next_right_edge= $1, abs_next_right_edge= $2 WHERE start_node = $3 AND next_right_edge= $4 AND abs_next_right_edge= $5 AND edge_id != $6 4 | SELECT edge_id,geom FROM "t".edge WHERE geom && $1::geometry 1 | select topogeo_addlinestring($1, $2) 8 | SELECT $2 FROM ONLY "t"."edge_data" x WHERE "edge_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 2 | UPDATE "t".node SET containing_face =$1::int WHERE node_id = $2 1 | UPDATE "t".edge_data SET next_left_edge= $1, abs_next_left_edge= $2 WHERE end_node= $3 AND next_left_edge= $4 AND abs_next_left_edge= $5 AND edge_id != $6 2 | SELECT node_id,containing_face,geom FROM "t".node WHERE node_id IN ($1,$2) 1 | SELECT id,srid,precision,null::geometry FROM topology.topology WHERE name = $1::varchar 1 | UPDATE "t".edge_data SET next_right_edge= $1, abs_next_right_edge= $2 WHERE edge_id = $3 1 | UPDATE "t".edge_data SET next_left_edge= $1, abs_next_left_edge= $2 WHERE edge_id = $3 (26 rows)
The query I've used to extract the information:
select calls n, query from pg_stat_statements where dbid = ( select oid from pg_database where datname = current_database() ) and query not like '%pg_stat_statements%';
comment:16 by , 9 months ago
There is a topology/test/perf/TopoGeo_addLinestring.sql
script that was written specifically to analyze performance of that function.
That script calls TopoGeo_addLinestring
30752 times and there are exactly 30752 queries to the topology.topology table (so ratio of 1/1).
I should also mention that the time cost of the topology.topology queries sums up to 142ms over the 44212ms of the statement resulting in those 30752 linestring additions, so we are talking about the 0.3% of the total operation. There are bigger fishes to tackle.
Biggest fish confirms being the query that looks for newly formed ringsas shown by a query to pg_stat_statements ordered by total execution time:
n | tot_ms | avg_ms | q --------+--------+--------+------------------------------------------------------------------------------------------------------------------------- 1 | 44212 | 44212 | SELECT count(*) FROM ( SELECT topology.TopoGeo_addLinestring($1, g) FROM topoperf.case_full_coverage_no_holes ) foo 2883 | 3480 | 1 | WITH RECURSIVE edgering AS ( SELECT $1 as signed_edge_id, edge_id, next_left_edge, next_right_edge FROM "topoperf".edge 141824 | 1818 | 0 | SELECT $2 FROM ONLY "topoperf"."face" x WHERE "face_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 46624 | 1783 | 0 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "topoperf".edge WHERE 900 | 1688 | 2 | WITH newedges(edge_id,left_face) AS ( VALUES ($1,$2),($3,$4),($5,$6),($7,$8),($9,$10),($11,$12),($13,$14),($15,$16),($1 46624 | 1557 | 0 | SELECT edge_id,geom FROM "topoperf".edge WHERE geom && $1::geometry 46624 | 1427 | 0 | SELECT node_id,containing_face,geom FROM "topoperf".node WHERE geom && $1::geometry 900 | 1147 | 1 | WITH newedges(edge_id,right_face) AS ( VALUES ($1,$2),($3,$4),($5,$6),($7,$8),($9,$10),($11,$12),($13,$14),($15,$16),($ 15870 | 1144 | 0 | UPDATE "topoperf".edge_data SET next_left_edge= $1, abs_next_left_edge= $2 WHERE edge_id = $3 61504 | 1081 | 0 | SELECT node_id,geom FROM "topoperf".node WHERE ST_DWithin(geom, $1::geometry, $2) 15872 | 1036 | 0 | INSERT INTO "topoperf".edge_data (edge_id,start_node,end_node,left_face,right_face,next_left_edge, abs_next_left_edge,n 961 | 585 | 1 | SELECT edge_id,left_face,right_face,geom FROM "topoperf".edge_data WHERE ( left_face = ANY($1) OR right_face = ANY ($1) 14912 | 542 | 0 | SELECT edge_id,start_node,end_node,left_face,right_face,geom FROM "topoperf".edge_data ORDER BY geom <-> $1 ASC LIMIT $ 14912 | 475 | 0 | INSERT INTO "topoperf".node (node_id,containing_face,geom) VALUES (DEFAULT,$1,$2::geometry) RETURNING node_id 14912 | 386 | 0 | SELECT EXISTS ( SELECT $1 FROM "topoperf".edge_data WHERE ST_Within($2::geometry, geom)) 126912 | 338 | 0 | SELECT $2 FROM ONLY "topoperf"."node" x WHERE "node_id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x 14912 | 322 | 0 | UPDATE "topoperf".node SET containing_face =$1::int WHERE node_id = $2 14912 | 311 | 0 | SELECT EXISTS ( SELECT $1 FROM "topoperf".node WHERE ST_Equals(geom, $2::geometry)) 46654 | 248 | 0 | SELECT edge_id,start_node,end_node,left_face,right_face,next_left_edge,next_right_edge,geom FROM "topoperf".edge_data W 14912 | 220 | 0 | SELECT edge_id,geom FROM "topoperf".edge_data WHERE ST_DWithin($1::geometry, geom, $2) 1 | 170 | 170 | CREATE TABLE topoperf.case_full_coverage_no_holes AS SELECT ST_Subdivide(ST_Segmentize(ST_Boundary(geom), 0.5), 5) g FR 30752 | 142 | 0 | SELECT id,srid,precision,$2::geometry FROM topology.topology WHERE name = $1::varchar
comment:17 by , 9 months ago
Speaking of the big fish (finding newly create edge rings) I would mention that some internal function to exist to allow postponing this step to after all linework is noded and is found to be an effective way to speed-up batch topology creation. See https://git.osgeo.org/gitea/postgis/postgis/pulls/28#issuecomment-12359
comment:18 by , 9 months ago
Keywords: | performance added |
---|---|
Summary: | Performance Postgis Topolgy when adding millions by using TopoGeo_AddLineString → Performance Postgis Topology when adding million lines by using TopoGeo_AddLineString |
comment:19 by , 9 months ago
The most recent work on postponing face detection to after all lines are loaded is in ST_CreateTopoGeo, via #5670 - that work requires computing *all* faces at the end, while for the case at hand (adding a bunch of lines into an existing topology) we would probably want to only compute new faces within a given bounding box, which is ticketed as #917
comment:20 by , 8 months ago
The for key share of x command is related to foreign key constraints, see https://dba.stackexchange.com/questions/271167/what-does-key-share-of-in-a-postgresql-statement-mean
comment:21 by , 8 months ago
Initial draft of a TopoGeo_AddGeometry(topo, geom)
function returning void is here:
https://git.osgeo.org/gitea/postgis/postgis/pulls/178
comment:22 by , 8 months ago
For the sake of this ticket a version of the function taking ONLY lines (multilinestring, for example) would be preferred, to further get performance improvement opportunities.
What I'm not fully happy with is the API as the TopoGeo_addXXX
family of functions all return the identifiers of the primitives added by the function while we mentioned we might save on that cost when adding million of lines...
comment:23 by , 8 months ago
I've settled for void TopoGeo_LoadGeometry
for now so we have a naming convention to distinguish between functions that return primitive identifiers and functions that do not. See https://git.osgeo.org/gitea/postgis/postgis/pulls/190
comment:24 by , 8 months ago
One issue with void , is that we in some cases use the edge id list to compute area created and then remove edges below given limit.
This may be done other ways also since we know the spatial lines that are added.
comment:26 by , 8 months ago
Milestone: | → PostGIS 3.5.0 |
---|
How many calls to TopoGeo_AddLineString resulted in over 48 million selects to the topology.topology table ? That query comes from the C topology library that takes care of loading the topology. I guess some cache could be put in place to avoid those repeated calls.