#3971 closed defect (fixed)
bias in kmeans
Reported by: | tilt | Owned by: | pramsey |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.4.3 |
Component: | postgis | Version: | 2.4.x |
Keywords: | Cc: |
Description
When running ST_ClusterKmeans on a large amount (>100) of clusters it becomes clear that there is a uneven distribution in the clustering, even when the points are evenly distributed.
Consider the following query:
WITH points AS ( SELECT (ST_DumpPoints(ST_generatePoints(ST_MakeEnvelope(0,0,1000,1000),100000))).geom geom ) SELECT ST_ClusterKMeans(geom,1000) over () AS cid, geom FROM points;
This will generate the following clusters:
Obviously, clusters on the lowleft, uppright diagonal are smaller then clusters further from this diagonal which seems to be originating from the seeding algorithm.
Original post on this: https://lists.osgeo.org/pipermail/postgis-devel/2017-December/026775.html
Attachments (3)
Change History (15)
comment:1 by , 7 years ago
comment:2 by , 7 years ago
comment:3 by , 7 years ago
Can you please also visualizer results as ST_ConvexHull(ST_Collect(geom)) polygons, so there's no visual concavity caused by overlapping points?
Also, try measuring ST_MaxDistance(ST_Collect(geom)) for clusters. Is it similar for all clusters?
comment:4 by , 7 years ago
comment:6 by , 7 years ago
Given the rather large disparity in number of points per cluster, and the fact that the clusters appear to have more-or-less equal sizing, is it possible your uniform input is not actually uniform? What are the counts against a uniform grid of the area?
comment:7 by , 7 years ago
hey @pramsey, a grid of 45x45 points with k=81 (supposedly should cluster into 9x9 clusters of 5x5) clusters in a way that smallest cluster is 4 points without the PR applied and with smallest cluster of 12 with PR applied. I think it's a step forward, and further improvements are going to come from tweaking algorithm to escape local minimums, if that's needed.
comment:8 by , 7 years ago
I like your PR fine, I just find the graphics attached to this ticket confusing. If the input points are uniform and the cluster boundaries are more-or-less the same size, I'd expect the counts of points within the clusters to be much more similar than they are.
comment:10 by , 7 years ago
Cc: | removed |
---|
comment:12 by , 7 years ago
Hi,
I'm hitting a heap-buffer-overflow error because of these changes
==1977==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61f000002900 at pc 0x7f9825df18cd bp 0x7fff856d5be0 sp 0x7fff856d5bd0 READ of size 8 at 0x61f000002900 thread T0 #0 0x7f9825df18cc in lwkmeans_pt_distance /home/raul/dev/public/postgis/liblwgeom/lwkmeans.c:13 #1 0x7f9825df27f0 in lwgeom_cluster_2d_kmeans /home/raul/dev/public/postgis/liblwgeom/lwkmeans.c:173 #2 0x557ad2966598 in test_kmeans /home/raul/dev/public/postgis/liblwgeom/cunit/cu_algorithm.c:1339 #3 0x7f9825881087 in run_single_test /tmp/yaourt-tmp-raul/aur-cunit/src/CUnit-2.1-3/CUnit/Sources/Framework/TestRun.c:991 #4 0x7f98258812bd in run_single_suite /tmp/yaourt-tmp-raul/aur-cunit/src/CUnit-2.1-3/CUnit/Sources/Framework/TestRun.c:876 #5 0x7f9825881736 in CU_run_all_tests /tmp/yaourt-tmp-raul/aur-cunit/src/CUnit-2.1-3/CUnit/Sources/Framework/TestRun.c:367 #6 0x557ad29bf35c in main /home/raul/dev/public/postgis/liblwgeom/cunit/cu_tester.c:173 #7 0x7f982519af49 in __libc_start_main (/usr/lib/libc.so.6+0x20f49) #8 0x557ad295c999 in _start (/home/raul/dev/public/postgis/liblwgeom/cunit/.libs/lt-cu_tester+0x3b999)
It appears it's passing a void **
instead of a void *
, so I've tried to fix it by changing the line to:
distances[j] += config.distance_method(config.objs[j], (Pointer)¢ers_raw[i-1]);
But that reduces the cluster number from 12 to 7 so it doesn't pass the test.
Fix: https://github.com/postgis/postgis/pull/181
@tilt do you have any idea how to describe the test for this? I'm thinking of making a stable grid of identical clusters and see if kmeans converges to have min(count_pt_in_cluster) = max(count_pt_in_cluster) but haven't fabricated such case that would currently fail yet.