Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#3971 closed defect (fixed)

bias in kmeans

Reported by: tilt Owned by: pramsey
Priority: medium Milestone: PostGIS 2.4.3
Component: postgis Version: 2.4.x
Keywords: Cc:

Description

When running ST_ClusterKmeans on a large amount (>100) of clusters it becomes clear that there is a uneven distribution in the clustering, even when the points are evenly distributed.

Consider the following query:

WITH points AS (
    SELECT
(ST_DumpPoints(ST_generatePoints(ST_MakeEnvelope(0,0,1000,1000),100000))).geom
geom
)
SELECT ST_ClusterKMeans(geom,1000) over () AS cid, geom
FROM points;

This will generate the following clusters: http://lists.osgeo.org/pipermail/postgis-devel/attachments/20171228/0512497e/attachment-0001.png

Obviously, clusters on the lowleft, uppright diagonal are smaller then clusters further from this diagonal which seems to be originating from the seeding algorithm.

Original post on this: https://lists.osgeo.org/pipermail/postgis-devel/2017-December/026775.html

Attachments (3)

new_result.jpg (102.8 KB ) - added by tilt 7 years ago.
New kmeans result
new_result_spotty.jpg (79.7 KB ) - added by tilt 7 years ago.
Spottiness in new result in clustersize
new_result_convexhull.jpg (174.3 KB ) - added by tilt 7 years ago.
convexhull coloured by numpts

Download all attachments as: .zip

Change History (15)

comment:1 by komzpa, 7 years ago

Fix: https://github.com/postgis/postgis/pull/181

@tilt do you have any idea how to describe the test for this? I'm thinking of making a stable grid of identical clusters and see if kmeans converges to have min(count_pt_in_cluster) = max(count_pt_in_cluster) but haven't fabricated such case that would currently fail yet.

by tilt, 7 years ago

Attachment: new_result.jpg added

New kmeans result

by tilt, 7 years ago

Attachment: new_result_spotty.jpg added

Spottiness in new result in clustersize

comment:2 by tilt, 7 years ago

New result looks good compared to first bias: https://trac.osgeo.org/postgis/attachment/ticket/3971/new_result.jpg

There is some interesting difference in number of points per cluster though: https://trac.osgeo.org/postgis/attachment/ticket/3971/new_result_spotty.jpg

lightblue < 40 pts/cluster, darkblue > 180 pts/cluster

This seems a high difference considering the original points were randomly distributed.

comment:3 by komzpa, 7 years ago

Can you please also visualizer results as ST_ConvexHull(ST_Collect(geom)) polygons, so there's no visual concavity caused by overlapping points?

Also, try measuring ST_MaxDistance(ST_Collect(geom)) for clusters. Is it similar for all clusters?

by tilt, 7 years ago

Attachment: new_result_convexhull.jpg added

convexhull coloured by numpts

comment:4 by tilt, 7 years ago

Using ST_MininumBoundingCircle(ST_Collect(geom)) instead (since maxdistance needs 2 geometries). Minimum radius: ~15 Maximum radius: ~25

Convexhull (coloured by numpts) is here: https://trac.osgeo.org/postgis/attachment/ticket/3971/new_result_convexhull.jpg

comment:5 by komzpa, 7 years ago

Hi, MaxDistance can accept both geometries as the same.

comment:6 by pramsey, 7 years ago

Given the rather large disparity in number of points per cluster, and the fact that the clusters appear to have more-or-less equal sizing, is it possible your uniform input is not actually uniform? What are the counts against a uniform grid of the area?

comment:7 by komzpa, 7 years ago

hey @pramsey, a grid of 45x45 points with k=81 (supposedly should cluster into 9x9 clusters of 5x5) clusters in a way that smallest cluster is 4 points without the PR applied and with smallest cluster of 12 with PR applied. I think it's a step forward, and further improvements are going to come from tweaking algorithm to escape local minimums, if that's needed.

comment:8 by pramsey, 7 years ago

I like your PR fine, I just find the graphics attached to this ticket confusing. If the input points are uniform and the cluster boundaries are more-or-less the same size, I'd expect the counts of points within the clusters to be much more similar than they are.

comment:9 by komzpa, 7 years ago

Resolution: fixed
Status: newclosed

In 16228:

Change KMeans init algorithm to one with less skew.

Closes #3971
Closes https://github.com/postgis/postgis/pull/181

comment:10 by komzpa, 7 years ago

Cc: me@… removed

comment:11 by komzpa, 7 years ago

In 16230:

NEWS entry for KMeans init change.

Closes #3971

comment:12 by Algunenano, 7 years ago

Hi,

I'm hitting a heap-buffer-overflow error because of these changes

    ==1977==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x61f000002900 at pc 0x7f9825df18cd bp 0x7fff856d5be0 sp 0x7fff856d5bd0
    READ of size 8 at 0x61f000002900 thread T0
        #0 0x7f9825df18cc in lwkmeans_pt_distance /home/raul/dev/public/postgis/liblwgeom/lwkmeans.c:13
        #1 0x7f9825df27f0 in lwgeom_cluster_2d_kmeans /home/raul/dev/public/postgis/liblwgeom/lwkmeans.c:173
        #2 0x557ad2966598 in test_kmeans /home/raul/dev/public/postgis/liblwgeom/cunit/cu_algorithm.c:1339
        #3 0x7f9825881087 in run_single_test /tmp/yaourt-tmp-raul/aur-cunit/src/CUnit-2.1-3/CUnit/Sources/Framework/TestRun.c:991
        #4 0x7f98258812bd in run_single_suite /tmp/yaourt-tmp-raul/aur-cunit/src/CUnit-2.1-3/CUnit/Sources/Framework/TestRun.c:876
        #5 0x7f9825881736 in CU_run_all_tests /tmp/yaourt-tmp-raul/aur-cunit/src/CUnit-2.1-3/CUnit/Sources/Framework/TestRun.c:367
        #6 0x557ad29bf35c in main /home/raul/dev/public/postgis/liblwgeom/cunit/cu_tester.c:173
        #7 0x7f982519af49 in __libc_start_main (/usr/lib/libc.so.6+0x20f49)
        #8 0x557ad295c999 in _start (/home/raul/dev/public/postgis/liblwgeom/cunit/.libs/lt-cu_tester+0x3b999)

It appears it's passing a void ** instead of a void *, so I've tried to fix it by changing the line to:

distances[j] += config.distance_method(config.objs[j], (Pointer)&centers_raw[i-1]);

But that reduces the cluster number from 12 to 7 so it doesn't pass the test.

Note: See TracTickets for help on using tickets.