Hierarchical clustering based on a union of ellipsoids

Example of two spirals

using GeometricClusterAnalysis
using Plots
using Statistics

Data generation

Parameters

nsignal = 2000 # number of signal points
nnoise = 400   # number of outliers
dim = 2       # dimension of the data
σ = 0.5  # standard deviation for the additive noise

0.5

Data generation

rng = MersenneTwister(123)
data = noisy_nested_spirals(rng, nsignal, nnoise, σ, dim)
npoints = size(data.points, 2)
println("The dataset contains $npoints points, of dimension $dim.")

The dataset contains 2400 points, of dimension 2.

Data display

plot(data, aspect_ratio = 1)

Computation of the union of ellipsoids with the kPLM function

Parameters

k = 20        # number of nearest neighbors
c = 30        # number of ellipsoids
iter_max = 20 # maximum number of iterations of the algorithm kPLM
nstart = 10;  # number of initializations of the algorithm kPLM

Method

We decide to make no constraints on the ellipsoids, that is, no constraints on the eigenvalues of the matrices directing the ellipsoids.

function f_Σ!(Σ) end

f_Σ! (generic function with 1 method)

The parameter indexed_by_r2 = true is the default parameter in the function kplm. We should not modify it since some birth times are negative.

It is due to the logarithm of the determinant of some matrices that are negative. This problem can be solved by adding constraints to the matrices, with the argument "f_Σ!". In particular, forcing eigenvalues to be non smaller than 1 works.

Application of the method

distance_function = kplm(rng, data.points, k, c, nsignal, iter_max, nstart, f_Σ!)

KpResult{Float64}:
k = 20
centers = [
     [-10.860789972992192, 19.970795164085693]
     [16.00475015537982, -21.31496930064836]
     [6.1023299556013875, 17.54903931651983]
     [-19.746254787024657, 14.952669606314805]
     [25.528986764859425, -13.442439423421728]
     [30.42997054009674, -2.7391180438203344]
     [0.7032262204928013, -35.949058375193076]
     [11.421353634115135, 11.907429341324686]
     [11.51676220731877, -35.75246460501173]
     [-14.482680166287082, 2.3007061478146347]
     [0.6920337333034114, 38.97007370882358]
     [-18.789931171009325, -27.31774602168406]
     [9.209749635838328, 36.75954117490916]
     [-28.361493915674316, -6.781154021454974]
     [-8.414604469538876, -19.543901828005655]
     [2.5711264420104674, -23.632892618222193]
     [16.069480513820515, 33.31690276725144]
     [-24.981864843657377, -19.018210713096064]
     [-26.57699510377622, 4.784311668419593]
     [30.046156034346293, 14.288544217032054]
     [22.513951973689547, -31.928160418986252]
     [13.18486557024591, 3.740197937097277]
     [10.700645437236728, -2.9912087693354663]
     [-1.061970003081628, 20.598676632091724]
     [-15.593824428314335, -8.948208114255388]
     [-7.707035110095129, 8.445814680999092]
     [-10.09171953613707, -33.03908975834024]
     [23.38721194276153, 26.554085130187104]
 ]
colors = [15, 28, 25, 0, 6, 20, 16, 20, 17, 16  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
cost = 3.1055741238750003

Clustering based on the persistence of the union of ellipsoids filtration

Matrix of distances

This is a matrix that contains the birth times of the ellipsoids in the diagonal, and the intersecting times of pairs of ellipsoids in the lower left triangular part of the matrix.

distance_matrix = build_distance_matrix(distance_function)

28×28 Matrix{Float64}:
    0.997921    Inf          Inf        …   Inf          Inf       Inf
 2219.19         0.794625    Inf            Inf          Inf       Inf
   12.6782     270.956        0.532123      Inf          Inf       Inf
    3.40809   1487.47        31.4601        Inf          Inf       Inf
  506.898        4.5751     149.438         Inf          Inf       Inf
  184.65        10.4047      93.1143    …   Inf          Inf       Inf
 1234.97        42.1397    1216.95          Inf          Inf       Inf
   29.949      145.532        3.38581       Inf          Inf       Inf
 3615.26       106.89       584.439         Inf          Inf       Inf
   35.0316     261.5         46.3986        Inf          Inf       Inf
    ⋮                                   ⋱    ⋮                     
  102.619       33.3004      92.1652        Inf          Inf       Inf
 2024.69       184.388      376.066     …   Inf          Inf       Inf
   76.0247      90.6043      14.0412        Inf          Inf       Inf
  310.086      201.926       51.493         Inf          Inf       Inf
    3.21365    414.258        3.7872        Inf          Inf       Inf
   44.4365      49.6346     182.182         Inf          Inf       Inf
  137.021     1465.96        46.1403    …    0.644328    Inf       Inf
  627.77        42.4105    2557.85         439.845        1.07454  Inf
   80.3238      86.1381     172.663        187.794     1126.33      2.07069

Selection of parameter "threshold"

We draw a persistence diagram based on the filtration of the union of ellipsoids.

Each point corresponds to the connected component of an ellipsoid. The birth time corresponds to the time at which the ellipsoid, and thus the component, appears. The death time corresponds to the time at which the component merges with a component that was born before.

The top left point corresponds to the ellipsoid that appeared first and therefore never merges with a component born before. Its death time is $\infty$.

The parameter "threshold" aims at removing ellipsoids born after time "threshold". Such ellipsoids are considered as irrelevant. This may be due to a bad initialisation of the algorithm that creates ellipsoids in bad directions with respect to the data.

hc1 = hierarchical_clustering_lem(
    distance_matrix,
    infinity = Inf,
    threshold = Inf,
    store_colors = false,
    store_timesteps = false,
)

plot(hc1)

We consider that ellipsoids born after time "threshold = 4" were not relevant.

Selection of parameter "infinity"

We then have to select parameter "infinity". Connected components which lifetime is larger than "infinity" are components that we want not to die.

hc2 = hierarchical_clustering_lem(
    distance_matrix,
    infinity = Inf,
    threshold = 4,
    store_colors = false,
    store_timesteps = false,
)

plot(hc2)

We select "infinity = 15". Since there are clearly two connected components that have a lifetime much larger than others. This lifetime is larger than 15, whereas the lifetime of others is smaller than 15.

Clustering

hc3 = hierarchical_clustering_lem(
    distance_matrix,
    infinity = 10,
    threshold = 4,
    store_colors = true,
    store_timesteps = true,
)

plot(hc3)

Getting the number of components, colors of ellipsoids and times of evolution of the clustering

Number of ellipsoids

nellipsoids = length(hc3.startup_indices)
# Ellispoids colors
ellipsoids_colors = hc3.saved_colors
# Time at which a component borns or dies
timesteps = hc3.timesteps;

Note : ellipsoids_colors[i] contains the labels of the ellipsoids just before the time timesteps[i]

Example :

ellipsoids_colors[1] contains only 0 labels

Moreover, if there are 2 connexed components in the remaining clustering :

ellipsoids_colors[end - 1] = ellipsoids_colors[end] contains 2 different labels
ellipsoids_colors[end - 2] contains 3 different labels

Using a parameter threshold not equal to $\infty$ erases some ellipsoids. Therefore we need to compute new labels of the data points, with respect to the new ellipsoids.

remain_indices = hc3.startup_indices
points_colors, distances = subcolorize(data.points, npoints, distance_function, remain_indices)

([15, 28, 25, 15, 6, 20, 16, 20, 17, 16  …  24, 15, 6, 11, 14, 6, 9, 24, 11, 1], [5.056903762672652, 4.538266417572684, 1.8430166100578418, 9.5536893016661, 6.059080912769491, 2.722670242635671, 4.836228149273284, 5.564645694927842, 1.4374791563411444, 2.3310649964686  …  29.172850822469016, 300.49653109603753, 26.556311888134914, 17.633627530538348, 161.3582506246063, 144.65867510604483, 73.39341913055875, 152.89693367584908, 35.17630745652872, 227.94678809591284])

Removing outliers

Selection of the number of outliers

idxs = zeros(Int, npoints)
sortperm!(idxs, distances, rev = false)
costs = cumsum(distances[idxs])
plot( costs, legend = false, xlabel = "number of points", ylabel = "cost")

We choose "nsignal", the number of points at which there is a slope change in the cost curve.

We set the label of the (npoints - nsignal) points with largest cost to 0. These points are considered as outliers.

nsignal = 2100

points_colors[idxs[(nsignal+1):end]] .= 0

300-element view(::Vector{Int64}, [2240, 2297, 2119, 2058, 2069, 2061, 2131, 2280, 2053, 2374  …  2338, 2019, 2215, 2365, 2266, 2044, 2235, 2008, 2205, 2368]) with eltype Int64:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0

Animation - Clustering result

ellipsoids_frames, points_frames, μ, ω, Σ, sq_time = create_ellipsoids_animation(
    distance_function, timesteps, ellipsoids_colors, points_colors, distances, remain_indices )

nframes = length(points_frames)

anim = @animate for i in [1:nframes; Iterators.repeated(nframes, 30)...]
    ellipsoids(data.points, ellipsoids_frames[i], points_frames[i], μ, ω, Σ, sq_time[i]; markersize = 3)
    xlims!(-60, 60)
    ylims!(-60, 60)
end

gif(anim, "anim_kplm.gif", fps = 5)