Hierarchical clustering based on a union of ellipsoids

Example of two spirals

using GeometricClusterAnalysis
using Plots
using Statistics

Data generation

Parameters

nsignal = 2000 # number of signal points
nnoise = 400   # number of outliers
dim = 2       # dimension of the data
σ = 0.5  # standard deviation for the additive noise
0.5

Data generation

rng = MersenneTwister(123)
data = noisy_nested_spirals(rng, nsignal, nnoise, σ, dim)
npoints = size(data.points, 2)
println("The dataset contains $npoints points, of dimension $dim.")
The dataset contains 2400 points, of dimension 2.

Data display

plot(data, aspect_ratio = 1)
Example block output

Computation of the union of ellipsoids with the kPLM function

Parameters

k = 20        # number of nearest neighbors
c = 30        # number of ellipsoids
iter_max = 20 # maximum number of iterations of the algorithm kPLM
nstart = 10;  # number of initializations of the algorithm kPLM

Method

We decide to make no constraints on the ellipsoids, that is, no constraints on the eigenvalues of the matrices directing the ellipsoids.

function f_Σ!(Σ) end
f_Σ! (generic function with 1 method)

The parameter indexed_by_r2 = true is the default parameter in the function kplm. We should not modify it since some birth times are negative.

It is due to the logarithm of the determinant of some matrices that are negative. This problem can be solved by adding constraints to the matrices, with the argument "f_Σ!". In particular, forcing eigenvalues to be non smaller than 1 works.

Application of the method

distance_function = kplm(rng, data.points, k, c, nsignal, iter_max, nstart, f_Σ!)
KpResult{Float64}:
k = 20
centers = [
     [12.09106880600502, 35.512231849941706]
     [11.10520495644539, -2.312099964879074]
     [31.175588395021816, 9.828570610333857]
     [18.71285759525761, -33.45543114227322]
     [-28.578498377359935, -4.390107974557133]
     [-25.972760286726256, -16.490838307431552]
     [-12.020589483335518, 5.687191365519236]
     [-23.12024469693999, 11.478240217735193]
     [-14.239571475472095, -12.358578620993404]
     [3.5497049470821413, 19.076151157010045]
     [9.600768508612669, 14.334870661394383]
     [0.6540923296266847, -35.903051591486765]
     [13.351471186728437, 3.9121753393088134]
     [12.738703142064521, 8.913453932763328]
     [23.80283427786567, 26.180667519677634]
     [-15.943381171133062, -29.301310948924073]
     [-26.887619578163637, 5.32742487673712]
     [22.514812503349418, -16.78207867672038]
     [29.569820840371726, -6.064799836803726]
     [1.0193713291756967, 38.71327198695241]
     [-6.967111726104882, 8.61449353816522]
     [-15.713918326602439, -0.6508444028152078]
     [-6.35428857987379, -21.073641491198124]
     [-11.047852997167988, 20.094091964326843]
     [-17.452687666674855, 16.88506531820494]
     [1.8715692010545144, -23.792470974279972]
     [15.107330695764665, -21.562989560883665]
     [9.299431945474161, -23.536239968723898]
     [-3.5686771173016005, 20.920829360238933]
 ]
colors = [3, 1, 19, 23, 9, 19, 18, 21, 9, 7  …  0, 0, 0, 20, 0, 0, 0, 0, 0, 0]
cost = 3.1573758464749813

Clustering based on the persistence of the union of ellipsoids filtration

Matrix of distances

This is a matrix that contains the birth times of the ellipsoids in the diagonal, and the intersecting times of pairs of ellipsoids in the lower left triangular part of the matrix.

distance_matrix = build_distance_matrix(distance_function)
29×29 Matrix{Float64}:
    1.95953   Inf         Inf       …    Inf         Inf       Inf
  133.698      1.25378    Inf            Inf         Inf       Inf
   19.4736    80.9993      2.27055       Inf         Inf       Inf
  320.764    264.044      42.1753        Inf         Inf       Inf
  278.687    397.1      2084.0           Inf         Inf       Inf
  366.351    209.601    1412.99     …    Inf         Inf       Inf
  236.471    360.066     424.592         Inf         Inf       Inf
  104.216    702.299     463.291         Inf         Inf       Inf
  419.088     86.2906    638.357         Inf         Inf       Inf
  297.584     49.3773     91.4927        Inf         Inf       Inf
    ⋮                               ⋱                          
  319.283    130.823     242.778    …    Inf         Inf       Inf
  186.038    254.46      905.515         Inf         Inf       Inf
 2720.54      67.2223    171.524         Inf         Inf       Inf
  146.79     342.63      266.896         Inf         Inf       Inf
   96.3251   668.731     260.749         Inf         Inf       Inf
 1468.17      69.5428    141.187    …    Inf         Inf       Inf
  787.89      95.6592     91.4626         1.05758    Inf       Inf
  864.089     78.1591    104.547          3.64918     1.21095  Inf
  141.487    129.419     127.53        1020.68     1282.04      0.843565

Selection of parameter "threshold"

We draw a persistence diagram based on the filtration of the union of ellipsoids.

Each point corresponds to the connected component of an ellipsoid. The birth time corresponds to the time at which the ellipsoid, and thus the component, appears. The death time corresponds to the time at which the component merges with a component that was born before.

The top left point corresponds to the ellipsoid that appeared first and therefore never merges with a component born before. Its death time is $\infty$.

The parameter "threshold" aims at removing ellipsoids born after time "threshold". Such ellipsoids are considered as irrelevant. This may be due to a bad initialisation of the algorithm that creates ellipsoids in bad directions with respect to the data.

hc1 = hierarchical_clustering_lem(
    distance_matrix,
    infinity = Inf,
    threshold = Inf,
    store_colors = false,
    store_timesteps = false,
)

plot(hc1)
Example block output

We consider that ellipsoids born after time "threshold = 4" were not relevant.

Selection of parameter "infinity"

We then have to select parameter "infinity". Connected components which lifetime is larger than "infinity" are components that we want not to die.

hc2 = hierarchical_clustering_lem(
    distance_matrix,
    infinity = Inf,
    threshold = 4,
    store_colors = false,
    store_timesteps = false,
)

plot(hc2)
Example block output

We select "infinity = 15". Since there are clearly two connected components that have a lifetime much larger than others. This lifetime is larger than 15, whereas the lifetime of others is smaller than 15.

Clustering

hc3 = hierarchical_clustering_lem(
    distance_matrix,
    infinity = 10,
    threshold = 4,
    store_colors = true,
    store_timesteps = true,
)

plot(hc3)
Example block output

Getting the number of components, colors of ellipsoids and times of evolution of the clustering

Number of ellipsoids

nellipsoids = length(hc3.startup_indices)
# Ellispoids colors
ellipsoids_colors = hc3.saved_colors
# Time at which a component borns or dies
timesteps = hc3.timesteps;

Note : ellipsoids_colors[i] contains the labels of the ellipsoids just before the time timesteps[i]

Example :

  • ellipsoids_colors[1] contains only 0 labels

Moreover, if there are 2 connexed components in the remaining clustering :

  • ellipsoids_colors[end - 1] = ellipsoids_colors[end] contains 2 different labels
  • ellipsoids_colors[end - 2] contains 3 different labels

Using a parameter threshold not equal to $\infty$ erases some ellipsoids. Therefore we need to compute new labels of the data points, with respect to the new ellipsoids.

remain_indices = hc3.startup_indices
points_colors, distances = subcolorize(data.points, npoints, distance_function, remain_indices)
([3, 1, 19, 23, 9, 19, 18, 21, 9, 7  …  4, 3, 12, 20, 6, 20, 8, 4, 19, 6], [2.315749981293533, 3.816132565039391, 2.880162047151902, 3.2194811169533812, 3.8947497741374466, 3.5901592704888947, 2.4333564312555938, 0.9319242279474274, 5.661826288583886, 0.5896412635063679  …  62.41752856613242, 146.8749249743046, 69.0763291826481, 7.858653442362352, 161.2110981757169, 142.60508756077564, 19.539032004720067, 337.23702532359215, 105.42048180700176, 112.87606117457861])

Removing outliers

Selection of the number of outliers

idxs = zeros(Int, npoints)
sortperm!(idxs, distances, rev = false)
costs = cumsum(distances[idxs])
plot( costs, legend = false, xlabel = "number of points", ylabel = "cost")
Example block output

We choose "nsignal", the number of points at which there is a slope change in the cost curve.

We set the label of the (npoints - nsignal) points with largest cost to 0. These points are considered as outliers.

nsignal = 2100

points_colors[idxs[(nsignal+1):end]] .= 0
300-element view(::Vector{Int64}, [2337, 2207, 2054, 2382, 2246, 2211, 2136, 2056, 2112, 2232  …  2332, 2204, 2245, 2329, 2137, 2034, 2075, 2150, 2037, 2389]) with eltype Int64:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0

Animation - Clustering result

ellipsoids_frames, points_frames, μ, ω, Σ, sq_time = create_ellipsoids_animation(
    distance_function, timesteps, ellipsoids_colors, points_colors, distances, remain_indices )

nframes = length(points_frames)

anim = @animate for i in [1:nframes; Iterators.repeated(nframes, 30)...]
    ellipsoids(data.points, ellipsoids_frames[i], points_frames[i], μ, ω, Σ, sq_time[i]; markersize = 3)
    xlims!(-60, 60)
    ylims!(-60, 60)
end

gif(anim, "anim_kplm.gif", fps = 5)
Example block output