Hierarchical clustering based on a union of ellipsoids
Example of two spirals
using GeometricClusterAnalysis
using Plots
using Statistics
Data generation
Parameters
nsignal = 2000 # number of signal points
nnoise = 400 # number of outliers
dim = 2 # dimension of the data
σ = 0.5 # standard deviation for the additive noise
0.5
Data generation
rng = MersenneTwister(123)
data = noisy_nested_spirals(rng, nsignal, nnoise, σ, dim)
npoints = size(data.points, 2)
println("The dataset contains $npoints points, of dimension $dim.")
The dataset contains 2400 points, of dimension 2.
Data display
plot(data, aspect_ratio = 1)
Computation of the union of ellipsoids with the kPLM function
Parameters
k = 20 # number of nearest neighbors
c = 30 # number of ellipsoids
iter_max = 20 # maximum number of iterations of the algorithm kPLM
nstart = 10; # number of initializations of the algorithm kPLM
Method
We decide to make no constraints on the ellipsoids, that is, no constraints on the eigenvalues of the matrices directing the ellipsoids.
function f_Σ!(Σ) end
f_Σ! (generic function with 1 method)
The parameter indexed_by_r2 = true
is the default parameter in the function kplm. We should not modify it since some birth times are negative.
It is due to the logarithm of the determinant of some matrices that are negative. This problem can be solved by adding constraints to the matrices, with the argument "f_Σ!". In particular, forcing eigenvalues to be non smaller than 1 works.
Application of the method
distance_function = kplm(rng, data.points, k, c, nsignal, iter_max, nstart, f_Σ!)
KpResult{Float64}:
k = 20
centers = [
[12.09106880600502, 35.512231849941706]
[11.10520495644539, -2.312099964879074]
[31.175588395021816, 9.828570610333857]
[18.71285759525761, -33.45543114227322]
[-28.578498377359935, -4.390107974557133]
[-25.972760286726256, -16.490838307431552]
[-12.020589483335518, 5.687191365519236]
[-23.12024469693999, 11.478240217735193]
[-14.239571475472095, -12.358578620993404]
[3.5497049470821413, 19.076151157010045]
[9.600768508612669, 14.334870661394383]
[0.6540923296266847, -35.903051591486765]
[13.351471186728437, 3.9121753393088134]
[12.738703142064521, 8.913453932763328]
[23.80283427786567, 26.180667519677634]
[-15.943381171133062, -29.301310948924073]
[-26.887619578163637, 5.32742487673712]
[22.514812503349418, -16.78207867672038]
[29.569820840371726, -6.064799836803726]
[1.0193713291756967, 38.71327198695241]
[-6.967111726104882, 8.61449353816522]
[-15.713918326602439, -0.6508444028152078]
[-6.35428857987379, -21.073641491198124]
[-11.047852997167988, 20.094091964326843]
[-17.452687666674855, 16.88506531820494]
[1.8715692010545144, -23.792470974279972]
[15.107330695764665, -21.562989560883665]
[9.299431945474161, -23.536239968723898]
[-3.5686771173016005, 20.920829360238933]
]
colors = [3, 1, 19, 23, 9, 19, 18, 21, 9, 7 … 0, 0, 0, 20, 0, 0, 0, 0, 0, 0]
cost = 3.1573758464749813
Clustering based on the persistence of the union of ellipsoids filtration
Matrix of distances
This is a matrix that contains the birth times of the ellipsoids in the diagonal, and the intersecting times of pairs of ellipsoids in the lower left triangular part of the matrix.
distance_matrix = build_distance_matrix(distance_function)
29×29 Matrix{Float64}:
1.95953 Inf Inf … Inf Inf Inf
133.698 1.25378 Inf Inf Inf Inf
19.4736 80.9993 2.27055 Inf Inf Inf
320.764 264.044 42.1753 Inf Inf Inf
278.687 397.1 2084.0 Inf Inf Inf
366.351 209.601 1412.99 … Inf Inf Inf
236.471 360.066 424.592 Inf Inf Inf
104.216 702.299 463.291 Inf Inf Inf
419.088 86.2906 638.357 Inf Inf Inf
297.584 49.3773 91.4927 Inf Inf Inf
⋮ ⋱
319.283 130.823 242.778 … Inf Inf Inf
186.038 254.46 905.515 Inf Inf Inf
2720.54 67.2223 171.524 Inf Inf Inf
146.79 342.63 266.896 Inf Inf Inf
96.3251 668.731 260.749 Inf Inf Inf
1468.17 69.5428 141.187 … Inf Inf Inf
787.89 95.6592 91.4626 1.05758 Inf Inf
864.089 78.1591 104.547 3.64918 1.21095 Inf
141.487 129.419 127.53 1020.68 1282.04 0.843565
Selection of parameter "threshold"
We draw a persistence diagram based on the filtration of the union of ellipsoids.
Each point corresponds to the connected component of an ellipsoid. The birth time corresponds to the time at which the ellipsoid, and thus the component, appears. The death time corresponds to the time at which the component merges with a component that was born before.
The top left point corresponds to the ellipsoid that appeared first and therefore never merges with a component born before. Its death time is $\infty$.
The parameter "threshold" aims at removing ellipsoids born after time "threshold". Such ellipsoids are considered as irrelevant. This may be due to a bad initialisation of the algorithm that creates ellipsoids in bad directions with respect to the data.
hc1 = hierarchical_clustering_lem(
distance_matrix,
infinity = Inf,
threshold = Inf,
store_colors = false,
store_timesteps = false,
)
plot(hc1)
We consider that ellipsoids born after time "threshold = 4" were not relevant.
Selection of parameter "infinity"
We then have to select parameter "infinity". Connected components which lifetime is larger than "infinity" are components that we want not to die.
hc2 = hierarchical_clustering_lem(
distance_matrix,
infinity = Inf,
threshold = 4,
store_colors = false,
store_timesteps = false,
)
plot(hc2)
We select "infinity = 15". Since there are clearly two connected components that have a lifetime much larger than others. This lifetime is larger than 15, whereas the lifetime of others is smaller than 15.
Clustering
hc3 = hierarchical_clustering_lem(
distance_matrix,
infinity = 10,
threshold = 4,
store_colors = true,
store_timesteps = true,
)
plot(hc3)
Getting the number of components, colors of ellipsoids and times of evolution of the clustering
Number of ellipsoids
nellipsoids = length(hc3.startup_indices)
# Ellispoids colors
ellipsoids_colors = hc3.saved_colors
# Time at which a component borns or dies
timesteps = hc3.timesteps;
Note : ellipsoids_colors[i]
contains the labels of the ellipsoids just before the time timesteps[i]
Example :
ellipsoids_colors[1]
contains only 0 labels
Moreover, if there are 2 connexed components in the remaining clustering :
ellipsoids_colors[end - 1] = ellipsoids_colors[end]
contains 2 different labelsellipsoids_colors[end - 2]
contains 3 different labels
Using a parameter threshold not equal to $\infty$ erases some ellipsoids. Therefore we need to compute new labels of the data points, with respect to the new ellipsoids.
remain_indices = hc3.startup_indices
points_colors, distances = subcolorize(data.points, npoints, distance_function, remain_indices)
([3, 1, 19, 23, 9, 19, 18, 21, 9, 7 … 4, 3, 12, 20, 6, 20, 8, 4, 19, 6], [2.315749981293533, 3.816132565039391, 2.880162047151902, 3.2194811169533812, 3.8947497741374466, 3.5901592704888947, 2.4333564312555938, 0.9319242279474274, 5.661826288583886, 0.5896412635063679 … 62.41752856613242, 146.8749249743046, 69.0763291826481, 7.858653442362352, 161.2110981757169, 142.60508756077564, 19.539032004720067, 337.23702532359215, 105.42048180700176, 112.87606117457861])
Removing outliers
Selection of the number of outliers
idxs = zeros(Int, npoints)
sortperm!(idxs, distances, rev = false)
costs = cumsum(distances[idxs])
plot( costs, legend = false, xlabel = "number of points", ylabel = "cost")
We choose "nsignal", the number of points at which there is a slope change in the cost curve.
We set the label of the (npoints - nsignal) points with largest cost to 0. These points are considered as outliers.
nsignal = 2100
points_colors[idxs[(nsignal+1):end]] .= 0
300-element view(::Vector{Int64}, [2337, 2207, 2054, 2382, 2246, 2211, 2136, 2056, 2112, 2232 … 2332, 2204, 2245, 2329, 2137, 2034, 2075, 2150, 2037, 2389]) with eltype Int64:
0
0
0
0
0
0
0
0
0
0
⋮
0
0
0
0
0
0
0
0
0
Animation - Clustering result
ellipsoids_frames, points_frames, μ, ω, Σ, sq_time = create_ellipsoids_animation(
distance_function, timesteps, ellipsoids_colors, points_colors, distances, remain_indices )
nframes = length(points_frames)
anim = @animate for i in [1:nframes; Iterators.repeated(nframes, 30)...]
ellipsoids(data.points, ellipsoids_frames[i], points_frames[i], μ, ω, Σ, sq_time[i]; markersize = 3)
xlims!(-60, 60)
ylims!(-60, 60)
end
gif(anim, "anim_kplm.gif", fps = 5)