Hierarchical clustering based on a union of ellipsoids
Example of two spirals
using GeometricClusterAnalysis
using Plots
using Statistics
Data generation
Parameters
nsignal = 2000 # number of signal points
nnoise = 400 # number of outliers
dim = 2 # dimension of the data
σ = 0.5 # standard deviation for the additive noise
0.5
Data generation
rng = MersenneTwister(123)
data = noisy_nested_spirals(rng, nsignal, nnoise, σ, dim)
npoints = size(data.points, 2)
println("The dataset contains $npoints points, of dimension $dim.")
The dataset contains 2400 points, of dimension 2.
Data display
plot(data, aspect_ratio = 1)
Computation of the union of ellipsoids with the kPLM function
Parameters
k = 20 # number of nearest neighbors
c = 30 # number of ellipsoids
iter_max = 20 # maximum number of iterations of the algorithm kPLM
nstart = 10; # number of initializations of the algorithm kPLM
Method
We decide to make no constraints on the ellipsoids, that is, no constraints on the eigenvalues of the matrices directing the ellipsoids.
function f_Σ!(Σ) end
f_Σ! (generic function with 1 method)
The parameter indexed_by_r2 = true
is the default parameter in the function kplm. We should not modify it since some birth times are negative.
It is due to the logarithm of the determinant of some matrices that are negative. This problem can be solved by adding constraints to the matrices, with the argument "f_Σ!". In particular, forcing eigenvalues to be non smaller than 1 works.
Application of the method
distance_function = kplm(rng, data.points, k, c, nsignal, iter_max, nstart, f_Σ!)
KpResult{Float64}:
k = 20
centers = [
[-10.860789972992192, 19.970795164085693]
[16.00475015537982, -21.31496930064836]
[6.1023299556013875, 17.54903931651983]
[-19.746254787024657, 14.952669606314805]
[25.528986764859425, -13.442439423421728]
[30.42997054009674, -2.7391180438203344]
[0.7032262204928013, -35.949058375193076]
[11.421353634115135, 11.907429341324686]
[11.51676220731877, -35.75246460501173]
[-14.482680166287082, 2.3007061478146347]
[0.6920337333034114, 38.97007370882358]
[-18.789931171009325, -27.31774602168406]
[9.209749635838328, 36.75954117490916]
[-28.361493915674316, -6.781154021454974]
[-8.414604469538876, -19.543901828005655]
[2.5711264420104674, -23.632892618222193]
[16.069480513820515, 33.31690276725144]
[-24.981864843657377, -19.018210713096064]
[-26.57699510377622, 4.784311668419593]
[30.046156034346293, 14.288544217032054]
[22.513951973689547, -31.928160418986252]
[13.18486557024591, 3.740197937097277]
[10.700645437236728, -2.9912087693354663]
[-1.061970003081628, 20.598676632091724]
[-15.593824428314335, -8.948208114255388]
[-7.707035110095129, 8.445814680999092]
[-10.09171953613707, -33.03908975834024]
[23.38721194276153, 26.554085130187104]
]
colors = [15, 28, 25, 0, 6, 20, 16, 20, 17, 16 … 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
cost = 3.1055741238750003
Clustering based on the persistence of the union of ellipsoids filtration
Matrix of distances
This is a matrix that contains the birth times of the ellipsoids in the diagonal, and the intersecting times of pairs of ellipsoids in the lower left triangular part of the matrix.
distance_matrix = build_distance_matrix(distance_function)
28×28 Matrix{Float64}:
0.997921 Inf Inf … Inf Inf Inf
2219.19 0.794625 Inf Inf Inf Inf
12.6782 270.956 0.532123 Inf Inf Inf
3.40809 1487.47 31.4601 Inf Inf Inf
506.898 4.5751 149.438 Inf Inf Inf
184.65 10.4047 93.1143 … Inf Inf Inf
1234.97 42.1397 1216.95 Inf Inf Inf
29.949 145.532 3.38581 Inf Inf Inf
3615.26 106.89 584.439 Inf Inf Inf
35.0316 261.5 46.3986 Inf Inf Inf
⋮ ⋱ ⋮
102.619 33.3004 92.1652 Inf Inf Inf
2024.69 184.388 376.066 … Inf Inf Inf
76.0247 90.6043 14.0412 Inf Inf Inf
310.086 201.926 51.493 Inf Inf Inf
3.21365 414.258 3.7872 Inf Inf Inf
44.4365 49.6346 182.182 Inf Inf Inf
137.021 1465.96 46.1403 … 0.644328 Inf Inf
627.77 42.4105 2557.85 439.845 1.07454 Inf
80.3238 86.1381 172.663 187.794 1126.33 2.07069
Selection of parameter "threshold"
We draw a persistence diagram based on the filtration of the union of ellipsoids.
Each point corresponds to the connected component of an ellipsoid. The birth time corresponds to the time at which the ellipsoid, and thus the component, appears. The death time corresponds to the time at which the component merges with a component that was born before.
The top left point corresponds to the ellipsoid that appeared first and therefore never merges with a component born before. Its death time is $\infty$.
The parameter "threshold" aims at removing ellipsoids born after time "threshold". Such ellipsoids are considered as irrelevant. This may be due to a bad initialisation of the algorithm that creates ellipsoids in bad directions with respect to the data.
hc1 = hierarchical_clustering_lem(
distance_matrix,
infinity = Inf,
threshold = Inf,
store_colors = false,
store_timesteps = false,
)
plot(hc1)
We consider that ellipsoids born after time "threshold = 4" were not relevant.
Selection of parameter "infinity"
We then have to select parameter "infinity". Connected components which lifetime is larger than "infinity" are components that we want not to die.
hc2 = hierarchical_clustering_lem(
distance_matrix,
infinity = Inf,
threshold = 4,
store_colors = false,
store_timesteps = false,
)
plot(hc2)
We select "infinity = 15". Since there are clearly two connected components that have a lifetime much larger than others. This lifetime is larger than 15, whereas the lifetime of others is smaller than 15.
Clustering
hc3 = hierarchical_clustering_lem(
distance_matrix,
infinity = 10,
threshold = 4,
store_colors = true,
store_timesteps = true,
)
plot(hc3)
Getting the number of components, colors of ellipsoids and times of evolution of the clustering
Number of ellipsoids
nellipsoids = length(hc3.startup_indices)
# Ellispoids colors
ellipsoids_colors = hc3.saved_colors
# Time at which a component borns or dies
timesteps = hc3.timesteps;
Note : ellipsoids_colors[i]
contains the labels of the ellipsoids just before the time timesteps[i]
Example :
ellipsoids_colors[1]
contains only 0 labels
Moreover, if there are 2 connexed components in the remaining clustering :
ellipsoids_colors[end - 1] = ellipsoids_colors[end]
contains 2 different labelsellipsoids_colors[end - 2]
contains 3 different labels
Using a parameter threshold not equal to $\infty$ erases some ellipsoids. Therefore we need to compute new labels of the data points, with respect to the new ellipsoids.
remain_indices = hc3.startup_indices
points_colors, distances = subcolorize(data.points, npoints, distance_function, remain_indices)
([15, 28, 25, 15, 6, 20, 16, 20, 17, 16 … 24, 15, 6, 11, 14, 6, 9, 24, 11, 1], [5.056903762672652, 4.538266417572684, 1.8430166100578418, 9.5536893016661, 6.059080912769491, 2.722670242635671, 4.836228149273284, 5.564645694927842, 1.4374791563411444, 2.3310649964686 … 29.172850822469016, 300.49653109603753, 26.556311888134914, 17.633627530538348, 161.3582506246063, 144.65867510604483, 73.39341913055875, 152.89693367584908, 35.17630745652872, 227.94678809591284])
Removing outliers
Selection of the number of outliers
idxs = zeros(Int, npoints)
sortperm!(idxs, distances, rev = false)
costs = cumsum(distances[idxs])
plot( costs, legend = false, xlabel = "number of points", ylabel = "cost")
We choose "nsignal", the number of points at which there is a slope change in the cost curve.
We set the label of the (npoints - nsignal) points with largest cost to 0. These points are considered as outliers.
nsignal = 2100
points_colors[idxs[(nsignal+1):end]] .= 0
300-element view(::Vector{Int64}, [2240, 2297, 2119, 2058, 2069, 2061, 2131, 2280, 2053, 2374 … 2338, 2019, 2215, 2365, 2266, 2044, 2235, 2008, 2205, 2368]) with eltype Int64:
0
0
0
0
0
0
0
0
0
0
⋮
0
0
0
0
0
0
0
0
0
Animation - Clustering result
ellipsoids_frames, points_frames, μ, ω, Σ, sq_time = create_ellipsoids_animation(
distance_function, timesteps, ellipsoids_colors, points_colors, distances, remain_indices )
nframes = length(points_frames)
anim = @animate for i in [1:nframes; Iterators.repeated(nframes, 30)...]
ellipsoids(data.points, ellipsoids_frames[i], points_frames[i], μ, ω, Σ, sq_time[i]; markersize = 3)
xlims!(-60, 60)
ylims!(-60, 60)
end
gif(anim, "anim_kplm.gif", fps = 5)
