Two-dimensional data from the Poisson distribution

Generation of variables from a mixture of Poisson distributions

We generate a second sample of 950 points in $\mathcal{R}^2$. The two coordinates of each point are independent. They are sampled according to a Poisson distribution with parameter $10$, $20$ or$40$ (for every point, the parameter is chosen with probability $\frac13$). Then, a sample of 50 outliers is generated according to the Uniform distribution on $[0,120]\times[0,120]$. We denote by x the sample containing these 1000 points.

using GeometricClusterAnalysis
import GeometricClusterAnalysis: sample_poisson, sample_outliers, performance
using Plots
using Random

n = 1000
n_outliers = 50
d = 2

rng = MersenneTwister(1)
lambdas =  [10,20,40]
proba = [1/3,1/3,1/3]
points, labels = sample_poisson(rng, n - n_outliers, d, lambdas, proba)

outliers = sample_outliers(rng, n_outliers, d; scale = 120)
x = hcat(points, outliers)
labels_true = vcat(labels, zeros(Int, n_outliers))

scatter( x[1,:], x[2,:], c = labels_true, palette = :rainbow)

Data clustering on an example

In order to cluster the data, we will use the following parameters.

k = 3
α = 0.03
maxiter = 50
nstart = 50

Using the classical algorithm : Trimmed $k$-means

(Cuesta-Albertos et al., 1997)

Firstly, we use our algorithm trimmed_bregman_clustering with the squared Euclidean distance euclidean.

tb_kmeans = trimmed_bregman_clustering( rng, x, k, α, euclidean, maxiter, nstart )
println("k-means : $(tb_kmeans.centers)")

k-means : [40.983088267977884 20.580502670919078 10.071428571428571; 40.56059919818226 20.760109777804345 9.970238095238095]

We see three clusters with the same diameter. So, multiple outliers are assigned to a cluster of points associated to the mean $(10,10)$. This cluster was actually supposted to have a diameter smaller than for points generated from Poisson distributions with means $(20,20)$ and $(40,40)$.

plot(tb_kmeans)

Bregman divergence selection for the Poisson distribution

When using the Bregman divergence associated to the Poisson distribution, the clusters have various diameters. These diameters are well suited for the data. Moreover, the estimators tB_Poisson$centers of the means are better..

tb_poisson = trimmed_bregman_clustering( rng, x, k, α, poisson, maxiter, nstart )
println("poisson : $(tb_poisson.centers)")

poisson : [20.14988009856211 40.950637235466054 9.884735202492212; 20.257415080497523 40.37321167956688 9.71651090342679]

plot(tb_poisson)

Performance comparison

We measure the performance of the two clustering methods (with the squared Euclidean distance and with the Bregman divergence associated to the Poisson distribution), using the normalised mutual information (NMI).

import Clustering: mutualinfo
println("k-means : $(mutualinfo( tb_kmeans.cluster, labels_true, normed = true ))")
println("poisson : $(mutualinfo( tb_poisson.cluster, labels_true, normed = true ))")

k-means : 0.8481027382130856
poisson : 0.8753602251104734

The normalised mutual information is larger for the Bregman divergence associated to the Poisson distribution. This illustrates the fact that, for this example, using the correct divergence improves the clustering : the performance is better than for a classical trimmed k-means.

Performance measurement

In order to ensure that the method with the correct Bregman divergence outperforms Trimmed k-means, we replicate the experiment replications times.

In particular, we replicate the algorithm trimmed_bregman_clustering, replications times, on samples of size $n = 1000$, on data generated according to the aforementionned procedure.

The function performance does it.

sample_generator = (rng, n) -> sample_poisson(rng, n, d, lambdas, proba)
outliers_generator = (rng, n) -> sample_outliers(rng, n, d; scale = 120)

nmi_kmeans, _, _ = performance(n, n_outliers, k, α, sample_generator, outliers_generator, euclidean)
nmi_poisson, _, _ = performance(n, n_outliers, k, α, sample_generator, outliers_generator, poisson)

([0.8831555050177946, 0.8424465145396753, 0.8708006443576987, 0.8738926068592313, 0.8452830664564249, 0.8309342220916809, 0.8741016321062142, 0.8470718893079344, 0.8476803575255522, 0.8787996459490521  …  0.8746630533541553, 0.877400081947486, 0.8672016591873275, 0.8660710816698721, 0.8735827091118009, 0.8683511483019623, 0.8400178766290994, 0.8745556186299602, 0.8628873755156736, 0.8604254185215433], 0.8586688157310582, 0.002621721405614691)

The boxplots show the NMI on the two different methods. The method using the Bregman divergence associated to the Poisson distribution outperfoms the Trimmed k-means method.

using StatsPlots

boxplot( ones(100), nmi_kmeans, label = "kmeans" )
boxplot!( fill(2, 100), nmi_poisson, label = "poisson" )

Selection of the parameters $k$ and $\alpha$

We still use the dataset x.

vect_k = collect(1:5)
vect_α = sort([((0:2)./50)...,((1:4)./5)...])

rng = MersenneTwister(42)
nstart = 5

params_risks = select_parameters_nonincreasing(rng, vect_k, vect_α, x, poisson, maxiter, nstart)

plot(; title = "select parameters")
for (i,k) in enumerate(vect_k)
   plot!( vect_α, params_risks[i, :], label ="k=$k", markershape = :circle )
end
xlabel!("alpha")
ylabel!("NMI")

According to the graph, the risk decreases from 1 to 2 clusters, and as well from 2 to 3 clusters. However, there is no gain in terms of risk from 3 to 4 clusters or from 4 to 5 clusters. Indeed, the curves with parameters $k = 3$, $k = 4$ and $k = 5$ are very close. So we will cluster the data into $k = 3$ clusters.

The curve with parameter $k = 3$ strongly decreases, with a slope that is stable around $\alpha = 0.04$.

For more details about the selection of the parameter $\alpha$, we may focus on the curve $k = 3$. We may increase the nstart parameter and focus on small values of $\alpha$.

vec_k = [3]
vec_α = collect(0:15) ./ 200
params_risks = select_parameters_nonincreasing(rng, vec_k, vec_α, x, poisson, maxiter, nstart)

plot(vec_α, params_risks[1, :], markershape = :circle)

There is no strong modification of the slope. Although the slope is stable after $\alpha = 0.04$. Therefore, we select the parameter $\alpha = 0.04$.

k, α = 3, 0.04
tb = trimmed_bregman_clustering( rng, x, k, α, poisson, maxiter, nstart )
plot(tb)