Two-dimensional data from the Poisson distribution
Generation of variables from a mixture of Poisson distributions
We generate a second sample of 950 points in $\mathcal{R}^2$. The two coordinates of each point are independent. They are sampled according to a Poisson distribution with parameter $10$, $20$ or$40$ (for every point, the parameter is chosen with probability $\frac13$). Then, a sample of 50 outliers is generated according to the Uniform distribution on $[0,120]\times[0,120]$. We denote by x
the sample containing these 1000 points.
using GeometricClusterAnalysis
import GeometricClusterAnalysis: sample_poisson, sample_outliers, performance
using Plots
using Random
n = 1000
n_outliers = 50
d = 2
rng = MersenneTwister(1)
lambdas = [10,20,40]
proba = [1/3,1/3,1/3]
points, labels = sample_poisson(rng, n - n_outliers, d, lambdas, proba)
outliers = sample_outliers(rng, n_outliers, d; scale = 120)
x = hcat(points, outliers)
labels_true = vcat(labels, zeros(Int, n_outliers))
scatter( x[1,:], x[2,:], c = labels_true, palette = :rainbow)
Data clustering on an example
In order to cluster the data, we will use the following parameters.
k = 3
α = 0.03
maxiter = 50
nstart = 50
50
Using the classical algorithm : Trimmed $k$-means
(Cuesta-Albertos et al., 1997)
Firstly, we use our algorithm trimmed_bregman_clustering
with the squared Euclidean distance euclidean
.
tb_kmeans = trimmed_bregman_clustering( rng, x, k, α, euclidean, maxiter, nstart )
println("k-means : $(tb_kmeans.centers)")
k-means : [40.38294722784871 10.537657975809905 20.464713649163052; 40.354190104912355 10.271275601080664 20.829711544491296]
We see three clusters with the same diameter. So, multiple outliers are assigned to a cluster of points associated to the mean $(10,10)$. This cluster was actually supposted to have a diameter smaller than for points generated from Poisson distributions with means $(20,20)$ and $(40,40)$.
plot(tb_kmeans)
Bregman divergence selection for the Poisson distribution
When using the Bregman divergence associated to the Poisson distribution, the clusters have various diameters. These diameters are well suited for the data. Moreover, the estimators tB_Poisson$centers
of the means are better..
tb_poisson = trimmed_bregman_clustering( rng, x, k, α, poisson, maxiter, nstart )
println("poisson : $(tb_poisson.centers)")
poisson : [20.126198864736047 10.324096110361406 40.33912028708773; 20.34891612932911 10.04219126366854 40.249302368137506]
plot(tb_poisson)
Performance comparison
We measure the performance of the two clustering methods (with the squared Euclidean distance and with the Bregman divergence associated to the Poisson distribution), using the normalised mutual information (NMI).
import Clustering: mutualinfo
println("k-means : $(mutualinfo( tb_kmeans.cluster, labels_true, normed = true ))")
println("poisson : $(mutualinfo( tb_poisson.cluster, labels_true, normed = true ))")
k-means : 0.8137090523241111
poisson : 0.8259135879044363
The normalised mutual information is larger for the Bregman divergence associated to the Poisson distribution. This illustrates the fact that, for this example, using the correct divergence improves the clustering : the performance is better than for a classical trimmed k
-means.
Performance measurement
In order to ensure that the method with the correct Bregman divergence outperforms Trimmed k
-means, we replicate the experiment replications
times.
In particular, we replicate the algorithm trimmed_bregman_clustering
, replications
times, on samples of size $n = 1000$, on data generated according to the aforementionned procedure.
The function performance
does it.
sample_generator = (rng, n) -> sample_poisson(rng, n, d, lambdas, proba)
outliers_generator = (rng, n) -> sample_outliers(rng, n, d; scale = 120)
nmi_kmeans, _, _ = performance(n, n_outliers, k, α, sample_generator, outliers_generator, euclidean)
nmi_poisson, _, _ = performance(n, n_outliers, k, α, sample_generator, outliers_generator, poisson)
([0.8356598819265075, 0.8522312505329743, 0.8591345518407613, 0.8610637292766185, 0.8745705543779904, 0.8617018594300161, 0.896725093980184, 0.8339466059687569, 0.8455702227968579, 0.8543486895823693 … 0.8425314667867644, 0.8252535684443999, 0.8517394503721635, 0.8517282945103648, 0.8403440272336288, 0.8618119879135011, 0.8505247390669495, 0.845877014737236, 0.8617574721117149, 0.8514048231924788], 0.8563155121836071, 0.0033341253272724507)
The boxplots show the NMI on the two different methods. The method using the Bregman divergence associated to the Poisson distribution outperfoms the Trimmed k
-means method.
using StatsPlots
boxplot( ones(100), nmi_kmeans, label = "kmeans" )
boxplot!( fill(2, 100), nmi_poisson, label = "poisson" )
Selection of the parameters $k$ and $\alpha$
We still use the dataset x
.
vect_k = collect(1:5)
vect_α = sort([((0:2)./50)...,((1:4)./5)...])
rng = MersenneTwister(42)
nstart = 5
params_risks = select_parameters_nonincreasing(rng, vect_k, vect_α, x, poisson, maxiter, nstart)
plot(; title = "select parameters")
for (i,k) in enumerate(vect_k)
plot!( vect_α, params_risks[i, :], label ="k=$k", markershape = :circle )
end
xlabel!("alpha")
ylabel!("NMI")
According to the graph, the risk decreases from 1 to 2 clusters, and as well from 2 to 3 clusters. However, there is no gain in terms of risk from 3 to 4 clusters or from 4 to 5 clusters. Indeed, the curves with parameters $k = 3$, $k = 4$ and $k = 5$ are very close. So we will cluster the data into $k = 3$ clusters.
The curve with parameter $k = 3$ strongly decreases, with a slope that is stable around $\alpha = 0.04$.
For more details about the selection of the parameter $\alpha$, we may focus on the curve $k = 3$. We may increase the nstart
parameter and focus on small values of $\alpha$.
vec_k = [3]
vec_α = collect(0:15) ./ 200
params_risks = select_parameters_nonincreasing(rng, vec_k, vec_α, x, poisson, maxiter, nstart)
plot(vec_α, params_risks[1, :], markershape = :circle)
There is no strong modification of the slope. Although the slope is stable after $\alpha = 0.04$. Therefore, we select the parameter $\alpha = 0.04$.
k, α = 3, 0.04
tb = trimmed_bregman_clustering( rng, x, k, α, poisson, maxiter, nstart )
plot(tb)