One-dimensional data from the Poisson distribution

Generation of variables from a mixture of Poisson distributions

The function sample_poisson generates random variables according to a mixture of $k$ Poisson distributions in dimension $d$. The parameters are given in the $k\times d$-matrix lambdas. The probabilities of the mixture components are given in the vector proba.

The function sample_outliers generates random variable uniformly on the hypercube $[0,L]^d$. This function will be used to generate outliers.

We generate a first sample of 950 points from the Poisson distribution with parameters $10$, $20$ or $40$ with probability $\frac13$. Then, we generate 50 outliers from the uniform distribution on $[0,120]$. We denote by x the resulting sample.

using GeometricClusterAnalysis
import GeometricClusterAnalysis: sample_poisson, sample_outliers, performance
using Plots
using Random

n = 1000
n_outliers = 50
d = 1

rng = MersenneTwister(1)
lambdas =  [10,20,40]
proba = [1/3,1/3,1/3]
points, labels = sample_poisson(rng, n - n_outliers, d, lambdas, proba)

outliers = sample_outliers(rng, n_outliers, 1; scale = 120)

x = hcat(points, outliers)
labels_true = vcat(labels, zeros(Int, n_outliers))
scatter( x[1,:], c = labels_true, palette = :rainbow)
Example block output

Data clustering on an example

In order to cluster the data, we will use the following parameters.

k = 3 # Number of clusters in the clustering
alpha = 0.04 # Proportion of outliers
maxiter = 50 # Maximal number of iterations
nstart = 20 # Number of initialisations of the algorithm (the best result is kept)
20

Using the classical algorithm : Trimmed $k$-means (Cuesta-Albertos et al., 1997)

Firstly, we use our algorithm trimmed_bregman_clustering with the squared Euclidean distance euclidean.

tb_kmeans = trimmed_bregman_clustering(rng, x, k, alpha, euclidean, maxiter, nstart)
tb_kmeans.centers
1×3 Matrix{Float64}:
 22.116  40.9987  10.3412

This method corresponds to the Trimmed $k$-means of (Cuesta-Albertos et al., 1997).

We see three clusters with the same diameter. In particular, the group centered at $10$ also contains points of the group centered at $20$. Therefore, the estimators tB_kmeans.centers of the three means are not that satisfying. These estimated means are larger than the true means $10$, $20$ and $40$.

plot(tb_kmeans)
Example block output

Bregman divergence selection for the Poisson distribution

When using the Bregman divergence associated to the Poisson distribution, the clusters have various diameters. These diameters are well suited for the data. Moreover, the estimators tB_Poisson$centers of the means are better.

tb_poisson = trimmed_bregman_clustering(rng, x, k, alpha, poisson, maxiter, nstart)
tb_poisson.centers
1×3 Matrix{Float64}:
 40.7604  20.6136  9.71071
plot(tb_poisson)
Example block output

Performance comparison

We measure the performance of the two clustering methods (with the squared Euclidean distance and with the Bregman divergence associated to the Poisson distribution), using the normalised mutual information (NMI).

For the Trimmed k-means (that is, with the squared Euclidean distance):

import Clustering: mutualinfo

println(mutualinfo(labels_true,tb_kmeans.cluster, normed = true))
0.641631775173803

For the trimmed clustering method with the Bregman divergence associated to the Poisson distribution :

println(mutualinfo(labels_true,tb_poisson.cluster, normed = true))
0.6826165935516633

The normalised mutual information is larger for the Bregman divergence associated to the Poisson distribution. This illustrates the fact that, for this example, using the correct divergence improves the clustering : the performance is better than for a classical trimmed k-means.

Performance measurement

In order to ensure that the method with the correct Bregman divergence outperforms Trimmed k-means, we replicate the experiment replications times.

In particular, we replicate the algorithm trimmed_bregman_clustering, replications times, on samples of size $n = 1000$, on data generated according to the aforementionned procedure.

The function performance does it.

sample_generator = (rng, n) -> sample_poisson(rng, n, d, lambdas, proba)
outliers_generator = (rng, n) -> sample_outliers(rng, n, d; scale = 120)
#3 (generic function with 1 method)

Default values: maxiter = 100, nstart = 10, replications = 100

n = 1200
n_outliers = 200
k = 3
alpha = 0.1
nmi_kmeans, _, _ = performance(n, n_outliers, k, alpha, sample_generator, outliers_generator, euclidean)
nmi_poisson, _, _ = performance(n, n_outliers, k, alpha, sample_generator, outliers_generator, poisson)
([0.5839398784033235, 0.6202318416861812, 0.6101178140792, 0.6374897347566113, 0.6540671777145892, 0.6083343936925494, 0.6304838453433942, 0.6306156841785864, 0.6336810374826111, 0.6527637354246719  …  0.6248676870772997, 0.6257423603470843, 0.6304698035282263, 0.6376606605106416, 0.6270816412278291, 0.6273014539714742, 0.6195508045521951, 0.6298634762917615, 0.6471578693973343, 0.6209459002804606], 0.6300521037537219, 0.003665377094835531)

The boxplots show the NMI on the two different methods. The method using the Bregman divergence associated to the Poisson distribution outperfoms the Trimmed k-means method.

using StatsPlots

boxplot( ones(100), nmi_kmeans, label = "kmeans" )
boxplot!( fill(2, 100), nmi_poisson, label = "poisson" )
Example block output

Selection of the parameters $k$ and $\alpha$

We still use the dataset x.

vect_k = collect(1:5)
vect_alpha = sort([((0:2)./50)...,((1:4)./5)...])

params_risks = select_parameters_nonincreasing(rng, vect_k, vect_alpha, x, poisson, maxiter, nstart)

plot(; title = "select parameters")
for (i,k) in enumerate(vect_k)
   plot!( vect_alpha, params_risks[i, :], label ="k=$k", markershape = :circle )
end
xlabel!("alpha")
ylabel!("NMI")
Example block output

According to the graph, the risk decreases from 1 to 2 clusters, and as well from 2 to 3 clusters. However, there is no gain in terms of risk from 3 to 4 clusters or from 4 to 5 clusters. Indeed, the curves with parameters $k = 3$, $k = 4$ and $k = 5$ are very close. So we will cluster the data into $k = 3$ clusters.

The curve with parameter $k = 3$ strongly decreases, with a slope that is stable around $\alpha = 0.04$.

For more details about the selection of the parameter $\alpha$, we may focus on the curve $k = 3$. We may increase the nstart parameter and focus on small values of $\alpha$.

vect_k = [3]
vec_alpha = collect(0:15) ./ 200
params_risks = select_parameters_nonincreasing(rng, [3], vec_alpha, x, poisson, maxiter, 5)

plot(vec_alpha, params_risks[1, :], markershape = :circle)
Example block output

There is no strong modification of the slope. Although the slope is stable after $\alpha = 0.03$. Therefore, we select the parameter $\alpha = 0.03$.

The clustering obtained with parameters k and alpha selected according to the heuristic is the following.

k, alpha = 3, 0.03
tb_poisson = trimmed_bregman_clustering( rng, x, k, alpha, poisson, maxiter, nstart )
tb_poisson.centers
1×3 Matrix{Float64}:
 41.0419  20.6136  9.57292
plot( tb_poisson )
Example block output