Functions
GeometricClusterAnalysis.build_distance_matrix
— Methodbuild_distance_matrix(result; indexed_by_r2)
Distance matrix for the graph filtration
indexed_by_r2 = true
always workindexed_by_r2 = false
requires elements of weigths to be non-negative.indexed_by_r2 = false
for the sub-level set of the square-root of non-negative power functions : the k-PDTM or the k-PLM (when determinant of matrices are forced to be 1)
GeometricClusterAnalysis.build_distance_matrix_power_function_buchet
— Methodbuild_distance_matrix_power_function_buchet(birth, means)
Auxiliary functions for the power-distance
a
and b
are two vectors, c
and d
two numerics
GeometricClusterAnalysis.colorize!
— Methodcolorize!(colors, μ, ω, points, k, nsignal, centers, Σ)
Auxiliary function which, given k centers, computes the "new twisted distances" of all points of P, at all centers We color with the color of the nearest center. The "distance" to a center is the square of the Mahalanobis norm at the local mean local mean around the center plus a weight which depends on a local variance around the center of the center to which we add the $log(det(Σ))$
We often use the mahalanobis function. mahalanobis(x,c,Σ)
computes the square of the Mahalanobis norm $(p-c)^T Σ^{-1}(p-c)$, for any point $p$, column of $x$.
GeometricClusterAnalysis.compute_dists!
— Methodcompute_dists!(dists, center, points, Σ)
GeometricClusterAnalysis.compute_threshold_infinity
— Methodcompute_threshold_infinity(
dist_func,
distance_matrix,
nb_means_removed,
nb_clusters
)
GeometricClusterAnalysis.create_ellipsoids_animation
— Methodcreate_ellipsoids_animation(
distance_function,
timesteps,
ellipsoids_colors,
points_colors,
distances,
remain_indices
)
GeometricClusterAnalysis.distance_matrix_tomato
— Methoddistance_matrix_tomato(graph, birth)
birth
: distance to measure graph
: Matrix that contains 0 and 1, $graph_{i,j} = 1$ if $i$ and $j$ are neighbours
GeometricClusterAnalysis.dtm
— Methoddtm(x, m0; r)
Distance to measure function for each points
GeometricClusterAnalysis.dtm_filtration
— Methoddtm_filtration(points, m0; infinity, threshold)
GeometricClusterAnalysis.euclidean
— Methodeuclidean(x, y)
Euclidian sqaured distance
GeometricClusterAnalysis.fourteen_segments
— Methodfourteen_segments(rng, n, σ, d)
GeometricClusterAnalysis.graph_nn
— Methodgraph_nn(points, k)
Nearest neighbours graph
- k : number of nearest neighbours to link to
GeometricClusterAnalysis.graph_radius
— Methodgraph_radius(points, r)
Rips graph with radius r
GeometricClusterAnalysis.hierarchical_clustering_lem
— Methodhierarchical_clustering_lem(
distance_matrix;
infinity,
threshold,
store_colors,
store_timesteps
)
distance_matrix
: $(r_{i,j})_{i,j} r_{i,j}$ : time $r$ when components $i$ and $j$ merge- $r_{i,i}$ : birth time of component $i$.
c
: number of componentsinfinity
: components whose lifetime is larger than infinity never diethreshold
: centers born afterthreshold
are removedstore_colors = true
: insaved_colors
, we store all configurations of colors, for every step.store_timesteps = true
: insaved_timesteps
, we store all timesteps.
It is possible to select infinity and threshold
after running the algorithm with infinity = Inf
and threshold = Inf
. For this, we look at the persistence diagram of the components : (x-axis Birth ; y-axis Death)
GeometricClusterAnalysis.infinity_symbol
— Methodinfinity_symbol(
npoints,
α,
σ,
dimension,
noise_min,
noise_max
)
GeometricClusterAnalysis.infinity_symbol
— Methodinfinity_symbol(
nsignal,
nnoise,
σ,
dimension,
noise_min,
noise_max
)
GeometricClusterAnalysis.infinity_symbol
— Methodinfinity_symbol(
rng,
npoints,
α,
σ,
dimension,
noise_min,
noise_max
)
GeometricClusterAnalysis.infinity_symbol
— Methodinfinity_symbol(
rng,
nsignal,
nnoise,
σ,
dimension,
noise_min,
noise_max
)
GeometricClusterAnalysis.k_witnessed_distance
— Methodk_witnessed_distance(points, k, c, nsignal)
GeometricClusterAnalysis.kpdtm
— Methodkpdtm(
rng,
points,
k,
c,
nsignal,
iter_max,
nstart,
first_centers
)
GeometricClusterAnalysis.kpdtm
— Methodkpdtm(rng, points, k, c, α, iter_max, nstart)
GeometricClusterAnalysis.kpdtm
— Methodkpdtm(rng, points, k, c, nsignal, iter_max, nstart)
GeometricClusterAnalysis.kplm
— Methodkplm(
rng,
points,
k,
ncenters,
nsignal,
iter_max,
nstart,
f_Σ!,
first_centers
)
GeometricClusterAnalysis.kplm
— Methodkplm(rng, points, k, ncenters, α, iter_max, nstart, f_Σ!)
GeometricClusterAnalysis.kplm
— Methodkplm(
rng,
points,
k,
ncenters,
nsignal,
iter_max,
nstart,
f_Σ!
)
GeometricClusterAnalysis.mahalanobis
— Methodmahalanobis( x, μ, Σ; inverted = false)
Returns the squared Mahalanobis distance of all rows in x and the vector μ = center with respect to Σ = cov. This is (for vector x) defined as
\[D^2 = (x - \mu)' \Sigma^{-1} (x - \mu)\]
- x : vector or matrix of data with, say,
p
columns. - μ : mean vector of the distribution or second data vector of length
p
or recyclable to that length. - Σ : covariance matrix
p x p
of the distribution. - inverted : If true, Σ is supposed to contain the inverse of the covariance matrix.
GeometricClusterAnalysis.meanvar!
— Methodmeanvar!(μ, ω, points, centers, k)
GeometricClusterAnalysis.noisy_circles
— Methodnoisy_circles(rng, n; r1, r2, noise)
GeometricClusterAnalysis.noisy_fourteen_segments
— Methodnoisy_fourteen_segments(npoints, α, σ, d)
GeometricClusterAnalysis.noisy_fourteen_segments
— Methodnoisy_fourteen_segments(nsignal, nnoise, σ, d)
GeometricClusterAnalysis.noisy_fourteen_segments
— Methodnoisy_fourteen_segments(rng, npoints, α, σ, d)
npoints
: total number of pointsα
: fraction of outliers
GeometricClusterAnalysis.noisy_fourteen_segments
— Methodnoisy_fourteen_segments(rng, nsignal, nnoise, σ, d)
nsignal
: number of signal pointsnnoise
: number of additionnal outliers
sampled accordingly to generate noise signal points are $X = Y+Z$ with $Y$ uniform on the 14 segments $Z$ normal with mean 0 and covariance matrix $σ*I_d$ (with Id the identity matrix of $R^d$) So, d is the dimension of the data and σ, the standard deviation of the additive Gaussian noise. When ``d>2, Yi = 0$for$i>=2$; with the notation$Y=(Yi){i=1..d}``
GeometricClusterAnalysis.noisy_moons
— Methodnoisy_moons(rng, n; r1, r2, noise)
GeometricClusterAnalysis.noisy_nested_spirals
— Methodnoisy_nested_spirals(npoints, α, σ, dimension)
GeometricClusterAnalysis.noisy_nested_spirals
— Methodnoisy_nested_spirals(nsignal, nnoise, σ, dimension)
GeometricClusterAnalysis.noisy_nested_spirals
— Methodnoisy_nested_spirals(rng, npoints, α, σ, dimension)
GeometricClusterAnalysis.noisy_nested_spirals
— Methodnoisy_nested_spirals(rng, nsignal, nnoise, σ, dimension)
nsignal
: number of signal pointsnnoise
: number of additionnal outliers
Signal points are $x = y+z$ with
- $y$ uniform on the two nested spirals
- $z$ normal with mean 0 and covariance matrix $\sigma * I_d$ (with $I_d$ the identity matrix of $R^d$)
d
is the dimension of the data and sigma, the standard deviation of the additive Gaussian noise. When $d>2, y_i = 0$ for $i>=2$; with the notation $y=(y_i)_{i=1..d}$
GeometricClusterAnalysis.noisy_three_curves
— Methodnoisy_three_curves(npoints, α, sigma, d)
GeometricClusterAnalysis.noisy_three_curves
— Methodnoisy_three_curves(rng, npoints, α, sigma, d)
GeometricClusterAnalysis.noisy_three_curves
— Methodnoisy_three_curves(rng, nsignal, nnoise, sigma, d)
nsignal
: number of signal pointsnnoise
: number of additionnal outliers
Signal points are $x = y+z$ with
- $y$ uniform on the 3 curves
- $z$ normal with mean 0 and covariance matrix $\sigma * I_d$ (with $I_d$ the identity matrix of $R^d$)
d
is the dimension of the data and sigma, the standard deviation of the additive Gaussian noise. When $d>2, y_i = 0$ for $i>=2$; with the notation $y=(y_i)_{i=1..d}$
GeometricClusterAnalysis.performance
— Functionperformance(n, n_outliers, k, alpha, generator, outliers_generator,
bregman, maxiter = 100, nstart = 10, replications = 100)
La fonction generator
genere des points, elle retourne les points (l'echantillon) et les labels (les vraies etiquettes des points)
n
: nombre total de pointsn_outliers
: nombre de donnees generees comme des donnees aberrantes dans cesn
points
GeometricClusterAnalysis.poisson
— Methodpoisson(x, y)
Bregman divergence associated with the Poisson distribution
GeometricClusterAnalysis.power_function_buchet
— Methodpower_function_buchet(points, m0; infinity, threshold)
GeometricClusterAnalysis.recolor
— Methodrecolor(points, centers, k, nsignal)
GeometricClusterAnalysis.return_color
— Methodreturn_color(label_points, colors_in, startup_indices)
colors_in
: labels des centres en sortie d'algo.label_points
: les labels des points à valeurs dans le même ensemble que les indices de départ.
Si label_points[j] = 3
, alors on cherche le centre dont l'indice de départ vaut 3. Si par exemple, c'est le 5ème centre alors color[j] = col[5]
GeometricClusterAnalysis.sample_outliers
— Methodsample_outliers(rng, n_outliers, d; scale = 1)
GeometricClusterAnalysis.sample_poisson
— Methodsample_poisson(rng, n, d, lambdas, proba)
GeometricClusterAnalysis.select_parameters
— Methodselect_parameters(
rng,
vk,
valpha,
x,
bregman,
maxiter,
nstart
)
Initial centers are set randomly
k
: numbers of centersα
: trimming values
GeometricClusterAnalysis.select_parameters_nonincreasing
— Methodselect_parameters_nonincreasing(
rng,
vk,
valpha,
x,
bregman,
maxiter,
nstart
)
Nous forcons la courbe de risque a etre decroissante en alpha, on utilise les centres optimaux du alpha precedent.
- k est un nombre ou un vecteur contenant les valeurs des differents k
- alpha est un nombre ou un vecteur contenant les valeurs des differents alpha
- force_decreasing = false, tous les departs sont aléatoires.
GeometricClusterAnalysis.subcolorize
— Methodsubcolorize(points, nsignal, result, startup_indices)
Auxiliary function that, given the point cloud, the number of points of the signal, the result of kpdtm or kplm and the starting indices of the hclust method, computes the "new twisted distances" from all points of P, to all centers whose indices are in the starting indices. The nearest center is associated with them.
GeometricClusterAnalysis.trimmed_bregman_clustering
— Methodtrimmed_bregman_clustering(
rng,
x,
k,
α,
bregman,
maxiter,
nstart
)
n
: number of pointsd
: dimension
Input :
x
: sample ofn
points in $R^d$ - matrix of sizen
$\times$d
α
: proportion of eluted points, because considered as outliers. They are given the label 0k
: number of centersbregman
: function of two numbers or vectors named x and y, which reviews their Bregman divergence.maxiter
: maximum number of iterations allowed.nstart
: if centers is a number, it is the number of different initializations of the algorithm. Only the best result is kept.
Output :
centers
: matrix of size dxk whose columns represent the centers of the clusterscluster
: vector of integers in1:k
indicating the index of the cluster to which each point (line) ofx
is associated.risk
: average of the divergences of the points of x at their associated center.divergence
: the vector of divergences of the points of x at their nearest center in centers.
GeometricClusterAnalysis.trimmed_bregman_clustering
— Methodtrimmed_bregman_clustering(
rng,
x,
centers,
α,
bregman,
maxiter
)
- n : number of points
- d : dimension
Input :
x
: sample of n points in R^d - matrix of size n $\times$ dcenters
: intial centersalpha
: proportion of eluted points, because considered as outliers. They are given the label 0bregman
: function of two numbers or vectors named x and y, which reviews their Bregman divergence.maxiter
: maximum number of iterations allowed.
Output :
centers
: matrix of size dxk whose columns represent the centers of the clustersrisk
: average of the divergences of the points of x at their associated center.