Functions

GeometricClusterAnalysis.build_distance_matrix — Method

build_distance_matrix(result; indexed_by_r2)

Distance matrix for the graph filtration

indexed_by_r2 = true always work
indexed_by_r2 = false requires elements of weigths to be non-negative.
indexed_by_r2 = false for the sub-level set of the square-root of non-negative power functions : the k-PDTM or the k-PLM (when determinant of matrices are forced to be 1)

source

GeometricClusterAnalysis.build_distance_matrix_power_function_buchet — Method

build_distance_matrix_power_function_buchet(birth, means)

Auxiliary functions for the power-distance

a and b are two vectors, c and d two numerics

source

GeometricClusterAnalysis.colorize! — Method

colorize!(colors, μ, ω, points, k, nsignal, centers, Σ)

Auxiliary function which, given k centers, computes the "new twisted distances" of all points of P, at all centers We color with the color of the nearest center. The "distance" to a center is the square of the Mahalanobis norm at the local mean local mean around the center plus a weight which depends on a local variance around the center of the center to which we add the $log(det(Σ))$

We often use the mahalanobis function. mahalanobis(x,c,Σ) computes the square of the Mahalanobis norm $(p-c)^T Σ^{-1}(p-c)$, for any point $p$, column of $x$.

source

GeometricClusterAnalysis.compute_dists! — Method

compute_dists!(dists, center, points, Σ)

source

GeometricClusterAnalysis.compute_threshold_infinity — Method

compute_threshold_infinity(
    dist_func,
    distance_matrix,
    nb_means_removed,
    nb_clusters
)

source

GeometricClusterAnalysis.create_ellipsoids_animation — Method

create_ellipsoids_animation(
    distance_function,
    timesteps,
    ellipsoids_colors,
    points_colors,
    distances,
    remain_indices
)

source

GeometricClusterAnalysis.distance_matrix_tomato — Method

distance_matrix_tomato(graph, birth)

birth : distance to measure graph : Matrix that contains 0 and 1, $graph_{i,j} = 1$ if $i$ and $j$ are neighbours

source

GeometricClusterAnalysis.dtm — Method

dtm(x, m0; r)

Distance to measure function for each points

source

GeometricClusterAnalysis.dtm_filtration — Method

dtm_filtration(points, m0; infinity, threshold)

source

GeometricClusterAnalysis.euclidean — Method

euclidean(x, y)

Euclidian sqaured distance

source

GeometricClusterAnalysis.fourteen_segments — Method

fourteen_segments(rng, n, σ, d)

source

GeometricClusterAnalysis.graph_nn — Method

graph_nn(points, k)

Nearest neighbours graph

k : number of nearest neighbours to link to

source

GeometricClusterAnalysis.graph_radius — Method

graph_radius(points, r)

Rips graph with radius r

source

GeometricClusterAnalysis.hierarchical_clustering_lem — Method

hierarchical_clustering_lem(
    distance_matrix;
    infinity,
    threshold,
    store_colors,
    store_timesteps
)

distance_matrix : $(r_{i,j})_{i,j} r_{i,j}$ : time $r$ when components $i$ and $j$ merge
$r_{i,i}$ : birth time of component $i$.
c : number of components
infinity : components whose lifetime is larger than infinity never die
threshold : centers born after threshold are removed
store_colors = true : in saved_colors, we store all configurations of colors, for every step.
store_timesteps = true : in saved_timesteps, we store all timesteps.

It is possible to select infinity and threshold after running the algorithm with infinity = Inf and threshold = Inf. For this, we look at the persistence diagram of the components : (x-axis Birth ; y-axis Death)

source

GeometricClusterAnalysis.infinity_symbol — Method

infinity_symbol(
    npoints,
    α,
    σ,
    dimension,
    noise_min,
    noise_max
)

source

GeometricClusterAnalysis.infinity_symbol — Method

infinity_symbol(
    nsignal,
    nnoise,
    σ,
    dimension,
    noise_min,
    noise_max
)

source

GeometricClusterAnalysis.infinity_symbol — Method

infinity_symbol(
    rng,
    npoints,
    α,
    σ,
    dimension,
    noise_min,
    noise_max
)

source

GeometricClusterAnalysis.infinity_symbol — Method

infinity_symbol(
    rng,
    nsignal,
    nnoise,
    σ,
    dimension,
    noise_min,
    noise_max
)

source

GeometricClusterAnalysis.k_witnessed_distance — Method

k_witnessed_distance(points, k, c, nsignal)

source

GeometricClusterAnalysis.kpdtm — Method

kpdtm(
    rng,
    points,
    k,
    c,
    nsignal,
    iter_max,
    nstart,
    first_centers
)

source

GeometricClusterAnalysis.kpdtm — Method

kpdtm(rng, points, k, c, α, iter_max, nstart)

source

GeometricClusterAnalysis.kpdtm — Method

kpdtm(rng, points, k, c, nsignal, iter_max, nstart)

source

GeometricClusterAnalysis.kplm — Method

kplm(
    rng,
    points,
    k,
    ncenters,
    nsignal,
    iter_max,
    nstart,
    f_Σ!,
    first_centers
)

source

GeometricClusterAnalysis.kplm — Method

kplm(rng, points, k, ncenters, α, iter_max, nstart, f_Σ!)

source

GeometricClusterAnalysis.kplm — Method

kplm(
    rng,
    points,
    k,
    ncenters,
    nsignal,
    iter_max,
    nstart,
    f_Σ!
)

source

GeometricClusterAnalysis.mahalanobis — Method

mahalanobis( x, μ, Σ; inverted = false)

Returns the squared Mahalanobis distance of all rows in x and the vector μ = center with respect to Σ = cov. This is (for vector x) defined as

\[D^2 = (x - \mu)' \Sigma^{-1} (x - \mu)\]

x : vector or matrix of data with, say, p columns.
μ : mean vector of the distribution or second data vector of length p or recyclable to that length.
Σ : covariance matrix p x p of the distribution.
inverted : If true, Σ is supposed to contain the inverse of the covariance matrix.

source

GeometricClusterAnalysis.meanvar! — Method

meanvar!(μ, ω, points, centers, k)

source

GeometricClusterAnalysis.noisy_circles — Method

noisy_circles(rng, n; r1, r2, noise)

source

GeometricClusterAnalysis.noisy_fourteen_segments — Method

noisy_fourteen_segments(npoints, α, σ, d)

source

GeometricClusterAnalysis.noisy_fourteen_segments — Method

noisy_fourteen_segments(nsignal, nnoise, σ, d)

source

GeometricClusterAnalysis.noisy_fourteen_segments — Method

noisy_fourteen_segments(rng, npoints, α, σ, d)

npoints : total number of points
α : fraction of outliers

source

GeometricClusterAnalysis.noisy_fourteen_segments — Method

noisy_fourteen_segments(rng, nsignal, nnoise, σ, d)

nsignal : number of signal points
nnoise : number of additionnal outliers

sampled accordingly to generate noise signal points are $X = Y+Z$ with $Y$ uniform on the 14 segments $Z$ normal with mean 0 and covariance matrix $σ*I_d$ (with Id the identity matrix of $R^d$) So, d is the dimension of the data and σ, the standard deviation of the additive Gaussian noise. When ``d>2, Yi = 0$for$i>=2$; with the notation$Y=(Yi){i=1..d}``

source

GeometricClusterAnalysis.noisy_moons — Method

noisy_moons(rng, n; r1, r2, noise)

source

GeometricClusterAnalysis.noisy_nested_spirals — Method

noisy_nested_spirals(npoints, α, σ, dimension)

source

GeometricClusterAnalysis.noisy_nested_spirals — Method

noisy_nested_spirals(nsignal, nnoise, σ, dimension)

source

GeometricClusterAnalysis.noisy_nested_spirals — Method

noisy_nested_spirals(rng, npoints, α, σ, dimension)

source

GeometricClusterAnalysis.noisy_nested_spirals — Method

noisy_nested_spirals(rng, nsignal, nnoise, σ, dimension)

nsignal : number of signal points
nnoise : number of additionnal outliers

Signal points are $x = y+z$ with

$y$ uniform on the two nested spirals
$z$ normal with mean 0 and covariance matrix $\sigma * I_d$ (with $I_d$ the identity matrix of $R^d$)

d is the dimension of the data and sigma, the standard deviation of the additive Gaussian noise. When $d>2, y_i = 0$ for $i>=2$; with the notation $y=(y_i)_{i=1..d}$

source

GeometricClusterAnalysis.noisy_three_curves — Method

noisy_three_curves(npoints, α, sigma, d)

source

GeometricClusterAnalysis.noisy_three_curves — Method

noisy_three_curves(rng, npoints, α, sigma, d)

source

GeometricClusterAnalysis.noisy_three_curves — Method

noisy_three_curves(rng, nsignal, nnoise, sigma, d)

nsignal : number of signal points
nnoise : number of additionnal outliers

Signal points are $x = y+z$ with

$y$ uniform on the 3 curves
$z$ normal with mean 0 and covariance matrix $\sigma * I_d$ (with $I_d$ the identity matrix of $R^d$)

d is the dimension of the data and sigma, the standard deviation of the additive Gaussian noise. When $d>2, y_i = 0$ for $i>=2$; with the notation $y=(y_i)_{i=1..d}$

source

GeometricClusterAnalysis.performance — Function

performance(n, n_outliers, k, alpha, generator, outliers_generator, 
            bregman, maxiter = 100, nstart = 10, replications = 100)

La fonction generator genere des points, elle retourne les points (l'echantillon) et les labels (les vraies etiquettes des points)

n : nombre total de points
n_outliers : nombre de donnees generees comme des donnees aberrantes dans ces n points

source

GeometricClusterAnalysis.poisson — Method

poisson(x, y)

Bregman divergence associated with the Poisson distribution

source

GeometricClusterAnalysis.power_function_buchet — Method

power_function_buchet(points, m0; infinity, threshold)

source

GeometricClusterAnalysis.recolor — Method

recolor(points, centers, k, nsignal)

source

GeometricClusterAnalysis.return_color — Method

return_color(label_points, colors_in, startup_indices)

colors_in : labels des centres en sortie d'algo.
label_points : les labels des points à valeurs dans le même ensemble que les indices de départ.

Si label_points[j] = 3, alors on cherche le centre dont l'indice de départ vaut 3. Si par exemple, c'est le 5ème centre alors color[j] = col[5]

source

GeometricClusterAnalysis.sample_outliers — Method

sample_outliers(rng, n_outliers, d; scale = 1)

source

GeometricClusterAnalysis.sample_poisson — Method

sample_poisson(rng, n, d, lambdas, proba)

source

GeometricClusterAnalysis.select_parameters — Method

select_parameters(
    rng,
    vk,
    valpha,
    x,
    bregman,
    maxiter,
    nstart
)

Initial centers are set randomly

k: numbers of centers
α: trimming values

source

GeometricClusterAnalysis.select_parameters_nonincreasing — Method

select_parameters_nonincreasing(
    rng,
    vk,
    valpha,
    x,
    bregman,
    maxiter,
    nstart
)

Nous forcons la courbe de risque a etre decroissante en alpha, on utilise les centres optimaux du alpha precedent.

k est un nombre ou un vecteur contenant les valeurs des differents k
alpha est un nombre ou un vecteur contenant les valeurs des differents alpha
force_decreasing = false, tous les departs sont aléatoires.

source

GeometricClusterAnalysis.subcolorize — Method

subcolorize(points, nsignal, result, startup_indices)

Auxiliary function that, given the point cloud, the number of points of the signal, the result of kpdtm or kplm and the starting indices of the hclust method, computes the "new twisted distances" from all points of P, to all centers whose indices are in the starting indices. The nearest center is associated with them.

source

GeometricClusterAnalysis.trimmed_bregman_clustering — Method

trimmed_bregman_clustering(
    rng,
    x,
    k,
    α,
    bregman,
    maxiter,
    nstart
)

n : number of points
d : dimension

Input :

x : sample of n points in $R^d$ - matrix of size n $\times$ d
α : proportion of eluted points, because considered as outliers. They are given the label 0
k : number of centers
bregman : function of two numbers or vectors named x and y, which reviews their Bregman divergence.
maxiter: maximum number of iterations allowed.
nstart: if centers is a number, it is the number of different initializations of the algorithm. Only the best result is kept.

Output :

centers: matrix of size dxk whose columns represent the centers of the clusters
cluster: vector of integers in 1:k indicating the index of the cluster to which each point (line) of x is associated.
risk: average of the divergences of the points of x at their associated center.
divergence: the vector of divergences of the points of x at their nearest center in centers.

source

GeometricClusterAnalysis.trimmed_bregman_clustering — Method

trimmed_bregman_clustering(
    rng,
    x,
    centers,
    α,
    bregman,
    maxiter
)

n : number of points
d : dimension

Input :

x : sample of n points in R^d - matrix of size n $\times$ d
centers : intial centers
alpha : proportion of eluted points, because considered as outliers. They are given the label 0
bregman : function of two numbers or vectors named x and y, which reviews their Bregman divergence.
maxiter: maximum number of iterations allowed.

Output :

centers: matrix of size dxk whose columns represent the centers of the clusters
risk: average of the divergences of the points of x at their associated center.

source