Functions

GeometricClusterAnalysis.build_distance_matrixMethod
build_distance_matrix(result; indexed_by_r2)

Distance matrix for the graph filtration

  • indexed_by_r2 = true always work
  • indexed_by_r2 = false requires elements of weigths to be non-negative.
  • indexed_by_r2 = false for the sub-level set of the square-root of non-negative power functions : the k-PDTM or the k-PLM (when determinant of matrices are forced to be 1)
source
GeometricClusterAnalysis.colorize!Method
colorize!(colors, μ, ω, points, k, nsignal, centers, Σ)

Auxiliary function which, given k centers, computes the "new twisted distances" of all points of P, at all centers We color with the color of the nearest center. The "distance" to a center is the square of the Mahalanobis norm at the local mean local mean around the center plus a weight which depends on a local variance around the center of the center to which we add the $log(det(Σ))$

We often use the mahalanobis function. mahalanobis(x,c,Σ) computes the square of the Mahalanobis norm $(p-c)^T Σ^{-1}(p-c)$, for any point $p$, column of $x$.

source
GeometricClusterAnalysis.hierarchical_clustering_lemMethod
hierarchical_clustering_lem(
    distance_matrix;
    infinity,
    threshold,
    store_colors,
    store_timesteps
)
  • distance_matrix : $(r_{i,j})_{i,j} r_{i,j}$ : time $r$ when components $i$ and $j$ merge
  • $r_{i,i}$ : birth time of component $i$.
  • c : number of components
  • infinity : components whose lifetime is larger than infinity never die
  • threshold : centers born after threshold are removed
  • store_colors = true : in saved_colors, we store all configurations of colors, for every step.
  • store_timesteps = true : in saved_timesteps, we store all timesteps.

It is possible to select infinity and threshold after running the algorithm with infinity = Inf and threshold = Inf. For this, we look at the persistence diagram of the components : (x-axis Birth ; y-axis Death)

source
GeometricClusterAnalysis.mahalanobisMethod
mahalanobis( x, μ, Σ; inverted = false)

Returns the squared Mahalanobis distance of all rows in x and the vector μ = center with respect to Σ = cov. This is (for vector x) defined as

\[D^2 = (x - \mu)' \Sigma^{-1} (x - \mu)\]

  • x : vector or matrix of data with, say, p columns.
  • μ : mean vector of the distribution or second data vector of length p or recyclable to that length.
  • Σ : covariance matrix p x p of the distribution.
  • inverted : If true, Σ is supposed to contain the inverse of the covariance matrix.
source
GeometricClusterAnalysis.noisy_fourteen_segmentsMethod
noisy_fourteen_segments(rng, nsignal, nnoise, σ, d)
  • nsignal : number of signal points
  • nnoise : number of additionnal outliers

sampled accordingly to generate noise signal points are $X = Y+Z$ with $Y$ uniform on the 14 segments $Z$ normal with mean 0 and covariance matrix $σ*I_d$ (with Id the identity matrix of $R^d$) So, d is the dimension of the data and σ, the standard deviation of the additive Gaussian noise. When ``d>2, Yi = 0$for$i>=2$; with the notation$Y=(Yi){i=1..d}``

source
GeometricClusterAnalysis.noisy_nested_spiralsMethod
noisy_nested_spirals(rng, nsignal, nnoise, σ, dimension)
  • nsignal : number of signal points
  • nnoise : number of additionnal outliers

Signal points are $x = y+z$ with

  • $y$ uniform on the two nested spirals
  • $z$ normal with mean 0 and covariance matrix $\sigma * I_d$ (with $I_d$ the identity matrix of $R^d$)

d is the dimension of the data and sigma, the standard deviation of the additive Gaussian noise. When $d>2, y_i = 0$ for $i>=2$; with the notation $y=(y_i)_{i=1..d}$

source
GeometricClusterAnalysis.noisy_three_curvesMethod
noisy_three_curves(rng, nsignal, nnoise, sigma, d)
  • nsignal : number of signal points
  • nnoise : number of additionnal outliers

Signal points are $x = y+z$ with

  • $y$ uniform on the 3 curves
  • $z$ normal with mean 0 and covariance matrix $\sigma * I_d$ (with $I_d$ the identity matrix of $R^d$)

d is the dimension of the data and sigma, the standard deviation of the additive Gaussian noise. When $d>2, y_i = 0$ for $i>=2$; with the notation $y=(y_i)_{i=1..d}$

source
GeometricClusterAnalysis.performanceFunction
performance(n, n_outliers, k, alpha, generator, outliers_generator, 
            bregman, maxiter = 100, nstart = 10, replications = 100)

La fonction generator genere des points, elle retourne les points (l'echantillon) et les labels (les vraies etiquettes des points)

  • n : nombre total de points
  • n_outliers : nombre de donnees generees comme des donnees aberrantes dans ces n points
source
GeometricClusterAnalysis.return_colorMethod
return_color(label_points, colors_in, startup_indices)
  • colors_in : labels des centres en sortie d'algo.
  • label_points : les labels des points à valeurs dans le même ensemble que les indices de départ.

Si label_points[j] = 3, alors on cherche le centre dont l'indice de départ vaut 3. Si par exemple, c'est le 5ème centre alors color[j] = col[5]

source
GeometricClusterAnalysis.select_parameters_nonincreasingMethod
select_parameters_nonincreasing(
    rng,
    vk,
    valpha,
    x,
    bregman,
    maxiter,
    nstart
)

Nous forcons la courbe de risque a etre decroissante en alpha, on utilise les centres optimaux du alpha precedent.

  • k est un nombre ou un vecteur contenant les valeurs des differents k
  • alpha est un nombre ou un vecteur contenant les valeurs des differents alpha
  • force_decreasing = false, tous les departs sont aléatoires.
source
GeometricClusterAnalysis.subcolorizeMethod
subcolorize(points, nsignal, result, startup_indices)

Auxiliary function that, given the point cloud, the number of points of the signal, the result of kpdtm or kplm and the starting indices of the hclust method, computes the "new twisted distances" from all points of P, to all centers whose indices are in the starting indices. The nearest center is associated with them.

source
GeometricClusterAnalysis.trimmed_bregman_clusteringMethod
trimmed_bregman_clustering(
    rng,
    x,
    k,
    α,
    bregman,
    maxiter,
    nstart
)
  • n : number of points
  • d : dimension

Input :

  • x : sample of n points in $R^d$ - matrix of size n $\times$ d
  • α : proportion of eluted points, because considered as outliers. They are given the label 0
  • k : number of centers
  • bregman : function of two numbers or vectors named x and y, which reviews their Bregman divergence.
  • maxiter: maximum number of iterations allowed.
  • nstart: if centers is a number, it is the number of different initializations of the algorithm. Only the best result is kept.

Output :

  • centers: matrix of size dxk whose columns represent the centers of the clusters
  • cluster: vector of integers in 1:k indicating the index of the cluster to which each point (line) of x is associated.
  • risk: average of the divergences of the points of x at their associated center.
  • divergence: the vector of divergences of the points of x at their nearest center in centers.
source
GeometricClusterAnalysis.trimmed_bregman_clusteringMethod
trimmed_bregman_clustering(
    rng,
    x,
    centers,
    α,
    bregman,
    maxiter
)
  • n : number of points
  • d : dimension

Input :

  • x : sample of n points in R^d - matrix of size n $\times$ d
  • centers : intial centers
  • alpha : proportion of eluted points, because considered as outliers. They are given the label 0
  • bregman : function of two numbers or vectors named x and y, which reviews their Bregman divergence.
  • maxiter: maximum number of iterations allowed.

Output :

  • centers: matrix of size dxk whose columns represent the centers of the clusters
  • risk: average of the divergences of the points of x at their associated center.
source