Application to authors texts clustering

Data from texts are stored in some variable df. The commands used for displaying data are the following.

using CategoricalArrays
using DataFrames
using DelimitedFiles
using GeometricClusterAnalysis
using MultivariateStats
using Plots
using Random
import Clustering: mutualinfo

rng = MersenneTwister(2022)

table = readdlm(joinpath("assets", "textes.txt"))

df = DataFrame(
    hcat(table[2:end, 1], table[2:end, 2:end]),
    vec(vcat("authors", table[1, 1:end-1])),
    makeunique = true,
)
first(df, 10)
10×210 DataFrame
RowauthorsMark TwainMark Twain_1Mark Twain_2Mark Twain_3Mark Twain_4Mark Twain_5Mark Twain_6Mark Twain_7Mark Twain_8Mark Twain_9Mark Twain_10Mark Twain_11Mark Twain_12Mark Twain_13Mark Twain_14Mark Twain_15Mark Twain_16Mark Twain_17Mark Twain_18Mark Twain_19Mark Twain_20Mark Twain_21Mark Twain_22Mark Twain_23Mark Twain_24Charles DickensCharles Dickens_1Charles Dickens_2Charles Dickens_3Charles Dickens_4Charles Dickens_5Charles Dickens_6Charles Dickens_7Charles Dickens_8Charles Dickens_9Charles Dickens_10Charles Dickens_11Charles Dickens_12Charles Dickens_13Charles Dickens_14Charles Dickens_15Charles Dickens_16Charles Dickens_17Charles Dickens_18Charles Dickens_19Charles Dickens_20Charles Dickens_21Charles Dickens_22Charles Dickens_23Charles Dickens_24Charles Dickens_25Charles Dickens_26Charles Dickens_27Charles Dickens_28Charles Dickens_29Charles Dickens_30Charles Dickens_31Charles Dickens_32Charles Dickens_33Charles Dickens_34Charles Dickens_35Charles Dickens_36Charles Dickens_37Charles Dickens_38Charles Dickens_39Charles Dickens_40Charles Dickens_41Charles Dickens_42Charles Dickens_43Charles Dickens_44Charles Dickens_45Charles Dickens_46Charles Dickens_47Charles Dickens_48Charles Dickens_49Charles Dickens_50Charles Dickens_51Charles Dickens_52Charles Dickens_53Charles Dickens_54Charles Dickens_55Charles Dickens_56Charles Dickens_57Charles Dickens_58Charles Dickens_59Charles Dickens_60Charles Dickens_61Charles Dickens_62Charles Dickens_63Charles Dickens_64Charles Dickens_65Charles Dickens_66Charles Dickens_67Charles Dickens_68Charles Dickens_69Charles Dickens_70Charles Dickens_71Charles Dickens_72Charles Dickens_73Charles Dickens_74Charles Dickens_75Charles Dickens_76Charles Dickens_77Charles Dickens_78Charles Dickens_79Charles Dickens_80Charles Dickens_81Charles Dickens_82Charles Dickens_83Charles Dickens_84Charles Dickens_85Charles Dickens_86Charles Dickens_87Charles Dickens_88Charles Dickens_89Charles Dickens_90Charles Dickens_91Charles Dickens_92Charles Dickens_93Charles Dickens_94Nathaniel HawthorneNathaniel Hawthorne_1Nathaniel Hawthorne_2Nathaniel Hawthorne_3Nathaniel Hawthorne_4Nathaniel Hawthorne_5Nathaniel Hawthorne_6Nathaniel Hawthorne_7Nathaniel Hawthorne_8Nathaniel Hawthorne_9Nathaniel Hawthorne_10Nathaniel Hawthorne_11Nathaniel Hawthorne_12Nathaniel Hawthorne_13Nathaniel Hawthorne_14Nathaniel Hawthorne_15Nathaniel Hawthorne_16Nathaniel Hawthorne_17Nathaniel Hawthorne_18Nathaniel Hawthorne_19Nathaniel Hawthorne_20Nathaniel Hawthorne_21Nathaniel Hawthorne_22Nathaniel Hawthorne_23Nathaniel Hawthorne_24Nathaniel Hawthorne_25Nathaniel Hawthorne_26Nathaniel Hawthorne_27Nathaniel Hawthorne_28Nathaniel Hawthorne_29Nathaniel Hawthorne_30Nathaniel Hawthorne_31Nathaniel Hawthorne_32Nathaniel Hawthorne_33Nathaniel Hawthorne_34Nathaniel Hawthorne_35Nathaniel Hawthorne_36Nathaniel Hawthorne_37Nathaniel Hawthorne_38Nathaniel Hawthorne_39Nathaniel Hawthorne_40Nathaniel Hawthorne_41Nathaniel Hawthorne_42Sir Arthur Conan DoyleSir Arthur Conan Doyle_1Sir Arthur Conan Doyle_2Sir Arthur Conan Doyle_3Sir Arthur Conan Doyle_4Sir Arthur Conan Doyle_5Sir Arthur Conan Doyle_6Sir Arthur Conan Doyle_7Sir Arthur Conan Doyle_8Sir Arthur Conan Doyle_9Sir Arthur Conan Doyle_10Sir Arthur Conan Doyle_11Sir Arthur Conan Doyle_12Sir Arthur Conan Doyle_13Sir Arthur Conan Doyle_14Sir Arthur Conan Doyle_15Sir Arthur Conan Doyle_16Sir Arthur Conan Doyle_17Sir Arthur Conan Doyle_18Sir Arthur Conan Doyle_19Sir Arthur Conan Doyle_20Sir Arthur Conan Doyle_21Sir Arthur Conan Doyle_22Sir Arthur Conan Doyle_23Sir Arthur Conan Doyle_24Sir Arthur Conan Doyle_25GodGod_1God_2God_3God_4God_5God_6God_7God_8God_9God_10God_11God_12God_13God_14ObamaObama_1Obama_2Obama_3Obama_4
AnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAnyAny
1be435365461472398456490507414427433437417496515422367414393445383456458484410423464459440481482417432435443484481523478501451445484433544474468480437402403424445378399413450384440408358516517503559549506400476549565521510521524550521504506446467495500480479472508428477492443449444460492448476435472430474471510475435462457455436487411449494439488468453437455367378380364347372350328347342334346340296352332301337338366297379357409373391407357381380368418381408378380426398361381426426337295553503563521488511418474499540603588563581523471454457453429513479475462483449442401426356313431527463379295322346320362409319353274276283
2have1121019410410989112859511295131144131167101101137931189113216315817016018325018219919817119422520020720822224823919425122024424922126825018117014116516414516019220118118817716019621917715617118822021821122120118222621118819921322321825423824826321219823620029324415415116419117919216616915117021117525016917317822319420721221122619118617718716118615716319016416217822720117214815320913217412217219315516517720716218318417618615920517318318317620716316918519214916717218019517819815625122227628524622720121720424423928023320719317618420821421320620224019421025411215111410961678844135127150129130116154213255196191277
3say1531201351181171641771811426580769274866660524797464473987844598265111996863551149712410414912814011412897114124102631229412512313079121109871041391166012612414016413188113161122156150139121139158891531261311399914110610314115282123121731058111312513291117158147112971121149913612414314317613011878100120154938613611342721402843261527311843152033183241282115284337194146213243365449263938303538524145348183664769805579747180879697885457704045889176827422176232230161623743551537360111852182191318101615
4do13612216716812817918121919395134941291348938496638493575861279358531124493605859436291909711870757597829111814471406964928553719155781151064591968412074597280577797987890959987948168988680778410646588951116113981158178140115112951039910386112751331331231011081191121011136164729720242829424725142733273419102135201916201647574221425437354761405142485744436253462727758196107678147506664691037867112543337473761566042594452466075374043514762726865767667828473106
5go1089798888176106859036655380635432213933332935435961293624314152304025415667606946494961415795668217192636273022331836284235486353537371727142585849633260654356395661375230654154304546304536384145424821452748422931554638454041466159493829385518134141725151583125141379161618813220191116111516342020171916182540404433232222293338282436111829362137132221151015231914191524163165797013031162438627254771021271031523141326
6come266358646144576455324642445359342234453626273034352933273732403631233245416044455347514761484552363225442631272827342730474742464952485247503051393231465240302941453640384235373949384137423340384927435058533245294829353741403847463544344645161821193126223017382316242314262022172111191520149202434222621231822393737313044271834404444374653333833294541543231363440352820463734345896124774113242377383896771261391531162017
7see4271575745718457472826264440402013321837221529223117353324295028184033654335343751503254494756731535413116252134291932253031343044423133425644514647385440413832645030454461245642492012282420422332273038143028292431373533373036314136332816831315142427283488820249161216182113171024262113211810271425301716313338395641492420284646632848352947473631444032264149604531403736523934345129219533120102392826141371516
8know355039292138455546273734273438221318938171036385020293422633629423328383744594436583942493867452027303746182631292744383546383044464937514354414355246149603433625257574151482937433634474942263442475241547749332626545441273536454530303843341812131615242087191112139911793141110202210181971819162513231822182325211987323826542729223121372238493521281118185221329171732197222010210761518813164533212923
9make4841364037443949432725253824202728264227252438323128302526452819292141332638314043383451422925343425302424353131282430222641504446323537444533413949402824294043414353383539313416253025323729482637302233254229353829394331383228364144243434282431222926232225241823191212141424279253120312741302933422944272728213328385634552820261725212418192421272230272726293433272623191521241957333626192965727333531422228324848584368
10man41274643502818102220141212132111231213812171332402521233935364031366225142113301924162033121830202437282233474144373131621272299282172316222622101220101713111491613917231417642130354045391726441277233645232045451436141442272227353745374024574530384439385128342321341931363731291927235443364029333533402640242423335071385047454755485034486356517891644324152720332234374144784374594113594860476875401771255118122

The following transposed version will be more convenient.

dft = DataFrame(
    [[names(df)[2:end]]; collect.(eachrow(df[:, 2:end]))],
    [:column; Symbol.(axes(df, 1))],
)
rename!(dft, String.(vcat("authors", values(df[:, 1]))))
first(dft, 10)
10×51 DataFrame
Rowauthorsbehavesaydogocomeseeknowmakemanlooktakelittletimehandthinkgetoldothergoodtelleyewaygivefacefinddayownheadgreatseemsuchturnhearleaveputasklifethingnothingstandyoungnightcrywordmindhousesitsirlast
StringInt64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64
1Mark Twain435112153136108264235484121561447182814142182947430176212291005891982313426151301197781310
2Mark Twain_13651011201229763715041274243243982597368413042710528252141310630924203112113241475105
3Mark Twain_24619413516798585739364622422447112711217183339535121854231111182041215261670222945037
4Mark Twain_3472104118168886457294043292824341012873523354292479151951421175351114122132232519518130103
5Mark Twain_43981091171288161452137502736273419157620242220833152101671323711988462611201020779120011
6Mark Twain_54568916417976447138442826422335252490192428345412410918623146131617187232221211997591213
7Mark Twain_649011217718110657844539182837183421491213516275682925715148102109147910122282372861017100217
8Mark Twain_750785181219856457554910382725371938116299234554824313135100661123111691383113113489140113
9Mark Twain_8414951421939055474643223247939826134161821245291511582100161320132331032205112359100317
10Mark Twain_94271126595363228272720302827242015401021211616122316717913195721528451361282451449410

We add the labels column with the authors's names

transform!(dft, "authors" => ByRow(x -> first(split(x, "_"))) => "labels")
first(dft, 10)
10×52 DataFrame
Rowauthorsbehavesaydogocomeseeknowmakemanlooktakelittletimehandthinkgetoldothergoodtelleyewaygivefacefinddayownheadgreatseemsuchturnhearleaveputasklifethingnothingstandyoungnightcrywordmindhousesitsirlastlabels
StringInt64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64Int64SubStrin…
1Mark Twain435112153136108264235484121561447182814142182947430176212291005891982313426151301197781310Mark Twain
2Mark Twain_13651011201229763715041274243243982597368413042710528252141310630924203112113241475105Mark Twain
3Mark Twain_24619413516798585739364622422447112711217183339535121854231111182041215261670222945037Mark Twain
4Mark Twain_3472104118168886457294043292824341012873523354292479151951421175351114122132232519518130103Mark Twain
5Mark Twain_43981091171288161452137502736273419157620242220833152101671323711988462611201020779120011Mark Twain
6Mark Twain_54568916417976447138442826422335252490192428345412410918623146131617187232221211997591213Mark Twain
7Mark Twain_649011217718110657844539182837183421491213516275682925715148102109147910122282372861017100217Mark Twain
8Mark Twain_750785181219856457554910382725371938116299234554824313135100661123111691383113113489140113Mark Twain
9Mark Twain_8414951421939055474643223247939826134161821245291511582100161320132331032205112359100317Mark Twain
10Mark Twain_94271126595363228272720302827242015401021211616122316717913195721528451361282451449410Mark Twain

Computing the Principal Component Analysis (PCA).

X = Matrix{Float64}(df[!, 2:end])
X_labels = dft[!, :labels]

pca = fit(PCA, X; maxoutdim = 50)
X_pca = predict(pca, X)
37×209 Matrix{Float64}:
 -60.7414      9.83193   -77.8121    …  165.516     164.319      132.117
 144.626     146.345     141.665          4.83876     4.48532    -40.2968
   2.86075    26.2327    -21.3985        62.1869     60.0662     120.151
  69.6778     63.8413     78.3785        73.6832     57.7492      76.9475
 -24.8752    -12.6836    -32.0202       -28.9607    -22.7359     -45.6323
 -11.4354     -6.21062   -23.8029    …   32.7799     23.1818      44.5347
   8.94384    45.751      13.0142       -35.717     -27.565      -37.1215
  40.1841     14.2104      7.5561         7.79548     9.50864      6.38706
 -40.341     -28.3526    -16.3179        -8.18071    -1.23962     -2.80065
 -22.6938    -12.4824      3.27572        5.16248    11.1488       1.68726
   ⋮                                 ⋱                           
  -6.05136    -1.54905    -3.57877       -0.506974    0.447435    -0.537165
  -9.70702     8.41345     0.581103       3.45939    -2.3016       3.3244
  -0.810259   -0.595373   -2.07314   …    3.25579    -0.0607717   -0.296643
   9.3398     -0.637202   -6.41379        7.87453    -0.0905966   -2.50238
  -6.50688    -4.10144    -3.60888       -0.795095    4.65839      6.11248
  -0.394297    3.78447     1.90339        2.05996    -2.68061      1.92713
   2.0695     -0.447788    0.94063        1.51751    -3.63181      0.969622
   4.2428      5.01362    -4.58153   …    6.10315     7.88316     -2.04012
  -1.61565    -0.938709   -5.43101       -1.10219     2.87962      2.44519

Recoding labels for the linear discriminant analysis:

Y_labels = recode(
    X_labels,
    "Obama" => 1,
    "God" => 2,
    "Mark Twain" => 3,
    "Charles Dickens" => 4,
    "Nathaniel Hawthorne" => 5,
    "Sir Arthur Conan Doyle" => 6,
)

lda = fit(MulticlassLDA, X_pca, Y_labels; outdim=20)
points = predict(lda, X_pca)
20×209 Matrix{Float64}:
  0.0425391   0.100762    -0.0443951  …   0.0635746    0.0537748
 -0.0476023  -0.0335414   -0.0323627     -0.187445    -0.226327
  0.466552    0.428397     0.50515        0.41138      0.335156
  0.179813    0.279297     0.0988312      0.0381078    0.00635588
  0.165531    0.14778      0.251933      -0.492498    -0.643294
 -0.0169923   0.0517767   -0.0640841  …   0.0380897   -0.104371
 -0.15124    -0.0906387   -0.0796038      0.00310871   0.0282575
  0.0328233  -0.00191108  -0.081439       0.105777    -0.0338196
 -0.0304685   0.0716366    0.0276993      0.00296056   0.00174264
 -0.0733328  -0.131337     0.0478864     -0.00354468   0.0555406
  0.103659   -0.12559     -0.0123341  …  -0.027959     0.0108726
 -0.0140485  -0.0992294    0.0163172     -0.0278225   -0.0321585
  0.0107203  -0.112733    -0.0628497     -0.00171432  -0.015996
  0.0342722   0.137548     0.0161871     -0.00247884  -0.0916099
  0.063174   -0.0192288   -0.0520502      0.0195034   -0.0356987
 -0.0133432   0.00223252   0.124153   …   0.0235802    0.00187892
 -0.040238   -0.0548456    0.034812       0.00448765  -0.0806262
 -0.19582     0.0492696   -0.112352      -0.00700629  -0.0305518
  0.123881   -0.00719929   0.0167019     -0.0289716    0.0996145
 -0.221213    0.00749935  -0.0475717      0.0174744   -0.0196127

Representation of data:

function plot_clustering( points, cluster, true_labels; axis = 1:2)

    pairs = Dict(1 => :rtriangle, 2 => :diamond, 3 => :square, 4 => :ltriangle,
                  5 => :star, 6 => :pentagon, 0 => :circle)

    shapes = replace(cluster, pairs...)

    p = scatter(points[1, :], points[2, :]; markershape = shapes,
                markercolor = true_labels, label = "")

    authors = [ "Obama", "God", "Twain", "Dickens",
                "Hawthorne", "Conan Doyle"]

    xl, yl = xlims(p), ylims(p)
    for (s,a) in zip(values(pairs),authors)
        scatter!(p, [1], markershape=s, markercolor = "blue", label=a, xlims=xl, ylims=yl)
    end
    for c in keys(pairs)
        scatter!(p, [1], markershape=:circle, markercolor = c, label = c, xlims=xl, ylims=yl)
    end
    plot!(p, xlabel = "PC1", ylabel = "PC2", legend=:outertopright)

    return p

end
plot_clustering (generic function with 1 method)

Data clustering

To cluster the data, we will use the following parameters. The true proportion of outliers is 20/209 since 15+5 texts were extracted from the bible or a speech from Obama.

k = 4
alpha = 20/209
maxiter = 50
nstart = 50
50

Application of the classical trimmed $k$-means algorithm.

(Cuesta-Albertos et al., 1997)

tb_kmeans = trimmed_bregman_clustering(rng, points, k, alpha, euclidean, maxiter, nstart)

plot_clustering(tb_kmeans.points, tb_kmeans.cluster, Y_labels)
Example block output

Using the Bregman divergence associated to the Poisson distribution

function standardize!( points )
    points .-= minimum(points, dims=2)
end

standardize!(points)
20×209 Matrix{Float64}:
 0.371239   0.429461   0.284305   0.351883  …  0.406523  0.392274  0.382475
 0.445381   0.459442   0.460621   0.392174     0.291369  0.305539  0.266656
 0.753697   0.715542   0.792295   0.614593     0.731444  0.698525  0.622301
 0.676151   0.775636   0.59517    0.562054     0.531942  0.534446  0.502694
 0.817917   0.800165   0.904319   0.862279     0.059263  0.159888  0.00909161
 0.234037   0.302806   0.186945   0.0       …  0.235471  0.289119  0.146658
 0.0601579  0.120759   0.131794   0.245968     0.158027  0.214507  0.239655
 0.389972   0.355237   0.275709   0.286407     0.430856  0.462925  0.323329
 0.18849    0.290595   0.246658   0.267849     0.260358  0.221919  0.220701
 0.139707   0.0817024  0.260926   0.248796     0.21737   0.209495  0.26858
 0.295311   0.0660611  0.179317   0.224351  …  0.162675  0.163692  0.202524
 0.159309   0.0741285  0.189675   0.167506     0.197067  0.145535  0.141199
 0.206475   0.0830217  0.132905   0.0          0.17484   0.19404   0.179758
 0.240473   0.343749   0.222388   0.282089     0.210068  0.203722  0.114591
 0.299407   0.217004   0.184183   0.129202     0.24267   0.255736  0.200534
 0.163483   0.179059   0.300979   0.200116  …  0.186113  0.200406  0.178705
 0.14086    0.126252   0.21591    0.165868     0.174994  0.185586  0.100472
 0.0        0.245089   0.0834676  0.206437     0.134483  0.188814  0.165268
 0.302945   0.171865   0.195766   0.231095     0.184692  0.150092  0.278678
 0.0        0.228712   0.173641   0.355764     0.260298  0.238687  0.2016
tb_poisson = trimmed_bregman_clustering(rng, points, k, alpha, poisson, maxiter, nstart)

plot_clustering(points, tb_poisson.cluster, Y_labels)
Example block output

By using the Bregman divergence associated to the Poisson distribution, we see that the clustering method is performant with the parameters k = 4 and alpha = 20/209. Indeed, the outliers are the texts from the bible and from the Obama speech. Moreover, the other texts are mostly well clustered.

Performance comparison

We measure the performance of two clustering methods (the one with the squared Euclidean distance and the one with the Bregman divergence associated to the Poisson distribution). For this, we use the normalised mutual information (NMI).

True labelling for which the texts from the bible and the Obama speech do have the same label:

true_labels = copy(Y_labels)
true_labels[Y_labels .== 2] .= 1
15-element view(::Vector{Int64}, [190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204]) with eltype Int64:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

For trimmed k-means :

mutualinfo(true_labels, tb_kmeans.cluster, normed = true)
1.0

For trimmed clustering with the Bregman divergence associated to the Poisson distribution :

mutualinfo(true_labels, tb_poisson.cluster, normed = true)
0.9372246668789784

The mutualy normalised information is larger for the Bregman divergence associated to the Poisson distribution. This illustrates the fact that using the correct Bregman divergence helps improving the clustering, in comparison to the classical trimmed $k$-means algorithm. Indeed, the number of appearance of a word in a text of a fixed number of words, written by the same author, can be modelled by a random variable of Poisson distribution. The independance between the number of appearance of the words is not realistic. However, since we do consider only some words (the 50 more frequent words), we make this approximation. We will use the Bregman divergence associated to the Poisson distribution.

Selecting the parameters $k$ and $\alpha$

We display the risks curves as a function of $k$ and $\alpha$. In practive, it is important to realise this step since we are not supposed to know the data set in advance, nor the number of outliers.

vect_k = collect(1:6)
vect_alpha = [(1:5)./50; [0.15,0.25,0.75,0.85,0.9]]
nstart = 20

rng = MersenneTwister(20)

params_risks = select_parameters(rng, vect_k, vect_alpha, points, poisson, maxiter, nstart)

plot(; title = "select parameters")
for (i,k) in enumerate(vect_k)
   plot!( vect_alpha, params_risks[i, :], label ="k=$k", markershape = :circle )
end
xlabel!("alpha")
ylabel!("NMI")
Example block output

In order to select the parameters k and alpha, we will focus onf the different possible values for alpha. For alpha not smaller than 0.15, we see that we gain a lot going from 1 to 3 groups and from 2 to 3 groups. Therefore, we choose k=3 and alpha of order 0.15 corresponding to the slope change, for the curve k=3.

For alpha smaller than 0.15, we see that we gain a lot going from 1 to 2 groups, from 2 to 3 groups and to 3 to 4 groups. However, we do not gain in terms of risk going from 4 to 5 groups or from 5 to 6 groups. Indeed, the curves associated to the parameters $k = 4$, $k = 5$ and $k = 6$ are very close. So, we cluster the data in $k = 4$ groups.

The curve associated to the parameter $k = 4$ strongly decreases with a slope that stabilises around $\alpha = 0.1$.

Then, since there is a slope jump at that curve $k = 6$, we can choose the parameter k = 6, with alpha = 0. We do not consider any outlier.

Note that the fact that our method is initialised by random centers implies that the curves representing the risk as a function of $k$ and $\alpha$ vary, quite strongly, from one time to another one. Consequently, the comment abovementionned does not necessarily corresponds to the figure. For more robustness, we should have increased the value of nstart, and so, the execution time. These curves for the selection of the parameters k and alpha are mostly indicative.

Finaly, here are three clustering obtained after choosing 3 pairs of parameters.

maxiter = 50
nstart = 50
tb = trimmed_bregman_clustering(rng, points, 3, 0.15, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)
Example block output

The texts of Twain, the bible and the Obama speech are considered as outliers.

tb = trimmed_bregman_clustering(rng, points, 4, 0.1, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)
Example block output

The texts from the bible and the Obama speech are considered as outliers.

tb = trimmed_bregman_clustering(rng, points, 6, 0.0, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)
Example block output

We obtain 6 groups corresponding to the texts of the 4 authors and to the texts from the bible and from the Obama speech.