Application to authors texts clustering

Data from texts are stored in some variable df. The commands used for displaying data are the following.

using CategoricalArrays
using DataFrames
using DelimitedFiles
using GeometricClusterAnalysis
using MultivariateStats
using Plots
using Random
import Clustering: mutualinfo

rng = MersenneTwister(2022)

table = readdlm(joinpath("assets", "textes.txt"))

df = DataFrame(
    hcat(table[2:end, 1], table[2:end, 2:end]),
    vec(vcat("authors", table[1, 1:end-1])),
    makeunique = true,
)
first(df, 10)

10×210 DataFrame

Row	authors	Mark Twain	Mark Twain_1	Mark Twain_2	Mark Twain_3	Mark Twain_4	Mark Twain_5	Mark Twain_6	Mark Twain_7	Mark Twain_8	Mark Twain_9	Mark Twain_10	Mark Twain_11	Mark Twain_12	Mark Twain_13	Mark Twain_14	Mark Twain_15	Mark Twain_16	Mark Twain_17	Mark Twain_18	Mark Twain_19	Mark Twain_20	Mark Twain_21	Mark Twain_22	Mark Twain_23	Mark Twain_24	Charles Dickens	Charles Dickens_1	Charles Dickens_2	Charles Dickens_3	Charles Dickens_4	Charles Dickens_5	Charles Dickens_6	Charles Dickens_7	Charles Dickens_8	Charles Dickens_9	Charles Dickens_10	Charles Dickens_11	Charles Dickens_12	Charles Dickens_13	Charles Dickens_14	Charles Dickens_15	Charles Dickens_16	Charles Dickens_17	Charles Dickens_18	Charles Dickens_19	Charles Dickens_20	Charles Dickens_21	Charles Dickens_22	Charles Dickens_23	Charles Dickens_24	Charles Dickens_25	Charles Dickens_26	Charles Dickens_27	Charles Dickens_28	Charles Dickens_29	Charles Dickens_30	Charles Dickens_31	Charles Dickens_32	Charles Dickens_33	Charles Dickens_34	Charles Dickens_35	Charles Dickens_36	Charles Dickens_37	Charles Dickens_38	Charles Dickens_39	Charles Dickens_40	Charles Dickens_41	Charles Dickens_42	Charles Dickens_43	Charles Dickens_44	Charles Dickens_45	Charles Dickens_46	Charles Dickens_47	Charles Dickens_48	Charles Dickens_49	Charles Dickens_50	Charles Dickens_51	Charles Dickens_52	Charles Dickens_53	Charles Dickens_54	Charles Dickens_55	Charles Dickens_56	Charles Dickens_57	Charles Dickens_58	Charles Dickens_59	Charles Dickens_60	Charles Dickens_61	Charles Dickens_62	Charles Dickens_63	Charles Dickens_64	Charles Dickens_65	Charles Dickens_66	Charles Dickens_67	Charles Dickens_68	Charles Dickens_69	Charles Dickens_70	Charles Dickens_71	Charles Dickens_72	Charles Dickens_73	Charles Dickens_74	Charles Dickens_75	Charles Dickens_76	Charles Dickens_77	Charles Dickens_78	Charles Dickens_79	Charles Dickens_80	Charles Dickens_81	Charles Dickens_82	Charles Dickens_83	Charles Dickens_84	Charles Dickens_85	Charles Dickens_86	Charles Dickens_87	Charles Dickens_88	Charles Dickens_89	Charles Dickens_90	Charles Dickens_91	Charles Dickens_92	Charles Dickens_93	Charles Dickens_94	Nathaniel Hawthorne	Nathaniel Hawthorne_1	Nathaniel Hawthorne_2	Nathaniel Hawthorne_3	Nathaniel Hawthorne_4	Nathaniel Hawthorne_5	Nathaniel Hawthorne_6	Nathaniel Hawthorne_7	Nathaniel Hawthorne_8	Nathaniel Hawthorne_9	Nathaniel Hawthorne_10	Nathaniel Hawthorne_11	Nathaniel Hawthorne_12	Nathaniel Hawthorne_13	Nathaniel Hawthorne_14	Nathaniel Hawthorne_15	Nathaniel Hawthorne_16	Nathaniel Hawthorne_17	Nathaniel Hawthorne_18	Nathaniel Hawthorne_19	Nathaniel Hawthorne_20	Nathaniel Hawthorne_21	Nathaniel Hawthorne_22	Nathaniel Hawthorne_23	Nathaniel Hawthorne_24	Nathaniel Hawthorne_25	Nathaniel Hawthorne_26	Nathaniel Hawthorne_27	Nathaniel Hawthorne_28	Nathaniel Hawthorne_29	Nathaniel Hawthorne_30	Nathaniel Hawthorne_31	Nathaniel Hawthorne_32	Nathaniel Hawthorne_33	Nathaniel Hawthorne_34	Nathaniel Hawthorne_35	Nathaniel Hawthorne_36	Nathaniel Hawthorne_37	Nathaniel Hawthorne_38	Nathaniel Hawthorne_39	Nathaniel Hawthorne_40	Nathaniel Hawthorne_41	Nathaniel Hawthorne_42	Sir Arthur Conan Doyle	Sir Arthur Conan Doyle_1	Sir Arthur Conan Doyle_2	Sir Arthur Conan Doyle_3	Sir Arthur Conan Doyle_4	Sir Arthur Conan Doyle_5	Sir Arthur Conan Doyle_6	Sir Arthur Conan Doyle_7	Sir Arthur Conan Doyle_8	Sir Arthur Conan Doyle_9	Sir Arthur Conan Doyle_10	Sir Arthur Conan Doyle_11	Sir Arthur Conan Doyle_12	Sir Arthur Conan Doyle_13	Sir Arthur Conan Doyle_14	Sir Arthur Conan Doyle_15	Sir Arthur Conan Doyle_16	Sir Arthur Conan Doyle_17	Sir Arthur Conan Doyle_18	Sir Arthur Conan Doyle_19	Sir Arthur Conan Doyle_20	Sir Arthur Conan Doyle_21	Sir Arthur Conan Doyle_22	Sir Arthur Conan Doyle_23	Sir Arthur Conan Doyle_24	Sir Arthur Conan Doyle_25	God	God_1	God_2	God_3	God_4	God_5	God_6	God_7	God_8	God_9	God_10	God_11	God_12	God_13	God_14	Obama	Obama_1	Obama_2	Obama_3	Obama_4
	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any	Any
1	be	435	365	461	472	398	456	490	507	414	427	433	437	417	496	515	422	367	414	393	445	383	456	458	484	410	423	464	459	440	481	482	417	432	435	443	484	481	523	478	501	451	445	484	433	544	474	468	480	437	402	403	424	445	378	399	413	450	384	440	408	358	516	517	503	559	549	506	400	476	549	565	521	510	521	524	550	521	504	506	446	467	495	500	480	479	472	508	428	477	492	443	449	444	460	492	448	476	435	472	430	474	471	510	475	435	462	457	455	436	487	411	449	494	439	488	468	453	437	455	367	378	380	364	347	372	350	328	347	342	334	346	340	296	352	332	301	337	338	366	297	379	357	409	373	391	407	357	381	380	368	418	381	408	378	380	426	398	361	381	426	426	337	295	553	503	563	521	488	511	418	474	499	540	603	588	563	581	523	471	454	457	453	429	513	479	475	462	483	449	442	401	426	356	313	431	527	463	379	295	322	346	320	362	409	319	353	274	276	283
2	have	112	101	94	104	109	89	112	85	95	112	95	131	144	131	167	101	101	137	93	118	91	132	163	158	170	160	183	250	182	199	198	171	194	225	200	207	208	222	248	239	194	251	220	244	249	221	268	250	181	170	141	165	164	145	160	192	201	181	188	177	160	196	219	177	156	171	188	220	218	211	221	201	182	226	211	188	199	213	223	218	254	238	248	263	212	198	236	200	293	244	154	151	164	191	179	192	166	169	151	170	211	175	250	169	173	178	223	194	207	212	211	226	191	186	177	187	161	186	157	163	190	164	162	178	227	201	172	148	153	209	132	174	122	172	193	155	165	177	207	162	183	184	176	186	159	205	173	183	183	176	207	163	169	185	192	149	167	172	180	195	178	198	156	251	222	276	285	246	227	201	217	204	244	239	280	233	207	193	176	184	208	214	213	206	202	240	194	210	254	112	151	114	109	61	67	88	44	135	127	150	129	130	116	154	213	255	196	191	277
3	say	153	120	135	118	117	164	177	181	142	65	80	76	92	74	86	66	60	52	47	97	46	44	73	98	78	44	59	82	65	111	99	68	63	55	114	97	124	104	149	128	140	114	128	97	114	124	102	63	122	94	125	123	130	79	121	109	87	104	139	116	60	126	124	140	164	131	88	113	161	122	156	150	139	121	139	158	89	153	126	131	139	99	141	106	103	141	152	82	123	121	73	105	81	113	125	132	91	117	158	147	112	97	112	114	99	136	124	143	143	176	130	118	78	100	120	154	93	86	136	113	4	27	21	40	28	43	26	15	27	31	18	43	15	20	33	18	32	41	28	21	15	28	43	37	19	41	46	21	32	43	36	54	49	26	39	38	30	35	38	52	41	45	34	81	83	66	47	69	80	55	79	74	71	80	87	96	97	88	54	57	70	40	45	88	91	76	82	74	22	176	232	230	161	62	37	43	55	153	73	60	111	85	218	219	13	18	10	16	15
4	do	136	122	167	168	128	179	181	219	193	95	134	94	129	134	89	38	49	66	38	49	35	75	86	127	93	58	53	112	44	93	60	58	59	43	62	91	90	97	118	70	75	75	97	82	91	118	144	71	40	69	64	92	85	53	71	91	55	78	115	106	45	91	96	84	120	74	59	72	80	57	77	97	98	78	90	95	99	87	94	81	68	98	86	80	77	84	106	46	58	89	51	116	113	98	115	81	78	140	115	112	95	103	99	103	86	112	75	133	133	123	101	108	119	112	101	113	61	64	72	97	20	24	28	29	42	47	25	14	27	33	27	34	19	10	21	35	20	19	16	20	16	47	57	42	21	42	54	37	35	47	61	40	51	42	48	57	44	43	62	53	46	27	27	75	81	96	107	67	81	47	50	66	64	69	103	78	67	112	54	33	37	47	37	61	56	60	42	59	44	52	46	60	75	37	40	43	51	47	62	72	68	65	76	76	67	82	84	73	106
5	go	108	97	98	88	81	76	106	85	90	36	65	53	80	63	54	32	21	39	33	33	29	35	43	59	61	29	36	24	31	41	52	30	40	25	41	56	67	60	69	46	49	49	61	41	57	95	66	82	17	19	26	36	27	30	22	33	18	36	28	42	35	48	63	53	53	73	71	72	71	42	58	58	49	63	32	60	65	43	56	39	56	61	37	52	30	65	41	54	30	45	46	30	45	36	38	41	45	42	48	21	45	27	48	42	29	31	55	46	38	45	40	41	46	61	59	49	38	29	38	55	18	13	4	14	17	25	15	15	8	31	25	14	13	7	9	16	16	18	8	13	2	20	19	11	16	11	15	16	34	20	20	17	19	16	18	25	40	40	44	33	23	22	22	29	33	38	28	24	36	11	18	29	36	21	37	13	22	21	15	10	15	23	19	14	19	15	24	16	31	65	79	70	130	31	16	24	38	62	72	54	77	102	127	103	15	23	14	13	26
6	come	26	63	58	64	61	44	57	64	55	32	46	42	44	53	59	34	22	34	45	36	26	27	30	34	35	29	33	27	37	32	40	36	31	23	32	45	41	60	44	45	53	47	51	47	61	48	45	52	36	32	25	44	26	31	27	28	27	34	27	30	47	47	42	46	49	52	48	52	47	50	30	51	39	32	31	46	52	40	30	29	41	45	36	40	38	42	35	37	39	49	38	41	37	42	33	40	38	49	27	43	50	58	53	32	45	29	48	29	35	37	41	40	38	47	46	35	44	34	46	45	16	18	21	19	31	26	22	30	17	38	23	16	24	23	14	26	20	22	17	21	11	19	15	20	14	9	20	24	34	22	26	21	23	18	22	39	37	37	31	30	44	27	18	34	40	44	44	37	46	53	33	38	33	29	45	41	54	32	31	36	34	40	35	28	20	46	37	34	34	58	96	124	77	41	13	24	23	77	38	38	96	77	126	139	15	31	16	20	17
7	see	42	71	57	57	45	71	84	57	47	28	26	26	44	40	40	20	13	32	18	37	22	15	29	22	31	17	35	33	24	29	50	28	18	40	33	65	43	35	34	37	51	50	32	54	49	47	56	73	15	35	41	31	16	25	21	34	29	19	32	25	30	31	34	30	44	42	31	33	42	56	44	51	46	47	38	54	40	41	38	32	64	50	30	45	44	61	24	56	42	49	20	12	28	24	20	42	23	32	27	30	38	14	30	28	29	24	31	37	35	33	37	30	36	31	41	36	33	28	16	83	13	15	14	24	27	28	34	8	8	8	20	24	9	16	12	16	18	21	13	17	10	24	26	21	13	21	18	10	27	14	25	30	17	16	31	33	38	39	56	41	49	24	20	28	46	46	63	28	48	35	29	47	47	36	31	44	40	32	26	41	49	60	45	31	40	37	36	52	39	34	34	51	29	21	9	5	3	31	20	10	23	9	28	26	14	13	7	15	16
8	know	35	50	39	29	21	38	45	55	46	27	37	34	27	34	38	22	13	18	9	38	17	10	36	38	50	20	29	34	22	63	36	29	42	33	28	38	37	44	59	44	36	58	39	42	49	38	67	45	20	27	30	37	46	18	26	31	29	27	44	38	35	46	38	30	44	46	49	37	51	43	54	41	43	55	24	61	49	60	34	33	62	52	57	57	41	51	48	29	37	43	36	34	47	49	42	26	34	42	47	52	41	54	77	49	33	26	26	54	54	41	27	35	36	45	45	30	30	38	43	34	18	12	13	16	15	24	20	8	7	19	11	12	13	9	9	11	7	9	3	14	11	10	20	22	10	18	19	7	18	19	16	25	13	23	18	22	18	23	25	21	19	8	7	32	38	26	54	27	29	22	31	21	37	22	38	49	35	21	28	11	18	18	5	22	13	29	17	17	32	19	7	22	20	10	2	1	0	7	6	15	18	8	13	16	45	33	21	29	23
9	make	48	41	36	40	37	44	39	49	43	27	25	25	38	24	20	27	28	26	42	27	25	24	38	32	31	28	30	25	26	45	28	19	29	21	41	33	26	38	31	40	43	38	34	51	42	29	25	34	34	25	30	24	24	35	31	31	28	24	30	22	26	41	50	44	46	32	35	37	44	45	33	41	39	49	40	28	24	29	40	43	41	43	53	38	35	39	31	34	16	25	30	25	32	37	29	48	26	37	30	22	33	25	42	29	35	38	29	39	43	31	38	32	28	36	41	44	24	34	34	28	24	31	22	29	26	23	22	25	24	18	23	19	12	12	14	14	24	27	9	25	31	20	31	27	41	30	29	33	42	29	44	27	27	28	21	33	28	38	56	34	55	28	20	26	17	25	21	24	18	19	24	21	27	22	30	27	27	26	29	34	33	27	26	23	19	15	21	24	19	57	33	36	26	192	96	57	27	33	35	31	42	22	28	32	48	48	58	43	68
10	man	41	27	46	43	50	28	18	10	22	20	14	12	1	21	32	11	12	31	21	38	12	17	13	32	40	25	21	23	39	35	36	40	31	36	62	25	14	21	13	30	19	24	16	20	33	12	18	30	20	24	37	28	22	33	47	41	44	37	31	31	62	12	7	22	9	9	28	21	7	23	16	22	26	22	10	12	20	10	17	13	11	14	9	16	13	9	17	23	14	17	64	21	30	35	40	45	39	17	26	44	12	77	23	36	45	23	20	45	45	14	36	14	14	42	27	22	27	35	37	45	37	40	24	57	45	30	38	44	39	38	51	28	34	23	21	34	19	31	36	37	31	29	19	27	23	54	43	36	40	29	33	35	33	40	26	40	24	24	23	33	50	71	38	50	47	45	47	55	48	50	34	48	63	56	51	78	91	64	43	24	15	27	20	33	22	34	37	41	44	78	43	74	59	41	13	59	48	60	47	68	75	40	177	125	5	11	8	12	2

The following transposed version will be more convenient.

dft = DataFrame(
    [[names(df)[2:end]]; collect.(eachrow(df[:, 2:end]))],
    [:column; Symbol.(axes(df, 1))],
)
rename!(dft, String.(vcat("authors", values(df[:, 1]))))
first(dft, 10)

10×51 DataFrame

Row	authors	be	have	say	do	go	come	see	know	make	man	look	take	little	time	hand	think	get	old	other	good	tell	eye	way	give	face	find	day	own	head	great	seem	such	turn	hear	leave	put	ask	life	thing	nothing	stand	young	night	cry	word	mind	house	sit	sir	last
	String	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64
1	Mark Twain	435	112	153	136	108	26	42	35	48	41	21	56	14	47	18	28	141	42	18	29	47	4	30	17	6	21	22	9	10	0	5	8	9	19	8	23	13	4	26	15	13	0	11	9	7	7	8	1	3	10
2	Mark Twain_1	365	101	120	122	97	63	71	50	41	27	42	43	24	39	8	25	97	36	8	41	30	4	27	10	5	28	25	2	14	1	3	10	6	30	9	24	2	0	31	12	11	3	24	1	4	7	5	1	0	5
3	Mark Twain_2	461	94	135	167	98	58	57	39	36	46	22	42	24	47	11	27	112	17	18	33	39	5	35	12	1	8	5	4	23	1	11	11	8	20	4	12	1	5	26	16	7	0	22	2	9	4	5	0	3	7
4	Mark Twain_3	472	104	118	168	88	64	57	29	40	43	29	28	24	34	10	12	87	35	23	35	42	9	24	7	9	15	19	5	14	2	11	7	5	35	11	14	12	2	13	22	3	25	19	5	1	8	13	0	10	3
5	Mark Twain_4	398	109	117	128	81	61	45	21	37	50	27	36	27	34	19	15	76	20	24	22	20	8	33	15	2	10	16	7	13	2	3	7	11	9	8	8	4	6	26	11	20	10	20	7	7	9	12	0	0	11
6	Mark Twain_5	456	89	164	179	76	44	71	38	44	28	26	42	23	35	25	24	90	19	24	28	34	5	41	24	10	9	18	6	23	1	4	6	13	16	17	18	7	2	32	22	12	11	9	9	7	5	9	1	2	13
7	Mark Twain_6	490	112	177	181	106	57	84	45	39	18	28	37	18	34	21	49	121	35	16	27	56	8	29	25	7	15	14	8	10	2	10	9	14	7	9	10	12	2	28	23	7	2	8	6	10	17	10	0	2	17
8	Mark Twain_7	507	85	181	219	85	64	57	55	49	10	38	27	25	37	19	38	116	29	9	23	45	5	48	24	3	13	13	5	10	0	6	6	11	23	11	16	9	1	38	31	13	1	13	4	8	9	14	0	1	13
9	Mark Twain_8	414	95	142	193	90	55	47	46	43	22	32	47	9	39	8	26	134	16	18	21	24	5	29	15	1	15	8	2	10	0	1	6	13	20	13	23	3	10	32	20	5	1	12	3	5	9	10	0	3	17
10	Mark Twain_9	427	112	65	95	36	32	28	27	27	20	30	28	27	24	20	15	40	10	21	21	16	16	12	23	16	7	17	9	13	19	5	7	21	5	2	8	4	5	13	6	12	8	2	4	5	14	4	9	4	10

We add the labels column with the authors's names

transform!(dft, "authors" => ByRow(x -> first(split(x, "_"))) => "labels")
first(dft, 10)

10×52 DataFrame

Row	authors	be	have	say	do	go	come	see	know	make	man	look	take	little	time	hand	think	get	old	other	good	tell	eye	way	give	face	find	day	own	head	great	seem	such	turn	hear	leave	put	ask	life	thing	nothing	stand	young	night	cry	word	mind	house	sit	sir	last	labels
	String	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	Int64	SubStrin…
1	Mark Twain	435	112	153	136	108	26	42	35	48	41	21	56	14	47	18	28	141	42	18	29	47	4	30	17	6	21	22	9	10	0	5	8	9	19	8	23	13	4	26	15	13	0	11	9	7	7	8	1	3	10	Mark Twain
2	Mark Twain_1	365	101	120	122	97	63	71	50	41	27	42	43	24	39	8	25	97	36	8	41	30	4	27	10	5	28	25	2	14	1	3	10	6	30	9	24	2	0	31	12	11	3	24	1	4	7	5	1	0	5	Mark Twain
3	Mark Twain_2	461	94	135	167	98	58	57	39	36	46	22	42	24	47	11	27	112	17	18	33	39	5	35	12	1	8	5	4	23	1	11	11	8	20	4	12	1	5	26	16	7	0	22	2	9	4	5	0	3	7	Mark Twain
4	Mark Twain_3	472	104	118	168	88	64	57	29	40	43	29	28	24	34	10	12	87	35	23	35	42	9	24	7	9	15	19	5	14	2	11	7	5	35	11	14	12	2	13	22	3	25	19	5	1	8	13	0	10	3	Mark Twain
5	Mark Twain_4	398	109	117	128	81	61	45	21	37	50	27	36	27	34	19	15	76	20	24	22	20	8	33	15	2	10	16	7	13	2	3	7	11	9	8	8	4	6	26	11	20	10	20	7	7	9	12	0	0	11	Mark Twain
6	Mark Twain_5	456	89	164	179	76	44	71	38	44	28	26	42	23	35	25	24	90	19	24	28	34	5	41	24	10	9	18	6	23	1	4	6	13	16	17	18	7	2	32	22	12	11	9	9	7	5	9	1	2	13	Mark Twain
7	Mark Twain_6	490	112	177	181	106	57	84	45	39	18	28	37	18	34	21	49	121	35	16	27	56	8	29	25	7	15	14	8	10	2	10	9	14	7	9	10	12	2	28	23	7	2	8	6	10	17	10	0	2	17	Mark Twain
8	Mark Twain_7	507	85	181	219	85	64	57	55	49	10	38	27	25	37	19	38	116	29	9	23	45	5	48	24	3	13	13	5	10	0	6	6	11	23	11	16	9	1	38	31	13	1	13	4	8	9	14	0	1	13	Mark Twain
9	Mark Twain_8	414	95	142	193	90	55	47	46	43	22	32	47	9	39	8	26	134	16	18	21	24	5	29	15	1	15	8	2	10	0	1	6	13	20	13	23	3	10	32	20	5	1	12	3	5	9	10	0	3	17	Mark Twain
10	Mark Twain_9	427	112	65	95	36	32	28	27	27	20	30	28	27	24	20	15	40	10	21	21	16	16	12	23	16	7	17	9	13	19	5	7	21	5	2	8	4	5	13	6	12	8	2	4	5	14	4	9	4	10	Mark Twain

Computing the Principal Component Analysis (PCA).

X = Matrix{Float64}(df[!, 2:end])
X_labels = dft[!, :labels]

pca = fit(PCA, X; maxoutdim = 50)
X_pca = predict(pca, X)

37×209 Matrix{Float64}:
 -60.7414      9.83193   -77.8121    …  165.516     164.319      132.117
 144.626     146.345     141.665          4.83876     4.48532    -40.2968
   2.86075    26.2327    -21.3985        62.1869     60.0662     120.151
  69.6778     63.8413     78.3785        73.6832     57.7492      76.9475
 -24.8752    -12.6836    -32.0202       -28.9607    -22.7359     -45.6323
 -11.4354     -6.21062   -23.8029    …   32.7799     23.1818      44.5347
   8.94384    45.751      13.0142       -35.717     -27.565      -37.1215
  40.1841     14.2104      7.5561         7.79548     9.50864      6.38706
 -40.341     -28.3526    -16.3179        -8.18071    -1.23962     -2.80065
 -22.6938    -12.4824      3.27572        5.16248    11.1488       1.68726
   ⋮                                 ⋱                           
  -6.05136    -1.54905    -3.57877       -0.506974    0.447435    -0.537165
  -9.70702     8.41345     0.581103       3.45939    -2.3016       3.3244
  -0.810259   -0.595373   -2.07314   …    3.25579    -0.0607717   -0.296643
   9.3398     -0.637202   -6.41379        7.87453    -0.0905966   -2.50238
  -6.50688    -4.10144    -3.60888       -0.795095    4.65839      6.11248
   0.394297   -3.78447    -1.90339       -2.05996     2.68061     -1.92713
   2.0695     -0.447788    0.94063        1.51751    -3.63181      0.969622
  -4.2428     -5.01362     4.58153   …   -6.10315    -7.88316      2.04012
  -1.61565    -0.938709   -5.43101       -1.10219     2.87962      2.44519

Recoding labels for the linear discriminant analysis:

Y_labels = recode(
    X_labels,
    "Obama" => 1,
    "God" => 2,
    "Mark Twain" => 3,
    "Charles Dickens" => 4,
    "Nathaniel Hawthorne" => 5,
    "Sir Arthur Conan Doyle" => 6,
)

lda = fit(MulticlassLDA, X_pca, Y_labels; outdim=20)
points = predict(lda, X_pca)

20×209 Matrix{Float64}:
  0.0425391    0.100762    -0.0443951   …   0.0635746    0.0537748
  0.0476023    0.0335414    0.0323627       0.187445     0.226327
  0.466552     0.428397     0.50515         0.41138      0.335156
 -0.179813    -0.279297    -0.0988312      -0.0381078   -0.00635588
  0.165531     0.14778      0.251933       -0.492498    -0.643294
  0.0900502    0.0828919    0.0675417   …   0.0266026   -3.73819e-5
 -0.0286011    0.0105123   -0.00746261     -0.00804636  -0.067026
 -0.115277     0.0652138    0.0694368      -0.0485349    0.0493863
  0.0874938    0.0327027    0.195493       -0.06054      0.133479
  0.106509    -0.0722132    0.039035       -0.0409007    0.00047618
  0.0931864    0.0253175   -0.0294416   …   0.00342425  -0.0339487
 -0.169473    -0.018269    -0.0212951      -0.0344613   -0.0143525
 -0.00818611  -0.0500911    0.0782168      -0.0501469   -0.00684315
  0.055376    -0.10681      0.0303543       0.0575427   -0.0570484
  0.0307417   -0.00714364  -0.0762494       0.00527125  -0.0414409
 -0.0110628    0.00280159  -0.0732645   …  -0.0484745    0.0222361
 -0.173138     0.203889     0.0159951       0.0354451   -0.141568
  0.132832     0.0570692   -0.0561526       0.0410097    0.00739636
  0.0502897    0.0446066    0.119052       -0.0253959   -0.00404992
 -0.0413259   -0.0102804    0.0508593       0.0580702   -0.0782537

Representation of data:

function plot_clustering( points, cluster, true_labels; axis = 1:2)

    pairs = Dict(1 => :rtriangle, 2 => :diamond, 3 => :square, 4 => :ltriangle,
                  5 => :star, 6 => :pentagon, 0 => :circle)

    shapes = replace(cluster, pairs...)

    p = scatter(points[1, :], points[2, :]; markershape = shapes,
                markercolor = true_labels, label = "")

    authors = [ "Obama", "God", "Twain", "Dickens",
                "Hawthorne", "Conan Doyle"]

    xl, yl = xlims(p), ylims(p)
    for (s,a) in zip(values(pairs),authors)
        scatter!(p, [1], markershape=s, markercolor = "blue", label=a, xlims=xl, ylims=yl)
    end
    for c in keys(pairs)
        scatter!(p, [1], markershape=:circle, markercolor = c, label = c, xlims=xl, ylims=yl)
    end
    plot!(p, xlabel = "PC1", ylabel = "PC2", legend=:outertopright)

    return p

end

plot_clustering (generic function with 1 method)

Data clustering

To cluster the data, we will use the following parameters. The true proportion of outliers is 20/209 since 15+5 texts were extracted from the bible or a speech from Obama.

k = 4
alpha = 20/209
maxiter = 50
nstart = 50

Application of the classical trimmed $k$-means algorithm.

(Cuesta-Albertos et al., 1997)

tb_kmeans = trimmed_bregman_clustering(rng, points, k, alpha, euclidean, maxiter, nstart)

plot_clustering(tb_kmeans.points, tb_kmeans.cluster, Y_labels)

Using the Bregman divergence associated to the Poisson distribution

function standardize!( points )
    points .-= minimum(points, dims=2)
end

standardize!(points)

20×209 Matrix{Float64}:
 0.371239    0.429461   0.284305  …  0.406523  0.392274  0.382475
 0.418747    0.404686   0.403507     0.572759  0.558589  0.597472
 0.753697    0.715542   0.792295     0.731444  0.698525  0.622301
 0.0994844   0.0        0.180466     0.243693  0.241189  0.272941
 0.817917    0.800165   0.904319     0.059263  0.159888  0.00909161
 0.307763    0.300605   0.285255  …  0.188484  0.244316  0.217676
 0.164781    0.203894   0.185919     0.133349  0.185335  0.126356
 0.112667    0.293158   0.297381     0.234451  0.179409  0.27733
 0.303018    0.248227   0.411018     0.248358  0.154985  0.349004
 0.298165    0.119442   0.23069      0.156461  0.150755  0.192132
 0.321081    0.253212   0.198453  …  0.262394  0.231319  0.193946
 0.00746265  0.158667   0.155641     0.14917   0.142475  0.162583
 0.227772    0.185867   0.314175     0.197574  0.185811  0.229115
 0.248541    0.0863548  0.223519     0.189827  0.250707  0.136116
 0.214761    0.176876   0.10777      0.234662  0.189291  0.142579
 0.209213    0.223077   0.147011  …  0.248635  0.171801  0.242512
 0.00460738  0.381635   0.193741     0.203022  0.213191  0.0361774
 0.383736    0.307973   0.194751     0.264333  0.291914  0.2583
 0.283295    0.277612   0.352057     0.250214  0.207609  0.228955
 0.242414    0.273459   0.334599     0.327492  0.34181   0.205486

tb_poisson = trimmed_bregman_clustering(rng, points, k, alpha, poisson, maxiter, nstart)

plot_clustering(points, tb_poisson.cluster, Y_labels)

By using the Bregman divergence associated to the Poisson distribution, we see that the clustering method is performant with the parameters k = 4 and alpha = 20/209. Indeed, the outliers are the texts from the bible and from the Obama speech. Moreover, the other texts are mostly well clustered.

Performance comparison

We measure the performance of two clustering methods (the one with the squared Euclidean distance and the one with the Bregman divergence associated to the Poisson distribution). For this, we use the normalised mutual information (NMI).

True labelling for which the texts from the bible and the Obama speech do have the same label:

true_labels = copy(Y_labels)
true_labels[Y_labels .== 2] .= 1

15-element view(::Vector{Int64}, [190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204]) with eltype Int64:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

For trimmed k-means :

mutualinfo(true_labels, tb_kmeans.cluster, normed = true)

1.0

For trimmed clustering with the Bregman divergence associated to the Poisson distribution :

mutualinfo(true_labels, tb_poisson.cluster, normed = true)

0.9529773660189614

The mutualy normalised information is larger for the Bregman divergence associated to the Poisson distribution. This illustrates the fact that using the correct Bregman divergence helps improving the clustering, in comparison to the classical trimmed $k$-means algorithm. Indeed, the number of appearance of a word in a text of a fixed number of words, written by the same author, can be modelled by a random variable of Poisson distribution. The independance between the number of appearance of the words is not realistic. However, since we do consider only some words (the 50 more frequent words), we make this approximation. We will use the Bregman divergence associated to the Poisson distribution.

Selecting the parameters $k$ and $\alpha$

We display the risks curves as a function of $k$ and $\alpha$. In practive, it is important to realise this step since we are not supposed to know the data set in advance, nor the number of outliers.

vect_k = collect(1:6)
vect_alpha = [(1:5)./50; [0.15,0.25,0.75,0.85,0.9]]
nstart = 20

rng = MersenneTwister(20)

params_risks = select_parameters(rng, vect_k, vect_alpha, points, poisson, maxiter, nstart)

plot(; title = "select parameters")
for (i,k) in enumerate(vect_k)
   plot!( vect_alpha, params_risks[i, :], label ="k=$k", markershape = :circle )
end
xlabel!("alpha")
ylabel!("NMI")

In order to select the parameters k and alpha, we will focus onf the different possible values for alpha. For alpha not smaller than 0.15, we see that we gain a lot going from 1 to 3 groups and from 2 to 3 groups. Therefore, we choose k=3 and alpha of order 0.15 corresponding to the slope change, for the curve k=3.

For alpha smaller than 0.15, we see that we gain a lot going from 1 to 2 groups, from 2 to 3 groups and to 3 to 4 groups. However, we do not gain in terms of risk going from 4 to 5 groups or from 5 to 6 groups. Indeed, the curves associated to the parameters $k = 4$, $k = 5$ and $k = 6$ are very close. So, we cluster the data in $k = 4$ groups.

The curve associated to the parameter $k = 4$ strongly decreases with a slope that stabilises around $\alpha = 0.1$.

Then, since there is a slope jump at that curve $k = 6$, we can choose the parameter k = 6, with alpha = 0. We do not consider any outlier.

Note that the fact that our method is initialised by random centers implies that the curves representing the risk as a function of $k$ and $\alpha$ vary, quite strongly, from one time to another one. Consequently, the comment abovementionned does not necessarily corresponds to the figure. For more robustness, we should have increased the value of nstart, and so, the execution time. These curves for the selection of the parameters k and alpha are mostly indicative.

Finaly, here are three clustering obtained after choosing 3 pairs of parameters.

maxiter = 50
nstart = 50
tb = trimmed_bregman_clustering(rng, points, 3, 0.15, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)

The texts of Twain, the bible and the Obama speech are considered as outliers.

tb = trimmed_bregman_clustering(rng, points, 4, 0.1, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)

The texts from the bible and the Obama speech are considered as outliers.

tb = trimmed_bregman_clustering(rng, points, 6, 0.0, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)

We obtain 6 groups corresponding to the texts of the 4 authors and to the texts from the bible and from the Obama speech.