Application to authors texts clustering
Data from texts are stored in some variable df
. The commands used for displaying data are the following.
using CategoricalArrays
using DataFrames
using DelimitedFiles
using GeometricClusterAnalysis
using MultivariateStats
using Plots
using Random
import Clustering: mutualinfo
rng = MersenneTwister(2022)
table = readdlm(joinpath("assets", "textes.txt"))
df = DataFrame(
hcat(table[2:end, 1], table[2:end, 2:end]),
vec(vcat("authors", table[1, 1:end-1])),
makeunique = true,
)
first(df, 10)
Row | authors | Mark Twain | Mark Twain_1 | Mark Twain_2 | Mark Twain_3 | Mark Twain_4 | Mark Twain_5 | Mark Twain_6 | Mark Twain_7 | Mark Twain_8 | Mark Twain_9 | Mark Twain_10 | Mark Twain_11 | Mark Twain_12 | Mark Twain_13 | Mark Twain_14 | Mark Twain_15 | Mark Twain_16 | Mark Twain_17 | Mark Twain_18 | Mark Twain_19 | Mark Twain_20 | Mark Twain_21 | Mark Twain_22 | Mark Twain_23 | Mark Twain_24 | Charles Dickens | Charles Dickens_1 | Charles Dickens_2 | Charles Dickens_3 | Charles Dickens_4 | Charles Dickens_5 | Charles Dickens_6 | Charles Dickens_7 | Charles Dickens_8 | Charles Dickens_9 | Charles Dickens_10 | Charles Dickens_11 | Charles Dickens_12 | Charles Dickens_13 | Charles Dickens_14 | Charles Dickens_15 | Charles Dickens_16 | Charles Dickens_17 | Charles Dickens_18 | Charles Dickens_19 | Charles Dickens_20 | Charles Dickens_21 | Charles Dickens_22 | Charles Dickens_23 | Charles Dickens_24 | Charles Dickens_25 | Charles Dickens_26 | Charles Dickens_27 | Charles Dickens_28 | Charles Dickens_29 | Charles Dickens_30 | Charles Dickens_31 | Charles Dickens_32 | Charles Dickens_33 | Charles Dickens_34 | Charles Dickens_35 | Charles Dickens_36 | Charles Dickens_37 | Charles Dickens_38 | Charles Dickens_39 | Charles Dickens_40 | Charles Dickens_41 | Charles Dickens_42 | Charles Dickens_43 | Charles Dickens_44 | Charles Dickens_45 | Charles Dickens_46 | Charles Dickens_47 | Charles Dickens_48 | Charles Dickens_49 | Charles Dickens_50 | Charles Dickens_51 | Charles Dickens_52 | Charles Dickens_53 | Charles Dickens_54 | Charles Dickens_55 | Charles Dickens_56 | Charles Dickens_57 | Charles Dickens_58 | Charles Dickens_59 | Charles Dickens_60 | Charles Dickens_61 | Charles Dickens_62 | Charles Dickens_63 | Charles Dickens_64 | Charles Dickens_65 | Charles Dickens_66 | Charles Dickens_67 | Charles Dickens_68 | Charles Dickens_69 | Charles Dickens_70 | Charles Dickens_71 | Charles Dickens_72 | Charles Dickens_73 | Charles Dickens_74 | Charles Dickens_75 | Charles Dickens_76 | Charles Dickens_77 | Charles Dickens_78 | Charles Dickens_79 | Charles Dickens_80 | Charles Dickens_81 | Charles Dickens_82 | Charles Dickens_83 | Charles Dickens_84 | Charles Dickens_85 | Charles Dickens_86 | Charles Dickens_87 | Charles Dickens_88 | Charles Dickens_89 | Charles Dickens_90 | Charles Dickens_91 | Charles Dickens_92 | Charles Dickens_93 | Charles Dickens_94 | Nathaniel Hawthorne | Nathaniel Hawthorne_1 | Nathaniel Hawthorne_2 | Nathaniel Hawthorne_3 | Nathaniel Hawthorne_4 | Nathaniel Hawthorne_5 | Nathaniel Hawthorne_6 | Nathaniel Hawthorne_7 | Nathaniel Hawthorne_8 | Nathaniel Hawthorne_9 | Nathaniel Hawthorne_10 | Nathaniel Hawthorne_11 | Nathaniel Hawthorne_12 | Nathaniel Hawthorne_13 | Nathaniel Hawthorne_14 | Nathaniel Hawthorne_15 | Nathaniel Hawthorne_16 | Nathaniel Hawthorne_17 | Nathaniel Hawthorne_18 | Nathaniel Hawthorne_19 | Nathaniel Hawthorne_20 | Nathaniel Hawthorne_21 | Nathaniel Hawthorne_22 | Nathaniel Hawthorne_23 | Nathaniel Hawthorne_24 | Nathaniel Hawthorne_25 | Nathaniel Hawthorne_26 | Nathaniel Hawthorne_27 | Nathaniel Hawthorne_28 | Nathaniel Hawthorne_29 | Nathaniel Hawthorne_30 | Nathaniel Hawthorne_31 | Nathaniel Hawthorne_32 | Nathaniel Hawthorne_33 | Nathaniel Hawthorne_34 | Nathaniel Hawthorne_35 | Nathaniel Hawthorne_36 | Nathaniel Hawthorne_37 | Nathaniel Hawthorne_38 | Nathaniel Hawthorne_39 | Nathaniel Hawthorne_40 | Nathaniel Hawthorne_41 | Nathaniel Hawthorne_42 | Sir Arthur Conan Doyle | Sir Arthur Conan Doyle_1 | Sir Arthur Conan Doyle_2 | Sir Arthur Conan Doyle_3 | Sir Arthur Conan Doyle_4 | Sir Arthur Conan Doyle_5 | Sir Arthur Conan Doyle_6 | Sir Arthur Conan Doyle_7 | Sir Arthur Conan Doyle_8 | Sir Arthur Conan Doyle_9 | Sir Arthur Conan Doyle_10 | Sir Arthur Conan Doyle_11 | Sir Arthur Conan Doyle_12 | Sir Arthur Conan Doyle_13 | Sir Arthur Conan Doyle_14 | Sir Arthur Conan Doyle_15 | Sir Arthur Conan Doyle_16 | Sir Arthur Conan Doyle_17 | Sir Arthur Conan Doyle_18 | Sir Arthur Conan Doyle_19 | Sir Arthur Conan Doyle_20 | Sir Arthur Conan Doyle_21 | Sir Arthur Conan Doyle_22 | Sir Arthur Conan Doyle_23 | Sir Arthur Conan Doyle_24 | Sir Arthur Conan Doyle_25 | God | God_1 | God_2 | God_3 | God_4 | God_5 | God_6 | God_7 | God_8 | God_9 | God_10 | God_11 | God_12 | God_13 | God_14 | Obama | Obama_1 | Obama_2 | Obama_3 | Obama_4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | Any | |
1 | be | 435 | 365 | 461 | 472 | 398 | 456 | 490 | 507 | 414 | 427 | 433 | 437 | 417 | 496 | 515 | 422 | 367 | 414 | 393 | 445 | 383 | 456 | 458 | 484 | 410 | 423 | 464 | 459 | 440 | 481 | 482 | 417 | 432 | 435 | 443 | 484 | 481 | 523 | 478 | 501 | 451 | 445 | 484 | 433 | 544 | 474 | 468 | 480 | 437 | 402 | 403 | 424 | 445 | 378 | 399 | 413 | 450 | 384 | 440 | 408 | 358 | 516 | 517 | 503 | 559 | 549 | 506 | 400 | 476 | 549 | 565 | 521 | 510 | 521 | 524 | 550 | 521 | 504 | 506 | 446 | 467 | 495 | 500 | 480 | 479 | 472 | 508 | 428 | 477 | 492 | 443 | 449 | 444 | 460 | 492 | 448 | 476 | 435 | 472 | 430 | 474 | 471 | 510 | 475 | 435 | 462 | 457 | 455 | 436 | 487 | 411 | 449 | 494 | 439 | 488 | 468 | 453 | 437 | 455 | 367 | 378 | 380 | 364 | 347 | 372 | 350 | 328 | 347 | 342 | 334 | 346 | 340 | 296 | 352 | 332 | 301 | 337 | 338 | 366 | 297 | 379 | 357 | 409 | 373 | 391 | 407 | 357 | 381 | 380 | 368 | 418 | 381 | 408 | 378 | 380 | 426 | 398 | 361 | 381 | 426 | 426 | 337 | 295 | 553 | 503 | 563 | 521 | 488 | 511 | 418 | 474 | 499 | 540 | 603 | 588 | 563 | 581 | 523 | 471 | 454 | 457 | 453 | 429 | 513 | 479 | 475 | 462 | 483 | 449 | 442 | 401 | 426 | 356 | 313 | 431 | 527 | 463 | 379 | 295 | 322 | 346 | 320 | 362 | 409 | 319 | 353 | 274 | 276 | 283 |
2 | have | 112 | 101 | 94 | 104 | 109 | 89 | 112 | 85 | 95 | 112 | 95 | 131 | 144 | 131 | 167 | 101 | 101 | 137 | 93 | 118 | 91 | 132 | 163 | 158 | 170 | 160 | 183 | 250 | 182 | 199 | 198 | 171 | 194 | 225 | 200 | 207 | 208 | 222 | 248 | 239 | 194 | 251 | 220 | 244 | 249 | 221 | 268 | 250 | 181 | 170 | 141 | 165 | 164 | 145 | 160 | 192 | 201 | 181 | 188 | 177 | 160 | 196 | 219 | 177 | 156 | 171 | 188 | 220 | 218 | 211 | 221 | 201 | 182 | 226 | 211 | 188 | 199 | 213 | 223 | 218 | 254 | 238 | 248 | 263 | 212 | 198 | 236 | 200 | 293 | 244 | 154 | 151 | 164 | 191 | 179 | 192 | 166 | 169 | 151 | 170 | 211 | 175 | 250 | 169 | 173 | 178 | 223 | 194 | 207 | 212 | 211 | 226 | 191 | 186 | 177 | 187 | 161 | 186 | 157 | 163 | 190 | 164 | 162 | 178 | 227 | 201 | 172 | 148 | 153 | 209 | 132 | 174 | 122 | 172 | 193 | 155 | 165 | 177 | 207 | 162 | 183 | 184 | 176 | 186 | 159 | 205 | 173 | 183 | 183 | 176 | 207 | 163 | 169 | 185 | 192 | 149 | 167 | 172 | 180 | 195 | 178 | 198 | 156 | 251 | 222 | 276 | 285 | 246 | 227 | 201 | 217 | 204 | 244 | 239 | 280 | 233 | 207 | 193 | 176 | 184 | 208 | 214 | 213 | 206 | 202 | 240 | 194 | 210 | 254 | 112 | 151 | 114 | 109 | 61 | 67 | 88 | 44 | 135 | 127 | 150 | 129 | 130 | 116 | 154 | 213 | 255 | 196 | 191 | 277 |
3 | say | 153 | 120 | 135 | 118 | 117 | 164 | 177 | 181 | 142 | 65 | 80 | 76 | 92 | 74 | 86 | 66 | 60 | 52 | 47 | 97 | 46 | 44 | 73 | 98 | 78 | 44 | 59 | 82 | 65 | 111 | 99 | 68 | 63 | 55 | 114 | 97 | 124 | 104 | 149 | 128 | 140 | 114 | 128 | 97 | 114 | 124 | 102 | 63 | 122 | 94 | 125 | 123 | 130 | 79 | 121 | 109 | 87 | 104 | 139 | 116 | 60 | 126 | 124 | 140 | 164 | 131 | 88 | 113 | 161 | 122 | 156 | 150 | 139 | 121 | 139 | 158 | 89 | 153 | 126 | 131 | 139 | 99 | 141 | 106 | 103 | 141 | 152 | 82 | 123 | 121 | 73 | 105 | 81 | 113 | 125 | 132 | 91 | 117 | 158 | 147 | 112 | 97 | 112 | 114 | 99 | 136 | 124 | 143 | 143 | 176 | 130 | 118 | 78 | 100 | 120 | 154 | 93 | 86 | 136 | 113 | 4 | 27 | 21 | 40 | 28 | 43 | 26 | 15 | 27 | 31 | 18 | 43 | 15 | 20 | 33 | 18 | 32 | 41 | 28 | 21 | 15 | 28 | 43 | 37 | 19 | 41 | 46 | 21 | 32 | 43 | 36 | 54 | 49 | 26 | 39 | 38 | 30 | 35 | 38 | 52 | 41 | 45 | 34 | 81 | 83 | 66 | 47 | 69 | 80 | 55 | 79 | 74 | 71 | 80 | 87 | 96 | 97 | 88 | 54 | 57 | 70 | 40 | 45 | 88 | 91 | 76 | 82 | 74 | 22 | 176 | 232 | 230 | 161 | 62 | 37 | 43 | 55 | 153 | 73 | 60 | 111 | 85 | 218 | 219 | 13 | 18 | 10 | 16 | 15 |
4 | do | 136 | 122 | 167 | 168 | 128 | 179 | 181 | 219 | 193 | 95 | 134 | 94 | 129 | 134 | 89 | 38 | 49 | 66 | 38 | 49 | 35 | 75 | 86 | 127 | 93 | 58 | 53 | 112 | 44 | 93 | 60 | 58 | 59 | 43 | 62 | 91 | 90 | 97 | 118 | 70 | 75 | 75 | 97 | 82 | 91 | 118 | 144 | 71 | 40 | 69 | 64 | 92 | 85 | 53 | 71 | 91 | 55 | 78 | 115 | 106 | 45 | 91 | 96 | 84 | 120 | 74 | 59 | 72 | 80 | 57 | 77 | 97 | 98 | 78 | 90 | 95 | 99 | 87 | 94 | 81 | 68 | 98 | 86 | 80 | 77 | 84 | 106 | 46 | 58 | 89 | 51 | 116 | 113 | 98 | 115 | 81 | 78 | 140 | 115 | 112 | 95 | 103 | 99 | 103 | 86 | 112 | 75 | 133 | 133 | 123 | 101 | 108 | 119 | 112 | 101 | 113 | 61 | 64 | 72 | 97 | 20 | 24 | 28 | 29 | 42 | 47 | 25 | 14 | 27 | 33 | 27 | 34 | 19 | 10 | 21 | 35 | 20 | 19 | 16 | 20 | 16 | 47 | 57 | 42 | 21 | 42 | 54 | 37 | 35 | 47 | 61 | 40 | 51 | 42 | 48 | 57 | 44 | 43 | 62 | 53 | 46 | 27 | 27 | 75 | 81 | 96 | 107 | 67 | 81 | 47 | 50 | 66 | 64 | 69 | 103 | 78 | 67 | 112 | 54 | 33 | 37 | 47 | 37 | 61 | 56 | 60 | 42 | 59 | 44 | 52 | 46 | 60 | 75 | 37 | 40 | 43 | 51 | 47 | 62 | 72 | 68 | 65 | 76 | 76 | 67 | 82 | 84 | 73 | 106 |
5 | go | 108 | 97 | 98 | 88 | 81 | 76 | 106 | 85 | 90 | 36 | 65 | 53 | 80 | 63 | 54 | 32 | 21 | 39 | 33 | 33 | 29 | 35 | 43 | 59 | 61 | 29 | 36 | 24 | 31 | 41 | 52 | 30 | 40 | 25 | 41 | 56 | 67 | 60 | 69 | 46 | 49 | 49 | 61 | 41 | 57 | 95 | 66 | 82 | 17 | 19 | 26 | 36 | 27 | 30 | 22 | 33 | 18 | 36 | 28 | 42 | 35 | 48 | 63 | 53 | 53 | 73 | 71 | 72 | 71 | 42 | 58 | 58 | 49 | 63 | 32 | 60 | 65 | 43 | 56 | 39 | 56 | 61 | 37 | 52 | 30 | 65 | 41 | 54 | 30 | 45 | 46 | 30 | 45 | 36 | 38 | 41 | 45 | 42 | 48 | 21 | 45 | 27 | 48 | 42 | 29 | 31 | 55 | 46 | 38 | 45 | 40 | 41 | 46 | 61 | 59 | 49 | 38 | 29 | 38 | 55 | 18 | 13 | 4 | 14 | 17 | 25 | 15 | 15 | 8 | 31 | 25 | 14 | 13 | 7 | 9 | 16 | 16 | 18 | 8 | 13 | 2 | 20 | 19 | 11 | 16 | 11 | 15 | 16 | 34 | 20 | 20 | 17 | 19 | 16 | 18 | 25 | 40 | 40 | 44 | 33 | 23 | 22 | 22 | 29 | 33 | 38 | 28 | 24 | 36 | 11 | 18 | 29 | 36 | 21 | 37 | 13 | 22 | 21 | 15 | 10 | 15 | 23 | 19 | 14 | 19 | 15 | 24 | 16 | 31 | 65 | 79 | 70 | 130 | 31 | 16 | 24 | 38 | 62 | 72 | 54 | 77 | 102 | 127 | 103 | 15 | 23 | 14 | 13 | 26 |
6 | come | 26 | 63 | 58 | 64 | 61 | 44 | 57 | 64 | 55 | 32 | 46 | 42 | 44 | 53 | 59 | 34 | 22 | 34 | 45 | 36 | 26 | 27 | 30 | 34 | 35 | 29 | 33 | 27 | 37 | 32 | 40 | 36 | 31 | 23 | 32 | 45 | 41 | 60 | 44 | 45 | 53 | 47 | 51 | 47 | 61 | 48 | 45 | 52 | 36 | 32 | 25 | 44 | 26 | 31 | 27 | 28 | 27 | 34 | 27 | 30 | 47 | 47 | 42 | 46 | 49 | 52 | 48 | 52 | 47 | 50 | 30 | 51 | 39 | 32 | 31 | 46 | 52 | 40 | 30 | 29 | 41 | 45 | 36 | 40 | 38 | 42 | 35 | 37 | 39 | 49 | 38 | 41 | 37 | 42 | 33 | 40 | 38 | 49 | 27 | 43 | 50 | 58 | 53 | 32 | 45 | 29 | 48 | 29 | 35 | 37 | 41 | 40 | 38 | 47 | 46 | 35 | 44 | 34 | 46 | 45 | 16 | 18 | 21 | 19 | 31 | 26 | 22 | 30 | 17 | 38 | 23 | 16 | 24 | 23 | 14 | 26 | 20 | 22 | 17 | 21 | 11 | 19 | 15 | 20 | 14 | 9 | 20 | 24 | 34 | 22 | 26 | 21 | 23 | 18 | 22 | 39 | 37 | 37 | 31 | 30 | 44 | 27 | 18 | 34 | 40 | 44 | 44 | 37 | 46 | 53 | 33 | 38 | 33 | 29 | 45 | 41 | 54 | 32 | 31 | 36 | 34 | 40 | 35 | 28 | 20 | 46 | 37 | 34 | 34 | 58 | 96 | 124 | 77 | 41 | 13 | 24 | 23 | 77 | 38 | 38 | 96 | 77 | 126 | 139 | 15 | 31 | 16 | 20 | 17 |
7 | see | 42 | 71 | 57 | 57 | 45 | 71 | 84 | 57 | 47 | 28 | 26 | 26 | 44 | 40 | 40 | 20 | 13 | 32 | 18 | 37 | 22 | 15 | 29 | 22 | 31 | 17 | 35 | 33 | 24 | 29 | 50 | 28 | 18 | 40 | 33 | 65 | 43 | 35 | 34 | 37 | 51 | 50 | 32 | 54 | 49 | 47 | 56 | 73 | 15 | 35 | 41 | 31 | 16 | 25 | 21 | 34 | 29 | 19 | 32 | 25 | 30 | 31 | 34 | 30 | 44 | 42 | 31 | 33 | 42 | 56 | 44 | 51 | 46 | 47 | 38 | 54 | 40 | 41 | 38 | 32 | 64 | 50 | 30 | 45 | 44 | 61 | 24 | 56 | 42 | 49 | 20 | 12 | 28 | 24 | 20 | 42 | 23 | 32 | 27 | 30 | 38 | 14 | 30 | 28 | 29 | 24 | 31 | 37 | 35 | 33 | 37 | 30 | 36 | 31 | 41 | 36 | 33 | 28 | 16 | 83 | 13 | 15 | 14 | 24 | 27 | 28 | 34 | 8 | 8 | 8 | 20 | 24 | 9 | 16 | 12 | 16 | 18 | 21 | 13 | 17 | 10 | 24 | 26 | 21 | 13 | 21 | 18 | 10 | 27 | 14 | 25 | 30 | 17 | 16 | 31 | 33 | 38 | 39 | 56 | 41 | 49 | 24 | 20 | 28 | 46 | 46 | 63 | 28 | 48 | 35 | 29 | 47 | 47 | 36 | 31 | 44 | 40 | 32 | 26 | 41 | 49 | 60 | 45 | 31 | 40 | 37 | 36 | 52 | 39 | 34 | 34 | 51 | 29 | 21 | 9 | 5 | 3 | 31 | 20 | 10 | 23 | 9 | 28 | 26 | 14 | 13 | 7 | 15 | 16 |
8 | know | 35 | 50 | 39 | 29 | 21 | 38 | 45 | 55 | 46 | 27 | 37 | 34 | 27 | 34 | 38 | 22 | 13 | 18 | 9 | 38 | 17 | 10 | 36 | 38 | 50 | 20 | 29 | 34 | 22 | 63 | 36 | 29 | 42 | 33 | 28 | 38 | 37 | 44 | 59 | 44 | 36 | 58 | 39 | 42 | 49 | 38 | 67 | 45 | 20 | 27 | 30 | 37 | 46 | 18 | 26 | 31 | 29 | 27 | 44 | 38 | 35 | 46 | 38 | 30 | 44 | 46 | 49 | 37 | 51 | 43 | 54 | 41 | 43 | 55 | 24 | 61 | 49 | 60 | 34 | 33 | 62 | 52 | 57 | 57 | 41 | 51 | 48 | 29 | 37 | 43 | 36 | 34 | 47 | 49 | 42 | 26 | 34 | 42 | 47 | 52 | 41 | 54 | 77 | 49 | 33 | 26 | 26 | 54 | 54 | 41 | 27 | 35 | 36 | 45 | 45 | 30 | 30 | 38 | 43 | 34 | 18 | 12 | 13 | 16 | 15 | 24 | 20 | 8 | 7 | 19 | 11 | 12 | 13 | 9 | 9 | 11 | 7 | 9 | 3 | 14 | 11 | 10 | 20 | 22 | 10 | 18 | 19 | 7 | 18 | 19 | 16 | 25 | 13 | 23 | 18 | 22 | 18 | 23 | 25 | 21 | 19 | 8 | 7 | 32 | 38 | 26 | 54 | 27 | 29 | 22 | 31 | 21 | 37 | 22 | 38 | 49 | 35 | 21 | 28 | 11 | 18 | 18 | 5 | 22 | 13 | 29 | 17 | 17 | 32 | 19 | 7 | 22 | 20 | 10 | 2 | 1 | 0 | 7 | 6 | 15 | 18 | 8 | 13 | 16 | 45 | 33 | 21 | 29 | 23 |
9 | make | 48 | 41 | 36 | 40 | 37 | 44 | 39 | 49 | 43 | 27 | 25 | 25 | 38 | 24 | 20 | 27 | 28 | 26 | 42 | 27 | 25 | 24 | 38 | 32 | 31 | 28 | 30 | 25 | 26 | 45 | 28 | 19 | 29 | 21 | 41 | 33 | 26 | 38 | 31 | 40 | 43 | 38 | 34 | 51 | 42 | 29 | 25 | 34 | 34 | 25 | 30 | 24 | 24 | 35 | 31 | 31 | 28 | 24 | 30 | 22 | 26 | 41 | 50 | 44 | 46 | 32 | 35 | 37 | 44 | 45 | 33 | 41 | 39 | 49 | 40 | 28 | 24 | 29 | 40 | 43 | 41 | 43 | 53 | 38 | 35 | 39 | 31 | 34 | 16 | 25 | 30 | 25 | 32 | 37 | 29 | 48 | 26 | 37 | 30 | 22 | 33 | 25 | 42 | 29 | 35 | 38 | 29 | 39 | 43 | 31 | 38 | 32 | 28 | 36 | 41 | 44 | 24 | 34 | 34 | 28 | 24 | 31 | 22 | 29 | 26 | 23 | 22 | 25 | 24 | 18 | 23 | 19 | 12 | 12 | 14 | 14 | 24 | 27 | 9 | 25 | 31 | 20 | 31 | 27 | 41 | 30 | 29 | 33 | 42 | 29 | 44 | 27 | 27 | 28 | 21 | 33 | 28 | 38 | 56 | 34 | 55 | 28 | 20 | 26 | 17 | 25 | 21 | 24 | 18 | 19 | 24 | 21 | 27 | 22 | 30 | 27 | 27 | 26 | 29 | 34 | 33 | 27 | 26 | 23 | 19 | 15 | 21 | 24 | 19 | 57 | 33 | 36 | 26 | 192 | 96 | 57 | 27 | 33 | 35 | 31 | 42 | 22 | 28 | 32 | 48 | 48 | 58 | 43 | 68 |
10 | man | 41 | 27 | 46 | 43 | 50 | 28 | 18 | 10 | 22 | 20 | 14 | 12 | 1 | 21 | 32 | 11 | 12 | 31 | 21 | 38 | 12 | 17 | 13 | 32 | 40 | 25 | 21 | 23 | 39 | 35 | 36 | 40 | 31 | 36 | 62 | 25 | 14 | 21 | 13 | 30 | 19 | 24 | 16 | 20 | 33 | 12 | 18 | 30 | 20 | 24 | 37 | 28 | 22 | 33 | 47 | 41 | 44 | 37 | 31 | 31 | 62 | 12 | 7 | 22 | 9 | 9 | 28 | 21 | 7 | 23 | 16 | 22 | 26 | 22 | 10 | 12 | 20 | 10 | 17 | 13 | 11 | 14 | 9 | 16 | 13 | 9 | 17 | 23 | 14 | 17 | 64 | 21 | 30 | 35 | 40 | 45 | 39 | 17 | 26 | 44 | 12 | 77 | 23 | 36 | 45 | 23 | 20 | 45 | 45 | 14 | 36 | 14 | 14 | 42 | 27 | 22 | 27 | 35 | 37 | 45 | 37 | 40 | 24 | 57 | 45 | 30 | 38 | 44 | 39 | 38 | 51 | 28 | 34 | 23 | 21 | 34 | 19 | 31 | 36 | 37 | 31 | 29 | 19 | 27 | 23 | 54 | 43 | 36 | 40 | 29 | 33 | 35 | 33 | 40 | 26 | 40 | 24 | 24 | 23 | 33 | 50 | 71 | 38 | 50 | 47 | 45 | 47 | 55 | 48 | 50 | 34 | 48 | 63 | 56 | 51 | 78 | 91 | 64 | 43 | 24 | 15 | 27 | 20 | 33 | 22 | 34 | 37 | 41 | 44 | 78 | 43 | 74 | 59 | 41 | 13 | 59 | 48 | 60 | 47 | 68 | 75 | 40 | 177 | 125 | 5 | 11 | 8 | 12 | 2 |
The following transposed version will be more convenient.
dft = DataFrame(
[[names(df)[2:end]]; collect.(eachrow(df[:, 2:end]))],
[:column; Symbol.(axes(df, 1))],
)
rename!(dft, String.(vcat("authors", values(df[:, 1]))))
first(dft, 10)
Row | authors | be | have | say | do | go | come | see | know | make | man | look | take | little | time | hand | think | get | old | other | good | tell | eye | way | give | face | find | day | own | head | great | seem | such | turn | hear | leave | put | ask | life | thing | nothing | stand | young | night | cry | word | mind | house | sit | sir | last |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | Mark Twain | 435 | 112 | 153 | 136 | 108 | 26 | 42 | 35 | 48 | 41 | 21 | 56 | 14 | 47 | 18 | 28 | 141 | 42 | 18 | 29 | 47 | 4 | 30 | 17 | 6 | 21 | 22 | 9 | 10 | 0 | 5 | 8 | 9 | 19 | 8 | 23 | 13 | 4 | 26 | 15 | 13 | 0 | 11 | 9 | 7 | 7 | 8 | 1 | 3 | 10 |
2 | Mark Twain_1 | 365 | 101 | 120 | 122 | 97 | 63 | 71 | 50 | 41 | 27 | 42 | 43 | 24 | 39 | 8 | 25 | 97 | 36 | 8 | 41 | 30 | 4 | 27 | 10 | 5 | 28 | 25 | 2 | 14 | 1 | 3 | 10 | 6 | 30 | 9 | 24 | 2 | 0 | 31 | 12 | 11 | 3 | 24 | 1 | 4 | 7 | 5 | 1 | 0 | 5 |
3 | Mark Twain_2 | 461 | 94 | 135 | 167 | 98 | 58 | 57 | 39 | 36 | 46 | 22 | 42 | 24 | 47 | 11 | 27 | 112 | 17 | 18 | 33 | 39 | 5 | 35 | 12 | 1 | 8 | 5 | 4 | 23 | 1 | 11 | 11 | 8 | 20 | 4 | 12 | 1 | 5 | 26 | 16 | 7 | 0 | 22 | 2 | 9 | 4 | 5 | 0 | 3 | 7 |
4 | Mark Twain_3 | 472 | 104 | 118 | 168 | 88 | 64 | 57 | 29 | 40 | 43 | 29 | 28 | 24 | 34 | 10 | 12 | 87 | 35 | 23 | 35 | 42 | 9 | 24 | 7 | 9 | 15 | 19 | 5 | 14 | 2 | 11 | 7 | 5 | 35 | 11 | 14 | 12 | 2 | 13 | 22 | 3 | 25 | 19 | 5 | 1 | 8 | 13 | 0 | 10 | 3 |
5 | Mark Twain_4 | 398 | 109 | 117 | 128 | 81 | 61 | 45 | 21 | 37 | 50 | 27 | 36 | 27 | 34 | 19 | 15 | 76 | 20 | 24 | 22 | 20 | 8 | 33 | 15 | 2 | 10 | 16 | 7 | 13 | 2 | 3 | 7 | 11 | 9 | 8 | 8 | 4 | 6 | 26 | 11 | 20 | 10 | 20 | 7 | 7 | 9 | 12 | 0 | 0 | 11 |
6 | Mark Twain_5 | 456 | 89 | 164 | 179 | 76 | 44 | 71 | 38 | 44 | 28 | 26 | 42 | 23 | 35 | 25 | 24 | 90 | 19 | 24 | 28 | 34 | 5 | 41 | 24 | 10 | 9 | 18 | 6 | 23 | 1 | 4 | 6 | 13 | 16 | 17 | 18 | 7 | 2 | 32 | 22 | 12 | 11 | 9 | 9 | 7 | 5 | 9 | 1 | 2 | 13 |
7 | Mark Twain_6 | 490 | 112 | 177 | 181 | 106 | 57 | 84 | 45 | 39 | 18 | 28 | 37 | 18 | 34 | 21 | 49 | 121 | 35 | 16 | 27 | 56 | 8 | 29 | 25 | 7 | 15 | 14 | 8 | 10 | 2 | 10 | 9 | 14 | 7 | 9 | 10 | 12 | 2 | 28 | 23 | 7 | 2 | 8 | 6 | 10 | 17 | 10 | 0 | 2 | 17 |
8 | Mark Twain_7 | 507 | 85 | 181 | 219 | 85 | 64 | 57 | 55 | 49 | 10 | 38 | 27 | 25 | 37 | 19 | 38 | 116 | 29 | 9 | 23 | 45 | 5 | 48 | 24 | 3 | 13 | 13 | 5 | 10 | 0 | 6 | 6 | 11 | 23 | 11 | 16 | 9 | 1 | 38 | 31 | 13 | 1 | 13 | 4 | 8 | 9 | 14 | 0 | 1 | 13 |
9 | Mark Twain_8 | 414 | 95 | 142 | 193 | 90 | 55 | 47 | 46 | 43 | 22 | 32 | 47 | 9 | 39 | 8 | 26 | 134 | 16 | 18 | 21 | 24 | 5 | 29 | 15 | 1 | 15 | 8 | 2 | 10 | 0 | 1 | 6 | 13 | 20 | 13 | 23 | 3 | 10 | 32 | 20 | 5 | 1 | 12 | 3 | 5 | 9 | 10 | 0 | 3 | 17 |
10 | Mark Twain_9 | 427 | 112 | 65 | 95 | 36 | 32 | 28 | 27 | 27 | 20 | 30 | 28 | 27 | 24 | 20 | 15 | 40 | 10 | 21 | 21 | 16 | 16 | 12 | 23 | 16 | 7 | 17 | 9 | 13 | 19 | 5 | 7 | 21 | 5 | 2 | 8 | 4 | 5 | 13 | 6 | 12 | 8 | 2 | 4 | 5 | 14 | 4 | 9 | 4 | 10 |
We add the labels
column with the authors's names
transform!(dft, "authors" => ByRow(x -> first(split(x, "_"))) => "labels")
first(dft, 10)
Row | authors | be | have | say | do | go | come | see | know | make | man | look | take | little | time | hand | think | get | old | other | good | tell | eye | way | give | face | find | day | own | head | great | seem | such | turn | hear | leave | put | ask | life | thing | nothing | stand | young | night | cry | word | mind | house | sit | sir | last | labels |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
String | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | SubStrin… | |
1 | Mark Twain | 435 | 112 | 153 | 136 | 108 | 26 | 42 | 35 | 48 | 41 | 21 | 56 | 14 | 47 | 18 | 28 | 141 | 42 | 18 | 29 | 47 | 4 | 30 | 17 | 6 | 21 | 22 | 9 | 10 | 0 | 5 | 8 | 9 | 19 | 8 | 23 | 13 | 4 | 26 | 15 | 13 | 0 | 11 | 9 | 7 | 7 | 8 | 1 | 3 | 10 | Mark Twain |
2 | Mark Twain_1 | 365 | 101 | 120 | 122 | 97 | 63 | 71 | 50 | 41 | 27 | 42 | 43 | 24 | 39 | 8 | 25 | 97 | 36 | 8 | 41 | 30 | 4 | 27 | 10 | 5 | 28 | 25 | 2 | 14 | 1 | 3 | 10 | 6 | 30 | 9 | 24 | 2 | 0 | 31 | 12 | 11 | 3 | 24 | 1 | 4 | 7 | 5 | 1 | 0 | 5 | Mark Twain |
3 | Mark Twain_2 | 461 | 94 | 135 | 167 | 98 | 58 | 57 | 39 | 36 | 46 | 22 | 42 | 24 | 47 | 11 | 27 | 112 | 17 | 18 | 33 | 39 | 5 | 35 | 12 | 1 | 8 | 5 | 4 | 23 | 1 | 11 | 11 | 8 | 20 | 4 | 12 | 1 | 5 | 26 | 16 | 7 | 0 | 22 | 2 | 9 | 4 | 5 | 0 | 3 | 7 | Mark Twain |
4 | Mark Twain_3 | 472 | 104 | 118 | 168 | 88 | 64 | 57 | 29 | 40 | 43 | 29 | 28 | 24 | 34 | 10 | 12 | 87 | 35 | 23 | 35 | 42 | 9 | 24 | 7 | 9 | 15 | 19 | 5 | 14 | 2 | 11 | 7 | 5 | 35 | 11 | 14 | 12 | 2 | 13 | 22 | 3 | 25 | 19 | 5 | 1 | 8 | 13 | 0 | 10 | 3 | Mark Twain |
5 | Mark Twain_4 | 398 | 109 | 117 | 128 | 81 | 61 | 45 | 21 | 37 | 50 | 27 | 36 | 27 | 34 | 19 | 15 | 76 | 20 | 24 | 22 | 20 | 8 | 33 | 15 | 2 | 10 | 16 | 7 | 13 | 2 | 3 | 7 | 11 | 9 | 8 | 8 | 4 | 6 | 26 | 11 | 20 | 10 | 20 | 7 | 7 | 9 | 12 | 0 | 0 | 11 | Mark Twain |
6 | Mark Twain_5 | 456 | 89 | 164 | 179 | 76 | 44 | 71 | 38 | 44 | 28 | 26 | 42 | 23 | 35 | 25 | 24 | 90 | 19 | 24 | 28 | 34 | 5 | 41 | 24 | 10 | 9 | 18 | 6 | 23 | 1 | 4 | 6 | 13 | 16 | 17 | 18 | 7 | 2 | 32 | 22 | 12 | 11 | 9 | 9 | 7 | 5 | 9 | 1 | 2 | 13 | Mark Twain |
7 | Mark Twain_6 | 490 | 112 | 177 | 181 | 106 | 57 | 84 | 45 | 39 | 18 | 28 | 37 | 18 | 34 | 21 | 49 | 121 | 35 | 16 | 27 | 56 | 8 | 29 | 25 | 7 | 15 | 14 | 8 | 10 | 2 | 10 | 9 | 14 | 7 | 9 | 10 | 12 | 2 | 28 | 23 | 7 | 2 | 8 | 6 | 10 | 17 | 10 | 0 | 2 | 17 | Mark Twain |
8 | Mark Twain_7 | 507 | 85 | 181 | 219 | 85 | 64 | 57 | 55 | 49 | 10 | 38 | 27 | 25 | 37 | 19 | 38 | 116 | 29 | 9 | 23 | 45 | 5 | 48 | 24 | 3 | 13 | 13 | 5 | 10 | 0 | 6 | 6 | 11 | 23 | 11 | 16 | 9 | 1 | 38 | 31 | 13 | 1 | 13 | 4 | 8 | 9 | 14 | 0 | 1 | 13 | Mark Twain |
9 | Mark Twain_8 | 414 | 95 | 142 | 193 | 90 | 55 | 47 | 46 | 43 | 22 | 32 | 47 | 9 | 39 | 8 | 26 | 134 | 16 | 18 | 21 | 24 | 5 | 29 | 15 | 1 | 15 | 8 | 2 | 10 | 0 | 1 | 6 | 13 | 20 | 13 | 23 | 3 | 10 | 32 | 20 | 5 | 1 | 12 | 3 | 5 | 9 | 10 | 0 | 3 | 17 | Mark Twain |
10 | Mark Twain_9 | 427 | 112 | 65 | 95 | 36 | 32 | 28 | 27 | 27 | 20 | 30 | 28 | 27 | 24 | 20 | 15 | 40 | 10 | 21 | 21 | 16 | 16 | 12 | 23 | 16 | 7 | 17 | 9 | 13 | 19 | 5 | 7 | 21 | 5 | 2 | 8 | 4 | 5 | 13 | 6 | 12 | 8 | 2 | 4 | 5 | 14 | 4 | 9 | 4 | 10 | Mark Twain |
Computing the Principal Component Analysis (PCA).
X = Matrix{Float64}(df[!, 2:end])
X_labels = dft[!, :labels]
pca = fit(PCA, X; maxoutdim = 50)
X_pca = predict(pca, X)
37×209 Matrix{Float64}:
-60.7414 9.83193 -77.8121 … 165.516 164.319 132.117
144.626 146.345 141.665 4.83876 4.48532 -40.2968
2.86075 26.2327 -21.3985 62.1869 60.0662 120.151
69.6778 63.8413 78.3785 73.6832 57.7492 76.9475
-24.8752 -12.6836 -32.0202 -28.9607 -22.7359 -45.6323
-11.4354 -6.21062 -23.8029 … 32.7799 23.1818 44.5347
8.94384 45.751 13.0142 -35.717 -27.565 -37.1215
40.1841 14.2104 7.5561 7.79548 9.50864 6.38706
-40.341 -28.3526 -16.3179 -8.18071 -1.23962 -2.80065
-22.6938 -12.4824 3.27572 5.16248 11.1488 1.68726
⋮ ⋱
-6.05136 -1.54905 -3.57877 -0.506974 0.447435 -0.537165
-9.70702 8.41345 0.581103 3.45939 -2.3016 3.3244
-0.810259 -0.595373 -2.07314 … 3.25579 -0.0607717 -0.296643
9.3398 -0.637202 -6.41379 7.87453 -0.0905966 -2.50238
-6.50688 -4.10144 -3.60888 -0.795095 4.65839 6.11248
-0.394297 3.78447 1.90339 2.05996 -2.68061 1.92713
2.0695 -0.447788 0.94063 1.51751 -3.63181 0.969622
4.2428 5.01362 -4.58153 … 6.10315 7.88316 -2.04012
-1.61565 -0.938709 -5.43101 -1.10219 2.87962 2.44519
Recoding labels
for the linear discriminant analysis:
Y_labels = recode(
X_labels,
"Obama" => 1,
"God" => 2,
"Mark Twain" => 3,
"Charles Dickens" => 4,
"Nathaniel Hawthorne" => 5,
"Sir Arthur Conan Doyle" => 6,
)
lda = fit(MulticlassLDA, X_pca, Y_labels; outdim=20)
points = predict(lda, X_pca)
20×209 Matrix{Float64}:
0.0425391 0.100762 -0.0443951 … 0.0635746 0.0537748
-0.0476023 -0.0335414 -0.0323627 -0.187445 -0.226327
0.466552 0.428397 0.50515 0.41138 0.335156
0.179813 0.279297 0.0988312 0.0381078 0.00635588
0.165531 0.14778 0.251933 -0.492498 -0.643294
-0.0169923 0.0517767 -0.0640841 … 0.0380897 -0.104371
-0.15124 -0.0906387 -0.0796038 0.00310871 0.0282575
0.0328233 -0.00191108 -0.081439 0.105777 -0.0338196
-0.0304685 0.0716366 0.0276993 0.00296056 0.00174264
-0.0733328 -0.131337 0.0478864 -0.00354468 0.0555406
0.103659 -0.12559 -0.0123341 … -0.027959 0.0108726
-0.0140485 -0.0992294 0.0163172 -0.0278225 -0.0321585
0.0107203 -0.112733 -0.0628497 -0.00171432 -0.015996
0.0342722 0.137548 0.0161871 -0.00247884 -0.0916099
0.063174 -0.0192288 -0.0520502 0.0195034 -0.0356987
-0.0133432 0.00223252 0.124153 … 0.0235802 0.00187892
-0.040238 -0.0548456 0.034812 0.00448765 -0.0806262
-0.19582 0.0492696 -0.112352 -0.00700629 -0.0305518
0.123881 -0.00719929 0.0167019 -0.0289716 0.0996145
-0.221213 0.00749935 -0.0475717 0.0174744 -0.0196127
Representation of data:
function plot_clustering( points, cluster, true_labels; axis = 1:2)
pairs = Dict(1 => :rtriangle, 2 => :diamond, 3 => :square, 4 => :ltriangle,
5 => :star, 6 => :pentagon, 0 => :circle)
shapes = replace(cluster, pairs...)
p = scatter(points[1, :], points[2, :]; markershape = shapes,
markercolor = true_labels, label = "")
authors = [ "Obama", "God", "Twain", "Dickens",
"Hawthorne", "Conan Doyle"]
xl, yl = xlims(p), ylims(p)
for (s,a) in zip(values(pairs),authors)
scatter!(p, [1], markershape=s, markercolor = "blue", label=a, xlims=xl, ylims=yl)
end
for c in keys(pairs)
scatter!(p, [1], markershape=:circle, markercolor = c, label = c, xlims=xl, ylims=yl)
end
plot!(p, xlabel = "PC1", ylabel = "PC2", legend=:outertopright)
return p
end
plot_clustering (generic function with 1 method)
Data clustering
To cluster the data, we will use the following parameters. The true proportion of outliers is 20/209 since 15+5 texts were extracted from the bible or a speech from Obama.
k = 4
alpha = 20/209
maxiter = 50
nstart = 50
50
Application of the classical trimmed $k$-means algorithm.
(Cuesta-Albertos et al., 1997)
tb_kmeans = trimmed_bregman_clustering(rng, points, k, alpha, euclidean, maxiter, nstart)
plot_clustering(tb_kmeans.points, tb_kmeans.cluster, Y_labels)
Using the Bregman divergence associated to the Poisson distribution
function standardize!( points )
points .-= minimum(points, dims=2)
end
standardize!(points)
20×209 Matrix{Float64}:
0.371239 0.429461 0.284305 0.351883 … 0.406523 0.392274 0.382475
0.445381 0.459442 0.460621 0.392174 0.291369 0.305539 0.266656
0.753697 0.715542 0.792295 0.614593 0.731444 0.698525 0.622301
0.676151 0.775636 0.59517 0.562054 0.531942 0.534446 0.502694
0.817917 0.800165 0.904319 0.862279 0.059263 0.159888 0.00909161
0.234037 0.302806 0.186945 0.0 … 0.235471 0.289119 0.146658
0.0601579 0.120759 0.131794 0.245968 0.158027 0.214507 0.239655
0.389972 0.355237 0.275709 0.286407 0.430856 0.462925 0.323329
0.18849 0.290595 0.246658 0.267849 0.260358 0.221919 0.220701
0.139707 0.0817024 0.260926 0.248796 0.21737 0.209495 0.26858
0.295311 0.0660611 0.179317 0.224351 … 0.162675 0.163692 0.202524
0.159309 0.0741285 0.189675 0.167506 0.197067 0.145535 0.141199
0.206475 0.0830217 0.132905 0.0 0.17484 0.19404 0.179758
0.240473 0.343749 0.222388 0.282089 0.210068 0.203722 0.114591
0.299407 0.217004 0.184183 0.129202 0.24267 0.255736 0.200534
0.163483 0.179059 0.300979 0.200116 … 0.186113 0.200406 0.178705
0.14086 0.126252 0.21591 0.165868 0.174994 0.185586 0.100472
0.0 0.245089 0.0834676 0.206437 0.134483 0.188814 0.165268
0.302945 0.171865 0.195766 0.231095 0.184692 0.150092 0.278678
0.0 0.228712 0.173641 0.355764 0.260298 0.238687 0.2016
tb_poisson = trimmed_bregman_clustering(rng, points, k, alpha, poisson, maxiter, nstart)
plot_clustering(points, tb_poisson.cluster, Y_labels)
By using the Bregman divergence associated to the Poisson distribution, we see that the clustering method is performant with the parameters k = 4
and alpha = 20/209
. Indeed, the outliers are the texts from the bible and from the Obama speech. Moreover, the other texts are mostly well clustered.
Performance comparison
We measure the performance of two clustering methods (the one with the squared Euclidean distance and the one with the Bregman divergence associated to the Poisson distribution). For this, we use the normalised mutual information (NMI).
True labelling for which the texts from the bible and the Obama speech do have the same label:
true_labels = copy(Y_labels)
true_labels[Y_labels .== 2] .= 1
15-element view(::Vector{Int64}, [190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204]) with eltype Int64:
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
For trimmed k-means :
mutualinfo(true_labels, tb_kmeans.cluster, normed = true)
1.0
For trimmed clustering with the Bregman divergence associated to the Poisson distribution :
mutualinfo(true_labels, tb_poisson.cluster, normed = true)
0.9372246668789784
The mutualy normalised information is larger for the Bregman divergence associated to the Poisson distribution. This illustrates the fact that using the correct Bregman divergence helps improving the clustering, in comparison to the classical trimmed $k$-means algorithm. Indeed, the number of appearance of a word in a text of a fixed number of words, written by the same author, can be modelled by a random variable of Poisson distribution. The independance between the number of appearance of the words is not realistic. However, since we do consider only some words (the 50 more frequent words), we make this approximation. We will use the Bregman divergence associated to the Poisson distribution.
Selecting the parameters $k$ and $\alpha$
We display the risks curves as a function of $k$ and $\alpha$. In practive, it is important to realise this step since we are not supposed to know the data set in advance, nor the number of outliers.
vect_k = collect(1:6)
vect_alpha = [(1:5)./50; [0.15,0.25,0.75,0.85,0.9]]
nstart = 20
rng = MersenneTwister(20)
params_risks = select_parameters(rng, vect_k, vect_alpha, points, poisson, maxiter, nstart)
plot(; title = "select parameters")
for (i,k) in enumerate(vect_k)
plot!( vect_alpha, params_risks[i, :], label ="k=$k", markershape = :circle )
end
xlabel!("alpha")
ylabel!("NMI")
In order to select the parameters k
and alpha
, we will focus onf the different possible values for alpha
. For alpha
not smaller than 0.15, we see that we gain a lot going from 1 to 3 groups and from 2 to 3 groups. Therefore, we choose k=3
and alpha
of order 0.15
corresponding to the slope change, for the curve k=3
.
For alpha
smaller than 0.15, we see that we gain a lot going from 1 to 2 groups, from 2 to 3 groups and to 3 to 4 groups. However, we do not gain in terms of risk going from 4 to 5 groups or from 5 to 6 groups. Indeed, the curves associated to the parameters $k = 4$, $k = 5$ and $k = 6$ are very close. So, we cluster the data in $k = 4$ groups.
The curve associated to the parameter $k = 4$ strongly decreases with a slope that stabilises around $\alpha = 0.1$.
Then, since there is a slope jump at that curve $k = 6$, we can choose the parameter k = 6
, with alpha = 0
. We do not consider any outlier.
Note that the fact that our method is initialised by random centers implies that the curves representing the risk as a function of $k$ and $\alpha$ vary, quite strongly, from one time to another one. Consequently, the comment abovementionned does not necessarily corresponds to the figure. For more robustness, we should have increased the value of nstart
, and so, the execution time. These curves for the selection of the parameters k
and alpha
are mostly indicative.
Finaly, here are three clustering obtained after choosing 3 pairs of parameters.
maxiter = 50
nstart = 50
tb = trimmed_bregman_clustering(rng, points, 3, 0.15, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)
The texts of Twain, the bible and the Obama speech are considered as outliers.
tb = trimmed_bregman_clustering(rng, points, 4, 0.1, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)
The texts from the bible and the Obama speech are considered as outliers.
tb = trimmed_bregman_clustering(rng, points, 6, 0.0, poisson, maxiter, nstart)
plot_clustering(points, tb.cluster, Y_labels)
We obtain 6 groups corresponding to the texts of the 4 authors and to the texts from the bible and from the Obama speech.