JADT 2022. 4.3. Contiguity constrained clustering with local stopping

4.3. Contiguity constrained clustering with local stopping

As stopping rule, we adopt here the one proposed in Legendre et al. (1985) and Legendre and Legendre (2012). These works propose to apply a permutation test in order to check whether two nodes, candidates for merger, are homogeneous (merger allowed) or heterogeneous (merger not allowed) at each clustering stage.

4.3.1. Dendrograms depending critic p-value (denoted α)

Depending on the critic p-value (denoted α) that is chosen for this test, different dendrograms are obtained (Figure 2).

The number of discontinuities detected in the corpus, and then the number of clusters which constitute time periods, depends on the chosen α value. For α=0.01 two main clusters would be obtained, from the first speech until 11.Rj11 and from 12.Sa16 to the end, respectively.

cl.TEST.0.01 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.01, description=FALSE, graph = FALSE)
plot(cl.TEST.0.01,tree.barplot=FALSE , title="Figure 2.a. HCCC with stopping local rule with alpha 0.01", type="tree")

If we choose α=0.05, three periods are identified (Figure 2.b). The first is composed of the first five speeches from 1.Su79 to 5.Gz89, the second of the speeches from 6.Gz93 until 11.Rj11 and the third of 12.Sa16 and the subsequent ones.

cl.TEST.0.05 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.05, description=FALSE, graph = FALSE)
plot(cl.TEST.0.05,tree.barplot=FALSE , title="Figure 2.b. HCCC with stopping local rule with alpha 0.05", type="tree")

When increasing α until 0.15 (Figure 2.c.), the first period is split into two, remaining the others equal. Note that the shape of the dendrograms varies as the α value changes. Legendre et al. (1985) perform several analyses on non-textual data varying the α value between 0.01 and 0.25, allowing more fusions when α is small, leading to larger groups. In general terms, this relationship between the α value and the number of clusters follows this tendency, although exceptions can be found.

cl.TEST.0.15 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.15, description=FALSE, graph = FALSE)
plot(cl.TEST.0.15,tree.barplot=FALSE , title="Figure 2.c. HCCC with stopping local rule with alpha 0.15", type="tree")

cl.TEST.0.3 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.3, description=FALSE, graph = FALSE)
cl.TEST.0.3,tree.barplot=FALSE , title="Figure 2.d. HCCC with stopping local rule with alpha 0.3", type="tree")

4.3.2. Lexical context for each cluster period

After selecting the value of α=0.05 and thus the partition into clusters, the lexical content of each cluster-period has to be identified looking for the overrepresented and under-represented words in each period.

Saving the description for LexCHCca for α=0.05

cl.TEST.0.05.descr <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.05, description=TRUE, graph = FALSE)
names(cl.TEST.0.05.descr)

cl.TEST.0.05.descr$description$des.word