JADT 2022. 4.3. Contiguity constrained clustering with local stopping

4.3. Contiguity constrained clustering with local stopping

As stopping rule, we adopt here the one proposed in Legendre et al. (1985) and Legendre and Legendre (2012). These works propose to apply a permutation test in order to check whether two nodes, candidates for merger, are homogeneous (merger allowed) or heterogeneous (merger not allowed) at each clustering stage.

4.3.1. Dendrograms depending critic p-value (denoted α)

Depending on the critic p-value (denoted α) that is chosen for this test, different dendrograms are obtained (Figure 2).

The number of discontinuities detected in the corpus, and then the number of clusters which constitute time periods, depends on the chosen α value. For α=0.01 two main clusters would be obtained, from the first speech until 11.Rj11 and from 12.Sa16 to the end, respectively.

cl.TEST.0.01 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.01, description=FALSE, graph = FALSE)
plot(cl.TEST.0.01,tree.barplot=FALSE , title="Figure 2.a. HCCC with stopping local rule with alpha 0.01", type="tree")

If we choose α=0.05, three periods are identified (Figure 2.b). The first is composed of the first five speeches from 1.Su79 to 5.Gz89, the second of the speeches from 6.Gz93 until 11.Rj11 and the third of 12.Sa16 and the subsequent ones.

cl.TEST.0.05 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.05, description=FALSE, graph = FALSE)
plot(cl.TEST.0.05,tree.barplot=FALSE , title="Figure 2.b. HCCC with stopping local rule with alpha 0.05", type="tree")

When increasing α until 0.15 (Figure 2.c.), the first period is split into two, remaining the others equal. Note that the shape of the dendrograms varies as the α value changes. Legendre et al. (1985) perform several analyses on non-textual data varying the α value between 0.01 and 0.25, allowing more fusions when α is small, leading to larger groups. In general terms, this relationship between the α value and the number of clusters follows this tendency, although exceptions can be found.

cl.TEST.0.15 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.15, description=FALSE, graph = FALSE)
plot(cl.TEST.0.15,tree.barplot=FALSE , title="Figure 2.c. HCCC with stopping local rule with alpha 0.15", type="tree")

cl.TEST.0.3 <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.3, description=FALSE, graph = FALSE)
cl.TEST.0.3,tree.barplot=FALSE , title="Figure 2.d. HCCC with stopping local rule with alpha 0.3", type="tree")

4.3.2. Lexical context for each cluster period

After selecting the value of α=0.05 and thus the partition into clusters, the lexical content of each cluster-period has to be identified looking for the overrepresented and under-represented words in each period.

Saving the description for LexCHCca for α=0.05

cl.TEST.0.05.descr <- LexCHCca(resCA, nb.clust="auto", cut.test=TRUE, alpha.test=0.05, description=TRUE, graph = FALSE)
names(cl.TEST.0.05.descr)

cl.TEST.0.05.descr$description$des.word

 

Looking at the most characteristic words of each of the 3 periods, we obtain:

  • Period 1.Sz79 until 5.Gz89

This first period is characterised by words related to the transition (from dictatorship to democracy), the Socialists' arrival in power, the structure of the State of the Autonomous Regions, the foreign policy, the consolidation of Spain in the European Union as well as the entry into NATO. In particular, the following words are overused in this period: politics, freedom, problem(s), inflation, unemployment, crisis, balance of payments, infrastructure.
This last one is used in the expressions "infrastructure development", "infrastructure policy", "infrastructure investments", "infrastructure upgrading" and so on).

  • Period 6.Gz98 until 11.Rj11

In this second period, we find many words related to economic growth and development such as: reform(s), activity, competitiveness, efficient, stability,
boost, liberalisation of enterprises and so on. In particular, the economic crisis gives rise to many words such as:
euro(s), million, financial, debt, deficit, public spending, profit, austerity, economic and so on.

  • Period 12.Sa16 until 16.Sa20

This last period is the most complex. Note that this cluster appears in this chronological clustering but also in the case of performing hierarchical clustering without restriction and with contiguity restriction, but without stopping rules. This points that pronounced thematic differences exist between this last period and the former ones.
Particular themes are the need to be voted on by a sufficient number of MEPs (vote(s), agreement, dialogue, pact and so on), the territorial integrity and the problem posed by Catalonia.

New themes are emerging such as climate change, corruption crimes, democratic regeneration as well as equality question with words such as exclusion, equality, diversity and disability. These themes characterize the entire period.

CA informed us previously that the two speeches by Rajoy of this period, on the one hand, and the three by Sánchez, on the other, are strongly opposed on the third axis (Figure 1). Thus, both present clear lexical differences on which we must also focus on if we want to capture all elements of the corpus structure. Rajoy’s speeches insist on words such as investiture, elections, Partido Popular, Spain, Spaniards while Sánchez's speeches mention current challenges through the words digital, digitalisation, ecological and also more specific problems of left-wing parties such as workers, poverty, precariousness, vulnerable, inequality, gender and women. Note that this opposition is not visible in clustering results, because the vocabulary that unites the whole period is very important and well differentiated from this of the former periods This fact highlights the complementarity between clustering and CA. The first method captures well the consolidated phenomena while CA is able to detect finer and more subtle aspects.