3. Local stopping rules in contiguity constrained clustering
Local stopping rules (Milligan & Cooper, 1985, Gordon, 1999, Arbelaitz et al. 2013) have been applied in HC with different objectives such as to determine the number of clusters or check their stability. HCCC can group together chronologically close but heterogeneous nodes. Therefore, it is necessary to determine whether the fusion between two nodes based on their contiguity lead to a heterogenous new node, in the sense that the lexical contents are significantly different.
To prevent inversions when applying contiguity-constrained aggregation methods, it is preferable to use the complete-linkage aggregation one (Nielsen, 2016). To avoid building heterogeneous clusters, we adopt the solution proposed in Legendre et al. (1985) and in Legendre and Legendre (2012) under the name of chronological clustering.
Originally, it was designed to detect discontinuities in multi-species time series. In this algorithm, regrouping two contiguous nodes is stopped if the relationship between their contents (computing the distances between all the elements of the two groups) appears to be random depending on a permutation test. The result is a partition of the corpus into homogeneous contiguous documents.
A complete hierarchy is not built because of the unauthorised regroupings. However, it is interesting to study the sub-dendrograms (separate parts of the tree) to follow how documents come together. In some cases, inversions are observed in these dendrograms, even if complete-linkage aggregation method has been used. Note that the dendrograms obtained with and without using the stopping rule differ and not only because of the discontinuities.