JADT 2022

This script will be presented at JADT2022 http://jadt2022.vadistat.org/

Order constrained clustering with local stopping rules. Application in textual analysis

Ramón Alvarez-Esteban University of León – ramon.alvarez@unileon.es

Mónica Bécue-Bertaut Universitat Politècnica de Catalunya – monica.becue@upc.edu

The documents of a corpus can be related to an order variable, for example the chronology. This order may explain the variability in the vocabulary. When the vocabulary is highly related to order, correspondence analysis (CA) visualizes the document trajectory on the first factorial axes. Further, hierarchical contiguity constrained clustering methods (HCCC), starting from the document coordinates on these axes, will group the documents only if they are contiguous. However, HCCC may group heterogeneous documents, in terms of their content, because over-emphasis is given to order. To obtain a partition respecting the thematic differences, we propose to include in the algorithm a rule which stops the fusion between two nodes when considered as heterogeneous, meaning that a discontinuity has been detected. HCCC with local stopping rule is applied to Spanish political speeches to illustrate how it works in textual analysis.

Keywords: Hierarchical contiguity constrained clustering, local stopping rule, textual clustering.

1. Introduction

2. Cluster analysis and contiguity constrained clustering methods

3. Local stopping rules in contiguity constrained clustering

4. Application on a corpus of Spanish investiture speeches

4.1. The corpus

4.2. Correspondence analysis

4.3. Contiguity constrained clustering with local stopping

5. Conclusions

6. Bibliography

Appendix I

Appendix II