4.1. The corpus
To show the relevance of the local stopping rules on chronological clustering, we use the 16 investiture speeches given by 7 candidates for Prime Minister (Suárez, Calvo-Sotelo, Felipe González, Aznar, Zapatero, Rajoy and Sánchez) from December 1978 to the present day. In their speeches, the candidates present their government programme. Then, the deputies give their confidence or not.
First, we will remove all objects in memory.
Secondly, we will download the file containing the 16 speeches in RData format, observing its structure.
Given the SpanishDisc data frame object, the best way to understand the data structure is to use str().
rm(list=ls())
# load("SpanishDisc16.RData")
con<-url('https://xplortext.unileon.es/wp-content/uploads/2022/06/SpanishDisc16..rdata') # Create connexion
load(con) #Load the data
close(con) #close connexion
str(SpanishDisc16)
# load("SpanishDisc16.RData")
con<-url('https://xplortext.unileon.es/wp-content/uploads/2022/06/SpanishDisc16..rdata') # Create connexion
load(con) #Load the data
close(con) #close connexion
str(SpanishDisc16)
'data.frame': 16 obs. of 8 variables:
$ chronology : num 1 2 3 4 5 6 7 8 9 10 ...
$ acronym : chr "Su79" "CS81" "Gz82" "Gz86" ...
$ name : chr "Suárez" "CalvoSotelo" "González" "González" ...
$ politicparty: chr "UCD" "UCD" "PSOE" "PSOE" ...
$ year : num 1979 1981 1982 1986 1989 ...
$ legislatura : chr "I" "I" "II" "III" ...
$ result : chr "YES" "YES" "YES" "YES" ...
$ text : Factor w/ 16 levels "buenas tardes ya, señorías: en 1979, 40 años atrás, se celebró el primer debate de investidura en esta Cámara. "| __truncated__,..: 2 11 4 10 7 6 3 14 9 8 ...
$ chronology : num 1 2 3 4 5 6 7 8 9 10 ...
$ acronym : chr "Su79" "CS81" "Gz82" "Gz86" ...
$ name : chr "Suárez" "CalvoSotelo" "González" "González" ...
$ politicparty: chr "UCD" "UCD" "PSOE" "PSOE" ...
$ year : num 1979 1981 1982 1986 1989 ...
$ legislatura : chr "I" "I" "II" "III" ...
$ result : chr "YES" "YES" "YES" "YES" ...
$ text : Factor w/ 16 levels "buenas tardes ya, señorías: en 1979, 40 años atrás, se celebró el primer debate de investidura en esta Cámara. "| __truncated__,..: 2 11 4 10 7 6 3 14 9 8 ...
Here, a speech is identified by its rank (from 1 to 16), two letters summarizing the name of the candidate and, finally, the two last figures of the year. So, we have a short number of documents with a similar structure to ease the presentation of the methodology. The texts of the speeches have been downloaded from the Diario de Sesiones del Congreso de los Diputados (Journal of Sessions of the Congress of Deputies).
Their writing has been unified (e.g., IAE and Impuesto de Actividades Económicas).
We present 16 investiture speaches (rows) and 8 variables (columns):
SpanishDisc$title
The number of the speech defined as a number.
SpanishDisc16$acronym
Label with four characters (name of the politician and year) usefull to represent the speech in tables and graphs,
SpanishDisc16$name
Name of the politician defined as factor. There are 8 different politicians (Aznar, CalvoSotelo, González, Rajoy, Suárez, Zapatero, Rajoy and Sanchez) and 16 investiture speeches.
options(width = 300)
table(SpanishDisc16$name)
table(SpanishDisc16$name)
Aznar CalvoSotelo González Rajoy Sánchez Suárez Zapatero
2 1 4 3 3 1 2
SpanishDisc$politicparty
Name of the political party defined as factor. There are 3 different political parties (PP, PSOE, UCD).
table(SpanishDisc16$politicparty)
PP PSOE UCD
5 9 2
SpanishDisc16$year
The year is defined as numerical variable from 1979 until 2019
table(SpanishDisc16$year)
1979 1981 1982 1986 1989 1993 1996 2000 2004 2008 2011 2016 2019
1 1 1 1 1 1 1 1 1 1 1 3 2
SpanishDisc16$legislatura
From 1979 to 2022 there have been 13 legislative periods (XIII) in Spain.
table(SpanishDisc16$legislatura)
I II III IV IX V VI VII VIII X XI XII XIII
2 1 1 1 1 1 1 1 1 1 1 2 2
SpanishDisc16$result
TRUE if the candidate became president of the government, FALSE in the opposite case.
table(SpanishDisc16$legislatura)
I II III IV IX V VI VII VIII X XI XII XIII
2 1 1 1 1 1 1 1 1 1 1 2 2
SpanishDisc16$text
Pretreatment
In order to preserve the capital letters introduced in the corpus at the moment of their capture by the “Diario de Sesiones del Congreso de los Diputados” and the semantic information they provide, capital letters at the beginning of the sentence have been manually eliminated in the database. Those which are preserved serve, in general, to differentiate homographs. Thus, “Gobierno” (the government) is differentiated from “gobierno” (I govern). It will be necessary to specify in the script that it is desired to keep the capital letters.
Loading Xplortext package
library(Xplortext)
Loading required package: FactoMineR
Loading required package: ggplot2
Loading required package: tm
Loading required package: NLP
Attaching package: 'NLP'
The following object is masked from 'package:ggplot2': annotate
Building TextData object
Before doing any analysis we need to construct an object of TextData class.
Each row of the source-base is considered as a source-document.
TextData function builds the working-documents-by-words table, submitted to the analysis. In this case we will not use any information related to
contextual variables. Non-aggregate table (Direct Analysis) is defined by default (var.agg=NULL).
A work-document with non-empty-source-documents are built. DocTerm is a non-aggregate lexical table with as many rows as non-empty source-documents and as many columns as words are selected.
We can get the arguments of the TextData function by executing:
args(TextData)
args(TextData)
function (base, var.text = NULL, var.agg = NULL, context.quali = NULL,
context.quanti = NULL, selDoc = "ALL", lower = TRUE, remov.number = TRUE,
lminword = 1, Fmin = Dmin, Dmin = 1, Fmax = Inf, stop.word.tm = FALSE,
idiom = "en", stop.word.user = NULL, segment = FALSE, sep.weak = "default",
sep.strong = "[()¿?./:¡!=;{}]…", seg.nfreq = 10, seg.nfreq2 = 10,
seg.nfreq3 = 10, graph = FALSE)
Computing results before threshold. Initial corpus description
TD <- TextData(SpanishDisc16, var.text=c(8), var.agg=NULL, Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)
- In the previous command, we have been chosen the eighth variable as textual variable: text
- There is a direct analysis defined by var.agg=NULL (by default)
- The selection of variables use the following arguments:
- Minimum length of a word to be selected (lminword= 1)
- Minimum frequency of a word to be selected (Fmin= 1)
- A word has to be used in at least 1 source-documents to be selected (Dmin=1)
- Maximum frequency of a word to be selected (Fmax= Inf by default)
- idiom = “es”: Declared idiom for the textual column(s) is: es.
(See IETF language in package NLP) - lower=FALSE: The corpus is not converted into lowercase
- remov.number=FALSE: Numbers are no removed from the corpus
- stop.word.tm=FALSE. Stoplist is not provided by tm package in accordance with the idiom
- stop.word.user: Stop word list is not provided by the user
- segment = FALSE: Repeated segments are not selected (segment= FALSE by default)
To obtain a summary of TextData object for 16 documments and 10 words we use the summary function (summary.TextData). Run help(summary.TextData) for more detail.
options(width = 300)
summary(TD, ndoc=16, nword=10, info=FALSE)
summary(TD, ndoc=16, nword=10, info=FALSE)
TextData summary
Before After
Documents 16.00 16.00
Occurrences 162796.00 162796.00
Words 12178.00 12178.00
Mean-length 10174.75 10174.75
Statistics for the documents
DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences DistinctWords PctLength Mean Length100
before before before before after after after after
1 15.Sa19 16240 3287 9.98 159.61 16240 3287 9.98 159.61
2 16.Sa20 15358 3219 9.43 150.94 15358 3219 9.43 150.94
3 12.Sa16 13501 2782 8.29 132.69 13501 2782 8.29 132.69
4 1.Su79 12137 2824 7.46 119.29 12137 2824 7.46 119.29
5 4.Gz86 11330 2069 6.96 111.35 11330 2069 6.96 111.35
6 7.Az96 10240 2351 6.29 100.64 10240 2351 6.29 100.64
7 13.Rj16 10019 2260 6.15 98.47 10019 2260 6.15 98.47
8 11.Rj11 9745 2379 5.99 95.78 9745 2379 5.99 95.78
9 3.Gz82 9415 2526 5.78 92.53 9415 2526 5.78 92.53
10 10.Zp08 8817 2278 5.42 86.66 8817 2278 5.42 86.66
11 2.CS81 8267 2174 5.08 81.25 8267 2174 5.08 81.25
12 8.Az00 8259 1987 5.07 81.17 8259 1987 5.07 81.17
13 6.Gz93 8134 2049 5.00 79.94 8134 2049 5.00 79.94
14 9.Zp04 7882 2021 4.84 77.47 7882 2021 4.84 77.47
15 5.Gz89 7591 1811 4.66 74.61 7591 1811 4.66 74.61
16 14.Rj16 5861 1565 3.60 57.60 5861 1565 3.60 57.60
Index of the 10 most frequent words
Word Frequency N.Documents
1 de 11791 16
2 la 7939 16
3 y 5609 16
4 que 5551 16
5 en 4865 16
6 el 4482 16
7 a 3513 16
8 los 3183 16
9 las 2275 16
10 del 1855 16
To obtain a barplot with the 25 documents with higher length (in this case only 16) we use the plot function (plot.TextData). Use help(plot.TextData) for more detail. Adding frequencies and vline representing the speach average size.
plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)
pl2<- plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)
pl2
Stopwords
The stopwords are suppressed and only the words with a frequency over 9 and appearing at least in two speeches are retained. The size of the corpus is 162,796 occurrences before applying the selection and 55,850 occurrences after. A total of 1,558 different words are retained.
To define the list of stopwords that must be removed from documents we build un object named "swu":
swu <- c("consiguiente", "ello", "hacia", "punto", "si", "Sus", "vista", "A", "B", "C", "D", "E", "F", "a", "b", "c", "d")
The tm package has a list of words that can be used as stopwords depending on the language (by default english=en). This words can be retrieve using the stopwords command: tm::stopwords(kind=idiom)
swu <- c("consiguiente", "ello", "hacia", "punto", "si", "Sus", "vista",
"A", "B", "C", "D", "E", "F", "a", "b", "c", "d")
TD <- TextData(SpanishDisc16, var.text=c(8), Fmin=10, Dmin=2, idiom="es", lower=FALSE,
remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)
summary(TD, ndoc=16, nword=10, info=FALSE)
"A", "B", "C", "D", "E", "F", "a", "b", "c", "d")
TD <- TextData(SpanishDisc16, var.text=c(8), Fmin=10, Dmin=2, idiom="es", lower=FALSE,
remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)
summary(TD, ndoc=16, nword=10, info=FALSE)
TextData summary
Before After
Documents 16.00 16.00
Occurrences 162796.00 55850.00
Words 12178.00 1558.00
Mean-length 10174.75 3490.62
Statistics for the documents
DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences DistinctWords PctLength Mean Length100
before before before before after after after after
1 15.Sa19 16240 3287 9.98 159.61 5402 1174 9.67 154.76
2 16.Sa20 15358 3219 9.43 150.94 5116 1178 9.16 146.56
3 12.Sa16 13501 2782 8.29 132.69 4506 1046 8.07 129.09
4 1.Su79 12137 2824 7.46 119.29 4166 1073 7.46 119.35
5 4.Gz86 11330 2069 6.96 111.35 3901 936 6.98 111.76
6 7.Az96 10240 2351 6.29 100.64 3817 1070 6.83 109.35
7 13.Rj16 10019 2260 6.15 98.47 3494 1017 6.26 100.10
8 11.Rj11 9745 2379 5.99 95.78 3320 996 5.94 95.11
9 8.Az00 8259 1987 5.07 81.17 3077 959 5.51 88.15
10 10.Zp08 8817 2278 5.42 86.66 3019 975 5.41 86.49
11 3.Gz82 9415 2526 5.78 92.53 2985 994 5.34 85.51
12 6.Gz93 8134 2049 5.00 79.94 2925 962 5.24 83.80
13 2.CS81 8267 2174 5.08 81.25 2809 925 5.03 80.47
14 9.Zp04 7882 2021 4.84 77.47 2690 875 4.82 77.06
15 5.Gz89 7591 1811 4.66 74.61 2652 880 4.75 75.97
16 14.Rj16 5861 1565 3.60 57.60 1971 724 3.53 56.47
Index of the 10 most frequent words
Word Frequency N.Documents
1 Gobierno 712 16
2 señorías 684 16
3 España 631 16
4 política 517 16
5 Estado 321 16
6 sociedad 303 16
7 país 302 16
8 social 302 16
9 españoles 290 15
10 años 283 16