JADT 2022. 4.1. The corpus

4.1. The corpus

To show the relevance of the local stopping rules on chronological clustering, we use the 16 investiture speeches given by 7 candidates for Prime Minister (Suárez, Calvo-Sotelo, Felipe González, Aznar, Zapatero, Rajoy and Sánchez) from December 1978 to the present day. In their speeches, the candidates present their government programme. Then, the deputies give their confidence or not.

First, we will remove all objects in memory.
Secondly, we will download the file containing the 16 speeches in RData format, observing its structure.

Given the SpanishDisc data frame object, the best way to understand the data structure is to use str().

rm(list=ls())
# load("SpanishDisc16.RData")
con<-url('https://xplortext.unileon.es/wp-content/uploads/2022/06/SpanishDisc16..rdata') # Create connexion
load(con) #Load the data
close(con) #close connexion
str(SpanishDisc16)
'data.frame': 16 obs. of 8 variables:
$ chronology : num 1 2 3 4 5 6 7 8 9 10 ...
$ acronym : chr "Su79" "CS81" "Gz82" "Gz86" ...
$ name : chr "Suárez" "CalvoSotelo" "González" "González" ...
$ politicparty: chr "UCD" "UCD" "PSOE" "PSOE" ...
$ year : num 1979 1981 1982 1986 1989 ...
$ legislatura : chr "I" "I" "II" "III" ...
$ result : chr "YES" "YES" "YES" "YES" ...
$ text : Factor w/ 16 levels "buenas tardes ya, señorías: en 1979, 40 años atrás, se celebró el primer debate de investidura en esta Cámara. "| __truncated__,..: 2 11 4 10 7 6 3 14 9 8 ...

 

Here, a speech is identified by its rank (from 1 to 16), two letters summarizing the name of the candidate and, finally, the two last figures of the year. So, we have a short number of documents with a similar structure to ease the presentation of the methodology. The texts of the speeches have been downloaded from the Diario de Sesiones del Congreso de los Diputados (Journal of Sessions of the Congress of Deputies).
Their writing has been unified (e.g., IAE and Impuesto de Actividades Económicas).

We present 16 investiture speaches (rows) and 8 variables (columns):

SpanishDisc$title

The number of the speech defined as a number.

SpanishDisc16$acronym

Label with four characters (name of the politician and year) usefull to represent the speech in tables and graphs,

SpanishDisc16$name

Name of the politician defined as factor. There are 8 different politicians (Aznar, CalvoSotelo, González, Rajoy, Suárez, Zapatero, Rajoy and Sanchez) and 16 investiture speeches.

options(width = 300)
table(SpanishDisc16$name)

      Aznar CalvoSotelo    González       Rajoy     Sánchez      Suárez    Zapatero 
          2           1           4           3           3           1           2 

SpanishDisc$politicparty

Name of the political party defined as factor. There are 3 different political parties (PP, PSOE, UCD).

table(SpanishDisc16$politicparty)

  PP PSOE  UCD 
   5    9    2 

SpanishDisc16$year

The year is defined as numerical variable from 1979 until 2019

table(SpanishDisc16$year)

1979 1981 1982 1986 1989 1993 1996 2000 2004 2008 2011 2016 2019 
   1    1    1    1    1    1    1    1    1    1    1    3    2 

SpanishDisc16$legislatura

From 1979 to 2022 there have been 13 legislative periods (XIII) in Spain.

table(SpanishDisc16$legislatura)

   I   II  III   IV   IX    V   VI  VII VIII    X   XI  XII XIII 
   2    1    1    1    1    1    1    1    1    1    1    2    2 

SpanishDisc16$result

TRUE if the candidate became president of the government, FALSE in the opposite case.

table(SpanishDisc16$legislatura)

   I   II  III   IV   IX    V   VI  VII VIII    X   XI  XII XIII 
   2    1    1    1    1    1    1    1    1    1    1    2    2 

SpanishDisc16$text

Pretreatment

In order to preserve the capital letters introduced in the corpus at the moment of their capture by the “Diario de Sesiones del Congreso de los Diputados” and the semantic information they provide, capital letters at the beginning of the sentence have been manually eliminated in the database. Those which are preserved serve, in general, to differentiate homographs. Thus, “Gobierno” (the government) is differentiated from “gobierno” (I govern). It will be necessary to specify in the script that it is desired to keep the capital letters.

Loading Xplortext package

library(Xplortext)
Loading required package: FactoMineR
Loading required package: ggplot2 Loading required package: tm
Loading required package: NLP
Attaching package: 'NLP'
The following object is masked from 'package:ggplot2':
annotate

Building TextData object

Before doing any analysis we need to construct an object of TextData class.

Each row of the source-base is considered as a source-document.
TextData function builds the working-documents-by-words table, submitted to the analysis. In this case we will not use any information related to
contextual variables. Non-aggregate table (Direct Analysis) is defined by default (var.agg=NULL).

A work-document with non-empty-source-documents are built. DocTerm is a non-aggregate lexical table with as many rows as non-empty source-documents and as many columns as words are selected.

We can get the arguments of the TextData function by executing:
args(TextData)

args(TextData)
function (base, var.text = NULL, var.agg = NULL, context.quali = NULL, 
    context.quanti = NULL, selDoc = "ALL", lower = TRUE, remov.number = TRUE, 
    lminword = 1, Fmin = Dmin, Dmin = 1, Fmax = Inf, stop.word.tm = FALSE, 
    idiom = "en", stop.word.user = NULL, segment = FALSE, sep.weak = "default", 
    sep.strong = "[()¿?./:¡!=;{}]…", seg.nfreq = 10, seg.nfreq2 = 10, 
    seg.nfreq3 = 10, graph = FALSE) 

Computing results before threshold. Initial corpus description

TD <- TextData(SpanishDisc16, var.text=c(8), var.agg=NULL, Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)
  • In the previous command, we have been chosen the eighth variable as textual variable: text
  • There is a direct analysis defined by var.agg=NULL (by default)
  • The selection of variables use the following arguments:
  • Minimum length of a word to be selected (lminword= 1)
  • Minimum frequency of a word to be selected (Fmin= 1)
  • A word has to be used in at least 1 source-documents to be selected (Dmin=1)
  • Maximum frequency of a word to be selected (Fmax= Inf by default)
  • idiom = “es”: Declared idiom for the textual column(s) is: es.
    (See IETF language in package NLP)
  • lower=FALSE: The corpus is not converted into lowercase
  • remov.number=FALSE: Numbers are no removed from the corpus
  • stop.word.tm=FALSE. Stoplist is not provided by tm package in accordance with the idiom
  • stop.word.user: Stop word list is not provided by the user
  • segment = FALSE: Repeated segments are not selected (segment= FALSE by default)

To obtain a summary of TextData object for 16 documments and 10 words we use the summary function (summary.TextData). Run help(summary.TextData) for more detail.

options(width = 300)
summary(TD, ndoc=16, nword=10, info=FALSE)
TextData summary

               Before     After
Documents       16.00     16.00
Occurrences 162796.00 162796.00
Words        12178.00  12178.00
Mean-length  10174.75  10174.75

Statistics for the documents
                                                                                                                
   DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences DistinctWords PctLength Mean Length100
                before        before    before         before       after         after     after          after
1  15.Sa19       16240          3287      9.98         159.61       16240          3287      9.98         159.61
2  16.Sa20       15358          3219      9.43         150.94       15358          3219      9.43         150.94
3  12.Sa16       13501          2782      8.29         132.69       13501          2782      8.29         132.69
4   1.Su79       12137          2824      7.46         119.29       12137          2824      7.46         119.29
5   4.Gz86       11330          2069      6.96         111.35       11330          2069      6.96         111.35
6   7.Az96       10240          2351      6.29         100.64       10240          2351      6.29         100.64
7  13.Rj16       10019          2260      6.15          98.47       10019          2260      6.15          98.47
8  11.Rj11        9745          2379      5.99          95.78        9745          2379      5.99          95.78
9   3.Gz82        9415          2526      5.78          92.53        9415          2526      5.78          92.53
10 10.Zp08        8817          2278      5.42          86.66        8817          2278      5.42          86.66
11  2.CS81        8267          2174      5.08          81.25        8267          2174      5.08          81.25
12  8.Az00        8259          1987      5.07          81.17        8259          1987      5.07          81.17
13  6.Gz93        8134          2049      5.00          79.94        8134          2049      5.00          79.94
14  9.Zp04        7882          2021      4.84          77.47        7882          2021      4.84          77.47
15  5.Gz89        7591          1811      4.66          74.61        7591          1811      4.66          74.61
16 14.Rj16        5861          1565      3.60          57.60        5861          1565      3.60          57.60

Index of the  10  most frequent words
   Word Frequency N.Documents
1   de      11791          16
2   la       7939          16
3   y        5609          16
4   que      5551          16
5   en       4865          16
6   el       4482          16
7   a        3513          16
8   los      3183          16
9   las      2275          16
10  del      1855          16

To obtain a barplot with the 25 documents with higher length (in this case only 16) we use the plot function (plot.TextData). Use help(plot.TextData) for more detail. Adding frequencies and vline representing the speach average size.

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)
pl2<- plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)
pl2

 

Stopwords

The stopwords are suppressed and only the words with a frequency over 9 and appearing at least in two speeches are retained. The size of the corpus is 162,796 occurrences before applying the selection and 55,850 occurrences after. A total of 1,558 different words are retained.

To define the list of stopwords that must be removed from documents we build un object named "swu":

swu <- c("consiguiente", "ello", "hacia", "punto", "si", "Sus", "vista", "A", "B", "C", "D", "E", "F", "a", "b", "c", "d")

The tm package has a list of words that can be used as stopwords depending on the language (by default english=en). This words can be retrieve using the stopwords command: tm::stopwords(kind=idiom)

swu <- c("consiguiente", "ello", "hacia", "punto", "si", "Sus", "vista",
"A", "B", "C", "D", "E", "F", "a", "b", "c", "d")
TD <- TextData(SpanishDisc16, var.text=c(8), Fmin=10, Dmin=2, idiom="es", lower=FALSE,
remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)
summary(TD, ndoc=16, nword=10, info=FALSE)

TextData summary

               Before    After
Documents       16.00    16.00
Occurrences 162796.00 55850.00
Words        12178.00  1558.00
Mean-length  10174.75  3490.62

Statistics for the documents
                                                                                                                
   DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences DistinctWords PctLength Mean Length100
                before        before    before         before       after         after     after          after
1  15.Sa19       16240          3287      9.98         159.61        5402          1174      9.67         154.76
2  16.Sa20       15358          3219      9.43         150.94        5116          1178      9.16         146.56
3  12.Sa16       13501          2782      8.29         132.69        4506          1046      8.07         129.09
4   1.Su79       12137          2824      7.46         119.29        4166          1073      7.46         119.35
5   4.Gz86       11330          2069      6.96         111.35        3901           936      6.98         111.76
6   7.Az96       10240          2351      6.29         100.64        3817          1070      6.83         109.35
7  13.Rj16       10019          2260      6.15          98.47        3494          1017      6.26         100.10
8  11.Rj11        9745          2379      5.99          95.78        3320           996      5.94          95.11
9   8.Az00        8259          1987      5.07          81.17        3077           959      5.51          88.15
10 10.Zp08        8817          2278      5.42          86.66        3019           975      5.41          86.49
11  3.Gz82        9415          2526      5.78          92.53        2985           994      5.34          85.51
12  6.Gz93        8134          2049      5.00          79.94        2925           962      5.24          83.80
13  2.CS81        8267          2174      5.08          81.25        2809           925      5.03          80.47
14  9.Zp04        7882          2021      4.84          77.47        2690           875      4.82          77.06
15  5.Gz89        7591          1811      4.66          74.61        2652           880      4.75          75.97
16 14.Rj16        5861          1565      3.60          57.60        1971           724      3.53          56.47

Index of the  10  most frequent words
        Word Frequency N.Documents
1  Gobierno        712          16
2  señorías        684          16
3  España          631          16
4  política        517          16
5  Estado          321          16
6  sociedad        303          16
7  país            302          16
8  social          302          16
9  españoles       290          15
10 años            283          16