JADT 2022. 4.1. The corpus

4.1. The corpus

To show the relevance of the local stopping rules on chronological clustering, we use the 16 investiture speeches given by 7 candidates for Prime Minister (Suárez, Calvo-Sotelo, Felipe González, Aznar, Zapatero, Rajoy and Sánchez) from December 1978 to the present day. In their speeches, the candidates present their government programme. Then, the deputies give their confidence or not.

First, we will remove all objects in memory.
Secondly, we will download the file containing the 16 speeches in RData format, observing its structure.

Given the SpanishDisc data frame object, the best way to understand the data structure is to use str().

rm(list=ls())
# load("SpanishDisc16.RData")
con<-url('https://xplortext.unileon.es/wp-content/uploads/2022/06/SpanishDisc16..rdata') # Create connexion
load(con) #Load the data
close(con) #close connexion
str(SpanishDisc16)
'data.frame': 16 obs. of 8 variables:
$ chronology : num 1 2 3 4 5 6 7 8 9 10 ...
$ acronym : chr "Su79" "CS81" "Gz82" "Gz86" ...
$ name : chr "Suárez" "CalvoSotelo" "González" "González" ...
$ politicparty: chr "UCD" "UCD" "PSOE" "PSOE" ...
$ year : num 1979 1981 1982 1986 1989 ...
$ legislatura : chr "I" "I" "II" "III" ...
$ result : chr "YES" "YES" "YES" "YES" ...
$ text : Factor w/ 16 levels "buenas tardes ya, señorías: en 1979, 40 años atrás, se celebró el primer debate de investidura en esta Cámara. "| __truncated__,..: 2 11 4 10 7 6 3 14 9 8 ...

 

Here, a speech is identified by its rank (from 1 to 16), two letters summarizing the name of the candidate and, finally, the two last figures of the year. So, we have a short number of documents with a similar structure to ease the presentation of the methodology. The texts of the speeches have been downloaded from the Diario de Sesiones del Congreso de los Diputados (Journal of Sessions of the Congress of Deputies).
Their writing has been unified (e.g., IAE and Impuesto de Actividades Económicas).

We present 16 investiture speaches (rows) and 8 variables (columns):

SpanishDisc$title

The number of the speech defined as a number.

SpanishDisc16$acronym

Label with four characters (name of the politician and year) usefull to represent the speech in tables and graphs,

SpanishDisc16$name

Name of the politician defined as factor. There are 8 different politicians (Aznar, CalvoSotelo, González, Rajoy, Suárez, Zapatero, Rajoy and Sanchez) and 16 investiture speeches.

options(width = 300)
table(SpanishDisc16$name)

SpanishDisc$politicparty

Name of the political party defined as factor. There are 3 different political parties (PP, PSOE, UCD).

table(SpanishDisc16$politicparty)

SpanishDisc16$year

The year is defined as numerical variable from 1979 until 2019

table(SpanishDisc16$year)

SpanishDisc16$legislatura

From 1979 to 2022 there have been 13 legislative periods (XIII) in Spain.

table(SpanishDisc16$legislatura)

SpanishDisc16$result

TRUE if the candidate became president of the government, FALSE in the opposite case.

table(SpanishDisc16$legislatura)

SpanishDisc16$text

Pretreatment

In order to preserve the capital letters introduced in the corpus at the moment of their capture by the “Diario de Sesiones del Congreso de los Diputados” and the semantic information they provide, capital letters at the beginning of the sentence have been manually eliminated in the database. Those which are preserved serve, in general, to differentiate homographs. Thus, “Gobierno” (the government) is differentiated from “gobierno” (I govern). It will be necessary to specify in the script that it is desired to keep the capital letters.

Loading Xplortext package

library(Xplortext)

Building TextData object

Before doing any analysis we need to construct an object of TextData class.

Each row of the source-base is considered as a source-document.
TextData function builds the working-documents-by-words table, submitted to the analysis. In this case we will not use any information related to
contextual variables. Non-aggregate table (Direct Analysis) is defined by default (var.agg=NULL).

A work-document with non-empty-source-documents are built. DocTerm is a non-aggregate lexical table with as many rows as non-empty source-documents and as many columns as words are selected.

We can get the arguments of the TextData function by executing:
args(TextData)

args(TextData)

Computing results before threshold. Initial corpus description

TD <- TextData(SpanishDisc16, var.text=c(8), var.agg=NULL, Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)
  • In the previous command, we have been chosen the eighth variable as textual variable: text
  • There is a direct analysis defined by var.agg=NULL (by default)
  • The selection of variables use the following arguments:
  • Minimum length of a word to be selected (lminword= 1)
  • Minimum frequency of a word to be selected (Fmin= 1)
  • A word has to be used in at least 1 source-documents to be selected (Dmin=1)
  • Maximum frequency of a word to be selected (Fmax= Inf by default)
  • idiom = “es”: Declared idiom for the textual column(s) is: es.
    (See IETF language in package NLP)
  • lower=FALSE: The corpus is not converted into lowercase
  • remov.number=FALSE: Numbers are no removed from the corpus
  • stop.word.tm=FALSE. Stoplist is not provided by tm package in accordance with the idiom
  • stop.word.user: Stop word list is not provided by the user
  • segment = FALSE: Repeated segments are not selected (segment= FALSE by default)

To obtain a summary of TextData object for 16 documments and 10 words we use the summary function (summary.TextData). Run help(summary.TextData) for more detail.

options(width = 300)
summary(TD, ndoc=16, nword=10, info=FALSE)

To obtain a barplot with the 25 documents with higher length (in this case only 16) we use the plot function (plot.TextData). Use help(plot.TextData) for more detail. Adding frequencies and vline representing the speach average size.

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)

 

Stopwords

The stopwords are suppressed and only the words with a frequency over 9 and appearing at least in two speeches are retained. The size of the corpus is 162,796 occurrences before applying the selection and 55,850 occurrences after. A total of 1,558 different words are retained.

To define the list of stopwords that must be removed from documents we build un object named "swu":

The tm package has a list of words that can be used as stopwords depending on the language (by default english=en). This words can be retrieve using the stopwords command: tm::stopwords(kind=idiom)

swu <- c("consiguiente", "ello", "hacia", "punto", "si", "Sus", "vista",
"A", "B", "C", "D", "E", "F", "a", "b", "c", "d")
TD <- TextData(SpanishDisc16, var.text=c(8), Fmin=10, Dmin=2, idiom="es", lower=FALSE,
remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)
summary(TD, ndoc=16, nword=10, info=FALSE)