Spanish Discourses _Pg2

2. TextData object


Previous syntax

First of all, we must load the Xplortext package in memory.

library(Xplortext)

Building textual and contextual tables. Direct analysis

Before doing any analysis we need to construct an object of TextData class.

Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis. In this case we will not use any information related to contextual variables. Non-aggregate table (Direct Analysis) is defined by default (var.agg=NULL).

A work-document with non-empty-source-documents are built. DocTerm is a non-aggregate lexical table with as many rows as non-empty source-documents and as many columns as words are selected.

We can get the arguments of the TextData function by executing: args(TextData)

function (base, var.text = NULL, var.agg = NULL, context.quali = NULL, context.quanti = NULL, selDoc = "ALL", lower = TRUE, remov.number = TRUE, lminword = 1, Fmin = Dmin, Dmin = 1, Fmax = Inf, stop.word.tm = FALSE, idiom = "en", stop.word.user = NULL, segment = FALSE, sep.weak = "(['?]|[[:punct:]]|[[:space:]]|[[:cntrl:]])+", sep.strong = "[()¿?./:¡!=+;{}-]", seg.nfreq = 10, seg.nfreq2 = 10, seg.nfreq3 = 10, graph = FALSE)

Initial description of the corpus

TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)
  • In the previous command, we have been chosen as textual variables: text
  • There is a direct analysis defined by var.agg=NULL
  • There are 0 qualitative contextual variable(s) defined by context.quali=
  • There are no quantitative contextual variables defined by context.quanti=NULL There are 0 quantitative contextual variable(s) defined by context.quanti=
  • There are 0 empty documents saved in res.TD$remov.docs
  • The corpus is not converted into lowercase: lower=TRUE
  • Numbers are no removed from the corpus: remov.number=FALSE
  • The selection of variables use the following arguments:
    • Minimum length of a word to be selected (lminword= 1)
    • Minimum frequency of a word to be selected (Fmin= 1)
    • A word has to be used in at least 1 source-documents to be selected (Dmin=1)
    • Maximum frequency of a word to be selected (Fmax= Inf)
    • Stoplist is not provided by tm package in accordance with the idiom
    • Stop word list is not provided by the user
    • Declared idiom for the textual column(s) is: es. See IETF language in package NLP)
  • Repeated segments are not selected (segment= FALSE)

To obtain a summary of TextData object for 11 documments and 10 words we use the summary function (summary.TextData). Use help(summary.TextData) for more detail.

summary(TD, ndoc=11, nword=10)

To obtain a barplot with the 25 documents with higher length (in this case only 11) we use the plot function (plot.TextData). Use help(plot.TextData) for more detail.

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length")

Adding frequencies and vline representing the speach average size.

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)

Stopwords

To define the list of stopwords that must be removed from documents we build un object named “swu”:

swu <- c("consiguiente", "ello", "hacia", "punto", "Señorías", "si", "Sus", "vista", "A", "B", "C", "D", "E", "F", "a", "b", "c", "d")

 

The tm package has a list of words that can be used as stopwords depending on the language (by default english=en). This words can be retrieve using the stopwords command: tm::stopwords(kind=idiom)

The new TextData object without tm stopwords and without user stopwords is:

TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)
summary(TD, ndoc=11, nword=10)

 

To obtain a barplot with the 25 non-stopwords with higest frequency:

plot(TD,sel="word",title="Words frequency without stopwords", xtitle="Word frequency")