2. TextData object
Previous syntax
First of all, we must load the Xplortext package in memory.
library(Xplortext)
## Loading required package: FactoMineR
## Loading required package: ggplot2
Building textual and contextual tables. Direct analysis
Before doing any analysis we need to construct an object of TextData class.
Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis. In this case we will not use any information related to contextual variables. Non-aggregate table (Direct Analysis) is defined by default (var.agg=NULL).
A work-document with non-empty-source-documents are built. DocTerm is a non-aggregate lexical table with as many rows as non-empty source-documents and as many columns as words are selected.
We can get the arguments of the TextData function by executing: args(TextData)
function (base, var.text = NULL, var.agg = NULL, context.quali = NULL, context.quanti = NULL, selDoc = "ALL", lower = TRUE, remov.number = TRUE, lminword = 1, Fmin = Dmin, Dmin = 1, Fmax = Inf, stop.word.tm = FALSE, idiom = "en", stop.word.user = NULL, segment = FALSE, sep.weak = "(['?]|[[:punct:]]|[[:space:]]|[[:cntrl:]])+", sep.strong = "[()¿?./:¡!=+;{}-]", seg.nfreq = 10, seg.nfreq2 = 10, seg.nfreq3 = 10, graph = FALSE)
Initial description of the corpus
TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)
- In the previous command, we have been chosen as textual variables: text
- There is a direct analysis defined by var.agg=NULL
- There are 0 qualitative contextual variable(s) defined by context.quali=
- There are no quantitative contextual variables defined by context.quanti=NULL There are 0 quantitative contextual variable(s) defined by context.quanti=
- There are 0 empty documents saved in res.TD$remov.docs
- The corpus is not converted into lowercase: lower=TRUE
- Numbers are no removed from the corpus: remov.number=FALSE
- The selection of variables use the following arguments:
- Minimum length of a word to be selected (lminword= 1)
- Minimum frequency of a word to be selected (Fmin= 1)
- A word has to be used in at least 1 source-documents to be selected (Dmin=1)
- Maximum frequency of a word to be selected (Fmax= Inf)
- Stoplist is not provided by tm package in accordance with the idiom
- Stop word list is not provided by the user
- Declared idiom for the textual column(s) is: es. See IETF language in package NLP)
- Repeated segments are not selected (segment= FALSE)
To obtain a summary of TextData object for 11 documments and 10 words we use the summary function (summary.TextData). Use help(summary.TextData) for more detail.
summary(TD, ndoc=11, nword=10)
TextData summary
Before After
Documents 11.00 11.00
Occurrences 101967.00 101967.00
Words 9416.00 9416.00
Mean-length 9269.73 9269.73
Statistics for the documents
DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences
before before before before after
1 Su79 12149 2825 11.91 131.06 12149
2 CS81 8274 2172 8.11 89.26 8274
3 Gz82 9427 2529 9.25 101.70 9427
4 Gz86 11344 2076 11.13 122.38 11344
5 Gz89 7592 1814 7.45 81.90 7592
6 Gz93 8141 2048 7.98 87.82 8141
7 Az96 10251 2352 10.05 110.59 10251
8 Az00 8287 1993 8.13 89.40 8287
9 Zp04 7882 2019 7.73 85.03 7882
10 Zp08 8833 2295 8.66 95.29 8833
11 Rj11 9787 2404 9.60 105.58 9787
DistinctWords PctLength Mean Length100
after after after
1 2825 11.91 131.06
2 2172 8.11 89.26
3 2529 9.25 101.70
4 2076 11.13 122.38
5 1814 7.45 81.90
6 2048 7.98 87.82
7 2352 10.05 110.59
8 1993 8.13 89.40
9 2019 7.73 85.03
10 2295 8.66 95.29
11 2404 9.60 105.58
Index of the 10 most frequent words
Word Frequency N.Documents
1 de 7837 11
2 la 5154 11
3 y 3438 11
4 que 3312 11
5 en 3165 11
6 el 2861 11
7 los 2106 11
8 a 2094 11
9 las 1559 11
10 del 1261 11
To obtain a barplot with the 25 documents with higher length (in this case only 11) we use the plot function (plot.TextData). Use help(plot.TextData) for more detail.
plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length")
Adding frequencies and vline representing the speach average size.
plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)
Stopwords
To define the list of stopwords that must be removed from documents we build un object named “swu”:
swu <- c("consiguiente", "ello", "hacia", "punto", "Señorías", "si", "Sus", "vista", "A", "B", "C", "D", "E", "F", "a", "b", "c", "d")
The tm package has a list of words that can be used as stopwords depending on the language (by default english=en). This words can be retrieve using the stopwords command: tm::stopwords(kind=idiom)
[1] "de" "la" "que" "el"
[5] "en" "y" "a" "los"
[9] "del" "se" "las" "por"
[13] "un" "para" "con" "no"
[17] "una" "su" "al" "lo"
[21] "como" "más" "pero" "sus"
[25] "le" "ya" "o" "este"
[29] "sí" "porque" "esta" "entre"
[33] "cuando" "muy" "sin" "sobre"
[37] "también" "me" "hasta" "hay"
[41] "donde" "quien" "desde" "todo"
[45] "nos" "durante" "todos" "uno"
[49] "les" "ni" "contra" "otros"
[53] "ese" "eso" "ante" "ellos"
[57] "e" "esto" "mí" "antes"
[61] "algunos" "qué" "unos" "yo"
[65] "otro" "otras" "otra" "él"
[69] "tanto" "esa" "estos" "mucho"
[73] "quienes" "nada" "muchos" "cual"
[77] "poco" "ella" "estar" "estas"
[81] "algunas" "algo" "nosotros" "mi"
[85] "mis" "tú" "te" "ti"
[89] "tu" "tus" "ellas" "nosotras"
[93] "vosotros" "vosotras" "os" "mío"
[97] "mía" "míos" "mías" "tuyo"
[101] "tuya" "tuyos" "tuyas" "suyo"
[105] "suya" "suyos" "suyas" "nuestro"
[109] "nuestra" "nuestros" "nuestras" "vuestro"
[113] "vuestra" "vuestros" "vuestras" "esos"
[117] "esas" "estoy" "estás" "está"
[121] "estamos" "estáis" "están" "esté"
[125] "estés" "estemos" "estéis" "estén"
[129] "estaré" "estarás" "estará" "estaremos"
[133] "estaréis" "estarán" "estaría" "estarías"
[137] "estaríamos" "estaríais" "estarían" "estaba"
[141] "estabas" "estábamos" "estabais" "estaban"
[145] "estuve" "estuviste" "estuvo" "estuvimos"
[149] "estuvisteis" "estuvieron" "estuviera" "estuvieras"
[153] "estuviéramos" "estuvierais" "estuvieran" "estuviese"
[157] "estuvieses" "estuviésemos" "estuvieseis" "estuviesen"
[161] "estando" "estado" "estada" "estados"
[165] "estadas" "estad" "he" "has"
[169] "ha" "hemos" "habéis" "han"
[173] "haya" "hayas" "hayamos" "hayáis"
[177] "hayan" "habré" "habrás" "habrá"
[181] "habremos" "habréis" "habrán" "habría"
[185] "habrías" "habríamos" "habríais" "habrían"
[189] "había" "habías" "habíamos" "habíais"
[193] "habían" "hube" "hubiste" "hubo"
[197] "hubimos" "hubisteis" "hubieron" "hubiera"
[201] "hubieras" "hubiéramos" "hubierais" "hubieran"
[205] "hubiese" "hubieses" "hubiésemos" "hubieseis"
[209] "hubiesen" "habiendo" "habido" "habida"
[213] "habidos" "habidas" "soy" "eres"
[217] "es" "somos" "sois" "son"
[221] "sea" "seas" "seamos" "seáis"
[225] "sean" "seré" "serás" "será"
[229] "seremos" "seréis" "serán" "sería"
[233] "serías" "seríamos" "seríais" "serían"
[237] "era" "eras" "éramos" "erais"
[241] "eran" "fui" "fuiste" "fue"
[245] "fuimos" "fuisteis" "fueron" "fuera"
[249] "fueras" "fuéramos" "fuerais" "fueran"
[253] "fuese" "fueses" "fuésemos" "fueseis"
[257] "fuesen" "siendo" "sido" "tengo"
[261] "tienes" "tiene" "tenemos" "tenéis"
[265] "tienen" "tenga" "tengas" "tengamos"
[269] "tengáis" "tengan" "tendré" "tendrás"
[273] "tendrá" "tendremos" "tendréis" "tendrán"
[277] "tendría" "tendrías" "tendríamos" "tendríais"
[281] "tendrían" "tenía" "tenías" "teníamos"
[285] "teníais" "tenían" "tuve" "tuviste"
[289] "tuvo" "tuvimos" "tuvisteis" "tuvieron"
[293] "tuviera" "tuvieras" "tuviéramos" "tuvierais"
[297] "tuvieran" "tuviese" "tuvieses" "tuviésemos"
[301] "tuvieseis" "tuviesen" "teniendo" "tenido"
[305] "tenida" "tenidos" "tenidas" "tened"
The new TextData object without tm stopwords and without user stopwords is:
TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)
summary(TD, ndoc=11, nword=10)
TextData summary
Before After
Documents 11.00 11.00
Occurrences 101967.00 49603.00
Words 9416.00 9088.00
Mean-length 9269.73 4509.36
Statistics for the documents
DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences
before before before before after
1 Su79 12149 2825 11.91 131.06 6105
2 CS81 8274 2172 8.11 89.26 4101
3 Gz82 9427 2529 9.25 101.70 4577
4 Gz86 11344 2076 11.13 122.38 5174
5 Gz89 7592 1814 7.45 81.90 3593
6 Gz93 8141 2048 7.98 87.82 3967
7 Az96 10251 2352 10.05 110.59 5146
8 Az00 8287 1993 8.13 89.40 4114
9 Zp04 7882 2019 7.73 85.03 3829
10 Zp08 8833 2295 8.66 95.29 4281
11 Rj11 9787 2404 9.60 105.58 4716
DistinctWords PctLength Mean Length100
after after after
1 2660 12.31 135.38
2 2020 8.27 90.94
3 2373 9.23 101.50
4 1927 10.43 114.74
5 1660 7.24 79.68
6 1899 8.00 87.97
7 2210 10.37 114.12
8 1849 8.29 91.23
9 1872 7.72 84.91
10 2125 8.63 94.94
11 2213 9.51 104.58
Index of the 10 most frequent words
Word Frequency N.Documents
1 política 411 11
2 Gobierno 403 11
3 España 320 11
4 Estado 235 11
5 sociedad 218 11
6 social 185 11
7 años 183 11
8 empleo 162 11
9 ciudadanos 158 11
10 sistema 158 11
To obtain a barplot with the 25 non-stopwords with higest frequency: