2. TextData object
Previous syntax
First of all, we must load the Xplortext package in memory.
library(Xplortext)
1 |
## Loading required package: FactoMineR |
1 |
## Loading required package: ggplot2 |
Building textual and contextual tables. Direct analysis
Before doing any analysis we need to construct an object of TextData class.
Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis. In this case we will not use any information related to contextual variables. Non-aggregate table (Direct Analysis) is defined by default (var.agg=NULL).
A work-document with non-empty-source-documents are built. DocTerm is a non-aggregate lexical table with as many rows as non-empty source-documents and as many columns as words are selected.
We can get the arguments of the TextData function by executing: args(TextData)
function (base, var.text = NULL, var.agg = NULL, context.quali = NULL, context.quanti = NULL, selDoc = "ALL", lower = TRUE, remov.number = TRUE, lminword = 1, Fmin = Dmin, Dmin = 1, Fmax = Inf, stop.word.tm = FALSE, idiom = "en", stop.word.user = NULL, segment = FALSE, sep.weak = "(['?]|[[:punct:]]|[[:space:]]|[[:cntrl:]])+", sep.strong = "[()¿?./:¡!=+;{}-]", seg.nfreq = 10, seg.nfreq2 = 10, seg.nfreq3 = 10, graph = FALSE)
Initial description of the corpus
TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)
- In the previous command, we have been chosen as textual variables: text
- There is a direct analysis defined by var.agg=NULL
- There are 0 qualitative contextual variable(s) defined by context.quali=
- There are no quantitative contextual variables defined by context.quanti=NULL There are 0 quantitative contextual variable(s) defined by context.quanti=
- There are 0 empty documents saved in res.TD$remov.docs
- The corpus is not converted into lowercase: lower=TRUE
- Numbers are no removed from the corpus: remov.number=FALSE
- The selection of variables use the following arguments:
- Minimum length of a word to be selected (lminword= 1)
- Minimum frequency of a word to be selected (Fmin= 1)
- A word has to be used in at least 1 source-documents to be selected (Dmin=1)
- Maximum frequency of a word to be selected (Fmax= Inf)
- Stoplist is not provided by tm package in accordance with the idiom
- Stop word list is not provided by the user
- Declared idiom for the textual column(s) is: es. See IETF language in package NLP)
- Repeated segments are not selected (segment= FALSE)
To obtain a summary of TextData object for 11 documments and 10 words we use the summary function (summary.TextData). Use help(summary.TextData) for more detail.
summary(TD, ndoc=11, nword=10)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
TextData summary Before After Documents 11.00 11.00 Occurrences 101967.00 101967.00 Words 9416.00 9416.00 Mean-length 9269.73 9269.73 Statistics for the documents DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences before before before before after 1 Su79 12149 2825 11.91 131.06 12149 2 CS81 8274 2172 8.11 89.26 8274 3 Gz82 9427 2529 9.25 101.70 9427 4 Gz86 11344 2076 11.13 122.38 11344 5 Gz89 7592 1814 7.45 81.90 7592 6 Gz93 8141 2048 7.98 87.82 8141 7 Az96 10251 2352 10.05 110.59 10251 8 Az00 8287 1993 8.13 89.40 8287 9 Zp04 7882 2019 7.73 85.03 7882 10 Zp08 8833 2295 8.66 95.29 8833 11 Rj11 9787 2404 9.60 105.58 9787 DistinctWords PctLength Mean Length100 after after after 1 2825 11.91 131.06 2 2172 8.11 89.26 3 2529 9.25 101.70 4 2076 11.13 122.38 5 1814 7.45 81.90 6 2048 7.98 87.82 7 2352 10.05 110.59 8 1993 8.13 89.40 9 2019 7.73 85.03 10 2295 8.66 95.29 11 2404 9.60 105.58 Index of the 10 most frequent words Word Frequency N.Documents 1 de 7837 11 2 la 5154 11 3 y 3438 11 4 que 3312 11 5 en 3165 11 6 el 2861 11 7 los 2106 11 8 a 2094 11 9 las 1559 11 10 del 1261 11 |
To obtain a barplot with the 25 documents with higher length (in this case only 11) we use the plot function (plot.TextData). Use help(plot.TextData) for more detail.
plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length")
Adding frequencies and vline representing the speach average size.
plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)
Stopwords
To define the list of stopwords that must be removed from documents we build un object named “swu”:
swu <- c("consiguiente", "ello", "hacia", "punto", "Señorías", "si", "Sus", "vista", "A", "B", "C", "D", "E", "F", "a", "b", "c", "d")
The tm package has a list of words that can be used as stopwords depending on the language (by default english=en). This words can be retrieve using the stopwords command: tm::stopwords(kind=idiom)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
[1] "de" "la" "que" "el" [5] "en" "y" "a" "los" [9] "del" "se" "las" "por" [13] "un" "para" "con" "no" [17] "una" "su" "al" "lo" [21] "como" "más" "pero" "sus" [25] "le" "ya" "o" "este" [29] "sí" "porque" "esta" "entre" [33] "cuando" "muy" "sin" "sobre" [37] "también" "me" "hasta" "hay" [41] "donde" "quien" "desde" "todo" [45] "nos" "durante" "todos" "uno" [49] "les" "ni" "contra" "otros" [53] "ese" "eso" "ante" "ellos" [57] "e" "esto" "mí" "antes" [61] "algunos" "qué" "unos" "yo" [65] "otro" "otras" "otra" "él" [69] "tanto" "esa" "estos" "mucho" [73] "quienes" "nada" "muchos" "cual" [77] "poco" "ella" "estar" "estas" [81] "algunas" "algo" "nosotros" "mi" [85] "mis" "tú" "te" "ti" [89] "tu" "tus" "ellas" "nosotras" [93] "vosotros" "vosotras" "os" "mío" [97] "mía" "míos" "mías" "tuyo" [101] "tuya" "tuyos" "tuyas" "suyo" [105] "suya" "suyos" "suyas" "nuestro" [109] "nuestra" "nuestros" "nuestras" "vuestro" [113] "vuestra" "vuestros" "vuestras" "esos" [117] "esas" "estoy" "estás" "está" [121] "estamos" "estáis" "están" "esté" [125] "estés" "estemos" "estéis" "estén" [129] "estaré" "estarás" "estará" "estaremos" [133] "estaréis" "estarán" "estaría" "estarías" [137] "estaríamos" "estaríais" "estarían" "estaba" [141] "estabas" "estábamos" "estabais" "estaban" [145] "estuve" "estuviste" "estuvo" "estuvimos" [149] "estuvisteis" "estuvieron" "estuviera" "estuvieras" [153] "estuviéramos" "estuvierais" "estuvieran" "estuviese" [157] "estuvieses" "estuviésemos" "estuvieseis" "estuviesen" [161] "estando" "estado" "estada" "estados" [165] "estadas" "estad" "he" "has" [169] "ha" "hemos" "habéis" "han" [173] "haya" "hayas" "hayamos" "hayáis" [177] "hayan" "habré" "habrás" "habrá" [181] "habremos" "habréis" "habrán" "habría" [185] "habrías" "habríamos" "habríais" "habrían" [189] "había" "habías" "habíamos" "habíais" [193] "habían" "hube" "hubiste" "hubo" [197] "hubimos" "hubisteis" "hubieron" "hubiera" [201] "hubieras" "hubiéramos" "hubierais" "hubieran" [205] "hubiese" "hubieses" "hubiésemos" "hubieseis" [209] "hubiesen" "habiendo" "habido" "habida" [213] "habidos" "habidas" "soy" "eres" [217] "es" "somos" "sois" "son" [221] "sea" "seas" "seamos" "seáis" [225] "sean" "seré" "serás" "será" [229] "seremos" "seréis" "serán" "sería" [233] "serías" "seríamos" "seríais" "serían" [237] "era" "eras" "éramos" "erais" [241] "eran" "fui" "fuiste" "fue" [245] "fuimos" "fuisteis" "fueron" "fuera" [249] "fueras" "fuéramos" "fuerais" "fueran" [253] "fuese" "fueses" "fuésemos" "fueseis" [257] "fuesen" "siendo" "sido" "tengo" [261] "tienes" "tiene" "tenemos" "tenéis" [265] "tienen" "tenga" "tengas" "tengamos" [269] "tengáis" "tengan" "tendré" "tendrás" [273] "tendrá" "tendremos" "tendréis" "tendrán" [277] "tendría" "tendrías" "tendríamos" "tendríais" [281] "tendrían" "tenía" "tenías" "teníamos" [285] "teníais" "tenían" "tuve" "tuviste" [289] "tuvo" "tuvimos" "tuvisteis" "tuvieron" [293] "tuviera" "tuvieras" "tuviéramos" "tuvierais" [297] "tuvieran" "tuviese" "tuvieses" "tuviésemos" [301] "tuvieseis" "tuviesen" "teniendo" "tenido" [305] "tenida" "tenidos" "tenidas" "tened" |
The new TextData object without tm stopwords and without user stopwords is:
TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)
summary(TD, ndoc=11, nword=10)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
TextData summary Before After Documents 11.00 11.00 Occurrences 101967.00 49603.00 Words 9416.00 9088.00 Mean-length 9269.73 4509.36 Statistics for the documents DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences before before before before after 1 Su79 12149 2825 11.91 131.06 6105 2 CS81 8274 2172 8.11 89.26 4101 3 Gz82 9427 2529 9.25 101.70 4577 4 Gz86 11344 2076 11.13 122.38 5174 5 Gz89 7592 1814 7.45 81.90 3593 6 Gz93 8141 2048 7.98 87.82 3967 7 Az96 10251 2352 10.05 110.59 5146 8 Az00 8287 1993 8.13 89.40 4114 9 Zp04 7882 2019 7.73 85.03 3829 10 Zp08 8833 2295 8.66 95.29 4281 11 Rj11 9787 2404 9.60 105.58 4716 DistinctWords PctLength Mean Length100 after after after 1 2660 12.31 135.38 2 2020 8.27 90.94 3 2373 9.23 101.50 4 1927 10.43 114.74 5 1660 7.24 79.68 6 1899 8.00 87.97 7 2210 10.37 114.12 8 1849 8.29 91.23 9 1872 7.72 84.91 10 2125 8.63 94.94 11 2213 9.51 104.58 Index of the 10 most frequent words Word Frequency N.Documents 1 política 411 11 2 Gobierno 403 11 3 España 320 11 4 Estado 235 11 5 sociedad 218 11 6 social 185 11 7 años 183 11 8 empleo 162 11 9 ciudadanos 158 11 10 sistema 158 11 |
To obtain a barplot with the 25 non-stopwords with higest frequency: