Spanish Discourses _Pg2

2. TextData object

Previous syntax

First of all, we must load the Xplortext package in memory.

library(Xplortext)

## Loading required package: FactoMineR

## Loading required package: ggplot2

Building textual and contextual tables. Direct analysis

Before doing any analysis we need to construct an object of TextData class.

Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis. In this case we will not use any information related to contextual variables. Non-aggregate table (Direct Analysis) is defined by default (var.agg=NULL).

A work-document with non-empty-source-documents are built. DocTerm is a non-aggregate lexical table with as many rows as non-empty source-documents and as many columns as words are selected.

We can get the arguments of the TextData function by executing: args(TextData)

function (base, var.text = NULL, var.agg = NULL, context.quali = NULL, context.quanti = NULL, selDoc = "ALL", lower = TRUE, remov.number = TRUE, lminword = 1, Fmin = Dmin, Dmin = 1, Fmax = Inf, stop.word.tm = FALSE, idiom = "en", stop.word.user = NULL, segment = FALSE, sep.weak = "(['?]|[[:punct:]]|[[:space:]]|[[:cntrl:]])+", sep.strong = "[()¿?./:¡!=+;{}-]", seg.nfreq = 10, seg.nfreq2 = 10, seg.nfreq3 = 10, graph = FALSE)

Initial description of the corpus

TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)

In the previous command, we have been chosen as textual variables: text
There is a direct analysis defined by var.agg=NULL
There are 0 qualitative contextual variable(s) defined by context.quali=
There are no quantitative contextual variables defined by context.quanti=NULL There are 0 quantitative contextual variable(s) defined by context.quanti=
There are 0 empty documents saved in res.TD$remov.docs
The corpus is not converted into lowercase: lower=TRUE
Numbers are no removed from the corpus: remov.number=FALSE
The selection of variables use the following arguments:
- Minimum length of a word to be selected (lminword= 1)
- Minimum frequency of a word to be selected (Fmin= 1)
- A word has to be used in at least 1 source-documents to be selected (Dmin=1)
- Maximum frequency of a word to be selected (Fmax= Inf)
- Stoplist is not provided by tm package in accordance with the idiom
- Stop word list is not provided by the user
- Declared idiom for the textual column(s) is: es. See IETF language in package NLP)
Repeated segments are not selected (segment= FALSE)

To obtain a summary of TextData object for 11 documments and 10 words we use the summary function (summary.TextData). Use help(summary.TextData) for more detail.

summary(TD, ndoc=11, nword=10)


TextData summary

               Before     After
Documents       11.00     11.00
Occurrences 101967.00 101967.00
Words         9416.00   9416.00
Mean-length   9269.73   9269.73

Statistics for the documents
                                                                         
   DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences
                before        before    before         before       after
1     Su79       12149          2825     11.91         131.06       12149
2     CS81        8274          2172      8.11          89.26        8274
3     Gz82        9427          2529      9.25         101.70        9427
4     Gz86       11344          2076     11.13         122.38       11344
5     Gz89        7592          1814      7.45          81.90        7592
6     Gz93        8141          2048      7.98          87.82        8141
7     Az96       10251          2352     10.05         110.59       10251
8     Az00        8287          1993      8.13          89.40        8287
9     Zp04        7882          2019      7.73          85.03        7882
10    Zp08        8833          2295      8.66          95.29        8833
11    Rj11        9787          2404      9.60         105.58        9787
                                         
   DistinctWords PctLength Mean Length100
           after     after          after
1           2825     11.91         131.06
2           2172      8.11          89.26
3           2529      9.25         101.70
4           2076     11.13         122.38
5           1814      7.45          81.90
6           2048      7.98          87.82
7           2352     10.05         110.59
8           1993      8.13          89.40
9           2019      7.73          85.03
10          2295      8.66          95.29
11          2404      9.60         105.58

Index of the  10  most frequent words
   Word Frequency N.Documents
1   de       7837          11
2   la       5154          11
3   y        3438          11
4   que      3312          11
5   en       3165          11
6   el       2861          11
7   los      2106          11
8   a        2094          11
9   las      1559          11
10  del      1261          11

To obtain a barplot with the 25 documents with higher length (in this case only 11) we use the plot function (plot.TextData). Use help(plot.TextData) for more detail.

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length")

Adding frequencies and vline representing the speach average size.

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)

Stopwords

To define the list of stopwords that must be removed from documents we build un object named “swu”:

swu <- c("consiguiente", "ello", "hacia", "punto", "Señorías", "si", "Sus", "vista", "A", "B", "C", "D", "E", "F", "a", "b", "c", "d")

The tm package has a list of words that can be used as stopwords depending on the language (by default english=en). This words can be retrieve using the stopwords command: tm::stopwords(kind=idiom)

  [1] "de"           "la"           "que"          "el"          
  [5] "en"           "y"            "a"            "los"         
  [9] "del"          "se"           "las"          "por"         
 [13] "un"           "para"         "con"          "no"          
 [17] "una"          "su"           "al"           "lo"          
 [21] "como"         "más"          "pero"         "sus"         
 [25] "le"           "ya"           "o"            "este"        
 [29] "sí"           "porque"       "esta"         "entre"       
 [33] "cuando"       "muy"          "sin"          "sobre"       
 [37] "también"      "me"           "hasta"        "hay"         
 [41] "donde"        "quien"        "desde"        "todo"        
 [45] "nos"          "durante"      "todos"        "uno"         
 [49] "les"          "ni"           "contra"       "otros"       
 [53] "ese"          "eso"          "ante"         "ellos"       
 [57] "e"            "esto"         "mí"           "antes"       
 [61] "algunos"      "qué"          "unos"         "yo"          
 [65] "otro"         "otras"        "otra"         "él"          
 [69] "tanto"        "esa"          "estos"        "mucho"       
 [73] "quienes"      "nada"         "muchos"       "cual"        
 [77] "poco"         "ella"         "estar"        "estas"       
 [81] "algunas"      "algo"         "nosotros"     "mi"          
 [85] "mis"          "tú"           "te"           "ti"          
 [89] "tu"           "tus"          "ellas"        "nosotras"    
 [93] "vosotros"     "vosotras"     "os"           "mío"         
 [97] "mía"          "míos"         "mías"         "tuyo"        
[101] "tuya"         "tuyos"        "tuyas"        "suyo"        
[105] "suya"         "suyos"        "suyas"        "nuestro"     
[109] "nuestra"      "nuestros"     "nuestras"     "vuestro"     
[113] "vuestra"      "vuestros"     "vuestras"     "esos"        
[117] "esas"         "estoy"        "estás"        "está"        
[121] "estamos"      "estáis"       "están"        "esté"        
[125] "estés"        "estemos"      "estéis"       "estén"       
[129] "estaré"       "estarás"      "estará"       "estaremos"   
[133] "estaréis"     "estarán"      "estaría"      "estarías"    
[137] "estaríamos"   "estaríais"    "estarían"     "estaba"      
[141] "estabas"      "estábamos"    "estabais"     "estaban"     
[145] "estuve"       "estuviste"    "estuvo"       "estuvimos"   
[149] "estuvisteis"  "estuvieron"   "estuviera"    "estuvieras"  
[153] "estuviéramos" "estuvierais"  "estuvieran"   "estuviese"   
[157] "estuvieses"   "estuviésemos" "estuvieseis"  "estuviesen"  
[161] "estando"      "estado"       "estada"       "estados"     
[165] "estadas"      "estad"        "he"           "has"         
[169] "ha"           "hemos"        "habéis"       "han"         
[173] "haya"         "hayas"        "hayamos"      "hayáis"      
[177] "hayan"        "habré"        "habrás"       "habrá"       
[181] "habremos"     "habréis"      "habrán"       "habría"      
[185] "habrías"      "habríamos"    "habríais"     "habrían"     
[189] "había"        "habías"       "habíamos"     "habíais"     
[193] "habían"       "hube"         "hubiste"      "hubo"        
[197] "hubimos"      "hubisteis"    "hubieron"     "hubiera"     
[201] "hubieras"     "hubiéramos"   "hubierais"    "hubieran"    
[205] "hubiese"      "hubieses"     "hubiésemos"   "hubieseis"   
[209] "hubiesen"     "habiendo"     "habido"       "habida"      
[213] "habidos"      "habidas"      "soy"          "eres"        
[217] "es"           "somos"        "sois"         "son"         
[221] "sea"          "seas"         "seamos"       "seáis"       
[225] "sean"         "seré"         "serás"        "será"        
[229] "seremos"      "seréis"       "serán"        "sería"       
[233] "serías"       "seríamos"     "seríais"      "serían"      
[237] "era"          "eras"         "éramos"       "erais"       
[241] "eran"         "fui"          "fuiste"       "fue"         
[245] "fuimos"       "fuisteis"     "fueron"       "fuera"       
[249] "fueras"       "fuéramos"     "fuerais"      "fueran"      
[253] "fuese"        "fueses"       "fuésemos"     "fueseis"     
[257] "fuesen"       "siendo"       "sido"         "tengo"       
[261] "tienes"       "tiene"        "tenemos"      "tenéis"      
[265] "tienen"       "tenga"        "tengas"       "tengamos"    
[269] "tengáis"      "tengan"       "tendré"       "tendrás"     
[273] "tendrá"       "tendremos"    "tendréis"     "tendrán"     
[277] "tendría"      "tendrías"     "tendríamos"   "tendríais"   
[281] "tendrían"     "tenía"        "tenías"       "teníamos"    
[285] "teníais"      "tenían"       "tuve"         "tuviste"     
[289] "tuvo"         "tuvimos"      "tuvisteis"    "tuvieron"    
[293] "tuviera"      "tuvieras"     "tuviéramos"   "tuvierais"   
[297] "tuvieran"     "tuviese"      "tuvieses"     "tuviésemos"  
[301] "tuvieseis"    "tuviesen"     "teniendo"     "tenido"      
[305] "tenida"       "tenidos"      "tenidas"      "tened"

The new TextData object without tm stopwords and without user stopwords is:

TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)

summary(TD, ndoc=11, nword=10)


TextData summary

               Before    After
Documents       11.00    11.00
Occurrences 101967.00 49603.00
Words         9416.00  9088.00
Mean-length   9269.73  4509.36

Statistics for the documents
                                                                         
   DocName Occurrences DistinctWords PctLength Mean Length100 Occurrences
                before        before    before         before       after
1     Su79       12149          2825     11.91         131.06        6105
2     CS81        8274          2172      8.11          89.26        4101
3     Gz82        9427          2529      9.25         101.70        4577
4     Gz86       11344          2076     11.13         122.38        5174
5     Gz89        7592          1814      7.45          81.90        3593
6     Gz93        8141          2048      7.98          87.82        3967
7     Az96       10251          2352     10.05         110.59        5146
8     Az00        8287          1993      8.13          89.40        4114
9     Zp04        7882          2019      7.73          85.03        3829
10    Zp08        8833          2295      8.66          95.29        4281
11    Rj11        9787          2404      9.60         105.58        4716
                                         
   DistinctWords PctLength Mean Length100
           after     after          after
1           2660     12.31         135.38
2           2020      8.27          90.94
3           2373      9.23         101.50
4           1927     10.43         114.74
5           1660      7.24          79.68
6           1899      8.00          87.97
7           2210     10.37         114.12
8           1849      8.29          91.23
9           1872      7.72          84.91
10          2125      8.63          94.94
11          2213      9.51         104.58

Index of the  10  most frequent words
         Word Frequency N.Documents
1  política         411          11
2  Gobierno         403          11
3  España           320          11
4  Estado           235          11
5  sociedad         218          11
6  social           185          11
7  años             183          11
8  empleo           162          11
9  ciudadanos       158          11
10 sistema          158          11

To obtain a barplot with the 25 non-stopwords with higest frequency:

Spanish Discourses _Pg2

2. TextData object

Previous syntax

library(Xplortext)

Building textual and contextual tables. Direct analysis

Initial description of the corpus

TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=FALSE, stop.word.tm=FALSE, stop.word.user=NULL, graph=FALSE)

summary(TD, ndoc=11, nword=10)

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length")

plot(TD,sel="doc",title="Documents with higher length", xtitle="Document length", col.fill="slategray1", theme=theme_classic(), freq=-800, vline= TRUE)

Stopwords

swu <- c("consiguiente", "ello", "hacia", "punto", "Señorías", "si", "Sus", "vista", "A", "B", "C", "D", "E", "F", "a", "b", "c", "d")

TD <- TextData(SpanishDisc, var.text=c(1), Fmin=1, Dmin=1, idiom="es", lower=FALSE, remov.number=TRUE, stop.word.tm=TRUE, stop.word.user=swu, graph=FALSE)

summary(TD, ndoc=11, nword=10)

plot(TD,sel="word",title="Words frequency without stopwords", xtitle="Word frequency")