Non suitable characters importing texts

Some typical problems when importing texts in Windows, Linux and MAC

1.- Description of the problem

Sometimes when you open a text file you see characters that are not suitable.

For example, a file with UTF8 format in Mac may not be read correctly in Windows.

It can be read with some editors (for example Notepad) without any problem because the editor can detect (and transform) different formats.

To know the "locale categories" in R for our computer we can use Sys.getlocale function:

Sys.getlocale(category = "LC_ALL")

The answer is different for Mac, Lynux, Windows and locale codification (language). For example:



## Not run:

Sys.setlocale("LC_TIME", "de")     # Solaris: details are OS-dependent

Sys.setlocale("LC_TIME", "de_DE")  # Many Unix-alikes

Sys.setlocale("LC_TIME", "de_DE.UTF-8")  # Linux, macOS, other Unix-alikes

Sys.setlocale("LC_TIME", "de_DE.utf8")   # some Linux versions

Sys.setlocale("LC_TIME", "German") # Windows

To change the locale codification you can use Sys.setlocale function in R.

My advice is not to change these options and adapt the text to the features of our computer and the operating system.

2.- How to prepare the text outside R

If input file has, for example, UFT8 Mac format it can be saved to UTF8 (Windows format) or ANSII format using Notepad editor.

This new file can be read directly from R using read.csv2 function as the following examples:

C2 <- read.csv2("C:/Xplortext/ANSI.csv",sep = ";", header =TRUE, row.names = 1)
C3 <- read.csv2("C:/Xplortext/UTF8.csv",sep = ";", header =TRUE, row.names = 1)

You can save your dataframe in the format of your operating system using save function.

3.- How to prepare the text inside R

In some cases, it is not easy to clean the text of special characters, for example tweets may contain icons.

In the Monica Bécue book:

you can download the scripts of chapters in:

The last example of this book is an application from twitter:

You can find two files in R data format depending on the operating system used:

Windows format:

Mac & Linux format:

This Windows file has latin-1 encoding format and Mac & Linux has UTF8 format.

It should be noted that Mac UTF format may not be read correctly in Windows.

Frequently some special characters are read as blanks (or including invisible blanks). When the text is processed with TextData (Xplortext package) the words will be split in two.

This type of errors appear frequently when using the tolower function from tm package.

To convert characters we use iconv function from R base package.

The names of encodings available are platform-dependent (Windows-Linux-Mac).

All R platforms support "" (for the encoding of the current locale), "latin1" and "UTF-8".  That is to say, when leaving the quotes blank, the coding of our computer is used.

The iconv() function from {base} package allows to change from one to other format:
iconv(x, from = "", to = "", sub = NA, mark = TRUE, toRaw = FALSE)

For example:

1. Load the RData file from console, load(base) or loading a csv file. However, be aware of the fact that this file is not ready to be processed because it has characters that can not be read correctly by tm, Xplortext, quanteda or another packages.

2. If base$text is the text to convert, for example, from latin1 to UTF-8, create a temporal object:
A <- iconv(base$text , 'latin1', 'UTF-8')
Using iconvlist() function you can obtain all the avaliable possibilities.

3. Check that the encoding in "A" object is correct.

4. Replace the text:
base$text <- A

5. Save the new text:
save(base, file=”newbase.RData”)

There are some packages to help in this problem, for example: