Encoding headaches, emoticons, and R’s handling of UTF-8/16

Posted on February 5, 2015 by Kenneth Benoit

I was recently asked for help from a colleague (@kmmunger) who was experiencing a choke on cleaning the tokenized texts from Twitter data. The tweets were in the JSON format that comes from the Twitter API, in what we thought was UTF-8 encoding. Turns out these tweets used some emoticons from the nosebleed section of the Unicode maps, and these were not being read properly into R, as quanteda was being used for processing this text.

Read More

How to install the R package topicmodels on OS X

Posted on February 4, 2015 by Kenneth Benoit

Many people have reported problems when attempting to install the R package topicmodels on R when using OS X Mavericks or Yosemite. The problem is that the binaries are not yet built for these versions of OS X, and you need additional software installed in order to build the source. Once you have built the package from source, however, it seems to work fine.

Read More