What text analysis software is available for Stata?

Posted on July 14, 2015 by Kenneth Benoit

A lot of text analysis packages exist for R, such as quanteda, tm, qdap, and korPus. But these are only useful if you are proficient in R programming. What about users of alternative statistical packages, such as Stata?

Turns out that recent versions of Stata have made huge strides in this area. As of Stata version 13, Stata introduced a new data type of “long string” – strL – that can be of virtually unlimited length. Combined with a variety of string handling tools (such as tokenize) Stata has many built-in functions and tools for manipulating text. As of version 14, Stata (finally!) supports Unicode text.

Recently there are two promising new developments on the Stata text analysis front.

The best-of-breed consumer text analysis package, WordStat/QDAMiner, now works with Stata. See the announcement from the Provalis page. Previously, I recommended (and have taught with) this package to anyone with text analysis needs but who is not comfortable programming. Now, it could be worth looking at again for programmers too. QDAMiner is fantastic if you have qualitative text analysis needs (like general annotation or applying content analysis codes to text) and also works superbly with dictionary-based applications.

There is a relatively new text analysis package for Stata called txttool, see the article here. I have not tried it in action, but the article makes it sound quite good for basic stuff. This can be installed easily using

. net install txttool

from Stata.

If I ever get around to it, I will fix up the creaky my own ancient Stata package “Wordscores” originally written for Stata 7.0. Miraculously, it still works, but of course quanteda is better, approximately like comparing a spaceship to a bicycle.