Text as Data Approach

Treating text “as data” is a subtle but radical departure from classical methods of using qualitative, interpretative tools (such as reading!) to determine the “meaning” of the text. It is also different from studying the features of the text as objects of study in their own right, although an eclectic view of text as data might stretch to include this.

When the text is simply data, by contrast, then our focus of interest is not on the texts or the features of the texts themselves, but on what the patterns in these features can tell us about something else: usually the author, process, or system that produced the texts. Textual data then becomes no different from analyzing a survey dataset, for instance, where the concern is not to interpret each respondent’s vector of data, but rather to analyze patterns of responses from a larger group of respondents for what they can tell us about various questions of interest. Whether your approach is motivated by a social science question or data mining for marketing analytics, you are still interested in more aggregate patterns or relationship questions, using the patterns found in the texts to inform you about this process.

Traditional content analysis schemes for analyzing texts — whether computer-assisted or not — involves human determination of segments of a text through an interpretative process. Pure “text as data” approaches can replace the human interpretative element of the analysis with statistical scaling models, usually to estimate latent traits or classes thought to generate the texts. This is the basic approach taken by scaling models and by topic models, for example.

The advantages of treating text in this manner are numerous. First, if text can be analyzed using statistical methods, then we have access to the large corpus of knowledge about stochastic processes, models of uncertainty, functional forms, and other common elements from the statistical analyst’s repertoire. These issues may be more complicated for natural language, especially when it comes to building a generative models, but they are still the same class of problems familiar from quantitative data analysis generally. Second, by removing the need for qualitative judgment as part of the analytic procedure, the application of quantitative methods to analyze textual data means we are no longer limited to languages that we must understand.

The distinction between removing qualitative judgment from the analytic procedure and removing human judgment altogether is important, since it would be impossible to analyze text or any other data without human judgment. All supervised learning methods begin with some human judgments (for a training set), and even unsupervised methods involve human selection of the data. And no analytic procedure can produce useful results if these cannot be made meaningful in some way for human interpretation. Treating text as data, however, means that the steps in between can be performed without human judgment, just as the core part of a statistical analysis is done by computerised implementation of algorithmic and and mathematical methods.