How do you convert letters into numbers?

Representation in textual analysis

In their recent articles “Text As Data” (Gentzkow, Kelly, and Taddy 2019) argued that any textual quantitative analysis requires three key steps that could be summarized as follow:

  • choose a representation: we transform textual data into numerical data. Formally, we transform $C$, a corpus composed of documents, into a array $W$, which associates a textual value with a numerical value.

  • apply a measure: the aim is then to apply a measure $f$ to $W$ to estimate an unknown research question $Y$.

  • interpretation: interpret the result in a descriptive or causal analysis

In this post, we will focus on the first step, the representation, and we’ll find out what does it means to transform textual data into numbers.

One example

We’ll use first a standard text analysis metric, tf-idf, and then deconstruct the various steps involved in transforming a raw text into a table of quantifiable numerical values. If you’re not familiar with the tf-idf, read this post first.

Let’s use a simple hypothetical corpus of two documents $d_1$ and $d_2$.

doc_id text
text1 This is the first document
text2 This document is the second document

and let’s calculate the tf-idf for each term:

doc_id term tf_idf
text1 document 0.0000000
text1 first 0.1386294
text1 is 0.0000000
text1 the 0.0000000
text1 this 0.0000000
text2 document 0.0000000
text2 is 0.0000000
text2 second 0.1155245
text2 the 0.0000000
text2 this 0.0000000

To estimate the tf-idf, we have made significant transformation of our textual data:

  • Firstly, we have separated each document by the words that compose it.
  • Secondly, we associated each word with a numerical value: its number of occurrences in the document.
  • Only after these two steps we were able to actually apply our tf-idf measure.

These three steps in data transformation could be summarized in the table below:

Step 1 Step 2 Step 3
document 1 0.0000000
first 1 0.1386294
is 1 0.0000000
the 1 0.0000000
this 1 0.0000000
document 2 0.0000000
is 1 0.0000000
second 1 0.1155245
the 1 0.0000000
this 1 0.0000000

The second step here is a key one. A computer doesn’t understand natural language and texts cannot be automatically processed by a computer. It is necessary to convert, somehow, text values into numerical values so that our corpus can be quantitatively analyzed. This is what it is called representation.

We calculated the frequency of each token within each document, and used this frequency to calculate our tf-idf measure. We didn’t apply our measure directly to the raw text, we applied it to a new table (let’s say table $W$) which associates each token with a numerical value, its frequency within each document.

doc_id token n
text1 this 1
text1 is 1
text1 the 1
text1 first 1
text1 document 1
text2 this 1
text2 document 2
text2 is 1
text2 the 1
text2 second 1
text2 document 2

In our $C$ corpus, the documents were raw text. In our new spreadsheet $W$, the documents are now a series of numerical values on which we can carry out some measure (for example, the tf-idf).

To illustrate this, we can slightly transform the presentation of $W$ so that the rows are now the documents, and the columns the unique token of our corpus.

doc_id this is the first document second
text1 1 1 1 1 1 0
text2 1 1 1 0 2 1

The columns represent the vocabulary $V$ of the corpus, i.e. the set of unique tokens in our corpus $C$. Each document (text1 and text2) is represented by a vector of the same length (the size of the vocabulary) and each value i of this vector represents the frequency of the i-th token of the vocabulary within the document. In our example, document 1 is represented by a vector $(1,1,1,1,1,0)$ and document 2 by the vector $(1,1,1,0,2,1)$.

This representation of the corpus is called the bag-of-words representation. This bag-of-words metaphor refers to the documents in the corpus: the document was a rich but unquantifiable text, and now becomes a “bag” into which we have thrown a set of tokens, enabling its quantitative analysis.

The knowledge we lost in representation

References

Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74.

Thomas Delcey
Thomas Delcey
Assistant Professor in Economics