How do you convert letters into numbers?
Representation in textual analysis
In their recent articles “Text As Data” (Gentzkow, Kelly, and Taddy 2019) argued that any textual quantitative analysis requires three key steps that could be summarized as follow:
-
choose a representation: we transform textual data into numerical data. Formally, we transform $C$, a corpus composed of documents, into a array $W$, which associates a textual value with a numerical value.
-
apply a measure: the aim is then to apply a measure $f$ to $W$ to estimate an unknown research question $Y$.
-
interpretation: interpret the result in a descriptive or causal analysis
In this post, we will focus on the first step, the representation, and we’ll find out what does it means to transform textual data into numbers.
One example
We’ll use first a standard text analysis metric, tf-idf, and then deconstruct the various steps involved in transforming a raw text into a table of quantifiable numerical values. If you’re not familiar with the tf-idf, read this post first.
Let’s use a simple hypothetical corpus of two documents $d_1$ and $d_2$.
doc_id | text |
---|---|
text1 | This is the first document |
text2 | This document is the second document |
and let’s calculate the tf-idf for each term:
doc_id | term | tf_idf |
---|---|---|
text1 | document | 0.0000000 |
text1 | first | 0.1386294 |
text1 | is | 0.0000000 |
text1 | the | 0.0000000 |
text1 | this | 0.0000000 |
text2 | document | 0.0000000 |
text2 | is | 0.0000000 |
text2 | second | 0.1155245 |
text2 | the | 0.0000000 |
text2 | this | 0.0000000 |
To estimate the tf-idf, we have made significant transformation of our textual data:
- Firstly, we have separated each document by the words that compose it.
- Secondly, we associated each word with a numerical value: its number of occurrences in the document.
- Only after these two steps we were able to actually apply our tf-idf measure.
These three steps in data transformation could be summarized in the table below:
Step 1 | Step 2 | Step 3 |
---|---|---|
document | 1 | 0.0000000 |
first | 1 | 0.1386294 |
is | 1 | 0.0000000 |
the | 1 | 0.0000000 |
this | 1 | 0.0000000 |
document | 2 | 0.0000000 |
is | 1 | 0.0000000 |
second | 1 | 0.1155245 |
the | 1 | 0.0000000 |
this | 1 | 0.0000000 |
The second step here is a key one. A computer doesn’t understand natural language and texts cannot be automatically processed by a computer. It is necessary to convert, somehow, text values into numerical values so that our corpus can be quantitatively analyzed. This is what it is called representation.
We calculated the frequency of each token within each document, and used this frequency to calculate our tf-idf measure. We didn’t apply our measure directly to the raw text, we applied it to a new table (let’s say table $W$) which associates each token with a numerical value, its frequency within each document.
doc_id | token | n |
---|---|---|
text1 | this | 1 |
text1 | is | 1 |
text1 | the | 1 |
text1 | first | 1 |
text1 | document | 1 |
text2 | this | 1 |
text2 | document | 2 |
text2 | is | 1 |
text2 | the | 1 |
text2 | second | 1 |
text2 | document | 2 |
In our $C$ corpus, the documents were raw text. In our new spreadsheet $W$, the documents are now a series of numerical values on which we can carry out some measure (for example, the tf-idf).
To illustrate this, we can slightly transform the presentation of $W$ so that the rows are now the documents, and the columns the unique token of our corpus.
doc_id | this | is | the | first | document | second |
---|---|---|---|---|---|---|
text1 | 1 | 1 | 1 | 1 | 1 | 0 |
text2 | 1 | 1 | 1 | 0 | 2 | 1 |
The columns represent the vocabulary $V$ of the corpus, i.e. the set of unique tokens in our corpus $C$. Each document (text1 and text2) is represented by a vector of the same length (the size of the vocabulary) and each value i of this vector represents the frequency of the i-th token of the vocabulary within the document. In our example, document 1 is represented by a vector $(1,1,1,1,1,0)$ and document 2 by the vector $(1,1,1,0,2,1)$.
This representation of the corpus is called the bag-of-words representation. This bag-of-words metaphor refers to the documents in the corpus: the document was a rich but unquantifiable text, and now becomes a “bag” into which we have thrown a set of tokens, enabling its quantitative analysis.
The knowledge we lost in representation
References
Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. 2019. “Text as Data.” Journal of Economic Literature 57 (3): 535–74.