Text exploration with the tf-idf

In the history of economics, it can be difficult to qualitatively explore the document of a corpus if it is too large. A very simple method for exploring a large corpus is to use the tf-idf measure, which puts light on the most used and specific words of a document.

The tf-idf is a powerful tool to explore the difference between documents of your corpus. For instance, Claveau et al. (2021) used the tf-idf to identify the topics discussed by the various communities in the field of philosophy of economics.1.

Source: Philosophy of Economics?: Three Decades of Bibliometric History, (Claveau et al. 2021)

In the figure above, Claveau et al. (2021) identifies the terms in each document (the y axis) and extracts those with the highest tf-idf (x axis). Rather than laboriously analyzing hundreds of documents, this method enables them to quickly identify the main research topics discussed in philosophy of economics over the last past decades. For example, the second community is mainly interested in critical realism and heterodoxy. This shows that the opposition between heterodox and orthodox in economics has been an important source of philosophical discussion on model realism.

So how is this tf-idf worked, and how can we use it? In this post, we’ll take a step-by-step look at this measure and implemented it in R using real data from the American Economic Review and the Journal of Finance.

In a in a corpus $C$ of $N$ documents, the tf-idf of a term $t$ in a document $d$ is defined by the following formula:

$$\frac{f_{t,d}}{\sum\limits_{t \in d, } f_{t, d}} * \log{(\frac{N}{d\in C:t \in d})}$$

The tf-idf is the combination of two distinct formulas: the term frequency (left), and the inverse document frequency (right). We will explain this formula step by step using a hypothetical corpus of two documents with very simple texts:

doc_id text
1 This is the first document
2 This document is the second document

Let’s start with the left side of the formula, the term frequency.

The term frequency

The term frequency is simply the number of time a term appears in a document. Let’s calculate the frequency of each word in our two documents $C$:

doc_id word n
1 this 1
1 is 1
1 the 1
1 first 1
1 document 1
2 this 1
2 document 2
2 is 1
2 the 1
2 second 1
2 document 2

It’s easy to see that absolute frequency is a poor indicator because it doesn’t take the length of the documents into account. If a document contains more words than another document, its probability of containing any word is higher, and this irrespective of the topic of the text itself.

In our example, the text of document 2 is slightly longer and thus the documents word frequency cannot be compare without a normalization. One way to normalize is to divide the absolute frequency by the sum of the frequencies of each word in the document (namely the number of words in the document):

$$TF(d,t) = \frac{f_{t,d}}{\sum\limits_{t \in d } f_{t, d}}$$

doc_id word n total TF
1 this 1 5 0.2000000
1 is 1 5 0.2000000
1 the 1 5 0.2000000
1 first 1 5 0.2000000
1 document 1 5 0.2000000
2 this 1 6 0.1666667
2 document 2 6 0.3333333
2 is 1 6 0.1666667
2 the 1 6 0.1666667
2 second 1 6 0.1666667
2 document 2 6 0.3333333

As the second document is made up of more words, each word has a relatively smaller weight than those in the first document, which is shorter.

While the relative frequency is a useful indicator for understanding the content of a text, it remains also a poor indicator to understand the difference between documents. In fact, the frequency of a document tends to emphasize common words used in all documents (for example, in our corpus, “is” or “the”). However, because these words are used everywhere, they provide us no interesting information on the peculiarities of each document.

For instance, in our corpus, the word “first” is the only one that is specific to the first document. Therefore, it gives more information about the first document than the other term in the document. We’d like to find a measure that gives more weight to “first”. This is what the idf measure does.

The idf

The idf is the ratio of the number of $d_n$ documents in the $C$ corpus, denoted $N$, over the number of documents containing the word $t$:

$$idf = \frac{N}{d\in C:t\in d}$$

doc_id word TF idf
1 this 0.2000000 1
1 is 0.2000000 1
1 the 0.2000000 1
1 first 0.2000000 2
1 document 0.2000000 1
2 this 0.1666667 1
2 document 0.3333333 1
2 is 0.1666667 1
2 the 0.1666667 1
2 second 0.1666667 2

The idf is a measure of the specificity of a word in a document relative to other documents. In our example, the idf of the words “first” and “second”, which are specific to each document, is higher than the other words, which are found in all documents.

The tf-idf

Hence, the tf-idf, which combines the tf and the idf, gives more weight to words that appear several times in a document, and to words that are specific to a document compared to other documents. It is therefore both a measure of the importance and the specificity of a word within a document relative to the corpus.

Note that it’s common practice to apply the logarithmic function to the idf measurement and we finally retrieve our original formula !

$$\frac{f_{t,d}}{\sum\limits_{t \in d } f_{t, d}} * \log{(\frac{N}{d\in C:t \in d})}$$

doc_id word TF idf (log) tf-idf
1 this 0.2000000 0.0000000 0.0000000
1 is 0.2000000 0.0000000 0.0000000
1 the 0.2000000 0.0000000 0.0000000
1 first 0.2000000 0.6931472 0.1386294
1 document 0.2000000 0.0000000 0.0000000
2 this 0.1666667 0.0000000 0.0000000
2 document 0.3333333 0.0000000 0.0000000
2 is 0.1666667 0.0000000 0.0000000
2 the 0.1666667 0.0000000 0.0000000
2 second 0.1666667 0.6931472 0.1155245

With the logarithm applied to the idf, words shared by all documents (an idf value of 1) have a value of 0, and words specific to a subset of documents have a positive value.2

R implementation

In practice, you don’t need to follow the steps above to calculate the tf-idf of your documents. This measure is already implemented in most programming languages used by scientists. For example, in R, the tidytext package (from the tidyverse environment) gives you the tf-idf with the bind_tf_idf function, which gives the same result as our hand-calculation:

library(dplyr)
library(tidytext)

# our hypothetical corpus
corpus <- tibble(doc_id = c(1, 2), text = c("This is the first document",
    "This document is the second document"))

# tokenization
list_tokens <- corpus %>%
    unnest_tokens(output = token, input = text)

# estimate tf-idf
list_tokens %>%
    group_by(doc_id) %>%
    count(token) %>%
    ungroup() %>%
    bind_tf_idf(token, doc_id, n)
## # A tibble: 10 × 6
##    doc_id token        n    tf   idf tf_idf
##     <dbl> <chr>    <int> <dbl> <dbl>  <dbl>
##  1      1 document     1 0.2   0      0    
##  2      1 first        1 0.2   0.693  0.139
##  3      1 is           1 0.2   0      0    
##  4      1 the          1 0.2   0      0    
##  5      1 this         1 0.2   0      0    
##  6      2 document     2 0.333 0      0    
##  7      2 is           1 0.167 0      0    
##  8      2 second       1 0.167 0.693  0.116
##  9      2 the          1 0.167 0      0    
## 10      2 this         1 0.167 0      0

References

Claveau, François, Alexandre Truc, Olivier Santerre, and Luis Mireles-Flores. 2021. “Philosophy of Economics?: Three Decades of Bibliometric History.” In The Routledge Handbook of Philosophy of Economics, 151–68. Routledge.


  1. Communities are identified by the authors using network and bibliometric analysis. ↩︎

  2. There are several pratical justifications to use the logarithm function. One of them is that it keeps the idf at the same “scale” as the tf. Indeed, unlike the tf, the idf can rapidly increase as $N$ increases. ↩︎

Thomas Delcey
Thomas Delcey
Assistant Professor in Economics