La revue Economique (1950-2025)

Annexe

Authors

Affiliations

Thomas Delcey

Université de Bourgogne, Laboratoire d’économie de Dijon

Ivan Ledezma

Université de Bourgogne, Laboratoire d’économie de Dijon

Yann Giraud

CY Cergy Paris Université, Agora

Loïc Charles

Université de Paris 8, Led (EA 3391)

Published

October 2, 2025

Dans cette annexe, nous détaillons la construction de la base de données et nous présentons les codes utilisés pour les analyses. Les codes utilisés pour chaque analyse (transformation des données et visualisations) peuvent être déroulés en cliquant sur les boutons “Afficher le code”. Ils peuvent aussi être consultés directement sur le dépôt GitHub du projet. Hormis les données relatives aux textes entiers et aux citations qui sont fermés, les tables utilisées sont disponibles dans le dossier data du dépôt GitHub.

1 Données

Notre jeu de données est composé de quatre tables :

Une table des métadonnées. Chaque ligne est un document qui possède un identifiant unique, id. Cet identifiant est soit l’identifiant Persée, soit l’identifiant Cairn. Certaines lignes partagent le même id car quelques documents dans la base de données originale de la Revue Économique correspondent à une même notice bibliographique dans Persée ou Cairn. Une URL est disponible pour chaque document pour consulter la notice Persée ou Cairn en ligne.
Une table des auteurs associés à ces documents. Chaque ligne correspond à un auteur d’un document et nous leur avons attribué un identifiant unique id_authors. Chaque auteur est associé à ses documents par un identifiant que nous avons créé, id_document. Les auteurs sont identifiés par leur nom, prénom, genre. Pour les documents archivés par Persée, nous avons également enrichi la base de données avec les informations issues d’IdRef, une base de données qui recense les informations sur les auteurs de l’enseignement et de la recherche en France.
Une table des éditeurs. Chaque ligne correspond à un éditeur et une position institutionnelle au sein d’une université et/ou de la Revue Economique . Si une personne change d’universités ou de position dans le comité éditoriale, plusieurs lignes lui sont attribuées. La personne est identifiée par son nom, prénom, genre, institution, discipline et position au sein de la Revue Economique ainsi que les dates d’entrées et de sorties.
Une table avec les textes entiers. Malheureusement, nous ne pouvons pas partager les textes entiers pour des raisons de droits d’auteur. Le reste des données est consultable et téléchargeable dans les tableaux interactifs ci-dessous de la section ou directement la page GitHub du package de replication.

1.1 Sources

Nous avons utilisé trois sources principales pour la construction de notre base de données. La première source est un document interne à la revue économique qui répertorie les documents publiés dans la Revue Économique entre 1950 et 2019. Ce document comprend plusieurs métadonnées des documents que nous avons exploitées : les titres, les auteurs, leur genre, l’année de publication, les numéros spéciaux, le type de document (note de lecture, article, etc.).

Nous avons enrichi cette base de données avec deux autres sources, les deux bibliothèques numériques Persée, qui archive la revue économique entre 1950 et 2000, et Cairn, qui couvre la période de 2001 à aujourd’hui. Nous avons notamment utilisé les API de ces deux bibliothèques pour récupérer les résumés des articles (https://oai.cairn.info/ et https://www.persee.fr/entrepot-oai). Nous avons également eu accès, après demande, aux textes entiers. Ces derniers ne sont malheureusement pas open data. Cairn nous a également donné accès aux citations entrantes et sortantes pour chaque document.

La table des éditeurs a été construite par nos soins. Nous avons recoupé les informations sur Cairn, Persée, Jstor et d’autres entrepôts de la recherche en France comme https://data.bnf.fr/, https://www.sudoc.abes.fr/ ou https://theses.fr/?domaine=theses.

Afficher le code

documents <- read_xlsx(here(clean_corpus_path, "re_metadata_1950_2023.xlsx"),
                       col_types = "text") %>% unique
authors <- read_xlsx(here(clean_corpus_path, "re_authors_1950_2023.xlsx"),
                     col_types = "text") %>% unique
editors <- read_xlsx(here(clean_corpus_path, "re_editors.xlsx"),
                     col_types = "text") %>% unique

1.2 Nettoyage

L’algorithme ci-dessous vise à identifier des doublons par le biais de la stratégie suivante :

Les documents sont groupés par le premier auteur du document ;
Pour chaque groupe, nous calculons la distance Optimal String Alignment (OSA) des titres. La mesure OSA calcule le nombre d’opérations (insertions, deletions, substitutions, and adjacent character transpositions) nécessaires pour rendre parfaitement identiques deux chaînes de caractères. Cette méthode est implémentée sur R via le package stringdist (Van der Loo et al. 2014).
Nous identifions et fusionnons les doublons en fonction de deux seuils : la distance OSA et la distance normalisée (distance OSA divisée par le produit des longueurs des deux titres).

Afficher le code

# first delete forthcoming article that are duplicates 

# remove duplicates and save 

documents <- documents %>% 
  filter(!issue == "Forthcoming") 

# select author information used in find_duplicate()

authors_info_to_join <- authors %>%
  select(id_document, authors)

# join
documents_with_authors <- documents %>%
  left_join(authors_info_to_join, by = c("id" = "id_document"))

# create metadata for stm
data_to_check <- documents_with_authors %>%
  filter(type %in% c("varia", "numéro spécial", "")) %>%
  # nest author
  group_by(id) %>%
  mutate(authors_list = list(authors)) %>%
  mutate(authors = first(authors)) %>%
  # remove non unique line
  unique %>%
  # harmonize authors
  mutate(authors = str_remove_all(authors, "[[:punct:]]"),
         authors = str_to_lower(authors),
         authors = str_squish(authors)) %>%
  #remove special cases, regular chroniques
  filter(!title %in%
           c("Chronique de la pensée économique en Italie",
             "Commentaires",
             "La situation économique",
             "Avant-propos",
             "Introduction",
             "introduction"))

duplicates <- find_duplicates(data_dt = data_to_check,
                                          threshold_distance = 6,
                                          threshold_normalization = 0.1,
                                          workers = 4)

duplicates <- duplicates %>%
  group_by(id) %>%
  mutate(duplicates = list(c(id, id_2)),
         duplicates = map(duplicates, ~ .x %>% sort()))

# Add duplicates to the main metadata table
documents <- documents %>%
  left_join(duplicates %>%
              select(id, duplicates)) %>%
  mutate(duplicates = ifelse(duplicates == "NULL", NA_character_, duplicates))

duplicates_to_keep <- documents %>%
  filter(!is.na(duplicates)) %>%
  # sort
  unique %>%
  group_by(duplicates) %>%
  arrange(!is.na(abstract_fr),
          # Prioritize rows where abstract_fr is not NA
          as.numeric(issue),
          # Prioritize numeric issues (NA if not numeric)
          issue != "Forthcoming",
          # Ensure "Forthcoming" is deprioritized
          .by_group = TRUE) %>%
  slice(1) %>%  # Keep only the first row within each group
  ungroup() %>%
  unique()

documents <- documents %>%
  filter(is.na(duplicates)) %>%
  bind_rows(duplicates_to_keep)

# maj authors data removing lines with duplicates

id_to_keep <- documents$id

authors <- authors %>%
  filter(id_document %in% id_to_keep)

saveRDS(documents, here(clean_corpus_path, "documents_no_duplicates.rds"))
saveRDS(authors, here(clean_corpus_path, "authors_no_duplicates.rds"))

1.3 Tables

Afficher le code

documents <- readRDS(here(clean_corpus_path, "documents_no_duplicates.rds")) 

authors <- readRDS(here(clean_corpus_path, "authors_no_duplicates.rds"))

editors <- read_xlsx(here(clean_corpus_path, "re_editors.xlsx"),
                     col_types = "text") %>% unique

Afficher le code

documents <- documents %>% 
  # trunc abstracts for table lisibility  
  mutate(abstract_fr = str_trunc(abstract_fr, 100, ellipsis = "..."),
         abstract_en = str_trunc(abstract_en, 100, ellipsis = "..."))

DT::datatable(
  documents,
  extensions = 'Buttons',
  options = list(
    dom = 'Blfrtip',
    buttons = c('excel', 'csv'),
    pageLength = 3
  )
)

Afficher le code

authors_nested <- authors %>% 
  group_by(id_authors) %>% 
  mutate(id_document = list(id_document),
         institution = list(institution),
         year = list(year)) %>% 
  select(-"Type d'institution", -"Discipline 1", -"Discipline 2") %>%
  unique() 

DT::datatable(
  authors_nested,
  extensions = 'Buttons',
  options = list(
    dom = 'Blfrtip',
    buttons = c('excel', 'csv'),
    pageLength = 3
  )
)

Afficher le code

DT::datatable(
  editors,
  extensions = 'Buttons',
  options = list(
    dom = 'Blfrtip',
    buttons = c('excel', 'csv'),
    pageLength = 3
  )
)

Table 1: Table des éditeurs

2 Modélisation thématique

2.1 Prétraitement des données textuelles

Dans cette section, nous détaillons les étapes suivies pour le prétraitement des données textuelles en vue de leur utilisation dans un topic model. En plus d’un pré-nettoyage et du formatage habituel des données textuelles, nous créons les deux variables indicatrices is_varia et has_female pour la future régression.

La tokenisation a été réalisée grâce à la bibliothèque tokenizers. La liste des mots qui en découle est ensuite nettoyée pour supprimer les mots peu informatifs, typiquement certains caractères spéciaux, les chiffres et les mots d’une lettre, ainsi qu’une liste de stopwords. Les tokens sont généralement des unigrams, mais nous avons également inclus des bigrams. Nous avons conservé les bigrams selon leur score de Point Mutual Information (PMI) :

\[ PMI(w_1, w_2) = \log\left( \frac{P(w_1, w_2)}{P(w_1)P(w_2)} \right) \]

Le PMI estime les chances d’observer deux mots ensemble par rapport à la probabilité d’observer ces mots indépendamment. Un PMI positif indique que les mots sont plus souvent observés ensemble que séparément. Nous avons conservé les bigrams qui apparaissent plus de 10 fois dans le corpus et présentent un PMI supérieur à 0.

Une pratique standard en modélisation thématique consiste à réduire la liste du vocabulaire en filtrant les mots peu utilisés. Par exemple, il est théoriquement peu utile de conserver les mots utilisés uniquement dans un seul document puisque, par construction, une thématique est un ensemble de mots qui tendent à co-occurrer ensemble. Filtrer les mots utilisés uniquement par un document ne fait pas perdre beaucoup d’information, mais augmente le temps computationnel en réduisant la taille du vocabulaire. Filtrer les mots peut également augmenter l’interprétabilité des thématiques. Intuitivement, un mot utilisé dans seulement quelques documents est peu informatif puisque les mots apparaissant rarement ne participent pas significativement à la structuration des thématiques et peuvent introduire du bruit dans l’analyse. Pour tester l’effet de différents filtrages, nous construisons différentes représentations du corpus en filtrant les mots qui apparaissent dans moins de \(N\) documents, avec \(N \in [1, 5, 10, 15, 20, 30]\).

Afficher le code

#' SCRIPT FOR CHOOSING K 

# Load packages and data 
source(here::here("scripts","paths_and_packages.R"))
source(here::here("scripts", "producing_results", "_functions_for_tm.R"))

documents <- readRDS(here(clean_corpus_path, "documents_no_duplicates.rds")) 
authors <- readRDS(here(clean_corpus_path, "authors_no_duplicates.rds"))

full_text <- readRDS(here(clean_corpus_path, "full_text.rds")) %>% 
  mutate(id = str_remove_all(id, "_[:digit:]{4}.txt$")) 

### JOINING TABLES ###

# select relevant authors information 
authors_info_to_join <- authors %>%
  select(id_document,
         authors,
         gender)

# join tables
documents_with_authors <- documents %>% 
  left_join(authors_info_to_join, by = c("id" = "id_document")) %>% 
  # create dummy variables has female and isvaria 
  group_by(id) %>% 
  mutate(has_female = as.integer(any(gender == "F")),
         is_varia = ifelse(type == "varia", 1, 0)) %>% 
  # nest author 
  group_by(id) %>%
  # keep only first author or first title for some id duplicates 
  slice(1) %>%
  # remove duplicates from joining 
  select(id, authors, title, abstract_fr, year, has_female, is_varia, -gender, type) %>% 
  unique

# join fulltext 


full_text_filtered <- documents_with_authors %>%
  filter(type %in% c("varia", "numéro spécial")) %>%
  left_join(full_text, by = "id") %>%
  # filter na covariates
  filter(!is.na(has_female), !is.na(is_varia))


#### PRE-CLEANING #### 

# using data.table for efficiency 

# Convert to data.table if not already
setDT(full_text_filtered)

# Select relevant columns
df_text <- full_text_filtered[, .(id, authors, title, abstract_fr, text, year, has_female, is_varia)]

# Clean abstracts
df_text[, abstract_fr := str_squish(abstract_fr)]
df_text[, abstract_fr := str_trim(str_remove_all(abstract_fr, "^[Rr]ésumé"))]
df_text[, abstract_fr := str_remove_all(abstract_fr, "Classification JEL.*$")]
df_text[, abstract_fr := str_remove_all(abstract_fr, "JEL [Cc]ode(s)?.*$")]
df_text[, abstract_fr := str_remove_all(abstract_fr, "JEL [Cc]lassification.*$")]
df_text[, abstract_fr := str_remove_all(abstract_fr, "(JEL : D11, L13, Q42.)|(Classification jel : A14, B10, F13)")]
df_text[, abstract_fr := fifelse(is.na(abstract_fr), "", abstract_fr)]

# Clean texts
df_text[, text := str_squish(text)]

# remove revue economique in body text 
df_text[, text := str_remove_all(text, "Revue économique")]
df_text[, text := str_remove_all(text, "Revue Economique")]
df_text[, text := str_remove_all(text, "REVUE ECONOMIQUE")]
df_text[, text := str_remove_all(text, "REVUE CONOMIQUE")]
# remove vol + number 
df_text[, text := str_remove_all(text, "vol. [0-9]+")]


# remove references bibliographique 
df_text[, text := str_remove_all(text,  regex("Références bibliographiques.*", ignore_case = TRUE))]
df_text[, text := str_remove_all(text,  regex("REFERENCES BIBLIOGRAPHIQUES.*", ignore_case = TRUE))]
df_text[, text := str_remove_all(text,  regex("Notes bibliographiques.*", ignore_case = TRUE))]
df_text[, text := str_remove_all(text,  regex("Bibliographie .*", ignore_case = TRUE))]
df_text[, text := str_remove_all(text,  regex("Bibliography .*", ignore_case = TRUE))]
df_text[, text := str_remove_all(text,  regex("APPENDIX .*", ignore_case = TRUE))]

# handle french special character 
df_text[, text := str_replace_all(text, "fa on", "façon")]
df_text[, text := str_replace_all(text, regex("fran ais", ignore_case = TRUE), "Français")]
df_text[, text := str_replace_all(text, regex("fran e", ignore_case = TRUE), "France")]
df_text[, text := str_replace_all(text, " uvre ", "oeuvre")]
df_text[, text := str_replace_all(text, "e ment ", "ement")]
df_text[, text := str_replace_all(text, " tion ", "tion ")]

# Construct final text column
df_text[, text := tolower(paste(title, ".", abstract_fr, text)), by = id]

saveRDS(df_text, here(private_data_path, "df_full_text.rds"), compress = TRUE)


#### TOKENIZATION ####

# df_text <- readRDS(here(intermediate_data_path, "df_full_text.rds"))

# Tokenize sentences while keeping the document ID
df_tokens <- df_text %>%
  select(id, text) %>%
  mutate(tokens = tokenizers::tokenize_words(text, 
                                             lowercase = TRUE,
                                             strip_punct = TRUE, # delete punctuation 
                                             strip_numeric = TRUE, # delete numbers
                                             simplify = FALSE)) %>% 
  ungroup %>% 
  unnest(tokens) %>%  # Expand token lists into rows
  group_by(id) %>% 
  mutate(token_id = row_number()) %>% 
  rename(token = tokens) %>% # Rename column for clarity
  select(id, token, token_id)

#### STOPWORDS ####

# prepare stop_words
stop_words <- bind_rows(get_stopwords(language = "fr", source = "stopwords-iso"),
                        get_stopwords(language = "fr", source = "snowball"),
                        get_stopwords(language = "en", source = "stopwords-iso"),
                        get_stopwords(language = "en", source = "snowball")) %>% 
  distinct(word) %>% 
  pull(word)

custom_stop_words <- c(
  "faire",
  "faut",
  "résumé",
  "article",
  "analyse",
  "analyser",
  "analysons",
  "analysent",
  "approche",
  "étude",
  "étudie",
  "étudions",
  "étudient",
  "montrons",
  "montrer",
  "montre",
  "montrent",
  "permettre",
  "permet",
  "permettent",
  "proposer",
  "propose",
  "proposons",
  "proposent",
  "utiliser",
  "mettre",
  "présente",
  "présentons",
  "présenter",
  "role",
  "rôle"
)

stop_words <- c(stop_words, custom_stop_words) 

# use data.table for efficiency 

# Convert to data.table if not already
setDT(df_tokens)

# Remove article contractions, punctuation from tokens
df_tokens[, token := str_remove_all(token, "^(.*qu|[mjldscn])[\u0027\u2019\u2032\u0060]")]
df_tokens[, token := str_remove_all(token, "[[:punct:]]")]

# once, token are cleaned, we can remove stopwords, non-latin characters, digits and  one letter characters
df_tokens <- df_tokens[!token %in% stop_words]
df_tokens <- df_tokens[!str_detect(token, "[^\\p{Latin}]")]
# df_tokens <- df_tokens[!str_detect(token, "[\u0370-\u03FF]")]
df_tokens <- df_tokens[!str_detect(token, "^.*\\d+.*$")]
df_tokens <- df_tokens[str_detect(token, "[[:letter:]]")]


#### BIGRAMS ####

# Create bigrams
df_tokens <- df_tokens[order(id, token_id)]  # Ensure correct order
df_tokens[, bigram := ifelse(token_id < shift(token_id, type = "lead"), 
                             paste(token, shift(token, type = "lead"), sep = "_"), 
                             NA),
          by = .(id)]


# filter na and count bigrams, keep only bigrams that appear more than 20 times
bigram_counts <- df_tokens[!is.na(bigram)]
bigram_counts <- bigram_counts[, .N, by = .(id, token, bigram)]  
bigram_counts <- bigram_counts[N > 20]  

# Split bigram into word_1 and word_2
bigram_counts[, c("word_1", "word_2") := tstrsplit(bigram, "_", fixed = TRUE)]

# Remove bigrams where either word is a stopword
bigram_counts <- bigram_counts[!(word_1 %in% stop_words | word_2 %in% stop_words)]

# Assign a unique window ID to each bigram (acts like `window_id`)
bigram_counts[, window_id := .I]  # .I is the row number (unique ID), the context window is thus only the bigram itself

# Convert to long format (similar to pivot_longer)
bigram_long <- melt(bigram_counts, 
                    id.vars = "window_id", 
                    measure.vars = c("word_1", "word_2"), 
                    variable.name = "rank", 
                    value.name = "word") %>% 
  as.data.table()

# Calculate PMI values

#Count occurrences of each word
word_prob <- df_tokens[, .N, by = token]
total_tokens <- sum(word_prob$N)
word_prob[, prob := N / total_tokens]

#Count occurrences of each bigram
bigram_prob <- df_tokens[!is.na(bigram), .N, by = bigram]
total_bigrams <- sum(bigram_prob$N)
bigram_prob[, prob := N / total_bigrams]


# Merge word_1 probabilities into bigram table and rename
bigram_counts <- merge(bigram_counts, word_prob[, .(token, prob)], by.x = "word_1", by.y = "token", all.x = TRUE)
setnames(bigram_counts, "prob", "prob_word_1")

# Merge word_2 probabilities into bigram table and rename
bigram_counts <- merge(bigram_counts, word_prob[, .(token, prob)], by.x = "word_2", by.y = "token", all.x = TRUE)
setnames(bigram_counts, "prob", "prob_word_2")

# merge bigram probabilities into bigram table and rename
bigram_counts <- merge(bigram_counts, bigram_prob, by = "bigram")
setnames(bigram_counts, "prob", "prob_bigram")

# compute pmi 
bigram_counts[, pmi := log2(prob_bigram / (prob_word_1 * prob_word_2))]

# keep only bigrams with pmi > 0
bigram_to_keep <- bigram_counts[pmi > 0] 
bigram_to_keep <- bigram_to_keep[, keep_bigram := TRUE]


# Add bigrams to the token list

df_tokens_final <- df_tokens  %>% 
  left_join(bigram_to_keep) %>% 
  mutate(token = if_else(keep_bigram, bigram, token, missing = token),
         token = if_else(lag(keep_bigram), lag(bigram), token, missing = token),
         token_id = if_else(lag(keep_bigram), token_id - 1, token_id, missing = token_id)) %>% 
  distinct(id, token_id, token)

# save in term list format

term_list <- df_tokens_final %>% 
  rename(term = token)

saveRDS(term_list, here(private_data_path, "term_list_FULL_TEXT.rds"))


#### PREPROCESSING ####

term_list <- readRDS(here(private_data_path, "term_list_FULL_TEXT.rds"))

# create stm objects with diff pre-processing

corpora_in_dfm <- list()
corpora_in_stm <- list()

treshsholds <- c(1, 5, 10, 15, 20, 30)

for (i in 1:length(treshsholds)) {
  
  term_to_remove <- term_list %>%
    distinct(id, term) %>% 
    count(term, name = "frequency") %>%
    filter(frequency <= i) %>% 
    distinct(term)  
  
  #remove words 
  terms_list_filtered <- term_list %>%
    filter(!term %in% term_to_remove$term)
  
  #transform list of terms into stm object 
  corpus_in_dfm <- terms_list_filtered %>%
    add_count(term, id) %>%
    cast_dfm(id, term, n)
  
  treshold <- treshsholds[[i]] %>% as.character()
  
  # dfm object 
  corpora_in_dfm[[treshold]] <- corpus_in_dfm
  
  # stm object with covariate 
  metadata <- terms_list_filtered %>%
    select(id) %>% 
    left_join(df_text, by = "id") %>% 
    mutate(year = as.integer(year),
           has_female = as.factor(has_female),
           is_varia = as.factor(is_varia),
           has_female = relevel(has_female, ref = "0"),
           is_varia = relevel(is_varia, ref = "0")) %>% 
    select(id, text, authors, title, abstract_fr, year, has_female, is_varia) %>%
    unique
  
  corpus_in_stm <- quanteda::convert(corpus_in_dfm, to = "stm",  docvars = metadata)
  
  corpora_in_stm[[treshold]] <- corpus_in_stm
  
}

saveRDS(corpora_in_stm, here(private_data_path, "corpora_in_stm_FULL_TEXT.rds"))

2.2 Choix de K et du prétraitement

À partir de six représentations du corpus, nous cherchons à estimer le nombre de thématiques de notre modèle \(K\). Pour déterminer \(K\), nous entraînons une série de modèles avec \(K \in \{10, 20, ..., 70\}\). Nous estimons un modèle pour chaque valeur de \(K\) (nombre de thématiques) et pour chaque valeur de \(N\) (nombre de représentation du corpus selon le filtrage des mots), soit 42 modèles.

Le structural topic model est implémenté dans R dans le package stm (Roberts et al. 2013). L’ensemble des informations relatives à cette implémentation est disponible sur le site web dédié. Pour une exploration avancée, l’ensemble du code R est disponible sur le GitHub. Une série d’articles des auteurs présentent le modèle. Roberts, Stewart, and Airoldi (2016) est la présentation la plus complète pour une exploration avancée de l’inférence bayésienne utilisée.

Afficher le code

#' K evaluation 
#' to start the session open
 
library(stm)
library(furrr)
library(tidyverse)

corpora_in_stm <- readRDS(here::here(private_data_path, "corpora_in_stm.rds"))


#prepare furrr parallélisation
# 
# nb_cores <- availableCores() / 2 
# plan(multisession, workers = nb_cores)

#run multiple topic models 

seed <- 123

many_stm <- tibble::tibble(
  K = seq(10, 70, by = 10),
  preprocessing = list(names(corpora_in_stm))) %>% 
  tidyr::unnest(cols = c(K, preprocessing)) %>% 
  dplyr::mutate(st_models = map2(
    K,
    preprocessing,
    ~ {
      # run stm 
      stm::stm(
        documents = corpora_in_stm[[.y]]$documents,
        vocab = corpora_in_stm[[.y]]$vocab,
        data = corpora_in_stm[[.y]]$meta,
        prevalence = as.formula("~has_female + is_varia + s(year)"),
        K = .x,
        init.type = "Spectral",
        max.em.its = 800,
        verbose = FALSE,
        seed = seed
      )
      
    },
    .progress = TRUE,
    .options = furrr_options(seed = seed)
  ))

saveRDS(many_stm, here::here(private_data_path, "many_stm.rds"), compress = TRUE)

Nous avons utilisé deux métriques : la FREX et la cohérence sémantique pour chacune de ces combinaisons.

La cohérence sémantique mesure la similarité entre les mots d’un thème (Mimno et al. 2011). Similaire à la PMI dans l’esprit, la cohérence sémantique mesure la probabilité de voir deux mots ensemble dans un thème à partir d’une liste de \(M\) mots les plus probables par thématique. La FREX (ou FREquent EXclusivity) est une mesure qui partage l’esprit de la célèbre mesure tf-idf et vise à évaluer l’importance d’un mot \(w\) dans un thème \(k\), en tenant compte à la fois de sa fréquence et de son exclusivité (Bischof and Airoldi 2012) Nous utilisons la bibliothèque stm pour estimer ces métriques, respectivement les fonctions stm::semanticCoherence et stm::exclusivity avec un parametrage par defaut. Ces résultats indiquent que, quel que soit le prétraitement, un nombre de thèmes de 50 semble être un bon compromis entre la cohérence sémantique et la FREX.

Afficher le code

many_stm <- readRDS(here::here(private_data_path, "many_stm.rds"))
corpora_in_stm <- readRDS(here(private_data_path, "corpora_in_stm.rds"))

# estimate exclusivity and coherence 

setDT(many_stm)

# unnest corpus_in_stm by K 
many_stm[, corpus_in_stm := corpora_in_stm, by = K]

# many_stm[, heldout := future_map(corpus_in_stm, ~ make.heldout(.x$documents, .x$vocab))]
many_stm[, exclusivity := map(st_models, exclusivity)]
many_stm[, semantic_coherence := map2(st_models, corpus_in_stm, ~ semanticCoherence(.x, .y$documents))]

evaluation_result <- many_stm[, .(
  K,
  preprocessing, 
  # heldout = mean(unlist(map(eval_heldout, "expected.heldout"))),
  # residual = mean(unlist(map(residual, "dispersion"))),
  semantic_coherence = map_dbl(semantic_coherence, mean),
  exclusivity = map_dbl(exclusivity, mean)
  # lbound
)]

2.3 Présentation du modèle choisi

Les tables ci-dessous sont les distributions \(\theta_{1:D}\) et \(\beta_{1:K}\), les distributions de probabilité des thèmes pour chaque document et des mots pour chaque thème. La prévalence \(\theta\) est notée \(\gamma\) dans le code— l’anotation conventionnelle dans la littérature.

Afficher le code

library(tidyverse)
library(stm) 

#' Running Topic model


# load data 
corpus_in_stm <- readRDS(here(private_data_path, "corpora_in_stm_FULL_TEXT.rds"))[["15"]]

many_stm <- readRDS(many_stm_path <- here::here(private_data_path, "many_stm_full_text.rds")) 

structured_topic_model <- many_stm %>% filter(K == 30 & preprocessing == "15") %>% pull(st_models) %>% .[[1]]

saveRDS(structured_topic_model, here::here(private_data_path, "structured_topic_model.rds"))



# save distributions in separated tibbles 

label_topic <- labelTopics(structured_topic_model, n = 5) 

meta <- corpus_in_stm$meta %>% 
  mutate(document = row_number(),
         # cut text 
         text = str_trunc(text, 500)
  )

top_terms_prob <- label_topic %>% .[[1]] %>% 
  as_tibble() %>% 
  reframe(topic_label_prob = pmap_chr(., ~ paste(c(...), collapse = ", "))) %>%
  mutate(topic = row_number()) 

top_terms_frex <- label_topic %>% .[[2]] %>% 
  as_tibble() %>% 
  reframe(topic_label_frex = pmap_chr(., ~ paste(c(...), collapse = ", "))) %>%
  mutate(topic = row_number()) 

gamma <- tidy(structured_topic_model,
     matrix = "gamma") %>% 
  left_join(meta) %>% 
  left_join(top_terms_prob, by = "topic") %>% 
  left_join(top_terms_frex, by = "topic") 

beta <- tidy(structured_topic_model, matrix = "beta")

saveRDS(gamma, here(private_data_path, "gamma.rds"))
saveRDS(beta, here(private_data_path, "beta.rds"))

Afficher le code

gamma <- readRDS(here(private_data_path, "gamma.rds"))

beta <- readRDS(here(private_data_path, "beta.rds"))

gamma_mean <- gamma %>%
  group_by(topic, topic_label_prob, topic_label_frex) %>%
  summarise(gamma = mean(gamma)) %>%
  ungroup %>% 
  mutate(topic = reorder(topic, gamma)) 

gg <- gamma_mean %>%
  ggplot() +
  geom_segment(
    aes(x = 0, xend = gamma, y = topic, yend = topic),
    color = "black",
    size = 0.5
  ) +
  geom_text(
    aes(
      x = gamma, 
      y = topic, 
      label = paste0("Thématique ", topic, ": ", topic_label_prob)
    ),
    size = 6,
    hjust = -.01,
    nudge_y = 0.0005
  ) +
  scale_x_continuous(
    expand = c(0, 0),
    limits = c(0, max(gamma_mean$gamma) + 0.05)
  ) +
  theme_light() +
  theme(
    text = element_text(size = 20),
    axis.text.y = element_blank(),  # Removes y-axis text
    axis.ticks.y = element_blank()  # Removes y-axis ticks
  ) + 
  labs(
    x = "Prévalences moyennes des thématiques",
    y = NULL,
    caption = "\n\n Note: chaque thématique est associée à ses mots les plus probables selon la distribution beta"
  )

print(gg)

Afficher le code

# Filtrer les 10 documents les plus associés à chaque topic
gamma_top10 <- gamma %>%
  group_by(topic) %>% 
  slice_max(order_by = gamma, n = 10) %>%  
  ungroup() %>% 
  select(topic, id, gamma, title, authors) %>% 
  # cut text
  # mutate(text = str_trunc(text, 500)) 
  arrange(topic, desc(gamma))

# Affichage avec DT
gamma_top10 %>% 
  DT::datatable(
    extensions = c('Buttons', 'ColReorder', 'FixedHeader'),
    options = list(
      dom = 'Bfrtip',
      buttons = c('excel', 'csv'),
      pageLength = 10,
      colReorder = TRUE,
      fixedHeader = TRUE,
      order = list(list(2, 'desc')),
      search = list(regex = TRUE, caseInsensitive = TRUE),
      columnDefs = list(
        list(width = '500px', targets = 3) 
      )
    ),
    filter = "top"
  )

Table 2: Les 10 documents avec la prévalence la plus importante pour chaque topic

Afficher le code

# Filtrer les 10 documents les plus associés à chaque topic
beta_top10 <- beta %>%
  group_by(topic) %>% 
  slice_max(order_by = beta, n = 10) %>%  
  ungroup() 

# Affichage avec DT
beta_top10 %>% 
  DT::datatable(
    extensions = c('Buttons', 'ColReorder', 'FixedHeader'),
    options = list(
      dom = 'Bfrtip',
      buttons = c('excel', 'csv'),
      pageLength = 10,
      colReorder = TRUE,
      fixedHeader = TRUE,
      order = list(list(2, 'desc')),
      search = list(regex = TRUE, caseInsensitive = TRUE),
      columnDefs = list(
        list(width = '500px', targets = 3) 
      )
    ),
    filter = "top"
  )

Table 3: Top 10 des mots associés à chaque topic

2.4 Autres modèles STM

Afficher le code

library(tidyverse)
library(stm)
library(DT)
library(scales)
library(glue)

source(here::here("scripts","paths_and_packages.R"))

many_stm <- readRDS(many_stm_path <- here::here(private_data_path, "many_stm_full_text.rds"))
corpora_in_stm <- readRDS(here(private_data_path, "corpora_in_stm_FULL_TEXT.rds"))

# Get all unique combinations of K and preprocessing
combinations <- many_stm %>%
  filter(K %in% c(30, 40),
         preprocessing %in% c("1", "15", "20")) %>% 
  select(K, preprocessing) %>%
  distinct()

plot_list <- list()
theta_list <- list()
beta_list <- list()

# Loop through all combinations
for (i in seq_len(nrow(combinations))) {
  
  # Extract K and preprocessing values
  current_K <- combinations$K[i]
  current_preprocessing <- combinations$preprocessing[i]
  
  # Extract the corresponding STM model
  structured_topic_model <- many_stm %>%
    filter(preprocessing == current_preprocessing & K == current_K) %>%
    pull(3) %>%
    .[[1]]
  
  # Extract metadata for the corpus
  corpus_in_stm <- corpora_in_stm[[as.character(current_preprocessing)]]
  
  meta <- corpus_in_stm$meta %>%
    mutate(document = row_number())

  # Label topics with top words
  top_terms <- labelTopics(structured_topic_model, n = 10)[[1]] %>%
    as_tibble() %>%
    reframe(topic_label = pmap_chr(., ~ paste(c(...), collapse = ", "))) %>%
    mutate(topic = row_number())

  # Extract gamma (theta)
  gamma <- tidy(structured_topic_model, matrix = "gamma") %>%
    left_join(meta) %>%
    left_join(top_terms, by = "topic")

  # Compute average gamma per topic
  gamma_mean <- gamma %>%
    group_by(topic, topic_label) %>%
    summarise(gamma = mean(gamma), .groups = "drop") %>%
    mutate(topic = reorder(topic, gamma))

  # Extract top documents
  gamma_top10 <- gamma %>%
    group_by(topic) %>%
    slice_max(order_by = gamma, n = 30) %>%
    ungroup() %>%
    select(topic, document, gamma, title) %>%
    group_by(topic) %>% 
    arrange(topic, desc(gamma))

  beta <- tidy(structured_topic_model, matrix = "beta")
  
  beta_top10 <- beta %>%
    group_by(topic) %>%
    slice_max(order_by = beta, n = 20)
  

  # Generate the gamma plot
  plot_list[[glue("K{current_K}_Preproc{current_preprocessing}")]] <- 
    gamma_mean %>%
    ggplot(aes(x = gamma, y = topic)) +
    geom_segment(aes(x = 0, xend = gamma, yend = topic), color = "black", size = 0.5) +
    geom_text(aes(label = topic_label), size = 4, hjust = -0.01, nudge_y = 0.0005) +
    scale_x_continuous(expand = c(0, 0), limits = c(0, max(gamma_mean$gamma) + 0.2), labels = percent_format()) +
    theme_light(base_size = 25) +
    labs(
      x = expression(theta),
      y = NULL,
      title = glue("Gamma Distribution for K = {current_K}, Preprocessing = {current_preprocessing}"),
      caption = "Each topic is associated with the 10 most probable words from the Beta distribution"
    )

  # Store the table in the list
  theta_list[[glue("K{current_K}_Preproc{current_preprocessing}")]] <- 
    DT::datatable(
      gamma_top10,
      extensions = c('Buttons', 'ColReorder', 'FixedHeader'),
      options = list(
        dom = 'Bfrtip',
        buttons = c('excel', 'csv'),
        pageLength = 30,
        colReorder = TRUE,
        fixedHeader = TRUE,
        order = list(list(2, 'desc')),
        search = list(regex = TRUE, caseInsensitive = TRUE),
        columnDefs = list(list(width = '500px', targets = 3))
      ),
      filter = "top"
    )
  
    beta_list[[glue("K{current_K}_Preproc{current_preprocessing}")]] <- 
    DT::datatable(
      beta_top10,
      extensions = c('Buttons', 'ColReorder', 'FixedHeader'),
      options = list(
        dom = 'Bfrtip',
        buttons = c('excel', 'csv'),
        pageLength = 30,
        colReorder = TRUE,
        fixedHeader = TRUE,
        order = list(list(2, 'desc')),
        search = list(regex = TRUE, caseInsensitive = TRUE),
        columnDefs = list(list(width = '500px', targets = 3))
      ),
      filter = "top"
    )
}

saveRDS(plot_list, here(private_data_path, "plot_list_stm_models.rds"))
saveRDS(theta_list, here(private_data_path, "theta_list_stm_models.rds"))
saveRDS(beta_list, here(private_data_path, "beta_list_stm_models.rds"))

2.4.0.1 Résumé

Afficher le code

plot_list <- readRDS(here(private_data_path, "plot_list_stm_models.rds"))

plot_list[["K30_Preproc1"]]

2.4.0.2 Distribution theta

Afficher le code

theta_list <- readRDS(here(private_data_path, "theta_list_stm_models.rds"))

theta_list[["K30_Preproc1"]]

2.4.0.3 Distribution beta

Afficher le code

beta_list <- readRDS(here(private_data_path, "beta_list_stm_models.rds"))

beta_list[["K30_Preproc1"]]

2.4.0.4 Résumé

Afficher le code

plot_list[["K30_Preproc15"]]

2.4.0.5 Distribution theta

Afficher le code

theta_list[["K30_Preproc15"]]

2.4.0.6 Distribution beta

Afficher le code

beta_list[["K30_Preproc15"]]

2.4.0.7 Résumé

Afficher le code

plot_list[["K30_Preproc20"]]

2.4.0.8 Distribution theta

Afficher le code

theta_list[["K30_Preproc20"]]

2.4.0.9 Distribution beta

Afficher le code

beta_list[["K30_Preproc20"]]

2.4.0.10 Résumé

Afficher le code

plot_list[["K40_Preproc1"]]

2.4.0.11 Distribution theta

Afficher le code

theta_list[["K40_Preproc1"]]

2.4.0.12 Distribution beta

Afficher le code

beta_list[["K40_Preproc1"]]

3 Figures

3.1 Figure 1: distribution des documents

Afficher le code

# plot document distribution 

p <- documents %>%
  mutate(type = case_when(
    type == "varia" ~ "Varia",
    type == "numéro spécial" ~ "Numéro spécial",
    TRUE ~ "Recensions et autres"),
    type = factor(type, levels = c("Recensions et autres", "Numéro spécial", "Varia"))) %>%
  count(year, type) %>%
  mutate(tooltip = paste("Année:", year, "<br>Nombre de documents:", n, "<br>Type:", type)) %>%
  ggplot(aes(x = as.integer(year), y = n, fill = type, text = tooltip)) + 
  geom_col(colour = "black") +
  labs(x = "Année", y = "Nombre de documents", fill = "") + 
  theme_article_custom() 

print(p)

Distribution des documents par année (articles et autre type de documents)

3.2 Figure 2: part des co-écritures par année

Afficher le code

# keep only articles
articles <- documents %>% filter(str_detect(type, "varia|spécial") | type == "")

p <- authors %>%
  # keep only articles
  filter(id_document %in% articles$id) %>%
  group_by(id_document) %>%
  reframe(n = n(), year) %>%
  mutate(is_coauthor = ifelse(n > 1, 1, 0)) %>%
  add_count(year) %>%
  group_by(year, is_coauthor) %>%
  reframe(total = nn, n = n(), percentage = n / total * 100) %>%
  filter(is_coauthor == 1) %>%
  ggplot(aes(x = as.integer(year), y = percentage)) +
  geom_smooth(se = FALSE, method = 'loess', span = 0.75, color = "gray50") +
  geom_point() +
  theme_light(base_size = 15) +
  scale_y_continuous(labels = scales::label_percent(scale = 1)) + 
  labs(x = "Année", y = "Part de co-écriture") +
  theme_article_custom()


print(p)

Pourcentage d’articles en co-écriture par année

3.3 Figure 3: disciplines et genre des éditeur.es

Afficher le code

editors <- editors %>%
  filter(!is.na(Nom)) %>%
  mutate(
    date_departure = ifelse(is.na(date_departure), 2023, date_departure)
  ) %>%
  rename(field = `discipline 1`)

editors_annually <- editors %>%
  rowwise() %>%
  mutate(years = list(seq(date_entrance, date_departure, by = 1))) %>%
  unnest(years) %>%
  select(years, Nom, Prénom, genre, field) %>%
  unique

p1 <- editors_annually %>%
  count(years, genre) %>%
  group_by(years) %>%
  mutate(percentage = n / sum(n) * 100) %>%
  filter(genre == "F") %>%
  ggplot(aes(x = years, y = percentage)) +
  geom_smooth(
    se = FALSE,
    method = 'loess',
    span = 0.75,
    alpha = 0.5,
    color = "gray85"
  ) +
  geom_point(, color = "black") +
  labs(x = "Année", y = "Part dans le comité éditorial") +
  scale_y_continuous(labels = scales::label_percent(scale = 1)) +
  theme_article_custom()

p2 <- editors_annually %>%
  filter(!is.na(field)) %>%
  count(years, field, name = "n") %>%
  group_by(years) %>%
  mutate(percentage = 100 * n / sum(n)) %>%
  ungroup() %>%
  filter(!field == "Economie") %>%
  ggplot(aes(x = years, y = percentage)) +
  geom_point(
    aes(shape = field),
    position = position_jitter(width = 0.3, height = 0),
    size = 2,
    stroke = 0.8,
    colour = "black",
    fill = NA
  ) +
  scale_shape_discrete(name = "") + # formes distinctes
  scale_y_continuous(
    labels = scales::label_percent(scale = 1),
    limits = c(0, 30)
  ) +
  labs(x = "Année", y = "Part dans le comité éditorial", shape = "") +
  theme_article_custom() +
  # legend on two lines
  guides(shape = guide_legend(nrow = 2, byrow = TRUE))

print(p2)
print(p1)

ggsave(
  plot = p2,
  filename = here::here("revue_economique", "figures", "figure_3_editeurs_gender_field-1.png"),
  width = 8,
  height = 5,
  units = "in",
  dpi = 300
)

Evolution des éditeurs

3.4 Figure 4: genre des contributeur.es

Afficher le code

# same but only articles 
articles <- documents %>% filter(str_detect(type, "varia|spécial") | type == "")

authors <- authors %>%
  mutate(gender = factor(gender, levels = c("H", "F")))


# tous documents sauf articles
p1 <- authors %>%
  filter(!is.na(gender),
         !id_document %in% articles$id) %>%
  group_by(gender, year) %>%
  summarise(n = n(), .groups = "drop") %>%
  ggplot(aes(x = as.integer(year), y = n, fill = gender)) +
  geom_col(colour = "black") +
  labs(x = "Année", y = "Nombre d'auteurs", fill = "Genre") +
  theme_article_custom()

# seulement les articles
p2 <- authors %>%
  filter(!is.na(gender),
         id_document %in% articles$id) %>%
  group_by(gender, year) %>%
  summarise(n = n(), .groups = "drop") %>%
  ggplot(aes(x = as.integer(year), y = n, fill = gender)) +
  geom_col(colour = "black") +
  labs(x = "Année", y = "Nombre d'auteurs", fill = "Genre") +
  theme_article_custom()


print(p1)
print(p2)

Distribution des auteurs par genre

3.5 Figure 5: évolution des méta-thématiques

Afficher le code

# load meta topics 

structured_topic_model <- readRDS(here(private_data_path, "structured_topic_model.rds"))
corpus_in_stm <- readRDS(here(private_data_path, "corpora_in_stm_FULL_TEXT.rds"))[["15"]]
metatopics <- xlsx::read.xlsx(here(clean_corpus_path, "metatopics.xlsx"),
                                     sheetIndex = 1) 

metatopics <- metatopics %>% 
  select(-remarques) 

# output metatopics in latex format 

# latex_output <- metatopics %>% left_join(top_terms_prob)
# kableExtra::kable(latex_output, format = "latex", booktabs = TRUE, escape = FALSE)

manual_color <- c("Analyse historique" = "#D55E00",      
                  "Macroéconomie" = "#009E73", 
                  "Expertise économique" = "#CC79A7", 
                  "Microéconomie" = "#E69F00",
                  "Economie internationale" = "#56B4E9"
                  )           


label_topic <- labelTopics(structured_topic_model, n = 5) 

meta <- corpus_in_stm$meta %>% 
  mutate(document = row_number()) 

top_terms_prob <- label_topic %>% .[[1]] %>% 
  as_tibble() %>% 
  reframe(topic_label_prob = pmap_chr(., ~ paste(c(...), collapse = ", "))) %>% 
  mutate(topic = row_number()) 

gamma <- tidy(structured_topic_model,
     matrix = "gamma") %>% 
  left_join(meta) %>% 
  left_join(top_terms_prob, by = "topic") 

meta_gamma_mean_year <- gamma %>% 
  left_join(metatopics %>% select(topic, label, champ), by = "topic") %>% 
  select(topic, champ, gamma, year) %>% 
  group_by(year, topic) %>%
  mutate(gamma = mean(gamma)) %>% 
  unique %>% 
  group_by(year, champ) %>% 
  reframe(sum = sum(gamma))

# Palette manuelle en nuances de gris
manual_grey <- c(
  "Analyse historique"      = "grey20",  # le plus foncé
  "Macroéconomie"           = "grey40",
  "Expertise économique"    = "grey60",
  "Microéconomie"           = "grey75",
  "Economie internationale" = "grey90"   # le plus clair
)

p <- meta_gamma_mean_year %>% 
  ggplot(aes(
    x = year,
    y = sum,
    group = champ,
    fill = champ
  )) +
  stat_smooth(
    geom = "area",
    position = "stack",
    method = "loess",
    span = 0.25,
    colour = "black",        # contour noir pour bien séparer
    linewidth = 0.2
  ) + 
  labs(
    x = "Année",
    y = "Somme des prévalences moyennes par année",
    fill = "Meta-thématiques"
  ) +
  scale_fill_manual(values = manual_grey) +
  theme_light(base_size = 14) 

print(p)

3.6 Figure 6: effet de l’année sur la prévalence espérée

Le code ci-dessous regroupe les prévalences espérées selon l’année par méta-thématiques. Nous utilisons une version personalisée de stm::plot.estimateEffect pour extraire les effets de l’année sur la prévalence espérée des thématiques. Ensuite, nous utilisons ggplot2 pour visualiser ces effets.

Afficher le code

library(dplyr)
library(ggplot2)
library(glue)
library(stringr)
library(purrr)

# --- chargement des objets nécessaires
ee_date <- readRDS(here::here(clean_corpus_path, "ee_date.rds"))
top_terms_prob <- readRDS(here::here(clean_corpus_path, "top_terms_prob.rds"))

ee_date <- ee_date %>%
  left_join(top_terms_prob, by = "topic") %>%
  select(-label) %>%
  left_join(metatopics %>% select(topic, label, champ), by = "topic") %>%
  mutate(facet_lab = paste0(topic, ": ", label))

# Fonction qui renvoie un ggplot pour une métathématique
make_plot_metachamp <- function(champ_name){
  df <- ee_date %>% filter(champ == champ_name)
  if(nrow(df) == 0) return(NULL)

  ymin <- min(df$estimate, na.rm = TRUE)
  ymax <- max(df$estimate, na.rm = TRUE)

  gg <- ggplot(df, aes(x = covariate.value, y = estimate)) +
    geom_line(linewidth = 0.4, colour = "black") +
    geom_hline(yintercept = 0, linetype = "dashed") +
    facet_wrap(~ facet_lab, ncol = 3) +
    labs(
      x = "Année",
      y = "Prévalence espérée",
      title = glue("Effet de l'année — {champ_name}")
    ) +
    theme_article_custom(base_size = 13) +
    theme(
      # reduce font size of facet labels
      strip.text = element_text(size = 8.2),
      plot.title = element_text(hjust = 0),
      #strip.background = element_rect(fill = "white", colour = "black"),
    ) +
    coord_cartesian(ylim = c(ymin, ymax))

  # calcul hauteur auto
  n_panels <- df %>% distinct(topic) %>% nrow()
  ncol_facets <- 3L
  nrow_facets <- ceiling(n_panels / ncol_facets)
  height_mm <- nrow_facets * 40 + 20

  slug <- champ_name %>%
    stringi::stri_trans_general("Latin-ASCII") %>%
    str_replace_all("[^A-Za-z0-9]+", "_") %>%
    str_replace("^_|_$", "")

  ggsave(
    filename = here::here(figures_path, glue("figure_year_effect_{slug}.png")),
    plot = gg, dpi = 300,
    width =  180, height = height_mm, units = "mm"
  )

  return(gg)
}

# --- génère la liste
list_metatopics <- ee_date$champ %>% unique() %>% sort()

list_plots <- map(set_names(list_metatopics), make_plot_metachamp)

# --- sauvegarde pour l'annexe en ligne
saveRDS(list_plots, here::here(clean_corpus_path, "list_plots_year_effect.rds"))

3.6.1 Analyse historique

Afficher le code

list_plots <- readRDS(here::here(clean_corpus_path, "list_plots_year_effect.rds"))

p <- list_plots[['Analyse historique']] 

print(p)

3.6.2 Microéconomie

Afficher le code

p <- list_plots[['Microéconomie']] 
print(p)

3.6.3 Economie internationale

Afficher le code

p <- list_plots[['Economie internationale']] 

print(p)

3.6.4 Macroéconomie

Afficher le code

p <- list_plots[['Macroéconomie']]

print(p)

3.6.5 Expertise économique

Afficher le code

p <- list_plots[['Expertise économique']]

print(p)

3.7 Figure 7: langues des titres des documents

Afficher le code

file_ftz <- system.file("language_identification/lid.176.ftz", package = "fastText") # importing the pre-trained model

# languages <- documents %>%
#   select(id, title) %>%
#   mutate(
#     # language_cld3 = cld3::detect_language(title, size = 2),
#     language_fasttext = furrr::future_map(
#       title,
#       ~ fastText::language_identification(
#         input_obj = .x,
#         pre_trained_language_model_path = file_ftz,
#         k = 1,
#         th = 0.0,
#       )
#     )
#   )
# 
# saveRDS(languages, here(clean_corpus_path, "language_fasttext.rds"))

languages <- readRDS(here(clean_corpus_path, "language_fasttext.rds"))

# cleaning errors 
title_fr <- c("Introduction")
id_fr <- c("reco_0035-2764_1958_num_9_2_407293",
           "reco_0035-2764_1964_num_15_4_407617",
           "reco_0035-2764_1962_num_13_5_407529_t1_0849_0000_001",
           "reco_0035-2764_1977_num_28_6_408364_t1_1030_0000_000",
           "reco_0035-2764_1969_num_20_6_407897_t1_1062_0000_002")

id_en <- c("reco_0035-2764_2000_num_51_3_410526",
           "reco_0035-2764_1994_num_45_5_409606")

languages_clean <- languages %>% 
  unnest(language_fasttext) %>% 
  rename(language = iso_lang_1) %>%
  mutate(
    language = ifelse(title %in% title_fr, "fr", language),
    language = ifelse(id %in% id_fr, "fr", language),
    language = ifelse(id %in% id_en, "en", language),
    # on fusionne italien avec autres
    language = ifelse(!language %in% c("fr", "en", "autres"), "autres", language),
    language = factor(language, levels = c("autres", "en", "fr"))
  )

# palette de gris : fr clair, autres foncé
language_greys <- c(
  "autres" = "grey20",
  "en"     = "grey50",
  "fr"     = "grey85"
)

# Articles scientifiques
p1 <- languages_clean %>% 
  left_join(documents %>% select(year, type, id), by = "id") %>%
  filter(type %in% c("varia", "numéro spécial")) %>% 
  count(language, year) %>%
  ggplot(aes(x = as.integer(year), y = n, fill = language)) +
  geom_col(colour = "black") +
  scale_fill_manual(values = language_greys, drop = FALSE) +
  labs(x = "Année", y = "Nombre de documents", fill = "Langue") +
  theme_article_custom(base_size = 15)

# Recensions et autres
p2 <- languages_clean %>% 
  left_join(documents %>% select(year, type, id), by = "id") %>%
  filter(!type %in% c("varia", "numéro spécial")) %>% 
  count(language, year) %>%
  ggplot(aes(x = as.integer(year), y = n, fill = language)) +
  geom_col(colour = "black") +
  scale_fill_manual(values = language_greys, drop = FALSE) +
  labs(x = "Année", y = "Nombre de documents", fill = "Langue") +
  theme_article_custom(base_size = 15)

print(p1)
print(p2)

Prédiction de la langue du titre

3.8 Figure 8: Citations extra-disciplinaires

Pour analyser les citations extra-disciplinaires des documents de la Revue Économique, nous avons utilisé les données de citation de Cairn — malheureusement, ces données ne sont pas disponibles pour les documents archivés par Persée, et Cairn ne nous autorise pas à les diffuser.

Nous avons classifié à la main les journaux des documents qui sont cités au moins deux fois dans la Revue Économique, soit 1639 journaux (voir ref_journals.xlsx). Nous avons ensuite classé ces journaux par disciplines en utilisant le nom du journal, et en consultant le board éditorial de chaque journal en cas d’ambiguïté (i.e. The Journal of Economics and Sociology). Les journaux de finance sont des cas particuliers puisqu’ils pourraient légitimement être classés en économie ou en management. Nous avons donc décidé de créer une catégorie à part. Nous calculons ensuite la fréquence de chaque discipline au sein des documents, normalisée par le nombre de références. Plus un article cite de références, moins ses références comptent.

Afficher le code

# Normalize references

# get fields and journal frequency  
field <- read_xlsx(here(clean_corpus_path, "ref_journals.xlsx")) 

#get references 
cairn_ref <- read_xlsx(here(data_path, "raw_source", "cairn", "RECO_31-01-2024_références_sortantes.xlsx"))

# join field and references  
ref_journals <- cairn_ref %>%
  # renaming for simplicity 
  rename(journal = `Titre revue citée`,
         year = `Année source`,
         id = `ID_ARTICLE source`) %>% 
  # keep only id in the official document database 
  filter(id %in% documents$id) %>% 
  # count number of references (use in the chunk later)
  add_count(id, name = "n_ref") %>%
  mutate(journal = str_to_lower(journal) %>% 
           str_remove(., "^\\s?the") %>% 
           str_trim(., "both")) %>% 
  # add field and journal frequency 
  left_join(field, by = "journal") %>% 
  select(id, journal, field, year, n, n_ref) %>% 
  filter(!is.na(journal),
         !is.na(field)) 

# estimate the normalized frequency of journals 
ref_journals_normalize_weak <- ref_journals %>%
  group_by(field, id) %>% 
  reframe(n = n(),
          n_normalize = n / n_ref,
          n_ref = n_ref,
          year = year) %>% 
  unique()

Nous reprenons ensuite la méthodologie de Truc et al. (2023) pour identifier le ratio de citations extra-disciplinaires, c’est-à-dire le pourcentage de citations en dehors de l’économie par rapport au nombre total de citations.

Afficher le code

# plot extra-disciplinary citation for the 12th most important fields  

# find the 12th 
field_to_keep <- ref_journals_normalize_weak %>%
  group_by(field) %>% 
  reframe(sum_n = sum(n_normalize)) %>% 
  arrange(desc(sum_n)) %>% 
  slice_max(sum_n, n = 12) %>% 
  distinct(field)
 
data_summary_weak <- ref_journals_normalize_weak %>%
  # keep the the 12th most represented fields 
  mutate(field = ifelse(field %in% field_to_keep$field, field, "Other")) %>%
  # estimate the normalized frequency each year 
  group_by(year) %>% 
  mutate(sum_by_year = sum(n_normalize)) %>%
  group_by(field, year) %>% 
  reframe(sum_by_year_field = sum(n_normalize),
          sum_by_year = sum_by_year) %>% 
  unique %>% 
  group_by(field, year) %>% 
  mutate(ratio = sum_by_year_field/sum_by_year*100)


field_levels <- unique(data_summary_weak$field[data_summary_weak$field != "Other"])

# assign color to each field level 
color_values <- c("Other" = "lightgray", 
                  "Total" = "black",
                  setNames(RColorBrewer::brewer.pal(length(field_levels), "Paired"), field_levels))

# Données : top 5 + Other déjà calculé
data_summary_top <- data_summary_weak %>%
  filter(field != "Economics")


# Ajouter le total Extra
data_summary_total <- data_summary_weak %>%
  mutate(field = ifelse(str_detect(field, "Economics"), "Intra", "Extra")) %>%
  group_by(field, year) %>%
  reframe(ratio = sum(ratio)) %>%
  filter(field == "Extra") %>%
  mutate(field = "Total")

# Fusionner
data_summary_all <- bind_rows(
  data_summary_top,
  data_summary_total
)

# Définir l’ordre des facettes
field_order <- data_summary_all %>%
  group_by(field) %>%
  summarise(total_ratio = sum(ratio, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_ratio)) %>%
  pull(field)

# Forcer Other puis Total à la fin
field_order <- c(setdiff(field_order, c("Other", "Total")), "Other", "Total")

# Appliquer l’ordre
data_summary_all <- data_summary_all %>%
  mutate(field = factor(field, levels = field_order))

# Traductions des labels
trad <- c(
  "Economics"                   = "Économie",
  "Finance"                     = "Finance",
  "Management"                  = "Gestion",
  "Medecine and Health Policy"  = "Médecine et politiques de santé",
  "Medicine and Health Policy"  = "Médecine et politiques de santé",
  "Statistics & Mathematics"    = "Statistiques & Mathématiques",
  "Law"                         = "Droit",
  "Political science"           = "Science politique",
  "Urban Studies and Geography" = "Études urbaines et géographie",
  "Psychology"                  = "Psychologie",
  "General Social Sciences"     = "Sciences sociales générales",
  "Environmental Sciences"      = "Sciences de l’environnement",
  "Environnemental Sciences"    = "Sciences de l’environnement",
  "Sociology"                   = "Sociologie",
  "Other"                       = "Autres"
)

data_summary_all <- data_summary_all %>%
  mutate(discipline = dplyr::recode(as.character(field), !!!trad)) 

# Graphique facetté
p1 <- data_summary_all %>%
  ggplot(aes(x = year, y = ratio)) +
  geom_smooth(se = FALSE, method = "loess",
              linewidth = 0.8, colour = "black", span = 0.75) +
  labs(
    x = "Année",
    y = "% du total des références"
  ) +
  scale_y_continuous(labels = scales::label_percent(scale = 1)) +
  facet_wrap(~ discipline, 
  scales = "free_y",
  # 3 columns
  ncol = 3
  ) +
  theme_article_custom() +
  theme(
    legend.position = "none",
    #strip.background = element_rect(fill = "white", colour = "black"),
  )

ggsave(p1, file = here(figures_path, "figure_8_citation-1.png"), width = 12, height = 15, dpi = 300)

Afin de confirmer ces résultats, nous avons reclassé les journaux en utilisant une classification plus stricte de l’économie. Nous avons considéré que n’importe quel journal utilisant le mot “économie” dans son titre est un journal d’économie. Nous avons ensuite recalculé le ratio de citations extra-disciplinaires. La tendance haussière reste la même, même si (logiquement) le ratio de citations extra-disciplinaires est plus faible.

Show the code

# estimate normalized frequency 
ref_journals_normalize_strong <- ref_journals %>%
  mutate(field2 = ifelse(str_detect(journal, "[ée]conom"), "Economics", field)) %>% 
  group_by(field2, id) %>% 
  reframe(n = n(),
          n_normalize = n / n_ref,
          n_ref = n_ref,
          year = year) %>% 
  unique %>% 
  rename(field = field2)

# estimate normalized frequency each year 
data_summary_strong <- ref_journals_normalize_strong %>%
  mutate(field = ifelse(field %in% field_to_keep$field, field, "Other")) %>%
  group_by(year) %>% 
  mutate(sum_by_year = sum(n_normalize)) %>%
  group_by(field, year) %>% 
  reframe(sum_by_year_field = sum(n_normalize),
          sum_by_year = sum_by_year) %>% 
  unique %>% 
  group_by(field, year) %>% 
  mutate(ratio = sum_by_year_field/sum_by_year*100)

# total extra-disciplinary citations 
data_summary_strong2 <- data_summary_strong %>% 
  mutate(field = ifelse(str_detect(field, "Economics"), "Intra", "Extra")) %>%
  group_by(field, year) %>% 
  reframe(ratio = sum(ratio)) %>% 
  unique 

# plot 

p1 <- data_summary_strong2 %>%
  filter(field == "Extra") %>% 
  ggplot(aes(x = year, y = ratio)) +
  geom_point() +
  geom_smooth(se=F, method = 'loess', span = 0.50,
              linewidth= 1,
              color = "grey50",
              alpha = 0.2) +
  labs(
    x = "Année",
    y = "% du total des références") +
  theme_article_custom() 

ggsave(p1, file = here(figures_path, "figure_8_citation-2.png"), width = 8, height = 6, dpi = 300)

3.9 Figure 9: effet du genre

Afficher le code

estimate_effect <- readRDS(here::here(clean_corpus_path, "estimate_effect_Local.rds"))

summary <- summary(estimate_effect)

summary_tibble <- summary$tables %>%
  purrr::imap_dfr( ~ {
    tibble(
      topic = .y,
      # Extract topic number
      term = rownames(.x),
      # Covariate names
      estimate = .x[, 1],
      # Coefficients
      std_error = .x[, 2],
      # Standard errors
      t_value = .x[, 3],
      # Confidence interval lower bound
      p_value = .x[, 4]   # Confidence interval upper bound
    )
  })

has_female_significant <- summary_tibble %>% 
  filter(term == "has_female1") 
  
df_reg <- has_female_significant %>%
  left_join(top_terms_prob, by = "topic") %>%
  left_join(metatopics %>% select(topic, label)) %>% 
  mutate(
    topic_label = paste0(topic, ": ", label),
    effect_group = case_when(
      p_value > .10 ~ "Non significatif au seuil de 90 %",
      estimate > 0 ~ "Effet positif",
      TRUE ~ "Effet négatif"
    )
  ) %>%
  mutate(topic_label = reorder(topic_label, estimate))

# graphique pour un groupe donné
plot_reg_group <- function(data, group_title) {
  ggplot(data, aes(x = estimate, y = topic_label)) +
    geom_point(size = 2, colour = "black") +
    geom_errorbarh(aes(xmin = estimate - 1.64*std_error, 
                       xmax = estimate + 1.64*std_error),
                   height = 0.1, colour = "black") +
    geom_vline(xintercept = 0, linetype = "dashed") +
    labs(
      x = "Effet estimé (± IC 90%)",
      y = NULL,
      title = group_title
    ) +
    theme_article_custom()
}

# Séparer les trois groupes
df_pos <- df_reg %>% filter(effect_group == "Effet positif")
df_neg <- df_reg %>% filter(effect_group == "Effet négatif")
df_ns  <- df_reg %>% filter(effect_group == "Non significatif au seuil de 90 %")

p_pos <- plot_reg_group(df_pos, "Effets positifs (au moins une femme)") + 
  # réduire la taille du titre 
  theme(plot.title = element_text(size = 15, hjust = 0))

p_neg <- plot_reg_group(df_neg, "Effets négatifs (au moins une femme)") +
  # réduire la taille du titre 
  theme(plot.title = element_text(size = 15, hjust = 0))

p_ns  <- plot_reg_group(df_ns, "Non significatifs") +
  # réduire la taille du titre 
  theme(plot.title = element_text(size = 15, hjust = 0))

# Afficher
print(p_pos)

Estimate effect of having a women in the author list on topic prevalence

Afficher le code

print(p_neg)

Afficher le code

print(p_ns)

3.10 Figure 10: Distribution des textes par année

Afficher le code

df_text <- readRDS(here(private_data_path, "df_full_text.rds"))

p <- df_text %>%
  count(year) %>%
  mutate(tooltip = paste("Année:", year, "<br>Count:", n)) %>%
  ggplot(aes(x = as.integer(year), y = n, text = tooltip)) +
  geom_col(binwidth = 1,
           fill = "grey50",
           color = "black") +
  labs(x = "Année", y = "Nombre de documents par année") +
  theme_article_custom(base_size = 14) 


print(p)

3.11 Figure 11: Cohérence et exclusivité des thématiques

Afficher le code

evaluation_result <- readRDS(here(private_data_path, "evaluation_result_FULL_TEXT.rds"))

# plotting general metrics and choose a preprocessing treshold 
k_metric_summary <- evaluation_result %>%
  select(K, preprocessing, semantic_coherence, exclusivity) %>%
  rename(frex = exclusivity) %>%
  gather(Metric, Value, -K, -preprocessing) %>%
  mutate(
    preprocessing = factor(preprocessing, levels = c("1","5","10","15","20","30")),
    Metric = recode(Metric, "semantic_coherence" = "Cohérence", "frex" = "FREX")
  )

# Points choisis (adapte si besoin)
selected_points <- k_metric_summary %>%
  filter((K == 30 & preprocessing %in% c("30","15") & Metric == "FREX") |
         (K == 40 & preprocessing %in% c("5")         & Metric == "Cohérence") |
         (K == 50 & preprocessing %in% c("30")        & Metric == "FREX")) %>%
  mutate(label_choice = paste0("K=", K, ", Filtre=", preprocessing))

# palette de formes pour distinguer les niveaux de prétraitement
shapes_vec <- c(16, 17, 15, 3, 4, 8)  # autant que tes 6 filtres

p <- ggplot(k_metric_summary,
            aes(K, Value, group = preprocessing,
                shape = preprocessing)) +
  # ligne commune (noire fine)
  geom_line(colour = "black", linewidth = 0.2) +
  # tous les points
  geom_point(colour = "black", size = 3) +
  # facettes cohérence / FREX
  facet_wrap(~ Metric, scales = "free_y") +
  scale_shape_manual(values = shapes_vec, name = "Filtre") +
  labs(x = "K (Nombre de thématiques)", y = NULL) +
  theme_article_custom(base_size = 15) +
  theme(
    legend.position = "bottom",
    legend.key = element_blank(),
    strip.background = element_rect(colour = "white", fill = "white"),
    strip.text = element_text(color = "black")
  ) +
  # points sélectionnés = cercles blancs
  geom_point(data = selected_points,
             aes(K, Value),
             inherit.aes = FALSE,
             shape = 21, fill = "white", colour = "black",
             size = 5, stroke = 1.1) +
  geom_text_repel(data = selected_points,
                  aes(K, Value, label = label_choice),
                  inherit.aes = FALSE,
                  size = 3.1, colour = "black",
                  segment.color = "grey40", max.overlaps = 20,
                  # move away from points taking account of y value
                  nudge_y = ifelse(selected_points$Metric == "Cohérence", 2, 0.1),
                  nudge_x = ifelse(selected_points$Metric == "Cohérence", 3, -3)) 


# ggsave(plot = p, filename = here(figures_path, "figure_11_coherence_exclusivity.png"), dpi = 300, width = 8, height = 5)
print(p)

3.12 Figure 12: Prévalence moyenne par année et effet de l’année prédit

Afficher le code

# load data 
corpora_in_stm <- readRDS(here::here(private_data_path, "corpora_in_stm_full_text.rds"))
corpus_in_stm <- corpora_in_stm[["15"]]

structured_topic_model <- readRDS(here(private_data_path, "structured_topic_model.rds"))
  
topic_chosen <- 28
# average prevalence 

metadata <- corpus_in_stm$meta %>% 
  as_tibble %>%
  mutate(document = row_number()) %>% 
  select(year, document)

# tidy call gamma the prevalence matrix, stm calls it theta
theta <- broom::tidy(structured_topic_model, matrix = "gamma") %>%
  # broom called stm theta matrix gamma
  left_join(metadata, by = "document")

theta_mean <- theta %>%
  filter(topic == topic_chosen) %>% 
  group_by(topic, year) %>% 
  reframe(theta_mean = mean(gamma)) 

#plot 
gg_average <- theta_mean %>%
  ggplot(aes(x = year, y = theta_mean)) +
  geom_line(color = "black") +
  labs(y = "Prévalence moyenne",
       x = "Année") +
  theme_light(base_size = 15)


# run regressions

formula_spline <- as.formula("~ has_female + is_varia + s(year)")

metadata <- corpus_in_stm$meta %>% as_tibble %>% 
  mutate(year = as.numeric(year),
         has_female = as.factor(has_female),
         is_varia = as.factor(is_varia),
         has_female = relevel(has_female, ref = "0"),
         is_varia = relevel(is_varia, ref = "0"))

# estimate effect with spline and without 

estimate_effect_spline <- estimateEffect(formula_spline,
                                  structured_topic_model,
                                  metadata =  metadata,
                                  documents = corpus_in_stm$documents,
                                  uncertainty = "None",
                                  nsims = 25)

# plot 

gg_spline <- tidystm::extract.estimateEffect(
  estimate_effect_spline,
  "year",
  method = "continuous",
  topic = topic_chosen,
  model = structured_topic_model,
  labeltype = "prob",
  npoints = length(1950:2023),
  n = 5
) %>%
  ggplot(aes(x = covariate.value)) +
  geom_line(aes(y = estimate), color = "black") +
  geom_line(aes(y = ci.lower), color = "black", linetype = "dashed") +
  geom_line(aes(y = ci.upper), color = "black", linetype = "dashed") +
  geom_hline(yintercept = 0, linetype = "dotted") +
  labs(y = "Prévalence espérée de la thématique",
       x = "Valeur de variable année") +
  theme_light(base_size = 15) 


print(gg_average)

Afficher le code

print(gg_spline)

References

Bischof, Jonathan, and Edoardo M Airoldi. 2012. “Summarizing Topical Content with Word Frequency and Exclusivity.” In Proceedings of the 29th International Conference on Machine Learning (Icml-12), 201–8.

Mimno, David, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. “Optimizing Semantic Coherence in Topic Models.” In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 262–72.

Roberts, Margaret E, Brandon M Stewart, and Edoardo M Airoldi. 2016. “A Model of Text for Experimentation in the Social Sciences.” Journal of the American Statistical Association 111 (515): 988–1003.

Roberts, Margaret E, Brandon M Stewart, Dustin Tingley, Edoardo M Airoldi, et al. 2013. “The Structural Topic Model and Applied Social Science.” In Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation, 4:1–20. 1. Harrahs; Harveys, Lake Tahoe.

Truc, Alexandre, Olivier Santerre, Yves Gingras, and François Claveau. 2023. “The Interdisciplinarity of Economics.” Cambridge Journal of Economics 47 (6): 1057–86.

Van der Loo, Mark PJ et al. 2014. “The Stringdist Package for Approximate String Matching.” R J. 6 (1): 111.