Topic Modelling with BERTtopic in Python | by Petr Korab | Apr, 2024

Hands-on tutorial on modeling political statements with a topic model

Petr Korab
Towards Data Science
Photo by Harryarts on Freepik

(i.e., topic identification in a corpus of text data) has developed quickly since the (LDA) model was published. This classic topic model, however, does not well capture the between words because it is based on the statistical concept of a bag of words. Recent embedding-based Top2Vec and BERTopic address its drawbacks by exploiting pre-trained to generate topics.

In this article, we’ll use Maarten Grootendorst’s () BERTopic to identify the terms representing topics in political speech transcripts. It outperforms most traditional and modern topic models in topic modeling metrics on various corpora and has been used in companies, academia (Chagnon, 2024), and the public sector. We’ll explore in :

  • how to effectively preprocess data
  • how to create a Bigram topic model
  • how to explore the most frequent terms over time.

As an example , we’ll use the Empoliticon: Political Speeches-Context & Emotion dataset, released under the…

Source link