Pushing the Limits of the Two-Tower Model | by Samuel Flender | Dec, 2023

Where the assumptions behind the two-tower break — and how to go beyond

Samuel Flender
Towards Data Science
( created by the author using AI)

Two-tower are among the most common architectural choices in modern recommender systems — the key idea is to have one tower that learns relevance, and a second, shallow, tower that learns observational biases such as position bias.

In this post, we’ll take a closer look at two assumptions behind two-tower models, in particular:

  • the factorization assumption, i.e. the hypothesis that we can simply multiply the probabilities computed by the two towers (or add their logits), and
  • the positional independence assumption, i.e. the hypothesis that the only variable that determines position bias is the position of the item itself, and not the in which it is impressed.

We’ll see where both of these assumptions break, and how to go beyond these limitations with newer algorithms such as the MixEM model, the Dot Product model, and XPA.

Let’s start with a very brief reminder.

Two-tower models: the story so far

The primary objective for the ranking models in recommender systems is relevance: we want the model to predict the best possible piece of content given the context. Here, context simply means everything that we’ve learned about the user, for example from their previous engagement or search histories, depending on the application.

However, ranking models usually exhibit certain observation biases, that is, the tendency for to engage more or less with an impression depending on how it was presented to them. The most prominent observation bias is position bias — the tendency of users to engage more with items that are shown first.

The key idea in two-tower models is to train two “towers”, that is, , in parallel, the main tower for learning relevance, and…

Source link