Where the assumptions behind the two-tower model architecture break — and how to go beyond
Two-tower models are among the most common architectural design choices in modern recommender systems — the key idea is to have one tower that learns relevance, and a second, shallow, tower that learns observational biases such as position bias.
In this post, we’ll take a closer look at two assumptions behind two-tower models, in particular:
- the factorization assumption, i.e. the hypothesis that we can simply multiply the probabilities computed by the two towers (or add their logits), and
- the positional independence assumption, i.e. the hypothesis that the only variable that determines position bias is the position of the item itself, and not the context in which it is impressed.
We’ll see where both of these assumptions break, and how to go beyond these limitations with newer algorithms such as the MixEM model, the Dot Product model, and XPA.
Let’s start with a very brief reminder.
Two-tower models: the story so far
The primary learning objective for the ranking models in recommender systems is relevance: we want the model to predict the best possible piece of content given the context. Here, context simply means everything that we’ve learned about the user, for example from their previous engagement or search histories, depending on the application.
However, ranking models usually exhibit certain observation biases, that is, the tendency for users to engage more or less with an impression depending on how it was presented to them. The most prominent observation bias is position bias — the tendency of users to engage more with items that are shown first.
The key idea in two-tower models is to train two “towers”, that is, neural networks, in parallel, the main tower for learning relevance, and…