An analysis of the intuition behind the notion of Key, Query, and Value in Transformer architecture and why is it used.
Recent years have seen the Transformer architecture make waves in the field of natural language processing (NLP), achieving state-of-the-art results in a variety of tasks including machine translation, language modeling, and text summarization, as well as other domains of AI i.e. Vision, Speech, RL, etc.
Vaswani et al. (2017), first introduced the transformer in their paper “Attention Is All You Need”, in which they used the self-attention mechanism without incorporating recurrent connections while the model can focus selectively on specific portions of input sequences.
In particular, previous sequence models, such as recurrent encoder-decoder models, were limited in their ability to capture long-term dependencies and parallel computations. In fact, right before the Transformers paper came out in 2017, state-of-the-art performance in most NLP tasks was obtained by using RNNs with an attention mechanism on top, so attention kind of existed before transformers. By introducing the multi-head attention mechanism on its own, and dropping the RNN part, the transformer architecture resolves these issues by allowing multiple independent attention mechanisms.
In this post, we will go over one of the details of this architecture, namely the Query, Key, and Values, and try to make sense of the intuition used behind this part.
Note that this post assumes you are already familiar with some basic concepts in NLP and deep learning such as embeddings, Linear (dense) layers, and in general how a simple neural network works.
First, let’s start understanding what the attention mechanism is trying to achieve. And for the sake of simplicity, let’s start with a simple case of sequential data to understand what problem exactly…