How to efficiently outperform GPT-3.5 and Llama 2 70B
Most of the recent large language models (LLMs) use very similar neural architectures. For instance, the Falcon, Mistral, and Llama 2 models use a similar combination of self-attention and MLP modules.
In contrast, Mistral AI, which also created Mistral 7B, just released a new LLM with a significantly different architecture: Mixtral-8x7B, a sparse mixture of 8 expert models.
In total, Mixtral contains 46.7B parameters. Yet, thanks to its architecture, Mixtral-8x7B can efficiently run on consumer hardware. Inference with Mixtral-8x7B is indeed significantly faster than other models of similar size while outperforming them in most tasks.
In this article, I explain what a sparse mixture of experts is and why it is faster for inference than a standard model. Then, we will see how to use and fine-tune Mixtral-8x7B on consumer hardware.
I have implemented a notebook demonstrating QLoRA fine-tuning and inference with Mixtral-8x7B here:
Get the notebook (#32)
A sparse mixture of experts (SMoE) is a type of neural network architecture designed to improve the efficiency and scalability of traditional models. The concept of a mixture of experts was introduced to allow a model to learn different parts of the input space using specialized “expert” sub-networks. In Mixtral, there are 8 expert sub-networks.
Note that the “8x7B” in the name of the model is slightly misleading. The model has a total of 46.7B parameters which is almost 10B parameters less than what 8x7B parameters would yield. Indeed, Mixtral-8x7b is not a 56B parameter model since several modules, such as the ones for self-attention, are shared with the 8 expert sub-networks.
If you load and print the model with Transformers, the structure of the model is easier to understand:
MixtralForCausalLM(…