Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts | by Benjamin Marie | Dec, 2023

How to efficiently outperform GPT-3.5 and Llama 2 70B

Benjamin Marie
Towards Data Science
Image by 8385 from Pixabay

Most of the recent large (LLMs) use very similar neural architectures. For instance, the Falcon, Mistral, and Llama 2 models use a similar combination of self-attention and MLP modules.

In , Mistral AI, which also created Mistral 7B, just released a new LLM with a significantly different architecture: Mixtral-8x7B, a sparse mixture of 8 expert models.

In total, Mixtral contains 46.7B parameters. Yet, thanks to its architecture, Mixtral-8x7B can efficiently run on consumer hardware. with Mixtral-8x7B is indeed significantly faster than other models of similar size while outperforming them in most .

In this article, I explain what a sparse mixture of experts is and why it is faster for inference than a standard . Then, we will see how to use and -tune Mixtral-8x7B on consumer hardware.

I have implemented a notebook demonstrating QLoRA and inference with Mixtral-8x7B here:

Get the notebook (#32)

Image by the author

A sparse mixture of experts (SMoE) is a type of neural network architecture designed to improve the and of traditional models. The concept of a mixture of experts was introduced to allow a model to learn different parts of the input space using specialized “expert” sub-networks. In Mixtral, there are 8 expert sub-networks.

Note that the “8x7B” in the name of the model is slightly misleading. The model has a total of 46.7B parameters which is almost 10B parameters less than what 8x7B parameters would yield. Indeed, Mixtral-8x7b is not a 56B parameter model since several modules, such as the ones for self-attention, are shared with the 8 expert sub-networks.

If you load and the model with Transformers, the structure of the model is easier to understand:

MixtralForCausalLM(…

Source link