Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts | by Benjamin Marie | Dec, 2023

December 14, 2023
by Benjamin Marie
AI, Syndicated
172 Views

How to efficiently outperform GPT-3.5 and Llama 2 70B

Most of the recent large language models (LLMs) use very similar neural architectures. For instance, the Falcon, Mistral, and Llama 2 models use a similar combination of self-attention and MLP modules.

In contrast, Mistral AI, which also created Mistral 7B, just released a new LLM with a significantly different architecture: Mixtral-8x7B, a sparse mixture of 8 expert models.

In total, Mixtral contains 46.7B parameters. Yet, thanks to its architecture, Mixtral-8x7B can efficiently run on consumer hardware. Inference with Mixtral-8x7B is indeed significantly faster than other models of similar size while outperforming them in most tasks.

In this article, I explain what a sparse mixture of experts is and why it is faster for inference than a standard model. Then, we will see how to use and fine-tune Mixtral-8x7B on consumer hardware.

I have implemented a notebook demonstrating QLoRA fine-tuning and inference with Mixtral-8x7B here:

Get the notebook (#32)

A sparse mixture of experts (SMoE) is a type of neural network architecture designed to improve the efficiency and scalability of traditional models. The concept of a mixture of experts was introduced to allow a model to learn different parts of the input space using specialized “expert” sub-networks. In Mixtral, there are 8 expert sub-networks.

Note that the “8x7B” in the name of the model is slightly misleading. The model has a total of 46.7B parameters which is almost 10B parameters less than what 8x7B parameters would yield. Indeed, Mixtral-8x7b is not a 56B parameter model since several modules, such as the ones for self-attention, are shared with the 8 expert sub-networks.

If you load and print the model with Transformers, the structure of the model is easier to understand:

MixtralForCausalLM(…

Source link

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts | by Benjamin Marie | Dec, 2023

How to efficiently outperform GPT-3.5 and Llama 2 70B

About Us

Our Services

Latest QSOL IT News

Mixtral-8x7B: Understanding and Running the Sparse Mixture of Experts | by Benjamin Marie | Dec, 2023

How to efficiently outperform GPT-3.5 and Llama 2 70B

Related Post

Less is more: Why your MSP marketing plan

Thank you, PM Narendra Modi ji for your

Every general-purpose technology creates new opportunity across the

Seize the opportunity: Managed platform engineering services