Which Quantization Method is Right for You?(GPTQ vs. GGUF vs. AWQ) | by Maarten Grootendorst | Nov, 2023

Exploring Pre-Quantized Large Language

Maarten Grootendorst
Towards Data Science

Throughout the last year, we have seen the Wild West of Large (LLMs). The pace at which new technology and models were released was astounding! As a result, we have many different and ways of with LLMs.

In this article, we will explore one such topic, namely loading your local LLM through several (quantization) standards. With sharding, quantization, and different saving and compression strategies, it is not easy to know which method is suitable for you.

Throughout the examples, we will use Zephyr 7B, a -tuned variant of Mistral 7B that was trained with Direct Preference Optimization (DPO).

🔥 TIP: After each example of loading an LLM, it is advised to restart your notebook to prevent OutOfMemory errors. Loading multiple LLMs requires significant RAM/VRAM. You can reset memory by deleting the models and resetting your cache like so:

# Delete any models previously created
del , tokenizer, pipe

# Empty VRAM cache
import torch
torch.cuda.empty_cache()

You can also follow along with the Colab Notebook to make sure everything works as intended.

The most straightforward, and vanilla, way of loading your LLM is through 🤗 Transformers. HuggingFace has created a large suite of packages that allow us to do amazing things with LLMs!

We will start by HuggingFace, among others, from its main branch to newer models:

# Latest HF transformers version for Mistral-like models
pip install git+https://github.com/huggingface/transformers.git
pip install accelerate bitsandbytes xformers

After installation, we can use the following pipeline to easily load our LLM:

from torch import bfloat16
from transformers import pipeline

# Load in your LLM without any compression tricks
pipe = pipeline(
"text-generation",
model="HuggingFaceH4/zephyr-7b-",
torch_dtype=bfloat16,
device_map="auto"
)

Source link