One of the common challenges faced with the **deployment **of **large language models (LLMs)** while achieving **low-latency completions** (**inferences**) is the **size of the LLMs**. The **size of LLM** throws challenges in terms of **compute**, **storage**, and **memory requirements**. And, the solution to this is to **optimize the LLM **deployment by taking advantage of model compression techniques that aim to reduce the size of the model. In this blog, we will look into three different optimization techniques namely **pruning**, **quantization,** and **distillation** along with their examples. These techniques help model load quickly while enabling reduced latency during LLM inference. They reduce the resource requirements for the compute, storage, and memory. You might want to check out the book **Generative AI on AWS** to learn how to apply this technique on the **AWS cloud**.

The following diagram represents different optimization techniques for LLM inference such as Pruning, Quantization, and Distillation.

Let’s learn about the LLM inference optimization techniques in detail in the following sections with the help of examples.

## Pruning – Eliminate Parameters

Pruning is a technique that aims to reduce the model size of LLM by removing the weights that contribute minimally to the output. This is based on the observation that not all parameters used in LLM are equally important for making predictions. **By identifying and eliminating these low-impact parameters, pruning reduces the model’s size and the number of computations required during inference**, leading to faster and more efficient performance. The following diagram represents the pruned LLM after some weights have been removed.

There are **various strategies for pruning**, including **magnitude-based pruning** (**unstructured pruning**), where weights with the smallest absolute values are pruned, and **structured pruning, which removes entire channels or filters** based on their importance. Pruning can be applied iteratively, with cycles of pruning followed by fine-tuning to recover any lost performance, resulting in a compact model that retains much of the original model’s accuracy.

The above-mentioned approaches (**structured and unstructured pruning**) require **retraining**; however, there are post-training pruning methods as well. These methods are typically referred to as **one-shot pruning methods**. These methods can do pruning **without retraining**. One such method of post-training pruning is called SparseGPT. This technique has been found to achieve pruning of magnitude to at least **50% sparsity in one-shot**, without any retraining, at minimal loss of accuracy. The following code sample from the SparseGPT pruning library demonstrates how pruning is achieved for the LLaMA and Llama 2 models.

```
target_sparsity_ratio = 0.5
# Prune each layer using the given sparsity ratio
for layer_name in layers:
gpts[layer_name].fasterprune(
target_sparsity_ratio,
)
gpts[layer_name].free() # free the zero'd out memory
```

## Quantization – Model Weights Precision Conversion

In the Quantization technique, the model’s weights are converted from high precision (e.g., 32-bit) to lower precision (e.g., 16-bit). This not only reduces the model’s memory footprint but also the compute requirements by working with a smaller number of representations. With large LLMs, it’s common to reduce the precision further to 8 bits to increase inference performance. The popular method of quantization is reducing the precision of a model’s weights and activations after it has already been trained, as opposed to applying quantization during the training process itself. This method is also called **post-training quantization (PTQ)**. The PTQ method is a popular option for optimizing models for inference because it doesn’t require retraining the model from scratch with quantization-aware techniques.

There are a variety of **post-training quantization methods**, including GPT post-training quantization (GPTQ). Check out this paper for the details – GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.

## Distillation – Statistical Method for Training Smaller Model

**Distillation **is an LLM model optimization technique for inference that helps reduce the model size thereby reducing the number of computations. It **uses statistical methods to train a smaller student model on a larger teacher model**. The result is a student model that retains a high percentage of the teacher’s model accuracy but uses a much smaller number of parameters. The student model is then deployed for inference.

The teacher model is often a generative pre-trained / foundation or a fine-tuned LLM. During the distillation training process, the student model learns to statistically replicate the behavior of the teacher model. Both the teacher and student models generate completions from a prompt-based training dataset. A loss function is calculated by comparing the two completions and calculating the KL divergence between the teacher and student output distributions. The loss is then minimized during the distillation process using backpropagation to improve the student model’s ability to match the teacher model’s predicted next-token probability distribution.

A popular distilled student model is DistilBERT from Hugging Face. DistilBERT was trained from the larger BERT teacher model and is an order of magnitude smaller than BERT, yet it retains approximately 97% of the accuracy of the original BERT model.

## Conclusion

Each of these optimization techniques—**pruning**, **quantization**, and **distillation**—offers a pathway to **optimizing LLMs for inference**, making them more accessible for deployment in resource-constrained environments. The choice of technique(s) depends on the specific requirements and constraints of the deployment scenario, such as the acceptable trade-off between accuracy and computational efficiency, the hardware available for inference, and the specific tasks the LLM is being used for. Often, a combination of these techniques is employed to achieve an optimal balance.

- Invoke Python ML Models from Other Applications – Examples - September 18, 2024
- Principal Component Analysis (PCA) & Feature Extraction – Examples - September 17, 2024
- Content-based Recommender System: Python Example - September 17, 2024

## Leave a Reply