Efficient Deep Learning: Unleashing the Power of Model Compression | by Marcello Politi | Sep, 2023

By Author

Accelerate speed in production

Marcello Politi

Introduction

When a Machine Learning model is deployed into production there are often requirements to be met that are not taken into in a prototyping phase of the model. For example, the model in production will have to handle lots of requests from different users running the product. So you will want to optimize for instance and/o throughput.

  • Latency: is the it takes for a task to get done, like how long it takes to load a webpage after you click a link. It’s the waiting time between starting something and seeing the result.
  • Throughput: is how much requests a system can handle in a certain time.

This means that the Machine Learning model has to be very fast at making its predictions, and for this there are various techniques that serve to increase the speed of model inference, let’s look at the most important ones in this article.

There are techniques that aim to make smaller, which is why they are called model compression techniques, while others that focus on making models faster at inference and thus fall under the field of model .
But often making models smaller also helps with inference speed, so it is a very blurred line that separates these two fields of study.

Low Rank Factorization

This is the first method we see, and it is being studied a lot, in fact many papers have recently come out concerning it.

The basic idea is to replace the matrices of a neural (the matrices representing the layers of the network) with matrices that have a lower dimensionality, although it would be more correct to talk about tensors, because we can often have matrices of more than 2 dimensions. In this way we will have fewer network parameters and faster inference.

A trivial case is in a CNN network of replacing 3×3 convolutions with 1×1 convolutions. Such techniques are used by such as SqueezeNet.

Source link