Introduction
When a Machine Learning model is deployed into production there are often requirements to be met that are not taken into account in a prototyping phase of the model. For example, the model in production will have to handle lots of requests from different users running the product. So you will want to optimize for instance latency and/o throughput.
- Latency: is the time it takes for a task to get done, like how long it takes to load a webpage after you click a link. It’s the waiting time between starting something and seeing the result.
- Throughput: is how much requests a system can handle in a certain time.
This means that the Machine Learning model has to be very fast at making its predictions, and for this there are various techniques that serve to increase the speed of model inference, let’s look at the most important ones in this article.
There are techniques that aim to make models smaller, which is why they are called model compression techniques, while others that focus on making models faster at inference and thus fall under the field of model optimization.
But often making models smaller also helps with inference speed, so it is a very blurred line that separates these two fields of study.
Low Rank Factorization
This is the first method we see, and it is being studied a lot, in fact many papers have recently come out concerning it.
The basic idea is to replace the matrices of a neural network (the matrices representing the layers of the network) with matrices that have a lower dimensionality, although it would be more correct to talk about tensors, because we can often have matrices of more than 2 dimensions. In this way we will have fewer network parameters and faster inference.
A trivial case is in a CNN network of replacing 3×3 convolutions with 1×1 convolutions. Such techniques are used by networks such as SqueezeNet.