Pros and cons of different techniques
More and more companies are actively using artificial intelligence (AI) in their business, and, slowly but surely, more models are being brought into production. When making the step towards production, inference time starts to play an important role. When a model is external user facing, you typically want to get your inference time in the millisecond range, and no longer than a few seconds. Google recommends staying under 1.3 seconds for a feeling of responsiveness. In practice a lot of machine learning (ML) models tend to take much longer. For batch processes it is fine if the model takes hours, but anything that needs to be served real time needs to be quick enough to be worthwhile to deploy. Not to mention the fact that longer inference times means more costs if you are using cloud hardware to run your models!
So how do you speed up your inference time? For the context of this article we will consider inference time as the time it takes to send data to the API of the model, and receive output back. There are many different ways in which you can optimize your inference time, each with their own advantages and disadvantages, so let’s walk through the ones applied in practice most often.
Converting to ONNX
You can use ONNX to make your models faster, but what exactly is ONNX? ONNX stands for “Open Neural Network Exchange“ and is basically an open representation format for machine learning algorithms. It allows for portability – in other words, an ONNX model can run everywhere. You can simply import and export ONNX models in popular tools like PyTorch and TensorFlow for example. This is a great feature on its own, but the added benefit is that you can choose the runtime. This means that you can optimize the model better for your purposes based on the model’s needs and the way it is run.
At UbiOps, we ran a test with a TensorFlow model which became 200% faster after conversion to ONNX! This speed increase even made it possible to run the model on a simple CPU, instead of an expensive GPU. You can check out the full article for all the details and a code example.
- Easy to work with
- Same code no matter what hardware you want to work with
- Great performance gains
- Does not work with obscure or custom made ML libraries
When simple CPU processors aren’t fast enough, GPUs come into play. GPUs can compute certain workloads much faster than any regular processor ever could, but even then it’s important to optimize your code to get the most out of that GPU! TensorRT is an NVIDIA framework that can help you with that – as long as you’re using NVIDIA GPUs.
NVIDIA TensorRT is an SDK for high-performance deep learning inference built on top of CUDA. It is able to optimize ML models on many different levels, from model data type quantization to GPU memory optimizations.
These optimization techniques do however come with a cost: reduced accuracy. With that said, for many different applications the reduced accuracy is hugely outweighed by the performance increase, but that is dependent on your use case.
TensorRT was behind NVIDIA’s wins across all performance tests in the industry-standard benchmark for MLPerf Inference. It works well for computer vision, automatic speech recognition, natural language understanding (BERT), text-to-speech, and recommender systems.
At UbiOps, we ran a benchmark where we compared running the ResNet152-v2 ONNX Model on two types of NVIDIA GPUs with, and without, TensorRT. We made 1000 classifications with the model and plotted the average start-up time and the actual inference time (see below).
Even though TensorRT significantly increased the cold start time of the model, the actual inference time was 2 – 3 times as fast. For the full details of this benchmark, check out the blogpost.
- Good performance gains
- Works with all major ML frameworks
- Only works with NVIDIA GPUs
- Increases startup times considerably
- Decreases accuracy of the model
Choosing the right hardware (CPU, GPU, IPU)
Hardware is an important factor when it comes to deploying ML models. Small models typically do fine on CPUs, but, with the recent advancements in the field, GPUs are far more common. A model like ChatGPT simply can’t run on a CPU as the inference time would be way too high. We recently deployed Stable Diffusion and when running it on a 32 GB CPU instance it took about half an hour to process a single request! But even if you know that a standard CPU won’t cut it for your use case, how do you pick the right alternative?
There are many different GPUs available on most clouds, ranging from T4 instances to NVIDIA A100’s. And recently Intelligence Processing Units (IPUs) from Graphcore have also made an entrance to the market. So which one will help you to get your inference time down the most? Let’s quickly compare GPUs and IPU’s.
GPU benefits and downsides
GPU’s are Graphical Processing Units, and, as the name suggests, they are optimized for graphics calculations. Since graphic calculations are mathematically quite similar to Deep Learning calculations, they tend to also lend themselves well for that. In a nutshell:
- They are optimized for parallel processing
- They work well with big batches
- Certain GPUs are specialized for inference, like NVIDIA T4’s
- Each type of GPU has its own quirks. It will take some manual testing to find one that works best for you
- They use a lot of energy! More energy means higher costs and a bigger carbon footprint
- Built for general purpose, not specifically optimized for ML
IPU benefits and downsides
IPU’s are Intelligence Processing Units. They are developed by Graphcore and are made specifically for ML applications. In a nutshell:
- Efficient massive compute parallelism
- Very large memory bandwidth
- Specialized for graph processing (so perfect for deep learning)
- Currently only available on Gcore and not on other cloud providers
- Less availability
- Only suitable for TensorFlow and PyTorch models
Reducing the number of operations
You can also reduce your model’s inference time by optimizing your model itself. When optimizing an ML model you typically want to try to reduce the number of operations that have to be performed on any input data. There are many different techniques you can use to achieve this, like pooling, model pruning, or separable convolutions. But going into the details of those techniques would make this article very lengthy. I’ll leave these topics for a part two article!
There are various techniques and strategies that you can employ to improve the inference times of your models. I would recommend always converting your model to ONNX whenever you can, since it’s an almost guaranteed speed up and it doesn’t take a lot of effort to do. The benefits of other techniques are less clear cut and more dependent on your specific use case. Luckily tools like TensorRT are easy to set-up so it’s easy to test. And with MLOps platforms like UbiOps, switching between CPU, GPU or IPU instances can be done with the push of a button.