Introduction
Optimizing inference is a machine learning (ML) engineer’s task. In a lot of cases, though, it tends to fall into the hands of data scientists. Whether you’re a data scientist deploying models as a hobby or whether you work in a team that lacks engineers, at some point you will probably have to start learning about inference optimization.
To get data scientists started, we compiled a list of the most used large language model (LLM) inference performance metrics and optimization techniques that NVIDIA, Databricks, Anyscale, and other AI experts recommend. We also included links to resources that you can use to put these techniques into practice and in the process build better, faster, and more efficient LLM-based applications that make users happy and managers happier.
What is LLM inference and how does it work?
Inference and instance types for LLMs
Inference essentially refers to running a trained model on new data and producing an output. For an LLM, that implies taking an input, i.e., a prompt, and generating an output, i.e., a response. When building customer-facing applications, engineers consider how many users will be prompting a model, how often, and how long inference will take for each request before building custom architecture. Data scientists commonly achieve similar results by leveraging pre-defined cloud instances and employing model optimization techniques.
To run a model, it must be spun up on an instance. In the context of ML, an instance is the computational environment within which a model is deployed and run. Instances have varying characteristics based on their architecture and their hardware components such as processing power, memory capacity, and network connectivity. Due to their large size, LLMs typically require state-of-the-art instance types featuring GPUs with a lot of memory. With the right expertise, instances can be built on premise, otherwise they are most often sourced in the cloud.
LLM text generation phases
LLMs process inputs as sequences of tokens, which can be basically simplified words and phrases. Different LLMs have different approaches to tokenization, i.e., the process of breaking up text into tokens, which can affect the LLM’s ability to infer context. Some LLMs require longer or shorter inputs than others to deliver results of a similar quality. For all LLMs, though, the tokenization process is fundamental to their functioning, as it lays the groundwork for understanding, interpreting, and generating text.
Text generation in LLMs is composed of two phases: the prefill phase and the decode phase. Here’s a very high-level explanation of what occurs during these two phases:
- In the prefill phase, the LLM makes calculations based on the input tokens provided and produces an initial, predicted output token. Prefill occurs only once per input and is calculated for all input tokens independently. As such, this phase efficiently utilizes a GPU’s computing capabilities.
- In the decode phase, the LLM takes the predicted response token from the prefill phase and sequentially generates following output tokens. Decoding is repeated for each consequent output token until a stop token is generated, completing the response. The problem with this phase is that it can’t be run in parallel, and therefore it underutilizes expensive GPUs.
Compute-bound vs. memory-bound inference
Inference speed is heavily dependent on the characteristics of the instance that a model is running on and on the model itself. A model or phase of a model that is computationally demanding will be limited by different factors than one that requires a lot of data to be transferred back and forth between memory and storage. On the hardware side, computing speed and memory availability are important limiters of inference speed. Cases where inference speed is limited by these are referred to as compute-bound or memory-bound inference.
Compute-bound inference
Compute-bound inference is when inference speed is limited by the computing speed of an instance. The type of processing unit being used by an instance, e.g., CPU or GPU, will determine the maximum speed at which calculations can be made. A model may be deployed with the most cutting-edge software optimization and request batching techniques, but it can only run as fast as a processor can calculate. At the same time, the type of calculation required by a model will affect its ability to make use of a processor’s compute speed.
The prefill phase of an LLM is usually compute-bound, since the main component that affects its speed are the processing capabilities of the GPU on which it’s running. Prefill can be processed in parallel for each token in an input, meaning that the full computing speed of an instance can be used. This is especially effective when running on GPUs, which are optimized for parallel processing.
Memory-bound inference
Memory-bound inference, on the other hand, is when inference speed is limited by the available memory or memory bandwidth of an instance. Different processors have different data transfer speeds. Instances can be built with varying amounts of random-access memory (RAM). Models also vary in size, as do their inputs and outputs. Processing LLMs requires notorious amounts of memory and memory bandwidth — not only is a lot of data involved, but this data needs to be loaded from storage to the instance and back again, usually multiple times.
The decode phase is generally considered to be memory-bound. The decode phase involves rounds of sequential calculations for each output token. In most cases, key-value (KV) caching is used, which helps to avoid GPUs making redundant calculations by storing data after each token prediction is made. Therefore, speed is limited by the time it takes for token prediction data from the prefill or previous decode phases to be loaded into instance memory. A faster GPU won’t do much to help, unless it also has a faster data transfer speed. Databricks argues that memory bandwidth is actually a more useful metric for inference speed than computational performance.
Identifying your bottleneck
If your inference speed is lacking, it’s important to investigate where your bottleneck is. If you do not identify your bottleneck, you may opt for an incorrect solution and either achieve poor performance gains or incur pointless costs. For example, if you switch from running an LLM on an NVIDIA A100 with 80 GB of memory to an H100 with the same amount of memory, but your limiter is a memory-bound operation, then you will be spending a lot more money for very little improvement.
There are methods to assess whether an instance type is optimal for a given model. Baseten, in their guide to LLM inference and performance, recommend comparing the operations per byte (ops:byte) ratio of a processor to the arithmetic intensity of a model, both measured in operations per byte. If the arithmetic intensity of the model is below the ops:byte ratio of the processor, then you are dealing with a memory-bound system.
How do you measure inference performance?
It’s all about speed. Whether measuring the amount of time it takes for a user to see the first output token, or the number of tokens that can be processed in a given time frame, it all comes down to how fast users will get responses to their prompts. Here are a few commonly used metrics for LLM inference performance:
Time to first token
Time to first token (TTFT) refers to the time it takes for an LLM to generate the first token of its response. For real-time applications, such as chatbots or virtual assistants, a low TTFT is crucial.
Time per output token
Time per output token (TPOT) represents the average duration required by an LLM to generate each token in its output sequence. This metric is also very important for real-time applications, as a high TPOT will cause users to wait longer for responses.
Latency
Latency is the total amount of time it takes for a full output to be generated. When referring to an LLM on its own, latency can be calculated based on TTFT, TPOT, and the expected token length of the output. When referring to an application, however, latency might also include time required for data preprocessing and post-processing tasks. Needless to say, latency gives you a useful overall view of how fast an LLM or LLM-based application can run.
Throughput
Throughput measures how many tokens an LLM system can output in a second across all requests coming into the system. A higher throughput indicates that your system can handle workloads more quickly compared to one with a lower throughput. While a shorter TPOT may lead to a higher throughput, this metric is also influenced by system-level factors, such as whether workloads are processed concurrently or sequentially.
Now that we have covered the metrics used to measure LLM inference performance, let’s explore how performance can be optimized.
How to optimize LLM inference performance / How can I improve my LLM inference performance?
In this article, we will be specifically talking about LLM inference optimization techniques. For general inference optimization, we recommend reading: “How to optimize the inference time of your machine learning model”.
Each LLM inference performance optimization technique leads to improvements in a specific metric. Therefore, data scientists looking to optimize LLM inference performance will need to consider which metric is most important for their use case. That will allow them to pick the technique that makes the most sense for them. For example, when batching, a smaller batch size may favor latency at the expense of throughput. So, in this case, it’s important to determine what the average expected throughput or maximum acceptable latency is for an application before configuring batch size. Cost is also a useful consideration to keep in mind: how much does each workload cost to run on an ML system?
The following is a list of LLM inference optimization techniques. To simplify things, we grouped them into two categories: model optimization and inference optimization. Model optimization techniques aim to achieve better performance by reducing the memory load of a model or by modifying its compute utilization. Inference optimization techniques aim to make the best use of the resources available in an instance type by managing data and operations.
Model optimization
Quantization
Quantization is a way of compressing a model’s weights and activations by reducing their precision. By converting to a lower precision, e.g., from 16-bit to 8-bit weights, you can effectively reduce the amount of data that needs to be processed and transferred to and from storage. It follows that this helps to reduce memory usage and speed up computation. Quantization can also increase the efficiency of KV caches. By now, quantization is a standard technique used to speed up LLM inference and manage with cheaper instance types.
The downside of quantization is that it can negatively affect model quality. Model accuracy can suffer as a result of using lower precision weights. This effect can vary based on model architecture and size. Quantization should be implemented with care.
Compression
Compression encompasses various methods such as sparsity or model distillation, aimed at reducing the size of a model while preserving its functionality. Compressed models consume less memory and bandwidth, resulting in faster inference.
Sparsity
Sparsity involves identifying and removing unnecessary parameters or activations in a model, typically through techniques like pruning or regularizing. Sparse models require fewer computations, leading to improved inference performance.
Distillation
Distillation is a method for streamlining model complexity and computational load. By training a compact student model to emulate the behavior of a larger teacher model, distillation condenses knowledge into a more agile representation. This reduction in model size translates to decreased memory consumption during inference, thereby enhancing both latency and throughput. The computational efficiency gained from distilling a model accelerates inference speed, reducing both TTFT and TPOT. However, it’s crucial to balance this compression against loss of accuracy to make sure overall model performance doesn’t suffer.
Attention mechanism optimization
Refining attention mechanisms in LLMs can profoundly impact inference performance, particularly TPOT (and therefore latency). The computational overhead of attention mechanisms can be reduced by optimizing attention computations through a wide range of techniques such as structured, kernelized, multi-head, multi-query, grouped-query, and flash attention. Most of these techniques work by minimizing the amount of memory needed or reducing the number of read and write events when computing a model. This optimization mainly speeds up the memory-bound operations of an LLM’s decode phase, resulting in a shorter TPOT.
Inference optimization
KV caching
Key-value, or KV, caching is a pivotal technique in optimizing LLM inference that involves strategically managing data retrieval and reuse. Caching intermediate computations or results during inference facilitates the retrieval of previously computed values and mitigates redundant calculations. KV caching significantly impacts latency and throughput by minimizing the time required for data retrieval and computation. With KV caching, LLMs can make better use of available hardware resources, resulting in improved scalability. It’s important to note, however, that the effectiveness of KV caching will heavily depend on the available memory and data transfer speed of your instance.
Operator fusion
Operator fusion trims memory access overhead and enhances cache utilization by combining multiple operations into a single entity within a computational graph. Similar to KV caching, this optimization reduces the number of calculations a processor needs to make, reducing inference time and increasing throughput.
Parallelization
Parallelization embodies many techniques such as speculative inference, speculative sampling, assisted generation, blockwise parallel decoding, pipeline parallelism, tensor parallelism, and sequence parallelism, all of which aim to make better use of parallel processing. Speculative inference and sampling allow for preemptive token generation, reducing latency, while assisted generation employs auxiliary models to enhance throughput. Blockwise parallel decoding divides input sequences into smaller, independent blocks that can be processed in parallel, and pipeline parallelism divides inference into sequential stages to minimize idle time and improve throughput. Tensor and sequence parallelism distribute different levels of computations across hardware. Parallelization techniques collectively improve inference speed and resource utilization for LLM workflows.
Batching
Batching is a popular method for optimizing throughput and latency. A range of techniques exist: traditional batching, in-flight batching, continuous batching, dynamic batching, etc. Whichever technique you opt for, the idea remains the same: by concurrently processing multiple input sequences during inference, batching harnesses the efficiency of matrix operations. By enabling a model to juggle multiple requests in parallel, a higher number of tokens can be processed per unit time. As such, selecting an appropriate batch size can be difficult, since trade-offs must be made between latency and throughput, with smaller batch sizes favoring lower latency but potentially reducing overall throughput, while larger batch sizes may improve throughput but increase latency. Batch size selection should be tailored to the specific use case and its inherent requirements, balancing the need for fast prompt responses with the desire to maximize overall throughput.
Conclusion
Optimizing inference performance is a great way to improve the efficiency and effectiveness of LLM-based applications, not to mention an excellent exercise for data scientists wanting to push their ML engineering skills further. By leveraging a combination of model and inference optimization techniques, data scientists can significantly enhance both the speed and scalability of LLM-based systems. From quantization and compression to attention mechanism optimization and parallelization, a wide array of strategies are available to address various performance bottlenecks and improve metrics such as TTFT, TPOT, latency, and throughput.
However, it’s crucial to carefully assess the trade-offs involved in each optimization technique and tailor them to the specific requirements and constraints of the application at hand. With the right approach to optimization, LLM-based applications can deliver faster, more responsive, and more cost-effective solutions.