For users of GenAI models, especially large language models (LLMs), inference costs remain one of the largest costs of using GenAI for business operations.
What is inference? Inference in the GenAI field is the process of data generation based on a prompt. It is the process of using an already trained model on new data to generate new output data. As opposed to the training phase, inference does not update the weights of the model. Everytime you send a prompt to an already trained chatbot, you are making an inference.
In a recent linkedin post by Michael Verdi, he outlines three ways inference costs can be reduced. In this article we will expand on these 3 points and offer 2 more. The major ways to reduce inference costs are:
- Use open-source models
- Use smaller models
- Use fine-tuning and RAG techniques to improve the accuracy of your model
- Use more optimized hardware
- Use more improved software and AI libraries
Open-source models
Closed-source models have a clear extra cost which open-source ones don’t, the cost of the intellectual property. The cost associated with intellectual property is not exactly known, as you would have to know the exact details surrounding the model provider’s pricing plan. Open-source models are free to download and therefore do not charge a premium on the intellectual property. You will still need to pay for the infrastructure needed to run the model, but the cost of the IP will not be a concern.
When it comes to quality, open-source models have achieved a similar level of quality to proprietary models. Here is a table comparing the performance results of Mixtral 8x7b, GPT 3.5 and GPT-4. A prominent open-source model vs two prominent closed-source models and.
As we can see, both models perform well but Mixtral 8x7B slightly outperforms GPT-3.5 on both. GPT-4 however blows both out of the water. It is important to remember that Mixtral has 20 times less parameters than GPT-4. While the benchmarks seem to seal the deal, open-source models are generally better for various reasons outside the scope of this article. We have written an article comparing closed-source versus open-source models.
If you want to learn about the performance benchmarks, MMLU and ARC-challenge, read our article about which LLM to choose for your use case. Overall, while proprietary models do often outperform open-source ones in performance metrics, open-source models are much more data secure and cheap.
Smaller models
While having a model with a large amount of parameters is linked with greater model accuracy, it does not necessarily make it more efficient. The number of parameters of a given model is a direct indicator of its inference cost. In general, tiny models are under 2B, small are between 2B and 7B, medium models are from 7B to 20B and large models are higher than 50B and require extensive hardware to run. Here is a performance comparison table between models which are around 2B parameters.
As we can see from the table, all of these models, which have a small amount of parameters, perform well on a variety of benchmarks. In general it is not necessary to have the greatest performance such as GPT-4. Performance, especially if used in a specific industry, will depend mostly on the fine-tuning stage.
Improve and customize models using RAG and fine-tuning
Another very useful way of reducing inference costs is by increasing model performance in a customized domain. If your model responds to prompts accurately, there will be less need for re-prompting thus reducing overall cost. We will discuss three ways of doing this: Fine-tuning, RAG and prompt engineering.
Fine-tuning
Fine-tuning is a process in which a model is re-trained on new data to either specialize it for a certain industry/topic or to customize it. It is a broad field which encompasses training processes done on a pre-trained model.
Pre-trained models such a mistral-7B and Llama-3–8B have been trained with the basics of language and general reasoning and come out of the box as completion models. It is often useful to customize these models on personal/company data, allowing the model to gain knowledge necessary to answer questions specific to your industry and company.
We have an article explaining how we fine-tuned mistral 7B on our company documentation. In it, we used a type of parameter-efficient fine-tuning (PEFT) called low-rank adaptation (LoRA) to efficiently and quickly fine-tune mistral. Here is the result:
Prompt: “How could Harry Potter use UbiOps to defeat Voldemort?”
Response: “Harry Potter could use UbiOps to build a predictive model to anticipate Voldemort’s next move based on historical data. He could also use UbiOps to automate the deployment of spells or potions, allowing him to focus on strategic planning. Additionally, UbiOps could help Harry and his friends analyze large amounts of data from various sources, such as the Ministry of Magic or Hogwarts Library, to gain insights and make informed decisions.”
RAG
Retrieval-augmented generation (RAG) is a type of software infrastructure which encodes the prompt to query a vector database which retrieves information on topics relating to the prompt.
As shown in the diagram, the user’s question is encoded and necessary information for the question is retrieved from a database and added as context to the original prompt. The context and the prompt are then given to the LLM which can use the context to answer the question. RAG is essentially a type of prompt engineering, however it has become so effective that it is worth giving its own section. It is a good way to reduce model hallucination.
What is model hallucination? It is when the model generates incorrect answers confidently. RAG is a safeguard which makes sure that the information in the database will be passed to the model when relevant, making it less likely for it to spew false information.
Prompt engineering
Prompt engineering is the process of trying to get more accurate and precise information by modifying and clarifying a prompt to make it more clear for the LLM. This reduces the chance of the LLM misunderstanding the prompt and therefore it will generate more accurate results. Essentially, it tries to increase the model’s performance by modifying the prompt in various ways.
This process is done at inference time and therefore does not incur any extra costs. There are a variety of prompt engineering techniques, here is an example of few-shot prompt engineering take from the Prompt Engineering Guide:
Prompt: This is awesome! // Positive
This is bad! // Negative
Wow that movie was rad! // Positive
What a horrible show! //
Response: Negative
As we can see from the prompt, we have given it extra context and examples before asking it to solve a problem. This type of prompt-engineering has been shown to increase model accuracy.
Use better hardware
A more optimized hardware infrastructure, either provided third party or on-premise, is vital to reducing your overall inference costs. Newer GPUs have special components in them which make them very efficient when it comes to GenAI tasks. For instance, Nvidia’s CUDA cores are specialized for parallel tasks and matrix multiplication, making them ideal for GenAI use.
We will now perform a test, measuring the inference speed of Gemma 2b on three different hardware tiers: pure CPU, a Tesla T4 GPU and an Ada Lovelace L4 GPU. We will compare the speeds and pricing to gauge which tier gives the most bang for your buck.
As we can see, while the CPU instances are much cheaper than the T4 and L4 instances, the difference does not match the difference in performance. It is therefore better to opt for more advanced hardware when possible. However, we do see that the T4 is the best option for running Gemma 2B, the L4 is not worth the price given its meager speed increase.
Optimization techniques
There are many optimization techniques you can use to reduce your inference costs. While a lot of focus is given to hardware and GPUs, the software which loads your model onto the hardware and passes it prompts is also crucial. We will discuss two ways to utilize software tools and libraries to potentially lower your inference costs: Batching and Quantization.
Batching
Batching is a process of combining several requests and making the model process them in a combined way. It can achieve very impressive results and reduce inference costs by a significant amount. We wrote an article comparing the difference in speed between batched and non-batched requests using the vLLM library. We processed 150 requests, batched and non-batched, on the same hardware and compared the results.
As is apparent, batching reduces inference time by a factor of around 42. This will in turn reduce costs as you can generate information in a significantly shorter amount of time, especially in high load use cases.
Quantization
Quantization is a process which changes the storage type of weights and reduces their precision. If weights are stored as 8-bit integers instead of 16-bit integers, this significantly reduces the storage requirements of running inferences. However this does come at the cost of speed, but it allows you to run models on cheaper but still very effective hardware. It is a trade off between memory and speed.
We will now perform a simple test, we will run Gemma 2b while storing its weights as a non-quantized 16-bit float, another run as quantized 8-bit and 4-bit integers and another as upscaled 32 bit floats and see the speed and memory usage differences. Consult the torch documentation for a comprehensive list of all storage types.
Conclusion
To conclude, we have delineated 5 main ways to reduce inference costs in this article: use open-source models, use smaller models, improve your models to make them more accurate, use better and more up-to-date hardware and use better and more optimal software tools and libraries. We think that by focusing on these 5 key issues, you can begin to drastically reduce your inference costs.