When Google released the paper Attention is all you need, which introduced a new neural network architecture for sequence modelling, called Transformers, the foundation for Large Language Models (LLMs) was established. The architecture is based on a self-attention mechanism that gives a model the ability to understand context very well, which revolutionised the field of Natural Language Processing (NLP). LLMs are able to understand, summarise, and generate text. This is achieved by training an LLM, which typically contains billions of parameters (weights) on massive datasets. Which makes the development and deployment of these kinds of models challenging. These challenges brought a long set of dedicated practices in the Machine Learning Operations (MLOps) community: Large Language Models Operations (LLMOps), which are a set of workflows and practices to streamline the process of bringing LLMs into production.
This page will dive deeper into the operational management of LLMs in production environments, by explaining what LLMs are in more detail, explaining the differences between MLOps and LLMOps, providing insight into what LLMOps is, sharing techniques and methods that you can use for LLMOps, and showing where to find platforms and other solutions that you can use to tackle LLMOps challenges. One of the platforms that will be discussed is UbiOps, which you can use to easily deploy and scale your LLM.
LLMs are a type of Foundation Models, which are large-scale ML models that are pre-trained on vast amounts of data. LLMs are able to perform a wide range of NLP tasks to understand, summarise and generate new content. An example of an LLM model is ChatGPT from OpenAI, or Llama from Meta. These models are able to do this by using several novel deep learning techniques and training on massive datasets. A LLM usually contains billions of parameters, hence the name Large Language models.
One thing to know about LLMs is that they are based on the transformer architecture, which are neural networks that are able to track relationships in sequential data (like words in a sentence, or a line of code) by learning context, and thus meaning. A Transformer mode can have different architectures, and comprises of an encoder or a decoder or both, depending upon the specific task the Transformer is used for. The encoder maps an input sequence to a sequence of continuing representations. The decoder then uses the output sequence and decoder output that was generated at a previous time step to generate an output sequence.
The ability to understand and generate language makes LLMs very well suited for:
- Customer service applications in the form of chatbots (question answering)
- Content creation in the form of writing articles, social media posts, etc.
- Software development, by reviewing and helping to generate code
- Sentiment analysis
- Text classification
Incorporating the capabilities of an LLM into your product brings a set of challenges with it.
While most companies working with AI have used MLOps to apply machine learning, the incorporation of LLMs brings on a new set of challenges, which will be discussed in the next chapter.
MLOps vs LLMOps
MLOps focuses on bringing ML (Machine Learning) models into production, while maintaining and monitoring them in a reliable and efficient way. LLMOps focuses specifically on LLMs in the areas where ML, DevOps, and Data Engineering overlap. As such, the stages of MLOps and LLMOps are mostly the same, but the usage of a foundation model changes the emphasis of some stages:
The training, deploying, and refining of LLMs require a significant amount of computational resources. GPUs are often used for this process, which can be difficult to access, not to mention expensive. The main cost to consider with LLMs is that of inference. .
What is LLMOps?
LLMOps focuses on closing the gap between the development of LLMs and the deployment of these models. This includes:
- Selection of a foundation model
- Adaptation to your own use case
Let’s dive into each of these steps.
Selection of a pre-trained LLM model
Foundation models are generally made available to the public as proprietary models or as open-source models. Proprietary models are models that you can’t download yourself, and are only accessible using a UI or via Application Programming Interface (API) calls. For example, if you want to make use of OpenAI’s GPT-4 model, you need to get a subscription. Only then can you access the model using its API. Proprietary models are usually larger than open-source models, and have better performance. The fact that they are available off-the-shelf makes them easy to use as well. Proprietary models have a few downsides: a lack of flexibility for adaptation, and higher costs due to having to use them through APIs. In some specific cases the law or regulations prevent you from working with proprietary models, due to restrictionson what kind of data you are allowed to send to external proprietary LLMs. Examples of proprietary models are: OpenAI’s GPT-3.5 & GPT-4 models, PaLM 2 from Google, Claude v1 from Anthropic, and Cohere.
The other option is to make use of open-source models. These models are often hosted and organised on platforms like Hugging Face. While these models are generally smaller than proprietary models, they do offer more flexibility and are usually cheaper to use. Being open-source also grants users the ability to deploy these models in their own data centres, which is particularly useful when working with sensitive data. The most powerful open-source model currently available is the LLaMA 2-70B model from Meta, which contains ( approximately) 70 billion parameters. Other examples of open-source models are: Stable Diffusion by Stability AI, BLOOM by BigScience, and Flan-T5 by Google. OpenAI also provides some open-source models like Point-E, Whisper, and Jukebox.
In short, proprietary models are only available through a subscription or license. While these models are easy to use, and in most cases have more parameters (i.e., better performance) opposed to open-source models, They could in some cases be limited in how you can use them. This is not the case with open-source models, which are free to access and can be used for any purpose. These models can be harder to set-up than proprietary models though.
Adaptation to your own use case
When an LLM doesn’t have the answer to a prompt, it may hallucinate. This means that the model will start togenerate false information. Hallucinations can be deviations from contextual logic, external facts or a combination of both. These hallucinations can often appear to be true due to the fact that LLMs are designed to produce fluent, coherent text. To prevent this you can apply several techniques:
- Prompt engineering: this method involves tweaking prompts to increase the likelihood of responses that match your expectations. The prompt is the input that is provided to the LLM. You can addcontext, use a specific tone, style, or format to “guide” the model to generate the desired and contextually relevant response. Prompt engineering helps you to get the most out of your LLM since a desired output structure can be achieved by simply providing effective prompts. When prompt engineering has been applied optimally it also reduces the inference time, which in turn may help with reducing cost. However, even after implementing prompt engineering, a model still may not be able to provide consistent or correct output. Prompt engineering requires a lot of trial and error before achieving the required results, thus domain knowledge is a must to craft effective prompts for different models and different tasks. You can also make use of external embeddings: Embeddings are numerical representations of sentences, phrases, or words (usually stored in a vector database) that capture their context and meaning. A separate model is usually connected to an LLM via an API to map the input of a user and categorise it in a high dimensional space, where each dimension captures a part of the input. Dogs and puppies are more similar than dogs and cats for example, and dogs and cats are more similar than cats and cars. An LLM can then use these mappings to better understand user inputs. Let’s say that you want to build a chatbot for movie recommendations. You could use an embedding API that is connected to a database containing movie summaries, and have your prompts pass through that database before being fed to an LLM. This would help to improve the response of your chatbot, given the context. Note that using external embeddings the weights and parameters of the LLM model itself are not altered, which is why it’s considered to be a form of prompt-engineering.
- Fine-tuning: this method involves training a model (i..e. changing some of its pre-trained weights) on how to respond to prompts, eliminating the need to provide highly specific prompts. This can be achieved by re-training a pre-trained LLM on a smaller, specialised, labelled dataset to improve its performance in a particular domain or to adapt it to a particular task. The idea being that you use new data to update some of the parameters of the LLM for new settings or repurpose it for new applications. Let’s look at an example of fine-tuning a different type of foundation model, in this case a Convolutional Neural Network (CNN). A CNN model is generally used to detect images. Our CNN model was trained on tens of thousands of pictures of passenger cars in an urban setting. Perhaps you want this model to be able to recognise trucks as well, so that your model can be used on highways too. You can choose to retrain the entire model on trucks and cars, but cars and trucks have a lot of visual features in common so it would be more efficient to provide the existing model with a smaller training set of trucks. This data set may only contain a few hundred or a few thousand images. After several epochs of training your CNN model, it will be optimised for a new application. Under the hood, the fine-tuning process updated the model’s parameters to match the distribution of the new dataset. This technique can be applied to LLMs as well, fine-tuning models for tasks like identifying symptoms of diseases, predicting stock prices based on financial news, or sentiment analysis in product reviews.
For a better understanding of each of the methods described above, we can compare an LLM to a chef that wants to make pasta bolognese. The chef generally knows how to make the dish but to help the chef improve his cooking we can do three things:
- We can send him to a pasta cooking course to expand their knowledge on making pasta bolognese (fine-tuning);
- We can add clearer instructions to the recipe to better guide them through the process (prompt-engineering);
- Or we can help the chef by arranging the ingredients in a more intuitive way for them (embeddings).
New alternatives are emerging nearly every day to adapt LLMs to specific use cases. One of them that deserves mentioning is prompt tuning, which combines prompt-engineering with fine-tuning. With prompt tuning a smaller, lightweight, fine-tuned model is placed before the pre-trained LLM. When a user enters a prompt the smaller model generates prompts (soft prompts) that are unreadable for people. These soft prompts are then combined with the user’s original prompt (hard prompt) and sent to the LLM. The foundation model is not altered in any way, making prompt-tuning a low-cost, efficient way for adaptation to downstream tasks.
ML models are usually validated on a hold-out training set, calculating metrics to indicate model performance. This can be tricky to do for LLMs since responses themselves aren’t necessarily always “good” or “bad”. The performance of an LLM can be determined by looking at features like language fluency, coherence, speech recognition, context comprehension, fact-based accuracy, or the model’s ability or capability to produce relevant and meaningful answers that are appropriate and valuable in a given context.
The deployment of an LLM can be challenging due to the size of the models. In essence, two things are required to deploy a LLM in production: an API so people have access to your model, and a UI for people to interact with your model. Due to the size and complexity of LLMs there are some things to consider before deploying one:
- Costs: when you’re making use of open-source LLMs, or building one on your own, you’re going to have to figure out how to host the model. As previously explained these models require significant computational resources and memory in order to run efficiently. In most cases it is advised to use GPUs or TPUs to reduce the latency of your model. Whether you choose to host your model locally or in the cloud, these resources can be expensive.
- Inference speed: when you’re planning on using LLMs in a live web application, you need to consider inference speed. If you’re planning to use a LLM for a chatbot, for example, you’ll need an inference speed fast enough or people will quickly lose interest in using your application. It is advised to determine the desired inference speed before bringing the model into production, so it’s easier to determine what kind of hardware you’ll need.
- Security: protecting user data and preventing the misuse of a model is vital, especially when working with sensitive data. You can overcome this obstacle in different ways: encryption, data anonymization, and access controls are security measures you should consider when deploying your LLMs.
- Infrastructure: a LLM needs to have proper infrastructure in order to function effectively. The following points should be considered:
- A model can be deployed on-premise or in the cloud. “On-prem” deployments are usually preferred for applications where data security is important, while cloud-based deployments are flexible and easier to scale up when necessary.
- Selecting the right hardware that will provide the desired performance is crucial. This includes processing power, storage capacity, and memory.
- Scaling: choosing the right inference option to make sure your model is able to scale with potential demand.
- Reducing memory utilisation and latency while enhancing computational efficiency of a model can be achieved through model compression, quantisation or pruning. Without proper resource optimisation, you may end up accumulating a lot of unnecessary costs.
- Infrastructure: a LLM needs to have proper infrastructure in order to function effectively. The following points should be considered:
A quick, easy and cost efficient way of deploying your models is to make use of UbiOps, a platform for running, training, scaling, and managing AI models, alongside Streamlit, an open-source Python framework for quickly building and sharing interactive web applications. With the combined power of both these platforms, you can deploy LLMs with a front-end in no time.
Like with ML models, it’s essential to monitor the performance of a LLM. You will need to be able to identify when your model starts to hallucinate. There are a number of reasons why a model can start hallucinating: training bias, overfitting, and bad prompts being a few of them. You can prevent this from happening by setting up an anomaly detection system that can flag unusual patterns in responses or by adding a moderation layer that uses a reliable source to cross-checks facts. You can also identify bad responses by monitoring user feedback.
The previous chapter explained the steps to implement each phase of LLMOps in your own business, now it’s time to explore how this is done, as well as some tools and platforms you can use.
1. Picking a foundation model
As explained earlier, pre-trained large language models come in two flavours: you can either opt for a proprietary model or an open-source model. The choice of which to use comes done to a number of things like: their accessibility, the number of parameters they use, their “fine-tunability”, and the type of training data that was used to pre-train them. The training data influences the performance of the model on certain tasks. An LLM that was trained on scientific text will perform better on questions about physics than an LLM that was trained on a general text. You can use this information to decide which LLM fits your company’s needs best. The main differences between most LLMs are the amount of parameters they’ve been trained on, and their training data. OpenAI didn’t release how many parameters their GPT-4 model uses, but it is rumoured to be 1.5 trillion parameters. The most powerful open-source LLM at time of writing this article is Meta’s LLaMa 2-70B model, with 70 billion parameters.
Choosing a foundation model depends on the type of application you want to build with it, the resources that your company has, and even the data you want to process with the model. Therefore, we’ve listed the most popular (at time of writing) LLMs in a table below, along with their providers, whether they’re open-source or not, their number of parameters, their “fine-tunability”, and what data they were trained on. You can use this table to help you decide which LLM fits your needs best. Keep in mind that not all available LLMs are listed here, as new ones are released everyday, so it’s still advised to do your own research before picking a foundation model.
2. Adapt your foundation model to your own use-case
Two of the main techniques to adapt an LLM to downstream tasks are fine-tuning and prompt engineering. Granted there are alternatives to these methods, and as is the case with many other things in the LLMOps field, new ways of adapting your foundation models are emerging every day (like prompt-tuning). We’ll focus on fine-tuning and prompt-engineering for this article, since these are the most proven techniques at the time of writing this article.
Before deciding whether to use fine-tuning or prompt-engineering you’ll need to consider three things: performance, costs, and data availability. If available data is scarce, an easy and quick way to get started is prompt-engineering. Do keep in mind that the maximum input token length can limit the number of examples you can include in your prompt. Fine-tuning doesn’t have a limit on how many examples you can use to fine-tune a model. The number of examples you’ll need for fine-tuning depends on the task. It is said that a noticeable difference in a model’s performance can be expected using a number of examples in the hundreds or larger. A study from Sao and Rush in 2021 concluded that a prompt is worth around 100 examples. The general trend being that as you increase the number of examples, fine-tuning will produce a better performing model than prompting.
The two main benefits that come with fine-tuning your model are that you can get a better performance out of your mode, by providing it with more use-case specific examples in the form of a dataset, and the fact that you can reduce the cost of a prediction: the more relevant content your LLM is able to use, the less instructions (i.e., costs) you’ll have to put in your prompt. However, there are a number of situations when fine-tuning is not advised:
- Models that are only available through API don’t always provide the option to fine-tune the model, or only in a limited manner.
- Fine-tuning a model can require a lot of data, which may not always be available. This can of course depend on the task of the application you want to use it for.
- When the data changes frequently, like in news-related applications.
- When the application that you want to use your model for is context-sensitive and dynamic, fine-tuning a model on user data because you want to customise the output for each individual user can’t be done.
Several fine-tuning techniques can be used to adapt a pre-trained foundation model to your own use case. Remember that fine-tuning is about adjusting the weights and parameters of a model to improve its performance on a specific task. In most cases updating the knowledge of a LLM is enough to adapt it to your own use case, which can be achieved by applying unsupervised fine-tuning. Updating the knowledge of a model can be done by using an unstructured dataset, e.g. scientific papers or articles. The goal here is to provide the model with enough tokens to be representative of a desired domain.
Occasionally updating the knowledge of a LLM is not enough, instead the behaviour of a model must be modified. This can be done by applying supervised fine-tuning. Which works by supplying the model with a dataset that contains a collection of prompts and their corresponding responses. The datasets can be manually created by users, or by other LLMs.
As explained above, prompt engineering involves different techniques to tweak an input to get a desired output. There are several techniques that you can apply for prompt engineering. Prompt Engineering Guide created a guide that explains all the techniques and lists a number of tools that you can use for prompt engineering.
Below are various techniques, gathered by Prompt Engineering Guide, and a short explanation of how they work, as well as supporting papers about each technique. You can find a more extensive explanation of the techniques, examples of prompts that could be used, and they’re given output by clicking on their names.
3. Evaluating your model’s performance
At the time of writing this article there is no set standard to evaluate LLMs, but there are several methods that you can use to measure coherence, language fluency, or other performance metrics mentioned above:
- Perplexity: which quantifies how well the model predicts a sample of text, i.e., how well can the model predict the next word based on the prior content. The lower the score, the better the ability of the model to predict the next work.
- Human evaluation: where human evaluators assess the quality of the LLMs response. The quality of the output can be based on a number of criteria like: the relevance, the coherence, the fluency, and the overall quality.
- BLEU (Bilingual Evaluation Understudy): which compares the generated output with one or more reference translations and measures the similarity. The BLEU score is calculated using the Brevity Penalty (which compares the length of the sentences), with the n-gram precision (which looks at the words that are used). The BLEU score can range between zero and one, where zero is a complete mismatch between the generated and reference translations, and one is a perfect match.
- ROUGE (Recall-Oriented Understudy for Gissing Evaluation): which evaluates the quality of summaries by comparing the generated summary with a reference summary to calculate precision, recall, and F1-score. The ROUGE score can vary between zero and one, as with the BLEU score. An excellent ROUGE score is considered to be anything higher than 0.5.
- Diversity: which analyses metrics such as the n-gram diversity or the semantic similarity between generated responses to determine the variety and uniqueness. The higher the diversity score is, the more diverse and unique the outputs are.
Note that the metrics mentioned above are just some examples that you can use to evaluate your model, depending on your use case you might want to use other metrics.
4. Deploying your Large Language Model
After you’ve finished adapting your LLM to your use case, you can bring it into production. This means that you are ready to deploy your LLM. The process of deploying depends on where you got your model from. In order to use a LLM for your own application you need to:
- Expose it via an API: this makes it possible for people to interact with your model, (in some cases provided by a platform like UbiOps,
- Create a User Interface (UI): like a chatbot, command-line tool, or web interface.
Proprietary models like GPT-4 already have an exposed API that you can use. In those cases you just need to create a UI for people to interact with. This is not the case for open-source LLMs. For open-source models you can use a platform like UbiOps, which exposes your LLM with an API and takes care of auto scaling. Or follow these steps:
- Select a programming framework that is suitable for deploying LLMs
- Set up a UI
- (Optional) expose your model through an API, as mentioned before this only needs to be done when you are going to make use of an open-source model.
- Make sure the model accepts users’ inputs.
- In some cases the generated output needs to be post-processed to make it more user-friendly or coherent.
- Set up monitoring tools to keep track of how your model is performing, and how it is used.
5. Monitor your deployed model’s performance
Typically two types of metrics are used to monitor performance: performance metrics & quality metrics. Performance metrics can give you an insight into how efficient and capable your model is. Some examples of performance metrics are:
- Tokens Per Second (TPS): which represents the number of tokens your model can provide in a second
- Query Per Second (QPS): which gives an insight into the number of queries your model processes in a second.
- Latency: which shows you how long it takes for your model to process a response from a request from a user.
Quality metrics are focused on the quality of responses, and are measured via monitoring:
- The quality of the responses themselves i.e. the readability, understandability, and how well written the responses are.
- Whether your LLM is responding with relevant content I.e., are the responses coherent to the topics expected by the application
- Whether your LLM is responding in the right tone: in some cases certain prompts might cause your LLMs to change its sentiment, when it isn’t expected.
This chapter will focus on where you can find the tools and platforms that can help you with each step of LLMOps, complemented with examples that have already been proven to be effective.
1. Where to get your foundation model?
Most proprietary models are accessible through their own developers. As such you can get these models on their websites. In most cases you’ll be charged for how many tokens are processed. OpenAI defines tokens as pieces of words, where 1000 tokens is equal to 750 words. Note that tokens for both the input and the output are charged.
Open-source models are widely available on Huggingface and Github. From these websites you can download the model locally or store them in a cloud bucket, where you can fine-tune them if necessary and start deploying them.
2. Adapt your model to downstream tasks
Some companies that offer proprietary models also offer ways to fine-tune their models. OpenAI for example offers functionality to fine-tune some of their models, using their API, for an additional fee. For open-source models you can use several Python libraries, such as Pytorch or Autotrain from Huggingface.
For prompt engineering you can use several packages and/or plug-ins that can provide the extra context from a user’s prompt. You can find these packages, like Langchain for example, on websites like Github. The Prompt Engineering Guide, mentioned above, also lists dozens of tools that you can use for prompt engineering.
There isn’t a set procedure for evaluating LLMs, but there are open-source tools for evaluating the performance of LLMs. OpenAI released an evaluation tool on Github that you can use to benchmark and evaluate LLMs. but there are other tools as well such as EvidentlyAI.
In 2023 the paper “A Survey on Evaluation of Large Language Models” was published that reviews and lists several techniques that you can use to evaluate LLMs. The paper also has an official Github page that’s updated more frequently and lists several tools and benchmarks that you can use for evaluation – definitely worth a read!
If you’re going to make use of a proprietary model this step is already taken care of. The inference of a model is performed via the model provider’s API. The only thing you need to do in this case is connect your application to the API provided by the company that owns the model.
For open-source models you could opt for one of the major cloud providers, but if you’re working with sensitive data this might not be an option for you since running your model on an on-premise installation is often preferred in these types of situations. The lack of customization and flexibility are also reasons to avoid using one of those solutions. Many cloud providers also require you to reserve compute resources which could result in you paying for more than you actually end up using.
If you want to be able to deploy your LLMs yourself and integrate them easily in your products and services, you can choose to work with a platform like UbiOps. It helps you deploy off-the-shelf models easily and run them in the cloud or on your own infrastructure. UbiOps equips your LLM with a scalable API endpoint, which makes it easy to combine or integrate it in a product or service. Instant on-demand access is provided to powerful GPUs with automatic-, and zero scaling, which means you don’t pay for idle time.
You can follow this guide for an example of how you can deploy LLaMA 2 on UbiOps with a customizable front-end in under 15 minutes.
For the monitoring of LLMs you can use open-source tools like the previously mentioned EvidentlyAI, or LangKit from Whylabs. Whylabs also offers a platform for monitoring your LLM’s performance. Another platform that you can use for monitoring is Arize, which you can also use to detect problematic prompts and/or responses.
Combining different tools can add a lot of complexity to your workflow. Luckily, some of the tools mentioned above can be combined fairly easily. UbiOps easily integrates with Whylabs and Arize, simplifying the integration process of LLMOps into your day-to-day operations.
And that is everything you need to know about LLMOps!
This article will be updated every month with new insights, information and techniques. If there is any information missing on this page, don’t hesitate to contact us so we can add this information to this page as soon as possible
What to read next?
If you’re curious to read more about LLMOps, and other important matters for LLMOps you can have a look at the following articles: