LLMOps is MLOps for LLMs. To explain what that means, we should start by explaining what MLOps is. MLOps is DevOps for ML-based software! So we should start the explanation at DevOps.
DevOps exists to solve some of the challenges that arise in the software development and deployment process. It consists of methods that make sure software can be developed quickly, with flexibility, quality assurance and efficiency in mind. MLOps expands on these methods to solve issues that arise with ML specifically, like model drift. Finally, LLMOps expands on MLOps to solve issues that arise with LLMs specifically. With LLMs, many of the challenges from working with ML remain, but in addition to those other challenges related to for example inference latency or cost arise because of the size of the models used.
MLOps: applying machine learning in business
MLOps is a set of practices that aims to streamline the development, deployment, and maintenance of ML models in a production environment. These practices span the end-to-end creation of a ML application, so you can imagine there’s a lot to take into consideration. But not to worry! We’ll help you understand. Check out our detailed article about MLOps: Here’s What You Need to Know About MLOps!
Briefly, MLOps can be divided into four sections. In the Project Initiation phase, the goal is to clearly define the problem to which a ML tool is the solution. The Feature Engineering pipeline is a process which takes incoming data, and processes it so that it can be used by the ML model by molding the data into usable features. The Experimentation pipeline is for the optimisation of the ML model. Finally, the automated ML Workflow pipeline refines ML models and can be triggered for a number of reasons. When triggered, it will use previously unseen data to (re-)optimize the ML model.
LLMOps: MLOps for LLMs
LLMOps further builds onto the four cornerstones of MLOps by adding practices specific to the use of LLMs. It involves the practices and strategies for effectively using, deploying, and managing large language models within various applications. LLMOps covers tasks such as fine-tuning models for specific tasks, optimizing inference performance across multiple different metrics, monitoring model behavior, and ensuring ongoing model relevance and accuracy. It is also more focused on data volume and quality.
How LLMOps works
So what does that look like in more detail? What do you need to start with LLMOps, how should you go about setting it up and what would the result look like in practise?
What are the components of LLMOps?
The components you need to get started with LLMOps do not really differ from what you need for MLOps. So for a detailed description check out these articles about getting started with MLOps. The main difference between LLMOps and MLOps is the use of a FM. Which needs to be adapted to your own use case by using one of the concepts mentioned above.
In general, the LLMOps cycle can be divided into the following phases:
- Selection of a FM
- Adaptation to your own use case
- Evaluation
- Deployment
- Monitoring
The workflow for LLMOps can be roughly visualized as follows:
The FM you need to use depends on the type of application you want to use your model for. Taking into consideration the performance, size and compatibility of the model. FMs are either open-source or proprietary. Open-source models are usually smaller than their proprietary counterparts, but offer more customization and the ability to self-host the model. Proprietary models like GPT-4 are usually only accessible with API’s, and charge you for both the prompt tokens, and sampled tokens. To give you an idea of the costs, at the time of writing this article OpenAI’s GPT-4 starts from $0.03/1k sampled tokens and $0.06/1k for sampled tokens.
In most cases you need to adapt the pre-trained model to your own use case, by either applying prompt engineering, fine-tuning, or an alternative like prompt-tuning to improve the performance for the task you want to use the model for. The chosen technique depends, again, on the use case you want to use the model for. In most cases prompt engineering (like few-shot prompting or RAG) is the most cost effective way and will give you the desired performance.
Evaluating the performance of an LLM can be difficult, since the traditional metrics used in ML might not capture the ability of the model to interpret human languag.. You can set up task specific benchmark datasets, and use several metrics to determine the model’s effectiveness. Examples of metrics that you can use for this are the ROUGE, and BLEU score.
The deployment of the model can be done on a cloud-based platform or on an on-premise deployment, this depends on data security and available resources.
MLOps vs LLMOps
While there is a significant overlap between MLOps and LLMOps, the use of LLMs shifts the emphasis of some of the tasks. Latency for example is an important factor for MLOps, but a crucial one for LLMOps. A low latency can significantly reduce the time needed for the experimentation phase, and is also a key factor for the user experience when the model goes to production. The table below shows other key differences between MLOps and LLMOps:
Task | MLOps | LLMOps |
Data management | New data needs to be sourced, wrangled, cleaned, and labeled. | Requires a lot of data, which needs to be diverse and representative. When prompt engineering is required, sufficient examples need to be provided. |
Experimentation | Improving ML performance by trying out different model architectures and creating new features. | The LLMs ability to effectively learn feature representation from raw data makes feature engineering less important, instead the emphasis during the experimentation phase lies on prompt engineering and fine-tuning to make the model perform well for a specific task. |
Evaluation | The model’s performance is evaluated on a hold out validation set, metrics like accuracy and F-1 score are used to determine the models performance. | Model’s performance is evaluated by using a broader set of metrics, like BLEU or ROUGE score.LLMs also need to be assessed on robustness, interpretability, and fairness. Therefore, a broader set of metrics are used like the BLEU score or ROUGE score. In some cases human feedback is used to evaluate the performance of an LLM. |
Deployment | The deployment phase focuses on staging, split testing (A/B testing), and versioning (rollback). | Robust tools need to be used for the management of training data, the training process, and versioning of models. Implementing model drift systems are also important to keep track of misaligned inputs, and handling of adversarial attacks. |
Monitoring | Tracking live metrics to monitor prediction quality. | Monitor the LLMs output for things like potential biases, and ethical issues to track the models performance |
Main cost | Data collection & model training. | Inference |
LLMOps for fine-tuning models
The key parts of LLMOps are:
- Project Initiation: clearly define the problem that needs to be solved by an LLM, and select which, and what type (open-source/proprietary) LLM you will use.
- Data preparation: prepare a labeled dataset that can be used to fine-tune your chosen LLM to be able to perform your task, or collect enough examples for the model to apply prompt engineering
- Adaptation to your own use case: use fine-tuning, prompt engineering, or an alternative to adjust the LLM to your needs.
- Monitoring: feedback loops that keep track of various quality measures keep track of model performance.
- Model serving: deployment of the model to production where it can be used by your users.
These parts are all related to each other. The data from your data source(s) is used to create a dataset, or prompt examples that the chosen LLM can be fine tuned with. Then the model can be exported and deployed to your model serving component. From there, the model and data source(s) are closely monitored. Every time the model drifts too much or the data changes too much, the model needs to be updated again by fine-tuning it on a new dataset, or providing new prompt examples.
LLMOps visualization
Advantages of LLMOps
Adopting LLMs into your organization with LLMOps comes with a couple key advantages for business:
- Improved Results: In most cases, LLMs provide better results than other ML models for language related tasks.
- Efficient Deployment: LLMOps streamlines the process of deploying LLMs into production environments.
- Scalability: LLMOps enables seamless scaling of LLMs to handle varying workloads, ensuring consistent performance as demand changes in the most cost effective manner..
- Cost Efficiency: By optimizing resource allocation and usage, LLMOps helps organizations reduce the costs where possible associated with operating large language models.
- Continuous Improvement: Through feedback loops, LLMOps ensures continuous improvement by iterating on models and enhancing their performance over time.
LLMOps with UbiOps
With UbiOps, it is possible to make use of state-of-the-art GPUs and CPUs to accelerate your AI, all in the cloud. When you deploy an LLM to UbiOps, we create a microservice with its own API endpoint. This means you can integrate our processors into your data pipelines and don’t have to buy your own expensive hardware. You only pay for the computational power you use, so there are no unexpected costs and you can scale to zero whenever you want.
Are you interested in how that works? Check out our documentation or book a demo with us!