MLOps: must-know information for getting started with operationalizing machine learning
In recent years, businesses have increasingly been trying to apply machine learning (ML) to their data. This has led to improved efficiency and cost savings for those who succeeded, because ML can be used to automate routine tasks and gain insights from data that would otherwise be difficult or time consuming to find.
Obviously, this makes ML an interesting investment. Unfortunately, only a minority succeeded. In 2019, VentureBeat reported that around 90% of machine learning models never make it into production. What can you do to make yourself part of the group that succeeds? The answer to that question lies in understanding Machine Learning Operations (MLOps).
This article summarises the contents of this thorough paper (Kreuzberger, Kühl, Hirschl) on MLOps, and presents some key challenges in machine learning deployment covered in this paper (Paleyes, Urma, Lawrence). Our goal is to help take your MLOps knowledge from limited to something you can use to get started with machine learning in your organisation. So, to kick things off…
What is MLOps?
MLOps is a set of practices that aims to streamline the development, deployment, and maintenance of machine learning models in a production environment. These practices span the end-to-end creation of a machine learning application, so you can imagine there’s a lot to take into consideration. But not to worry! We’ll help you understand.
Why do you need MLOps?
You might be wondering why it’s necessary to turn the development of machine learning models into such a complex task. Why isn’t simply coding a neural network and feeding it data enough? If you take that approach, you will quickly run into issues of consistency, scalability, and, in bigger organisations, miscommunication between departments. A good understanding of MLOps will have a great positive impact on your machine learning projects:
- Improved collaboration: MLOps helps bridge the gap between data scientists and software engineers.
- Improved model quality: MLOps ensures ML models are developed with clean data sets, proper testing, validation, and quality control.
- Increased efficiency: MLOps automates many of the manual processes in the deployment of ML models.
- Scalability: MLOps helps to ensure models can be deployed and maintained at scale, even if the use of your application is rapidly growing.
DevOps vs MLOps
It’s possible that you’re already familiar with DevOps, a set of practices for creating serviceable and maintainable codebases at a fast pace. In many ways, DevOps and MLOps are similar. I mean, even the terms bear a clear resemblance to each other! DevOps and MLOps both focus on improving collaboration and communication between development and operations teams, and on automating processes to increase efficiency and reduce errors.
However, it is not possible to simply apply all your DevOps knowledge to machine learning and call it MLOps. You could see MLOps as an extension of DevOps, because the former adds processes to the latter to address challenges specific to machine learning. One example is that with MLOps, the continuous integration element needs to be specially set up such that the developers see the datasets as little as possible. This is to prevent the weakening of the data, which is key for preventing overfitted models.
What do you need for MLOps in your organisation?
To successfully bring your idea for a machine learning tool into production, you need to follow a few principles:
- Continuous integration / continuous delivery (CI/CD): develop your application in such a way that you are able to continuously develop it further in a collaboration of multiple developers, without taking it offline ever again!
- Workflow orchestration: coordinate and prioritise tasks.
- Reproducibility: be able to run the same experiment multiple times with equal results.
- Versioning: keep track of changes in source code and datasets, and log metadata for all models.
- Feedback loops: multiple feedback loops are present in an MLOps workflow, including continuous monitoring of all components to detect errors and unwanted changes. Then send that information back into the system.
- Continuous ML training and evaluation: through the support of feedback loops, frequently retrain your ML model based on new data and evaluate the performance of new models compared to previous versions.
To follow these principles, it’s important to use some technical components and assign some key responsibilities. Do you want to get an even better understanding of the requirements of MLOps? Then read our list of 10 important requirements!
What technical components do you need?
By technical components, we mean specific tools with which to develop your machine learning application, store its data, and maintain it. There are many tools you could possibly use, but we give a general overview of what your selection of tools needs to be able to do, along with an example for each component.
Arguably the most important component is the continuous integration / continuous development component. This was already listed above as one of the core principles of MLOps. This component can be in the form of a source code repository. GitHub is an online platform for software developers to collaborate and keep track of changes in their code. It is widely used and has many features that help you with version control (Another one of the key principles of MLOps), and CI/CD. Via GitHub actions, you can take care of building, testing, delivering and deploying changes. Automation like this might look like a big time investment, but if you do that investment early on it will save you a lot of time, and headache, down the line.
Workflow orchestration component
Next is the workflow orchestration component. Via directed acyclic graphs, this component determines the task orchestration of the ML workflow. Stated more simply: these graphs represent the order in which certain tasks are performed. Check out this article about different pipeline orchestrators. By using this software, you will be assisted in workflow orchestration, reproducibility of your ML system, and in the continuous monitoring and evaluation of the ML model.
Finally, here are a couple things you need to think about, but you likely will not need many separate pieces of software for them. These are: model training infrastructure, model registry, metadata stores, model serving component, and a monitoring component. What these components essentially boil down to is that you need some sort of cloud storage, and some software with which you can run your ML model in the cloud. For these purposes there are many different tools available. On the cloud storage end of things, you can consider Amazon Web Services, Google Cloud Storage, Microsoft Azure, Neptune.ai, and the list goes on. On the model execution side, consider UbiOps and Kubeflow, depending on your Kubernetes knowledge.
What responsibilities are there in MLOps?
There is a set of responsibilities that need to be covered for a successful ML model deployment. Ideally, each would be covered by one person. But we understand that many people that are just starting out do not have the resources to do that. In that case, it is possible to combine responsibilities, or outsource them to third parties. This can be done creatively! Take the skills of your coworkers and yourself into consideration when dividing tasks.
Firstly, a business stakeholder will need to define the business goal to be achieved with ML. Then, a data scientist will need to translate the business problem into an ML problem and take care of model engineering. A data engineer will need to manage data and feature engineering pipelines. Oftentimes, in smaller organisations the data scientist and data engineer will be the same person.
A software engineer will turn the ML problem into a well-engineered product, while a DevOps engineer bridges the gap between this development and the operations and applies DevOps principles like CI/CD and monitoring. Finally, an MLOps engineer performs the interdisciplinary task of managing the ML workflow and monitoring model and ML infrastructure. To reiterate, these tasks can be combined! The tasks of the DevOps and MLOps engineers can be performed by the same person. Or, the software engineer could also be the DevOps engineer. Let the people in your organisation determine who does what if you need to merge roles. The most important thing is that these responsibilities are discussed and divided.
How should you organise MLOps in your organisation?
MLOps practices can be categorised into four sections of development:
- Project initiation
- The feature engineering pipeline
- Automated ML workflow pipeline
Developing these parts, and connecting them in the right way, will yield a viable machine learning application that is maintainable and serviceable.
Starting a project is always exciting! In this development phase, the goal is to clearly define the problem to which a machine learning tool is the solution, and what data is necessary to solve it. So the business stakeholder, and the data scientist and data engineer will work together to design the ML system that will be developed. In addition, data will be collected to prepare for the initial data analysis.
Here, it is important to become familiar with the data you have. So check the distribution of the data and its quality. Finally, the data has to be cleaned and labelled. This means that incomplete data points are removed from the set, and all incoming data has a corresponding target attribute. This target attribute is what your ML model will use as output possibilities.
Feature engineering pipeline
The feature engineering pipeline is a sequence of operations which takes incoming data, and processes it so that it can be used by the ML model. For a large part, this pipeline does automatically what you do by hand in the project initiation phase. This is called data extraction, which entails collecting data from all your data sources. But it can also clean the data and assign it the correct labels, and use basic features of the data to calculate more advanced features. This is called data pre-processing, and the data scientist and data engineer together define how to do this and what these advanced features should be.
What’s important here is that the feature engineering pipeline is automatic. It should take raw data as input, and then output data that can be fed to the ML model, without any further human intervention. This makes it easier later on to optimise the model further. The pipeline can be tweaked to calculate new features, and new data can be processed to be ready for the model without having to manually go through the new data each time.
The experimentation pipeline is for the optimisation of the ML model. Data from the feature engineering pipeline and feedback from the model serving component go in, and then this pipeline will experiment with different algorithms and hyperparameters to find settings that yield the best performance.
The data that goes in is analysed and split into a test set and validation set, and then different models are trained and validated to find the optimal model for this data. The resulting model is then exported to the CI/CD component, which can pass it on to the automated ML workflow pipeline.
Automated ML workflow pipeline
The automated ML workflow pipeline refines ML models and can be triggered for a number of reasons. When triggered, it will use previously unseen data to further optimise the ML model. It will do the data preparation, split the data into a test set and validation set and then fine tune the hyperparameters by means of retraining the model.
The hyperparameters it initially uses are the ones that the experimentation pipeline finds as the most optimal. But if we already found optimal settings in the experimentation pipeline, why do we need this pipeline? There can be a number of reasons why you would want to retrain a model without wanting to completely re-do thorough experimentation. For instance, if the monitoring component finds that the performance of the model is decreasing, or your model is suffering from concept drift, this can be solved by fine-tuning instead of a complete model overhaul.
Putting it all together
The start of a new ML project has linear progression, where you will clearly define a goal and what type of machine learning model you will use to achieve it. Then you will collect data, determine what data features will be used in the ML model and build a pipeline that can prepare this data for you. You will also build an experimentation pipeline that you can use to optimise your model after making changes to it.
Once you have a model up and running, the automated ML workflow pipeline will continuously be used to refine the model. It is in periodic communication with the CI/CD component, which in turn delivers the best model you currently have to the model serving component.
The results from the model serving component are used to relay important information back to the feature engineering pipeline and experimentation pipeline, which means the changes you make to the ML model are based directly on what happens with your model in practice.
All of this can best be visualised with the flowchart below. If you would like to know in more detail how everything connects together, I highly recommend you check out the MLOps flowchart from this paper, which the flowchart below is based on.
Challenges you will face with MLOps
As MLOps has so many moving parts, you can imagine there’s a lot of challenges to overcome. This paper covers many of them, but we want to highlight two challenges that we find especially interesting here.
Training ML models and serving them can become incredibly expensive. The costs of the amount of high-end hardware and electricity necessary to do it can be eye-watering. With MLOps you will continuously be retraining models after making changes. These models run best on huge GPUs like the Nvidia A100, which are expensive. Luckily, there exist tools to reduce these costs. For example, it’s easy to deploy ML models on UbiOps which allows you to work with GPU and CPU on demand. This way you only pay when the model is actually active. This can decrease cloud costs tremendously.
Machine learning models are subject to new types of security attacks. These attacks can occur on the model itself, but also on your data or data sources. We have written about security in machine learning before! ML-specific malware called adversarial machine learning is a rapidly growing problem. The best ways to prevent yourself from becoming a victim of these attacks are to be very careful about what third-party or open source tools you use, what software packages you include in your code, and to keep an eye on cybersecurity newsletters to be up to date on what’s going on in the field.
MLOps is a set of practices that streamlines the development, deployment, and maintenance of machine learning models in a production environment. While it looks very complex at first glance, the core ideas are not. These core ideas are: clearly define the goal of your ML model, determine what data is necessary to achieve that goal, and then build the ML model. Then, you can use the results to develop it further to make it more accurate.
We covered what technical components are necessary to get your ML model up and running, what responsibilities need to be taken care of, and how you should use pipelines to create feedback loops that will allow you to continuously improve your ML solution. Finally, we discussed two examples of challenges in MLOps, and how you could deal with them.