What I learned about Machine Learning Infrastructure

Machine Learning Infrastructure vs traditional IT Infrastructure

2 years ago I had heard of DevOps and knew some bits and pieces of how AI works, such as: you take data from somewhere, train a model when you’re satisfied you feed it new data and work with the results! In reality, it’s not that simple. On the contrary, it’s extremely complex and there are a thousand different flavours (and opinions) on how things should be done. Especially the latter is one thing I learned. To be more precise, ML systems have all of the “problems” that traditional code (software) has, PLUS an additional set of ML-specific problems.

The actual physical infrastructure (hardware) is not that different. It’s mainly the processes and usage of the hardware that is different, which translates to new tooling to facilitate that. Let’s start with a brief review of what IT infrastructure, the overarching term, actually is.

IT infrastructure is the component to operate and manage IT environments. It can be deployed within a cloud computing environment, e.g. Google Cloud, or within an organization’s own facilities. The basic components are; hardware, software, and networking components (Redhat, 2021).


So what makes it different for Machine Learning?

The main reason is that ML systems are more dynamic and traditional software is not. I use Google’s definition of “ML systems” as given in one of their papers: ML code is only part of the larger picture. ML systems devote lots of resources to data collection, verification, feature extraction and a serving and hosting infrastructure, besides the code itself.


Hidden Technical Debt in Machine Learning Systems


Figure 1: Hidden Technical Debt in Machine Learning Systems, Sculley et al., 2015.



Machine learning models (especially the training thereof) require a lot of computing power. CPUs, the go-to hardware for computation, are not sufficient anymore. You’ll need GPUs (yes- graphics cards, which are also ‘processing units’ but more suitable for heavy processing tasks), instead of CPUs. For more info on that see this quote: “Tasks that take minutes with smaller training sets may now take more hours — in some cases weeks — when datasets get larger” (Dsouza, 2020).

Software level

When it comes to the software level, so many different tools are out there, all trying to eliminate technical debt, decrease development and maintenance time and lower costs. But what’s all the fuzz?

Most importantly, because an ML system is “live” and not static, it changes over time and thus the owner should stay on top of it. Meaning, the input data may change, the model itself (with automatic retraining) may change, and thus the outcomes will change, perhaps also the load on the system changes. That eliminates the traditional DevOps practices that apply to conventional software development. Traditional software deals with code, while ML systems deal with code + data + models (Datarevenue, 2021). That translates into (many) different software products to be developed to manage all the components and their life cycles.

If you want to read more about why ML systems are more difficult to build and maintain, have a look at this paper by Google Research: Hidden technical debt in machine learning systems.

As can be seen in figure 1, the actual ML code is only a small part of the entire system. The required surrounding infra is vast and complex.

Secondly, building ML systems is simply relatively new. Just over 20% of enterprises have some models running in production for over 2 years. Only 53% of the projects make it from prototype to production (Gartner, 2020).

And we haven’t even talked about the challenge of scaling up once you have a couple of models live. New territory means many pioneers paving the way to create best-in-class solutions. Take a look at Aparna’s article on ML Infrastructure tools, or just glaze over the image below (Matt Turck, 2020to get an idea of the huge landscape out there for data & AI (and growing).

ML Infrastructure tools landscape

Figure 2: Data & AI overview 2020 (Matt Turck, 2020)


So what are those processes that make it so difficult?

Besides the software code that must be developed to design your entire Machine Learning system, you need to cope with the in -and out-flow of data, with a constantly changing model and thus a way to monitor that. Moreover you probably need some kind of CI/CD (continuous integration, continuous deployment) version control, and the actual ‘serving’ infrastructure management should be taken care of. Lastly, don’t forget compliance. See figure 3 for a visual representation of most of the processes and mutual dependencies involved with ML systems. This figure is basically figure 1, but shown from a lifecycle perspective.


The core steps in a typical ML workflow

Figure 3: The core steps in a typical ML workflow, taken from ML-ops.org.


ML (Machine Learning) Monitoring

Look again at the figure above, in the top right. In the ideal situation you have such a ‘model decay trigger’. It should nudge you when something changes about the model performance. You could then first review the incoming data for large changes. For example, if the model is trained on winter data but it’s summer, the model will behave very differently. If you determine the input data is indeed different you should retrain the model, evaluate it, test it and eventually deploy it again. Lastly, you then monitor the newly deployed model and ensure everything is ok.

CI/CD for machine learning
Ideally you automate the monitoring, retraining and deployment. Also called “CI/CD” (continuous integration and continuous delivery/continuous deployment), which guarantees a low-risk and low-error development and release cycle, when in place. Continuous integration means that all developers merge their code changes often and do so in a central repository, so that everyone works with the same code. Continuous integration means that the release process of the model (so from development to deployment) is automated.

Model version control
Perhaps you deployed a new model and you then find out the previous one was better. So, you’ll probably want to have some type of version control and allow for rollback to older versions. But that previous version was built by your predecessor and stored somewhere, but where? And once you found it, you have difficulty getting it to run on your local PC to train and test it. Ideally you want to prevent any of that too.

Serving infrastructure management
Every model uses a different amount of resources (compute, memory) and has a different data throughput. The infrastructure must be able to handle such differences. You don’t see that in the picture below, but all these processes require computational resources and memory, and thus a certain infrastructure.

AI audits
If you’re based in Europe you will soon have to oblige with the EU Artificial Intelligence rules and perform certain compliance checks. Functionality such as an audit trail would then come in handy, but also reports and functionality to correctly assess the risks of AI models. In the figure below, compliance processes are not yet shown, as clear rules and regulations don’t exist yet — but it’s not going to make the processes easier.

This entire process is complex to set up and requires frequent changes that are at the core of your ML software, which makes it very different and more complex compared to conventional software.


Wrapping up

Long story short, I found out that there’s a lot more to an ML system than having data, having a model, and feeding it the data. The industry has yet to form a set of standards to manage ML systems that prevents a vendor lock-in, but creates a successful solution for everyone. At UbiOps we specifically focus on the deployment step, whereby we strive to be the best at solving the set of challenges around serving, scaling and managing live Machine Learning workloads.


Wouter Hollander – Key account & partnerships manager at UbiOps

Latest news

Turn your AI & ML models into powerful services with UbiOps