Slurm for running AI and ML workloads?

By 18 August 2022August 22nd, 2022Blog, Functionality

We sometimes get the question: “We currently use Slurm for orchestrating AI and ML workloads – how is UbiOps different?”

Although Slurm and UbiOps are two worlds apart – in this article we will explain in more detail the differences between Slurm, Kubernetes and UbiOps. For the ones who don’t have time for this read (3 minutes): a summary of the article you can find in the Table underneath.

  Slurm Kubernetes  Kubernetes + UbiOps
Granular control over GPU/CPU resources x x x
Optimized use of GPU and CPU resources  x   x
Jobs scheduler (CRON-like) x   x
Job priority levels x   x
Microservices-based approach   x x
Real-time model inferencing   x x
Built-in User Interface     x
Support for training and inferencing jobs     x
Dashboarding & monitoring     x
Version control for jobs and models     x
API management     x
Easy access to real-time logs     x
Data pipelines with reusable components     x
Run reproducibility and metadata management     x
Portability of workloads between clusters (multi/hybrid cloud)     x

From traditional HPC workloads to AI and ML workloads      


Thanks to advancements in computing capabilities, storage of high-quality data, and widely distributed open-source frameworks, more research organizations and businesses implement Artificial Intelligence (AI) and Machine Learning (ML) to drive research and innovation. From computer vision (CV) to natural language processing (NLP) and time series analysis, AI has found its way into most business innovations and state-of-the-art research programs.

Organizations with a background in traditional High-Performance Computing (HPC) are confronted with a new reality of i) Cloud enablement and ii) AI and ML adoption in research and business processes. Currently, these organizations typically use Slurm as the go-to scheduling tool to orchestrate AI and ML workloads at scale on HPC infrastructure.

Slurm (Simple Linux Utility for Resource Management) is a widely used open-source scheduler for managing distributed, batch-oriented workloads typical for HPC. Slurm is designed as a highly-scalable, fault-tolerant, and self-contained workload manager and job scheduler. It supports Linux-based clusters, offers flexible scheduling options, and integrates well with common frameworks. For example, the Application Launch and Provisioning System (ALPS) framework, which enables you to manage runtimes and launch applications at scale.

4 pressing issues using Slurm for AI and ML workloads 


Researchers and data science practitioners are facing new challenges – compared to traditional HPC workloads – using Slurm as their go-to-tool for orchestrating ML and AI workloads at scale. We will outline some of the most pressing pain-points AI/ML adopters face today by using Slurm.

It’s not easy to use Slurm    

To schedule and run a job with Slurm you need to define it in a bash script, where you among others need to specify the different steps, tasks, the amount of RAM, the number of cores and the type of nodes the job requires. The main interface with the Slurm workload manager is the command line. However, most researchers in AI and ML don’t have a computer science background and are used to working in interactive Python environments such as Jupyter Notebook. These notebooks make the lives of AI practitioners much easier where you can directly run your Python code. Therefore, many researchers struggle to get themselves used to using Slurm to run their AI and ML workloads.

Training of ML models on Slurm is tricky

Once you want to start training ML models you really want to know how your training experiment is running and if it is converging. Therefore, during training, access to logs and metrics is essential. However, the generation of log files in Slurm can be tricky – a failed job ends for most users in an error log file saying the job failed with no extra information. Moreover, without access to live logs and metrics, it is not possible to tell when to cancel a training job if it’s not converging – which is a necessary feature in many cases and leads to inefficient use of resources.

No support for model deployment and inference

Most AI and ML practitioners have a goal with model training: eventually bringing the trained model into production. Running a ML model in production requires infrastructure capable of load balancing, auto-scaling, versioning and more. Slurm does not support this, so it requires setting up and managing a very different infrastructure, for instance based on Kubernetes. Slurm is fit for running one-time jobs, not continuous jobs and workloads which require a microservices-based infrastructure. However, many AI and ML models will be embedded into larger applications, from mobile to web-based, both for model (re-) training and inferencing purposes.

No dashboarding & monitoring

The main interface with Slurm is the command line. Slurm doesn’t have any GUI or dashboard functionality by default. Doing data science or ML experiments requires keeping track of live training performance, past experiment results and easy access to logs and code output. Also, working with a team requires easy reproducibility of experiments and sharing of results.

The Rise of Kubernetes for AI and ML workloads


Kubernetes is the go-to platform for running and managing flexible, containerized workloads and microservices. It is a very popular open-source orchestration solution for container-based workloads. With the Kube-scheduler of Kubernetes, workloads can be effectively managed in ways like traditional HPC clustering methods, though Kubernetes alone does not offer all the scheduling capabilities of Slurm – such as batch scheduling.

Kubernetes is based on clusters of worker nodes (either physical or virtual servers) controlled by a master node. Each node can host a group of pods (containers). These pods share resources in the node and exist in a local network enabling pods to communicate with each other while still containing isolated workloads or applications. This makes Kubernetes a great orchestration tool for running and scaling Machine Learning workloads as well.

Compared to Slurm, Kubernetes enables you to better manage cloud-native technologies and container-based microservices which can be scaled more dynamically than traditional applications.

In summary: Slurm works well for batch workloads and optimizing resource usage based on a queue of jobs. Kubernetes is meant for use in ‘modern’ cloud-native microservice architectures. It optimizes scheduling and scaling live services based on available resources. So, two quite different worlds.

Introduction to UbiOps as an alternative to train and deploy AI and ML     


UbiOps takes care of turning code (models, scripts etc.) into microservices and orchestrating these with the use of Kubernetes.

UbiOps is a platform with a lot of different features for running and managing data science and ML workloads. Besides running models live, UbiOps also has the option to schedule workloads, which means it creates a request at a certain time based on a CRON schedule.

UbiOps for researchers  


UbiOps offers a user-friendly UI  and a way to manage and run different types of jobs and services through a simple, intuitive GUI or API client. Slurm is about running one-time jobs. UbiOps also has ways to manage the lifecycle of (live) models and is built for low-latency, repeatable inference calls or training runs where much more metadata about the models and requests are maintained inside the UbiOps platform.

Because of the scale-to-zero functionality, UbiOps optimizes workload distribution dynamically based on activity and idle deployments that do not consume resources, opening space for other workloads. The multi-cloud and hybrid cloud capability provides a way to lift and shift workloads between different clusters, optimizing resource availability.

About UbiOps     


UbiOps is a software company with its headquarters in the Netherlands and office in the United States. UbiOps is a MLOps platform that enables data analytics teams to run, manage and scale AI and machine learning workloads on demand. UbiOps is used by research organizations and universities, large enterprises and government.