Autoscaling your ML workloads

Autoscaling is an essential MLOps tool for allocating resources automatically to different MLOps workloads. A lot of MLOps tasks have very variable workloads and traffic. If your use case falls into that category, you need autoscaling. 

Because if you don’t use autoscaling you either:

  • Spend a lot of money on over-provisioning resources that can handle peak traffic. Even though maybe 90% of the time you do not need these resources.
  • Or, you spend a lot of time manually scaling resources up and down to fit traffic. Which can quickly become error prone, not to mention time consuming.

Having autoscaling features that optimize resources while saving time and money sounds a lot better right? However, there are still two distinct types of autoscaling which you can choose from, and it depends on your particular use case which one fits best.

Scheduled scaling versus dynamic scaling

The two types of autoscaling are scheduled scaling and dynamic scaling. Scheduled scaling is exactly what it sounds like, you have a predefined schedule that dictates when more resources need to be spun up, and when those resources need to be spun down. This is handy when you have variable, but very predictable traffic to your models. An example use case where this would be useful is when you have a ML model that only needs to run on the weekends, and you know exactly how much data is going to pass through the model. In that case scheduled scaling would suffice.

Dynamic scaling on the other hand, works well when you have very unpredictable traffic to your models. With dynamic scaling, compute resources are scaled up and down dynamically based on the incoming traffic. If there are suddenly a lot of requests made to your model, it will be scaled up, but when it’s not used for a certain amount of time it will be scaled back down again.

This optimizes your use of compute resources, which also minimizes your costs related to those resources. At the same time it also ensures high availability of your models, as there will always be enough resources to handle the incoming traffic. Unless you have a perfectly constant level of traffic to your model 24/7, dynamic scalability is always worth looking into.

Challenges related to autoscaling

There are loads of ways to go about introducing autoscaling into your workflow. Most cloud providers provide some way of autoscaling, like EC2 scheduled scaling at AWS, or you could dive into the nitty gritty of Kubernetes, or go for a specialized MLOps tool which includes autoscaling. When navigating this landscape of options, it is important to look at what your particular usecase needs, and if these potential solutions meet your criteria. They might all be forms of autoscaling, but they can differ largely in terms of:

  • Scale out time: some tools can take multiple minutes to scale up. Perhaps your use case needs super speedy scale out times
  • Availability of resources: maybe you need a specific type of GPU for your workloads, different tools support different GPU types. Also, many cloud providers struggle to supply enough GPUs so if you need many GPUs to scale out to, this is definitely something to keep in mind!
  • Costs: An obvious factor to look at
  • Maintenance overhead: One that is often forgotten! Of course you can go for a full DIY solution made with open source software, but do you actually have the necessary people and skills on board to set that up properly and to maintain it in the long run?
  • Customizability: How much control do you need to have over the scaling settings?

Which of these factors are important will always differ on a case by case basis. No tool will score a perfect ten in all of these categories, so evaluate how you should make the trade off to best fit your needs.

Autoscaling at UbiOps

Autoscaling is an integral part of the UbiOps platform. Every deployment automatically comes with dynamic scaling, including scale to zero. We allow you to change the scaling settings as you wish, so you stay in charge of the range within which we can scale, and whether or not scale to zero should be turned on. This way we provide you with full control over how we scale while still making sure everything happens automatically.

At UbiOps we focus on GPU scaling in particular. GPU scaling is generally more difficult than CPU scaling, mainly because of scarcity of GPUs and the typically longer spin up times of GPU instances. We partner with GPU focused cloud providers to guarantee that we have enough GPUs available for our users and that they can scale up as much as they need to. We are also continuously working to further speed up our scale-out times. In addition, UbiOps is also part of the NVIDIA inception program, to ensure that we are kept up to date with any new GPU features NVIDIA is putting out there so that we can make them available to our users as well.


Do you need autoscaling and would you like to see if UbiOps fits your needs? Test it out for free today!

Latest news

Turn your AI & ML models into powerful services with UbiOps