UbiOps version 2.31.0¶

Client library version

Python client library version for this release: 4.6.0
CLI version for this release: 2.23.0

On the 17th of October 2024 we have released new functionality and made improvements to our UbiOps SaaS product. An overview of the changes is given below.

Support for streaming request responses¶

We added support for streaming request responses to make it easier for you to leverage GenAI models with UbiOps. To make your deployment compatible for request streaming, you simply need to use streaming_update in your deployment's request method. Other use cases for streaming can include streaming statuses of requests and information to build progress bars.

Below you can see an example deployment template for deploying Gemma to UbiOps with streaming support.

import os
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer
from threading import Thread


class Deployment:

    def __init__(self, base_directory, context):

        # Log in to Hugging Face
        token = os.environ["HF_TOKEN"]
        login(token=token)

        # Download Gemma from Hugging Face
        model_id = os.environ.get("MODEL_ID", "google/gemma-2-2b-it")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(model_id)

        # You can change the system prompt by adding an environment variable to your deployment (version)
        self.system_prompt = os.environ.get(
            "SYSTEM_PROMPT",
            "You are a friendly chatbot who always responds in the style of a pirate",
        )

    def request(self, data, context):

        user_prompt = data
        streaming_callback = context["streaming_update"]

        # Prepare the chat prompt with the system message and user input
        chat = [{"role": "user", "content": f"{self.system_prompt} \n {user_prompt}"}]
        print("Applied chat: \n", chat)

        prompt = self.tokenizer.apply_chat_template(
            chat, tokenize=False, add_generation_prompt=True
        )
        inputs = self.tokenizer(
            prompt, add_special_tokens=False, return_tensors="pt"
        )

        streamer = TextIteratorStreamer(self.tokenizer, skip_prompt=True)

        generation_kwargs = dict(inputs, streamer=streamer, max_new_tokens=256)

        # The TextIteratorStreamer requires a thread which we start here
        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()

        generated_text = ""
        for new_text in streamer:
            # We use the streaming_callback from UbiOps to send partial updates
            streaming_callback(new_text)
            generated_text += new_text

        return generated_text

In this example we use the built in TextIteratorStreamer from HuggingFace and use the streaming_update object from UbiOps to pass our stream to the UbiOps API. Once we have the full response from the LLM we return that in our return statement. For all the details on this example, please see our Gemma with streaming support tutorial.

Improved speed and reliability for request handling¶

We've made optimizations behind the scenes to improve how quickly UbiOps can handle requests.You should expect to see even lower latency and faster creation of batch requests.

As part of this effort to reduce latency, we had to make a few concessions around listing of requests. Requests are listed with a status now, requests with different statuses cannot be viewed in one overview anymore.

You can still list requests by individual status (such as completed, failed, pending, and processing). However, please note that requests with processing and pending statuses will now only be available at the version level and can no longer be listed at the project level. This change will also be reflected in the WebApp. Not showing processing and pending requests on project level, allowed us to reduce the latency overhead of request handling.

Ability to cancel pipeline requests and to cancel in bulk¶

It was already possible to cancel a deployment request, but it's now also possible to cancel a pipeline request. After cancelling, we will ensure all the pipeline object requests that are part of the pipeline will be cancelled as well.

In addition, we also added support for bulk cancellation of requests, which allows you to cancel multiple requests at once. We know sometimes a simple scripting error, or deployments that are exposed to the outside world, can result in a bulk of unintended requests which will eat up your computing credits. You can now cancel the all the pending or processing deployment/pipeline requests at once in such a scenario. Simply navigate to your deployment or pipeline version, go to the requests overview, and filter on either pending or processing. A button will appear for cancelling everything in the list.

Decide which scaling algorithm is used¶

UbiOps scales instances of your deployments up and down based on incoming request traffic. We have two different scaling algorithms available for you to choose from: default or moderate.

The default scaling algorithm is the best pick for most scenarios. It scales up "aggressively". When new requests enter the queue and the maximum number of instances limit is not reached yet, it will scale up a new instance immediately. This works well for the following cases:

The cold start time is relatively short
You expect a lot of sudden bursts in traffic where you want your deployment to scale out quickly to work through the sudden queue as quickly as possible

The moderate algorithm is slightly more moderate in scaling up, as the name suggests. It looks at the historic cold start time of your deployment and at the current queue size to determine whether it's worthwhile to scale up a new instance. When you assign this algorithm to a newly created deployment, there won't be any metrics yet on the deployment's average cold start time. Therefore, the algorithm will scale similarly to the default algorithm in the beginning. Over time, when there is more data on the average cold start time and average request duration of your deployment, the moderate scaling algorithm will adapt its strategy to the usage of your model. This algorithm works well in the following cases:

The cold start time is a lot longer than the average request duration.
The traffic to your model is very stable, and you do not need quick burst scaling.

Ability to enforce two-factor authentication as admin¶

We know security is very important and that small things like two-factor authentication (2FA) can make a big difference. If you're an organization admin, you can now enforce 2FA for everyone in your organization's team. Simply head over to your team page on organization level, and click the button "Enforce 2FA for all" to enable this functionality.

Save your pipeline layout¶

You can now adjust your pipeline layouts in the WebApp to your liking! We noticed that our auto-layouts weren't always optimal for all the different pipeline setups you are using. So now whenever you edit and save your pipeline, we will also save your layout.

Miscellaneous & deprecations¶

We also made some miscellaneous changes and improvements:

It is now possible to list pipeline requests on project level in the UbiOps WebApp.
Notification groups have been deprecated in favor of the more versatile webhooks. See our migration guide for more information.
Cancelled request will now have status failed instead of cancelled. The error message will still indicate whether the request was actively cancelled or if it failed for other reasons.
We improved the error logging in the case of an invalid ubiops.yaml
The success field is removed from responses of request endpoints. This was already deprecated before.
The origin field of requests doesn't have request schedule information anymore