Request concurrency per instance¶
The most common method to scale workloads in UbiOps is by scaling the number of instances. This increases a deployments throughput by increasing the number of compute instances that run replicas of your code.
However, sometimes you might want to handle multiple requests concurrently in one instance. For example if your application is I/O bound and therefore the resources of an instance are not efficiently used, or if you want to share a model or other object in an instance's memory between multiple requests, such as in LLM server set-ups.
To do so, you can configure the Request Concurrency per instance. This setting, which defaults to 1, controls how many processes are started in each instance. Each of these processes runs a copy of the deployment code and handles requests independently.

Configuring Request Concurrency¶
You can configure the Request Concurrency on deployment versions. Either via the WebApp by changing the Request Concurrency setting, or via the API, Client Libraries or CLI by setting the instance_processes
parameter.
Identifying the processes¶
Sometimes it can be useful to identify the processes in an instance in your code. For example if you want to run some code only in the first process, such as starting a service in the background.
You can do so by obtaining the process_id
in the context
parameter in the __init__
method of your deployment. This process_id
will be an integer index of the process, starting from zero and increasing with the number of processes.
For example:
class Deployment:
def __init__(self, base_directory, context):
if context["process_id"] == 0:
# Do something only in one of the processes
pass
Tutorial¶
UbiOps provides a tutorial that explains how you can use this concept to set up a high throughput LLM server with the UbiOps Client Library: