Concurrency limits on Services¶
Concurrency limits provide control over the number of requests to a service that can be in progress at the same time. This is useful for managing the capacity of a service, for example for LLM models or other HTTP based apps.
A concurrency limit is a cap on the number of HTTP requests allowed to a service at any given time, the "in-flight" requests. Rate limits are available as well, which limit the number of requests made to a service per minute.
Concurrency limits are available on two levels:
- a total limit on the service, for all requests made to it combined
- a per-user limit, with both a default limit per user available as well as configurable limits for individual users
Upon reaching the concurrency limits, users will receive an HTTP 429 Too Many Requests response code. The response body indicates whether the total or user limit was reached.
Configuring concurrency limits¶
Concurrency limits can be configured through the WebApp and in the API, Client Libraries and CLI.
In the WebApp, the total concurrency limit for all users and the default limit per user are configurable in the Authentication and limits section of a service.
Additionally, you can set limits for individual users on the Concurrency limits tab of the service. Limits configured for a (service) user apply on the token used. The limits apply separately per token if a user has multiple.
User level concurrency limits are only available for services that require authentication, because they are applied per token. The total concurrency limit on a service is available for all services, both with and without authentication.
Request distribution and consistency guarantees¶
The concurrency limit applies to the service and is independent of the number of deployment instance replicas that back it. For example, a concurrency limit of 100 on a service backed by 5 deployment replicas means that each replica will receive approximately 20 requests at most at any given time.
The total concurrency limit on a service is strongly consistent and this limit is never exceeded. However, due to the nature of load balancing, asynchronous distribution of requests as well as varying request durations, UbiOps can't guarantee that each replica receives exactly an equal share of concurrent requests. In the example above, it could be that some replicas receive 21 requests and others 19. On a service with one replica, the limit for that replica is guaranteed.