Multi-LLM Routing
Learn how to set up and configure multiple LLM / embedding models for usage within a single component.
New in 3.11
The Multi-LLM-Routing is available starting with version 3.11.0
.
Overview
The Router for LLM and embeddings is a feature that allows you to distribute and load balance requests across multiple LLMs and embeddings. This feature uses LiteLLM under the hood to implement the routing logic.
Configuration
To enable the Router for LLM and embeddings, you need to define the router
key to your model group configuration in your endpoints.yaml
file.
The above example demonstrates a router with just one model. You can add multiple models to the models
key in the model group configuration.
There are no limit to the number models that you can add to the model group configuration.
warning
The model deployments in the model group configuration are advised to be using the same underlying model. For example,
defining different versions
of gpt-4
or a mix of gpt-4
and gpt-3.5-turbo
is not recommended because the component relying on the model group will not
receive consistent outputs from the different underlying models. The routing feature is designed to distribute
requests across different deployments of the same model.
The routing_strategy
key defines the routing strategy that the Router will use to distribute requests.
The following routing strategies are available without any additional configuration:
simple-shuffle
: The Router will shuffle the requests and distribute them based on RPM (requests per minute) or weight.least-busy
: The Router will distribute requests to the deployment that has least number of ongoing requests.latency-based-routing
: The Router will distribute requests to the deployment with the lowest response time.
The following routing strategies require additional redis setup as they work based on the caching mechanism:
cost-based-routing
: The Router will distribute requests to the deployment with the lowest cost.usage-based-routing
: The Router will distribute requests to the deployment with the lowest TPM (Tokens per minute) usage.
Refer to the LiteLLM's routing strategy documentation for more information on the routing strategies.
Additional configuration parameters for the Router
The Router can be configured with additional parameters like cooldown_time
, num_retries
and allowed_fails
to fine-tune the routing logic.
Refer to the LiteLLM's routing configuration documentation for more information on the configuration parameters.
Configuring the Router for multiple model groups
The above example demonstrates a configuration with two model groups. Each model group has two models and uses
the least-busy
and simple-shuffle
routing strategies respectively.
The router settings for each model group are defined under the router
key in the model group configuration and are independent of each other.
Configuring the Router for embeddings
While the examples above demonstrate the configuration for LLMs, the Router can also be configured for embeddings and all the routing settings and strategies mentioned above can be used for embeddings as well.
Configuring the Router for other providers
The Router can be configured for other providers as well. The configuration for other providers is similar to the Azure provider configuration demonstrated above. Refer to the LLM and embeddings provider documentation for more information on the provider specific configuration.
Configuring the Router for self-hosted deployments
The Router can also be configured for self-hosted deployments. The configuration for self-hosted deployments is similar to the Azure provider configuration demonstrated above. Some examples for different providers are shown below:
vLLM
The use_chat_completions_endpoint
parameter is used to enable/disable the chat completions endpoint for the model.
This parameter is optional and is set to true
by default. For more information, refer to the LLM configuration documentation.
With the Router enabled, the use_chat_completions_endpoint
parameter should be set as a router level setting and not at the model level.
LLama.cpp
Ollama
Connecting via a proxy
Connecting via litellm proxy
Caching
The Router caches the responses from the deployments to improve the response time and reduce the load on the deployments.
The cache settings can be configured through the cache_responses
key in the router
configuration.
Caching uses in-memory storage by default which should not be used in production. When using the router, caching should be enabled with a persistent storage like Redis.
Configuration for Redis based routing strategies
To use the cost-based-routing
or usage-based-routing
routing strategies, you need to define the redis
connection settings under the router
key in your endpoints.yml
file.
The redis_host
, redis_port
, and redis_password
keys define the connection settings for the redis server.
The redis connection can also be configured through redis_url
key.