Version: Latest

Multi-LLM Routing

Learn how to set up and configure multiple LLM / embedding models for usage within a single component.

New in 3.11

The Multi-LLM-Routing is available starting with version 3.11.0.

Overview

The Router for LLM and embeddings is a feature that allows you to distribute and load balance requests across multiple LLMs and embeddings. This feature uses LiteLLM under the hood to implement the routing logic.

Configuration

To enable the Router for LLM and embeddings, you need to define the router key to your model group configuration in your endpoints.yaml file.

endpoints.yml
model_groups:
- id: azure_llm_deployments
models:
- provider: azure
deployment: rasa-gpt-4
api_base: https://azure-deployment/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY}
router:
routing_strategy: simple-shuffle

The above example demonstrates a router with just one model. You can add multiple models to the models key in the model group configuration. There are no limit to the number models that you can add to the model group configuration.

endpoints.yml
model_groups:
- id: azure_llm_deployments
models:
- provider: azure
deployment: gpt-4-instance-france
api_base: https://azure-deployment-france/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_FRANCE}
- provider: azure
deployment: gpt-4-instance-canada
api_base: https://azure-deployment-canada/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_CANADA}
router:
routing_strategy: simple-shuffle
warning

The model deployments in the model group configuration are advised to be using the same underlying model. For example, defining different versions of gpt-4 or a mix of gpt-4 and gpt-3.5-turbo is not recommended because the component relying on the model group will not receive consistent outputs from the different underlying models. The routing feature is designed to distribute requests across different deployments of the same model.

The routing_strategy key defines the routing strategy that the Router will use to distribute requests.

The following routing strategies are available without any additional configuration:

  • simple-shuffle: The Router will shuffle the requests and distribute them based on RPM (requests per minute) or weight.
  • least-busy: The Router will distribute requests to the deployment that has least number of ongoing requests.
  • latency-based-routing: The Router will distribute requests to the deployment with the lowest response time.

The following routing strategies require additional redis setup as they work based on the caching mechanism:

  • cost-based-routing: The Router will distribute requests to the deployment with the lowest cost.
  • usage-based-routing: The Router will distribute requests to the deployment with the lowest TPM (Tokens per minute) usage.

Refer to the LiteLLM's routing strategy documentation for more information on the routing strategies.

Additional configuration parameters for the Router

The Router can be configured with additional parameters like cooldown_time, num_retries and allowed_fails to fine-tune the routing logic. Refer to the LiteLLM's routing configuration documentation for more information on the configuration parameters.

endpoints.yml
router:
routing_strategy: simple-shuffle
cooldown_time: 10 # in seconds - time to wait before retrying a failed request.
allowed_fails: 2 # number of allowed fails before the deployment is marked for cooldown.
num_retries: 3 # number of retries.

Configuring the Router for multiple model groups

endpoints.yml
model_groups:
- id: azure_gpt4_deployments
models:
- provider: azure
deployment: gpt-4-instance-france
api_base: https://azure-deployment-france/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_FRANCE}
- provider: azure
deployment: gpt-4-instance-canada
api_base: https://azure-deployment-canada/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_CANADA}
router:
routing_strategy: least-busy
- id: azure_gpt35_turbo_deployments
models:
- provider: azure
deployment: gpt-35-instance-france
api_base: https://azure-deployment-france/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_FRANCE}
- provider: azure
deployment: gpt-35-instance-canada
api_base: https://azure-deployment-canada/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_CANADA}
router:
routing_strategy: simple-shuffle

The above example demonstrates a configuration with two model groups. Each model group has two models and uses the least-busy and simple-shuffle routing strategies respectively. The router settings for each model group are defined under the router key in the model group configuration and are independent of each other.

Configuring the Router for embeddings

While the examples above demonstrate the configuration for LLMs, the Router can also be configured for embeddings and all the routing settings and strategies mentioned above can be used for embeddings as well.

endpoints.yml
model_groups:
- id: azure_embeddings_deployments
models:
- provider: azure
deployment: text-embeddings-instance-france
api_base: https://azure-deployment-embeddings-france/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_FRANCE}
- provider: azure
deployment: text-embeddings-instance-canada
api_base: https://azure-deployment-embeddings-canada/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_CANADA}
router:
routing_strategy: simple-shuffle

Configuring the Router for other providers

The Router can be configured for other providers as well. The configuration for other providers is similar to the Azure provider configuration demonstrated above. Refer to the LLM and embeddings provider documentation for more information on the provider specific configuration.

Configuring the Router for self-hosted deployments

The Router can also be configured for self-hosted deployments. The configuration for self-hosted deployments is similar to the Azure provider configuration demonstrated above. Some examples for different providers are shown below:

vLLM

endpoints.yml
model_groups:
- id: vllm_deployments
models:
- provider: self-hosted
model: meta-llama/Meta-Llama-3-8B #model name
api_base: "http://localhost:8000/v1" # hosted on port 8000
# use_chat_completions_endpoint: false # Can't be used here when using router.
- provider: self-hosted
model: meta-llama/Meta-Llama-3-8B #model name
api_base: "http://localhost:8001/v1" # hosted on port 8001
router:
routing_strategy: least-busy
use_chat_completions_endpoint: false

The use_chat_completions_endpoint parameter is used to enable/disable the chat completions endpoint for the model. This parameter is optional and is set to true by default. For more information, refer to the LLM configuration documentation.

With the Router enabled, the use_chat_completions_endpoint parameter should be set as a router level setting and not at the model level.

LLama.cpp

endpoints.yml
model_groups:
- id: llamacpp_deployments
models:
- provider: self-hosted
model: ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF #model name
api_base: "http://localhost:8080/v1" # hosted on port 8080
- provider: self-hosted
model: ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF #model name
api_base: "http://localhost:8081/v1" # hosted on port 8081
router:
routing_strategy: least-busy

Ollama

endpoints.yml
model_groups:
- id: ollama_deployments
models:
- provider: ollama
model: llama3.1 #model name
api_base: "http://localhost:11434" # hosted on port 11434
- provider: ollama
model: llama3.1 #model name
api_base: "http://localhost:11435" # hosted on port 11435
router:
routing_strategy: least-busy

Connecting via a proxy

endpoints.yml
model_groups:
- id: vllm_deployments
models:
- provider: self-hosted
model: meta-llama/Meta-Llama-3-8B #model name
api_base: "http://your-proxy-url-1" # URL of the proxy server
- provider: self-hosted
model: meta-llama/Meta-Llama-3-8B #model name
api_base: "http://your-proxy-url-2" # URL of the proxy server
router:
routing_strategy: least-busy

Connecting via litellm proxy

endpoints.yml
model_groups:
- id: litellm_proxy_deployments
models:
- provider: litellm_proxy # provider name
model: gpt-4-instance-1 # model name
api_base: "http://localhost:4000" # URL of the litellm proxy server
- provider: litellm_proxy # provider name
model: gpt-4-instance-2 # model name
api_base: "http://localhost:4000" # URL of the litellm proxy server
router:
routing_strategy: least-busy

Caching

The Router caches the responses from the deployments to improve the response time and reduce the load on the deployments. The cache settings can be configured through the cache_responses key in the router configuration.

endpoints.yml
router:
routing_strategy: simple-shuffle
cache_responses: true

Caching uses in-memory storage by default which should not be used in production. When using the router, caching should be enabled with a persistent storage like Redis.

Configuration for Redis based routing strategies

To use the cost-based-routing or usage-based-routing routing strategies, you need to define the redis connection settings under the router key in your endpoints.yml file.

endpoints.yml
model_groups:
- id: azure_llm_deployments
models:
- provider: azure
deployment: rasa-gpt-4
api_base: https://azure-deployment/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY}
router:
routing_strategy: cost-based-routing
redis_host: localhost # Can also be set through environment - ${REDIS_HOST}
redis_port: 6379 # Can also be set through environment - ${REDIS_PORT}
redis_password: ${REDIS_PASSWORD}

The redis_host, redis_port, and redis_password keys define the connection settings for the redis server. The redis connection can also be configured through redis_url key.

endpoints.yml
router:
routing_strategy: cost-based-routing
redis_url: "redis://:mypassword@host:port"

Configuring the Router for multiple model groups with Redis

endpoints.yml
model_groups:
- id: azure_gpt4_deployments
models:
- provider: azure
deployment: gpt-4-instance-france
api_base: https://azure-deployment-france/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_FRANCE}
- provider: azure
deployment: gpt-4-instance-canada
api_base: https://azure-deployment-canada/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_CANADA}
router:
routing_strategy: least-busy
redis_host: localhost
redis_port: 6379
redis_password: ${REDIS_PASSWORD}
- id: azure_gpt35_turbo_deployments
models:
- provider: azure
deployment: gpt-35-instance-france
api_base: https://azure-deployment-france/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_FRANCE}
- provider: azure
deployment: gpt-35-instance-canada
api_base: https://azure-deployment-canada/
api_version: "2024-02-15-preview"
api_key: ${MY_AZURE_API_KEY_CANADA}
router:
routing_strategy: simple-shuffle
redis_host: localhost
redis_port: 6379
redis_password: ${REDIS_PASSWORD}