Version: Latest

How do I configure multiple model deployments and routing in Rasa?

Starting from Rasa 3.11, you can configure LLMs and embeddings in CALM without retraining your assistant. This guide shows you how to:

important

Starting from Rasa 3.11, direct LLM and embedding configurations inside components are deprecated. Define all clients in endpoints.yml under the model_groups key. Mixing approaches can lead to errors and is not supported.

Decoupling LLM and embedding configurations from components

To decouple configurations, define your LLM and embedding client configurations in endpoints.yml under the model_groups key. Each model group should have a unique ID and define its associated models.

endpoints.yml
model_groups:
- id: gpt-4-primary # Unique ID for the LLM deployment
models:
- provider: openai
model: gpt-4
timeout: 7
temperature: 0.0
- id: text-embedding-3-small-primary # Unique ID for the embedding deployment
models:
- provider: openai
model: text-embedding-3-small

Use the model_group key to reference the appropriate model group defined in endpoints.yml.

config.yml
pipeline:
- name: SingleStepLLMCommandGenerator
llm:
model_group: gpt-4-primary # Reference the model group ID
flow_retrieval:
embeddings:
model_group: text-embedding-3-small-primary # Reference the model group ID

Run the following command to train with the updated configurations:

rasa train --config config.yml --endpoints endpoints.yml

Once trained, the components will reference model configurations from endpoints.yml. Updating endpoints.yml doesn't require retraining.

warning

Missing model group in endpoints.yml will cause errors. Ensure that all components referencing a model group (e.g., gpt-4-primary) have a corresponding definition in endpoints.yml.

Adapting model settings to different environments

You can configure your assistant to work across multiple environments, such as dev, staging, and prod, without retraining. Use the ${...} syntax to dynamically set values from environment variables for keys within the model group.

Supported keys:

  • api_base
  • api_version
  • deployment (specific to Azure OpenAI)
  • aws_access_key_id (specific to AWS Bedrock)
  • aws_secret_access_key (specific to AWS Bedrock)
  • aws_session_token (specific to AWS Bedrock)
  • aws_region_name (specific to AWS Bedrock)
endpoints.yml
model_groups:
- id: gpt-4-primary # Unique ID for the LLM deployment
models:
- provider: azure
deployment: ${AZURE_DEPLOYMENT_GPT4} # Dynamically set Azure deployment name
api_base: ${AZURE_API_BASE_GPT4} # Dynamically set API base URL
api_key: ${AZURE_API_KEY_GPT4} # Dynamically set API key
- id: text-embedding-3-small-primary # Unique ID for the embedding deployment
models:
- provider: azure
deployment: ${AZURE_DEPLOYMENT_EMBED_SMALL} # Dynamically set Azure deployment name
api_base: ${AZURE_API_BASE_EMBED_SMALL} # Dynamically set API base URL
api_key: ${AZURE_API_KEY_EMBED_SMALL} # Dynamically set API key

Enabling load balancing for multiple model deployments

To distribute requests across multiple LLM or embedding deployments, update endpoints.yml as follows:

  1. Define multiple models in a model_group.
  2. Add a router with a routing_strategy to control how requests are distributed. You can find the available strategies here (TODO_ADD_LINK).
endpoints.yaml
model_groups:
- id: load-balanced-gpt-4 # Unique group ID for load balancing across multiple GPT-4 deployments
models:
# Azure GPT-4 deployment in France
- provider: azure
deployment: azure-deployment-france
api_base: https://api.azure-france.example.com
api_version: 2024-08-01-preview
api_key: ${AZURE_API_KEY_FRANCE}
timeout: 7
temperature: 0.0
# Azure GPT-4 deployment in US
- provider: azure
deployment: azure-deployment-us
api_base: https://api.azure-us.example.com
api_version: 2024-08-01-preview
api_key: ${AZURE_API_KEY_US}
timeout: 7
temperature: 0.0
# OpenAI GPT-4 deployment
- provider: openai
model: gpt-4
api_key: ${OPENAI_API_KEY}
timeout: 7
temperature: 0.0
# Router configuration to distribute requests
router:
routing_strategy: least-busy # Route requests to the least busy deployment

Using separate credentials for multiple deployments from the same provider across components

When different components need to use separate deployments of the same provider, you can define individual model groups in endpoints.yml. This allows each deployment to use its own set credentials, such as API keys, and specify its own API base URLs.

To achieve this, use the ${...} syntax to reference environment variables for credentials.

For example, a rephraser might use gpt-3.5-turbo deployed in France with one set of credentials, while a SingleStepLLMCommandGenerator uses gpt-4 deployed in Switzerland. Another deployment, like text-embedding-3-small, might run on servers in the US.

endpoints.yml
model_groups:
# Model group with Azure GPT-3.5-TURBO deployment in France used by rephraser
- id: gpt-3.5-rephraser # Unique ID for the gpt-3.5-turbo deployment
models:
- provider: azure
deployment: ${AZURE_DEPLOYMENT_GPT_3_5_TURBO_FRANCE}
api_base: ${AZURE_API_BASE_GPT_3_5_TURBO_FRANCE}
api_key: ${AZURE_API_KEY_GPT_3_5_TURBO_FRANCE}
timeout: 7
temperature: 0.3
# Model group with Azure GPT-4 deployment in Swiss
- id: gpt-4-primary
models:
- provider: openai
deployment: ${AZURE_DEPLOYMENT_GPT_4_SWISS}
api_base: ${AZURE_API_BASE_GPT_4_SWISS}
api_key: ${AZURE_API_KEY_GPT_4_SWISS}
timeout: 7
# Model group with Azure text-embedding-3-small deployment in US
- id: text-embedding-3-small-primary
models:
- provider: openai
deployment: ${AZURE_DEPLOYMENT_EMBED_SMALL_US}
api_base: ${AZURE_API_BASE_EMBED_SMALL_US}
api_key: ${AZURE_API_KEY_EMBED_SMALL_US}
timeout: 7

Reference the appropriate model from component:

config.yml
pipeline:
- name: SingleStepLLMCommandGenerator
llm:
model_group: gpt-4-primary
embeddings:
flow_retrieval:
embeddings: text-embedding-3-small-primary
endpoints.yml
nlg:
- type: rephraser
llm:
model_group: gpt-3.5-turbo-rephraser

Using separate configurations for different environments

When you need to adapt deployment setups for different environments, such as dev and prod, you can use separate endpoints.yml files. This approach keeps your config.yml consistent while adjusting the runtime configuration to meet environment-specific needs.

For instance, you might use a single deployment in dev for simplicity, while in prod, you can leverage multiple deployments with routing to handle higher traffic loads.

Create separate endpoints.yml files for each environment, such as endpoints.dev.yml and endpoints.prod.yml.

endpoints.dev.yml
model_groups:
- id: gpt-4
models:
- provider: azure
deployment: ${AZURE_DEPLOYMENT_GPT_4_SWISS}
api_base: ${AZURE_API_BASE_GPT_4_SWISS}
api_key: ${AZURE_API_KEY_GPT_4_SWISS}
timeout: 7
- id: text-embedding-3-small
models:
- provider: azure
deployment: ${AZURE_DEPLOYMENT_TEXT_EMBEDDINGS_3_SMALL_SWISS}
api_base: ${AZURE_API_BASE_TEXT_EMBEDDINGS_3_SMALL_SWISS}
api_key: ${AZURE_API_KEY_TEXT_EMBEDDINGS_3_SMALL_SWISS}
timeout: 7
endpoints.prod.yml
model_groups:
- id: gpt-4
models:
- provider: azure
deployment: ${AZURE_DEPLOYMENT_GPT_4_SWISS}
api_base: ${AZURE_API_BASE_GPT_4_SWISS}
api_key: ${AZURE_API_KEY_GPT_4_SWISS}
timeout: 7
- provider: azure
deployment: ${AZURE_DEPLOYMENT_GPT_4_FRANCE}
api_base: ${AZURE_API_BASE_GPT_4_FRANCE}
api_key: ${AZURE_API_KEY_GPT_4_FRANCE}
timeout: 7
router:
routing_strategy: least_busy
- id: text-embedding-3-small
models:
- provider: azure
deployment: ${AZURE_DEPLOYMENT_TEXT_EMBEDDINGS_3_SMALL_SWISS}
api_base: ${AZURE_API_BASE_TEXT_EMBEDDINGS_3_SMALL_SWISS}
api_key: ${AZURE_API_KEY_TEXT_EMBEDDINGS_3_SMALL_SWISS}
timeout: 7
- provider: azure
deployment: ${AZURE_DEPLOYMENT_TEXT_EMBEDDINGS_3_SMALL_FRANCE}
api_base: ${AZURE_API_BASE_TEXT_EMBEDDINGS_3_SMALL_FRANCE}
api_key: ${AZURE_API_KEY_TEXT_EMBEDDINGS_3_SMALL_FRANCE}
timeout: 7
router:
routing_strategy: least_busy

Which endpoints file you use for training the assistant doesn't matter as long as both endpoints.dev.yml and endpoints.prod.yml define the same model groups with identical unique IDs. When running the assistant, specify the appropriate endpoints file for the environment.