Version: Latest

How do I configure multiple model deployments and routing in Rasa?

Starting from Rasa 3.11, you can configure LLMs and embeddings in CALM without retraining your assistant. This guide shows you how to:

Decouple model configurations from components
Adapt model settings for different environments
Enable load balancing for multiple model deployments
Use separate credentials for multiple deployments from the same provider across components
Use separate configurations for different environments

important

Starting from Rasa 3.11, direct LLM and embedding configurations inside components are deprecated. Define all clients in endpoints.yml under the model_groups key. Mixing approaches can lead to errors and is not supported.

Decoupling LLM and embedding configurations from components

To decouple configurations, define your LLM and embedding client configurations in endpoints.yml under the model_groups key. Each model group should have a unique ID and define its associated models.

endpoints.yml

model_groups:
  - id: gpt-4-primary  # Unique ID for the LLM deployment
    models:
      - provider: openai
        model: gpt-4
        timeout: 7
        temperature: 0.0
  - id: text-embedding-3-small-primary  # Unique ID for the embedding deployment
    models:
      - provider: openai
        model: text-embedding-3-small

Use the model_group key to reference the appropriate model group defined in endpoints.yml.

config.yml

pipeline:
  - name: SingleStepLLMCommandGenerator
    llm:
      model_group: gpt-4-primary  # Reference the model group ID
      flow_retrieval:
        embeddings:
        model_group: text-embedding-3-small-primary  # Reference the model group ID

Run the following command to train with the updated configurations:

rasa train --config config.yml --endpoints endpoints.yml

Once trained, the components will reference model configurations from endpoints.yml. Updating endpoints.yml doesn't require retraining.

warning

Missing model group in endpoints.yml will cause errors. Ensure that all components referencing a model group (e.g., gpt-4-primary) have a corresponding definition in endpoints.yml.

Adapting model settings to different environments

You can configure your assistant to work across multiple environments, such as dev, staging, and prod, without retraining. Use the ${...} syntax to dynamically set values from environment variables for keys within the model group.

Supported keys:

api_base
api_version
deployment (specific to Azure OpenAI)
aws_access_key_id (specific to AWS Bedrock)
aws_secret_access_key (specific to AWS Bedrock)
aws_session_token (specific to AWS Bedrock)
aws_region_name (specific to AWS Bedrock)

endpoints.yml

model_groups:
  - id: gpt-4-primary  # Unique ID for the LLM deployment
    models:
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_GPT4}  # Dynamically set Azure deployment name
        api_base: ${AZURE_API_BASE_GPT4}      # Dynamically set API base URL
        api_key: ${AZURE_API_KEY_GPT4}        # Dynamically set API key
  - id: text-embedding-3-small-primary  # Unique ID for the embedding deployment
    models:
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_EMBED_SMALL}  # Dynamically set Azure deployment name
        api_base: ${AZURE_API_BASE_EMBED_SMALL}      # Dynamically set API base URL
        api_key: ${AZURE_API_KEY_EMBED_SMALL}        # Dynamically set API key

Enabling load balancing for multiple model deployments

To distribute requests across multiple LLM or embedding deployments, update endpoints.yml as follows:

Define multiple models in a model_group.
Add a router with a routing_strategy to control how requests are distributed. You can find the available strategies here (TODO_ADD_LINK).

endpoints.yaml

model_groups:
  - id: load-balanced-gpt-4  # Unique group ID for load balancing across multiple GPT-4 deployments
    models:
      # Azure GPT-4 deployment in France
      - provider: azure
        deployment: azure-deployment-france
        api_base: https://api.azure-france.example.com
        api_version: 2024-08-01-preview
        api_key: ${AZURE_API_KEY_FRANCE}
        timeout: 7
        temperature: 0.0

      # Azure GPT-4 deployment in US
      - provider: azure
        deployment: azure-deployment-us
        api_base: https://api.azure-us.example.com
        api_version: 2024-08-01-preview
        api_key: ${AZURE_API_KEY_US}
        timeout: 7
        temperature: 0.0

      # OpenAI GPT-4 deployment
      - provider: openai
        model: gpt-4
        api_key: ${OPENAI_API_KEY}
        timeout: 7
        temperature: 0.0

    # Router configuration to distribute requests
    router:
      routing_strategy: least-busy  # Route requests to the least busy deployment

Using separate credentials for multiple deployments from the same provider across components

When different components need to use separate deployments of the same provider, you can define individual model groups in endpoints.yml. This allows each deployment to use its own set credentials, such as API keys, and specify its own API base URLs.

To achieve this, use the ${...} syntax to reference environment variables for credentials.

For example, a rephraser might use gpt-3.5-turbo deployed in France with one set of credentials, while a SingleStepLLMCommandGenerator uses gpt-4 deployed in Switzerland. Another deployment, like text-embedding-3-small, might run on servers in the US.

endpoints.yml

model_groups:
   # Model group with Azure GPT-3.5-TURBO deployment in France used by rephraser
  - id: gpt-3.5-rephraser  # Unique ID for the gpt-3.5-turbo deployment
    models:
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_GPT_3_5_TURBO_FRANCE}
        api_base: ${AZURE_API_BASE_GPT_3_5_TURBO_FRANCE}
        api_key: ${AZURE_API_KEY_GPT_3_5_TURBO_FRANCE}
        timeout: 7
        temperature: 0.3

  # Model group with Azure GPT-4 deployment in Swiss
  - id: gpt-4-primary
    models:
      - provider: openai
        deployment: ${AZURE_DEPLOYMENT_GPT_4_SWISS}
        api_base: ${AZURE_API_BASE_GPT_4_SWISS}
        api_key: ${AZURE_API_KEY_GPT_4_SWISS}
        timeout: 7

  # Model group with Azure text-embedding-3-small deployment in US
  - id: text-embedding-3-small-primary
    models:
      - provider: openai
        deployment: ${AZURE_DEPLOYMENT_EMBED_SMALL_US}
        api_base: ${AZURE_API_BASE_EMBED_SMALL_US}
        api_key: ${AZURE_API_KEY_EMBED_SMALL_US}
        timeout: 7

Reference the appropriate model from component:

config.yml

pipeline:
  - name: SingleStepLLMCommandGenerator
    llm:
      model_group: gpt-4-primary
    embeddings:
      flow_retrieval:
        embeddings: text-embedding-3-small-primary

endpoints.yml

nlg:
  - type: rephraser
    llm:
      model_group: gpt-3.5-turbo-rephraser

Using separate configurations for different environments

When you need to adapt deployment setups for different environments, such as dev and prod, you can use separate endpoints.yml files. This approach keeps your config.yml consistent while adjusting the runtime configuration to meet environment-specific needs.

For instance, you might use a single deployment in dev for simplicity, while in prod, you can leverage multiple deployments with routing to handle higher traffic loads.

Create separate endpoints.yml files for each environment, such as endpoints.dev.yml and endpoints.prod.yml.

endpoints.dev.yml

model_groups:
  - id: gpt-4
    models:
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_GPT_4_SWISS}
        api_base: ${AZURE_API_BASE_GPT_4_SWISS}
        api_key: ${AZURE_API_KEY_GPT_4_SWISS}
        timeout: 7

  - id: text-embedding-3-small
    models:
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_TEXT_EMBEDDINGS_3_SMALL_SWISS}
        api_base: ${AZURE_API_BASE_TEXT_EMBEDDINGS_3_SMALL_SWISS}
        api_key: ${AZURE_API_KEY_TEXT_EMBEDDINGS_3_SMALL_SWISS}
        timeout: 7

endpoints.prod.yml

model_groups:

  - id: gpt-4
    models:
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_GPT_4_SWISS}
        api_base: ${AZURE_API_BASE_GPT_4_SWISS}
        api_key: ${AZURE_API_KEY_GPT_4_SWISS}
        timeout: 7
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_GPT_4_FRANCE}
        api_base: ${AZURE_API_BASE_GPT_4_FRANCE}
        api_key: ${AZURE_API_KEY_GPT_4_FRANCE}
        timeout: 7
    router:
      routing_strategy: least_busy

  - id: text-embedding-3-small
    models:
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_TEXT_EMBEDDINGS_3_SMALL_SWISS}
        api_base: ${AZURE_API_BASE_TEXT_EMBEDDINGS_3_SMALL_SWISS}
        api_key: ${AZURE_API_KEY_TEXT_EMBEDDINGS_3_SMALL_SWISS}
        timeout: 7
      - provider: azure
        deployment: ${AZURE_DEPLOYMENT_TEXT_EMBEDDINGS_3_SMALL_FRANCE}
        api_base: ${AZURE_API_BASE_TEXT_EMBEDDINGS_3_SMALL_FRANCE}
        api_key: ${AZURE_API_KEY_TEXT_EMBEDDINGS_3_SMALL_FRANCE}
        timeout: 7

    router:
      routing_strategy: least_busy

Which endpoints file you use for training the assistant doesn't matter as long as both endpoints.dev.yml and endpoints.prod.yml define the same model groups with identical unique IDs. When running the assistant, specify the appropriate endpoints file for the environment.

important

Decoupling LLM and embedding configurations from components#

warning

Adapting model settings to different environments#

Enabling load balancing for multiple model deployments#

Using separate credentials for multiple deployments from the same provider across components#

Using separate configurations for different environments#

Decoupling LLM and embedding configurations from components

Adapting model settings to different environments

Enabling load balancing for multiple model deployments

Using separate credentials for multiple deployments from the same provider across components

Using separate configurations for different environments