How do I configure multiple model deployments and routing in Rasa?
Starting from Rasa 3.11, you can configure LLMs and embeddings in CALM without retraining your assistant. This guide shows you how to:
- Decouple model configurations from components
- Adapt model settings for different environments
- Enable load balancing for multiple model deployments
- Use separate credentials for multiple deployments from the same provider across components
- Use separate configurations for different environments
important
Starting from Rasa 3.11, direct LLM and embedding configurations inside components are deprecated. Define all clients in endpoints.yml under the model_groups key. Mixing approaches can lead to errors and is not supported.
Decoupling LLM and embedding configurations from components
To decouple configurations, define your LLM and embedding client configurations in
endpoints.yml
under the model_groups
key. Each model group should have a unique ID
and define its associated models.
Use the model_group
key to reference the appropriate model group defined in
endpoints.yml
.
Run the following command to train with the updated configurations:
Once trained, the components will reference model configurations from endpoints.yml
.
Updating endpoints.yml
doesn't require retraining.
warning
Missing model group in endpoints.yml will cause errors. Ensure that all components referencing a model group (e.g., gpt-4-primary) have a corresponding definition in endpoints.yml.
Adapting model settings to different environments
You can configure your assistant to work across multiple environments, such as dev
,
staging
, and prod
, without retraining. Use the ${...}
syntax to dynamically set
values from environment variables for keys within the model group.
Supported keys:
api_base
api_version
deployment
(specific to Azure OpenAI)aws_access_key_id
(specific to AWS Bedrock)aws_secret_access_key
(specific to AWS Bedrock)aws_session_token
(specific to AWS Bedrock)aws_region_name
(specific to AWS Bedrock)
Enabling load balancing for multiple model deployments
To distribute requests across multiple LLM or embedding deployments, update
endpoints.yml
as follows:
- Define multiple models in a model_group.
- Add a
router
with arouting_strategy
to control how requests are distributed. You can find the available strategies here (TODO_ADD_LINK).
Using separate credentials for multiple deployments from the same provider across components
When different components need to use separate deployments of the same provider, you can
define individual model groups in endpoints.yml
. This allows each deployment to use
its own set credentials, such as API keys, and specify its own API base URLs.
To achieve this, use the ${...}
syntax to reference environment variables for
credentials.
For example, a rephraser might use gpt-3.5-turbo
deployed in France with one set of
credentials, while a SingleStepLLMCommandGenerator
uses gpt-4
deployed in
Switzerland. Another deployment, like text-embedding-3-small
, might run on servers in
the US.
Reference the appropriate model from component:
Using separate configurations for different environments
When you need to adapt deployment setups for different environments, such as dev
and
prod
, you can use separate endpoints.yml
files. This approach keeps your
config.yml
consistent while adjusting the runtime configuration to meet
environment-specific needs.
For instance, you might use a single deployment in dev
for simplicity, while in
prod
, you can leverage multiple deployments with routing to handle higher traffic
loads.
Create separate endpoints.yml
files for each environment, such as endpoints.dev.yml
and endpoints.prod.yml
.
Which endpoints file you use for training the assistant doesn't matter as long as
both endpoints.dev.yml
and endpoints.prod.yml
define the same model groups with
identical unique IDs. When running the assistant, specify the appropriate endpoints file
for the environment.