Assertions reference
For more information E2E testing and assertions, see the E2E testing product documentation.
Assertions are a beta feature in Rasa Pro 3.10.0. The feature is subject to change and may not be fully stable.
To enable the feature, please set the environment variable RASA_PRO_BETA_E2E_ASSERTIONS
to true
in your testing
environment.
export RASA_PRO_BETA_E2E_ASSERTIONS=true
Installation and Configuration Prerequisites
To evaluate generative assistant responses for relevance and factual accuracy in your end-to-end tests, please install the
optional dependency mlflow
to enable these capabilities.
This dependency uses LLM (Large Language Model) evaluation
to assess the relevance and factual accuracy of the Rasa Pro assistant's generative responses. This LLM is also referred to as
a "LLM-as-Judge" model because it assesses another model's output. In Rasa Pro's use case, the LLM-as-Judge model evaluates whether
the generative response is relevant to the provided input
or whether the generative response is factually accurate in relation to
the provided or extracted ground truth text input.
You can install the dependency using the following commands:
pip install rasa-pro[mlflow]
# or if you are using poetry
poetry add "rasa-pro[mlflow]"
poetry add rasa-pro -E mlflow
Generative Response LLM Judge Configuration
This functionality only supports OpenAI models for the LLM Judge model at the moment.
By default, the LLM Judge model is configured to use the OpenAI gpt-4o-mini
model
to benefit of the long context window. If you want to use a different model, you can configure the LLM Judge model in
the conftest.yml
file which is a new testing configuration file added in Rasa Pro 3.10. It is automatically
discoverable by Rasa Pro as long as it is placed in the root directory of your assistant project.
llm_as_judge:
api_type: openai
model: "gpt-4-0613"
Assertion Types
Assertions allow you to check events like flows starting, or to confirm if a generative response is relevant/grounded, among others.
If a user step contains assertions, the older step types like bot:
... or utter: ... are ignored within that same step. You'll have to rely on the bot_uttered
assertion to check the response.
Below is a comprehensive list of assertion types you can use in your E2E tests. These allow you to verify everything from flow status to the factual grounding of a generative response.
Flow Started Assertion
flow_started
checks if the flow with the provided id was started.
test_cases:
- test_case: flight_booking
steps:
- user: "I want to book a flight"
assertions:
- flow_started: "flight_booking"
Flow Completed Assertion
flow_completed
checks if the flow with the provided id was completed. Optionally, you can specify a flow_step_id
if you want to confirm the final flow step.
test_cases:
- test_case: flight_booking
steps:
- user: "What is the average cost of a flight from New York to San Francisco?"
assertions:
- flow_completed:
flow_id: "pattern_search"
flow_step_id: "action_trigger_search"
Flow Cancelled Assertion
flow_cancelled
checks if the flow with the provided id was cancelled. You can also specify a flow_step_id
if needed.
test_cases:
- test_case: flight_booking
steps:
... # other user steps
- user: "Wait, I changed my mind, I don't want to book a flight."
assertions:
- flow_cancelled:
flow_id: "flight_booking"
flow_step_id: "make_payment"
Pattern Clarification Contains Assertion
pattern_clarification_contains
checks if the clarification (repair) pattern was triggered and returned the expected flow names. This assertion must list all flow names that you expect the pattern to suggest.
test_cases:
- test_case: flight_booking
steps:
- user: "make booking"
assertions:
- pattern_clarification_contains:
- "flight booking"
- "hotel booking"
Slot Was Set Assertion
slot_was_set
checks if the slot(s) with the provided name were filled with the provided value. Match the slot’s type in your domain (e.g. use boolean, integer, or float as appropriate without quotes).
test_cases:
- test_case: flight_booking
steps:
- user: "I want to book a flight from New York to San Francisco"
assertions:
- slot_was_set:
- name: "origin"
value: "New York"
- name: "destination"
value: "San Francisco"
Slot Was Not Set Assertion
slot_was_not_set
checks if a slot was not filled. If you specify a value, it checks the slot was not filled with that value.
test_cases:
- test_case: flight_booking
steps:
- user: "I want to book a flight to San Francisco."
assertions:
- slot_was_not_set:
- name: "origin"
- slot_was_not_set:
- name: "destination"
value: "New York"
If only name
is provided, the test confirms the slot’s value remains None
(or uninitialized).
Action Executed Assertion
action_executed
checks if the specified action was triggered.
test_cases:
- test_case: flight_booking
steps:
- user: "Book me a flight from New York to San Francisco tomorrow first thing in the morning."
assertions:
- action_executed: "action_book_flight"
Bot Uttered Assertion
bot_uttered
checks if the bot’s last utterance matches the provided pattern, buttons, and/or domain response name. Use text_matches
for the utterance text, which can be a string or regex.
test_cases:
- test_case: flight_booking
steps:
- user: "I want to book a flight"
assertions:
- bot_uttered:
utter_name: utter_ask_destination
text_matches: "Where would you like to fly to?"
buttons:
- title: "New York"
payload: "/SetSlots(destination=New York)"
- title: "San Francisco"
payload: "/SetSlots(destination=San Francisco)"
When asserting buttons, list them in the same order as defined in your domain file or custom action.
Bot Did Not Utter Assertion
bot_did_not_utter
checks that the bot’s utterance does not match the provided pattern, buttons, or domain response name.
test_cases:
- test_case: flight_booking
steps:
- user: "I want to book a flight"
assertions:
- bot_did_not_utter:
utter_name: utter_ask_payment
text_matches: "How would you like to pay?"
buttons:
- title: "Credit Card"
payload: "/set_payment_method{'method': 'credit_card'}"
- title: "PayPal"
payload: "/set_payment_method{'method': 'paypal'}"
Generative Response Is Relevant Assertion
generative_response_is_relevant
checks if the bot’s generative response is relevant to the user’s message. A threshold
(0–1) indicates how strictly you compare the system’s relevance score. The LLM Judge model scores a response 1 to 5, which maps to a 0–1 range.
test_cases:
- test_case: flight_booking
steps:
- user: "What times are the flights from New York to San Francisco tomorrow?"
assertions:
- generative_response_is_relevant:
threshold: 0.90
You can also specify utter_name
if you want to check a specific domain response event:
- user: "Actually, I want to amend flight date to next week."
assertions:
- generative_response_is_relevant:
threshold: 0.90
utter_name: utter_ask_correction_confirmation
Generative Response Is Grounded Assertion
generative_response_is_grounded
checks if the bot’s generative response is factually accurate given a ground-truth reference. Like the relevance check, it uses a 0–1 threshold
.
test_cases:
- test_case: flight_booking
steps:
- user: "What is the average cost of a flight from New York to San Francisco?"
assertions:
- generative_response_is_grounded:
threshold: 0.90
ground_truth: "The average cost of a flight from New York to San Francisco is $500."
If the correct factual source is available in the response metadata (e.g. from an Enterprise Search lookup or rephrased domain response), the test runner can extract it automatically if you don’t provide ground_truth
directly.