Version: Latest

Testing Your Assistant

Rasa Pro lets you validate and test your Assistant end-to-end.

Validating Data

Data validation verifies that no mistakes or major inconsistencies appear in your domain, NLU data, or story data. To validate your data, have your CI run this command:

rasa data validate

If you pass a max_history value to one or more policies in your config.yml file, provide the smallest of those values as

rasa data validate --max-history <max_history>

If data validation results in errors, training a model can also fail or yield bad performance, so it's always good to run this check before training a model. By including the --fail-on-warnings flag, this step will fail on warnings indicating more minor issues.

To read more about the validator and all of the available options, see the documentation for rasa data validate.

End-To-End Testing

New in 3.5

You can now use end-to-end testing to test your assistant as a whole, including dialogue management and custom actions.

End-to-end (E2E) testing is an enhanced and comprehensive CLI-based testing tool that allows you to test conversation scenarios with different pre-configured contexts, execute custom actions, verify response texts or names, and assert when slots are filled or not.

End-to-end testing is not limited to testing only the NLU or the dialogue model and allows you to design effective acceptance or integration tests. The main features of end-to-end testing are:

  • integration with the action server: you can execute custom actions in your tests; the prerequisite is to start the action server in the background.
  • test parametrization (e.g. different user profiles or other external factors): you can define multiple test fixtures with different pre-filled slots and re-use them in your tests.
  • verifying response texts or names: you can assert that the bot response text (including interpolated responses with slot values and conditional response variations) or utter name is as expected.
  • asserting that the bot either sets the slot value as expected or does not set the slot, or that the value is different from a specified value in the test step.
New Beta Feature in 3.10

To help you get started faster with end-to-end testing, we have introduced a new command rasa data convert e2e that converts your existing sample conversations into end-to-end test cases. You can learn more about this feature in the End-To-End Test Conversion section.

The resulting end-to-end test cases leverage the assertion format, which allows you to evaluate the relevance and factual accuracy of generative responses among other Rasa event-based assertions. You can learn more about this feature in the E2E testing with assertions section.

These features are only a beta (experimental) and may change in future Rasa Pro versions.

Testing Best Practices

When writing end-to-end tests, it is important to not use personally identifiable information (PII) or sensitive data in your test cases. We recommend using generic names, locations, dates, and other placeholders to avoid any potential data privacy issues.

We also highly recommend not using a production action server connected to production external services for end-to-end testing. Instead, use a full stack test environment with separate development or staging instances of your action server and other external services. Alternatively, you can also mock the external services to avoid any unintended side effects.

How to write test cases

To write test cases, you need to create a YAML file inside the tests directory of your project. The name of the file should be e2e_test_cases.yml. You can also create a subdirectory inside the tests directory and place your test case YAML files there. These files will be automatically discovered and run by Rasa Pro, however you need to provide the path to the subdirectory as positional argument to the rasa test e2e command.

Each input file must contain the test_cases key. The value of this key is a list of test cases. Each test case must include a name given to the test_case key and a list of test steps given to the steps key. A step can be either one of the following:

  • user: refers to a user's message.
  • bot: denotes a textual response generated by the bot.
  • utter: refers to a bot response name as defined in the domain file.
  • slot_was_set: indicates the successful setting of a slot, and depending on how it's defined:
    • if a slot name is provided, e.g. my_slot: checks that a slot with the given name is set.
    • if a key-value pair is provided, e.g. my_slot: value: checks that a slot with the given name is set with the expected value.
  • slot_was_not_set: specifies the failure to set a slot, and depending on the defenition:
    • if a slot name is provided, e.g. my_slot: checks that a slot with the given name is not set.
    • if a key-value pair is provided, e.g. my_slot: value: checks that a slot with the given name is not set or that its value differs from the expected value.

The following example illustrates how user, bot, utter and slot_was_set steps can be used in a test case:

tests.yml
test_cases:
- test_case: user books a restaurant # name of the test case must be provided and be unique
steps:
- user: I want to book a table for 4 at Xaigon for 8pm tonight
- slot_was_set: # slot_was_set/slot_was_not_set can contain multiple slot names or key-value pairs
- book_restaurant_name_of_restaurant: Xaigon
- book_restaurant_number_of_people: "4"
- book_restaurant_date: tonight
- book_restaurant_time: 8pm
- utter: utter_restaurant_available # utter is used for predefined domain utterances
- utter: utter_ask_book_restaurant_reservation_name
- user: Emil
- slot_was_set:
- book_restaurant_reservation_name: Emil
- utter: utter_ask_book_restaurant_confirmation
- user: yes
- bot: Thank you for booking the table at Xaigon for 4 people at 8pm tonight. # bot is used to match textual responses

In each of the steps after the user step, you can specify multiple expected events. Any additional events that are found are ignored and do not cause the test to fail.

Slots

Slots can be specified as a list of either string values (representing the slot name) or of slot name and slot value pairs. If the slot is specified as a key-value pair, the the values are also compared. If the slot step contains only the slot name, it is asserted that the slot was either set (when using slot_was_set step) or not (when using slot_was_not_set step).

- slot_was_set:
- book_restaurant_name_of_restaurant: Xaigon # check that the slot is set and the value is Xaigon
- book_restaurant_number_of_people: "4"
- book_restaurant_time # check that the slot is set regardless the value with which it is set
- slot_was_not_set:
- book_restaurant_reservation_name # check that the slot is not set (no SlotSet event found)
- book_restaurant_reservation_date: "2023-11-11" # check that the slot either is not set or the value is not 2023-11-11

It is not required to specify all the slots events in the slot_was_set or slot_was_not_set steps, you can indicate only a subset of slots that you want to check. You can think of slot_was_not_set as an inverse of slot_was_set and when specifing a slot_was_not_set step it looks for the absence of the SlotSet event in the tracker store but only for that particular user step not globally (more at order of test steps).

Fixtures for Pre-Filled Slots

Using fixtures is an optional feature that enables the pre-filling of slots, ensuring specific context before individual test cases are run. The fixtures key at the top level of your test case configuration consists of a list of fixture names, each of which must be unique. These fixture names correspond to sets of slot key-value pairs. When a particular test case needs predefined slot values, you can reference the fixture name within the test case definition by adding it to the fixtures key.

Consider the following example, which includes a test case file with fixtures and two test cases that leverage these fixtures:

fixture-tests.yml
fixtures:
- premium: # name of the fixture must be provided and be unique
- membership_type: premium # every fixture can contain multiple slot key-value pairs
- logged_in: True
- standard:
- logged_in: False
- membership_type: standard
test_cases:
- test_case: "test_premium_booking"
fixtures:
- premium # re-use the name of the fixture provided in fixtures section
steps:
- user: "Hi!"
- bot: "Welcome back! How can I help you?"
- user: "I want to book a trip."
- utter: utter_ask_location
- user: "I would like to travel to Lisbon."
- slot_was_set:
- location: "Lisbon"
- utter: utter_ask_date
- user: "I would like to travel on 22nd of June."
- slot_was_set:
- travel_date: "2023-06-22"
- bot: "Great! I will book your trip to Lisbon on 22nd of June."
- bot: "You saved 20% by being a premium member."
- test_case: "test_anonymous_booking"
fixtures:
- standard
steps:
- user: "Hi!"
- bot: "Hey! How can I help you?"
- user: "I want to book a trip."
- utter: utter_ask_location
- user: "I would like to travel to Paris."
- slot_was_set:
- location: "Paris"
- utter: utter_ask_date
- user: "I would like to travel on 2nd of April."
- slot_was_set:
- travel_date: "2023-04-02"
- bot: "Great! I will book your trip to Paris on 2nd of April."
- bot: "You can also choose to save 20% by becoming a premium member."

These slots in fixtures are set after the action_session_start action and before the first step is executed by the test runner.

Metadata on User Messages

New in 3.9

End-to-End testing now supports adding metadata on user messages.

This feature is only relevant when you have used custom connectors with your assistant and have passed extra information from your front end in your custom actions using the metadata key of your user message and want to properly test the conversation flow based on the metadata provided at runtime. Please see the Metadata on messages section of the custom connectors documentation for more information.

Using Metadata is an optional feature that enables the testing of interactions that are dynamically influenced by external metadata such as API response headers, middleware communications, or other contextual information.

The metadata key at the top level of your test case configuration consists of a list of metadata names, each of which must be unique. These metadata names correspond to key-value pairs of metadata. When all user steps in a test case need predefined metadata, you can reference the metadata name within the test case definition by adding it to the metadata key. In addition to this, you can also use the metadata key in each user step to provide additional metadata. This will merge with the test case level metadata before being passed to the UserMessage object.

note

In case of a conflict between the metadata provided in the user step and that of the test case during a merge operation, the user step metadata takes precedence and will override the one provided by the test case.

Consider the following example, which includes a test case file with metadata and two test cases that leverage these metadata:

metadata-tests.yml
metadata:
- user_info:
language: English
location: Europe
- device_info:
os: linux
test_cases:
- test_case: "test_standard_booking"
metadata: user_info
steps:
- user: "Hi!"
- utter: "utter_greet"
- user: "I would like to book a trip."
- bot: "Where would you like to travel?"
- user: "I want to travel to Lisbon."
metadata: device_info
- bot: "Your trip to Lisbon has been booked."
- bot: "You saved 15% by booking with a standard membership."
- bot: "Upgrade for more savings."
- test_case: "test_mood_great"
steps:
- user: "Hi!"
metadata: user_info
- bot: "Hey! How are you?"
- user: "I am feeling wonderful."
- bot: "Great, carry on!"

In the above example, all user steps in the test_standard_booking test case will have only the metadata user_info with the exception of the third user step, which will have the metadata device_info in addition to the user_info metadata. Also, only the first user step in the test_mood_great test case will have the user_info metadata, while other user steps have no metadata.

Stubbing Custom Actions

New Beta Feature in 3.10

You can now stub custom actions in your test cases to simulate the execution of a custom action without actually running the action server.

This feature is only a beta (experimental) and may change in future Rasa Pro versions. To enable the feature, you must set the environment variable RASA_PRO_BETA_STUB_CUSTOM_ACTION to true in your testing environment.

We welcome your feedback on this feature through the Customer Office team.

You can stub regular custom actions in your test cases by defining the stub_custom_actions key at the top level of your test case file. This allows you to simulate the execution of a custom action without actually running the action server. The stub_custom_actions key consists of a dictionary of custom action names that you want to stub. The value of each custom action name is a dictionary of expected events and responses. This represents what the custom action would return and must follow the same format that the action server would return.

For example:

stub-tests.yml
stub_custom_actions:
check_balance:
events:
- event: slot
name: account_balance
value: 1000
responses:
- text: "Your account balance is 1000."

When a custom action is stubbed, the test runner will not make a call to the action server but will instead look for the stub implementation. Note that a test run must stub all custom actions that are called in the test cases if this feature is used. If you'd like to stub only select custom actions, we recommend you create a separate test case file and run these custom actions separately with a development action server instance.

You can define multiple stubs for the same custom action. Please follow the naming convention of test_case_id::action_name for the custom action name to differentiate between these stubs.

For example:

stub-multiple-tests.yml
stub_custom_actions:
test_account_balance_is_positive::check_balance:
events:
- event: slot
name: account_balance
value: 1000
responses:
- text: "Your account balance is 1000."
test_account_balance_is_empty::check_balance:
events:
- event: slot
name: account_balance
value: 0
responses:
- text: "Your account balance is empty."
info

The current 3.10 beta version of the stubbing feature does not support the stubbing of slot validation custom actions.

How to run the tests

To run the end-to-end tests locally or in the CI pipeline, use the rasa test e2e command. The command takes the following arguments:

  • positional argument for the path to the test cases file or directory containing the test cases: rasa test e2e <path> If unspecified, the default path is tests/e2e_test_cases.yml.
  • optional argument for the trained model: --model <path>
  • optional argument for retrieving the trained model from remote storage: --remote-storage <remote-storage-location>
  • optional argument for the endpoints.yml file: --endpoints <path>
  • optional argument for stopping the test run at first failure: rasa test e2e --fail-fast
  • optional argument for exporting the test results to e2e_results.yml file: rasa test e2e -o
  • optional argument for creating a coverage report: rasa test e2e --coverage-report
  • optional argument for specifying the output directory for the coverage report: rasa test e2e --coverage-output-path

Testing custom actions

To test custom actions, there are three prerequisites:

  • The action server must be running in the background.
  • The endpoints.yml file must contain the action server configuration.
  • The action side effects must result in a SlotSet, BotUttered or UserUttered event.

When initiating the E2E test, a check is performed to determine if the bot configuration specifies an action server endpoint in the endpoints.yml file. If an endpoint is defined, a health-check is performed by calling the /health endpoint. In case the action server is not responding, the test run exits with a status code of 1 and an error message.

You can start the action server in the background before running the tests:

rasa run actions && rasa test e2e

How to interpret the output

By default, the results are always printed to stdout and the command will exit with exit code 0 (if all tests passed) or 1 (in case of test failures).

The output structure is inspired by pytest:

  • Failed test cases will be stacked in the order of their completion, each failed test highlighting the difference in identified mismatches similar to git diff:
    • In the case of test steps that are passed there is no colouring or prefix. (user steps are used as anchors for the diff and are always passed.)
    • Expected test steps that have failed will be preceded by + prefix (green in colour), while actual messages will be preceded by - prefix (red in colour).
  • The short test summary includes a list of every failed test case name and file location in a new line.

If -o flag is specified in the command, the results are also written to the tests/e2e_results.yml file, which will contain a list of test results with the following keys:

  • name: the name of the test case
  • pass_status: the status of the test case, either True or False
  • expected_steps: the expected test steps
  • difference: a list of differences between the expected and actual test steps

Limitations

End-to-end testing is a powerful tool for conducting thorough assessments of your conversational AI system. However, it does come with certain limitations that are important to consider. These limitations are as follows:

Dependency on events and the tracker store

End-to-end testing heavily relies on the availability of specific event types in the tracker store. In particular, it requires the presence of events such as BotUttered, UserUttered, and SlotSet to execute tests effectively. If your test scenario involves actions or events that do not generate these specific events, the testing algorithm is not able to evaluate them.

Order of test steps and events

It is essential to structure your test cases to closely mimic real conversations to avoid potential issues. The test runner works by running the user steps and capturing the events generated by the bot from the tracker store, after which it will compare the events generated by the bot with the expected events be it bot or slot test steps. It's best to avoid creating test cases with mupltiple user steps followed by bot events, as it will only evaluate events created from the last user step.

fauilty-tests.yml
test_cases:
- test_case: user checks their balance and doesn't ask for anything else
steps:
- user: Show my balance
- user: no
- utter: utter_current_balance
- utter: utter_can_do_something_else
- utter: utter_noworries
- test_case: user checks their balance and then about transactions
steps:
- user: Show my balance
- user: Show my transactions please
- utter: utter_current_balance
- utter: utter_can_do_something_else
- utter: utter_transactions

Note that the order of the bot, utter and slot steps which follow a user step is not important. The test case will pass as long as the bot, utter and slot events are executed after the user step.

Testing the start of a conversation

The evaluation of actual events against the defined expected test steps begins after the action_session_start action and it's advisable to start the test with a user step. However it is possible to test before the first user utterance when the action_session_start has been customized.

test_cases:
- test_case: user books a restaurant
steps:
- utter: utter_welcome # action_session_start is modified to also utter_welcome
- user: I want to book a table for 4 at Xaigon for 8pm tonight
...

Testing the accuracy of Enterprise Search

End-to-End Testing prior to Rasa Pro v3.10 did not support testing the accuracy of the Enterprise Search Policy. That was owed to the fact that the test runner was matching the bot responses with the expected responses in an exact match fashion. This means that even slight differences in the bot response resulted in a test failure.

With the introduction of the new assertion format in Rasa Pro v3.10, you can now evaluate the relevance and factual accuracy of generative responses issued by Enterprise Search. You can learn more about this feature in the E2E testing with assertions documentation section.

E2E Test Coverage Report

New in 3.10

You can now run diagnostics on your end-to-end (E2E) test suite to evaluate how comprehensively it covers your assistant's functionality.

This feature is currently released in a beta version. The feature might change in the future. If you want to enable this beta feature, set the environment variable RASA_PRO_BETA_FINE_TUNING_RECIPE=true.

To generate a coverage report, use the --coverage-report option with the rasa test e2e command. You can also specify an output directory for the report with the --coverage-output-path argument. By default, the coverage results are printed to stdout and saved in the e2e_coverage_results directory within your assistant's folder. The report calculates coverage separately for both passing and failing tests.

rasa test e2e <path-to-test-cases> --coverage-report

The specific output artifacts are explained below.

Flow Coverage

Flow coverage report shows the percentage of flow steps that are covered by the E2E tests. Here's an example output:

Flow Name Coverage Num Steps Missing Steps Line Numbers
data/flows/order_pizza.yml::order_pizza 80.00% 5 1 [15-22]
data/flows/add_card.yml::add_card 75.00% 4 1 [10-10]
data/flows/add_contact.yml::add_contact 25.00% 4 3 [22-35, 27-28, 31-32]
data/flows/authenticate_user.yml::authenticate_user 0.00% 5 5 [11-13, 23-24, 16-24, 20-21, 14-15]
data/flows/check_balance.yml::check_balance 0.00% 2 2 [7-7, 6-6]
Total 40.00% 20 12

The Coverage column shows the percentage of steps covered by tests, while the Num Steps column indicates the total steps in each flow. The Missing Steps column lists steps not covered by tests, and Line Numbers indicates where these missing steps are located in the flow files.

important

From the above example, it is evident that majority of data/flows/add_contact.yml::add_contact are not well tested and none of the steps of data/flows/authenticate_user.yml::authenticate_user and data/flows/check_balance.yml::check_balance are tested. The next logical step for the user is to add E2E tests that covers the following flows.

As mentioned, the above analysis is done for both passing and failing tests and the results are also added to the coverage_report_for_passed_tests.csv and coverage_report_for_failed_tests.csv files respectively.

Command Coverage

The tool also outputs the coverage of commands the dialogue understanding module triggers when running the E2E tests. The output is a histogram for passing and failing tests separately in commands_histogram_for_passed_tests.png and commands_histogram_for_failed_tests.png.

Commands histogram examples
Passing / failing tests

Each passing and failing test is added to passed.yml and failed.yml respectively.

By using these tools, you can ensure your assistant is thoroughly tested and any gaps in coverage are identified and addressed.