Dialogue Understanding Tests
Dialogue Understanding Tests are currently a beta feature. To enable the feature, set the
environment variable RASA_PRO_BETA_DIALOGUE_UNDERSTANDING_TEST to true.
Dialogue Understanding Tests (DUT) are designed to evaluate the command prediction accuracy of a chatbot's dialogue understanding module. Rather than merely assessing whether a chatbot behaves as expected in end-to-end (E2E) tests, these tests delve deeper into understanding why a chatbot may not be performing as anticipated. They aim to identify discrepancies between the expected and predicted commands during a conversation, providing insights into potential pitfalls in the command generator's operation.
The primary focus of DUTs is the command generator, a core component responsible for interpreting user input and orchestrating the chatbot's subsequent actions. When updates are made to the command generator — such as switching to a different LLM or tweaking the prompt — DUTs offer a structured approach to evaluate how accurately these changes affect command predictions.
Dialogue Understanding Tests only work for CALM-based assistants!
How Dialogue Understanding Tests work
In order to run DUTs, you first need to write test cases. Each test case consists of a sequence of interactions that simulate conversations with the chatbot. These interactions are broken down into user inputs, expected commands, and bot responses. Each step of a test case is evaluated independently. The predicted commands of the dialogue understanding module are compared with the expected commands defined in the test case. The DUT framework is able to evaluate each step of a test case, also if a previous test step failed. This allows for a more detailed analysis of the chatbot's performance. After all test cases have been executed, a detailed report is generated, including metrics such as accuracy, precision, recall, and f1-score for all commands.
Defining a Dialogue Understanding Test Case
A test case is structured as a sequence of interactions between a user and the chatbot, specifying both inputs and expected outcomes. Each test case is composed of multiple steps, and each step consists of:
- User Utterance: The input message from the user.
- Commands: All expected commands that are generated in response to the user's message.
- Bot Response(s): All expected responses from the bot, which can be either a direct textual response or a reference to a predefined bot response template.
Here is a sample test case:
test_cases:
- test_case: user_adds_contact_to_their_list
steps:
- user: I want to add someone to my contact list
commands:
- start flow add_contact
- utter: utter_ask_add_contact_handle
- user: it's @barts
commands:
- set slot handle @barts
- bot: "What is the name of the contact you want to add?"
- user: just Bart
commands:
- set slot name Bart
The list of commands and the list of bot responses need to be complete!
Fixtures, Metadata, and Stubbing Custom Actions known from end-to-end (E2E) tests are supported as well. They behave the same as in E2E tests.
Explanation of the Commands
Commands are the core components that direct the chatbot's actions in response to a user's input. Each command represents an operation or decision the bot should make. For details on the commands, see the key dialogue understanding commands.
The syntax of the commands matches the domain specific language (DSL) used in the prompt template of the
Command Generator.
The DSL of the commands depends on the Command Generator used.
The SingleStepLLMCommandGenerator uses the V1 DSL (deprecated), the CompactLLMCommandGenerator uses the V2 DSL, and the
SearchReadyLLMCommandGenerator uses the V3 DSL.
Make sure to use the correct DSL in your DUTs.
| Command | DSL V1 (deprecated) | DSL V2 | DSL V3 |
|---|---|---|---|
| StartFlow Command | StartFlow(flow_name) | start flow flow_name | start flow flow_name |
| SetSlot Command | SetSlot(slot_name, slot_value) | set slot slot_name slot_value | set slot slot_name slot_value |
| CancelFlow Command | CancelFlow() | cancel flow | cancel flow |
| Clarify Command | Clarify(options) | disambiguate flows options | disambiguate flows options |
| ChitChat Command | ChitChat() | offtopic reply | - |
| SearchAndReply Command | SearchAndReply() | provide info | search and reply |
| HumanHandoff Command | HumanHandoff() | hand over | - |
Custom Commands
It is also possible to use custom commands in the test cases.
See the Custom Commands section for more information.
When running the DUTs, you can pass the custom commands as an argument (--additional-commands).
You can also update the default command set by removing default commands (--remove-default-commands) and
replacing them with custom commands.
rasa test du <path-to-test-cases> --additional-commands my_module.CustomCancelFlow --remove-default-commands CancelFlow
The --additional-commands argument takes a list of custom command classes separated by spaces.
Like the --additional-commands argument, the --remove-default-commands argument takes a list of default
command classes separated by spaces.
The class name alone is sufficient for the --remove-default-commands argument because the default commands
are already known by the DUT framework.
Explanation of placeholder_generated_answer
The placeholder_generated_answer is used in scenarios where a bot response is dynamically generated, such as
when the bot retrieves information from an external knowledge base.
In such cases, you may not know the exact wording of the bot's response ahead of time.
This placeholder should be used in the test case where a specific bot response is expected but may vary due
to external dynamic content or search results.
It signals that the exact bot utterance is not fixed, yet the test case recognizes and accepts a dynamically
generated response in its place.
Here is an example test case that uses placeholder_generated_answer:
test_cases:
- test_case: user asks a knowledge question during flow
steps:
- user: I want to send some money to Tre
commands:
- start flow transfer_money
- set slot transfer_money_recipient Tre
- utter: utter_ask_transfer_money_amount_of_money
- user: btw, are these transfers free of charge?
commands:
- search and reply
- utter: placeholder_generated_answer
- utter: utter_ask_transfer_money_amount_of_money
- user: great, 50$ then
commands:
- set slot transfer_money_amount_of_money 50
- utter: utter_ask_transfer_money_final_confirmation
- user: yes
commands:
- set slot transfer_money_final_confirmation True
- utter: utter_transfer_complete
Running Dialogue Understanding Tests
To run DUTs, execute the following command:
rasa test du <path-to-test-cases>
By default, the test cases are expected to be located in the dialogue_understanding_tests directory.
Execute rasa test du --help to see the available options for running DUTs.
In order to execute any custom action that is needed by the DUTs, you need to
either start the action server in the background before running the tests via rasa run actions or use
Stubbing Custom Actions.
Criteria for Test Case Success
A test case is considered to have passed if all the expected commands match the predicted commands at
each step. The expected and predicted commands are considered identical if their types and arguments
exactly match, with the order of the commands being irrelevant.
To compare two commands we use the __eq__ method of the commands.
There's an exception for the Clarify command:
When defining a Clarify command in a Dialogue Understanding Test, you have the option to leave the
command's options empty or specify a list of options; the parameters for Clarify command are optional to mention.
If you provide a list of options, the predicted Clarify command must include the exact same list
to match the expected command. If you leave the options list empty, the predicted Clarify command
can have any list of options.
Dialogue Understanding Test Output
The output is logged to the console and saved in a detailed report file in a structured format for later analysis or record-keeping.
The following information is present in the output:
-
Overall accuracy across all test cases and user utterances.
-
Marco, mirco, and weighted average F1 scores of the predicted commands.
-
Number of passed and failed test cases.
-
Number of passed and failed user utterances.
-
Test case names of failed and passed test cases.
-
A detailed diff of expected vs. predicted commands for each failed user message in a failed test case. A test case can have multiple failed user messages. The command generators listed in the output are the ones that generated the predicted commands. "LLM output for Command Generator" refers to the actual LLM output which is then parsed into commands.
Example of a failed test case diff:
------------- test_case: <file-path>::user_adds_contact_to_their_list -------------
Number of failed steps: 1
== failure starting at user message 'it's @barts'.
-- COMMAND GENERATOR(s) --
CompactLLMCommandGenerator
-- CONVERSATION --
user: I want to add someone to my contact list
bot: What's the handle of the user you want to add?
user: it's @barts
-- EXPECTED -- | -- PREDICTED --
set slot handle @barts | set slot name @barts
-- LLM ouptut for CompactLLMCommandGenerator --
set slot name @barts
--- -
Command metrics for each command type, including the total count, true positives (tp), false positives (fp), false negatives (fn), precision, recall, and f1-score.
Example of command metrics:
start flow (total count: 10):
tp: 10 fp: 0 fn: 0
precision: 1.00
recall : 1.00
f1 : 1.00 -
Total latency of processing a full user message.
-
Latency and prompt token metrics for the LLM based Command Generator used.
If you start the DUTs with the --output-prompt flag, you will also see the prompt that
returned the predicted commands.
Useful Scripts
We have written two scripts that help you get started with DUTs. The scripts can be found here.
Converting end-to-end Tests to DUTs
To convert end-to-end (E2E) tests into DUTs you can use a standalone Python script:
python convert_e2e_tests_to_du_tests.py <path-to-e2e-tests>
The script has the following parameters:
<path-to-e2e-tests>: The path to your existing E2E test cases (can be a single file or a directory).--output-folder <output-folder>: The path where the converted test cases will be saved. The default isdialogue_understanding_tests.
After running the script, the output folder structure will look like this:
<output-folder>
|-- ready
| |-- test_case_a.yml
| |-- test_case_b.yml
| |-- ...
|-- to_review
| |-- test_case_c.yml
| |-- test_case_d.yml
| |-- ...
Test cases that end up in ready are converted from E2E test cases that passed. In most of the cases no further action is needed. However, sometimes the LLM might have predicted an obsolete command, that is cleaned up during processing. Those commands need to be deleted manually form the test cases. Test cases in to_review may require manual intervention because the E2E test failed. Review these cases to ensure that the converted test cases are correct and the list of commands and bot responses is complete.
Converting DUTs from one DSL to a another DSL
If you need to transform your commands from one DSL format to another
(for instance, updating StartFlow(flow_name) to start flow_name or SetSlot(slot_name, slot_value) to set slot_name slot_value),
you can use a standalone Python script:
python convert_dut_dsl.py --dut-tests-dir <path> --output-dir <path> --dsl-mappings <path>
The script has the following required parameters:
--dut-tests-dir <path>: The directory (relative or absolute) containing your existing Dialogue Understanding Tests (DUT). The script will look for.yamlor.ymlfiles within this folder (and subfolders).--output-dir <path>: The directory where transformed files will be saved. The folder structure from yourdut-tests-diris preserved.--dsl-mappings <path>: The YAML file defining your DSL mapping rules.
The YAML file containing the mappings must adhere to the following format:
-
from_dsl_regex: A regular expression (string) used to match the old DSL command. Must include any necessary anchors (like ^ and $) and capturing groups ( ... ) for dynamic parts. -
to_dsl_pattern: A string that contains placeholders like{1},{2}, etc. Each placeholder corresponds to a capturing group in from_dsl_regex, in order of appearance. -
input_separators: Optional list of separators of the captured groups that can be replaced with theoutput_separator -
output_separator: Output separator to replace separators from the list ofinput_separatorsin the captured group.mappings:
- from_dsl_regex: "^StartFlow\\(([^)]*)\\)$"
to_dsl_pattern: "start {1}"
- from_dsl_regex: "^SetSlot\\(([^,]+),\\s*(.*)\\)$"
to_dsl_pattern: "set {1} {2}"
- from_dsl_regex: "Clarify\(([\"\'a-zA-Z0-9_, ]*)\)"
to_dsl_pattern: "clarify {1}"
input_separators:
- ","
- " "
output_separator: " "
# ... add more mappings here