Testing your assistant requires you to write test conversations, which include the user messages and the conversation history. The format is the a modified version of the one used to specify stories in your training data.
Here are some examples:
- Custom Actions
- Forms Happy Path
- Forms Unhappy Path
By default Rasa Open Source saves test conversations to
You can test your assistant against them by running:
The command will always load all stories from any story files with filenames
test_stories.yml. Your story test
file names should always start with
test_ for this detection to work.
Testing Custom Actions
Custom Actions are not executed as part of test conversations. If your custom
actions append any events to the conversation, this has to be reflected in your test conversation
(e.g. by adding
slot_was_set events to your test conversation).
To test the code of your custom actions, you should write unit tests for them and include these tests in your CI/CD pipeline.
If you have any questions or problems, please share them with us in the dedicated testing section on our forum !
Evaluating an NLU Model
If you are looking to tune the hyperparameters of your NLU model, check out this tutorial.
A standard technique in machine learning is to keep some data separate as a test set. You can split your NLU training data into train and test sets using:
If you've done this, you can see how well your NLU model predicts the test cases:
If you don't want to create a separate test set, you can
still estimate how well your model generalises using cross-validation.
To do this, add the flag
You can find the full list of options in the
CLI documentation on
Comparing NLU Pipelines
By passing multiple pipeline configurations (or a folder containing them) to the CLI, Rasa will run a comparison between the pipelines:
The command in the example above will create a train/test split from your data, then train each pipeline multiple times with 0, 25, 50, 70 and 90% of your intent data excluded from the training set. The models are then evaluated on the test set and the f1-score for each exclusion percentage is recorded. This process runs three times (i.e. with 3 test sets in total) and then a graph is plotted using the means and standard deviations of the f1-scores.
The f1-score graph, along with all train/test sets, the trained models, classification and error reports,
will be saved into a folder called
rasa test script will produce a report, confusion matrix, and confidence histogram for your model.
The report logs precision, recall and f1 measure for each intent and entity,
as well as providing an overall average. You can save these reports as JSON files using the
The confusion matrix shows you which intents are mistaken for others;
any samples which have been incorrectly predicted are logged and saved to a file called
errors.json for easier debugging.
The histogram that the script produces allows you to visualise the confidence distribution for all predictions, with the volume of correct and incorrect predictions being displayed by blue and red bars respectively. Improving the quality of your training data will move the blue histogram bars to the right and the red histogram bars to the left of the plot.
The evaluation script will produce a combined report for all response selector models in your pipeline.
The report logs precision, recall and f1 measure for
each response, as well as providing an overall average.
You can save these reports as JSON files using the
Only trainable entity extractors, such as the
CRFEntityExtractor will be
evaluated by the
rasa test script. If you pretrained extractors like the
Rasa Open Source will not include these in the evaluation.
rasa test reports recall, precision, and f1 measure for each entity type that
your trainable entity extractors are trained to recognize.
incorrect entity annotations
If any of your entities are incorrectly annotated, your evaluation may fail. One common problem
is that an entity cannot stop or start inside a token.
For example, if you have an example for a
[Brian](name)'s house, this is only valid if your tokenizer splits
To evaluate entity extraction we apply a simple tag-based approach. We don't consider BILOU tags, but only the
entity type tags on a per token basis. For location entity like “near Alexanderplatz” we
expect the labels
LOC LOC instead of the BILOU-based
Our approach is more lenient when it comes to evaluation, as it rewards partial extraction and does not punish the splitting of entities. For example, given the aforementioned entity “near Alexanderplatz” and a system that extracts “Alexanderplatz”, our approach rewards the extraction of “Alexanderplatz” and punishes the missed out word “near”.
The BILOU-based approach, however, would label this as a complete failure since it expects Alexanderplatz
to be labeled as a last token in an entity (
L-LOC) instead of a single token entity (
U-LOC). Note also that
a split extraction of “near” and “Alexanderplatz” would get full scores on our approach and zero on the
Here's a comparison between the two scoring mechanisms for the phrase “near Alexanderplatz tonight”:
|extracted||Simple tags (score)||BILOU tags (score)|
|loc loc time (3)||B-loc L-loc U-time (3)|
|loc loc time (3)||U-loc U-loc U-time (1)|
|O loc time (2)||O U-loc U-time (1)|
|loc O time (2)||U-loc O U-time (1)|
|loc loc loc (2)||B-loc I-loc L-loc (1)|
Evaluating a Dialogue Model
You can evaluate your trained dialogue model on a set of test stories by using the test script:
This will print any failed stories to
We count a story as failed if at least one of the actions
was predicted incorrectly.
The test script will also save a confusion matrix to a file called
results/story_confmat.pdf. For each action in your domain, the confusion
matrix shows how often the action was correctly predicted and how often an
incorrect action was predicted instead.
Comparing Policy Configurations
To choose a configuration for your dialogue model, or to choose hyperparameters for a specific policy, you want to measure how well your dialogue model will generalize to conversations it hasn't seen before. Especially in the beginning of a project, when you don't have a lot of real conversations to use to train your bot, you may not want to exclude some to use as a test set.
Rasa Open Source has some scripts to help you choose and fine-tune your policy configuration. Once you are happy with it, you can then train your final configuration on your full data set.
To do this, you first have to train models for your different configurations. Create two (or more) config files including the policies you want to compare, and then provide them to the train script to train your models:
For each config file provided, Rasa Open Source will train dialogue models with 0, 5, 25, 50, 70 and 95% of your training stories excluded from the training data. This is done for multiple runs to ensure consistent results.
Once this script has finished, you can pass multiple models to the test script to compare the models you just trained:
This will evaluate each of the models on the provided stories (can be either training or test set) and plot some graphs to show you which policy performs best. By evaluating on the full set of stories, you can measure how well your model predicts the held-out stories. To compare single policies, create config files containing only one policy each.
This training process can take a long time, so we'd suggest letting it run somewhere in the background where it can't be interrupted.