Version: 2.0.x

Testing Your Assistant

Rasa Open Source lets you test dialogues end-to-end by running through test stories. You can also test the dialogue management and the message processing (NLU) separately.

End-to-end Testing

Testing your assistant requires you to write test stories, which include the user messages and the conversation history. The format is a modified version of the one used to specify stories in your training data.

Here are some examples:

- story: A basic end-to-end test
- user: |
intent: greet
- action: utter_ask_howcanhelp
- user: |
show me [chinese]{"entity": "cuisine"} restaurants
intent: inform
- action: utter_ask_location
- user: |
in [Paris]{"entity": "location"}
intent: inform
- action: utter_ask_price

By default Rasa Open Source saves test stories to tests/test_stories.yml. You can test your assistant against them by running:

rasa test

The command will always load all stories from any story files with filenames starting with test_, e.g. test_stories.yml. Your story test file names should always start with test_ for this detection to work.

Testing Custom Actions

Custom Actions are not executed as part of test stories. If your custom actions append any events to the conversation, this has to be reflected in your test story (e.g. by adding slot_was_set events to your test story).

To test the code of your custom actions, you should write unit tests for them and include these tests in your CI/CD pipeline.

If you have any questions or problems, please share them with us in the dedicated testing section on our forum.

Evaluating an NLU Model

hyperparameter tuning

If you are looking to tune the hyperparameters of your NLU model, check out this tutorial.

A standard technique in machine learning is to keep some data separate as a test set. You can split your NLU training data into train and test sets using:

rasa data split nlu

If you've done this, you can see how well your NLU model predicts the test cases:

rasa test nlu --nlu train_test_split/test_data.yml

If you don't want to create a separate test set, you can still estimate how well your model generalises using cross-validation. To do this, add the flag --cross-validation:

rasa test nlu --nlu data/nlu.yml --cross-validation

You can find the full list of options in the CLI documentation on rasa test.

Comparing NLU Pipelines

By passing multiple pipeline configurations (or a folder containing them) to the CLI, Rasa will run a comparison between the pipelines:

rasa test nlu --config config_1.yml config_2.yml
--nlu data/nlu.yml --runs 3 --percentages 0 25 50 70 90

The command in the example above will create a train/test split from your data, then train each pipeline multiple times with 0, 25, 50, 70 and 90% of your intent data excluded from the training set. The models are then evaluated on the test set and the f1-score for each exclusion percentage is recorded. This process runs three times (i.e. with 3 test sets in total) and then a graph is plotted using the means and standard deviations of the f1-scores.

The f1-score graph, along with all train/test sets, the trained models, classification and error reports, will be saved into a folder called nlu_comparison_results.

Intent Classification

The rasa test script will produce a report, confusion matrix, and confidence histogram for your model.

The report logs precision, recall and f1 measure for each intent and entity, as well as providing an overall average. You can save these reports as JSON files using the --report argument.

The confusion matrix shows you which intents are mistaken for others; any samples which have been incorrectly predicted are logged and saved to a file called errors.json for easier debugging.

The histogram that the script produces allows you to visualise the confidence distribution for all predictions, with the volume of correct and incorrect predictions being displayed by blue and red bars respectively. Improving the quality of your training data will move the blue histogram bars to the right and the red histogram bars to the left of the plot.

Response Selection

The evaluation script will produce a combined report for all response selector models in your pipeline.

The report logs precision, recall and f1 measure for each response, as well as providing an overall average. You can save these reports as JSON files using the --report argument.

Entity Extraction

Only trainable entity extractors, such as the DIETCLassifier and CRFEntityExtractor will be evaluated by the rasa test script. If you pretrained extractors like the DucklingHTTPExtractor Rasa Open Source will not include these in the evaluation.

rasa test reports recall, precision, and f1 measure for each entity type that your trainable entity extractors are trained to recognize.

incorrect entity annotations

If any of your entities are incorrectly annotated, your evaluation may fail. One common problem is that an entity cannot stop or start inside a token. For example, if you have an example for a name entity like [Brian](name)'s house, this is only valid if your tokenizer splits Brian's into multiple tokens.

Entity Scoring

To evaluate entity extraction we apply a simple tag-based approach. We don't consider BILOU tags, but only the entity type tags on a per token basis. For location entity like “near Alexanderplatz” we expect the labels LOC LOC instead of the BILOU-based B-LOC L-LOC.

Our approach is more lenient when it comes to evaluation, as it rewards partial extraction and does not punish the splitting of entities. For example, given the aforementioned entity “near Alexanderplatz” and a system that extracts “Alexanderplatz”, our approach rewards the extraction of “Alexanderplatz” and punishes the missed out word “near”.

The BILOU-based approach, however, would label this as a complete failure since it expects Alexanderplatz to be labeled as a last token in an entity (L-LOC) instead of a single token entity (U-LOC). Note also that a split extraction of “near” and “Alexanderplatz” would get full scores on our approach and zero on the BILOU-based one.

Here's a comparison between the two scoring mechanisms for the phrase “near Alexanderplatz tonight”:

extractedSimple tags (score)BILOU tags (score)
[near Alexanderplatz](loc) [tonight](time)loc loc time (3)B-loc L-loc U-time (3)
[near](loc) [Alexanderplatz](loc) [tonight](time)loc loc time (3)U-loc U-loc U-time (1)
near [Alexanderplatz](loc) [tonight](time)O loc time (2)O U-loc U-time (1)
[near](loc) Alexanderplatz [tonight](time)loc O time (2)U-loc O U-time (1)
[near Alexanderplatz tonight](loc)loc loc loc (2)B-loc I-loc L-loc (1)

Evaluating a Dialogue Model

You can evaluate your trained dialogue model on a set of test stories by using the test script:

rasa test core --stories test_stories.yml --out results

This will print any failed stories to results/failed_test_stories.yml. We count a story as failed if at least one of the actions was predicted incorrectly.

The test script will also save a confusion matrix to a file called results/story_confmat.pdf. For each action in your domain, the confusion matrix shows how often the action was correctly predicted and how often an incorrect action was predicted instead.

Comparing Policy Configurations

To choose a configuration for your dialogue model, or to choose hyperparameters for a specific policy, you want to measure how well your dialogue model will generalize to conversations it hasn't seen before. Especially in the beginning of a project, when you don't have a lot of real conversations to use to train your bot, you may not want to exclude some to use as a test set.

Rasa Open Source has some scripts to help you choose and fine-tune your policy configuration. Once you are happy with it, you can then train your final configuration on your full data set.

To do this, you first have to train models for your different configurations. Create two (or more) config files including the policies you want to compare, and then provide them to the train script to train your models:

rasa train core -c config_1.yml config_2.yml \
--out comparison_models --runs 3 --percentages 0 5 25 50 70 95

For each config file provided, Rasa Open Source will train dialogue models with 0, 5, 25, 50, 70 and 95% of your training stories excluded from the training data. This is done for multiple runs to ensure consistent results.

Once this script has finished, you can pass multiple models to the test script to compare the models you just trained:

rasa test core -m comparison_models --stories stories_folder
--out comparison_results --evaluate-model-directory

This will evaluate each of the models on the provided stories (can be either training or test set) and plot some graphs to show you which policy performs best. By evaluating on the full set of stories, you can measure how well your model predicts the held-out stories. To compare single policies, create config files containing only one policy each.


This training process can take a long time, so we'd suggest letting it run somewhere in the background where it can't be interrupted.