Evaluating Models


If you are looking to tune the hyperparameters of your NLU model, check out this tutorial.

Evaluating an NLU Model

A standard technique in machine learning is to keep some data separate as a test set. You can split your NLU training data into train and test sets using:

rasa data split nlu

If you’ve done this, you can see how well your NLU model predicts the test cases using this command:

rasa test nlu -u test_set.md --model models/nlu-20180323-145833.tar.gz

If you don’t want to create a separate test set, you can still estimate how well your model generalises using cross-validation. To do this, add the flag --cross-validation:

rasa test nlu -u data/nlu.md --config config.yml --cross-validation

The full list of options for the script is:

usage: rasa test nlu [-h] [-v] [-vv] [--quiet] [-m MODEL] [-u NLU]
                     [--report [REPORT]] [--successes [SUCCESSES]]
                     [--errors ERRORS] [--histogram HISTOGRAM]
                     [--confmat CONFMAT] [-c CONFIG [CONFIG ...]]
                     [--cross-validation] [-f FOLDS] [-r RUNS]
                     [-p PERCENTAGES [PERCENTAGES ...]]

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Path to a trained Rasa model. If a directory is
                        specified, it will use the latest model in this
                        directory. (default: models)
  -u NLU, --nlu NLU     File or folder containing your NLU data. (default:
  --report [REPORT]     Output path to save the intent/entity metrics report.
                        (default: None)
  --successes [SUCCESSES]
                        Output path to save successful predictions. (default:
  --errors ERRORS       Output path to save model errors. (default:
  --histogram HISTOGRAM
                        Output path for the confidence histogram. (default:
  --confmat CONFMAT     Output path for the confusion matrix plot. (default:
  -c CONFIG [CONFIG ...], --config CONFIG [CONFIG ...]
                        Model configuration file. If a single file is passed
                        and cross validation mode is chosen, cross-validation
                        is performed, if multiple configs or a folder of
                        configs are passed, models will be trained and
                        compared directly. (default: None)

Python Logging Options:
  -v, --verbose         Be verbose. Sets logging level to INFO. (default:
  -vv, --debug          Print lots of debugging statements. Sets logging level
                        to DEBUG. (default: None)
  --quiet               Be quiet! Sets logging level to WARNING. (default:

Cross Validation:
  --cross-validation    Switch on cross validation mode. Any provided model
                        will be ignored. (default: False)
  -f FOLDS, --folds FOLDS
                        Number of cross validation folds (cross validation
                        only). (default: 10)

Comparison Mode:
  -r RUNS, --runs RUNS  Number of comparison runs to make. (default: 3)
                        Percentages of training data to exclude during
                        comparison. (default: [0, 25, 50, 75, 90])

Comparing NLU Pipelines

By passing multiple pipeline configurations (or a folder containing them) to the CLI, Rasa will run a comparative examination between the pipelines.

$ rasa test nlu --config pretrained_embeddings_spacy.yml supervised_embeddings.yml
  --nlu data/nlu.md --runs 3 --percentages 0 25 50 70 90

The command in the example above will create a train/test split from your data, then train each pipeline multiple times with 0, 25, 50, 70 and 90% of your intent data excluded from the training set. The models are then evaluated on the test set and the f1-score for each exclusion percentage is recorded. This process runs three times (i.e. with 3 test sets in total) and then a graph is plotted using the means and standard deviations of the f1-scores.

The f1-score graph - along with all train/test sets, the trained models, classification and error reports - will be saved into a folder called nlu_comparison_results.

Intent Classification

The evaluation script will produce a report, confusion matrix, and confidence histogram for your model.

The report logs precision, recall and f1 measure for each intent and entity, as well as providing an overall average. You can save these reports as JSON files using the --report argument.

The confusion matrix shows you which intents are mistaken for others; any samples which have been incorrectly predicted are logged and saved to a file called errors.json for easier debugging.

The histogram that the script produces allows you to visualise the confidence distribution for all predictions, with the volume of correct and incorrect predictions being displayed by blue and red bars respectively. Improving the quality of your training data will move the blue histogram bars to the right and the red histogram bars to the left of the plot.


A confusion matrix will only be created if you are evaluating a model on a test set. In cross-validation mode, the confusion matrix will not be generated.


If any of your entities are incorrectly annotated, your evaluation may fail. One common problem is that an entity cannot stop or start inside a token. For example, if you have an example for a name entity like [Brian](name)'s house, this is only valid if your tokenizer splits Brian's into multiple tokens. A whitespace tokenizer would not work in this case.

Entity Extraction

The CRFEntityExtractor is the only entity extractor which you train using your own data, and so is the only one that will be evaluated. If you use the spaCy or duckling pre-trained entity extractors, Rasa NLU will not include these in the evaluation.

Rasa NLU will report recall, precision, and f1 measure for each entity type that CRFEntityExtractor is trained to recognize.

Entity Scoring

To evaluate entity extraction we apply a simple tag-based approach. We don’t consider BILOU tags, but only the entity type tags on a per token basis. For location entity like “near Alexanderplatz” we expect the labels LOC LOC instead of the BILOU-based B-LOC L-LOC. Our approach is more lenient when it comes to evaluation, as it rewards partial extraction and does not punish the splitting of entities. For example, given the aforementioned entity “near Alexanderplatz” and a system that extracts “Alexanderplatz”, our approach rewards the extraction of “Alexanderplatz” and punishes the missed out word “near”. The BILOU-based approach, however, would label this as a complete failure since it expects Alexanderplatz to be labeled as a last token in an entity (L-LOC) instead of a single token entity (U-LOC). Note also that a split extraction of “near” and “Alexanderplatz” would get full scores on our approach and zero on the BILOU-based one.

Here’s a comparison between the two scoring mechanisms for the phrase “near Alexanderplatz tonight”:

extracted Simple tags (score) BILOU tags (score)
[near Alexanderplatz](loc) [tonight](time) loc loc time (3) B-loc L-loc U-time (3)
[near](loc) [Alexanderplatz](loc) [tonight](time) loc loc time (3) U-loc U-loc U-time (1)
near [Alexanderplatz](loc) [tonight](time) O loc time (2) O U-loc U-time (1)
[near](loc) Alexanderplatz [tonight](time) loc O time (2) U-loc O U-time (1)
[near Alexanderplatz tonight](loc) loc loc loc (2) B-loc I-loc L-loc (1)

Evaluating a Core Model

You can evaluate your trained model on a set of test stories by using the evaluate script:

rasa test core --stories test_stories.md --out results

This will print the failed stories to results/failed_stories.md. We count any story as failed if at least one of the actions was predicted incorrectly.

In addition, this will save a confusion matrix to a file called results/story_confmat.pdf. For each action in your domain, the confusion matrix shows how often the action was correctly predicted and how often an incorrect action was predicted instead.

The full list of options for the script is:

usage: rasa test core [-h] [-v] [-vv] [--quiet] [-m MODEL [MODEL ...]]
                      [-s STORIES] [--max-stories MAX_STORIES] [--out OUT]
                      [--e2e] [--endpoints ENDPOINTS]
                      [--fail-on-prediction-errors] [--url URL]

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL [MODEL ...], --model MODEL [MODEL ...]
                        Path to a pre-trained model. If it is a 'tar.gz' file
                        that model file will be used. If it is a directory,
                        the latest model in that directory will be used. If
                        multiple 'tar.gz' files are provided, all those models
                        will be compared. (default: [None])
  -s STORIES, --stories STORIES
                        File or folder containing your test stories. (default:
  --max-stories MAX_STORIES
                        Maximum number of stories to test on. (default: None)
  --out OUT             Output path for any files created during the
                        evaluation. (default: results)
  --e2e, --end-to-end   Run an end-to-end evaluation for combined action and
                        intent prediction. Requires a story file in end-to-end
                        format. (default: False)
  --endpoints ENDPOINTS
                        Configuration file for the connectors as a yml file.
                        (default: None)
                        If a prediction error is encountered, an exception is
                        thrown. This can be used to validate stories during
                        tests, e.g. on travis. (default: False)
  --url URL             If supplied, downloads a story file from a URL and
                        trains on it. Fetches the data by sending a GET
                        request to the supplied URL. (default: None)

Python Logging Options:
  -v, --verbose         Be verbose. Sets logging level to INFO. (default:
  -vv, --debug          Print lots of debugging statements. Sets logging level
                        to DEBUG. (default: None)
  --quiet               Be quiet! Sets logging level to WARNING. (default:

Comparing Policies

To choose a specific policy configuration, or to choose hyperparameters for a specific policy, you want to measure how well Rasa Core will generalise to conversations which it hasn’t seen before. Especially in the beginning of a project, you do not have a lot of real conversations to use to train your bot, so you don’t just want to throw some away to use as a test set.

Rasa Core has some scripts to help you choose and fine-tune your policy configuration. Once you are happy with it, you can then train your final configuration on your full data set. To do this, you first have to train models for your different policies. Create two (or more) config files including the policies you want to compare (containing only one policy each), and then use the compare mode of the train script to train your models:

$ rasa train core -c config_1.yml config_2.yml \
  -d domain.yml -s stories_folder --out comparison_models --runs 3 \
  --percentages 0 5 25 50 70 95

For each policy configuration provided, Rasa Core will be trained multiple times with 0, 5, 25, 50, 70 and 95% of your training stories excluded from the training data. This is done for multiple runs to ensure consistent results.

Once this script has finished, you can use the evaluate script in compare mode to evaluate the models you just trained:

$ rasa test core -m comparison_models/<model-1>.tar.gz comparison_models/<model-2>.tar.gz \
  --stories stories_folder --out comparison_results

This will evaluate each of the models on the training set and plot some graphs to show you which policy performs best. By evaluating on the full set of stories, you can measure how well Rasa Core is predicting the held-out stories.

If you’re not sure which policies to compare, we’d recommend trying out the EmbeddingPolicy and the KerasPolicy to see which one works better for you.


This training process can take a long time, so we’d suggest letting it run somewhere in the background where it can’t be interrupted.

End-to-End Evaluation

Rasa lets you evaluate dialogues end-to-end, running through test conversations and making sure that both NLU and Core make correct predictions.

To do this, you need some stories in the end-to-end format, which includes both the NLU output and the original text. Here is an example:

## end-to-end story 1
* greet: hello
   - utter_ask_howcanhelp
* inform: show me [chinese](cuisine) restaurants
   - utter_ask_location
* inform: in [Paris](location)
   - utter_ask_price

If you’ve saved end-to-end stories as a file called e2e_stories.md, you can evaluate your model against them by running:

$ rasa test --stories e2e_stories.md --e2e


Make sure your model file in models is a combined core and nlu model. If it does not contain an NLU model, Core will use the default RegexInterpreter.