Evaluating and Testing


If you’re looking to evaluate both Rasa NLU and Rasa Core predictions combined, take a look at the section on end-to-end evaluation.

Evaluating a Trained Model

You can evaluate your trained model on a set of test stories by using the evaluate script:

$ python -m rasa_core.evaluate --core models/dialogue \
  --stories test_stories.md -o results

This will print the failed stories to results/failed_stories.md. We count any story as failed if at least one of the actions was predicted incorrectly.

In addition, this will save a confusion matrix to a file called results/story_confmat.pdf. The confusion matrix shows, for each action in your domain, how often that action was predicted, and how often an incorrect action was predicted instead.

The full list of options for the script is:

/home/travis/virtualenv/python3.5.6/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/travis/virtualenv/python3.5.6/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/travis/virtualenv/python3.5.6/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
/home/travis/virtualenv/python3.5.6/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
usage: evaluate.py default [-h] [-m MAX_STORIES] [-u NLU] [-o OUTPUT] [--e2e]
                           [--endpoints ENDPOINTS]
                           [--fail_on_prediction_errors] [--core CORE]
                           (-s STORIES | --url URL) [-v] [-vv] [--quiet]

optional arguments:
  -h, --help            show this help message and exit
  -m MAX_STORIES, --max_stories MAX_STORIES
                        maximum number of stories to test on
  -u NLU, --nlu NLU     nlu model to run with the server. None for regex
  -o OUTPUT, --output OUTPUT
                        output path for the any files created from the
  --e2e, --end-to-end   Run an end-to-end evaluation for combined action and
                        intent prediction. Requires a story file in end-to-end
  --endpoints ENDPOINTS
                        Configuration file for the connectors as a yml file
                        If a prediction error is encountered, an exception is
                        thrown. This can be used to validate stories during
                        tests, e.g. on travis.
  --core CORE           Path to a pre-trained core model directory
  -s STORIES, --stories STORIES
                        File or folder containing stories
  --url URL             If supplied, downloads a story file from a URL and
                        trains on it. Fetches the data by sending a GET
                        request to the supplied URL.
  -v, --verbose         Be verbose. Sets logging level to INFO
  -vv, --debug          Print lots of debugging statements. Sets logging level
                        to DEBUG
  --quiet               Be quiet! Sets logging level to WARNING

End-to-end evaluation of Rasa NLU and Core

Say your bot uses a dialogue model in combination with a Rasa NLU model to parse intent messages, and you would like to evaluate how the two models perform together on whole dialogues. The evaluate script lets you evaluate dialogues end-to-end, combining Rasa NLU intent predictions with Rasa Core action predictions. You can activate this feature with the --e2e option in the rasa_core.evaluate module.

The story format used for end-to-end evaluation is slightly different to the standard Rasa Core stories, as you’ll have to include the user messages in natural language instead of just their intent. The format for the user messages is * <intent>:<Rasa NLU example>. The NLU part follows the markdown syntax for Rasa NLU training data.

Here’s an example of what an end-to-end story file may look like:

## end-to-end story 1
* greet: hello
   - utter_ask_howcanhelp
* inform: show me [chinese](cuisine) restaurants
   - utter_ask_location
* inform: in [Paris](location)
   - utter_ask_price

## end-to-end story 2

If you’ve saved these stories under e2e_storied.md, the full end-to-end evaluation command is this:

$ python -m rasa_core.evaluate default --core models/dialogue \
  --nlu models/nlu/current \
  --stories e2e_stories.md --e2e


Make sure you specify an NLU model to load with the dialogue model using the --nlu option of rasa_core.evaluate. If you do not specify an NLU model, Rasa Core will load the default RegexInterpreter.

Comparing Policies

To choose a specific policy, or to choose hyperparameters for a specific policy, you want to measure how well Rasa Core will generalise to conversations which it hasn’t seen before. Especially in the beginning of a project, you do not have a lot of real conversations to use to train your bot, so you don’t just want to throw some away to use as a test set.

Rasa Core has some scripts to help you choose and fine-tune your policy. Once you are happy with it, you can then train your final policy on your full data set. To do this, you first have to train models for your different policies. Create two (or more) policy config files of the policies you want to compare (containing only one policy each), and then use the compare mode of the train script to train your models:

$ python -m rasa_core.train compare -c policy_config1.yml policy_config2.yml \
  -d domain.yml -s stories_folder -o comparison_models --runs 3 --percentages \
  0 5 25 50 70 90 95

For each policy configuration provided, Rasa Core will be trained multiple times with 0, 5, 25, 50, 70 and 95% of your training stories excluded from the training data. This is done for multiple runs, to ensure consistent results.

Once this script has finished, you can now use the evaluate script in compare mode to evaluate the models you just trained:

$ python -m rasa_core.evaluate compare --stories stories_folder \
  --core comparison_models \
  -o comparison_results

This will evaluate each of the models on the training set, and plot some graphs to show you which policy is best. By evaluating on the full set of stories, you can measure how well Rasa Core is predicting the held-out stories.

If you’re not sure which policies to compare, we’d recommend trying out the EmbeddingPolicy and the KerasPolicy to see which one works better for you.


This training process can take a long time, so we’d suggest letting it run somewhere in the background where it can’t be interrupted

Evaluating stories over http

Rasa Core’s server lets you to retrieve evaluations for the currently loaded model. Say your Rasa Core server is running locally on port 5005, and your story evaluation file is saved at eval_stories.md. The command to post stories to the server for evaluation is this:

$ curl --data-binary @eval_stories.md "localhost:5005/evaluate" | python -m json.tool

If you would like to evaluate end-to-end stories (docs), you may do so by adding the e2e=true query parameter:

$ curl --data-binary @eval_stories.md "localhost:5005/evaluate?e2e=true" | python -m json.tool

Have questions or feedback?

We have a very active support community on Rasa Community Forum that is happy to help you with your questions. If you have any feedback for us or a specific suggestion for improving the docs, feel free to share it by creating an issue on Rasa Core GitHub repository.