notice

This is unreleased documentation for Rasa Open Source Documentation Master/Unreleased version.
For the latest released documentation, see the latest version (2.0.x).

Version: Master/Unreleased

Training Data Format

This page provides an overview of the different types of training data that go into a Rasa assistant and how this training data is structured.

Overview

Rasa Open Source uses YAML as a unified and extendable way to manage all training data, including NLU data, stories and rules.

With the YAML format, training data can be split over any number of YAML files, and every file can contain any kind of data. The training data parser will read the top level keys in each file to decide what kind of data is in a section at training time.

The domain uses the same YAML format as the training data and can also be split across multiple files or combined in one file. The domain includes the definitions for responses and forms. See the documentation for the domain for information on how to format your domain file.

Legacy Formats

Looking for Rasa Open Source 1.x data formats? They are now deprecated, but you can still find the documentation for markdown NLU data and markdown stories.

High-Level Structure

Each file can contain one or more keys with corresponding training data. One file can contain multiple keys, as long as there is not more than one of a certain key in a single file. The available keys are:

  • version
  • nlu
  • stories
  • rules

All YAML training data files should specify the version key to be parsed correctly. If you don't specify a version key in your training data file, Rasa will assume you are using the latest training data format specification supported by the version of Rasa Open Source you have installed. Training data files with a version greater than is currently available in the version of Rasa Open Source you have installed will be skipped. Currently, the latest training data format specification for Rasa 2.x is 2.0.

Example

Here's a short example which keeps all training data in a single file:

version: "2.0"
nlu:
- intent: greet
examples: |
- Hey
- Hi
- hey there [Sara](name)
- intent: faq/language
examples: |
- What language do you speak?
- Do you only handle english?
stories:
- story: greet and faq
steps:
- intent: greet
- action: utter_greet
- intent: faq
- action: utter_faq
rules:
- rule: Greet user
steps:
- intent: greet
- action: utter_greet

If you want to specify your test stories, you need to put them into a separate file:

tests/test_stories.yml
stories:
- story: greet and ask language
- steps:
- user: |
hey
intent: greet
- action: utter_greet
- user: |
what language do you speak
intent: faq/language
- action: utter_faq

They use the same format as the story training data.

NLU Training Data

NLU training data consists of example user utterances categorized by intent. Training examples can also include entities. Entities are structured pieces of information that can be extracted from a user's message. You can also add extra information such as regular expressions and lookup tables to your training data to help the model identify intents and entities correctly.

NLU training data is defined under the nlu key. Items that can be added under this key are:

nlu:
- intent: check_balance
examples: |
- What's my [credit](account) balance?
- What's the balance on my [credit card account]{"entity":"account","value":"credit"}
nlu:
- synonym: credit
examples: |
- credit card account
- credit account
nlu:
- regex: account_number
examples: |
- \d{10,12}
nlu:
- lookup: banks
examples: |
- JPMC
- Comerica
- Bank of America

Training Examples

Training examples are grouped by intent and listed under the examples key. Usually, you'll list one example per line as follows:

nlu:
- intent: greet
examples: |
- hey
- hi
- whats up

However, it's also possible to use an extended format if you have a custom NLU component and need metadata for your examples:

nlu:
- intent: greet
examples:
- text: |
hi
metadata:
sentiment: neutral
- text: |
hey there!

The metadata key can contain arbitrary key-value data that stays with an example and is accessible by the components in the NLU pipeline. In the example above, the sentiment of the example could be used by a custom component in the pipeline for sentiment analysis.

You can also specify this metadata at the intent level:

nlu:
- intent: greet
metadata:
sentiment: neutral
examples:
- text: |
hi
- text: |
hey there!

In this case, the content of the metadata key is passed on to every intent example.

If you want to specify retrieval intents, then your NLU examples will look as follows:

nlu:
- intent: chitchat/ask_name
examples: |
- What is your name?
- May I know your name?
- What do people call you?
- Do you have a name for yourself?
- intent: chitchat/ask_weather
examples: |
- What's the weather like today?
- Does it look sunny outside today?
- Oh, do you mind checking the weather for me please?
- I like sunny days in Berlin.

All retrieval intents have a suffix added to them which identifies a particular response key for your assistant, in the above example ask_name and ask_weather are the suffixes. The suffix is separated from the retrieval intent name by a / delimiter.

Special meaning of /

As shown in the above examples, the / symbol is reserved as a delimiter to separate retrieval intents from their associated response keys. Make sure not to use it in the name of your intents.

Entities

Entities are structured pieces of information that can be extracted from a user's message.

Entities are annotated in training examples with minimally the entity's name. In addition to the entity name, you can annotate an entity with synonyms, roles, or groups.

In training examples, entity annotation would look like this:

nlu:
- intent: check_balance
examples: |
- how much do I have on my [savings]("account") account
- how much money is in my [checking]{"entity": "account"} account
- What's the balance on my [credit card account]{"entity":"account","value":"credit"}

The full possible syntax for annotating an entity is:

[<entity-text>]{"entity": "<entity name>", "role": "<role name>", "group": "<group name>", "value": "<entity synonym>"}

The keywords role, group, and value are optional in this notation. The value keyword refers to synonyms. To understand what the labels role and group are for, see the section on entity roles and groups.

Synonyms

Synonyms provide a way to normalize your training data by mapping an extracted entity to a value other than the literal text extracted. Synonyms can be defined in the format:

nlu:
- synonym: credit
examples: |
- credit card account
- credit account

Synonyms can also be defined in-line in your training examples by specifying the value of the entity:

nlu:
- intent: check_balance
examples: |
- how much do I have on my [credit card account]{"entity": "account", "value": "credit"}
- how much do I owe on my [credit account]{"entity": "account", "value": "credit"}

Read more about synonyms on the NLU Training Data page.

Regular Expressions

You can use regular expressions to improve intent classification and entity extraction in combination with the RegexFeaturizer and RegexEntityExtractor components in the pipeline.

The format for defining a regular expression is as follows:

nlu:
- regex: account_number
examples: |
- \d{10,12}

Here account_number is the name of the regular expression. When used as as features for the RegexFeaturizer the name of the regular expression does not matter. When using the RegexEntityExtractor, the name of the regular expression should match the name of the entity you want to extract.

Read more about when and how to use regular expressions with each component on the NLU Training Data page.

Lookup Tables

Lookup tables are lists of words used to generate case-insensitive regular expression patterns. The format is as follows:

nlu:
- lookup: banks
examples: |
- JPMC
- Bank of America

When you supply a lookup table in your training data, the contents of that table are combined into one large regular expression. This regex is used to check each training example to see if it contains matches for entries in the lookup table.

Lookup table regexes are processed identically to the regular expressions directly specified in the training data and can be used either with the RegexFeaturizer or with the RegexEntityExtractor. The name of the lookup table is subject to the same constraints as the name of a regex feature.

Read more about using lookup tables on the NLU Training Data page.

Conversation Training Data

Stories and rules are both representations of conversations between a user and a conversational assistant. They are used to train the dialogue management model. Stories are used to train a machine learning model to identify patterns in conversations and generalize to unseen conversation paths. Rules describe small pieces of conversations that should always follow the same path and are used to train the RulePolicy.

Stories

Stories are composed of:

  • story: The story's name. The name is arbitrary and not used in training; you can use it as a human-readable reference for the story.
  • metadata: arbitrary and optional, not used in training, you can use it to store relevant information about the story like e.g. the author
  • a list of steps: The user messages and actions that make up the story

For example:

stories:
- story: Greet the user
metadata:
author: Somebody
key: value
steps:
# list of steps
- intent: greet
- action: utter_greet

Each step can be one of the following:

User Messages

All user messages are specified with the intent: key and an optional entities: key.

While writing stories, you do not have to deal with the specific contents of the messages that the users send. Instead, you can take advantage of the output from the NLU pipeline, which lets you use the combination of an intent and entities to refer to all the possible messages the users can send to mean the same thing.

User messages follow the format:

stories:
- story: user message structure
steps:
- intent: intent_name # Required
entities: # Optional
- entity_name: entity_value

For example, to represent the sentence I want to check my credit balance, where credit is an entity:

stories:
- story: story with entities
steps:
- intent: account_balance
entities:
- account_type: credit
- action: action_credit_account_balance

It is important to include the entities here as well because the policies learn to predict the next action based on a combination of both the intent and entities (you can, however, change this behavior using the use_entities attribute).

Actions

All actions executed by the bot are specified with the action: key followed by the name of the action. While writing stories, you will encounter two types of actions:

  1. Responses: start with utter_ and send a specific message to the user. e.g.
stories:
- story: story with a response
steps:
- intent: greet
- action: utter_greet
  1. Custom actions: start with action_, run arbitrary code and send any number of messages (or none).
stories:
- story: story with a custom action
steps:
- intent: feedback
- action: action_store_feedback

Forms

A form is a specific kind of custom action that contains the logic to loop over a set of required slots and ask the user for this information. You define a form in the forms section in your domain. Once defined, the happy path for a form should be specified as a rule, but interruptions of forms or other "unhappy paths" should be included in stories so that the model can generalize to unseen conversation sequences. As a step in a story, a form takes the following basic format:

stories:
- story: story with a form
steps:
- intent: find_restaurant
- action: restaurant_form # Activate the form
- active_loop: restaurant_form # This form is currently active
- active_loop: null # Form complete, no form is active
- action: utter_restaurant_found

The action step activates the form and begins looping over the required slots. The active_loop: restaurant_form step indicates that there is a currently active form. Much like a slot_was_set step, a form step doesn't set a form to active but indicates that it should already be activated. In the same way, the active_loop: null step indicates that no form should be active before the subsequent steps are taken.

A form can be interrupted and remain active; in this case the interruption should come after the action: <form to activate> step and be followed by the active_loop: <active form> step. An interruption of a form could look like this:

stories:
- story: interrupted food
steps:
- intent: request_restaurant
- action: restaurant_form
- intent: chitchat
- action: utter_chitchat
- active_loop: restaurant_form
- active_loop: null
- action: utter_slots_values

Slots

A slot event is specified under the key slot_was_set: with the slot name and optionally the slot's value.

Slots act as the bots memory. Slots are set by entities or by custom actions and referenced by stories in slot_was_set steps. For example:

stories:
- story: story with a slot
steps:
- intent: celebrate_bot
- slot_was_set:
- feedback_value: positive
- action: utter_yay

This means the story requires that the current value for the feedback_value slot be positive for the conversation to continue as specified.

Whether or not you need to include the slot's value depends on the slot type and whether the value can or should influence the dialogue. If the value doesn't matter, list only the slot's name:

stories:
- story: story with a slot
steps:
- intent: greet
- slot_was_set:
- name
- action: utter_greet_user_by_name
How slots work

Stories do not set slots. The slot must be set by an entity or custom action before the slot_was_set step.

Checkpoints

Checkpoints are specified with the checkpoint: key, either at the beginning or the end of a story.

Checkpoints are ways to connect stories together. They can be either the first or the last step in a story. If they are the last step in a story, that story will be connected to each other story that starts with the checkpoint of the same name when the model is trained. Here is an example of a story that ends with a checkpoint, and one that starts with the same checkpoint:

stories:
- story: story_with_a_checkpoint_1
steps:
- intent: greet
- action: utter_greet
- checkpoint: greet_checkpoint
- story: story_with_a_checkpoint_2
steps:
- checkpoint: greet_checkpoint
- intent: book_flight
- action: action_book_flight

Checkpoints at the beginning of stories can also be conditional on slots being set, for example:

stories:
- story: story_with_a_conditional_checkpoint
steps:
- checkpoint: greet_checkpoint
# This checkpoint should only apply if slots are set to the specified value
slot_was_set:
- context_scenario: holiday
- holiday_name: thanksgiving
- intent: greet
- action: utter_greet_thanksgiving

Checkpoints can help simplify your training data and reduce redundancy in it, but do not overuse them. Using lots of checkpoints can quickly make your stories hard to understand. It makes sense to use them if a sequence of steps is repeated often in different stories, but stories without checkpoints are easier to read and write.

OR statement

or steps are ways to handle multiple intents the same way, without writing a separate story for each intent. For example, if you ask the user to confirm something, you might want to treat the affirm and thankyou intents in the same way. Stories with or steps will be converted into multiple separate stories at training time. For example, the following story would be converted to two stories at training time:

stories:
- story: story with OR
steps:
- intent: signup_newsletter
- action: utter_ask_confirm
- or:
- intent: affirm
- intent: thanks
- action: action_signup_newsletter

Just like checkpoints, OR statements can be useful, but if you are using a lot of them, it is probably better to restructure your domain and/or intents.

Don't overuse

Overusing these features (both checkpoints and OR statements) will slow down training.

Rules

Rules are listed under the rules key and look similar to stories. A rule also has a steps key, which contains a list of the same steps as stories do. Rules can additionally contain the conversation_started and conditions keys. These are used to specify conditions under which the rule should apply.

A rule that with a condition looks like this:

rules:
- rule: Only say `hey` when the user provided a name
condition:
- slot_was_set:
- user_provided_name: true
steps:
- intent: greet
- action: utter_greet

For more information about writing rules, see Rules.

Test Conversations

Test conversations use stories to test your bot. The tests check if a message is classified correctly as well as the action predictions.

Test stories use the same format as stories, except that user message steps can include a user to specify the actual text and entity annotations of the user message. Here's an example of a test conversation:

stories:
- story: A basic end-to-end test
steps:
- user: |
hey
intent: greet
- action: utter_ask_howcanhelp
- user: |
show me [chinese]{"entity": "cuisine"} restaurants
intent: inform
- action: utter_ask_location
- user: |
in [Paris]{"entity": "location"}
intent: inform
- action: utter_ask_price

Running the tests can be done using the CLI:

rasa test

If you want to know more about testing head over to Testing Your Assistant.