This is documentation for Rasa & Rasa Pro Documentation v2.x, which is no longer actively maintained.
For up-to-date documentation, see the latest version (3.x).
Training Data Format
This page describes the different types of training data that go into a Rasa assistant and how this training data is structured.
Rasa Open Source uses YAML as a unified and extendable way to manage all training data, including NLU data, stories and rules.
You can split the training data over any number of YAML files, and each file can contain any combination of NLU data, stories, and rules. The training data parser determines the training data type using top level keys.
The domain uses the same YAML format as the training data and can also be split across multiple files or combined in one file. The domain includes the definitions for responses and forms. See the documentation for the domain for information on how to format your domain file.
Looking for Rasa Open Source 1.x data formats? They are now deprecated, but you can still find the documentation for markdown NLU data and markdown stories.
Each file can contain one or more keys with corresponding training data. One file can contain multiple keys, but each key can only appear once in a single file. The available keys are:
You should specify the
version key in all YAML training data files.
If you don't specify a version key in your training data file, Rasa
will assume you are using the latest training data format specification supported
by the version of Rasa Open Source you have installed.
Training data files with a Rasa Open Source version greater than the version you have
installed on your machine will be skipped.
Currently, the latest training data format specification for Rasa 2.x is 2.0.
Here's a short example which keeps all training data in a single file:
To specify your test stories, you need to put them into a separate file:
Test stories use the same format as the story training data and should be placed
in a separate file with the prefix
As shown in the above examples, the
examples keys are followed by
(pipe) symbol. In YAML
| identifies multi-line strings with preserved indentation.
This helps to keep special symbols like
' and others still available in the
NLU Training Data
NLU training data consists of example user utterances categorized by intent. Training examples can also include entities. Entities are structured pieces of information that can be extracted from a user's message. You can also add extra information such as regular expressions and lookup tables to your training data to help the model identify intents and entities correctly.
NLU training data is defined under the
nlu key. Items that can be added under this key are:
- Training examples grouped by user intent e.g. optionally with annotated entities
Training examples are grouped by intent and listed under the
examples key. Usually, you'll list one example per line as follows:
However, it's also possible to use an extended format if you have a custom NLU component and need metadata for your examples:
metadata key can contain arbitrary key-value data that is tied to an example and
accessible by the components in the NLU pipeline.
In the example above, the sentiment metadata could be used by a custom component in
the pipeline for sentiment analysis.
You can also specify this metadata at the intent level:
In this case, the content of the
metadata key is passed to every intent example.
If you want to specify retrieval intents, then your NLU examples will look as follows:
All retrieval intents have a suffix
added to them which identifies a particular response key for your assistant. In the
ask_weather are the suffixes. The suffix is separated from
the retrieval intent name by a
Special meaning of
As shown in the above examples, the
/ symbol is reserved as a delimiter to separate
retrieval intents from their associated response keys. Make sure not to use it in the
name of your intents.
Entities are structured pieces of information that can be extracted from a user's message.
Entities are annotated in training examples with the entity's name. In addition to the entity name, you can annotate an entity with synonyms, roles, or groups.
In training examples, entity annotation would look like this:
The full possible syntax for annotating an entity is:
value are optional in this notation.
value field refers to synonyms. To understand what the labels
for, see the section on entity roles and groups.
Synonyms normalize your training data by mapping an extracted entity to a value other than the literal text extracted. You can define synonyms using the format:
You can also define synonyms in-line in your training examples by
value of the entity:
Read more about synonyms on the NLU Training Data page.
You can use regular expressions to improve intent classification and
entity extraction using the
The format for defining a regular expression is as follows:
account_number is the name of the regular expression. When used as as features for the
RegexFeaturizer the name of the regular expression does not matter. When using the
RegexEntityExtractor, the name of the regular expression should match the name of the entity you want to extract.
Read more about when and how to use regular expressions with each component on the NLU Training Data page.
Lookup tables are lists of words used to generate case-insensitive regular expression patterns. The format is as follows:
When you supply a lookup table in your training data, the contents of that table are combined into one large regular expression. This regex is used to check each training example to see if it contains matches for entries in the lookup table.
Lookup table regexes are processed identically to the regular expressions directly specified in the training data and can be used either with the RegexFeaturizer or with the RegexEntityExtractor. The name of the lookup table is subject to the same constraints as the name of a regex feature.
Read more about using lookup tables on the NLU Training Data page.
Conversation Training Data
Stories and rules are both representations of conversations between a user and a conversational assistant. They are used to train the dialogue management model. Stories are used to train a machine learning model to identify patterns in conversations and generalize to unseen conversation paths. Rules describe small pieces of conversations that should always follow the same path and are used to train the RulePolicy.
Stories are composed of:
story: The story's name. The name is arbitrary and not used in training; you can use it as a human-readable reference for the story.
metadata: arbitrary and optional, not used in training, you can use it to store relevant information about the story like e.g. the author
- a list of
steps: The user messages and actions that make up the story
Each step can be one of the following:
- A user message, represented by intent and entities.
- An or statement, which includes two or more user messages under it.
- A bot action.
- A form.
- A slot was set event.
- A checkpoint, which connects the story to another story.
All user messages are specified with the
key and an optional
While writing stories, you do not have to deal with the specific contents of the messages that the users send. Instead, you can take advantage of the output from the NLU pipeline, which uses a combination of an intent and entities to refer to all possible messages the users can send with the same meaning.
User messages follow the format:
For example, to represent the sentence
I want to check my credit balance, where
credit is an entity:
It is important to include the entities here as well because the
policies learn to predict the next action based on a combination of
both the intent and entities (you can, however, change this behavior
All actions executed by the bot are specified with the
action: key followed
by the name of the action.
While writing stories, you will encounter two types of actions:
- Responses: start with
utter_and send a specific message to the user. e.g.
- Custom actions: start with
action_, run arbitrary code and send any number of messages (or none).
A form is a specific kind of custom action that contains the logic to loop over
a set of required slots and ask the user for this information. You
define a form in the
forms section in your domain.
Once defined, you should specify the happy path
for a form as a rule. You should include interruptions of forms or
other "unhappy paths" in stories so that the model can
generalize to unseen conversation sequences.
As a step in a story, a form takes the following format:
action step activates the form and begins looping over the required slots. The
step indicates that there is a currently active form. Much like a
form step doesn't set a form to active but indicates that it should already be activated.
In the same way, the
active_loop: null step indicates that no form should be active before the subsequent
steps are taken.
A form can be interrupted and remain active; in this case the interruption should come after the
action: <form to activate> step and be followed by the
active_loop: <active form> step.
An interruption of a form could look like this:
A slot event is specified under the key
slot_was_set: with the
slot name and optionally the slot's value.
Slots act as the bots memory.
Slots are set by entities or by custom actions and referenced
by stories in
slot_was_set steps. For example:
This means the story requires that the current value for the
positive for the conversation to continue as specified.
Whether or not you need to include the slot's value depends on the slot type and whether the value can or should influence the dialogue. If the value doesn't matter, list only the slot's name:
How slots work
Stories do not set slots. The slot must be set by an entity or custom
action before the
Checkpoints are specified with the
checkpoint: key, either at the beginning
or the end of a story.
Checkpoints are ways to connect stories together. They can be either the first or the last step in a story. If they are the last step in a story, that story will be connected to each other story that starts with the checkpoint of the same name when the model is trained. Here is an example of a story that ends with a checkpoint, and one that starts with the same checkpoint:
Checkpoints at the beginning of stories can also be conditional on slots being set, for example:
Checkpoints can help simplify your training data and reduce redundancy in it, but do not overuse them. Using lots of checkpoints can quickly make your stories hard to understand. It makes sense to use them if a sequence of steps is repeated often in different stories, but stories without checkpoints are easier to read and write.
or steps are ways to handle multiple intents the same way,
without writing a separate story for each intent. For example, if you ask the user to
confirm something, you might want to treat the
thankyou intents in the
same way. Stories with
or steps will be converted into multiple
separate stories at training time. For example, the following story would be converted
to two stories at training time:
Just like checkpoints, OR statements can be useful, but if you are using a lot of them, it is probably better to restructure your domain and/or intents.
Overusing these features (both checkpoints and OR statements) will slow down training.
Rules are listed under the
rules key and look similar to stories. A rule also has a
key, which contains a list of the same steps as stories do. Rules can additionally
conditions keys. These are used to specify conditions
under which the rule should apply.
A rule that with a condition looks like this:
For more information about writing rules, see Rules.
Test stories check if a message is classified correctly as well as the action predictions.
Test stories use the same format as stories,
except that user message steps can include a
user to specify the actual
text and entity annotations of the user message. Here's an example of a
You can run the tests using the following command:
If you want to know more about testing head over to Testing Your Assistant.
New in 2.2
End-to-end training is an experimental feature. We introduce experimental features to get feedback from our community, so we encourage you to try it out! However, the functionality might be changed or removed in the future. If you have feedback (positive or negative) please share it with us on the Rasa Forum.
With end-to-end training, you do not have to deal with the specific
intents of the messages that are extracted by the NLU pipeline.
Instead, you can put the text of the user message directly in the stories,
These end-to-end user messages follow the format:
In addition, you can add entity tags that can be extracted
by the TED Policy.
The syntax for entity tags is the same as in
the NLU training data.
For example, the following story contains the user utterance
I can always go for sushi. By using the syntax from the NLU training data
[sushi](cuisine), you can mark
sushi as an entity of type
Similarly, you can put bot utterances directly in the stories,
by using the
bot key followed by the text that you want your bot to say.
A story with only a bot utterance might look like this:
You can also have a mixed end-to-end story:
Rasa end-to-end training is fully integrated with standard Rasa approach. It means that you can have mixed stories with some steps defined by actions or intents and other steps defined directly by user messages or bot responses.