August 15th, 2019
Customizing Training Data Importing
- Rasa 1.2 makes training data importing customizable
- You can now implement your own importer by extending the
- This blog post gives you a quick example of how to do so
As an open-source framework for contextual AI assistants, Rasa has a diverse community all over the world. Individual developers, small teams, and large enterprises are deploying Rasa in a variety of different settings and infrastructures. Delivering an out-of-the-box solution for every use case is tough. At Rasa we want to leverage the expertise of developers, giving them the opportunity to custom-tailor Rasa for their use cases. Hence, we are building Rasa as modular framework with customizable plug-and-play components. For example, you can add support for new platforms by adding a custom connector, or build your own NLU pipeline component.
The recently released Rasa 1.2 makes another step towards a fully customizable framework whose architecture can adapt to your individual requirements: you can now customize the way Rasa imports training data to train the model.
Why is this useful?
- Support data from different sources: FTP, your company's file server, end-to-end encrypted data fetching - you name it.
- Support different data formats: you write your stories in Excel or model stories as a graph? No problem, write a parser and hit the train button.
- Support different strategies: change the way Rasa collects training data - e.g. generate data on the fly.
How to Use It
Rasa comes with two different importers out of the box:
- RasaFileImporter: the default importer in Rasa. It will use the arguments supplied through the command line interface to load data from the files at the given paths.
- MultiProjectImporter: an experimental importer which allows you to split up your assistant into multiple sub projects. Work on each project independently and then use the MultiProjectImporter to import the training data of all selected projects at training time.
If you want to use the default importer, you don't have to change anything. But what if you want to configure another one?
It's as easy as putting a line
importers in your Rasa configuration file and list the name (in case it's already included in Rasa) or the full module path in the list of importers:
... # More Rasa configuration importers: - name: RasaFileImporter - name: path.to.your.CustomImporter
Why a list of importers? Rasa automatically combines the data of the given importers at training time. In the given example, Rasa would combine the training data of both the
RasaFileImporter and the
How to Implement Your Own Importer
Every importer has to satisfy the interface of the TrainingDataImporter class. This class provides the Rasa configuration, domain, NLU data, and stories.To demonstrate how you can implement your own importer, this blog post will guide you through the implementation of a data importer which loads data from a GitHub repository. For the sake of simplicity it's a very naive implementation and assumes that the repositories follow the default Rasa directory layout. Let's start:
What you need:
- Rasa >= 1.2.0 installed
- Git on your machine
- Installed PyGithub package
You can find the complete source code here.
The TrainingDataImporter Interface
Start by building a class which extends the
You have to implement four methods:
get_stories: returns the training data to train the the dialogue model
get_nlu_data: provides the training data to train the the NLU model
get_domain: provides the domain of your assistant
get_config: provides configuration for the dialogue and NLU model training
Each function is
async, which means you can use Python's asyncio module to work with modern IO frameworks and speed up the data loading.
Implementing the Interface
Getting Files From Github
Start by connecting to the GitHub repository. An easy way to do so is to use the PyGithub library:
When the importer is loaded, Rasa passes in the file paths which the user provided through the command line interface (
training_data_paths). In this case they are not required, so you can safely ignore them. The parameter you need is the
repository parameter which you can later specify in the
importers section of the configuration file.
After connecting to the repository, our importer searches for the stories and NLU files in the data folder of the repository, and stores any found files in a temporary directory on our machine. Storing the files on disk makes parsing a bit easier because you can reuse functions which are already part of Rasa_,_ and also gives us better performance when you need the actual content file of the file later. Finally use a Rasa function called
get_core_nlu_files to separate between stories and NLU files:
get_stories returns the
StoryGraph which contains the parsed training stories. It further gets a couple of parameters which you can pass on to the Rasa
StoryFileReader. If you are implementing your own parsing algorithm, you can also decide to ignore these parameters.
Since you already collected the story files in a list, you can simply point the
StoryFileReader to it. The
StoryFileReader will read the files and parse their content to
StorySteps. This reader does not return a plain list of
StorySteps, but returns a coroutine. To get the result of this coroutine, we have to apply
await to it. If you are using a custom parser which is not asynchronous, then you don't need to apply
await. Finally wrap the list of stories in a
StoryGraph and return it.
This method returns the NLU training data as a
TrainingData object. The method receives a language parameter which can be used to distinguish between training data for multiple languages. There is a handy function in
rasa.importers.utils which makes reading the NLU data very short:
Now you have to get the
Domain of the assistant. For simplicity, it is assumed that the domain is stored as YAML in a file domain.yml in the repository. Using this you can pull the file from GitHub, parse its content as text, and load the domain from it.
The last missing piece is the model configuration, which is a simple Python dictionary. It defines your dialogue policies and NLU components. Similar to the domain file, the assumption is that the configuration is stored in a file called config.yml:
Training with the GitImporter
Now it's time to take the importer for a test drive. Put the source code in a file called
git_importer.py in your Rasa project directory. Then add these lines to your configuration file:
importers: - name: "git_importer.GitImporter" repository: "rasahq/rasa-demo"
This will get the training data from the rasa demo repository, but you can also put in any other public repository that follows the default Rasa project layout.
Finally, you can simply execute
rasa train to train a bot with the training data from the GitHub repository (since rasa-demo is a complex bot, training might take a while 😀).
Where to Go From Here
Rasa is a framework for makers. Being developers ourselves, we acknowledge the need to customize and extend software for your individual use case. While you know what's required to do so, it's our job to make this as easy and quickly doable as possible. As of now, you can already plug in the following components into Rasa:
- NLU components
- Dialogue policies
- Input / output channels
- Custom actions
- Tracker stores
- Event brokers
- Training Data Importers 🎉
Is anything missing in this list what you need to use Rasa in your environment? Then please create a feature request for it on GitHub and we will discuss the details with you in the issue.
This tutorial gave you an overview of how to use the new
TrainingDataImporter interface to implement your own data importer. We are stoked to see where you take this. Please share your experiences, implementations and ideas on the Rasa forum so that we can take Rasa to the next level - together.