June 29th, 2020
How to make a Custom Printer Component in Rasa NLU
Vincent Warmerdam
Rasa offers many useful components to build a digital assistant but sometimes you may want to write your own. This document will be part of a series where we will create increasingly complex components from scratch. In this document we will keep it relatively simple. We're going to create a Printer
component that will print all the features that the pipeline has created. This will allow us to appreciate how components add features in the pipeline but it will also serve as a nice introduction on how to create custom components.
Note! This tutorial was made with Rasa version 1.10.0 in mind. It won't work in Rasa 3.0 because the computational backend changes. To learn more, you may appreciate this blogpost.
Example Project
You can clone the repository found here if you'd like to be able to run the same code. The repository contains a relatively small rasa project; we're only dealing with four intents and one entity. Here's some of the files in the project:
data/nlu.md
## intent:greet
- hey
- hello
...
## intent:goodbye
- bye
- goodbye
...
## intent:bot_challenge
- are you a bot?
- are you a human?
...
## intent:talk_code
- i want to talk about [python](proglang)
- Code to ask yes/no question in [javascript](proglang)
...
data/stories.md
## just code
* talk_code
- utter_proglang
## check bot
* bot_challenge
- utter_i_am_bot
* goodbye
- utter_goodbye
## hello and code
* greet
- utter_greet
* talk_code{"proglang": "python"}
- utter_proglang
Once we call rasa train
on the command line these files will generate training data for our machine learning pipeline. You can see the definition of this pipeline in the config.yml
file.
config.yml
language: en
pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: LexicalSyntacticFeaturizer
- name: DIETClassifier
epochs: 20
policies:
- name: MemoizationPolicy
- name: TEDPolicy
- name: MappingPolicy
The goal of this tutorial is to add our own component to this file.
Printing Context
Let's make a component that will help with debugging. The goal of the component will be to print all available information known at a certain point in the pipeline. This way, our new pipeline may look something like this;
config.yml
language: en
pipeline:
- name: WhitespaceTokenizer
- name: printer.Printer
alias: after tokenizer
- name: CountVectorsFeaturizer
- name: printer.Printer
alias: after 1st cv
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: printer.Printer
alias: after 2nd cv
- name: LexicalSyntacticFeaturizer
- name: printer.Printer
alias: after lexical syntactic featurizer
- name: DIETClassifier
epochs: 20
- name: printer.Printer
alias: after diet classifier
policies:
- name: MemoizationPolicy
- name: TEDPolicy
- name: MappingPolicy
Let's note a few things.
- We've added new steps with the name
printer.Printer
. This is a custom component that we'll create. - We've placed the
printer.Printer
component after each featurization step. The goal is that this component prints what information is created in each step. - We've also placed the
printer.Printer
component after theDIETClassifier
step. This should allow us to directly see the model output. - The custom component takes an argument
alias
that allows us to give it an extra name. This means that the component that we'll create needs to be able to read in parameters passed inconfig.yml
.
Making the printer.Printer
Component
The schematic below shows the lifecycle of components in Rasa.
Our own custom component will be a python object and it will need to have some of the methods implemented that you see in the diagram. We will create a new file called printer.py
in the project directory to put the new Printer
component in. Note that this is also how config.yml
is able to find the printer.Printer
component. To get started writing the component I took the example from the documentation and made some changes to it.
printer.py
import typing
from typing import Any, Optional, Text, Dict, List, Type
from rasa.nlu.components import Component
from rasa.nlu.config import RasaNLUModelConfig
from rasa.nlu.training_data import Message, TrainingData
from rasa.nlu.tokenizers.tokenizer import Token
if typing.TYPE_CHECKING:
from rasa.nlu.model import Metadata
def _is_list_tokens(v):
"""
This is a helper function.
It checks if `v` is a list of tokens.
If so, we need to print differently.
"""
if isinstance(v, List):
if len(v) > 0:
if isinstance(v[0], Token):
return True
return False
class Printer(Component):
@classmethod
def required_components(cls) -> List[Type[Component]]:
return []
defaults = {"alias": None}
language_list = None
def __init__(self, component_config: Optional[Dict[Text, Any]] = None) -> None:
super().__init__(component_config)
def train(
self,
training_data: TrainingData,
config: Optional[RasaNLUModelConfig] = None,
**kwargs: Any,
) -> None:
pass
def process(self, message: Message, **kwargs: Any) -> None:
if self.component_config['alias']:
print("\n")
print(self.component_config['alias'])
for k, v in message.data.items():
if _is_list_tokens(v):
print(f"{k}: {[t.text for t in v]}")
else:
print(f"{k}: {v.__repr__()}")
def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
pass
@classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Optional[Text] = None,
model_metadata: Optional["Metadata"] = None,
cached_component: Optional["Component"] = None,
**kwargs: Any,
) -> "Component":
"""Load this component from file."""
if cached_component:
return cached_component
else:
return cls(meta)
Most of the code in this file is exactly the same as what you will find in the documentation. Let's observe a few things here.
- We've created a
Printer
object that inherits fromrasa.nlu.components.Component
. - This component does not depend on other components. You can confirm this by looking at the
required_components
method. If this component was a
CountVectorizer
then it would depend on tokens being present and this method would be the place where you would specify that. - Right after this method we declare
defaults = {"alias": None}
. This sets the default value for thealias
setting that we could set in theconfig.yml
. - Right after this statement we declare
language_list = None
. This means that the component does not depend on a language. It's important to note that some components only work for certain languages. For example, theConveRTFeaturizer
will only work for the English language. - The
load
,persist
andtrain
methods are untouched and are also not relevant for this component. Since we're merely printing there's no need for a training phase or a phase where we load/store everything we've trained on disk.
The main change that we've made is in the process
method which we'll zoom in on below.
def process(self, message: Message, **kwargs: Any) -> None:
if self.component_config['alias']:
print("\n")
print(self.component_config['alias'])
for k, v in message.data.items():
if _is_list_tokens(v):
print(f"{k}: {[t.text for t in v]}")
else:
print(f"{k}: {v.__repr__()}")
The process
method of the Component
object is where all the logic gets applied. In our case this is where all the printing happens. We can access all the available data by parsing the message
that the method receives. In particular, we peek inside of message.data
and iterate over all the items. These all get printed.
See the Effect
If you now train this system you should be able to see the effect. Let's train and run it.
> rasa train
> rasa shell
When you now talk to the assistant you'll see extra printed lines appear. When we type hello there
you should see messages being printed from each printer.Printer
component in our pipeline. We'll go over all of them.
After the Tokenizer
This is the information that we see right after tokenisation. Note that the alias
setting is printed here.
after tokenizer
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']
Also note that we have three tokens. The __CLS__
token serves as a token that summarises the entire sentence.
After the first CountVectorizer
We now see that there are some sparse text features that have been added.
after 1st cv
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x272 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in COOrdinate format>
Note the size of the sparse matrix. We keep track of features for three tokens,
one of which is the __CLS__
token.
After the second CountVectorizer
We now see that more sparse text features have been added. Because the settings specify that we're counting bigrams we also see that we add about 2250 features for each token by doing so.
after 2nd cv
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x2581 sparse matrix of type '<class 'numpy.longlong'>'
with 80 stored elements in COOrdinate format>
After the LexicalSyntacticFeaturizer
The LexicalSyntacticFeaturizer
adds another 24 features per token.
after lexical syntactic featurizer
intent: {'name': None, 'confidence': 0.0}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x2605 sparse matrix of type '<class 'numpy.float64'>'
with 112 stored elements in COOrdinate format>
Note that the features for the __CLS__
token at this point is the sum of all the sparse tokens. Since all the features are sparse this is a reasonable way to summarise the features of all the words into a set of features that represents the entire utterance.
After the Diet Classifier
All the sparse features went into the DIETClassifier
and this produced some
output. You can confirm that the pipeline now actually produces an intent.
after diet classifier
intent: {'name': 'greet', 'confidence': 0.9849509000778198}
entities: []
tokens: ['hello', 'there', '__CLS__']
text_sparse_features: <3x2605 sparse matrix of type '<class 'numpy.float64'>'
with 112 stored elements in COOrdinate format>
intent_ranking: [{'name': 'greet', 'confidence': 0.9849509000778198}, {'name': 'talk_code', 'confidence': 0.008203224278986454}, {'name': 'goodbye', 'confidence': 0.005775876808911562}, {'name': 'bot_challenge', 'confidence': 0.0010700082639232278}]
If you were now to utter i want to talk about python
you should see similar lines being printed but at the end you will now also see that entities have been detected too.
Conclusion
So what have we seen in this guide?
- We've seen how to create a custom component that can read in settings from
config.yml
. - We've seen what features the component receives by looking at the output from the
printer.Printer
. - We've seen that the Rasa components continously add information to the message that is passed.
You may want to think twice about using this in production though. The printer.Printer
is great when you're writing custom components because you can see the effects on the messages. The downside is that every time you add a printer.Printer
to the pipeline you'll need to call rasa train
to see the effects. All the print statements might also cause an overflow of logs to appear so it's best to keep this component for local development.
Feel free to use the example project found here to start playing around with this custom component.