NLU Components
NLU-based assistants
This section refers to building NLU-based assistants. If you are working with Conversational AI with Language Models (CALM), this content may not apply to you.
Tokenizers
Tokenizers split text into tokens. If you want to split intents into multiple labels, e.g. for predicting multiple intents or for modeling hierarchical intent structure, use the following flags with any tokenizer:
-
intent_tokenization_flag
indicates whether to tokenize intent labels or not. Set it toTrue
, so that intent labels are tokenized. -
intent_split_symbol
sets the delimiter string to split the intent labels, default is underscore (_
).
WhitespaceTokenizer
-
Short
Tokenizer using whitespaces as a separator
-
Outputs
tokens
for user messages, responses (if present), and intents (if specified) -
Requires
Nothing
-
Description
Creates a token for every whitespace separated character sequence.
Any character not in:
a-zA-Z0-9_#@&
will be substituted with whitespace before splitting on whitespace if the character fulfills any of the following conditions:- the character follows a whitespace:
" !word"
→"word"
- the character precedes a whitespace:
"word! "
→"word"
- the character is at the beginning of the string:
"!word"
→"word"
- the character is at the end of the string:
"word!"
→"word"
Note that:
"wo!rd"
→"wo!rd"
In addition, any character not in:
a-zA-Z0-9_#@&.~:\/?[]()!$*+,;=-
will be substituted with whitespace before splitting on whitespace if the character is not between numbers:"twenty\{one"
→"twenty"
,"one"
("{"` is not between numbers)"20\{1"
→"20\{1"
("{"` is between numbers)
Note that:
"name@example.com"
→"name@example.com"
"10,000.1"
→"10,000.1"
"1 - 2"
→"1"
,"2"
- the character follows a whitespace:
-
Configuration
config.ymlpipeline:
- name: "WhitespaceTokenizer"
# Flag to check whether to split intents
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
# Regular expression to detect tokens
"token_pattern": None
Featurizers
Text featurizers are divided into two different categories: sparse featurizers and dense featurizers. Sparse featurizers are featurizers that return feature vectors with a lot of missing values, e.g. zeros. As those feature vectors would normally take up a lot of memory, we store them as sparse features. Sparse features only store the values that are non zero and their positions in the vector. Thus, we save a lot of memory and are able to train on larger datasets.
All featurizers can return two different kind of features: sequence features and sentence features.
The sequence features are a matrix of size (number-of-tokens x feature-dimension)
.
The matrix contains a feature vector for every token in the sequence.
This allows us to train sequence models.
The sentence features are represented by a matrix of size (1 x feature-dimension)
.
It contains the feature vector for the complete utterance.
The sentence features can be used in any bag-of-words model.
The corresponding classifier can therefore decide what kind of features to use.
Note: The feature-dimension
for sequence and sentence features does not have to be the same.
LanguageModelFeaturizer
-
Short
Creates a vector representation of user message and response (if specified) using a pre-trained language model.
-
Outputs
dense_features
for user messages and responses -
Type
Dense featurizer
-
Description
Creates features for entity extraction, intent classification, and response selection. Uses a pre-trained language model to compute vector representations of input text.
notePlease make sure that you use a language model which is pre-trained on the same language corpus as that of your training data.
-
Configuration
Include a Tokenizer component before this component.
You should specify what language model to load via the parameter
model_name
. See the below table for the currently supported language models. The weights to be loaded can be specified by the additional parametermodel_weights
. If left empty, it uses the default model weights listed in the table.+----------------+--------------+-------------------------+
| Language Model | Parameter | Default value for |
| | "model_name" | "model_weights" |
+----------------+--------------+-------------------------+
| BERT | bert | rasa/LaBSE |
+----------------+--------------+-------------------------+
| GPT | gpt | openai-gpt |
+----------------+--------------+-------------------------+
| GPT-2 | gpt2 | gpt2 |
+----------------+--------------+-------------------------+
| XLNet | xlnet | xlnet-base-cased |
+----------------+--------------+-------------------------+
| DistilBERT | distilbert | distilbert-base-uncased |
+----------------+--------------+-------------------------+
| RoBERTa | roberta | roberta-base |
+----------------+--------------+-------------------------+
| camemBERT | camembert | camembert-base |
+----------------+--------------+-------------------------+Apart from the default pretrained model weights, further models can be used from HuggingFace models provided the following conditions are met (the mentioned files can be found in the "Files and versions" section of the model website):
- The model architecture is one of the supported language models (check that the
model_type
inconfig.json
is listed in the table's columnmodel_name
) - The model has pretrained Tensorflow weights (check that the file
tf_model.h5
exists, at this time Safetensors are not supported.) - The model uses the default tokenizer (
config.json
should not contain a customtokenizer_class
setting)
noteWhile the
LaBSE
weights are loaded by default for thebert
architecture offering a multi-lingual model trained on 112 languages (see our tutorial and the original paper), we now recommend usingMiniLM
model for better performance.The
LaBSE
weights can still serve as a baseline for initial testing and development. After establishing this baseline, we strongly encourage exploring optimization with theMiniLM
to improve your assistant effectiveness, before trying to optimize this component with other weights/architectures.The following configuration loads the language model BERT with
rasa/LaBSE
weights, which can be found here:config.ymlpipeline:
- name: LanguageModelFeaturizer
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "rasa/LaBSE"
# An optional path to a directory from which
# to load pre-trained model weights.
# If the requested model is not found in the
# directory, it will be downloaded and
# cached in this directory for future use.
# The default value of `cache_dir` can be
# set using the environment variable
# `TRANSFORMERS_CACHE`, as per the
# Transformers library.
cache_dir: nullFor enhanced performance, we recommend the
sentence-transformers/all-MiniLM-L6-v2
weights, which can be found here:config.ymlpipeline:
- name: LanguageModelFeaturizer
model_name: "bert"
model_weights: "sentence-transformers/all-MiniLM-L6-v2"
cache_dir: null - The model architecture is one of the supported language models (check that the
RegexFeaturizer
-
Short
Creates a vector representation of user message using regular expressions.
-
Outputs
sparse_features
for user messages andtokens.pattern
-
Requires
tokens
-
Type
Sparse featurizer
-
Description
Creates features for entity extraction and intent classification. During training the
RegexFeaturizer
creates a list of regular expressions defined in the training data format. For each regex, a feature will be set marking whether this expression was found in the user message or not. All features will later be fed into an intent classifier / entity extractor to simplify classification (assuming the classifier has learned during the training phase, that this set feature indicates a certain intent / entity). Regex features for entity extraction are currently only supported by the CRFEntityExtractor. -
Configuration
Make the featurizer case insensitive by adding the
case_sensitive: False
option, the default beingcase_sensitive: True
.To correctly process languages such as Chinese that don't use whitespace for word separation, the user needs to add the
use_word_boundaries: False
option, the default beinguse_word_boundaries: True
.config.ymlpipeline:
- name: "RegexFeaturizer"
# Text will be processed with case sensitive as default
"case_sensitive": True
# use match word boundaries for lookup table
"use_word_boundaries": True
CountVectorsFeaturizer
-
Short
Creates bag-of-words representation of user messages, intents, and responses.
-
Outputs
sparse_features
for user messages, intents, and responses -
Requires
tokens
-
Type
Sparse featurizer
-
Description
Creates features for intent classification and response selection. Creates bag-of-words representation of user message, intent, and response using sklearn's CountVectorizer. All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.
-
Configuration
See sklearn's CountVectorizer docs for detailed description of the configuration parameters.
This featurizer can be configured to use word or character n-grams, using the
analyzer
configuration parameter. By defaultanalyzer
is set toword
so word token counts are used as features. If you want to use character n-grams, setanalyzer
tochar
orchar_wb
. The lower and upper boundaries of the n-grams can be configured via the parametersmin_ngram
andmax_ngram
. By default both of them are set to1
. By default the featurizer takes the lemma of a word instead of the word directly if it is available. You can disable this behavior by settinguse_lemma
toFalse
.noteOption
char_wb
creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. This option can be used to create Subword Semantic Hashing.noteFor character n-grams do not forget to increase
min_ngram
andmax_ngram
parameters. Otherwise the vocabulary will contain only single letters.Handling Out-Of-Vocabulary (OOV) words:
noteEnabled only if
analyzer
isword
.Since the training is performed on limited vocabulary data, it cannot be guaranteed that during prediction an algorithm will not encounter an unknown word (a word that were not seen during training). In order to teach an algorithm how to treat unknown words, some words in training data can be substituted by generic word
OOV_token
. In this case during prediction all unknown words will be treated as this generic wordOOV_token
.For example, one might create separate intent
outofscope
in the training data containing messages of different number ofOOV_token
s and maybe some additional general words. Then an algorithm will likely classify a message with unknown words as this intentoutofscope
.You can either set the
OOV_token
or a list of wordsOOV_words
:-
OOV_token
set a keyword for unseen words; if training data containsOOV_token
as words in some messages, during prediction the words that were not seen during training will be substituted with providedOOV_token
; ifOOV_token=None
(default behavior) words that were not seen during training will be ignored during prediction time; -
OOV_words
set a list of words to be treated asOOV_token
during training; if a list of words that should be treated as Out-Of-Vocabulary is known, it can be set toOOV_words
instead of manually changing it in training data or using custom preprocessor.
noteThis featurizer creates a bag-of-words representation by counting words, so the number of
OOV_token
in the sentence might be important.noteProviding
OOV_words
is optional, training data can containOOV_token
input manually or by custom additional preprocessor. Unseen words will be substituted withOOV_token
only if this token is present in the training data orOOV_words
list is provided.If you want to share the vocabulary between user messages and intents, you need to set the option
use_shared_vocab
toTrue
. In that case a common vocabulary set between tokens in intents and user messages is build.config.ymlpipeline:
- name: "CountVectorsFeaturizer"
# Analyzer to use, either 'word', 'char', or 'char_wb'
"analyzer": "word"
# Set the lower and upper boundaries for the n-grams
"min_ngram": 1
"max_ngram": 1
# Set the out-of-vocabulary token
"OOV_token": "_oov_"
# Whether to use a shared vocab
"use_shared_vocab": FalseConfiguring for incremental training
To ensure that
sparse_features
are of fixed size during incremental training, the component should be configured to account for additional vocabulary tokens that may be added as part of new training examples in the future. To do so, configure theadditional_vocabulary_size
parameter while training the base model from scratch:config.ymlpipeline:
- name: CountVectorsFeaturizer
additional_vocabulary_size:
text: 1000
response: 1000
action_text: 1000As in the above example, you can define additional vocabulary size for each of
text
(user messages),response
(bot responses used byResponseSelector
) andaction_text
(bot responses not used byResponseSelector
). If you are building a shared vocabulary (use_shared_vocab=True
), you only need to define a value for thetext
attribute. If any of the attribute is not configured by the user, the component takes half of the current vocabulary size as the default value for the attribute'sadditional_vocabulary_size
. This number is kept at a minimum of 1000 in order to avoid running out of additional vocabulary slots too frequently during incremental training. Once the component runs out of additional vocabulary slots, the new vocabulary tokens are dropped and not considered during featurization. At this point, it is advisable to retrain a new model from scratch. -
The above configuration parameters are the ones you should configure to fit your model to your data. However, additional parameters exist that can be adapted.
More configurable parameters
+---------------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===========================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option 'char_wb' creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| alias | CountVectorFeaturizer | Alias name of featurizer. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| use_lemma | True | Use the lemma of words for featurization. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| additional_vocabulary_size| text: 1000 | Size of additional vocabulary to account for incremental |
| | response: 1000 | training while training a model from scratch |
| | action_text: 1000 | |
+---------------------------+-------------------------+--------------------------------------------------------------+
LexicalSyntacticFeaturizer
-
Short
Creates lexical and syntactic features for a user message to support entity extraction.
-
Outputs
sparse_features
for user messages -
Requires
tokens
-
Type
Sparse featurizer
-
Description
Creates features for entity extraction. Moves with a sliding window over every token in the user message and creates features according to the configuration (see below). As a default configuration is present, you don't need to specify a configuration.
-
Configuration
You can configure what kind of lexical and syntactic features the featurizer should extract. The following features are available:
============== ==========================================================================================
Feature Name Description
============== ==========================================================================================
BOS Checks if the token is at the beginning of the sentence.
EOS Checks if the token is at the end of the sentence.
low Checks if the token is lower case.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining characters are
lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
============== ==========================================================================================As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for previous tokens, the current token, and the next tokens in the sliding window. You define the features as a [before, token, after] array. If you want to define features for the token before, the current token, and the token after, your features configuration would look like this:
config.ymlpipeline:
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit"],
["low", "title", "upper"],
]This configuration is also the default configuration.
Intent Classifiers
Intent classifiers assign one of the intents defined in the domain file to incoming user messages.
LogisticRegressionClassifier
-
Short
Logistic regression intent classifier, using the scikit-learn implementation.
-
Outputs
intent
andintent_ranking
-
Requires
Either
sparse_features
ordense_features
need to be present. -
Output-Example
{
"intent": { "name": "greet", "confidence": 0.78 },
"intent_ranking": [
{
"confidence": 0.78,
"name": "greet"
},
{
"confidence": 0.14,
"name": "goodbye"
},
{
"confidence": 0.08,
"name": "restaurant_search"
}
]
}
-
Description
This classifier uses scikit-learn's logistic regression implementation to perform intent classification. It's able to use only sparse features, but will also pick up any dense features that are present. In general, DIET should yield higher accuracy results, but this classifier should train faster and may be used as a lightweight benchmark. Our implementation uses the base settings from scikit-learn, with the exception of the
class_weight
parameter where we assume the"balanced"
setting. -
Configuration
An example configuration with all the defaults can be found below.
pipeline:
- name: LogisticRegressionClassifier
max_iter: 100
solver: lbfgs
tol: 0.0001
random_state: 42
ranking_length: 10
There configuration parameters are briefly explained below.
max_iter
: Maximum number of iterations taken for the solvers to converge.solver
: Solver to be used. For very small datasets you might considerliblinear
.tol
: Tolerance for stopping criteria of the optimizer.random_state
: Used to shuffle the data before training.ranking_length
: Number of top intents to report. Set to 0 to report all intents
More details on the parameters can be found on the scikit-learn documentation page.
SklearnIntentClassifier
-
Short
Sklearn intent classifier
-
Outputs
intent
andintent_ranking
-
Requires
dense_features
for user messages -
Output-Example
{
"intent": { "name": "greet", "confidence": 0.78 },
"intent_ranking": [
{
"confidence": 0.78,
"name": "greet"
},
{
"confidence": 0.14,
"name": "goodbye"
},
{
"confidence": 0.08,
"name": "restaurant_search"
}
]
} -
Description
The sklearn intent classifier trains a linear SVM which gets optimized using a grid search. It also provides rankings of the labels that did not “win”. The
SklearnIntentClassifier
needs to be preceded by a dense featurizer in the pipeline. This dense featurizer creates the features used for the classification. For more information about the algorithm itself, take a look at the GridSearchCV documentation. -
Configuration
During the training of the SVM a hyperparameter search is run to find the best parameter set. In the configuration you can specify the parameters that will get tried.
config.ymlpipeline:
- name: "SklearnIntentClassifier"
# Specifies the list of regularization values to
# cross-validate over for C-SVM.
# This is used with the ``kernel`` hyperparameter in GridSearchCV.
C: [1, 2, 5, 10, 20, 100]
# Specifies the kernel to use with C-SVM.
# This is used with the ``C`` hyperparameter in GridSearchCV.
kernels: ["linear"]
# Gamma parameter of the C-SVM.
"gamma": [0.1]
# We try to find a good number of cross folds to use during
# intent training, this specifies the max number of folds.
"max_cross_validation_folds": 5
# Scoring function used for evaluating the hyper parameters.
# This can be a name or a function.
"scoring_function": "f1_weighted"
KeywordIntentClassifier
-
Short
Simple keyword matching intent classifier, intended for small, short-term projects.
-
Output s
intent
-
Requires
Nothing
-
Output-Example
{
"intent": { "name": "greet", "confidence": 1.0 }
} -
Description
This classifier works by searching a message for keywords. The matching is case sensitive by default and searches only for exact matches of the keyword-string in the user message. The keywords for an intent are the examples of that intent in the NLU training data. This means the entire example is the keyword, not the individual words in the example.
-
Configuration
config.ymlpipeline:
- name: "KeywordIntentClassifier"
case_sensitive: True
Entity Extractors
Entity extractors extract entities, such as person names or locations, from the user message.
If you use multiple entity extractors, we advise that each extractor targets an exclusive set of entity types. For example, use Duckling to extract dates and times, and CRFEntityExtractor to extract person names. Otherwise, if multiple extractors target the same entity types, it is very likely that entities will be extracted multiple times.
For example, if you use two or more general purpose extractors like CRFEntityExtractor,
the entity types in your training data will be found and
extracted by all of them. If the slots you are filling with your entity types are of type text
,
then the last extractor in your pipeline will win. If the slot is of type list
, then all results
will be added to the list, including duplicates.
Another, less obvious case of duplicate/overlapping extraction can happen even if extractors focus on different
entity types. Imagine a food delivery bot and a user message like I would like to order the Monday special
.
Hypothetically, if your time extractor's performance isn't very good, it might extract Monday
here as a time for the order,
and your other extractor might extract Monday special
as the meal.
CRFEntityExtractor
-
Short
Conditional random field (CRF) entity extraction
-
Outputs
entities
-
Requires
tokens
anddense_features
(optional) -
Output-Example
{
"entities": [
{
"value": "New York City",
"start": 20,
"end": 33,
"entity": "city",
"confidence": 0.874,
"extractor": "CRFEntityExtractor"
}
]
} -
Description
This component implements a conditional random fields (CRF) to do named entity recognition. CRFs can be thought of as an undirected Markov chain where the time steps are words and the states are entity classes. Features of the words (capitalization, POS tagging, etc.) give probabilities to certain entity classes, as are transitions between neighbouring entity tags: the most likely set of tags is then calculated and returned.
If you want to pass custom features, such as pre-trained word embeddings, to
CRFEntityExtractor
, you can add any dense featurizer to the pipeline before theCRFEntityExtractor
and subsequently configureCRFEntityExtractor
to make use of the dense features by adding"text_dense_feature"
to its feature configuration.CRFEntityExtractor
automatically finds the additional dense features and checks if the dense features are an iterable oflen(tokens)
, where each entry is a vector. A warning will be shown in case the check fails. However,CRFEntityExtractor
will continue to train just without the additional custom features. In case dense features are present,CRFEntityExtractor
will pass the dense features tosklearn_crfsuite
and use them for training. -
Configuration
CRFEntityExtractor
has a list of default features to use. However, you can overwrite the default configuration. The following features are available:=================== ==========================================================================================
Feature Name Description
=================== ==========================================================================================
low word identity - use the lower-cased token as a feature.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining characters are
lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
pattern Take the patterns defined by ``RegexFeaturizer``.
bias Add an additional "bias" feature to the list of features.
text_dense_features Adds additional features from a dense featurizer.
=================== ==========================================================================================As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for previous tokens, the current token, and the next tokens in the sliding window. You define the features as [before, token, after] array.
Additional you can set a flag to determine whether to use the BILOU tagging schema or not.
BILOU_flag
determines whether to use BILOU tagging or not. DefaultTrue
.
config.ymlpipeline:
- name: "CRFEntityExtractor"
# BILOU_flag determines whether to use BILOU tagging or not.
"BILOU_flag": True
# features to extract in the sliding window
"features": [
["low", "title", "upper"],
[
"bias",
"low",
"prefix5",
"prefix2",
"suffix5",
"suffix3",
"suffix2",
"upper",
"title",
"digit",
"pattern",
"text_dense_features"
],
["low", "title", "upper"],
]
# The maximum number of iterations for optimization algorithms.
"max_iterations": 50
# weight of the L1 regularization
"L1_c": 0.1
# weight of the L2 regularization
"L2_c": 0.1
# Name of dense featurizers to use.
# If list is empty all available dense features are used.
"featurizers": []
# Indicated whether a list of extracted entities should be split into individual entities for a given entity type
"split_entities_by_comma":
address: False
email: TruenoteIf
pattern
features are used, you need to haveRegexFeaturizer
in your pipeline.noteIf
text_dense_features
features are used, you need to have a dense featurizer (e.g.LanguageModelFeaturizer
) in your pipeline.
DucklingEntityExtractor
-
Short
Duckling lets you extract common entities like dates, amounts of money, distances, and others in a number of languages.
-
Outputs
entities
-
Requires
Nothing
-
Output-Example
{
"entities": [
{
"end": 53,
"entity": "time",
"start": 48,
"value": "2017-04-10T00:00:00.000+02:00",
"confidence": 1.0,
"extractor": "DucklingEntityExtractor"
}
]
} -
Description
To use this component you need to run a duckling server. The easiest option is to spin up a docker container using
docker run -p 8000:8000 rasa/duckling
.Alternatively, you can install duckling directly on your machine and start the server.
Duckling allows to recognize dates, numbers, distances and other structured entities and normalizes them. Please be aware that duckling tries to extract as many entity types as possible without providing a ranking. For example, if you specify both
number
andtime
as dimensions for the duckling component, the component will extract two entities:10
as a number andin 10 minutes
as a time from the textI will be there in 10 minutes
. In such a situation, your application would have to decide which entity type is be the correct one. The extractor will always return 1.0 as a confidence, as it is a rule based system.The list of supported languages can be found in the Duckling GitHub repository.
-
Configuration
Configure which dimensions, i.e. entity types, the duckling component should extract. A full list of available dimensions can be found in the duckling project readme. Leaving the dimensions option unspecified will extract all available dimensions.
config.ymlpipeline:
- name: "DucklingEntityExtractor"
# url of the running duckling server
url: "http://localhost:8000"
# dimensions to extract
dimensions: ["time", "number", "amount-of-money", "distance"]
# allows you to configure the locale, by default the language is
# used
locale: "de_DE"
# if not set the default timezone of Duckling is going to be used
# needed to calculate dates from relative expressions like "tomorrow"
timezone: "Europe/Berlin"
# Timeout for receiving response from http url of the running duckling server
# if not set the default timeout of duckling http url is set to 3 seconds.
timeout : 3
RegexEntityExtractor
-
Short
Extracts entities using the lookup tables and/or regexes defined in the training data
-
Outputs
entities
-
Requires
Nothing
-
Description
This component extract entities using the lookup tables and regexes defined in the training data. The component checks if the user message contains an entry of one of the lookup tables or matches one of the regexes. If a match is found, the value is extracted as entity.
This component only uses those regex features that have a name equal to one of the entities defined in the training data. Make sure to annotate at least one example per entity.
noteWhen you use this extractor in combination with CRFEntityExtractor, it can lead to multiple extraction of entities. Especially if many training sentences have entity annotations for the entity types for which you also have defined regexes. See the big info box at the start of the entity extractor section for more info on multiple extraction.
In the case where you seem to need both this RegexEntityExtractor and another of the aforementioned statistical extractors, we advise you to consider one of the following two options.
Option 1 is advisable when you have exclusive entity types for each type of extractor. To make the sure the extractors don't interfere with one another annotate only one example sentence for each regex/lookup entity type, but not more.
Option 2 is useful when you want to use regexes matches as additional signal for your statistical extractor, but you don't have separate entity types. In this case you will want to 1) add the RegexFeaturizer before the extractors in your pipeline 2) annotate all your entity examples in the training data and 3) remove the RegexEntityExtractor from your pipeline. This way, your statistical extractors will receive additional signal about the presence of regex matches and will be able to statistically determine when to rely on these matches and when not to.
-
Configuration
Make the entity extractor case sensitive by adding the
case_sensitive: True
option, the default beingcase_sensitive: False
.To correctly process languages such as Chinese that don't use whitespace for word separation, the user needs to add the
use_word_boundaries: False
option, the default beinguse_word_boundaries: True
.config.ymlpipeline:
- name: RegexEntityExtractor
# text will be processed with case insensitive as default
case_sensitive: False
# use lookup tables to extract entities
use_lookup_tables: True
# use regexes to extract entities
use_regexes: True
# use match word boundaries for lookup table
use_word_boundaries: True
EntitySynonymMapper
-
Short
Maps synonymous entity values to the same value.
-
Outputs
Modifies existing entities that previous entity extraction components found.
-
Requires
An extractor from Entity Extractors
-
Description
If the training data contains defined synonyms, this component will make sure that detected entity values will be mapped to the same value. For example, if your training data contains the following examples:
[
{
"text": "I moved to New York City",
"intent": "inform_relocation",
"entities": [
{
"value": "nyc",
"start": 11,
"end": 24,
"entity": "city"
}
]
},
{
"text": "I got a new flat in NYC.",
"intent": "inform_relocation",
"entities": [
{
"value": "nyc",
"start": 20,
"end": 23,
"entity": "city"
}
]
}
]This component will allow you to map the entities
New York City
andNYC
tonyc
. The entity extraction will returnnyc
even though the message containsNYC
. When this component changes an existing entity, it appends itself to the processor list of this entity. -
Configuration
config.ymlpipeline:
- name: "EntitySynonymMapper"noteWhen using the
EntitySynonymMapper
as part of an NLU pipeline, it will need to be placed below any entity extractors in the configuration file.
Incremental training
This feature is experimental. We introduce experimental features to get feedback from our community, so we encourage you to try it out! However, the functionality might be changed or removed in the future. If you have feedback (positive or negative) please share it with us on the Rasa Forum.
In order to improve the performance of an assistant, it's helpful to practice CDD
and add new training examples based on how your users have talked to your assistant. You can use rasa train --finetune
to initialize the pipeline with an already trained model and further finetune it on the
new training dataset that includes the additional training examples. This will help reduce the
training time of the new model.
By default, the command picks up the latest model in the models/
directory. If you have a specific model
which you want to improve, you may specify the path to this by
running rasa train --finetune <path to model to finetune>
. Finetuning a model usually
requires fewer epochs to train machine learning components like DIETClassifier
, ResponseSelector
and TEDPolicy
compared to training from scratch.
Either use a model configuration for finetuning
which defines fewer epochs than before or use the flag
--epoch-fraction
. --epoch-fraction
will use a fraction of the epochs specified for each machine learning component
in the model configuration file. For example, if DIETClassifier
is configured to use 100 epochs,
specifying --epoch-fraction 0.5
will only use 50 epochs for finetuning.
You can also finetune an NLU-only or dialogue management-only model by using
rasa train nlu --finetune
and rasa train core --finetune
respectively.
To be able to fine tune a model, the following conditions must be met:
-
The configuration supplied should be exactly the same as the configuration used to train the model which is being finetuned. The only parameter that you can change is
epochs
for the individual machine learning components and policies. -
The set of labels(intents, actions, entities and slots) for which the base model is trained should be exactly the same as the ones present in the training data used for finetuning. This means that you cannot add new intent, action, entity or slot labels to your training data during incremental training. You can still add new training examples for each of the existing labels. If you have added/removed labels in the training data, the pipeline needs to be trained from scratch.
-
The model to be finetuned is trained with
MINIMUM_COMPATIBLE_VERSION
of the currently installed rasa version.