NLU Components
NLU-based assistants
This section refers to building NLU-based assistants. If you are working with Conversational AI with Language Models (CALM), this content may not apply to you.
To use NLU components, you need to install the nlu dependency group:
pip install 'rasa-pro[nlu]'
For more information about dependency groups, see our Python Versions and Dependencies reference page.
Tokenizers
Tokenizers split text into tokens. If you want to split intents into multiple labels, e.g. for predicting multiple intents or for modeling hierarchical intent structure, use the following flags with any tokenizer:
-
intent_tokenization_flagindicates whether to tokenize intent labels or not. Set it toTrue, so that intent labels are tokenized. -
intent_split_symbolsets the delimiter string to split the intent labels, default is underscore (_).
WhitespaceTokenizer
-
Short
Tokenizer using whitespaces as a separator
-
Outputs
tokensfor user messages, responses (if present), and intents (if specified) -
Requires
Nothing
-
Description
Creates a token for every whitespace separated character sequence.
Any character not in:
a-zA-Z0-9_#@&will be substituted with whitespace before splitting on whitespace if the character fulfills any of the following conditions:- the character follows a whitespace:
" !word"→"word" - the character precedes a whitespace:
"word! "→"word" - the character is at the beginning of the string:
"!word"→"word" - the character is at the end of the string:
"word!"→"word"
Note that:
"wo!rd"→"wo!rd"
In addition, any character not in:
a-zA-Z0-9_#@&.~:\/?[]()!$*+,;=-will be substituted with whitespace before splitting on whitespace if the character is not between numbers:"twenty\{one"→"twenty","one"("{"` is not between numbers)"20\{1"→"20\{1"("{"` is between numbers)
Note that:
"name@example.com"→"name@example.com""10,000.1"→"10,000.1""1 - 2"→"1","2"
- the character follows a whitespace:
-
Configuration
config.ymlpipeline:
- name: "WhitespaceTokenizer"
# Flag to check whether to split intents
"intent_tokenization_flag": False
# Symbol on which intent should be split
"intent_split_symbol": "_"
# Regular expression to detect tokens
"token_pattern": None
Featurizers
Text featurizers are divided into two different categories: sparse featurizers and dense featurizers. Sparse featurizers are featurizers that return feature vectors with a lot of missing values, e.g. zeros. As those feature vectors would normally take up a lot of memory, we store them as sparse features. Sparse features only store the values that are non zero and their positions in the vector. Thus, we save a lot of memory and are able to train on larger datasets.
All featurizers can return two different kind of features: sequence features and sentence features.
The sequence features are a matrix of size (number-of-tokens x feature-dimension).
The matrix contains a feature vector for every token in the sequence.
This allows us to train sequence models.
The sentence features are represented by a matrix of size (1 x feature-dimension).
It contains the feature vector for the complete utterance.
The sentence features can be used in any bag-of-words model.
The corresponding classifier can therefore decide what kind of features to use.
Note: The feature-dimension for sequence and sentence features does not have to be the same.
LanguageModelFeaturizer
-
Short
Creates a vector representation of user message and response (if specified) using a pre-trained language model.
-
Outputs
dense_featuresfor user messages and responses -
Type
Dense featurizer
-
Description
Creates features for entity extraction, intent classification, and response selection. Uses a pre-trained language model to compute vector representations of input text.
notePlease make sure that you use a language model which is pre-trained on the same language corpus as that of your training data.
-
Configuration
Include a Tokenizer component before this component.
You should specify what language model to load via the parameter
model_name. See the below table for the currently supported language models. The weights to be loaded can be specified by the additional parametermodel_weights. If left empty, it uses the default model weights listed in the table.+----------------+--------------+-------------------------+
| Language Model | Parameter | Default value for |
| | "model_name" | "model_weights" |
+----------------+--------------+-------------------------+
| BERT | bert | rasa/LaBSE |
+----------------+--------------+-------------------------+
| GPT | gpt | openai-gpt |
+----------------+--------------+-------------------------+
| GPT-2 | gpt2 | gpt2 |
+----------------+--------------+-------------------------+
| XLNet | xlnet | xlnet-base-cased |
+----------------+--------------+-------------------------+
| DistilBERT | distilbert | distilbert-base-uncased |
+----------------+--------------+-------------------------+
| RoBERTa | roberta | roberta-base |
+----------------+--------------+-------------------------+
| camemBERT | camembert | camembert-base |
+----------------+--------------+-------------------------+Apart from the default pretrained model weights, further models can be used from HuggingFace models provided the following conditions are met (the mentioned files can be found in the "Files and versions" section of the model website):
- The model architecture is one of the supported language models (check that the
model_typeinconfig.jsonis listed in the table's columnmodel_name) - The model has pretrained Tensorflow weights (check that the file
tf_model.h5exists, at this time Safetensors are not supported.) - The model uses the default tokenizer (
config.jsonshould not contain a customtokenizer_classsetting)
noteWhile the
LaBSEweights are loaded by default for thebertarchitecture offering a multi-lingual model trained on 112 languages (see our tutorial and the original paper), we now recommend usingMiniLMmodel for better performance.The
LaBSEweights can still serve as a baseline for initial testing and development. After establishing this baseline, we strongly encourage exploring optimization with theMiniLMto improve your assistant effectiveness, before trying to optimize this component with other weights/architectures.The following configuration loads the language model BERT with
rasa/LaBSEweights, which can be found here:config.ymlpipeline:
- name: LanguageModelFeaturizer
# Name of the language model to use
model_name: "bert"
# Pre-Trained weights to be loaded
model_weights: "rasa/LaBSE"
# An optional path to a directory from which
# to load pre-trained model weights.
# If the requested model is not found in the
# directory, it will be downloaded and
# cached in this directory for future use.
# The default value of `cache_dir` can be
# set using the environment variable
# `TRANSFORMERS_CACHE`, as per the
# Transformers library.
cache_dir: nullFor enhanced performance, we recommend the
sentence-transformers/all-MiniLM-L6-v2weights, which can be found here:config.ymlpipeline:
- name: LanguageModelFeaturizer
model_name: "bert"
model_weights: "sentence-transformers/all-MiniLM-L6-v2"
cache_dir: null - The model architecture is one of the supported language models (check that the
RegexFeaturizer
-
Short
Creates a vector representation of user message using regular expressions.
-
Outputs
sparse_featuresfor user messages andtokens.pattern -
Requires
tokens -
Type
Sparse featurizer
-
Description
Creates features for entity extraction and intent classification. During training the
RegexFeaturizercreates a list of regular expressions defined in the training data format. For each regex, a feature will be set marking whether this expression was found in the user message or not. All features will later be fed into an intent classifier / entity extractor to simplify classification (assuming the classifier has learned during the training phase, that this set feature indicates a certain intent / entity). Regex features for entity extraction are currently only supported by the CRFEntityExtractor. -
Configuration
Make the featurizer case insensitive by adding the
case_sensitive: Falseoption, the default beingcase_sensitive: True.To correctly process languages such as Chinese that don't use whitespace for word separation, the user needs to add the
use_word_boundaries: Falseoption, the default beinguse_word_boundaries: True.config.ymlpipeline:
- name: "RegexFeaturizer"
# Text will be processed with case sensitive as default
"case_sensitive": True
# use match word boundaries for lookup table
"use_word_boundaries": True
CountVectorsFeaturizer
-
Short
Creates bag-of-words representation of user messages, intents, and responses.
-
Outputs
sparse_featuresfor user messages, intents, and responses -
Requires
tokens -
Type
Sparse featurizer
-
Description
Creates features for intent classification and response selection. Creates bag-of-words representation of user message, intent, and response using sklearn's CountVectorizer. All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.
-
Configuration
See sklearn's CountVectorizer docs for detailed description of the configuration parameters.
This featurizer can be configured to use word or character n-grams, using the
analyzerconfiguration parameter. By defaultanalyzeris set towordso word token counts are used as features. If you want to use character n-grams, setanalyzertocharorchar_wb. The lower and upper boundaries of the n-grams can be configured via the parametersmin_ngramandmax_ngram. By default both of them are set to1. By default the featurizer takes the lemma of a word instead of the word directly if it is available. You can disable this behavior by settinguse_lemmatoFalse.noteOption
char_wbcreates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. This option can be used to create Subword Semantic Hashing.noteFor character n-grams do not forget to increase
min_ngramandmax_ngramparameters. Otherwise the vocabulary will contain only single letters.Handling Out-Of-Vocabulary (OOV) words:
noteEnabled only if
analyzerisword.Since the training is performed on limited vocabulary data, it cannot be guaranteed that during prediction an algorithm will not encounter an unknown word (a word that were not seen during training). In order to teach an algorithm how to treat unknown words, some words in training data can be substituted by generic word
OOV_token. In this case during prediction all unknown words will be treated as this generic wordOOV_token.For example, one might create separate intent
outofscopein the training data containing messages of different number ofOOV_tokens and maybe some additional general words. Then an algorithm will likely classify a message with unknown words as this intentoutofscope.You can either set the
OOV_tokenor a list of wordsOOV_words:-
OOV_tokenset a keyword for unseen words; if training data containsOOV_tokenas words in some messages, during prediction the words that were not seen during training will be substituted with providedOOV_token; ifOOV_token=None(default behavior) words that were not seen during training will be ignored during prediction time; -
OOV_wordsset a list of words to be treated asOOV_tokenduring training; if a list of words that should be treated as Out-Of-Vocabulary is known, it can be set toOOV_wordsinstead of manually changing it in training data or using custom preprocessor.
noteThis featurizer creates a bag-of-words representation by counting words, so the number of
OOV_tokenin the sentence might be important.noteProviding
OOV_wordsis optional, training data can containOOV_tokeninput manually or by custom additional preprocessor. Unseen words will be substituted withOOV_tokenonly if this token is present in the training data orOOV_wordslist is provided.If you want to share the vocabulary between user messages and intents, you need to set the option
use_shared_vocabtoTrue. In that case a common vocabulary set between tokens in intents and user messages is build.config.ymlpipeline:
- name: "CountVectorsFeaturizer"
# Analyzer to use, either 'word', 'char', or 'char_wb'
"analyzer": "word"
# Set the lower and upper boundaries for the n-grams
"min_ngram": 1
"max_ngram": 1
# Set the out-of-vocabulary token
"OOV_token": "_oov_"
# Whether to use a shared vocab
"use_shared_vocab": FalseConfiguring for incremental training
To ensure that
sparse_featuresare of fixed size during incremental training, the component should be configured to account for additional vocabulary tokens that may be added as part of new training examples in the future. To do so, configure theadditional_vocabulary_sizeparameter while training the base model from scratch:config.ymlpipeline:
- name: CountVectorsFeaturizer
additional_vocabulary_size:
text: 1000
response: 1000
action_text: 1000As in the above example, you can define additional vocabulary size for each of
text(user messages),response(bot responses used byResponseSelector) andaction_text(bot responses not used byResponseSelector). If you are building a shared vocabulary (use_shared_vocab=True), you only need to define a value for thetextattribute. If any of the attribute is not configured by the user, the component takes half of the current vocabulary size as the default value for the attribute'sadditional_vocabulary_size. This number is kept at a minimum of 1000 in order to avoid running out of additional vocabulary slots too frequently during incremental training. Once the component runs out of additional vocabulary slots, the new vocabulary tokens are dropped and not considered during featurization. At this point, it is advisable to retrain a new model from scratch. -
The above configuration parameters are the ones you should configure to fit your model to your data. However, additional parameters exist that can be adapted.
More configurable parameters
+---------------------------+-------------------------+--------------------------------------------------------------+
| Parameter | Default Value | Description |
+===========================+=========================+==============================================================+
| use_shared_vocab | False | If set to 'True' a common vocabulary is used for labels |
| | | and user message. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| analyzer | word | Whether the features should be made of word n-gram or |
| | | character n-grams. Option 'char_wb' creates character |
| | | n-grams only from text inside word boundaries; |
| | | n-grams at the edges of words are padded with space. |
| | | Valid values: 'word', 'char', 'char_wb'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| strip_accents | None | Remove accents during the pre-processing step. |
| | | Valid values: 'ascii', 'unicode', 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| stop_words | None | A list of stop words to use. |
| | | Valid values: 'english' (uses an internal list of |
| | | English stop words), a list of custom stop words, or |
| | | 'None'. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly lower than the given threshold. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_df | 1 | When building the vocabulary ignore terms that have a |
| | | document frequency strictly higher than the given threshold |
| | | (corpus-specific stop words). |
+---------------------------+-------------------------+--------------------------------------------------------------+
| min_ngram | 1 | The lower boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_ngram | 1 | The upper boundary of the range of n-values for different |
| | | word n-grams or char n-grams to be extracted. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| max_features | None | If not 'None', build a vocabulary that only consider the top |
| | | max_features ordered by term frequency across the corpus. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| lowercase | True | Convert all characters to lowercase before tokenizing. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_token | None | Keyword for unseen words. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| OOV_words | [] | List of words to be treated as 'OOV_token' during training. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| alias | CountVectorFeaturizer | Alias name of featurizer. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| use_lemma | True | Use the lemma of words for featurization. |
+---------------------------+-------------------------+--------------------------------------------------------------+
| additional_vocabulary_size| text: 1000 | Size of additional vocabulary to account for incremental |
| | response: 1000 | training while training a model from scratch |
| | action_text: 1000 | |
+---------------------------+-------------------------+--------------------------------------------------------------+
LexicalSyntacticFeaturizer
-
Short
Creates lexical and syntactic features for a user message to support entity extraction.
-
Outputs
sparse_featuresfor user messages -
Requires
tokens -
Type
Sparse featurizer
-
Description
Creates features for entity extraction. Moves with a sliding window over every token in the user message and creates features according to the configuration (see below). As a default configuration is present, you don't need to specify a configuration.
-
Configuration
You can configure what kind of lexical and syntactic features the featurizer should extract. The following features are available:
============== ==========================================================================================
Feature Name Description
============== ==========================================================================================
BOS Checks if the token is at the beginning of the sentence.
EOS Checks if the token is at the end of the sentence.
low Checks if the token is lower case.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining characters are
lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
============== ==========================================================================================As the featurizer is moving over the tokens in a user message with a sliding window, you can define features for previous tokens, the current token, and the next tokens in the sliding window. You define the features as a [before, token, after] array. If you want to define features for the token before, the current token, and the token after, your features configuration would look like this:
config.ymlpipeline:
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit"],
["low", "title", "upper"],
]This configuration is also the default configuration.
Intent Classifiers
Intent classifiers assign one of the intents defined in the domain file to incoming user messages.
LogisticRegressionClassifier
-
Short
Logistic regression intent classifier, using the scikit-learn implementation.
-
Outputs
intentandintent_ranking -
Requires
Either
sparse_featuresordense_featuresneed to be present. -
Output-Example
{
"intent": { "name": "greet", "confidence": 0.78 },
"intent_ranking": [
{
"confidence": 0.78,
"name": "greet"
},
{
"confidence": 0.14,
"name": "goodbye"
},
{
"confidence": 0.08,
"name": "restaurant_search"
}
]
}
-
Description
This classifier uses scikit-learn's logistic regression implementation to perform intent classification. It's able to use only sparse features, but will also pick up any dense features that are present. In general, DIET should yield higher accuracy results, but this classifier should train faster and may be used as a lightweight benchmark. Our implementation uses the base settings from scikit-learn, with the exception of the
class_weightparameter where we assume the"balanced"setting. -
Configuration
An example configuration with all the defaults can be found below.
pipeline:
- name: LogisticRegressionClassifier
max_iter: 100
solver: lbfgs
tol: 0.0001
random_state: 42
ranking_length: 10
There configuration parameters are briefly explained below.
max_iter: Maximum number of iterations taken for the solvers to converge.solver: Solver to be used. For very small datasets you might considerliblinear.tol: Tolerance for stopping criteria of the optimizer.random_state: Used to shuffle the data before training.ranking_length: Number of top intents to report. Set to 0 to report all intents
More details on the parameters can be found on the scikit-learn documentation page.
SklearnIntentClassifier
-
Short
Sklearn intent classifier
-
Outputs
intentandintent_ranking -
Requires
dense_featuresfor user messages -
Output-Example
{
"intent": { "name": "greet", "confidence": 0.78 },
"intent_ranking": [
{
"confidence": 0.78,
"name": "greet"
},
{
"confidence": 0.14,
"name": "goodbye"
},
{
"confidence": 0.08,
"name": "restaurant_search"
}
]
} -
Description
The sklearn intent classifier trains a linear SVM which gets optimized using a grid search. It also provides rankings of the labels that did not “win”. The
SklearnIntentClassifierneeds to be preceded by a dense featurizer in the pipeline. This dense featurizer creates the features used for the classification. For more information about the algorithm itself, take a look at the GridSearchCV documentation. -
Configuration
During the training of the SVM a hyperparameter search is run to find the best parameter set. In the configuration you can specify the parameters that will get tried.
config.ymlpipeline:
- name: "SklearnIntentClassifier"
# Specifies the list of regularization values to
# cross-validate over for C-SVM.
# This is used with the ``kernel`` hyperparameter in GridSearchCV.
C: [1, 2, 5, 10, 20, 100]
# Specifies the kernel to use with C-SVM.
# This is used with the ``C`` hyperparameter in GridSearchCV.
kernels: ["linear"]
# Gamma parameter of the C-SVM.
"gamma": [0.1]
# We try to find a good number of cross folds to use during
# intent training, this specifies the max number of folds.
"max_cross_validation_folds": 5
# Scoring function used for evaluating the hyper parameters.
# This can be a name or a function.
"scoring_function": "f1_weighted"
KeywordIntentClassifier
-
Short
Simple keyword matching intent classifier, intended for small, short-term projects.
-
Output s
intent -
Requires
Nothing
-
Output-Example
{
"intent": { "name": "greet", "confidence": 1.0 }
} -
Description
This classifier works by searching a message for keywords. The matching is case sensitive by default and searches only for exact matches of the keyword-string in the user message. The keywords for an intent are the examples of that intent in the NLU training data. This means the entire example is the keyword, not the individual words in the example.
-
Configuration
config.ymlpipeline:
- name: "KeywordIntentClassifier"
case_sensitive: True