Warning: This document is for an old version of Rasa. The latest version is 1.5.1.

Components

Note

For clarity, we have renamed the pre-defined pipelines to reflect what they do rather than which libraries they use as of Rasa NLU 0.15. The tensorflow_embedding pipeline is now called supervised_embeddings, and spacy_sklearn is now known as pretrained_embeddings_spacy. Please update your code if you are using these.

This is a reference of the configuration options for every built-in component in Rasa NLU. If you want to build a custom component, check out Custom NLU Components.

Word Vector Sources

MitieNLP

Short:

MITIE initializer

Outputs:

nothing

Requires:

nothing

Description:

Initializes mitie structures. Every mitie component relies on this, hence this should be put at the beginning of every pipeline that uses any mitie components.

Configuration:

The MITIE library needs a language model file, that must be specified in the configuration:

pipeline:
- name: "MitieNLP"
  # language model to load
  model: "data/total_word_feature_extractor.dat"

For more information where to get that file from, head over to installing MITIE.

SpacyNLP

Short:

spacy language initializer

Outputs:

nothing

Requires:

nothing

Description:

Initializes spacy structures. Every spacy component relies on this, hence this should be put at the beginning of every pipeline that uses any spacy components.

Configuration:

Language model, default will use the configured language. If the spacy model to be used has a name that is different from the language tag ("en", "de", etc.), the model name can be specified using this configuration variable. The name will be passed to spacy.load(name).

pipeline:
- name: "SpacyNLP"
  # language model to load
  model: "en_core_web_md"

  # when retrieving word vectors, this will decide if the casing
  # of the word is relevant. E.g. `hello` and `Hello` will
  # retrieve the same vector, if set to `false`. For some
  # applications and models it makes sense to differentiate
  # between these two words, therefore setting this to `true`.
  case_sensitive: false

Featurizers

MitieFeaturizer

Short:

MITIE intent featurizer

Outputs:

nothing, used as an input to intent classifiers that need intent features (e.g. SklearnIntentClassifier)

Requires:

MitieNLP

Description:

Creates feature for intent classification using the MITIE featurizer.

Note

NOT used by the MitieIntentClassifier component. Currently, only SklearnIntentClassifier is able to use precomputed features.

Configuration:
pipeline:
- name: "MitieFeaturizer"

SpacyFeaturizer

Short:

spacy intent featurizer

Outputs:

nothing, used as an input to intent classifiers that need intent features (e.g. SklearnIntentClassifier)

Requires:

SpacyNLP

Description:

Creates feature for intent classification using the spacy featurizer. Optionally adds word vectors for each token to ner_features, which can be referenced in CRFEntityExtractor

Configuration:
pipeline:
- name: "SpacyFeaturizer"
  # Whether to add word vectors to ``ner_features`` (default: False)
  ner_feature_vectors: True

NGramFeaturizer

Short:

Appends char-ngram features to feature vector

Outputs:

nothing, appends its features to an existing feature vector generated by another intent featurizer

Requires:

SpacyNLP

Description:

This featurizer appends character ngram features to a feature vector. During training the component looks for the most common character sequences (e.g. app or ing). The added features represent a boolean flag if the character sequence is present in the word sequence or not.

Note

There needs to be another intent featurizer previous to this one in the pipeline!

Configuration:
pipeline:
- name: "NGramFeaturizer"
  # Maximum number of ngrams to use when augmenting
  # feature vectors with character ngrams
  max_number_of_ngrams: 10

RegexFeaturizer

Short:

regex feature creation to support intent and entity classification

Outputs:

text_features and tokens.pattern

Requires:

nothing

Description:

During training, the regex intent featurizer creates a list of regular expressions defined in the training data format. For each regex, a feature will be set marking whether this expression was found in the input, which will later be fed into intent classifier / entity extractor to simplify classification (assuming the classifier has learned during the training phase, that this set feature indicates a certain intent). Regex features for entity extraction are currently only supported by the CRFEntityExtractor component!

Note

There needs to be a tokenizer previous to this featurizer in the pipeline!

CountVectorsFeaturizer

Short:

Creates bag-of-words representation of user message and label(intent and response) features

Outputs:

nothing, used as an input to intent classifiers that need bag-of-words representation of intent features (e.g. EmbeddingIntentClassifier)

Requires:

nothing

Description:

Creates bag-of-words representation of user message and label features using sklearn’s CountVectorizer. All tokens which consist only of digits (e.g. 123 and 99 but not a123d) will be assigned to the same feature.

Note

If the words in the model language cannot be split by whitespace, a language-specific tokenizer is required in the pipeline before this component (e.g. using JiebaTokenizer for Chinese).

Configuration:

See sklearn’s CountVectorizer docs for detailed description of the configuration parameters.

This featurizer can be configured to use word or character n-grams, using analyzer config parameter. By default analyzer is set to word so word token counts are used as features. If you want to use character n-grams, set analyzer to char or char_wb.

Note

Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space. This option can be used to create Subword Semantic Hashing

Note

For character n-grams do not forget to increase min_ngram and max_ngram parameters. Otherwise the vocabulary will contain only single letters

Handling Out-Of-Vacabulary (OOV) words:

Note

Enabled only if analyzer is word.

Since the training is performed on limited vocabulary data, it cannot be guaranteed that during prediction an algorithm will not encounter an unknown word (a word that were not seen during training). In order to teach an algorithm how to treat unknown words, some words in training data can be substituted by generic word OOV_token. In this case during prediction all unknown words will be treated as this generic word OOV_token.

For example, one might create separate intent outofscope in the training data containing messages of different number of OOV_token s and maybe some additional general words. Then an algorithm will likely classify a message with unknown words as this intent outofscope.

Note

This featurizer creates a bag-of-words representation by counting words, so the number of OOV_token in the sentence might be important.

  • OOV_token set a keyword for unseen words; if training data contains OOV_token as words in some messages, during prediction the words that were not seen during training will be substituted with provided OOV_token; if OOV_token=None (default behaviour) words that were not seen during training will be ignored during prediction time;
  • OOV_words set a list of words to be treated as OOV_token during training; if a list of words that should be treated as Out-Of-Vacabulary is known, it can be set to OOV_words instead of manually changing it in trainig data or using custom preprocessor.

Note

Providing OOV_words is optional, training data can contain OOV_token input manually or by custom additional preprocessor. Unseen words will be substituted with OOV_token only if this token is present in the training data or OOV_words list is provided.

Sharing Vocabulary between user message and labels:

Note

Enabled only if use_shared_vocab is True

Build a common vocabulary set between tokens in labels and user message.

pipeline:
- name: "CountVectorsFeaturizer"
  # whether to use a shared vocab
  "use_shared_vocab": False,
  # whether to use word or character n-grams
  # 'char_wb' creates character n-grams only inside word boundaries
  # n-grams at the edges of words are padded with space.
  analyzer: 'word'  # use 'char' or 'char_wb' for character
  # the parameters are taken from
  # sklearn's CountVectorizer
  # regular expression for tokens
  token_pattern: r'(?u)\b\w\w+\b'
  # remove accents during the preprocessing step
  strip_accents: None  # {'ascii', 'unicode', None}
  # list of stop words
  stop_words: None  # string {'english'}, list, or None (default)
  # min document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  min_df: 1  # float in range [0.0, 1.0] or int
  # max document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  max_df: 1.0  # float in range [0.0, 1.0] or int
  # set ngram range
  min_ngram: 1  # int
  max_ngram: 1  # int
  # limit vocabulary size
  max_features: None  # int or None
  # if convert all characters to lowercase
  lowercase: true  # bool
  # handling Out-Of-Vacabulary (OOV) words
  # will be converted to lowercase if lowercase is true
  OOV_token: None  # string or None
  OOV_words: []  # list of strings

ConveRTFeaturizer

Short:

Creates a vector representation of user message and response (if specified) using ConveRT model.

Outputs:

nothing, used as an input to intent classifiers and response selectors that need intent features and response features respectively(e.g. EmbeddingIntentClassifier and ResponseSelector)

Requires:

nothing

Description:

Creates features for intent classification and response selection. Uses the default signature to compute vector representations of input text.

Warning

Since ConveRT model is trained only on an english corpus of conversations, this featurizer should only be used if your training data is in english language.

Note

To use ConveRTFeaturizer you should install tensorflow_text==1.15.1 and tensorflow_hub==0.6.0. Otherwise, you can also do a pip install of Rasa with pip install rasa[convert]

Configuration:
pipeline:
- name: "ConveRTFeaturizer"

Intent Classifiers

MitieIntentClassifier

Short:

MITIE intent classifier (using a text categorizer)

Outputs:

intent

Requires:

A tokenizer and a featurizer

Output-Example:
{
    "intent": {"name": "greet", "confidence": 0.98343}
}
Description:

This classifier uses MITIE to perform intent classification. The underlying classifier is using a multi-class linear SVM with a sparse linear kernel (see MITIE trainer code).

Configuration:
pipeline:
- name: "MitieIntentClassifier"

SklearnIntentClassifier

Short:

sklearn intent classifier

Outputs:

intent and intent_ranking

Requires:

A featurizer

Output-Example:
{
    "intent": {"name": "greet", "confidence": 0.78343},
    "intent_ranking": [
        {
            "confidence": 0.1485910906220309,
            "name": "goodbye"
        },
        {
            "confidence": 0.08161531595656784,
            "name": "restaurant_search"
        }
    ]
}
Description:

The sklearn intent classifier trains a linear SVM which gets optimized using a grid search. In addition to other classifiers it also provides rankings of the labels that did not “win”. The spacy intent classifier needs to be preceded by a featurizer in the pipeline. This featurizer creates the features used for the classification.

Configuration:

During the training of the SVM a hyperparameter search is run to find the best parameter set. In the config, you can specify the parameters that will get tried

pipeline:
- name: "SklearnIntentClassifier"
  # Specifies the list of regularization values to
  # cross-validate over for C-SVM.
  # This is used with the ``kernel`` hyperparameter in GridSearchCV.
  C: [1, 2, 5, 10, 20, 100]
  # Specifies the kernel to use with C-SVM.
  # This is used with the ``C`` hyperparameter in GridSearchCV.
  kernels: ["linear"]

EmbeddingIntentClassifier

Short:

Embedding intent classifier

Outputs:

intent and intent_ranking

Requires:

A featurizer

Output-Example:
{
    "intent": {"name": "greet", "confidence": 0.8343},
    "intent_ranking": [
        {
            "confidence": 0.385910906220309,
            "name": "goodbye"
        },
        {
            "confidence": 0.28161531595656784,
            "name": "restaurant_search"
        }
    ]
}
Description:

The embedding intent classifier embeds user inputs and intent labels into the same space. Supervised embeddings are trained by maximizing similarity between them. This algorithm is based on StarSpace. However, in this implementation the loss function is slightly different and additional hidden layers are added together with dropout. This algorithm also provides similarity rankings of the labels that did not “win”.

The embedding intent classifier needs to be preceded by a featurizer in the pipeline. This featurizer creates the features used for the embeddings. It is recommended to use CountVectorsFeaturizer that can be optionally preceded by SpacyNLP and SpacyTokenizer.

Note

If during prediction time a message contains only words unseen during training, and no Out-Of-Vacabulary preprocessor was used, empty intent None is predicted with confidence 0.0.

Configuration:

The algorithm also has hyperparameters to control:

  • neural network’s architecture:

    • hidden_layers_sizes_a sets a list of hidden layer sizes before the embedding layer for user inputs, the number of hidden layers is equal to the length of the list
    • hidden_layers_sizes_b sets a list of hidden layer sizes before the embedding layer for intent labels, the number of hidden layers is equal to the length of the list
    • share_hidden if set to True, shares the hidden layers between user inputs and intent label
  • training:

    • batch_size sets the number of training examples in one forward/backward pass, the higher the batch size, the more memory space you’ll need;
    • batch_strategy sets the type of batching strategy, it should be either sequence or balanced;
    • epochs sets the number of times the algorithm will see training data, where one epoch equals one forward pass and one backward pass of all the training examples;
    • random_seed if set to any int will get reproducible training results for the same inputs;
  • embedding:

    • embed_dim sets the dimension of embedding space;
    • num_neg sets the number of incorrect intent labels, the algorithm will minimize their similarity to the user input during training;
    • similarity_type sets the type of the similarity, it should be either auto, cosine or inner, if auto, it will be set depending on loss_type, inner for softmax, cosine for margin;
    • loss_type sets the type of the loss function, it should be either softmax or margin;
    • mu_pos controls how similar the algorithm should try to make embedding vectors for correct intent labels, used only if loss_type is set to margin;
    • mu_neg controls maximum negative similarity for incorrect intents, used only if loss_type is set to margin;
    • use_max_sim_neg if true the algorithm only minimizes maximum similarity over incorrect intent labels, used only if loss_type is set to margin;
    • scale_loss if true the algorithm will downscale the loss for examples where correct label is predicted with high confidence, used only if loss_type is set to softmax;
  • regularization:

    • C2 sets the scale of L2 regularization
    • C_emb sets the scale of how important is to minimize the maximum similarity between embeddings of different intent labels;
    • droprate sets the dropout rate, it should be between 0 and 1, e.g. droprate=0.1 would drop out 10% of input units;

Note

For cosine similarity mu_pos and mu_neg should be between -1 and 1.

Note

There is an option to use linearly increasing batch size. The idea comes from https://arxiv.org/abs/1711.00489. In order to do it pass a list to batch_size, e.g. "batch_size": [64, 256] (default behaviour). If constant batch_size is required, pass an int, e.g. "batch_size": 64.

In the config, you can specify these parameters. The default values are defined in EmbeddingIntentClassifier.defaults:

defaults = {
    # nn architecture
    # sizes of hidden layers before the embedding layer for input words
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_a": [256, 128],
    # sizes of hidden layers before the embedding layer for intent labels
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_b": [],
    # Whether to share the hidden layer weights between input words and labels
    "share_hidden_layers": False,
    # training parameters
    # initial and final batch sizes - batch size will be
    # linearly increased for each epoch
    "batch_size": [64, 256],
    # how to create batches
    "batch_strategy": "balanced",  # string 'sequence' or 'balanced'
    # number of epochs
    "epochs": 300,
    # set random seed to any int to get reproducible results
    "random_seed": None,
    # embedding parameters
    # dimension size of embedding vectors
    "embed_dim": 20,
    # the type of the similarity
    "num_neg": 20,
    # flag if minimize only maximum similarity over incorrect actions
    "similarity_type": "auto",  # string 'auto' or 'cosine' or 'inner'
    # the type of the loss function
    "loss_type": "softmax",  # string 'softmax' or 'margin'
    # how similar the algorithm should try
    # to make embedding vectors for correct labels
    "mu_pos": 0.8,  # should be 0.0 < ... < 1.0 for 'cosine'
    # maximum negative similarity for incorrect labels
    "mu_neg": -0.4,  # should be -1.0 < ... < 1.0 for 'cosine'
    # flag: if true, only minimize the maximum similarity for incorrect labels
    "use_max_sim_neg": True,
    # scale loss inverse proportionally to confidence of correct prediction
    "scale_loss": True,
    # regularization parameters
    # the scale of L2 regularization
    "C2": 0.002,
    # the scale of how critical the algorithm should be of minimizing the
    # maximum similarity between embeddings of different labels
    "C_emb": 0.8,
    # dropout rate for rnn
    "droprate": 0.2,
    # visualization of accuracy
    # how often to calculate training accuracy
    "evaluate_every_num_epochs": 20,  # small values may hurt performance
    # how many examples to use for calculation of training accuracy
    "evaluate_on_num_examples": 0,  # large values may hurt performance
}

Note

Parameter mu_neg is set to a negative value to mimic the original starspace algorithm in the case mu_neg = mu_pos and use_max_sim_neg = False. See starspace paper for details.

KeywordIntentClassifier

Short:

Simple keyword matching intent classifier, intended for small, short-term projects.

Outputs:

intent

Requires:

nothing

Output-Example:
{
    "intent": {"name": "greet", "confidence": 1.0}
}
Description:

This classifier works by searching a message for keywords. The matching is case sensitive by default and searches only for exact matches of the keyword-string in the user message. The keywords for an intent are the examples of that intent in the NLU training data. This means the entire example is the keyword, not the individual words in the example.

Note

This classifier is intended only for small projects or to get started. If you have few NLU training data you can use one of our pipelines Choosing a Pipeline.

Configuration:
pipeline:
- name: "KeywordIntentClassifier"
  case_sensitive: True

Selectors

Response Selector

Short:

Response Selector

Outputs:

A dictionary with key as direct_response_intent and value containing response and ranking

Requires:

A featurizer

Output-Example:
{
    "text": "What is the recommend python version to install?",
    "entities": [],
    "intent": {"confidence": 0.6485910906220309, "name": "faq"},
    "intent_ranking": [
        {"confidence": 0.6485910906220309, "name": "faq"},
        {"confidence": 0.1416153159565678, "name": "greet"}
    ],
    "response_selector": {
      "faq": {
        "response": {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
        "ranking": [
            {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
            {"confidence": 0.2134543431, "name": "You can ask me about how to get started"}
        ]
      }
    }
}
Description:

Response Selector component can be used to build a response retrieval model to directly predict a bot response from a set of candidate responses. The prediction of this model is used by Retrieval Actions. It embeds user inputs and response labels into the same space and follows the exact same neural network architecture and optimization as the EmbeddingIntentClassifier.

The response selector needs to be preceded by a featurizer in the pipeline. This featurizer creates the features used for the embeddings. It is recommended to use CountVectorsFeaturizer that can be optionally preceded by SpacyNLP.

Note

If during prediction time a message contains only words unseen during training, and no Out-Of-Vacabulary preprocessor was used, empty response None is predicted with confidence 0.0.

Configuration:

The algorithm includes all the hyperparameters that EmbeddingIntentClassifier uses. In addition, the component can also be configured to train a response selector for a particular retrieval intent

  • retrieval_intent: sets the name of the intent for which this response selector model is trained. Default None

In the config, you can specify these parameters. The default values are defined in ResponseSelector.defaults:

defaults = {
    # nn architecture
    # sizes of hidden layers before the embedding layer for input words
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_a": [256, 128],
    # sizes of hidden layers before the embedding layer for intent labels
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_b": [256, 128],
    # Whether to share the hidden layer weights between input words and intent labels
    "share_hidden_layers": False,
    # training parameters
    # initial and final batch sizes - batch size will be
    # linearly increased for each epoch
    "batch_size": [64, 256],
    # how to create batches
    "batch_strategy": "balanced",  # string 'sequence' or 'balanced'
    # number of epochs
    "epochs": 300,
    # set random seed to any int to get reproducible results
    "random_seed": None,
    # embedding parameters
    # dimension size of embedding vectors
    "embed_dim": 20,
    # the type of the similarity
    "num_neg": 20,
    # flag if minimize only maximum similarity over incorrect actions
    "similarity_type": "auto",  # string 'auto' or 'cosine' or 'inner'
    # the type of the loss function
    "loss_type": "softmax",  # string 'softmax' or 'margin'
    # how similar the algorithm should try
    # to make embedding vectors for correct intent labels
    "mu_pos": 0.8,  # should be 0.0 < ... < 1.0 for 'cosine'
    # maximum negative similarity for incorrect intent labels
    "mu_neg": -0.4,  # should be -1.0 < ... < 1.0 for 'cosine'
    # flag: if true, only minimize the maximum similarity for
    # incorrect intent labels
    "use_max_sim_neg": True,
    # scale loss inverse proportionally to confidence of correct prediction
    "scale_loss": True,
    # regularization parameters
    # the scale of L2 regularization
    "C2": 0.002,
    # the scale of how critical the algorithm should be of minimizing the
    # maximum similarity between embeddings of different intent labels
    "C_emb": 0.8,
    # dropout rate for rnn
    "droprate": 0.2,
    # visualization of accuracy
    # how often to calculate training accuracy
    "evaluate_every_num_epochs": 20,  # small values may hurt performance
    # how many examples to use for calculation of training accuracy
    "evaluate_on_num_examples": 0,  # large values may hurt performance,
    # selector config
    # name of the intent for which this response selector is to be trained
    "retrieval_intent": None,
}

Tokenizers

WhitespaceTokenizer

Short:

Tokenizer using whitespaces as a separator

Outputs:

nothing

Requires:

nothing

Description:

Creates a token for every whitespace separated character sequence. Can be used to define tokens for the MITIE entity extractor.

Configuration:

If you want to split intents into multiple labels, e.g. for predicting multiple intents or for modeling hierarchical intent structure, use these flags:

  • tokenization of intent and response labels:
    • intent_split_symbol sets the delimiter string to split the intent and response labels, default is whitespace.

Make the tokenizer not case sensitive by adding the case_sensitive: false option. Default being case_sensitive: true.

pipeline:
- name: "WhitespaceTokenizer"
  case_sensitive: false

JiebaTokenizer

Short:

Tokenizer using Jieba for Chinese language

Outputs:

nothing

Requires:

nothing

Description:

Creates tokens using the Jieba tokenizer specifically for Chinese language. For language other than Chinese, Jieba will work as WhitespaceTokenizer. Can be used to define tokens for the MITIE entity extractor. Make sure to install Jieba, pip install jieba.

Configuration:

User’s custom dictionary files can be auto loaded by specific the files’ directory path via dictionary_path

pipeline:
- name: "JiebaTokenizer"
  dictionary_path: "path/to/custom/dictionary/dir"

If the dictionary_path is None (the default), then no custom dictionary will be used.

MitieTokenizer

Short:

Tokenizer using MITIE

Outputs:

nothing

Requires:

MitieNLP

Description:

Creates tokens using the MITIE tokenizer. Can be used to define tokens for the MITIE entity extractor.

Configuration:
pipeline:
- name: "MitieTokenizer"

SpacyTokenizer

Short:Tokenizer using spacy
Outputs:nothing
Requires:SpacyNLP
Description:Creates tokens using the spacy tokenizer. Can be used to define tokens for the MITIE entity extractor.

Entity Extractors

MitieEntityExtractor

Short:

MITIE entity extraction (using a MITIE NER trainer)

Outputs:

appends entities

Requires:

MitieNLP

Output-Example:
{
    "entities": [{"value": "New York City",
                  "start": 20,
                  "end": 33,
                  "confidence": null,
                  "entity": "city",
                  "extractor": "MitieEntityExtractor"}]
}
Description:

This uses the MITIE entity extraction to find entities in a message. The underlying classifier is using a multi class linear SVM with a sparse linear kernel and custom features. The MITIE component does not provide entity confidence values.

Configuration:
pipeline:
- name: "MitieEntityExtractor"

SpacyEntityExtractor

Short:

spaCy entity extraction

Outputs:

appends entities

Requires:

SpacyNLP

Output-Example:
{
    "entities": [{"value": "New York City",
                  "start": 20,
                  "end": 33,
                  "entity": "city",
                  "confidence": null,
                  "extractor": "SpacyEntityExtractor"}]
}
Description:

Using spaCy this component predicts the entities of a message. spacy uses a statistical BILOU transition model. As of now, this component can only use the spacy builtin entity extraction models and can not be retrained. This extractor does not provide any confidence scores.

Configuration:

Configure which dimensions, i.e. entity types, the spacy component should extract. A full list of available dimensions can be found in the spaCy documentation. Leaving the dimensions option unspecified will extract all available dimensions.

pipeline:
- name: "SpacyEntityExtractor"
  # dimensions to extract
  dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]

EntitySynonymMapper

Short:

Maps synonymous entity values to the same value.

Outputs:

modifies existing entities that previous entity extraction components found

Requires:

nothing

Description:

If the training data contains defined synonyms (by using the value attribute on the entity examples). this component will make sure that detected entity values will be mapped to the same value. For example, if your training data contains the following examples:

[{
  "text": "I moved to New York City",
  "intent": "inform_relocation",
  "entities": [{"value": "nyc",
                "start": 11,
                "end": 24,
                "entity": "city",
               }]
},
{
  "text": "I got a new flat in NYC.",
  "intent": "inform_relocation",
  "entities": [{"value": "nyc",
                "start": 20,
                "end": 23,
                "entity": "city",
               }]
}]

This component will allow you to map the entities New York City and NYC to nyc. The entitiy extraction will return nyc even though the message contains NYC. When this component changes an exisiting entity, it appends itself to the processor list of this entity.

CRFEntityExtractor

Short:

conditional random field entity extraction

Outputs:

appends entities

Requires:

A tokenizer

Output-Example:
{
    "entities": [{"value":"New York City",
                  "start": 20,
                  "end": 33,
                  "entity": "city",
                  "confidence": 0.874,
                  "extractor": "CRFEntityExtractor"}]
}
Description:

This component implements conditional random fields to do named entity recognition. CRFs can be thought of as an undirected Markov chain where the time steps are words and the states are entity classes. Features of the words (capitalisation, POS tagging, etc.) give probabilities to certain entity classes, as are transitions between neighbouring entity tags: the most likely set of tags is then calculated and returned. If POS features are used (pos or pos2), spaCy has to be installed. To use custom features made available by Featurizers, use "ner_features".

Configuration:
pipeline:
- name: "CRFEntityExtractor"
  # The features are a ``[before, word, after]`` array with
  # before, word, after holding keys about which
  # features to use for each word, for example, ``"title"``
  # in array before will have the feature
  # "is the preceding word in title case?".
  # Available features are:
  # ``low``, ``title``, ``suffix5``, ``suffix3``, ``suffix2``,
  # ``suffix1``, ``pos``, ``pos2``, ``prefix5``, ``prefix2``,
  # ``bias``, ``upper``, ``digit``, ``pattern``, and ``ner_features``
  features: [["low", "title"], ["bias", "suffix3"], ["upper", "pos", "pos2"]]

  # The flag determines whether to use BILOU tagging or not. BILOU
  # tagging is more rigorous however
  # requires more examples per entity. Rule of thumb: use only
  # if more than 100 examples per entity.
  BILOU_flag: true

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  max_iterations: 50

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  # Specifies the L1 regularization coefficient.
  L1_c: 0.1

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  # Specifies the L2 regularization coefficient.
  L2_c: 0.1

DucklingHTTPExtractor

Short:

Duckling lets you extract common entities like dates, amounts of money, distances, and others in a number of languages.

Outputs:

appends entities

Requires:

nothing

Output-Example:
{
    "entities": [{"end": 53,
                  "entity": "time",
                  "start": 48,
                  "value": "2017-04-10T00:00:00.000+02:00",
                  "confidence": 1.0,
                  "extractor": "DucklingHTTPExtractor"}]
}
Description:

To use this component you need to run a duckling server. The easiest option is to spin up a docker container using docker run -p 8000:8000 rasa/duckling.

Alternatively, you can install duckling directly on your machine and start the server.

Duckling allows to recognize dates, numbers, distances and other structured entities and normalizes them. Please be aware that duckling tries to extract as many entity types as possible without providing a ranking. For example, if you specify both number and time as dimensions for the duckling component, the component will extract two entities: 10 as a number and in 10 minutes as a time from the text I will be there in 10 minutes. In such a situation, your application would have to decide which entity type is be the correct one. The extractor will always return 1.0 as a confidence, as it is a rule based system.

Configuration:

Configure which dimensions, i.e. entity types, the duckling component should extract. A full list of available dimensions can be found in the duckling documentation. Leaving the dimensions option unspecified will extract all available dimensions.

pipeline:
- name: "DucklingHTTPExtractor"
  # url of the running duckling server
  url: "http://localhost:8000"
  # dimensions to extract
  dimensions: ["time", "number", "amount-of-money", "distance"]
  # allows you to configure the locale, by default the language is
  # used
  locale: "de_DE"
  # if not set the default timezone of Duckling is going to be used
  # needed to calculate dates from relative expressions like "tomorrow"
  timezone: "Europe/Berlin"
  # Timeout for receiving response from http url of the running duckling server
  # if not set the default timeout of duckling http url is set to 3 seconds.
  timeout : 3