Warning: This document is for the development version of Rasa. The latest version is 1.9.4.

Entity Extraction

Entity extraction involves parsing user messages for required pieces of information. Rasa Open Source provides entity extractors for custom entities as well as pre-trained ones like dates and locations. Here is a summary of the available extractors and what they are used for:

Component Requires Model Notes
CRFEntityExtractor sklearn-crfsuite conditional random field good for training custom entities
SpacyEntityExtractor spaCy averaged perceptron provides pre-trained entities
DucklingHTTPExtractor running duckling context-free grammar provides pre-trained entities
MitieEntityExtractor MITIE structured SVM good for training custom entities
EntitySynonymMapper existing entities N/A maps known synonyms
DIETClassifier   conditional random field on top of a transformer good for training custom entities

The “entity” Object

After parsing, an entity is returned as a dictionary. There are two fields that show information about how the pipeline impacted the entities returned: the extractor field of an entity tells you which entity extractor found this particular entity, and the processors field contains the name of components that altered this specific entity.

The use of synonyms can cause the value field not match the text exactly. Instead it will return the trained synonym.

{
  "text": "show me chinese restaurants",
  "intent": "restaurant_search",
  "entities": [
    {
      "start": 8,
      "end": 15,
      "value": "chinese",
      "entity": "cuisine",
      "extractor": "CRFEntityExtractor",
      "confidence": 0.854,
      "processors": []
    }
  ]
}

Note

The confidence will be set by the CRFEntityExtractor component. The DucklingHTTPExtractor will always return 1. The SpacyEntityExtractor extractor and DIETClassifier do not provide this information and returns null.

Some extractors, like duckling, may include additional information. For example:

{
  "additional_info":{
    "grain":"day",
    "type":"value",
    "value":"2018-06-21T00:00:00.000-07:00",
    "values":[
      {
        "grain":"day",
        "type":"value",
        "value":"2018-06-21T00:00:00.000-07:00"
      }
    ]
  },
  "confidence":1.0,
  "end":5,
  "entity":"time",
  "extractor":"DucklingHTTPExtractor",
  "start":0,
  "text":"today",
  "value":"2018-06-21T00:00:00.000-07:00"
}

Custom Entities

Almost every chatbot and voice app will have some custom entities. A restaurant assistant should understand chinese as a cuisine, but to a language-learning assistant it would mean something very different. The CRFEntityExtractor component can learn custom entities in any language, given some training data. See Training Data Format for details on how to include entities in your training data.

Extracting Places, Dates, People, Organisations

spaCy has excellent pre-trained named-entity recognisers for a few different languages. You can test them out in this interactive demo. We don’t recommend that you try to train your own NER using spaCy, unless you have a lot of data and know what you are doing. Note that some spaCy models are highly case-sensitive.

Dates, Amounts of Money, Durations, Distances, Ordinals

The duckling library does a great job of turning expressions like “next Thursday at 8pm” into actual datetime objects that you can use, e.g.

"next Thursday at 8pm"
=> {"value":"2018-05-31T20:00:00.000+01:00"}

The list of supported languages can be found here. Duckling can also handle durations like “two hours”, amounts of money, distances, and ordinals. Fortunately, there is a duckling docker container ready to use, that you just need to spin up and connect to Rasa NLU (see DucklingHTTPExtractor).

Regular Expressions (regex)

You can use regular expressions to help the CRF model learn to recognize entities. In your training data (see Training Data Format) you can provide a list of regular expressions, each of which provides the CRFEntityExtractor with an extra binary feature, which says if the regex was found (1) or not (0).

For example, the names of German streets often end in strasse. By adding this as a regex, we are telling the model to pay attention to words ending this way, and will quickly learn to associate that with a location entity.

If you just want to match regular expressions exactly, you can do this in your code, as a postprocessing step after receiving the response from Rasa NLU.

Passing Custom Features to CRFEntityExtractor

If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails. However, CRFEntityExtractor will continue to train just without the additional custom features. In case dense features are present, CRFEntityExtractor will pass the dense features to sklearn_crfsuite and use them for training.