notice
This is documentation for Rasa Open Source Documentation v2.0.x, which is no longer actively maintained.
For up-to-date documentation, see the latest version (2.3.x).
rasa.nlu.tokenizers.convert_tokenizer
ConveRTTokenizer Objects
Tokenizer using ConveRT model.
Loads the ConveRT(https://github.com/PolyAI-LDN/polyai-models#convert) model from TFHub and computes sub-word tokens for dense featurizable attributes of each message object.
__init__
Construct a new tokenizer using the WhitespaceTokenizer framework.
Arguments:
component_config
- User configuration for the component
cache_key
Cache the component for future use.
Arguments:
component_meta
- configuration for the component.model_metadata
- configuration for the whole pipeline.Returns
- key of the cache for future retrievals.
tokenize
Tokenize the text using the ConveRT model. ConveRT adds a special char in front of (some) words and splits words into sub-words. To ensure the entity start and end values matches the token values, tokenize the text first using the whitespace tokenizer. If individual tokens are split up into multiple tokens, add this information to the respected tokens.