notice
This is documentation for Rasa Documentation v2.x, which is no longer actively maintained.
For up-to-date documentation, see the latest version (3.x).
rasa.nlu.utils.hugging_face.transformers_pre_post_processors
cleanup_tokens
Utility method to apply delimiter based cleanup on list of tokens.
Arguments:
token_ids_string
- List of tuples with each tuple containing (token id, token string).delimiter
- character/string to be cleaned from token strings.
Returns:
Token ids and Token strings unpacked.
bert_tokens_pre_processor
Add BERT style special tokens(CLS and SEP).
Arguments:
token_ids
- List of token ids without any special tokens.
Returns:
List of token ids augmented with special tokens.
gpt_tokens_pre_processor
Add GPT style special tokens(None).
Arguments:
token_ids
- List of token ids without any special tokens.
Returns:
List of token ids augmented with special tokens.
xlnet_tokens_pre_processor
Add XLNET style special tokens.
Arguments:
token_ids
- List of token ids without any special tokens.
Returns:
List of token ids augmented with special tokens.
roberta_tokens_pre_processor
Add RoBERTa style special tokens.
Arguments:
token_ids
- List of token ids without any special tokens.
Returns:
List of token ids augmented with special tokens.
xlm_tokens_pre_processor
Add XLM style special tokens.
Arguments:
token_ids
- List of token ids without any special tokens.
Returns:
List of token ids augmented with special tokens.
bert_embeddings_post_processor
Post-process embeddings from BERT.
by removing CLS and SEP embeddings and returning CLS token embedding as sentence representation.
Arguments:
sequence_embeddings
- Sequence of token level embeddings received as output from BERT.
Returns:
sentence level embedding and post-processed sequence level embedding.
gpt_embeddings_post_processor
Post-process embeddings from GPT models.
by taking a mean over sequence embeddings and returning that as sentence representation.
Arguments:
sequence_embeddings
- Sequence of token level embeddings received as output from GPT.
Returns:
sentence level embedding and post-processed sequence level embedding.
xlnet_embeddings_post_processor
Post-process embeddings from XLNet models.
by taking a mean over sequence embeddings and returning that as sentence representation. Remove last two time steps corresponding to special tokens from the sequence embeddings.
Arguments:
sequence_embeddings
- Sequence of token level embeddings received as output from XLNet.
Returns:
sentence level embedding and post-processed sequence level embedding.
roberta_embeddings_post_processor
Post process embeddings from Roberta models.
by taking a mean over sequence embeddings and returning that as sentence representation. Remove first and last time steps corresponding to special tokens from the sequence embeddings.
Arguments:
sequence_embeddings
- Sequence of token level embeddings received as output from Roberta
Returns:
sentence level embedding and post-processed sequence level embedding
xlm_embeddings_post_processor
Post process embeddings from XLM models
by taking a mean over sequence embeddings and returning that as sentence representation. Remove first and last time steps corresponding to special tokens from the sequence embeddings.
Arguments:
sequence_embeddings
- Sequence of token level embeddings received as output from XLM
Returns:
sentence level embedding and post-processed sequence level embedding
bert_tokens_cleaner
Token cleanup method for BERT.
Clean up tokens with the extra delimiters(##) BERT adds while breaking a token into sub-tokens.
Arguments:
token_ids
- List of token ids received as output from BERT Tokenizer.token_strings
- List of token strings received as output from BERT Tokenizer.
Returns:
Cleaned token ids and token strings.
openaigpt_tokens_cleaner
Token cleanup method for GPT.
Clean up tokens with the extra delimiters(</w>) OpenAIGPT adds while breaking a token into sub-tokens.
Arguments:
token_ids
- List of token ids received as output from GPT Tokenizer.token_strings
- List of token strings received as output from GPT Tokenizer.
Returns:
Cleaned token ids and token strings.
gpt2_tokens_cleaner
Token cleanup method for GPT2.
Clean up tokens with the extra delimiters(Ġ) GPT2 adds while breaking a token into sub-tokens.
Arguments:
token_ids
- List of token ids received as output from GPT Tokenizer.token_strings
- List of token strings received as output from GPT Tokenizer.
Returns:
Cleaned token ids and token strings.
xlnet_tokens_cleaner
Token cleanup method for XLNet.
Clean up tokens with the extra delimiters(▁) XLNet adds while breaking a token into sub-tokens.
Arguments:
token_ids
- List of token ids received as output from GPT Tokenizer.token_strings
- List of token strings received as output from GPT Tokenizer.
Returns:
Cleaned token ids and token strings.