Rasa Pro License
You'll need a license to get started with Rasa Pro. Talk with Sales
New in 3.6
You can now anonymize personally identifiable information (PII) data in your logs and events streamed via Kafka event broker only. Continue reading to learn how to enable the anonymization of PII data.
Management of sensitive customer data collected by assistants is a critical requirement for complying with regulations and managing the data securely while making it available for analysis. Rasa Pro 3.6 introduces the ability to anonymize PII data in your logs and events streamed via the Kafka event broker.
Note that the Tracker Store continues to store conversation data which is not anonymized. This is required so that we retain a source of truth of pristine data for non-repudiation purposes. Learn how to secure your Tracker Store using Vault secrets manager to protect the source of truth with rotating credentials.
In addition, because the tracker store data plays a vital role in the assistant's dialogue management at inference time, anonymizing data in flight could have undesirable consequences for the assistant's dialogue management action predictions.
The anonymization of the Rasa event is run through an anonymization pipeline at the end of the dialogue management action prediction and execution. Note that the processing done by the anonymization pipeline is scheduled as a background task and does not affect the assistant's response time.
The anonymization steps are as follows:
- The Rasa agent invokes the anonymization pipeline during each user message handling.
- The anonymization pipeline runs through a sequence of anonymization rules that are applied to the new events.
- The pipeline publishes the anonymized events to the Kafka topic that is mapped to the anonymization rule list in
- The tracker store saves the original events that are not anonymized.
Supported Rasa Events
The Rasa events that are anonymized include the following:
Supported PII entity types
The anonymization pipeline uses Microsoft Presidio as both entity recognizer and anonymizer. Presidio is an open-source library that supports a wide range of entity types and anonymization methods.
You can specify any of the out-of-the-box supported Presidio entity types in the anonymization rules. Note that it is currently not possible to add custom entity types for the Rasa Pro anonymization pipeline.
How to write anonymization rules
You can now write anonymization rules in
endpoints.yml to explicitly declare which Presidio entities should be anonymized.
The anonymization pipeline is configurable via the
anonymization section in the
This section must have the structure given in the following example:
How to populate the metadata section
The metadata section contains the following fields:
language field specifies the language code of the text to be anonymized. Note that you can only specify one language
per anonymization pipeline, therefore the capability is currently not able to handle language switching by end users.
model_provider fields specify the name and provider of the Presidio model to be used for
anonymization. The available model providers are
If you want to use
spacy, we strongly recommend to use the available large models, such as
This is to ensure more accurate entity recognition.
When using large models, you must ensure that your Rasa Pro environment has enough memory to load the model.
If you opt for using the
transformers model provider, you must specify two model names in the
the HuggingFace model name and the spaCy model name, where the spaCy model would wrap the transformers NER model.
How to install the language model
You must install the model you declare to the
model_name field in your Rasa Pro environment.
For example, if you declare
model_name: en_core_web_lg, you must install the spaCy
en_core_web_lg model in your Rasa Pro environment.
You can follow model installation instructions for spaCy and stanza models in the Presidio official documentation.
In the case of the
transformers model provider, you must install both models that you declare in the
in your Rasa Pro environment. For example, if you declare
transformers: dslim/bert-base-NER in
endpoints.yml, you must
dslim/bert-base-NER model in your Rasa Pro environment. You can find model download instructions for
HuggingFace in the Presidio official documentation.
How to populate the rule_lists section
rule_lists section contains a list of anonymization rule lists. Each rule list must have a unique
id of type
string and a list of
rules. Each rule must have an
entity field and a
substitution field. The
specifies the Presidio entity type to be anonymized and must be in uppercase. Note that regular expressions are currently
not supported for identifying entities.
substitution field specifies the anonymization method to be used. Currently, the
following anonymization methods are supported:
textanonymization method replaces the original entity value with the value specified in the
valuefield. In the following example, the
PERSONentity value will be replaced with
maskanonymization method replaces the original entity value with a mask of the same length using the character '*'. For example, if the original entity value is
John Doe, the anonymized value will be
fakeranonymization method replaces the original entity value with a fake value generated by the Faker library. For example, if the original entity value is
John Doe, the anonymized value will be replaced with a fake name generated by the Faker library.
If no substitution method is specified, the default substitution method is
value field is only required for the
text anonymization method. It specifies the text to be used as the anonymized value.
value field is not specified, the original entity value to be anonymized will be replaced with the entity type
name between brackets. For example, if the
value field is not specified for the
PERSON entity type, the anonymized
value will be
faker anonymization method uses the Faker library to generate fake data.
By default, the
faker anonymization method will generate fake data in English unless a localized Presidio
entity type is used. For example, if you use the
faker substitution method for the
ES_NIF entity type, the generated
fake data will match the format of a Spanish NIF.
faker substitution method does not support the following Presidio entity types:
If any of the above entities is used together with the
faker substitution method, the anonymization pipeline will default
mask substitution method.
How to update the Kafka event broker configuration
The anonymization pipeline uses Kafka to publish the anonymized events to the Kafka topic that is mapped to the
anonymization rule list. You can configure the Kafka event broker in the
endpoints.yml file. The Kafka event broker
must contain the
anonymization_topics section, which must have the following structure:
anonymization_topics section contains a list of Kafka topics to which the anonymized events will be published.
Each Kafka topic must have a
name field and an
anonymization_rules field. The
name field specifies the name of the
Kafka topic. The
anonymization_rules field specifies the
id of the anonymization rule list to be used for the Kafka topic.
Streaming anonymized events to Rasa X/Enterprise with Kafka
Streaming anonymized events to Rasa X/Enterprise is only supported for Rasa X/Enterprise versions
1.3.0 and above.
In addition, you must use the Kafka event broker, other event broker types are not supported.
You can stream anonymized events to Rasa X/Enterprise via Kafka by adding the
rasa_x_consumer: true key-value pair to
If multiple Kafka anonymization topics contain the
rasa_x_consumer key-value pair, the anonymized events will be streamed
to the Kafka topic that is mapped to the first topic in the
anonymization_topics list that contains the
Note that the
rasa_x_consumer key-value pair is optional. If it is not specified, the anonymized events will be published
to the Kafka topic, but they will not be streamed to Rasa X/Enterprise.
How to enable anonymization of PII in logs
You can enable anonymization of PII in logs by filling the
logger section in the
logger section must have the following structure:
anonymization_rules field specifies the
id of the anonymization rule list to be used for the logs.
We strongly recommend to run with log level INFO in production. Running with log level DEBUG will increase the assistant's response latency because of processing delays.
Note that running
rasa shell in debug mode with a Kafka event broker might result in logs related to the event publishing
to be printed to console after the bot message. This behaviour is expected because the event anonymization and publishing
is done asynchronously as a background task, so it will complete after the assistant has already predicted and executed the