December 6th, 2021
Markers in Rasa Open Source 3.0
Aciel Eshky
Vincent Warmerdam
As part of Rasa Open Source 3.0, we’ve launched a new experimental feature called Markers. This feature will help Rasa Developers evaluate the performance of their assistants by counting how often specific events play out in a conversation. This blog post will explain what the feature does and how to use it.
Evaluation
“How do I evaluate my assistant?” is a question that our customers and community members frequently ask our research team.
One way to answer this question is by evaluating the different components that make up your assistant. For example, you can measure the accuracy of intent detection, entity extraction, and action selection. While these metrics are certainly meaningful, the accuracy of these individual components doesn’t tell you how well your assistant performs as a whole. After all, these metrics are entirely unaware of the task that the assistant is supposed to handle.
Instead, you can approach this question by defining and measuring task-specific evaluation criteria, such as how often task success, task failure, dialogue completion, and human handoff occur. You can think of these as your bot's key performance indicators (KPIs).
This is where the new markers feature comes in. Markers give you a language to describe important events in your dialogue and a method for recording when these critical events occur. More precisely, markers are conditions that allow you to define and mark points of interest in dialogues. When the conditions are met, the dialogues are “marked” for further analysis.
Example: moodbot
In Rasa, a dialogue is represented as a sequence of events. These events include bot actions, detected user intents, and slots. Markers allow you to describe conditions over such events. Markers allow you to describe conditions over such events and track them for further analysis.
For example, let’s take moodbot. Moodbot is the assistant you get when you run rasa init
. Moodbot is tasked with cheering up users when they express an unhappy mood. In this case, task success for moodbot can be defined as “successfully cheering up a user when they express an unhappy mood”. You can express this as a marker in a markers.yml
file with the following syntax:
In the case of moodbot, we can also think of a marker that indicates a task failure. Task failure can be defined as “failing to cheer up a user when they express an unhappy mood”. Again, this can be expressed as a marker:
Let’s discuss an example that demonstrates how these markers might trigger. Suppose we have the following dialogue sequence.
Text
User: Hello
Bot: Hi How are you?
User: I’m sad
Bot: Here’s something to cheer you up.
Bot: Did that help?
User: Yes!
These messages are related to intents and actions. Let’s add those.
Text Events
User: Hello intent: greet
Bot: Hi How are you? action: utter_greet
User: I’m sad intent: mood_unhappy
Bot: Here’s something to cheer you up. action: utter_cheer_up
Bot: Did that help? action: utter_did_that_help
User: Yes! intent: affirm
Given these events, Rasa can now determine if a marker applies. For example, the marker_cheer_up_success will apply once the entire sequence is satisfied.
Events marker_cheer_up_success
intent: greet False
action: utter_greet False
intent: mood_unhappy False
action: utter_cheer_up False
action: utter_did_that_help False
intent: affirm True
Suppose now that we have a different dialogue sequence where the user answers “No” to the question “Did that help?”. The events would be different, as shown below.
Text Events
User: Hello intent: greet
Bot: Hi How are you? action: utter_greet
User: I’m sad intent: mood_unhappy
Bot: Here’s something to cheer you up. action: utter_cheer_up
Bot: Did that help? action: utter_did_that_help
User: No! intent: deny
In this case, the marker marker_cheer_up_failure
will apply because of the events with the definition.
Events marker_cheer_up_failed
intent: greet False
action: utter_greet False
intent: mood_unhappy False
action: utter_cheer_up False
action: utter_did_that_help False
intent: deny True
In addition to defining and tracking your KPIs, you can also use markers to diagnose your dialogues. Markers also allow you to detect slot events, such as slot_was_set
/slot_was_not_set
events. That means you can also consider tracking forms where certain slots didn't get filled in. With these events you can also track the occurence of unhappy paths in your conversation.
Running Markers
Let’s now walk through an example together using moodbot. First, we’ll need to set up a tracker to store our conversations. We’ll use a local SQLite tracker store by adding the following lines to the endpoints.yml
file:
Now, we can populate this SQLite database when we run rasa shell
.
We’ll now assume that you have three dialogues with this assistant.
- For the first dialogue, you merely declare that you have a happy mood.
- For the second dialogue, you express an unhappy mood and that the assistant was able to cheer you up.
- For the final dialogue, you express an unhappy mood and that the assistant was not able to cheer you up.
We’ll assume that these three dialogues are stored in your local SQLite database. Because it’s configured in your endpoints.yml file, Rasa knows where to find these conversations when you extract the markers.
To extract markers, we ran the following script:
rasa evaluate markers all --config markers.yml out.csv
This produces three files: out.csv
, stats-overall.csv
, and stats-per-session.csv
.
The out.csv
file contains the occurrence of each marker. In this case, we have two markers and each of them occurred only once:
sender_id,session_idx,marker,event_idx,num_preceding_user_turns
0cdfe2dfea2845aab8a056dee58a1c22,0,marker_cheer_up_failure,15,2
85b830482c88499e9ffe876d6aba4cba,0,marker_cheer_up_success,15,2
Let’s go over all the columns in this file.
- The file contains the
sender_id
, which identifies the user in the conversation. - Each dialogue can have multiple sessions, which we index by
session_idx
. In this case, each dialogue had one session with index 0. - We then have the column
marker
, which lists the marker name. - Next, you can see the
event_idx
column, which represents the index of the event at which the marker occurred. In this case, both markers triggered at the 15th event step in the conversation. - Finally, we have the number of preceding user turns in the
num_preceding_user_turns
column. This gives us an indication of how long it took for a marker to apply from the user’s perspective. Both markers applied after two user turns.
We can use this csv file to filter dialogues based on the occurrence of a specific marker. For example, we may want to inspect the dialogues where marker_cheer_up_failure
occurred to gain insights into how to improve our assistant.
In addition to this file, we output some summary statistics in the stats-overall.csv
, and stats-per-session.csv
files. These files contain many statistics that could be useful for reporting. If you’d like to learn more about what’s in these files you can check the docs here.
Conclusion
To conclude, in this blog post we’ve introduced the new markers feature, which provides a language for expressing important events in dialogues and a way of marking when these events occurred. We showed an example of how to use markers to define task success and failure for moodbot, and then demonstrated how to quantify these important events to evaluate the success and failure rates of the assistant.
We want to emphasize that this first iteration of the Markers feature is experimental, which means that the feature might be adapted in the future. We urge our community and customers to try it out and give us feedback on our forum on what is missing and what we could improve. If you’d like to learn more about markers, be sure to check out the documentation page.