“How do I evaluate my assistant?” is a question that our customers and community members frequently ask our research team.
One way to answer this question is by evaluating the different components that make up your assistant. For example, you can measure the accuracy of intent detection, entity extraction, and action selection. While these metrics are certainly meaningful, the accuracy of these individual components doesn’t tell you how well your assistant performs as a whole. After all, these metrics are entirely unaware of the task that the assistant is supposed to handle.
Instead, you can approach this question by defining and measuring task-specific evaluation criteria, such as how often task success, task failure, dialogue completion, and human handoff occur. You can think of these as your bot's key performance indicators (KPIs).
This is where the new markers feature comes in. Markers give you a language to describe important events in your dialogue and a method for recording when these critical events occur. More precisely, markers are conditions that allow you to define and mark points of interest in dialogues. When the conditions are met, the dialogues are “marked” for further analysis.
In Rasa, a dialogue is represented as a sequence of events. These events include bot actions, detected user intents, and slots. Markers allow you to describe conditions over such events. Markers allow you to describe conditions over such events and track them for further analysis.
For example, let’s take moodbot. Moodbot is the assistant you get when you run
rasa init. Moodbot is tasked with cheering up users when they express an unhappy mood. In this case, task success for moodbot can be defined as “successfully cheering up a user when they express an unhappy mood”. You can express this as a marker in a
markers.yml file with the following syntax:
In the case of moodbot, we can also think of a marker that indicates a task failure. Task failure can be defined as “failing to cheer up a user when they express an unhappy mood”. Again, this can be expressed as a marker:
Let’s discuss an example that demonstrates how these markers might trigger. Suppose we have the following dialogue sequence.
Text User: Hello Bot: Hi How are you? User: I’m sad Bot: Here’s something to cheer you up. Bot: Did that help? User: Yes!
These messages are related to intents and actions. Let’s add those.
Text Events User: Hello intent: greet Bot: Hi How are you? action: utter_greet User: I’m sad intent: mood_unhappy Bot: Here’s something to cheer you up. action: utter_cheer_up Bot: Did that help? action: utter_did_that_help User: Yes! intent: affirm
Given these events, Rasa can now determine if a marker applies. For example, the marker_cheer_up_success will apply once the entire sequence is satisfied.
Events marker_cheer_up_success intent: greet False action: utter_greet False intent: mood_unhappy False action: utter_cheer_up False action: utter_did_that_help False intent: affirm True
Suppose now that we have a different dialogue sequence where the user answers “No” to the question “Did that help?”. The events would be different, as shown below.
Text Events User: Hello intent: greet Bot: Hi How are you? action: utter_greet User: I’m sad intent: mood_unhappy Bot: Here’s something to cheer you up. action: utter_cheer_up Bot: Did that help? action: utter_did_that_help User: No! intent: deny
In this case, the marker
marker_cheer_up_failure will apply because of the events with the definition.
Events marker_cheer_up_failed intent: greet False action: utter_greet False intent: mood_unhappy False action: utter_cheer_up False action: utter_did_that_help False intent: deny True
In addition to defining and tracking your KPIs, you can also use markers to diagnose your dialogues. Markers also allow you to detect slot events, such as
slot_was_not_set events. That means you can also consider tracking forms where certain slots didn't get filled in. With these events you can also track the occurence of unhappy paths in your conversation.
Let’s now walk through an example together using moodbot. First, we’ll need to set up a tracker to store our conversations. We’ll use a local SQLite tracker store by adding the following lines to the
Now, we can populate this SQLite database when we run
We’ll now assume that you have three dialogues with this assistant.
- For the first dialogue, you merely declare that you have a happy mood.
- For the second dialogue, you express an unhappy mood and that the assistant was able to cheer you up.
- For the final dialogue, you express an unhappy mood and that the assistant was not able to cheer you up.
We’ll assume that these three dialogues are stored in your local SQLite database. Because it’s configured in your endpoints.yml file, Rasa knows where to find these conversations when you extract the markers.
To extract markers, we ran the following script:
rasa evaluate markers all --config markers.yml out.csv
This produces three files:
out.csv file contains the occurrence of each marker. In this case, we have two markers and each of them occurred only once:
sender_id,session_idx,marker,event_idx,num_preceding_user_turns 0cdfe2dfea2845aab8a056dee58a1c22,0,marker_cheer_up_failure,15,2 85b830482c88499e9ffe876d6aba4cba,0,marker_cheer_up_success,15,2
Let’s go over all the columns in this file.
- The file contains the
sender_id, which identifies the user in the conversation.
- Each dialogue can have multiple sessions, which we index by
session_idx. In this case, each dialogue had one session with index 0.
- We then have the column
marker, which lists the marker name.
- Next, you can see the
event_idxcolumn, which represents the index of the event at which the marker occurred. In this case, both markers triggered at the 15th event step in the conversation.
- Finally, we have the number of preceding user turns in the
num_preceding_user_turnscolumn. This gives us an indication of how long it took for a marker to apply from the user’s perspective. Both markers applied after two user turns.
We can use this csv file to filter dialogues based on the occurrence of a specific marker. For example, we may want to inspect the dialogues where
marker_cheer_up_failure occurred to gain insights into how to improve our assistant.
In addition to this file, we output some summary statistics in the
stats-per-session.csv files. These files contain many statistics that could be useful for reporting. If you’d like to learn more about what’s in these files you can check the docs here.
To conclude, in this blog post we’ve introduced the new markers feature, which provides a language for expressing important events in dialogues and a way of marking when these events occurred. We showed an example of how to use markers to define task success and failure for moodbot, and then demonstrated how to quantify these important events to evaluate the success and failure rates of the assistant.
We want to emphasize that this first iteration of the Markers feature is experimental, which means that the feature might be adapted in the future. We urge our community and customers to try it out and give us feedback on our forum on what is missing and what we could improve. If you’d like to learn more about markers, be sure to check out the documentation page.