Markers in Rasa Open Source 3.0

Evaluation

“How do I evaluate my assistant?” is a question that our customers and community members frequently ask our research team.

One way to answer this question is by evaluating the different components that make up your assistant. For example, you can measure the accuracy of intent detection, entity extraction, and action selection. While these metrics are certainly meaningful, the accuracy of these individual components doesn’t tell you how well your assistant performs as a whole. After all, these metrics are entirely unaware of the task that the assistant is supposed to handle.

Instead, you can approach this question by defining and measuring task-specific evaluation criteria, such as how often task success, task failure, dialogue completion, and human handoff occur. You can think of these as your bot's key performance indicators (KPIs).

This is where the new markers feature comes in. Markers give you a language to describe important events in your dialogue and a method for recording when these critical events occur. More precisely, markers are conditions that allow you to define and mark points of interest in dialogues. When the conditions are met, the dialogues are “marked” for further analysis.

Example: moodbot

In Rasa, a dialogue is represented as a sequence of events. These events include bot actions, detected user intents, and slots. Markers allow you to describe conditions over such events. Markers allow you to describe conditions over such events and track them for further analysis.

For example, let’s take moodbot. Moodbot is the assistant you get when you run rasa init. Moodbot is tasked with cheering up users when they express an unhappy mood. In this case, task success for moodbot can be defined as “successfully cheering up a user when they express an unhappy mood”. You can express this as a marker in a markers.yml file with the following syntax:

YAML

marker_cheer_up_success:
  seq:
    - intent: mood_unhappy
    - action: utter_cheer_up
    - action: utter_did_that_help
    - intent: affirm

In the case of moodbot, we can also think of a marker that indicates a task failure. Task failure can be defined as “failing to cheer up a user when they express an unhappy mood”. Again, this can be expressed as a marker:

YAML

marker_cheer_up_failure:
  seq:
    - intent: mood_unhappy
    - action: utter_cheer_up
    - action: utter_did_that_help
    - intent: deny

Let’s discuss an example that demonstrates how these markers might trigger. Suppose we have the following dialogue sequence.

Text

User: Hello
Bot: Hi How are you?
User: I’m sad
Bot: Here’s something to cheer you up.
Bot: Did that help?
User: Yes!

These messages are related to intents and actions. Let’s add those.

Text                                    Events

User: Hello                             intent: greet
Bot: Hi How are you?                    action: utter_greet
User: I’m sad                           intent: mood_unhappy
Bot: Here’s something to cheer you up.  action: utter_cheer_up 
Bot: Did that help?                     action: utter_did_that_help
User: Yes!                              intent: affirm

Given these events, Rasa can now determine if a marker applies. For example, the marker_cheer_up_success will apply once the entire sequence is satisfied.

Events                         marker_cheer_up_success

intent: greet                  False
action: utter_greet            False
intent: mood_unhappy           False
action: utter_cheer_up         False
action: utter_did_that_help    False
intent: affirm                 True

Suppose now that we have a different dialogue sequence where the user answers “No” to the question “Did that help?”. The events would be different, as shown below.

Text                                    Events

User: Hello                             intent: greet
Bot: Hi How are you?                    action: utter_greet
User: I’m sad                           intent: mood_unhappy
Bot: Here’s something to cheer you up.  action: utter_cheer_up 
Bot: Did that help?                     action: utter_did_that_help
User: No!                               intent: deny

In this case, the marker marker_cheer_up_failure will apply because of the events with the definition.

Events                         marker_cheer_up_failed

intent: greet                  False
action: utter_greet            False
intent: mood_unhappy           False
action: utter_cheer_up         False
action: utter_did_that_help    False
intent: deny                   True

In addition to defining and tracking your KPIs, you can also use markers to diagnose your dialogues. Markers also allow you to detect slot events, such as slot_was_set/slot_was_not_set events. That means you can also consider tracking forms where certain slots didn't get filled in. With these events you can also track the occurence of unhappy paths in your conversation.

Running Markers

Let’s now walk through an example together using moodbot. First, we’ll need to set up a tracker to store our conversations. We’ll use a local SQLite tracker store by adding the following lines to the endpoints.yml file:

YAML

tracker_store:
   type: SQL
   dialect: "sqlite"  # the dialect used to interact with the db

Now, we can populate this SQLite database when we run rasa shell.

We’ll now assume that you have three dialogues with this assistant.

For the first dialogue, you merely declare that you have a happy mood.
For the second dialogue, you express an unhappy mood and that the assistant was able to cheer you up.
For the final dialogue, you express an unhappy mood and that the assistant was not able to cheer you up.

We’ll assume that these three dialogues are stored in your local SQLite database. Because it’s configured in your endpoints.yml file, Rasa knows where to find these conversations when you extract the markers.

To extract markers, we ran the following script:

rasa evaluate markers all --config markers.yml out.csv

This produces three files: out.csv, stats-overall.csv, and stats-per-session.csv.

The out.csv file contains the occurrence of each marker. In this case, we have two markers and each of them occurred only once:

sender_id,session_idx,marker,event_idx,num_preceding_user_turns
0cdfe2dfea2845aab8a056dee58a1c22,0,marker_cheer_up_failure,15,2
85b830482c88499e9ffe876d6aba4cba,0,marker_cheer_up_success,15,2

Let’s go over all the columns in this file.

The file contains the sender_id, which identifies the user in the conversation.
Each dialogue can have multiple sessions, which we index by session_idx. In this case, each dialogue had one session with index 0.
We then have the column marker, which lists the marker name.
Next, you can see the event_idx column, which represents the index of the event at which the marker occurred. In this case, both markers triggered at the 15th event step in the conversation.
Finally, we have the number of preceding user turns in the num_preceding_user_turns column. This gives us an indication of how long it took for a marker to apply from the user’s perspective. Both markers applied after two user turns.

We can use this csv file to filter dialogues based on the occurrence of a specific marker. For example, we may want to inspect the dialogues where marker_cheer_up_failure occurred to gain insights into how to improve our assistant.

In addition to this file, we output some summary statistics in the stats-overall.csv, and stats-per-session.csv files. These files contain many statistics that could be useful for reporting. If you’d like to learn more about what’s in these files you can check the docs here.

Conclusion

To conclude, in this blog post we’ve introduced the new markers feature, which provides a language for expressing important events in dialogues and a way of marking when these events occurred. We showed an example of how to use markers to define task success and failure for moodbot, and then demonstrated how to quantify these important events to evaluate the success and failure rates of the assistant.

We want to emphasize that this first iteration of the Markers feature is experimental, which means that the feature might be adapted in the future. We urge our community and customers to try it out and give us feedback on our forum on what is missing and what we could improve. If you’d like to learn more about markers, be sure to check out the documentation page.