Scenarios
A scenario describes a single test case for your agent: who the simulated user is, what they want to accomplish, and what a successful conversation looks like. Each scenario is an individual YAML file in eval/scenarios/.
A scenario has three parts:
simulation_context— a natural language briefing that tells the LLM how to play the user role across the whole conversation.setup— optional initial slot values injected before the conversation starts (e.g. pre-authenticated state).goals— what success looks like, expressed as deterministicassertions(checked against the tracker event history) and natural languagecriteria(scored by an LLM judge).
Schema
scenario:
name: Agent handles delayed order gracefully
simulation_context: >
You are an impatient customer who escalates quickly if not given a direct answer.
You want to find out why the order is delayed and get a clear resolution.
setup:
initial_slots:
authenticated: true
account_id: "12345"
order_id: "ORD-9981"
goals:
criteria:
- Agent communicates the delay reason clearly and empathetically
- Agent does not ask the user to repeat information already provided
assertions:
- flow_started:
flow_ids: [check_order_status]
operator: any
- flow_completed:
flow_id: check_order_status
- flow_cancelled:
flow_id: verify_identity
- action_executed: action_check_order_status
- slot_was_set:
- name: order_id
value: "ORD-9981"
- slot_was_not_set:
- name: transfer_amount
- bot_uttered:
utter_name: utter_order_delayed
- bot_did_not_utter:
utter_name: utter_dont_know
- pattern_clarification_contains:
flow_ids: [check_order_status, track_shipment]
operator: any
- generative_response_is_relevant:
utter_source: utter_order_status_details
- generative_response_is_grounded:
utter_source: utter_order_status_details
- sequencing:
- flow_started: verify_identity
- action_executed: action_verify_account
- slot_was_set: authenticated
- flow_completed: verify_identity
- flow_started: check_order_status
- flow_completed: check_order_status
Field Reference
simulation_context
A free-form natural language description of the simulated user and the conversation goal. The LLM uses this to generate consistent user turns across the whole conversation. You can include user behaviour (emotional state, interaction style), conversational intent, and contextual facts (account state, order details).
A single sentence works. The more specific you are, the more targeted the simulation.
setup.initial_slots
Slot values injected into the conversation before the first user turn. Use this to pre-authenticate users, pre-fill context, or skip onboarding steps that are out of scope.
When initial_slots are set, the evaluator sends a /session_start message first so that any custom action_session_start logic runs against an empty tracker, then injects the slot values on top.
Slot names must exist in your domain with compatible types. validate_scenario checks this before any simulation runs.
goals.criteria
A list of natural language statements describing required bot-side behavior. The LLM judge evaluates each criterion independently after the conversation ends and produces a pass/fail verdict with a written rationale. A run is marked failed if any criterion fails.
Use criteria for deal-breakers — non-negotiable system behavior such as correct step ordering, mandatory tool calls, flow adherence, or required phrasing. Do not use criteria for subjective qualities like tone or empathy, or for overall user goal achievement; those are captured by quality metrics and do not gate the pass/fail verdict.
If you omit criteria, the run still produces a full transcript and assertion results.
goals.assertions
Deterministic, binary checks run against the conversation's full event history. These use the same assertion types as E2E tests, with the addition of the sequencing type:
| Assertion | Checks | YAML value |
|---|---|---|
flow_started | A specific flow was started | {flow_ids: [...], operator: any|all} |
flow_completed | A flow reached a completed state | {flow_id: "...", flow_step_id: "..."} (flow_step_id optional) |
flow_cancelled | A flow was cancelled | {flow_id: "...", flow_step_id: "..."} (flow_step_id optional) |
action_executed | A specific action was triggered | plain string — action name |
slot_was_set | A slot was set (optionally to a specific value) | list of {name: "...", value: ...} (value optional) |
slot_was_not_set | A slot was not set, or was not set to a specific value | list of {name: "...", value: ...} (value optional) |
bot_uttered | The agent produced a response matching text, buttons, or a response name | {utter_name: "..."} or {text_matches: "..."} or {buttons: [...]} |
bot_did_not_utter | The agent did not produce a response matching a pattern | same fields as bot_uttered |
pattern_clarification_contains | The clarification suggestions included the expected flow IDs | {flow_ids: [...], operator: any|all} |
generative_response_is_relevant | A generative response is contextually relevant to the user's message (LLM-scored) | {utter_source: "..."} — threshold and utter_name optional |
generative_response_is_grounded | A generative response is factually accurate against its source material (LLM-scored) | {utter_source: "..."} — threshold, utter_name, ground_truth optional |
sequencing | A sequence of events occurred in the specified order | list of single-key step dicts (see below) |
sequencing
This takes a list of step matchers. Each step must match an event that appears strictly after the previous match in the conversation. All values are plain strings.
| Step key | Matches |
|---|---|
flow_started | FlowStarted event with the given flow ID |
flow_completed | FlowCompleted event with the given flow ID |
flow_cancelled | FlowCancelled event with the given flow ID |
flow_interrupted | FlowInterrupted event with the given flow ID |
action_executed | ActionExecuted event with the given action name |
slot_was_set | SlotSet event for the given slot name |
- sequencing:
- flow_started: verify_identity
- action_executed: action_verify_account
- slot_was_set: authenticated
- flow_completed: verify_identity
- flow_started: check_order_status
- flow_completed: check_order_status
Unlike in E2E tests, generative_response_is_relevant and generative_response_is_grounded do not have access to the full conversation context. They assess a single turn in isolation, not the whole transcript. For multi-turn scenarios, use criteria instead. The LLM judge evaluates the complete conversation and produces more reliable results.