Skip to main content

Scenarios

A scenario describes a single test case for your agent: who the simulated user is, what they want to accomplish, and what a successful conversation looks like. Each scenario is an individual YAML file in eval/scenarios/.

A scenario has three parts:

  • simulation_context — a natural language briefing that tells the LLM how to play the user role across the whole conversation.
  • setup — optional initial slot values injected before the conversation starts (e.g. pre-authenticated state).
  • goals — what success looks like, expressed as deterministic assertions (checked against the tracker event history) and natural language criteria (scored by an LLM judge).

Schema

eval/scenarios/order_delay.yml
scenario:
name: Agent handles delayed order gracefully

simulation_context: >
You are an impatient customer who escalates quickly if not given a direct answer.
You want to find out why the order is delayed and get a clear resolution.

setup:
initial_slots:
authenticated: true
account_id: "12345"
order_id: "ORD-9981"

goals:
criteria:
- Agent communicates the delay reason clearly and empathetically
- Agent does not ask the user to repeat information already provided
assertions:
- flow_started:
flow_ids: [check_order_status]
operator: any
- flow_completed:
flow_id: check_order_status
- flow_cancelled:
flow_id: verify_identity
- action_executed: action_check_order_status
- slot_was_set:
- name: order_id
value: "ORD-9981"
- slot_was_not_set:
- name: transfer_amount
- bot_uttered:
utter_name: utter_order_delayed
- bot_did_not_utter:
utter_name: utter_dont_know
- pattern_clarification_contains:
flow_ids: [check_order_status, track_shipment]
operator: any
- generative_response_is_relevant:
utter_source: utter_order_status_details
- generative_response_is_grounded:
utter_source: utter_order_status_details
- sequencing:
- flow_started: verify_identity
- action_executed: action_verify_account
- slot_was_set: authenticated
- flow_completed: verify_identity
- flow_started: check_order_status
- flow_completed: check_order_status

Field Reference

simulation_context

A free-form natural language description of the simulated user and the conversation goal. The LLM uses this to generate consistent user turns across the whole conversation. You can include user behaviour (emotional state, interaction style), conversational intent, and contextual facts (account state, order details).

A single sentence works. The more specific you are, the more targeted the simulation.

setup.initial_slots

Slot values injected into the conversation before the first user turn. Use this to pre-authenticate users, pre-fill context, or skip onboarding steps that are out of scope.

When initial_slots are set, the evaluator sends a /session_start message first so that any custom action_session_start logic runs against an empty tracker, then injects the slot values on top.

Slot names must exist in your domain with compatible types. validate_scenario checks this before any simulation runs.

goals.criteria

A list of natural language statements describing required bot-side behavior. The LLM judge evaluates each criterion independently after the conversation ends and produces a pass/fail verdict with a written rationale. A run is marked failed if any criterion fails.

Use criteria for deal-breakers — non-negotiable system behavior such as correct step ordering, mandatory tool calls, flow adherence, or required phrasing. Do not use criteria for subjective qualities like tone or empathy, or for overall user goal achievement; those are captured by quality metrics and do not gate the pass/fail verdict.

If you omit criteria, the run still produces a full transcript and assertion results.

goals.assertions

Deterministic, binary checks run against the conversation's full event history. These use the same assertion types as E2E tests, with the addition of the sequencing type:

AssertionChecksYAML value
flow_startedA specific flow was started{flow_ids: [...], operator: any|all}
flow_completedA flow reached a completed state{flow_id: "...", flow_step_id: "..."} (flow_step_id optional)
flow_cancelledA flow was cancelled{flow_id: "...", flow_step_id: "..."} (flow_step_id optional)
action_executedA specific action was triggeredplain string — action name
slot_was_setA slot was set (optionally to a specific value)list of {name: "...", value: ...} (value optional)
slot_was_not_setA slot was not set, or was not set to a specific valuelist of {name: "...", value: ...} (value optional)
bot_utteredThe agent produced a response matching text, buttons, or a response name{utter_name: "..."} or {text_matches: "..."} or {buttons: [...]}
bot_did_not_utterThe agent did not produce a response matching a patternsame fields as bot_uttered
pattern_clarification_containsThe clarification suggestions included the expected flow IDs{flow_ids: [...], operator: any|all}
generative_response_is_relevantA generative response is contextually relevant to the user's message (LLM-scored){utter_source: "..."}threshold and utter_name optional
generative_response_is_groundedA generative response is factually accurate against its source material (LLM-scored){utter_source: "..."}threshold, utter_name, ground_truth optional
sequencingA sequence of events occurred in the specified orderlist of single-key step dicts (see below)

sequencing

This takes a list of step matchers. Each step must match an event that appears strictly after the previous match in the conversation. All values are plain strings.

Step keyMatches
flow_startedFlowStarted event with the given flow ID
flow_completedFlowCompleted event with the given flow ID
flow_cancelledFlowCancelled event with the given flow ID
flow_interruptedFlowInterrupted event with the given flow ID
action_executedActionExecuted event with the given action name
slot_was_setSlotSet event for the given slot name
- sequencing:
- flow_started: verify_identity
- action_executed: action_verify_account
- slot_was_set: authenticated
- flow_completed: verify_identity
- flow_started: check_order_status
- flow_completed: check_order_status
caution

Unlike in E2E tests, generative_response_is_relevant and generative_response_is_grounded do not have access to the full conversation context. They assess a single turn in isolation, not the whole transcript. For multi-turn scenarios, use criteria instead. The LLM judge evaluates the complete conversation and produces more reliable results.