Scenarios

A scenario describes a single test case for your agent: who the simulated user is, what they want to accomplish, and what a successful conversation looks like. Each scenario is an individual YAML file in eval/scenarios/.

A scenario has three parts:

simulation_context — a natural language briefing that tells the LLM how to play the user role across the whole conversation.
setup — optional initial slot values injected before the conversation starts (e.g. pre-authenticated state).
goals — what success looks like, expressed as deterministic assertions (checked against the tracker event history) and natural language criteria (scored by an LLM judge).

Schema

eval/scenarios/order_delay.yml
scenario:
  name: Agent handles delayed order gracefully

  simulation_context: >
    You are an impatient customer who escalates quickly if not given a direct answer.
    You want to find out why the order is delayed and get a clear resolution.

  setup:
    initial_slots:
      authenticated: true
      account_id: "12345"
      order_id: "ORD-9981"

  goals:
    criteria:
      - Agent communicates the delay reason clearly and empathetically
      - Agent does not ask the user to repeat information already provided
    assertions:
      - flow_started:
          flow_ids: [check_order_status]
          operator: any
      - flow_completed:
          flow_id: check_order_status
      - flow_cancelled:
          flow_id: verify_identity
      - action_executed: action_check_order_status
      - slot_was_set:
          - name: order_id
            value: "ORD-9981"
      - slot_was_not_set:
          - name: transfer_amount
      - bot_uttered:
          utter_name: utter_order_delayed
      - bot_did_not_utter:
          utter_name: utter_dont_know
      - pattern_clarification_contains:
          flow_ids: [check_order_status, track_shipment]
          operator: any
      - generative_response_is_relevant:
          utter_source: utter_order_status_details
      - generative_response_is_grounded:
          utter_source: utter_order_status_details
      - sequencing:
          - flow_started: verify_identity
          - action_executed: action_verify_account
          - slot_was_set: authenticated
          - flow_completed: verify_identity
          - flow_started: check_order_status
          - flow_completed: check_order_status

Field Reference

`simulation_context`

A free-form natural language description of the simulated user and the conversation goal. The LLM uses this to generate consistent user turns across the whole conversation. You can include user behaviour (emotional state, interaction style), conversational intent, and contextual facts (account state, order details).

A single sentence works. The more specific you are, the more targeted the simulation.

`setup.initial_slots`

Slot values injected into the conversation before the first user turn. Use this to pre-authenticate users, pre-fill context, or skip onboarding steps that are out of scope.

When initial_slots are set, the evaluator sends a /session_start message first so that any custom action_session_start logic runs against an empty tracker, then injects the slot values on top.

Slot names must exist in your domain with compatible types. validate_scenario checks this before any simulation runs.

`goals.criteria`

A list of natural language statements describing required bot-side behavior. The LLM judge evaluates each criterion independently after the conversation ends and produces a pass/fail verdict with a written rationale. A run is marked failed if any criterion fails.

Use criteria for deal-breakers — non-negotiable system behavior such as correct step ordering, mandatory tool calls, flow adherence, or required phrasing. Do not use criteria for subjective qualities like tone or empathy, or for overall user goal achievement; those are captured by quality metrics and do not gate the pass/fail verdict.

If you omit criteria, the run still produces a full transcript and assertion results.

`goals.assertions`

Deterministic, binary checks run against the conversation's full event history. These use the same assertion types as E2E tests, with the addition of the sequencing type:

Assertion	Checks	YAML value
`flow_started`	A specific flow was started	`{flow_ids: [...], operator: any\|all}`
`flow_completed`	A flow reached a completed state	`{flow_id: "...", flow_step_id: "..."}` (`flow_step_id` optional)
`flow_cancelled`	A flow was cancelled	`{flow_id: "...", flow_step_id: "..."}` (`flow_step_id` optional)
`action_executed`	A specific action was triggered	plain string — action name
`slot_was_set`	A slot was set (optionally to a specific value)	list of `{name: "...", value: ...}` (`value` optional)
`slot_was_not_set`	A slot was not set, or was not set to a specific value	list of `{name: "...", value: ...}` (`value` optional)
`bot_uttered`	The agent produced a response matching text, buttons, or a response name	`{utter_name: "..."}` or `{text_matches: "..."}` or `{buttons: [...]}`
`bot_did_not_utter`	The agent did not produce a response matching a pattern	same fields as `bot_uttered`
`pattern_clarification_contains`	The clarification suggestions included the expected flow IDs	`{flow_ids: [...], operator: any\|all}`
`generative_response_is_relevant`	A generative response is contextually relevant to the user's message (LLM-scored)	`{utter_source: "..."}` — `threshold` and `utter_name` optional
`generative_response_is_grounded`	A generative response is factually accurate against its source material (LLM-scored)	`{utter_source: "..."}` — `threshold`, `utter_name`, `ground_truth` optional
`sequencing`	A sequence of events occurred in the specified order	list of single-key step dicts (see below)

`sequencing`

This takes a list of step matchers. Each step must match an event that appears strictly after the previous match in the conversation. All values are plain strings.

Step key	Matches
`flow_started`	`FlowStarted` event with the given flow ID
`flow_completed`	`FlowCompleted` event with the given flow ID
`flow_cancelled`	`FlowCancelled` event with the given flow ID
`flow_interrupted`	`FlowInterrupted` event with the given flow ID
`action_executed`	`ActionExecuted` event with the given action name
`slot_was_set`	`SlotSet` event for the given slot name

- sequencing:
    - flow_started: verify_identity
    - action_executed: action_verify_account
    - slot_was_set: authenticated
    - flow_completed: verify_identity
    - flow_started: check_order_status
    - flow_completed: check_order_status

caution

Unlike in E2E tests, generative_response_is_relevant and generative_response_is_grounded do not have access to the full conversation context. They assess a single turn in isolation, not the whole transcript. For multi-turn scenarios, use criteria instead. The LLM judge evaluates the complete conversation and produces more reliable results.

Schema​

Field Reference​

simulation_context​

setup.initial_slots​

goals.criteria​

goals.assertions​

sequencing​