Skip to main content

Simulating and Evaluating Your Agent

Simulation and evaluation gives you a way to test your agent's behavior before shipping, without manually replaying conversations every time you make a change. You write a scenario in YAML — or let the agent generate one from your flow definition — describing who the simulated user is, what they're trying to accomplish, and what success looks like. Rasa drives full multi-turn conversations automatically, then evaluates whether your agent met those goals.

This covers the full assertion set supported by E2E testing and adds LLM-as-judge scoring on top. Where evaluations show their real strength is with agents that have some degree of autonomy: E2E tests require you to specify every step of the conversation upfront, which works well for fully deterministic flows but becomes impractical as soon as the agent can make independent decisions — choosing between paths, generating responses, or invoking subagents.

Prerequisites

Before running evaluations, make sure you have the following in place:

  • Rasa Pro installed with the rasa tools component enabled.

  • The rasa-simulating-conversations agent skill installed. Run rasa tools init in your project root to install it alongside other skills — no extra steps needed.

  • To use a natural language interface for generating scenarios and running evaluations, you need an IDE agent (such as GitHub Copilot, Cursor, or any MCP-compatible AI agent) connected to the rasa tools run FastMCP server. For more details, refer to the documentation on the Rasa MCP Tools

    • This skill has been most extensively tested with Claude Code (Sonnet 4.6) and Cursor with Composer 2.5. Either is a good starting point if you are unsure which to use.
  • The rest and inspector channel configured in your credentials.yml.

  • If you want to inspect simulation results in the Inspector, start your server with the --inspect flag:

    rasa run --inspect

Quick Start

Once the skill is installed and your server is running, type a prompt into your AI coding assistant (such as GitHub Copilot, Cursor, or any MCP-compatible agent). You can generate a single scenario:

Simulate a happy path scenario for the add_contact skill.

Or generate multiple scenarios at once:

Generate and run scenarios for the billing skill: a happy path, a case where the user provides incorrect details, and a cancellation mid-flow.

The AI coding assistant will:

  1. Check whether eval/conftest.yml exists in your project. If it doesn't, it creates a starter template and pauses for you to fill in your LLM provider details.
  2. Read your flow definition to understand available slots, actions, and branching logic.
  3. Write one or more scenario YAMLs to eval/scenarios/.
  4. Call validate_scenario to check each scenario file for syntax and domain errors before running.
  5. Call evaluate_agent to run the simulation and evaluation and return a pass/fail summary with a link to each result file.

Writing Scenarios

Scenarios live in eval/scenarios/ as individual YAML files. Each scenario describes who the simulated user is, how the conversation should unfold, and what success looks like.

Generating Scenarios with the Agent

Instead of writing scenarios by hand, ask the agent to generate them. The skill reads your flow definition and applies evaluation best practices to produce grounded, useful scenarios.

Minimal — generates a first batch from the flow definition alone:

Generate scenarios for the add_contact skill.

With direction — specify the types of scenarios you want:

Generate scenarios for the billing skill: a happy path, a multi-turn conversation,
and a case where the user provides incorrect details.

Detailed — provide specific goals, assertions, or personas:

Generate a scenario where an impatient customer asks about a delayed order (ORD-9981).
Assert that the check_order_status flow completes and that the agent never asks for
information the user already provided.

Scenario generation and simulation are separate steps. After the agent writes YAMLs to eval/scenarios/, you can edit them manually before running. Use this to refine criteria, adjust assertions, or add edge cases.

Full Scenario Reference

eval/scenarios/order_delay.yaml
scenario:
name: Agent handles delayed order gracefully

simulation_context: >
You are an impatient customer who escalates quickly if not given a direct answer.
You want to find out why the order is delayed and get a clear resolution.

setup:
initial_slots:
authenticated: true
account_id: "12345"
order_id: "ORD-9981"

goals:
criteria:
- Agent communicates the delay reason clearly and empathetically
- Agent offers an alternative resolution when human handoff fails
- Agent does not ask the user to repeat information already provided
assertions:
- flow_started: check_order_status
- slot_was_set:
name: order_id
value: "ORD-9981"
- flow_completed: check_order_status
- bot_did_not_utter: "utter_dont_know"
- sequencing:
flow_started: verify_identity
flow_completed: cancel_account

Field Reference

simulation_context

A free-form natural language description of who the simulated user is and what the conversation should cover. The LLM uses this to generate realistic, consistent user turns across the whole conversation.

You can include any combination of:

  • User behavior — emotional state or interaction style (e.g. impatient, cooperative, vague)
  • Conversational intent — what the user wants to accomplish and how the conversation should unfold
  • Contextual facts — account state, order details, or any background the user brings to the conversation

A single sentence works. So does a detailed multi-line description. The more specific you are, the more targeted the simulation.

setup.initial_slots

Slot values injected into the conversation before the first user turn. Use this to pre-authenticate users, pre-fill context, or skip onboarding steps that are out of scope for the scenario being tested.

Slot names must exist in your domain with compatible types — the validate_scenario tool checks this before any simulation runs.

goals.criteria

A list of natural language statements describing what a successful conversation looks like. The LLM judge evaluates each criterion independently after the conversation ends and produces a pass/fail score with a written rationale.

Use criteria for outcomes that require judgment: goal achievement, tone, empathy, or appropriate handling of edge cases.

If you omit criteria, the run still produces a full transcript and assertion results — no LLM judgment is applied to criteria.

goals.assertions

Deterministic, binary checks run against the conversation's event history. These match the assertion types supported in E2E tests with the addition of a new one (sequencing):

AssertionChecks
flow_startedA specific flow was started
flow_completedA flow reached a completed state
flow_cancelledA flow was cancelled
action_executedA specific action was triggered
slot_was_setA slot was set (optionally to a specific value)
slot_was_not_setA slot was not set, or was not set to a specific value
bot_utteredThe agent produced a response matching text, buttons, or a response name
bot_did_not_utterThe agent did not produce a response matching a pattern
pattern_clarification_containsThe clarification suggestions returned the expected flow names
generative_response_is_relevantA generative response is contextually relevant (LLM-scored)
generative_response_is_groundedA generative response is factually accurate against a ground truth (LLM-scored)
sequencingThe events occurred in the expected order in the conversation

Configuring Evaluations

The eval/conftest.yml file controls which LLM models power the simulation and evaluation steps. Create it at the root of your project:

eval/conftest.yml
simulation:
llm:
provider: openai
model: gpt-5.1

evaluation:
llm:
provider: openai
model: gpt-5.1

The simulation.llm model drives the simulated user turns. The evaluation.llm model acts as the judge, scoring each conversation against your quality criteria and computing quality metrics. Configure them independently — a smaller, cheaper model works well for simulation while a more capable model is recommended for judging.

Overriding Evaluation Prompts

Three built-in prompt templates can be overridden: two for the LLM judge (one for scoring your quality criteria, one for computing quality metrics) and one that controls how the simulated user behaves during a conversation. Override any of them if you have domain-specific requirements:

eval/conftest.yml
simulation:
llm:
provider: openai
model: gpt-5.1
simulated_user_prompt: eval/prompts/simulator.jinja2

evaluation:
criteria_judge_prompt: prompt_templates/criteria_evaluation.jinja2
metrics_judge_prompt: prompt_templates/metrics_evaluation.jinja2
llm:
provider: openai
model: gpt-5.1

Paths are relative to your project root. If omitted, the built-in templates are used.

MCP Tools

The evaluation feature exposes two MCP tools, both registered on the rasa tools run FastMCP server.

validate_scenario

Validates a scenario YAML file before running any simulation. The agent skill calls this automatically after generating a scenario, and you can invoke it directly to check a scenario you've written or edited by hand.

It checks:

  • YAML syntax and structure
  • All assertion types are valid
  • Slots referenced in initial_slots and assertion slot checks exist in your domain with compatible types
  • All validation errors are reported together, not one at a time

evaluate_agent

Runs the simulation and evaluation loop for a given scenario. It:

  1. Loads eval/conftest.yml and the scenario YAML
  2. Injects initial_slots into a fresh sim-<uuid> conversation
  3. Runs an LLM-simulated multi-turn conversation against your running Rasa server
  4. Fetches the tracker and runs deterministic assertions against the event history
  5. Passes the transcript to the LLM judge, which scores criteria and computes quality metrics
  6. Writes a result file and updates the experiment summary

Running Evaluations

Use the evaluate_agent MCP tool by asking your IDE agent:

Run the order_delay scenario 3 times.
Run all scenarios in eval/scenarios/.

You control how many times each scenario is simulated. Running a scenario multiple times (N=3 or more) accounts for LLM non-determinism and gives a more reliable signal — a scenario that passes 3/3 is more meaningful than one that passes 1/1.

Results are written incrementally as each run completes. You don't need to wait for the full batch to see early failures.

Reading Results

Results are written to eval/results/<timestamp>/ and preserved across runs:

eval/
conftest.yml
scenarios/
order_delay.yaml
results/
2026-05-28_10-22-00/
summary.txt
order_delay/
run_1.txt
run_2.txt
run_3.txt

Result File

Each run_N.txt contains the full evaluation output for one simulated conversation:

scenario: order_delay
run: 2
timestamp: 2026-05-28T10:22:11Z
conversation_id: sim-1f6de497-8e1a-45b6-8ad4-e8d68a229c50
overall_result: FAIL

--- Quality Criteria Results ---
[PASS] Agent communicates the delay reason clearly and empathetically
rationale: Agent acknowledged the weather delay and expressed understanding before offering alternatives.

[FAIL] Agent does not ask the user to repeat information already provided
rationale: Agent re-requested the order ID on turn 4, which the user had already provided on turn 1.

--- Quality Metrics ---
bot_quality: 3.5/5
helpfulness: 4/5 - The agent provided a clear explanation and offered an alternative resolution.
task_completion: 3/5 - The user's goal was partially met — the delay reason was communicated but the resolution step failed.
[... additional metrics ...]
summary: The agent handled the core inquiry well but introduced unnecessary friction by asking for
information the user had already provided.

--- Assertion Results ---
[PASS] flow_started(check_order_status)
[PASS] slot_was_set(order_id)
[PASS] flow_completed: check_order_status
[FAIL] bot_did_not_utter: Bot uttered a forbidden utterance 'utter_dont_know'.

--- Raw Transcript ---
[1] user: Hey, where's my order ORD-9981?
[2] agent: Let me check that for you. Your order is delayed due to weather — estimated arrival in 3 days.
[3] user: That's not good enough. Can I speak to someone?
[4] agent: I understand your frustration. Could you share your order ID again?

--- Inspector URL ---
http://localhost:5005/webhooks/inspector/inspect.html?sender=sim-f3a1b2c4

Summary File

Each experiment folder contains a summary.txt with a summary of scenarios run, overall pass/fail counts across all scenarios and runs, total time, and a list of scenarios that need attention when any runs failed.

Inspecting a Simulated Conversation

Every result file includes an Inspector URL. Open it in your browser to step through the conversation turn by turn, inspect slot values and tracker events, and resume from a specific turn to test a fix without replaying from scratch.

Simulated conversations use a sim- prefix on the sender ID (e.g. sim-f3a1b2c4), so they are clearly distinguished from real user conversations.

Evaluations vs. E2E Tests

Both tools test your agent, but they serve different purposes. Use whichever fits the flow you're testing — or both in parallel.

E2E TestsEvaluations
Best forDeterministic, scripted flows with a fixed expected pathAutonomous LLM-driven agents where the path is hard to fully pre-define
When to useCI pipelines, regression checks on controlled business logicBuild loop — testing whether your agent achieves its goals while iterating
Pass/fail signalBinary per stepLLM judge scores + deterministic assertions combined
Suitable for CI?YesNo

Evaluations are not yet suitable for blocking CI pipelines. Keep your E2E tests in CI for now and use evaluations in your local build loop.