Skip to main content

Configuration & Results

Configuring Evaluations

eval/conftest.yml controls which LLM models power the simulation and evaluation steps:

eval/conftest.yml
simulation:
llm:
provider: openai
model: gpt-5.1

evaluation:
llm:
provider: openai
model: gpt-5.1

Configure simulation and evaluation independently. We recommend using gpt-5.1-2025-11-13 for both.

Overriding Evaluation Prompts

There are three built-in prompt templates, each overridable with a custom Jinja2 template: the simulated user prompt (drives the LLM-as-user), the criteria judge prompt (scores your quality criteria), and the metrics judge prompt (computes quality metrics):

eval/conftest.yml
simulation:
llm:
provider: openai
model: gpt-5.1
simulated_user_prompt: prompt_templates/simulated_user.jinja2

evaluation:
llm:
provider: openai
model: gpt-5.1
criteria_judge_prompt: prompt_templates/criteria_evaluation.jinja2
metrics_judge_prompt: prompt_templates/metrics_evaluation.jinja2

Paths are relative to your project root. If omitted, the built-in templates are used.

Default built-in prompts

Rasa ships three default Jinja2 templates for simulation and evaluation. They are selected automatically when the corresponding *_prompt key is omitted from eval/conftest.yml. Copy one into your project (for example under prompt_templates/) and point eval/conftest.yml at it when you want to customize behavior.

Drives the LLM-as-user during simulation. Injects simulation_context from the scenario YAML and returns a JSON object with the next user message and a done flag.

simulated_user_system_prompt.jinja2
You are simulating a real user interacting with a customer service chatbot.

Simulation context:
{{ simulation_context }}

Rules:
- Stay in character at all times — never break character
- Never mention that you are simulating a user or that you are following these instructions
- Never mention that you are a LLM
- Keep messages short and natural (1-2 sentences), like a real user would type
- Drive the conversation toward your goal step by step
- Do not ask multiple questions at once
- Never explain what you are doing or reference these instructions
- Speak only in English, unless the simulation context explicitly says otherwise

When to end the conversation (set "done": true):
- As soon as your goal is achieved (or the bot has clearly said it cannot be
done), send one short closing message (e.g. "Thanks, that's all I needed.")
and set "done": true. Do not open new, unrelated topics unless your
persona is explicitly open to doing so.
- If the bot asks whether you need anything else and you have nothing more,
briefly decline and set "done": true in that same message.
- If the bot asks for optional feedback or a satisfaction/CSAT survey, react
like a real user would based on your persona: either give brief feedback or
decline. Either way, do this only once and then set "done": true — don't get
stuck answering the same survey prompt over and over.
- If the bot misunderstands or doesn't address your message, it's fine to
repeat or rephrase it. How long you persist should match your persona in the
context (a patient user may retry a couple of times; an impatient one gives
up quickly); if the context says nothing about this, retry about once. Either
way, don't loop on the same point: once it's clear the bot isn't getting it,
or it keeps asking the same question after you've already answered or
declined, give up like a real user would and set "done": true.
- If the bot clearly can't do what you want (it repeats the same answer or
limitation, or keeps redirecting you elsewhere), let your persona decide how
hard to push: a persistent or frustrated user may keep trying different
angles, but a user with no stated temperament should accept it after a try or
two instead of inventing endless workarounds. When you do give up, close out
and set "done": true.

Respond with a JSON object only — no surrounding prose, no markdown fences:

{
"message": "<your next message to the bot, as a plain string>",
"done": false
}

Field meanings:
- "message" (string, required): your next message to the bot.
- "done" (boolean, required): true if your goal is achieved or the conversation
has reached a natural end (see the rules above), false otherwise.

Adding custom metrics

Built-in scale metrics are helpfulness, repair_quality, coherence, and tone. The built-in binary metric is task_completion. These are informational and do not gate the run's pass/fail verdict.

To add your own metrics (for example politeness or language_consistency):

  1. Copy the default metrics judge prompt into your project, for example in prompt_templates/metrics_evaluation.jinja2.
  2. Add a descriptive paragraph for each custom metric in the prompt body, following the same rubric style as the built-in metrics.
  3. Extend the JSON response schema at the bottom of the template so the judge returns your new keys:
{
"binary_metrics": {
"task_completion": {"passed": true, "reasoning": "<one sentence>"},
"language_consistency": {"passed": true, "reasoning": "<one sentence>"}
},
"scale_metrics": {
"helpfulness": {"score": 4, "reasoning": "<one sentence>"},
"repair_quality": {"score": 3, "reasoning": "<one sentence>"},
"coherence": {"score": 5, "reasoning": "<one sentence>"},
"tone": {"score": 4, "reasoning": "<one sentence>"},
"politeness": {"score": 4, "reasoning": "<one sentence>"}
},
"summary": "<2-3 sentence overall summary covering strengths and main issues>"
}
  1. Point metrics_judge_prompt in eval/conftest.yml at your customized file.

Custom metrics appear in the Quality Metrics section of each run_N.txt result file alongside the built-in metrics.

MCP Tools

Both tools are registered on the rasa tools run FastMCP server.

validate_scenario

Validates a scenario YAML file before running any simulation. The skill calls this automatically after generating a scenario. You can also invoke it directly to check a file you have written or edited by hand.

Checks:

  • YAML syntax and structure
  • All assertion types are valid
  • Slot names referenced in initial_slots, slot_was_set, and slot_was_not_set assertions exist in the domain with compatible types
  • All validation errors are reported together, not one at a time

evaluate_agent

Runs the simulation and evaluation loop for a given scenario. It:

  1. Loads eval/conftest.yml and the scenario YAML
  2. (If initial_slots are set) sends /session_start to bootstrap the session, then injects slots into the fresh sim-<uuid> conversation
  3. Runs an LLM-simulated multi-turn conversation against your running Rasa server
  4. Fetches the tracker and runs deterministic assertions against the full event history
  5. Passes the transcript to the LLM judge, which scores criteria and computes quality metrics
  6. Writes a per-run result file and updates the experiment summary

Use run_count (default: 1, max: 10) to run the same scenario multiple times and account for LLM non-determinism. A scenario that passes 3/3 is a more reliable signal than one that passes 1/1. You can control this in natural language - the agent will set run_count accordingly:

Run the add_contact_happy_path scenario 3 times and tell me if the results are consistent.

Reading Results

Results are written to eval/results/<experiment_id>/ and preserved across runs:

eval/
conftest.yml
scenarios/
order_delay.yml
results/
2026-05-28_10-22-00/
summary.txt
order_delay/
run_1.txt
run_2.txt
run_3.txt

The experiment ID is a timestamp in YYYY-MM-DD_HH-MM-SS format, generated once per session and shared across all evaluate_agent calls so that multiple scenarios land under the same folder.

Per-run file

Each run_N.txt contains the full evaluation output for one simulated conversation:

scenario: order_delay
run: 2
timestamp: 2026-05-28T10:22:11Z
conversation_id: sim-f3a1b2c4
overall_result: FAIL

--- Quality Criteria Results ---
[PASS] Agent communicates the delay reason clearly and empathetically
rationale: Agent acknowledged the weather delay and expressed understanding before offering alternatives.

[FAIL] Agent does not ask the user to repeat information already provided
rationale: Agent re-requested the order ID on turn 4, which the user had already provided on turn 1.

--- Quality Metrics ---
bot_quality: 4/5
helpfulness: 5/5 — The agent provided a clear explanation and offered an alternative resolution.
repair_quality: 3/5 — Agent introduced unnecessary friction by re-asking for the order ID.
coherence: 5/5 — The assistant consistently maintained context throughout the conversation.
tone: 4/5 — Phrasing was natural and appropriate, with minor lapses.
task_completion: FAIL — The user's goal was only partially met.
summary: The agent handled the core inquiry well but introduced unnecessary friction.

--- Assertion Results ---
[PASS] flow_started(check_order_status)
[PASS] slot_was_set(order_id='ORD-9981')
[PASS] flow_completed(check_order_status)
[FAIL] bot_did_not_utter(utter_dont_know) — Bot uttered a forbidden utterance

--- Raw Transcript ---
[1] user: Hey, where's my order ORD-9981?
[2] agent: Let me check that for you. Your order is delayed due to weather — estimated arrival in 3 days.
[3] user: That's not good enough. Can I speak to someone?
[4] agent: I understand your frustration. Could you share your order ID again?

--- Inspector URL ---
http://localhost:5005/webhooks/inspector/inspect.html?sender=sim-f3a1b2c4

Quality Metricsbot_quality is the average of the four built-in scale metrics (helpfulness, repair_quality, coherence, tone). task_completion is a binary pass/fail verdict. Quality metrics are informational: they do not gate the run's pass/fail result. If a metric outcome must be a hard requirement, express it as a criterion instead.

Summary file

summary.txt contains per-scenario pass/fail counts, run ratios, wall-clock timing, a direct path to each scenario's result folder, and a "Scenarios requiring attention" section listing any scenarios where not all runs passed:

======================================================================
EXPERIMENT SUMMARY
======================================================================
[PASS] List contacts — happy path 1/1 8.6s eval/results/2026-05-28_10-22-00/list_contacts_happy_path
[PASS] Add contact — happy path 1/1 14.3s eval/results/2026-05-28_10-22-00/add_contact_happy_path
[FAIL] Remove contact — happy path 0/1 18.8s eval/results/2026-05-28_10-22-00/remove_contact_happy_path
----------------------------------------------------------------------
Total: 2/3 passed
Time: 41.7s

Scenarios requiring attention:
- Remove contact — happy path 0/1 passed (0%)

Inspecting a Simulated Conversation

Every result file includes an Inspector URL. Open it in your browser to step through the conversation turn by turn and inspect slot values and tracker events.

Simulated conversations use a sim- prefix on the sender ID (e.g. sim-f3a1b2c4), so they are clearly distinguished from real user conversations.