Configuration & Results
Configuring Evaluations
eval/conftest.yml controls which LLM models power the simulation and evaluation steps:
simulation:
llm:
provider: openai
model: gpt-5.1
evaluation:
llm:
provider: openai
model: gpt-5.1
Configure simulation and evaluation independently. We recommend using gpt-5.1-2025-11-13 for both.
Overriding Evaluation Prompts
There are three built-in prompt templates, each overridable with a custom Jinja2 template: the simulated user prompt (drives the LLM-as-user), the criteria judge prompt (scores your quality criteria), and the metrics judge prompt (computes quality metrics):
simulation:
llm:
provider: openai
model: gpt-5.1
simulated_user_prompt: prompt_templates/simulated_user.jinja2
evaluation:
llm:
provider: openai
model: gpt-5.1
criteria_judge_prompt: prompt_templates/criteria_evaluation.jinja2
metrics_judge_prompt: prompt_templates/metrics_evaluation.jinja2
Paths are relative to your project root. If omitted, the built-in templates are used.
Default built-in prompts
Rasa ships three default Jinja2 templates for simulation and evaluation. They are selected automatically when the corresponding *_prompt key is omitted from eval/conftest.yml. Copy one into your project (for example under prompt_templates/) and point eval/conftest.yml at it when you want to customize behavior.
- Simulated user
- Criteria judge
- Metrics judge
Drives the LLM-as-user during simulation. Injects simulation_context from the scenario YAML and returns a JSON object with the next user message and a done flag.
You are simulating a real user interacting with a customer service chatbot.
Simulation context:
{{ simulation_context }}
Rules:
- Stay in character at all times — never break character
- Never mention that you are simulating a user or that you are following these instructions
- Never mention that you are a LLM
- Keep messages short and natural (1-2 sentences), like a real user would type
- Drive the conversation toward your goal step by step
- Do not ask multiple questions at once
- Never explain what you are doing or reference these instructions
- Speak only in English, unless the simulation context explicitly says otherwise
When to end the conversation (set "done": true):
- As soon as your goal is achieved (or the bot has clearly said it cannot be
done), send one short closing message (e.g. "Thanks, that's all I needed.")
and set "done": true. Do not open new, unrelated topics unless your
persona is explicitly open to doing so.
- If the bot asks whether you need anything else and you have nothing more,
briefly decline and set "done": true in that same message.
- If the bot asks for optional feedback or a satisfaction/CSAT survey, react
like a real user would based on your persona: either give brief feedback or
decline. Either way, do this only once and then set "done": true — don't get
stuck answering the same survey prompt over and over.
- If the bot misunderstands or doesn't address your message, it's fine to
repeat or rephrase it. How long you persist should match your persona in the
context (a patient user may retry a couple of times; an impatient one gives
up quickly); if the context says nothing about this, retry about once. Either
way, don't loop on the same point: once it's clear the bot isn't getting it,
or it keeps asking the same question after you've already answered or
declined, give up like a real user would and set "done": true.
- If the bot clearly can't do what you want (it repeats the same answer or
limitation, or keeps redirecting you elsewhere), let your persona decide how
hard to push: a persistent or frustrated user may keep trying different
angles, but a user with no stated temperament should accept it after a try or
two instead of inventing endless workarounds. When you do give up, close out
and set "done": true.
Respond with a JSON object only — no surrounding prose, no markdown fences:
{
"message": "<your next message to the bot, as a plain string>",
"done": false
}
Field meanings:
- "message" (string, required): your next message to the bot.
- "done" (boolean, required): true if your goal is achieved or the conversation
has reached a natural end (see the rules above), false otherwise.
Scores each deal-breaker criterion from the scenario against the transcript and the Rasa event ledger (event_ledger). Injects criteria_text and transcript.
You are evaluating whether a chatbot conversation satisfied a list of specific requirements.
Your only job is to mark each numbered criterion as `passed` or `not passed`. Do **not** judge tone, naturalness, or overall quality — that is evaluated separately. Be strict: a criterion passes only if the evidence clearly shows it was met. If evidence is ambiguous or missing, mark it as not passed.
Write every `reasoning` field in **English**, even when the conversation transcript is in another language. You may quote transcript text in its original language when citing evidence.
Criteria (numbered):
{{ criteria_text }}
Conversation transcript:
{{ transcript }}
{% if event_ledger %}
Behind-the-scenes event ledger (turn-anchored timeline of what the assistant runtime actually did):
{{ event_ledger }}
How to use the event ledger:
- The ledger is a curated, compressed timeline of `flow_started` / `flow_completed`, custom and terminal `action` invocations, user-driven `slot_set` events, sub-agent lifecycle events, and `mcp_tool:` calls made by MCP-tooled sub-agents.
- Each `[Tn]` block is anchored to the user's n-th message in the transcript. `[Tn] (events repeat identically through Tm — × N turns total)` means the same internal events fired in every one of those turns.
- **Two kinds of actions appear in the ledger and they mean very different things:**
- **`action: utter_<name>`** — a template / NLG render. It only produces visible bot text. It is **not** evidence that any backend operation happened. A bot saying "your transfer is complete" via `utter_transfer_complete` is just text — it does **not** prove money moved, an API was called, a record was written, an email was sent, or anything else.
- **`action: action_<name>`** (no `utter_` prefix) — a custom action that executes a real backend side effect: API call, DB write, submission, transfer, payment, escalation, integration. Presence is positive evidence of execution; absence is evidence the operation did not occur.
- **`flow_completed: <flow_id>`** is the strongest single signal that a flow finished its full sequence. If a flow has a `flow_started` but no matching `flow_completed`, the flow was interrupted and any "I've completed your X" claim by the bot is suspect.
- **Sub-agent lifecycle events** appear when a flow delegates work to an agent (an MCP-tooled LLM agent, an A2A remote agent, etc.). When known, the agent's protocol is shown as `protocol=mcp` or `protocol=a2a` — this tells you whether to expect `mcp_tool:` lines from that agent (only `mcp` agents emit them). Treat these events like flow events for the purpose of judging completion:
- **`agent_started: <agent_id> (flow=<flow_id>, protocol=<protocol>)`** — the agent step was entered. By itself this proves nothing about success.
- **`agent_completed: <agent_id> (flow=<flow_id>, protocol=<protocol>, status=<status>)`** — the agent finished its work. This is the strongest evidence the agent's task ran end-to-end. Check `status` if present: a status like `success` / `completed` is positive evidence; `failed` / `error` / `input_required` is the opposite.
- **`agent_interrupted: <agent_id>`** — the agent was paused (typically by user input mid-stream). Anything the bot then claims about completion is suspect unless a later `agent_resumed` + `agent_completed` pair appears.
- **`agent_cancelled: <agent_id> (flow=<flow_id>, reason=<reason>)`** — the agent was cancelled. The agent's task did **not** finish, regardless of what the parent flow says.
- **`agent_resumed: <agent_id>`** — work continued after an interrupt; only counts as completion if a subsequent `agent_completed` appears.
- If a criterion asks whether a sub-agent completed something (a downstream system was reached, a remote agent finished its task, a tool-using assistant executed its plan), require a matching `agent_completed` with a non-error status. An `agent_started` without `agent_completed`, or an `agent_cancelled` / `agent_interrupted`-without-resume, is evidence the task did not finish.
- **`mcp_tool:` lines** are individual backend tool calls executed *inside* an MCP-tooled sub-agent's loop. They are the strongest direct evidence that a specific backend operation ran:
- **`mcp_tool: <tool_name>(<args>) → ok`** — the sub-agent actually called `<tool_name>` with those arguments and it returned successfully. This is positive evidence that the operation (e.g. `get_bills`, `password_check`, `submit_claim`) executed. Use it to verify criteria about *which* tool was called and with what inputs — something the transcript alone cannot show.
- **`mcp_tool: <tool_name>(<args>) → ok (empty)`** — the tool ran and the MCP layer reported success, but it returned **no data** (an empty payload). Many tools swallow a backend failure into an empty result while still reporting success. Treat this as a **likely failure**: if the bot then states a concrete result that this tool was supposed to provide (specific figures, a record, a status), that claim is **suspect** and a criterion asserting the data was delivered should `FAIL`.
- **`mcp_tool: <tool_name>(<args>) → ok (reports error)`** — the tool returned successfully at the MCP layer but its payload carries an error field. Treat this the same as `→ ERROR`: the operation did not actually succeed.
- **`mcp_tool: <tool_name>(<args>) → ERROR: <message>`** — the tool was called but failed. Any bot claim that depends on that tool succeeding (e.g. "here are your charges", "your request is filed") is **suspect**: a criterion asserting the operation succeeded should `FAIL`, even if a later `agent_completed` appears (the agent can finish while one of its tools errored).
- **Reasoning about a missing tool call**: only MCP-tooled sub-agents emit `mcp_tool:` lines. The agent's protocol is shown on its `agent_started` / `agent_completed` line as `protocol=mcp` or `protocol=a2a`. Use it to decide what a *missing* tool call means:
- For a `protocol=mcp` agent step, its `mcp_tool:` lines are the **complete** record of the backend tools it called. If a criterion requires a specific tool and it does **not** appear among that agent's `mcp_tool:` lines, the tool was not run (the bot may have fabricated the result) — `FAIL` the criterion. This holds **even when that agent shows `agent_completed` with `status=success`, and even when the agent emitted no `mcp_tool:` lines at all**: a `protocol=mcp` agent that completes with zero `mcp_tool:` lines called **no** backend tools, so any criterion requiring a backend retrieval or operation it was supposed to perform must `FAIL` (a `success` status only means the agent's own loop ended cleanly — not that it called the expected tool). Apply this only when the criterion actually requires a specific backend tool/data; do not penalize a `protocol=mcp` agent that legitimately needs no tool call (e.g. it answers from already-available context).
- For a `protocol=a2a` agent step (tools run on a remote server, invisible to the ledger) — or any agent step with **no** `protocol=` label, e.g. older runs — the *absence* of `mcp_tool:` lines is **not** evidence a tool failed. Fall back to the `agent_completed` status (and any `action_<name>` / `flow_completed`) for whether the work finished. As a secondary cue, if that step did emit at least one `mcp_tool:` line it is instrumented, so a specific missing tool is still meaningful.
- **Precedence and hallucination-checking**: when a criterion asks whether some real-world action actually happened (a submission was made, a payment processed, a transfer executed, an account updated, a record filed, a sub-agent finished its task):
- Look for the corresponding `action: action_<name>` in the ledger, and ideally a matching `flow_completed`. If a sub-agent was involved, also require `agent_completed` for that agent_id with a non-error status, and — for MCP-tooled agents — a successful `mcp_tool:` line for the relevant operation (a `→ ERROR`, `→ ok (empty)`, or `→ ok (reports error)` tool call does not count as the operation having succeeded).
- If the bot text in the transcript says "I've done X" / "X is complete" / "X was successful" but the ledger shows neither a `action: action_<name>` nor a `flow_completed` (nor a successful `agent_completed`, when an agent step is involved) for the relevant flow, **the bot has hallucinated the outcome**. Any criterion claiming the action succeeded should `FAIL`, and any criterion phrased as "the assistant did not falsely claim X" should also `FAIL`.
- Do **not** treat an `utter_<name>` whose name suggests success ("`utter_transfer_complete`", "`utter_invoice_submitted`") as evidence the underlying operation happened. The template can be rendered without its corresponding action ever executing.
- Do **not** treat `agent_started` alone as evidence of completion — it only marks the entry to the agent step; without `agent_completed` (with a non-error status) the task did not finish.
- **Events upgrade behavior verdicts; they never override a user-visible failure on an outcome criterion.** Distinguish two kinds of criteria:
- **System-behavior criteria** — "the assistant retrieved the coupon information", "the right flow ran", "the correct tool was called", "the submission was executed". These are about what the system *did*. Use the ledger as ground truth: a `flow_completed` / `action_<name>` / successful `agent_completed` can `PASS` these even if the bot's wording to the user was clumsy or generic.
- **User-outcome / experience criteria** — "the user was able to view their coupons", "the assistant successfully helped the user reset their password", "the user received their answer". These are about what the user actually *got*. For these, the transcript is decisive: **if the bot communicated a failure or error to the user (e.g. "I'm having trouble, please try again later", "something went wrong", a generic fallback), the criterion `FAIL`s — regardless of what the ledger shows happened behind the scenes.** A backend that completed correctly while the user was shown an error is still a failed outcome for that user. Do **not** let `flow_completed` or `action_<name>` flip such a criterion to `PASS`.
- When a criterion is compound ("retrieved X **and** presented it to the user"), it only `PASS`es if **both** the behavior (ledger) **and** the user-facing delivery (transcript) succeeded. A backend success behind a user-visible error `FAIL`s the compound criterion.
- Linguistic criteria (e.g. "the assistant clearly communicated why it could not process the case") still come from the transcript — events don't have wording.
- Do not invent events. If a criterion asks about an action or flow that does not appear in the ledger, treat that as evidence the action did not happen.
{% endif %}
For each criterion, echo the criterion text **verbatim** in `criterion_text` — this is used to verify index-to-verdict alignment during review.
Respond with a JSON object only — no surrounding prose, no markdown fences:
{
"criteria": [
{
"criterion_index": 1,
"criterion_text": "<verbatim copy of the criterion text>",
"passed": true,
"reasoning": "<one sentence quoting or referencing the relevant turn(s){% if event_ledger %} or event(s){% endif %}>"
}
]
}
Computes built-in quality metrics from the transcript. Injects transcript. See Adding custom metrics to extend the output schema.
You are an expert evaluator of conversational AI quality. Score the assistant's overall conversational quality along two groups of metrics:
* **scale_metrics** — graded quality dimensions, each scored on a 1–5 scale (5 = excellent). Most production-quality conversations should land at 3–4; reserve `5` for clearly excellent behavior.
* **binary_metrics** — pass/fail verdicts. `true` for pass, `false` for fail. No partial credit.
Scoring rules (apply to every metric):
1. Score the **assistant's** behavior. The user's turns are context — they tell you what the assistant should have done — but the assistant's responses are what receive the score.
2. **Behavioral dimensions** (`helpfulness`, `repair_quality`, `coherence`) are evaluated *relative to* the user's turns: use the user's intent, corrections, and pushback to decide what the assistant should have done, then score how well the assistant actually did it.
3. **Surface dimensions** (`tone`) score the assistant's wording itself, independent of the user's emotional state. Whether the assistant should have softened or adapted its language given user frustration is captured by `repair_quality` and `coherence`, not `tone`.
4. Do not let the user's mood, politeness, length, or word choice move scores for the assistant. Identical assistant responses should receive identical scores regardless of how the user phrases their turns.
5. Write all `reasoning` fields and the `summary` in **English**, even when the conversation transcript is in another language. You may quote transcript text in its original language when citing evidence.
Conversation transcript:
{{ transcript }}
Binary metrics:
- **task_completion** — was the user's intended outcome clearly delivered by the assistant?
- `true` = the outcome was clearly delivered in the assistant's responses, not merely promised, deferred, or alluded to.
- `false` = the outcome was missing, only partially delivered, wrong, contradicted by a later turn, the conversation ended before it was addressed, or it is unclear from the transcript whether the user got what they came for (ambiguity = `false`).
Three rules override a naive reading:
- **CSAT / feedback loops** after a clear outcome delivery do not change the verdict. If the outcome was delivered earlier in the conversation, return `true` regardless of subsequent feedback loops or survey prompts.
- **Promised handoffs and async actions** — distinguish two cases:
- **Handoff to an external party** (live agent transfer, specialist callback, escalation to a human team, "I'll have someone reach out") requires follow-through. The bot offering a handoff that the user accepts only counts as outcome delivery if the assistant actually executed it within the conversation (filed a ticket, scheduled the callback, performed the transfer, displayed a confirmation that the handoff is in motion). An offer the user accepted but the assistant never followed through on = `false`.
- **Async system actions the assistant owns end-to-end** (email an invoice/receipt, queue a notification, submit a request to a backend, place an order, file a damage report) count as success once the assistant has collected the required inputs, confirmed them with the user, and announced the terminal action with concrete details — e.g. "your invoice will be sent to the email on file within 24 hours", "your damage report has been received". The judge cannot verify the email arrives or the report is processed in a backend — that is expected and acceptable. The verbalized completion *is* the bot's terminal action. Score `false` only if the announcement is vague or qualified ("we will try to send", "someone may follow up", "this will be handled at some point"), if the bot interrupted itself before completing the in-flow steps, or if the user clearly never confirmed the inputs.
- **User-initiated withdrawal** counts as success only when the user gives a clear, *positive* reason to withdraw that is independent of bot performance — examples: "I already paid it", "I found it in my email", "I figured it out myself", "I changed my mind, I'll just use the app", "I don't need that anymore". The user must explicitly signal that the original need has evaporated (already met elsewhere, no longer relevant, or actively reconsidered). Only in this case return `true`.
Two anti-patterns that are still `false`:
- **Frustration-driven dropoffs**: short curt exits like "never mind", "forget it", "I give up", "whatever" after the bot has deflected, re-asked, ignored pushback, or otherwise failed to make progress. These signal the user gave up *because the bot failed them*, not because their need was satisfied. Without an explicit positive reason ("I found it elsewhere", "I changed my mind"), default to `false`.
- **Pause-to-return-later**: the user steps away due to an external interruption ("hold on, I have to take a call", "I'll try again tonight", "my battery is dying"). The original need is unresolved, only deferred. Verdict is `false`.
When in doubt about whether a withdrawal is voluntary-satisfied vs. frustration-driven, default to `false` (consistent with the general ambiguity rule).
Scale metrics (1–5). Every point on the scale is anchored — use the in-between points (2 and 4) when the assistant's behavior clearly sits between the adjacent anchors, not as a hedge.
The four scale metrics measure **distinct observable behaviors**. A single failure mode (e.g. a "bot stuck in a feedback loop") will normally be the failure case for *at most one* metric — pick the metric that most directly describes what went wrong, and be reluctant to penalize the others for the same observation. The "Ignore" line under each metric tells you which failure modes belong to other metrics and should not affect the current score.
- **helpfulness** — **content quality of the assistant's substantive answers**: when the assistant attempts to answer the user's request, are those answers accurate, complete, and directly useful?
- **Ignore**: how the assistant handled disfluency or pushback (that is `repair_quality`); whether the assistant remembered prior turns or re-asked for given info (that is `coherence`); how the answer was phrased (that is `tone`). Only the *content* of the answers themselves goes into this score.
- **Slot-filling counts as forward progress**, not as a missing answer: an assistant that legitimately needs a piece of information to fulfill the request (e.g. asking for a phone number to look up a bill) is making productive progress and should not be scored down for "no substantive answer". Only penalize when the assistant *should* have answered something concrete and instead deflected, gave a vague non-answer, or was wrong.
- 5 = every substantive answer was accurate, complete, and directly actionable for the user's request.
- 4 = nearly all substantive answers were correct; one minor incomplete or imprecise answer that the user could still act on.
- 3 = at least one significant answer was vague, generic, or only partially correct where a specific answer was needed.
- 2 = multiple answers were unhelpfully vague or only partially correct; the user got partial value at best.
- 1 = the assistant consistently failed to give substantive answers — only deflections, generic non-answers, or wrong information. (Score 1 even for polite, consistent deflection if the user's request was within the assistant's stated scope.)
- **repair_quality** — **the assistant's response to user disfluency**: when the user explicitly signals the assistant misunderstood (using words like "no", "wait", "actually", "that's not what I meant", "I already told you", or repeating themselves), did the assistant adjust?
- **Ignore**: whether the assistant's substantive answers were correct (that is `helpfulness`); whether the assistant remembered slot values (that is `coherence`); how responses were phrased (that is `tone`). This metric is *only* about behavior immediately after a user disfluency signal.
- **If the conversation contains no disfluency signals** (the user accepted every answer and never pushed back), score 5 — there was no repair opportunity to mishandle. Note this default so it doesn't get confused with active recovery.
- 5 = every disfluency was acknowledged and acted on; the assistant absorbed user-driven course corrections cleanly. (Or: no disfluency arose.)
- 4 = nearly every disfluency handled; one case where the assistant needed an extra turn before recovering on its own.
- 3 = at least one disfluency signal was missed and the user had to repeat themselves once.
- 2 = multiple disfluencies missed; the user pushed back several times before the assistant changed course (or it never did).
- 1 = the assistant ignored explicit user pushback or kept repeating the same response after the user said it didn't help.
- **coherence** — **state and context tracking across turns**: does the assistant remember what was said earlier in the same conversation — slot values the user provided, prior corrections, the user's current intent?
- **Ignore**: whether the assistant's answers were correct (that is `helpfulness`); whether it recovered from disfluency (that is `repair_quality`); how it phrased things (that is `tone`). A consistently wrong, polite deflection is highly coherent — score it accordingly. Coherence is *only* about within-conversation memory and consistency.
- Penalize **only** for: (a) re-asking for information the user already provided, (b) contradicting an earlier statement the assistant itself made, or (c) continuing on an outdated interpretation after the user corrected it.
- 5 = no state-tracking issues: the assistant never re-asks for given info, never contradicts itself, never reverts to an outdated user intent.
- 4 = at most one minor slip (e.g. a brief outdated reference) that the assistant corrected within the next turn.
- 3 = one clear state-tracking failure: re-asked for one piece of info, OR briefly followed an outdated interpretation, OR one self-contradiction.
- 2 = two or three state-tracking failures across the conversation.
- 1 = persistent state-tracking failure: repeatedly re-asks for already-given info, persistent contradictions, or follows an outdated interpretation across multiple turns even after the user corrected it.
- **tone** — was the assistant's register and phrasing natural, varied, and easy to read? Do not penalize length by itself; penalize stilted, robotic, archaic, or awkward wording.
- 5 = natural, conversational phrasing; varied and easy to read.
- 4 = mostly natural with one or two slightly formulaic or stiff phrasings that don't disrupt readability.
- 3 = generally fine but occasional stilted, repetitive, or awkward phrasing.
- 2 = noticeably stilted or formulaic — multiple awkward, robotic, or repetitive phrasings that visibly affect readability.
- 1 = robotic, archaic, or consistently stilted/awkward phrasing.
Respond with a JSON object only — no surrounding prose, no markdown fences:
{
"binary_metrics": {
"task_completion": {"passed": <true or false>, "reasoning": "<one sentence>"}
},
"scale_metrics": {
"helpfulness": {"score": <1-5>, "reasoning": "<one sentence>"},
"repair_quality": {"score": <1-5>, "reasoning": "<one sentence>"},
"coherence": {"score": <1-5>, "reasoning": "<one sentence>"},
"tone": {"score": <1-5>, "reasoning": "<one sentence>"}
},
"summary": "<2-3 sentence overall summary covering strengths and main issues>"
}
Adding custom metrics
Built-in scale metrics are helpfulness, repair_quality, coherence, and tone. The built-in binary metric is task_completion. These are informational and do not gate the run's pass/fail verdict.
To add your own metrics (for example politeness or language_consistency):
- Copy the default metrics judge prompt into your project, for example in
prompt_templates/metrics_evaluation.jinja2. - Add a descriptive paragraph for each custom metric in the prompt body, following the same rubric style as the built-in metrics.
- Extend the JSON response schema at the bottom of the template so the judge returns your new keys:
{
"binary_metrics": {
"task_completion": {"passed": true, "reasoning": "<one sentence>"},
"language_consistency": {"passed": true, "reasoning": "<one sentence>"}
},
"scale_metrics": {
"helpfulness": {"score": 4, "reasoning": "<one sentence>"},
"repair_quality": {"score": 3, "reasoning": "<one sentence>"},
"coherence": {"score": 5, "reasoning": "<one sentence>"},
"tone": {"score": 4, "reasoning": "<one sentence>"},
"politeness": {"score": 4, "reasoning": "<one sentence>"}
},
"summary": "<2-3 sentence overall summary covering strengths and main issues>"
}
- Point
metrics_judge_promptineval/conftest.ymlat your customized file.
Custom metrics appear in the Quality Metrics section of each run_N.txt result file alongside the built-in metrics.
MCP Tools
Both tools are registered on the rasa tools run FastMCP server.
validate_scenario
Validates a scenario YAML file before running any simulation. The skill calls this automatically after generating a scenario. You can also invoke it directly to check a file you have written or edited by hand.
Checks:
- YAML syntax and structure
- All assertion types are valid
- Slot names referenced in
initial_slots,slot_was_set, andslot_was_not_setassertions exist in the domain with compatible types - All validation errors are reported together, not one at a time
evaluate_agent
Runs the simulation and evaluation loop for a given scenario. It:
- Loads
eval/conftest.ymland the scenario YAML - (If
initial_slotsare set) sends/session_startto bootstrap the session, then injects slots into the freshsim-<uuid>conversation - Runs an LLM-simulated multi-turn conversation against your running Rasa server
- Fetches the tracker and runs deterministic assertions against the full event history
- Passes the transcript to the LLM judge, which scores criteria and computes quality metrics
- Writes a per-run result file and updates the experiment summary
Use run_count (default: 1, max: 10) to run the same scenario multiple times and account for LLM non-determinism. A scenario that passes 3/3 is a more reliable signal than one that passes 1/1. You can control this in natural language - the agent will set run_count accordingly:
Run the add_contact_happy_path scenario 3 times and tell me if the results are consistent.
Reading Results
Results are written to eval/results/<experiment_id>/ and preserved across runs:
eval/
conftest.yml
scenarios/
order_delay.yml
results/
2026-05-28_10-22-00/
summary.txt
order_delay/
run_1.txt
run_2.txt
run_3.txt
The experiment ID is a timestamp in YYYY-MM-DD_HH-MM-SS format, generated once per session and shared across all evaluate_agent calls so that multiple scenarios land under the same folder.
Per-run file
Each run_N.txt contains the full evaluation output for one simulated conversation:
scenario: order_delay
run: 2
timestamp: 2026-05-28T10:22:11Z
conversation_id: sim-f3a1b2c4
overall_result: FAIL
--- Quality Criteria Results ---
[PASS] Agent communicates the delay reason clearly and empathetically
rationale: Agent acknowledged the weather delay and expressed understanding before offering alternatives.
[FAIL] Agent does not ask the user to repeat information already provided
rationale: Agent re-requested the order ID on turn 4, which the user had already provided on turn 1.
--- Quality Metrics ---
bot_quality: 4/5
helpfulness: 5/5 — The agent provided a clear explanation and offered an alternative resolution.
repair_quality: 3/5 — Agent introduced unnecessary friction by re-asking for the order ID.
coherence: 5/5 — The assistant consistently maintained context throughout the conversation.
tone: 4/5 — Phrasing was natural and appropriate, with minor lapses.
task_completion: FAIL — The user's goal was only partially met.
summary: The agent handled the core inquiry well but introduced unnecessary friction.
--- Assertion Results ---
[PASS] flow_started(check_order_status)
[PASS] slot_was_set(order_id='ORD-9981')
[PASS] flow_completed(check_order_status)
[FAIL] bot_did_not_utter(utter_dont_know) — Bot uttered a forbidden utterance
--- Raw Transcript ---
[1] user: Hey, where's my order ORD-9981?
[2] agent: Let me check that for you. Your order is delayed due to weather — estimated arrival in 3 days.
[3] user: That's not good enough. Can I speak to someone?
[4] agent: I understand your frustration. Could you share your order ID again?
--- Inspector URL ---
http://localhost:5005/webhooks/inspector/inspect.html?sender=sim-f3a1b2c4
Quality Metrics — bot_quality is the average of the four built-in scale metrics (helpfulness, repair_quality, coherence, tone). task_completion is a binary pass/fail verdict. Quality metrics are informational: they do not gate the run's pass/fail result. If a metric outcome must be a hard requirement, express it as a criterion instead.
Summary file
summary.txt contains per-scenario pass/fail counts, run ratios, wall-clock timing, a direct path to each scenario's result folder, and a "Scenarios requiring attention" section listing any scenarios where not all runs passed:
======================================================================
EXPERIMENT SUMMARY
======================================================================
[PASS] List contacts — happy path 1/1 8.6s eval/results/2026-05-28_10-22-00/list_contacts_happy_path
[PASS] Add contact — happy path 1/1 14.3s eval/results/2026-05-28_10-22-00/add_contact_happy_path
[FAIL] Remove contact — happy path 0/1 18.8s eval/results/2026-05-28_10-22-00/remove_contact_happy_path
----------------------------------------------------------------------
Total: 2/3 passed
Time: 41.7s
Scenarios requiring attention:
- Remove contact — happy path 0/1 passed (0%)
Inspecting a Simulated Conversation
Every result file includes an Inspector URL. Open it in your browser to step through the conversation turn by turn and inspect slot values and tracker events.
Simulated conversations use a sim- prefix on the sender ID (e.g. sim-f3a1b2c4), so they are clearly distinguished from real user conversations.