Fine-tuning Recipe for Command Generator
The feature is available starting
with version 3.10.0
as a beta feature. If you are already familiar with the concepts of the recipe, head over to
the user guide to follow the exact steps needed to try the recipe.
CALM is LLM agnostic which means that when you start building your CALM assistant, you can use an off the shelf powerful LLM like GPT-4o via the OpenAI / Azure OpenAI platform. This is a great way to bootstrap your CALM assistant.
However, as the assistant is scaled up to more use cases and higher traffic, the assistant can run into the following issues -
- Response time of the assistant can be high, spoiling the UX of the end user talking to the assistant.
- Relying on 3rd party LLM providers can mean having to adhere to rate limits imposed by them resulting in some user messages not receiving a reply.
- Some of these powerful LLMs can be quite costly at scale.
Fine-tuning recipe helps you to fine-tune a small language model, for e.g. Llama-3.1 8B, for the task of command generation and integrate it with your CALM assistant. Doing so can help mitigate the issues around response times and LLM availability by a huge margin and lowers down the runtime costs of the assistant as well.
This page provides a conceptual understanding of how the recipe works under the hood. You can refer to the user guide to follow the exact steps needed to try the recipe.
Conceptual Overview
The recipe semi-automates the following steps in order to produce a fine-tuned LLM:
- Annotate commands for each user step for every sample conversation available.
- Generate synthetic data using an LLM to create new conversations by rephrasing every user step.
- Construct a fine-tuning dataset by aggregating prompt and commands of every user step across all generated conversations.
- Fine-tune an LLM on the fine-tuning dataset.
We explain each of the steps in more detail in the following sections.
Preparation
The feature assumes that the user already has a CALM assistant built with the CompactLLMCommandGenerator
as
the command generator using a strong LLM like gpt-4o-2024-11-20
and
E2E tests written for the same assistant.
To fine-tune your model effectively, it’s crucial to ensure that your system is comprehensively covered by E2E tests. These tests provide the data needed for fine-tuning. If your E2E tests do not sufficiently cover the assistant's functionality, the fine-tuned model may not perform well due to a lack of relevant training examples.
To address this, you can use an E2E test diagnostic tool, which is available as part of Rasa’s CLI. This tool helps you evaluate whether your E2E tests adequately cover the system's capabilities. It also identifies areas where existing tests may need to be updated or where new tests should be created before proceeding with fine-tuning.
Assessing test coverage for fine-tuning
When reviewing the results of the coverage report, there are three key areas to focus on to ensure your data is suitable for fine-tuning:
- Representation of All Commands: Ensure that all commands your assistant might generate are represented in your tests. If certain commands are not covered, the model may struggle to generate them correctly, having never "seen" these scenarios during training. This can evaluated by inspecting the command coverage histograms
- Demonstration of Desired Skills: Ensure that the skills you want your bot to demonstrate are well-represented in the tests. This ensures the model learns from a variety of examples and scenarios, increasing its robustness and reliability. This can evaluated by inspecting the flow coverage report
By carefully analyzing and expanding your test coverage, you can better prepare your model for fine-tuning, resulting in improved performance and a more reliable assistant.
Command Annotation
If an E2E test is failing on your assistant, it will be ignored by the command annotation module and subsequently by all other steps of the recipe. Hence, please ensure that the assistant is able to successfully pass the input E2E tests. We also recommend using the E2E coverage analysis tool to understand the coverage of the passing tests against the flows of your assistant.
As the first step of the recipe, the command annotator module runs the E2E tests through the CALM assistant
and extracts commands predicted by the CompactLLMCommandGenerator
at every user step.
The module is run as part of the
rasa llm finetune prepare-data
CLI command
and each E2E test is augmented with the commands the LLM should predict at every user step. The output of this step
converts each E2E test into a conversation finally looking like this:
conversations:
- original_test_case: ../e2e_tests/user_books_car_and_hotel.yml::car booking combined with hotel
steps:
- user: I'd like to book a car
llm_commands:
- start flow search_rental_car
- bot: in which city?
- user: to Basel
llm_commands:
- set slot trip_destination Basel
- bot: When would you like to pick up the car?
- user: from may 14th to the 17th
llm_commands:
- set slot car_rental_start_date "may 14th"
- set slot car_rental_end_date "may 17th"
- utter: utter_ask_car_rental_selection
- user: I'll take the luxury one! looks nice
llm_commands:
- set slot car_rental_selection "Avis - Luxury"
Only user steps that the CompactLLMCommandGenerator
processes, are annotated with commands and will
end up in the final training dataset for fine-tuning.
For example, if you bypass the CompactLLMCommandGenerator
by using
buttons that issue set slot commands the user step will not
be annotated.
Synthetic data generation
After the user steps of each conversation are annotated with commands, the synthetic data generation module creates n number of rephrases for each annotated user step and validates if the rephrased user step produces the same set of commands as the original user step in the corresponding conversation. Only the rephrased user steps that pass this validation are taken for the fine-tuning dataset.
Note: User utterances that come from buttons, e.g. the user clicked on a button instead of typing a response, are not rephrased and skipped by the synthetic data generator.
The conversation with its failed and passing rephrased user steps looks like this, assuming we produced 3 rephrasings per user step:
conversations:
- original_test_case: ../e2e_tests/user_books_car_and_hotel.yml::car booking combined with hotel
steps:
- user: I'd like to book a car
llm_commands:
- start flow search_rental_car
passing_rephrasings:
- I need to reserve a car.
- Could I arrange for a car rental?
- I'm interested in hiring a car.
- bot: in which city?
- user: to Basel
llm_commands:
- set slot trip_destination Basel
passing_rephrasings:
- The destination is Basel.
- I'd like to go to Basel.
failed_rephrasings:
- I'll need the car in Basel.
- bot: When would you like to pick up the car?
- user: from may 14th to the 17th
llm_commands:
- set slot car_rental_start_date "may 14th"
- set slot car_rental_end_date "may 17th"
passing_rephrasings:
- I need the car from May 14th to May 17th.
- The rental period will be May 14th to 17th.
- I'll require the vehicle from the 14th to the 17th of May.
- utter: utter_ask_car_rental_selection
- user: I'll take the luxury one! looks nice
llm_commands:
- set slot car_rental_selection "Avis - Luxury"
passing_rephrasings:
- I'd like to go with the luxury option; it looks appealing.
- I'll choose the luxury model; it seems nice.
- I'm opting for the luxury car; it looks great.
Rephraser LLM
The Rephraser LLM
uses, by default, gpt-4o-mini
to create 10 paraphrases of a user step.
The rephraser uses the following prompt to create the rephrasings:
Objective:
Create multiple rephrasings of user messages tailored to the "{{ test_case_name }}" conversation scenario.
===
Conversation overview:
{{ transcript or "Not provided." }}
===
Task:
Produce {{ number_of_rephrasings }} rephrasings for each user message that are diverse yet contextually appropriate.
Preserve the intent and content, but vary the structure, formality, and detail.
Only rephrase messages prefixed with "{{ user_prefix }}:".
Guidelines:
- Use a variety of expressions from brief and casual to elaborate and formal.
- Vary sentence structures, vocabularies, and expressions creatively.
- Keep the core message intact with concise and simple modifications.
Format:
- Each original user message should be prefixed with "USER: ".
- Enumerate the rephrasing.
- Separate each user message set with a line break.
===
Example output for 3 rephrasings of 2 user messages:
"""
USER: Show invoices
1. I want to see my bills.
2. I mean bills
3. Yes, I want to see the invoices.
USER: I'd like to book a car
1. I need to reserve a car.
2. Could I arrange for a car rental?
3. I'm interested in hiring a car.
"""
===
Expected output:
{{ number_of_rephrasings }} rephrasings for the following {{ number_of_user_messages }} user messages in the expected
format:
{% for message in user_messages -%}
- {{ message }}
{% endfor %}
If you want to modify the prompt or use a different LLM for the Rephraser LLM you can specify a custom config via
the argument --rephrase-config <path-to-config-file>
on the CLI command
rasa llm finetune prepare-data
.
The default config looks like this
prompt_template: default_rephrase_prompt_template.jina2
llm:
model: gpt-4o-mini
provider: openai
You can specify the number of rephrasings per user step by adding the flag --num-rephrases <number>
on the CLI
command rasa llm finetune prepare-data
.
If you set num-rephrases
to 0
, the synthetic data generator will be skipped.
As the synthetic data generator adds linguistic diversity to the dataset it is recommended to use at least a couple
of rephrases.
Our internal experiments showed that adding rephrases to the dataset increases the performance of the fine-tuned model.