Fine Tuning an LLM
This page explains why you might want to fine-tune a smaller LLM (e.g., Llama-3.1 8B) for command generation in CALM, outlines Rasa’s fine-tuning recipe at a high level, and shows you how to get started.
Why Would You Want to Fine-Tune an LLM?
When starting out with CALM, you might rely on a powerful off-the-shelf LLM (e.g., GPT-4) via OpenAI or Azure OpenAI. That’s often the quickest path to a functional assistant. However, as your use cases multiply and your traffic grows, you might hit constraints like:
- Response Latency: Third-party LLM services can become slower under heavy loads, leading to slower responses and a frustrating user experience.
- Rate Limits & Availability: Relying on external providers can mean hitting usage caps or facing downtime.
- High Inference Costs: Large LLMs are expensive at scale.
Fine-tuning a smaller LLM locally or on a private cloud helps mitigate these issues. By tailoring a model specifically for command generation in your assistant, you can:
- Boost Performance and Reliability: A well-fine-tuned smaller model can respond faster.
- Cut Costs: Inference on a smaller, domain-tailored LLM can be significantly cheaper.
- Retain Control Over Your Data: Run the model on your own infrastructure to avoid potential data-sharing or vendor lock-in concerns.
What Is the Fine-Tuning Recipe?
The fine-tuning recipe in Rasa Pro semi-automates the process of gathering training data for your domain and training a smaller LLM to produce the commands (e.g. StartFlow
, SetSlot
) that drive your CALM assistant.
Below is a high-level overview of how to put the fine-tuning recipe into practice. If you want more technical details on each step, head over to the Fine Tuning Recipe reference for step-by-step instructions.
- Prepare Your Assistant & Tests
- Make sure your CALM assistant uses a strong LLM command generator (e.g.,
CompactLLMCommandGenerator
with GPT-4) and that you have comprehensive E2E tests covering your most important conversation flows. - Check test coverage and verify that your E2E tests pass. Failing tests are skipped by the fine-tuning recipe.
- Make sure your CALM assistant uses a strong LLM command generator (e.g.,
- Annotate Commands
- Run
rasa llm finetune prepare-data
on your E2E tests. - This process executes the tests against your assistant, captures the commands the LLM produces at each user step, and appends them to your test files.
- Run
- Generate Synthetic Data
- Using the same CLI command (with
--num-rephrases
set to a desired number), the recipe will:- Rephrase user utterances.
- Check if rephrasings produce the same commands.
- Keep only valid rephrasings.
- You now have an expanded set of user utterances, ensuring greater linguistic variety in training.
- Using the same CLI command (with
- Build the Fine-Tuning Dataset
- The recipe collates your original user utterances and valid rephrasings into new “synthetic” conversations.
- Each user step becomes a data point containing the prompt (conversation history + user utterance) and completion (the commands).
- By default, the recipe splits this dataset into training (80%) and validation (20%).
- Fine-Tune Your Model
- Take the resulting dataset (usually stored in
.jsonl
format) and fine-tune a smaller LLM, such as Llama-3.1 8B. - Rasa provides an example notebook demonstrating how to:
- Load the model in 8-bit precision.
- Apply parameter-efficient fine-tuning (LoRA).
- Train for command generation.
- Save and deploy your model.
- After training, you can configure your CALM assistant to point to your fine-tuned model instead of GPT-4.
- Take the resulting dataset (usually stored in
- Test & Deploy
- Test your fine-tuned LLM by running the same E2E tests again.
- If the results meet your quality and performance criteria, you can deploy the new model to your production environment.
Check out the Fine Tuning Recipe reference for all available parameters and how to customize each fine-tuning stage. Head over to Deploying a Fine-Tuned Model for more information on hosting fine-tuned LLMs.
Fine-tuning your own LLM can be an effective solution to deliver faster, cheaper, and more controllable conversational experiences with CALM. By following the recipe and leveraging your existing E2E tests, you can quickly build a strong, custom-tailored command-generation model for your assistant.