Fine-tuning and self-hosting an LLM for command generation
The fine-tuning recipe allows you to fine-tune a base LLM for the task of command generation. This fine-tuned model can be used as part of an LLM-based Command Generators within CALM. This can help mitigate issues of latency, reliability, and allows control over the LLM.
The current page is a tutorial covering all the steps required to fine-tune a base LLM and self-deploy it for the task of command generation. A more conceptual overview of the recipe is available on this page.
A pre-requisite of using the fine-tuning recipe is that you have already built an assistant and have written some E2E tests. In order to fine-tune a base model and use it within CALM, you need to perform the following steps:
- Step 1: Ensure comprehensive coverage of your system for fine-tuning
- Step 2: Prepare the fine-tuning dataset
- Step 3: Fine-tune a base-model
- Step 4: Evaluate your fine-tuned model
- Step 5: Create a production-ready fine-tuned model
- Step 6: Deploy the fine-tuned model to production
- Step 7: Connect the assistant to the fine-tuned model in production
Step 1: Ensure comprehensive coverage of your system for fine-tuning
Firstly, enable the feature by setting the RASA_PRO_BETA_FINE_TUNING_RECIPE
feature flag to True
.
As mentioned above, the fine-tuning recipe needs sample conversations to train a model on and these sample conversations are written in the E2E test format. To assess the breadth of your existing E2E tests, let's run E2E testing but with the coverage flag on.
The above command records which flow steps and commands were successfully covered. It then generates:
- A report on the coverage of flow steps by the E2E tests.
- A histogram of commands covered.
- Separate files for all passed and failed tests.
important
Inspect the flow coverage report to ensure that E2E tests have at least a 90% total coverage across the flows and no single flow has a coverage below 80%. Also inspect the command coverage histogram to ensure every command is represented well. Both the above inspections are important to generate high quality data for fine-tuning an LLM.
Step 2: Prepare the fine-tuning dataset
warning
This step needs to be executed using a Rasa model trained with the
SingleStepLLMCommandGenerator
and a strong LLM like gpt-4-0613
configured for the command generator.
info
This step takes some time to execute due to the number of LLM calls that are made.
To create a training and validation dataset from the passing E2E tests for fine-tuning, execute the following command:
By default, the passing E2E tests from Step 1
are written to the e2e_coverage_results/passed.yml
file. You can use them as E2E tests for this step, i.e.
The command will use the latest trained Rasa model in the models
folder to prepare the data for fine-tuning.
If you want to use a different model, you can set the flag --model <path-to-model>
.
The data preparation script performs a number of tasks:
- command annotation
- synthetic data generation
- training and validation dataset generation
For details on how these tasks work, please check out the conceptual overview of the fine-tuning recipe.
By default, the output of the data preparation command is written to the folder output
.
You can modify the output path by setting the flag --out <folder-name>
.
The output folder has the following structure:
The most important files are located under output/4_train_test_split
.
The remaining files are simply for debugging purpose.
If you are interested in what these data look like, please check the
conceptual overview of the fine-tuning recipe.
The training and validation datasets needed for Step 3 are located under
output/4_train_test_split/ft_splits/train.jsonl
and output/4_train_test_split/ft_splits/val.jsonl
.
To evaluate your fine-tuned model (see Step 4) you will need the E2E tests
located under output/4_train_test_split/e2e_tests
.
Step 3: Fine-tune a base model
Environment Configuration
Recommended Hardware:
- NVIDIA A100 GPU (40GB VRAM)
- 12 core CPU with 64GB RAM
- 256GB disk
Recommended Software:
- Python 3.10
- CUDA Toolkit 12.1
- PyTorch 2.2
warning
Despite the fact that these instructions will work with a relatively underpowered GPU, such as the NVIDIA T4, the actual fine-tuning and inference will be very slow.
It is highly recommended that you use an NVIDIA A100 or other similar GPU types.
Once you have provisioned the above hardware and installed the relevant software, deploy this python notebook and run the cells to go through rest of the model fine-tuning steps. The first few cells of the notebook also help in setting up the environment.
Below, we highlight a few important steps with recommendations for different setups you can try for them.
info
In case of any issues with the code provided in the python notebook for fine-tuning the model, please open an issue in its source repository on Github.
Base Model
The base model for finetuning is downloaded from the hugging face model hub. The python notebook installs the appropriate packages in order to do that.
Next, ensure that you add HUGGINGFACE_TOKEN
and BASE_MODEL
variables with your own values in
this cell.
We recommend starting with the Llama 3.1 8B Instruct model as the base model to finetune. If you want to try a different model, we recommend a model that has been instruction tuned specifically during pre-training. This would ensure that the model has been pre-trained to faithfully follow instructions from the prompt. A good alternative model is the CodeLlama 13b Instruct model.
Training and Validation Datasets
Ensure that the training and validation datasets, namely train.jsonl
and val.jsonl
, created in
Step 2 are available
on the disk where you are running fine-tuning. The
code provided
should be able to load the datasets and format them automatically.
As the files use the TRL instruction format, the TRL trainer used later will be able to automatically parse the datasets and generate the prompts from a template configured in the tokenizer.
Prompt templates vary between models and TRL will infer the correct template from your base model. If this is not available for your base model or if you wish to change it, you can set your own template string.
You can also define your own prompt formatting function in order to have full control of how the prompts are constructed.
Training Hyper-parameters
The provided code for this part uses the SFTConfig and SFTTrainer from the TRL library. The default values given to hyper-parameters worked the best in some in-house experiments we did, however there are a few parameters you can play with -
If you get an OOM error when running fine-tuning, you can reduce
per_device_train_batch_size
in order to reduce the memory footprint. However, if your GPU has sufficient memory, you can try increasing it in order to reduce the total number of training steps.Consider tuning
max_steps
, as you may not need to perform all epochs in order to achieve optimal model accuracy. Conversely, you may see better model accuracy by increasingnum_train_epochs
.If the model training is taking too long, you can increase
eval_steps
in order to reduce how often validation is performed.
Saving the trained model
The provided code
persists the trained model using 16 bit precision. It also merges the LoRA adapters with the weights of the base model and persists
all the parameters as part of the fine-tuned model. If you are using a relatively small GPU,
such as the NVIDIA T4, you may have to save the model in 4-bit instead (e.g. save_method = "merged_4bit_forced"
).
Visualizing training metrics
Once, the model training has finished the training and validation loss is plotted. Observe the plot for the following -
- Ideally, as the fine-tuning steps increase, the training and validation losses should decrease and converge.
- If both loss curves do not converge, it may be worth performing more fine-tuning steps or epochs. This condition is known as underfitting.
- If the validation loss suddenly starts to increase while the training loss continues to decrease or converge, you should decrease your total number of steps or epochs. This is known as overfitting.
Exporting fine-tuned model
Export your fine-tuned model directory to an appropriate storage location that can be easily accessed later for deployment
It is recommended that you use a cloud object store, such as Amazon S3 or Google Cloud Storage.
Step 4: Evaluate your fine-tuned model
Once you have a fine-tuned model, it is essential to evaluate it and cross-check its performance. Follow the steps below to do that:
Ensure you have downloaded the fine-tuned model on a cloud instance which has access to at least an A100 GPU.
Install
vllm
usingpip install vllm==0.6.0
. More information on that here.Run the model server with
vllm serve finetuned_model
withfinetuned_model
being the name of the directory containing the model artifacts. This should run a model server onlocalhost
network interface. If the assistant is on a different instance / machine, you should expose the model server on a public IP so that it can be pinged from an external machine.Configure the assistant to use the model server -
- Rasa Pro <=3.10.x
- Rasa Pro >=3.11.x
config.yml- name: SingleStepLLMCommandGeneratorllm:model_group: finetuned_model_groupendpoints.ymlmodel_groups:- id: finetuned_model_groupmodels:- model: finetuned_modelprovider: self-hostedapi_base: <URL of the model server> # `localhost` if assistant and model server on same machineAnd add the following to your endpoints.yml:
endpoints.ymlvector_store:type: milvushost: localhostport: 19530collection: rasaTrain the assistant with
rasa train
.Run:
rasa test e2e <path to validation E2E tests><path to validation E2E tests>
is the path to4_train_test_split/e2e_tests/validation.yaml
inside the output directory generated in step 2. The output of E2E testing should show the number of E2E test conversations that pass v/s fail. These E2E test conversation were specifically not used during the fine-tuning of the base model, hence they give a good idea on the generalization capability of the fine-tuned model across different natural language variations of user messages and domain of flows not present in the training data.As a sanity check, it is also good to check the performance of the fine-tuned model on the E2E tests used for generating data for fine-tuning the LLM as well. To do that, run:
rasa test e2e <path to training E2E tests><path to training E2E tests>
is the path to4_train_test_split/e2e_tests/train.yaml
inside the output directory generated in Step 2. Doing this ensures that the fine-tuning hasn't completely gone off the rails.
In order to assess whether the fine-tuned model is performing well, you can follow a few pointers -
- The assistant is able to pass at least 80% of the E2E test conversations from
validation.yaml
and 95% of E2E test conversations fromtrain.yaml
when using the fine-tuned model as the command generator - The performance of the assistant with the fine-tuned LLM as the command generator is comparable to the performance
of the assistant with a strong LLM like
gpt-4-0613
as the command generator when evaluated on the abovetrain.yml
andval.yml
.
If none of the above conditions are true, you should invest effort in creating more sample conversations, augmenting them with existing ones, generating fine-tuning data and fine-tuning a new model for evaluation.
Step 5: Create a production-ready fine-tuned model
This step involves retraining the model, but this time
combining the train set (train.jsonl
) and the validation set (val.jsonl
) into a single training dataset.
This step is optional and should be used in situations where training data is limited.
The effectiveness of training and choice of hyper-parameters should be established during Step 4 while there is a validation
set available.
Once this is completed a small improvement in performance can be gained by retraining the base model with the same
hyper-parameters and a slightly larger training dataset which includes the validation set, as follows:
- Concatenate the training and validation sets e.g.cat output/4_train_test_split/ft_splits/*.jsonl > output/4_train_test_split/ft_splits/combined_train_val.jsonl
- Retrain the model as per Step 3: Fine-tune a base-model using
combined_train_val.jsonl
- Evaluate the performance of this fine-tuning model as per Step 4 post which some manual testing may also be appropriate to ensure that training was successful.
Step 6: Deploy the fine-tuned model to production
Assuming you have:
- already installed Docker and the NVIDIA Container Toolkit in a VM instance with an NVIDIA GPU
- the fine-tuned model files available on the instance in a directory called
finetuned_model
You can start the vLLM server on that instance using the docker
command below. Make sure to first update the value of the --served-model-name
flag, which will override the name of the model used in API calls.
You will then need to expose the vLLM server port to the outside world so your assistant can access it via a public IP address.
If you wish to deploy using Kubernetes, refer to the details on this page.
Step 7: Connect the assistant to the fine-tuned model in production
Change the assistant's config to connect to the deployed fine-tuned model in production as part of Step 6:
- Rasa Pro <=3.10.x
- Rasa Pro >=3.11.x