Version: Latest

Fine-tuning and self-hosting an LLM for command generation

The fine-tuning recipe allows you to fine-tune a base LLM for the task of command generation. This fine-tuned model can be used as part of an LLM-based Command Generators within CALM. This can help mitigate issues of latency, reliability, and allows control over the LLM.

The current page is a tutorial covering all the steps required to fine-tune a base LLM and self-deploy it for the task of command generation. A more conceptual overview of the recipe is available on this page.

A pre-requisite of using the fine-tuning recipe is that you have already built an assistant and have written some E2E tests. In order to fine-tune a base model and use it within CALM, you need to perform the following steps:

Step 1: Ensure comprehensive coverage of your system for fine-tuning

Firstly, enable the feature by setting the RASA_PRO_BETA_FINE_TUNING_RECIPE feature flag to True.

export RASA_PRO_BETA_FINE_TUNING_RECIPE=true

As mentioned above, the fine-tuning recipe needs sample conversations to train a model on and these sample conversations are written in the E2E test format. To assess the breadth of your existing E2E tests, let's run E2E testing but with the coverage flag on.

rasa test e2e <path-to-test-cases> --coverage-report

The above command records which flow steps and commands were successfully covered. It then generates:

  • A report on the coverage of flow steps by the E2E tests.
  • A histogram of commands covered.
  • Separate files for all passed and failed tests.
important

Inspect the flow coverage report to ensure that E2E tests have at least a 90% total coverage across the flows and no single flow has a coverage below 80%. Also inspect the command coverage histogram to ensure every command is represented well. Both the above inspections are important to generate high quality data for fine-tuning an LLM.

Step 2: Prepare the fine-tuning dataset

warning

This step needs to be executed using a Rasa model trained with the SingleStepLLMCommandGenerator and a strong LLM like gpt-4-0613 configured for the command generator.

info

This step takes some time to execute due to the number of LLM calls that are made.

To create a training and validation dataset from the passing E2E tests for fine-tuning, execute the following command:

rasa llm finetune prepare-data <path-to-e2e-test-cases>

By default, the passing E2E tests from Step 1 are written to the e2e_coverage_results/passed.yml file. You can use them as E2E tests for this step, i.e.

rasa llm finetune prepare-data e2e_coverage_results/passed.yml

The command will use the latest trained Rasa model in the models folder to prepare the data for fine-tuning. If you want to use a different model, you can set the flag --model <path-to-model>.

The data preparation script performs a number of tasks:

  • command annotation
  • synthetic data generation
  • training and validation dataset generation

For details on how these tasks work, please check out the conceptual overview of the fine-tuning recipe.

By default, the output of the data preparation command is written to the folder output. You can modify the output path by setting the flag --out <folder-name>. The output folder has the following structure:

.
├── output
│ ├── 1_command_annotations
│ └── 2_rephrasings
│ └── 3_llm_finetune_data
│ └── 4_train_test_split
│ │ └── e2e_tests
│ │ │ └── train.yml
│ │ │ └── validation.yaml
│ │ └── ft_splits
│ │ │ └── train.jsonl
│ │ │ └── val.jsonl
│ └── result_summary.yaml
│ └── params.yaml

The most important files are located under output/4_train_test_split. The remaining files are simply for debugging purpose. If you are interested in what these data look like, please check the conceptual overview of the fine-tuning recipe.

The training and validation datasets needed for Step 3 are located under output/4_train_test_split/ft_splits/train.jsonl and output/4_train_test_split/ft_splits/val.jsonl.

To evaluate your fine-tuned model (see Step 4) you will need the E2E tests located under output/4_train_test_split/e2e_tests.

Step 3: Fine-tune a base model

Environment Configuration

Recommended Hardware:

  • NVIDIA A100 GPU (40GB VRAM)
  • 12 core CPU with 64GB RAM
  • 256GB disk

Recommended Software:

  • Python 3.10
  • CUDA Toolkit 12.1
  • PyTorch 2.2
warning

Despite the fact that these instructions will work with a relatively underpowered GPU, such as the NVIDIA T4, the actual fine-tuning and inference will be very slow.

It is highly recommended that you use an NVIDIA A100 or other similar GPU types.

Once you have provisioned the above hardware and installed the relevant software, deploy this python notebook and run the cells to go through rest of the model fine-tuning steps. The first few cells of the notebook also help in setting up the environment.

Below, we highlight a few important steps with recommendations for different setups you can try for them.

info

In case of any issues with the code provided in the python notebook for fine-tuning the model, please open an issue in its source repository on Github.

Base Model

The base model for finetuning is downloaded from the hugging face model hub. The python notebook installs the appropriate packages in order to do that.

Next, ensure that you add HUGGINGFACE_TOKEN and BASE_MODEL variables with your own values in this cell.

We recommend starting with the Llama 3.1 8B Instruct model as the base model to finetune. If you want to try a different model, we recommend a model that has been instruction tuned specifically during pre-training. This would ensure that the model has been pre-trained to faithfully follow instructions from the prompt. A good alternative model is the CodeLlama 13b Instruct model.

Training and Validation Datasets

Ensure that the training and validation datasets, namely train.jsonl and val.jsonl, created in Step 2 are available on the disk where you are running fine-tuning. The code provided should be able to load the datasets and format them automatically.

As the files use the TRL instruction format, the TRL trainer used later will be able to automatically parse the datasets and generate the prompts from a template configured in the tokenizer.

Prompt templates vary between models and TRL will infer the correct template from your base model. If this is not available for your base model or if you wish to change it, you can set your own template string.

You can also define your own prompt formatting function in order to have full control of how the prompts are constructed.

Training Hyper-parameters

The provided code for this part uses the SFTConfig and SFTTrainer from the TRL library. The default values given to hyper-parameters worked the best in some in-house experiments we did, however there are a few parameters you can play with -

  1. If you get an OOM error when running fine-tuning, you can reduce per_device_train_batch_size in order to reduce the memory footprint. However, if your GPU has sufficient memory, you can try increasing it in order to reduce the total number of training steps.

  2. Consider tuning max_steps, as you may not need to perform all epochs in order to achieve optimal model accuracy. Conversely, you may see better model accuracy by increasing num_train_epochs.

  3. If the model training is taking too long, you can increase eval_steps in order to reduce how often validation is performed.

Saving the trained model

The provided code persists the trained model using 16 bit precision. It also merges the LoRA adapters with the weights of the base model and persists all the parameters as part of the fine-tuned model. If you are using a relatively small GPU, such as the NVIDIA T4, you may have to save the model in 4-bit instead (e.g. save_method = "merged_4bit_forced").

Visualizing training metrics

Once, the model training has finished the training and validation loss is plotted. Observe the plot for the following -

  1. Ideally, as the fine-tuning steps increase, the training and validation losses should decrease and converge.
  2. If both loss curves do not converge, it may be worth performing more fine-tuning steps or epochs. This condition is known as underfitting.
  3. If the validation loss suddenly starts to increase while the training loss continues to decrease or converge, you should decrease your total number of steps or epochs. This is known as overfitting.

Exporting fine-tuned model

Export your fine-tuned model directory to an appropriate storage location that can be easily accessed later for deployment

It is recommended that you use a cloud object store, such as Amazon S3 or Google Cloud Storage.

Step 4: Evaluate your fine-tuned model

Once you have a fine-tuned model, it is essential to evaluate it and cross-check its performance. Follow the steps below to do that:

  1. Ensure you have downloaded the fine-tuned model on a cloud instance which has access to at least an A100 GPU.

  2. Install vllm using pip install vllm==0.6.0. More information on that here.

  3. Run the model server with vllm serve finetuned_model with finetuned_model being the name of the directory containing the model artifacts. This should run a model server on localhost network interface. If the assistant is on a different instance / machine, you should expose the model server on a public IP so that it can be pinged from an external machine.

  4. Configure the assistant to use the model server -

    config.yml
    - name: SingleStepLLMCommandGenerator
    llm:
    model_group: finetuned_model_group
    endpoints.yml
    model_groups:
    - id: finetuned_model_group
    models:
    - model: finetuned_model
    provider: self-hosted
    api_base: <URL of the model server> # `localhost` if assistant and model server on same machine

    And add the following to your endpoints.yml:

    endpoints.yml
    vector_store:
    type: milvus
    host: localhost
    port: 19530
    collection: rasa
  5. Train the assistant with rasa train.

  6. Run:

    rasa test e2e <path to validation E2E tests>

    <path to validation E2E tests> is the path to 4_train_test_split/e2e_tests/validation.yaml inside the output directory generated in step 2. The output of E2E testing should show the number of E2E test conversations that pass v/s fail. These E2E test conversation were specifically not used during the fine-tuning of the base model, hence they give a good idea on the generalization capability of the fine-tuned model across different natural language variations of user messages and domain of flows not present in the training data.

  7. As a sanity check, it is also good to check the performance of the fine-tuned model on the E2E tests used for generating data for fine-tuning the LLM as well. To do that, run:

    rasa test e2e <path to training E2E tests>

    <path to training E2E tests> is the path to 4_train_test_split/e2e_tests/train.yaml inside the output directory generated in Step 2. Doing this ensures that the fine-tuning hasn't completely gone off the rails.

In order to assess whether the fine-tuned model is performing well, you can follow a few pointers -

  1. The assistant is able to pass at least 80% of the E2E test conversations from validation.yaml and 95% of E2E test conversations from train.yaml when using the fine-tuned model as the command generator
  2. The performance of the assistant with the fine-tuned LLM as the command generator is comparable to the performance of the assistant with a strong LLM like gpt-4-0613 as the command generator when evaluated on the above train.yml and val.yml.

If none of the above conditions are true, you should invest effort in creating more sample conversations, augmenting them with existing ones, generating fine-tuning data and fine-tuning a new model for evaluation.

Step 5: Create a production-ready fine-tuned model

This step involves retraining the model, but this time combining the train set (train.jsonl) and the validation set (val.jsonl) into a single training dataset. This step is optional and should be used in situations where training data is limited. The effectiveness of training and choice of hyper-parameters should be established during Step 4 while there is a validation set available. Once this is completed a small improvement in performance can be gained by retraining the base model with the same hyper-parameters and a slightly larger training dataset which includes the validation set, as follows:

  1. Concatenate the training and validation sets e.g.
    cat output/4_train_test_split/ft_splits/*.jsonl > output/4_train_test_split/ft_splits/combined_train_val.jsonl
  2. Retrain the model as per Step 3: Fine-tune a base-model using combined_train_val.jsonl
  3. Evaluate the performance of this fine-tuning model as per Step 4 post which some manual testing may also be appropriate to ensure that training was successful.

Step 6: Deploy the fine-tuned model to production

Assuming you have:

You can start the vLLM server on that instance using the docker command below. Make sure to first update the value of the --served-model-name flag, which will override the name of the model used in API calls.

docker run --runtime nvidia --gpus all \
-v "./finetuned_model:/mnt/models" \
-p 8000:8000 --ipc="host" \
vllm/vllm-openai:latest \
--model "/mnt/models" \
--served-model-name "llama-fine-tuned"

You will then need to expose the vLLM server port to the outside world so your assistant can access it via a public IP address.

If you wish to deploy using Kubernetes, refer to the details on this page.

Step 7: Connect the assistant to the fine-tuned model in production

Change the assistant's config to connect to the deployed fine-tuned model in production as part of Step 6:

config.yml
- name: SingleStepLLMCommandGenerator
llm:
model_group: finetuned_model_group
endpoints.yml
model_groups:
- id: finetuned_model_group
models:
- model: llama-fine-tuned
provider: self-hosted
api_base: <URL of the model server from step 6>
Retrain the assistant with `rasa train`.