Deploying Fine-Tuned LLMs for Command Generation

This page provides detailed steps for different options and optimizations of hosting a fine-tuned LLM.

If you are looking for a more high-level overview on deploying LLMs consult the following pages:

Options for deploying the fine-tuned LLMs

The following are different scenarios for using the vLLM model serving library with your fine-tuned model.

Using vLLM directly

A quick option to try out a model right after training is to serve it right on the machine where you trained it.

Assuming that you have already installed vLLM and your fine-tuned model files are in a directory called finetuned_model, you can start the vLLM server this way:

vllm serve finetuned_model --enable-prefix-caching

If you run your rasa assistant on that machine too, it will be able to access the port of your vLLM server on localhost. The name of the model to be used in API calls, unless overriden with flags, will be finetuned_model.

You can add a model group to your endpoints.yml referencing this deployment like so:

endpoints.yml
model_groups:
  - id: vllm
    models:
      - provider: self-hosted
        model: finetuned_model
        api_base: http://127.0.0.1:8000/v1
        temperature: 0.0

Make sure to use this model group id in your rasa config.yml and retrain the assistant.

If you would like to run your rasa assistant on a different machine than the vLLM server, there are several different options. You could setup a reverse proxy or use ssh local forwarding (ssh -L 8000:localhost:8000 <your vllm server>) to forward the port to your local machine. Make sure that you properly protect your service from unauthorized use and that the firewall of the infrastructure surrounding your VM allows you to reach your server through the desired protocol (https / ssh).

Refer to the official documentation for all the flags available for the vllm serve command.

Using vLLM via Docker

It is recommended that you use the official vLLM docker image if you want to run a fine-tuned model on a VM instance via docker.

Assuming you have:

already installed Docker and the NVIDIA Container Toolkit in a VM instance with an NVIDIA GPU
the fine-tuned model files available on the instance in a directory called finetuned_model You can start the vLLM server on that instance using the docker command below.

docker run --runtime nvidia --gpus all \
    -v "./finetuned_model:/mnt/finetuned_model" \
    -p 8000:8000 --ipc="host" \
    vllm/vllm-openai:latest \
    --model "/mnt/finetuned_model" \
    --enable-prefix-caching

You will then need to expose the vLLM server port to the outside world so your assistant can access it via a public IP address. Make sure that you properly protect your service from unauthorized use.

Make sure to use this model group id in your rasa config.yml and retrain the assistant.

Kubernetes cluster

It is recommended that you use the KServe model inference platform when deploying a fine-tuned model to a Kubernetes cluster for use in production, due to the integrated vLLM runtime.

It is also recommended that you put your model files in a cloud object store, as KServe can automatically download models from buckets. You must first configure KServe with the credentials for your cloud storage provider, such as with Amazon S3 or with Google Cloud Storage.

Assuming that your cluster already has KServe installed and has at least one NVIDIA GPU node with appropriate drivers, you can use the manifest below to deploy a fine-tuned model from a bucket. First, make sure to update the values of:

the metadata.name field with a unique name for your model inference service. It's used for the internal DNS within the Kubernetes cluster.
the STORAGE_URI environment variable with the cloud storage URI for your model
the --served-model-name flag with the model name to be used when calling the OpenAI API

values.yml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: "CHANGEME-SERVICE-NAME"
spec:
  predictor:
    containers:
    - name: "main"
      image: "kserve/vllmserver:latest"
      command:
      - "python3"
      - "-m"
      - "vllm.entrypoints.openai.api_server"
      args:
      - "--port"
      - "8000"
      - "--model"
      - "/mnt/models"
      - "--served-model-name"
      - "CHANGEME-MODEL-NAME"
      - "--enable-prefix-caching"
      env:
      - name: "STORAGE_URI"
        value: "CHANGEME-STORAGE-URI"
      resources:
        limits:
          nvidia.com/gpu: "1"

If your CALM assistant is deployed in the same cluster as your fine-tuned model, you can exploit the Kubernetes DNS and use the internal URI of your inference service in your assistant config. Otherwise, you will have to set up your own ingress and use the external IP of your service.

Sagemaker Inference Endpoints

This is a recipe for the prototypical use of sagemaker inference endpoints that involves a number of manual steps.

For production deployments, we recommend using aws cdk to define roles and resources as code instead.

Prerequisites

To create an inference endpoint you need to have the appropriate permissions and quotas. First, create an aws role with full access to sage maker.

Second, make sure that your user account has permissions 1) to assume the formerly created role and 2) to stop sagemaker endpoints.

Third, make sure that you have the quota for using your desired instance type as a sagemaker inference endpoint. There are different quotas for using instances in different AWS services such as VMs in ec2 and as sagemaker inference endpoints. Make sure that you increase the one for sagemaker inference endpoints. For example, if you want to use a ml.g6e.2xlarge instance as a sagemaker endpoint, there is a specific quota for that "ml.g6e.2xlarge for endpoint usage" which you can set in the service quotas section.

Starting an Endpoint

Once you have taken care of the requirements, you can use the following python script to start a sagemaker inference endpoint with a model that has been uploaded to hugging face. It uses the sagemaker python sdk to start the endpoint.

Make sure that your AWS credentials and are in the environment when running the script. If you want to use a gated model from hugging face, also add the huggingface token.

import os

import sagemaker
import boto3
from sagemaker.djl_inference.model import DJLModel

iam = boto3.client('iam')
role = iam.get_role(RoleName='THE-ROLE-YOU-CREATED-EARLIER')['Role']['Arn']

session = sagemaker.session.Session()

# The model you want the endpoint to serve
model_id = "meta-llama/Llama-3.1-8B-Instruct"
# add a HF token to the env vars if the model you want to host is gated
hf_token = os.environ.get("HF_TOKEN")

# vllm arguments are prefixed with OPTION
env = {
    "OPTION_ENABLE_PREFIX_CACHING": "true",
}

# Using aws DJL containers that come prepackaged with vLLM for large model inference
# Updated containers can be found here https://github.com/aws/deep-learning-containers/blob/master/available_images.md
# make sure that you to replace the region part with your desired region as cross-region image pulls fail.
model = DJLModel(
    model_id=model_id,
    env=env,
    role=role,
    image_uri="763104351884.dkr.ecr.eu-central-1.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128",
    huggingface_hub_token=hf_token
)

# Instance type, here an instance with a single L40S gpu
instance_type = "ml.g6e.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

# deploy model to SageMaker Inference
predictor = model.deploy(
    initial_instance_count=1,
    endpoint_name=endpoint_name,
    instance_type=instance_type,
    container_startup_health_check_timeout=300,
)

This script will show you the name of the endpoint that was just created. It will be something like lmi-model-TIMESTAMP, e.g. lmi-model-2025-07-25-09-52-15-821. You need to use it to tell rasa where to find your model in the next section.

Using an Endpoint

In your rasa config, make sure that you add a model group to your endpoints.yml:

model_groups:
  - id: sagemaker_ft
    models:
      - provider: sagemaker_chat
        model: "sagemaker_chat/lmi-model-2025-07-25-09-52-15-821"
        temperature: 0.0
        timeout: 10.0

Afterward, make sure that you reference this model group in your config.yml and retrain your assistant. Also, add your AWS credentials to a .env file or export them. This way AWS will find the endpoint in your account and make it possible for you to use it.

info

The sagemaker_chat prefix to your model name is very important here, as it tells rasa to use the endpoint in chat format, i.e. it makes sure vLLM adds the needed formatting tokens to the prompt. Just using sagemaker as a prefix will result in using the model without these important chat formatting tokens and will result in unusable responses.

Stopping an Endpoint

To stop an endpoint navigate to the sagemaker service in AWS, select inference, select endpoints, select your desired endpoint and delete it. The model and endpoint configs do not necessarily need to be deleted.

Debugging an Endpoint

Sometimes, especially when trying out new configuration options, things fail. You can access the sagemaker logs by opening the endpoint details in the AWS interface (sagemaker -> inference -> endpoints) and selecting the endpoints cloud watch logs.

Optimizations

Prefix Caching

Prefix caching allows you to save compute on prompts that have the same start (=prefix). For all the tokens from the beginning on the prompt that are equal to a prompt that was seen before, previous computation can be reused; thereby saving time for the processing of the prompt.

This optimization is really important for command generation because the prompts have a large overlapping section at the beginning. We recommend that you turn it on by default.

For an indepth look into prefix caching, take a look at this section in the vLLM docs

8-bit loading

vLLM allows you to dynamically quantize 16-bit models to 8-bit at load time. This allows you to gain further speedup at the cost of a little accuracy. In our experiments we saw around 1% loss of accuracy. The exact numbers will depend on your use case and even on your hardware. Make sure that the GPU you use has fp8 cores to fully make use of the quantization. For example, L40S and H100 GPUs have dedicated fp8 cores, which you can see in their specs while the A100 GPU does not. That can make fp8 slower on A100 than fp16.

To load a model in fp8 mode add the parameter --quantization fp8 to your vLLM starting parameters.

Read more at this section in the vLLM docs.

4-bit loading

Loading your model in 4-bit mode can theoretically speed up the model even further at the cost of some more accuracy. You need to make sure that the bitsandbytes library is installed to use this feature. However, the direct hardware support for this fp-4 mode is quite limited and thus can make it more of a memory than a latency optimization.

To load a model in 4-bit mode add the parameter --quantization bitsandbytes to your vLLM starting parameters.

Read more at this section in the vLLM docs.

Options for deploying the fine-tuned LLMs​

Using vLLM directly​

Using vLLM via Docker​

Kubernetes cluster​

Sagemaker Inference Endpoints​

Prerequisites​

Starting an Endpoint​

Using an Endpoint​

Stopping an Endpoint​

Debugging an Endpoint​

Optimizations​

Prefix Caching​

8-bit loading​

4-bit loading​