Version: Latest

Deploy a fine-tuned model for command generation using vLLM

New in 3.10

This page relates to the fine-tuning recipe, which is a beta feature available starting with version 3.10.0

After you have fine-tuned a base language model for the task of command generation, you can deploy it to be used by a development or production CALM assistant.

The deployed fine-tuned model must be accessible by the assistant via an OpenAI compatible API.

It is highly recommended that you use the vLLM model serving library (version 0.6.0), as it provides an OpenAI compatible server.

This page provides advice on how to deploy a fine-tuned model with vLLM in different environments. The same instructions can be used to deploy any language model from Hugging Face, not necessarily one that you have fine-tuned.

info

After deploying your model, you will have to update your assistant's config for it to use the model behind a vLLM model server for command generation.

System requirements

Deploying a language model necessitates a GPU accelerator in order to achieve fast inference latency.

If you are using a language model with billions of parameters, such as Llama 3.1, it is highly recommended that you use a GPU with a relatively large amount of memory, such as the NVIDIA A100.

You should use a system with at least the same amount of resources as was used to fine-tune the model.

Deploy the fine-tuned model to a development environment

Assuming that you have already installed vLLM==0.6.0 and your fine-tuned model files are in a directory called finetuned_model, you can deploy it locally for development as a vLLM server:

vllm serve finetuned_model

If you run your assistant locally too, it will be able to access the port of your vLLM server on localhost. The name of the model to be used in API calls, unless overriden with flags, will be finetuned_model.

Refer to the official documentation for all the flags available for the vllm serve command, as they can have an impact on prediction latency.

Deploy the fine-tuned model to a production environment

VM instance

It is recommended that you use the official vLLM docker image if you want to run a fine-tuned model on a VM instance for use with a production assistant. The recommended version of vllm is 0.6.0

Assuming you have:

  • already installed Docker and the NVIDIA Container Toolkit in a VM instance with an NVIDIA GPU
  • the fine-tuned model files available on the instance in a directory called finetuned_model You can start the vLLM server on that instance using the docker command below. Make sure to first update the value of the --served-model-name flag, which will override the name of the model used in API calls.
docker run --runtime nvidia --gpus all \
-v "./finetuned_model:/mnt/models" \
-p 8000:8000 --ipc="host" \
vllm/vllm-openai:latest \
--model "/mnt/models" \
--served-model-name "CHANGEME"

You will then need to expose the vLLM server port to the outside world so your assistant can access it via a public IP address.

Kubernetes cluster

It is recommended that you use the KServe model inference platform when deploying a fine-tuned model to a Kubernetes cluster for use in production, due to the integrated vLLM runtime.

It is also recommended that you put your model files in a cloud object store, as KServe can automatically download models from buckets. You must first configure KServe with the credentials for your cloud storage provider, such as with Amazon S3 or with Google Cloud Storage.

Assuming that your cluster already has KServe installed and has at least one NVIDIA GPU node with appropriate drivers, you can use the manifest below to deploy a fine-tuned model from a bucket. First, make sure to update the values of:

  • the metadata.name field with a unique name for your model inference service
  • the STORAGE_URI environment variable with the cloud storage URI for your model
  • the --served-model-name flag with the model name to be used when calling the OpenAI API
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: "CHANGEME"
spec:
predictor:
containers:
- name: "main"
image: "kserve/vllmserver:latest"
command:
- "python3"
- "-m"
- "vllm.entrypoints.openai.api_server"
args:
- "--port"
- "8000"
- "--model"
- "/mnt/models"
- "--served-model-name"
- "CHANGEME"
env:
- name: "STORAGE_URI"
value: "CHANGEME"
resources:
limits:
nvidia.com/gpu: "1"

If your CALM assistant is deployed in the same cluster as your fine-tuned model, you can exploit the Kubernetes DNS and use the internal URI of your inference service in your assistant config. Otherwise, you will have to set up your own ingress and use the external IP of your service.