Deploying Fine-Tuned LLMs for Command Generation
This page relates to the fine-tuning recipe, which is a beta feature available starting with version 3.10.0
After you have fine-tuned a base language model for the task of command generation, you can deploy it to be used by a CALM assistant.
The deployed fine-tuned model must be accessible by the assistant via an litellm compatible API.
It is highly recommended that you use the vLLM model serving library, as it provides a number of important features such as batching and prefix caching.
This page gives an overview on the requirements and options for deploying a fine-tuned model in different environments with links to more detailed guides in our reference section. The same instructions can also be used to deploy any language model from Hugging Face, not only ones that you have fine-tuned.
After deploying your model, you will have to update your assistant's config for it to use the model for command generation.
Hardware Requirements
Hosting large language models requires a GPU to achieve fast inference latency.
If you are using a language model with billions of parameters, such as Llama 3.1, it is highly recommended that you use a GPU with a relatively large amount of memory. For example, an 8 Billion parameter model needs at least 16GB of GPU memory to fit the weights when using 16-bit precision. On top of that, some memory is required for processing one or multiple queries at once, and for caching activations from prior queries (prefix caching) to speed up processing of similar queries. In our experience, a GPU with 40GB of memory provides enough memory to operate a 8B parameter model at scale.
Suitable GPUs for low-latency inference:
- A100 (40GB/80GB)
- L40S (48GB)
- H100 (80GB)
Expected Throughput and Latency
With one of the recommended GPUs you can process around 5 requests per second for command generation at a median response time below 500ms. Note that requests per second is not equal to the number of users. For example, in a voice bot, a user might only interact with the system every 10 seconds because the interaction loop involves user speaking, processing, system responding, user listening, deciding and speaking again. Thus, 5 requests per second would equate to around 50 concurrent users.
The exact latency and throughput numbers will depend on the size of your prompt, which in turn is influenced by things such as the number of flows, their description length, whether you are using flow retrieval, etc.
Also take a look at our reference section on further optimizations
Options for deploying the fine-tuned model
The following options show different scenarios for using the vLLM model serving library with your fine-tuned model.
- If you like to serve the model for testing purposes right on the machine where you trained it, you can use vLLM directly
- You can also use docker to start vLLM on a VM.
- For production use you can deploy the LLM to a kubernetes cluster
- If you are on AWS, Sagemaker Inference Endpoints are another option for production use. Try them out with our guide in the reference section.