A Beginner’s Guide to Fine-Tuning LLMs

Pretrained large language models (LLMs) are impressive: They can summarize text, help with content drafting, and answer questions. But, as amazing as they are, they may not meet your business’s unique needs. This is why LLM fine-tuning is important.

Fine-tuning is essentially training an existing LLM to complete specific tasks. The process adjusts a model’s behavior, aligning it with your goals, language, and tone so that it can perform better in specific contexts (like customer service) or offer valuable assistance in niche fields (like law).

How does LLM fine-tuning occur? Read on to find out.

A simple walkthrough of the fine-tuning process

Training foundation LLMs requires significant resources, including finances, datasets, and infrastructure. While enterprise firms may have these resources, many organizations don’t. But what is in reach is fine-tuning. It requires a fraction of the cost and significantly less work. You can deploy your system relatively quickly if you know how to handle the fine-tuning process.

Here’s a basic step-by-step guide to help you:

Prepare your dataset

The first step is to gather the data you want your LLM to learn from. Your dataset should reflect the kind of responses you want the LLM to give and the type of tasks you want it to handle. For example, if you want the model to help your customer service team, compile sample questions (based on real customer queries) and ideal responses.

For a seamless training process:

Choose a suitable file type: CSV is ideal for small datasets, such as standard prompt-response pairs, while JSON Lines is better suited for large datasets.
Ensure proper formatting: Maintain a consistent entry pattern and punctuation. Also, avoid unnecessary symbols.
Label your data: Highlight which parts of your datasets constitute typical customer queries and ideal responses, using labels like “input” and “output” or “prompt” and “response.”
Maintain a consistent structure: Follow the same pattern across all your entries to reduce confusion. For example, if you use the “prompt” and “response” structure, maintain it throughout your dataset.

Choose your base model

Next, choose your foundational model. Look for small LLM options if you want a fast and cost-effective solution and large LLMs if you want a system that can offer sophisticated reasoning. If you’re unsure what to choose, start with a smaller option and reassess from there.

After determining the size, consider your use case. Go for a pretrained model that aligns with your specific needs. For example, if you have a global audience and want to build a bot that can speak to customers in their native languages, choose a multilingual model. This can save time and resources by reducing the amount of fine-tuning required.

Train and evaluate

This is where the actual fine-tuning starts. Feed your dataset into the base model so it can learn your preferred language, tone, and patterns. When you’re done, test the model by entering new prompts and assessing its responses for accuracy and relevance.

The first run might not be perfect, but training is rarely a one-off task. If the model’s responses don’t match your expectations, adjust your dataset by refining your responses and including more examples. Then, retrain and retest the LLM until it meets your needs.

Examples of the different types of fine-tuning

Some fine-tuning methods may be better suited for your LLM than others, depending on your goals and budget. Here’s a look at the most common examples to help you decide what to go for:

Supervised fine-tuning

This is the most common approach to fine-tuning. It involves teaching the model by example using a labeled dataset that aligns with your desired use case.

If you’re training a customer service system, you’ll first need to gather examples that highlight the types of questions users have (prompts) and the ideal responses (answers). Then, feed the examples to your model so it can learn to replicate.

This approach is straightforward and fast to iterate. But it may require a lot of examples, especially if you want to train your LLM for complex tasks.

Reinforcement learning from human feedback (RLHF)

This type of fine-tuning is common in advanced models like ChatGPT. RLHF relies on human feedback for training (as opposed to fixed inputs), and then incorporates preferences into its output. Here’s how it works:

You enter a prompt into your model.
The model generates multiple outputs in response to the prompt.
Human annotators rank or score the prompts based on factors like relevance.
You then train a separate model (reward model) with the ranking data to predict human preferences.
The reward model helps optimize your system through reinforcement learning (RL).

RLHF is great because it captures human preferences, increasing a model’s chances of success in real-world interactions. However, getting firsthand human input can be costly. What’s more, you have to use a diverse demographic to mitigate bias and overfitting risks.

Parameter-efficient fine-tuning (PEFT)

PEFT uses techniques that only change a small fraction of your model’s parameters. It’s common when training large LLMs, as they require significant computing power and memory for full fine-tuning.

Some common PEFT techniques include:

Low-Rank Adaptation (LoRA): Adds lightweight, low-rank matrices to the original model’s weights instead of fully updating them.
Quantized LoRA (QLoRA): Combines LoRA with quantization-the process of converting digital signals into formats that take up less space-to reduce model size.

These approaches are common for larger models like ChatGPT because they reduce training costs (by minimizing memory and GPU resource needs) and facilitate faster deployments. However, they can cause underperformance in niche tasks compared to full fine-tuning.

Instruction tuning and adapter layers

Instruction fine-tuning is a form of SFT that trains models to follow natural language instructions. It uses diverse datasets of instructions and ideal outputs, allowing models to handle a broader range of tasks than SFT-trained models.

Adapter layers are small, trainable parameters added between LLM layers. During fine-tuning, teams focus entirely on the parameters, leaving the base model frozen. Adapter layers are a form of PEFT, and as such, offer the same cost and time benefits as other techniques.

What kind of data do you need for fine-tuning LLMs?

Your data should be domain-specific when fine-tuning large language models. To build a customer service bot, you’ll need task-specific data from sources such as support transcripts, internal knowledge databases, and product documentation. For example, if you’re designing an LLM to serve as a legal assistant, you’ll need sources like legal briefs, agreements, case law, and judicial rulings. Feeding your bot domain-specific data increases its chances of providing high-quality, meaningful responses.

Other factors to keep in mind when choosing data include:

Quality: It should be bias-free and error-free for your model to learn effectively.
Consistency: It should follow the same structure and format to avoid confusion.
Balanced: It should have a balanced representation of different scenarios to reduce the risk of overfitting.
Enough examples: It should provide enough information to give your model sufficient context. This isn’t to say you need millions of examples-a few thousand high-quality ones are plenty.

Tools and platforms that support fine-tuning

While less complex than foundational LLM training, fine-tuning can still be quite challenging. The good news is some tools and platforms can make your work easier. Here are some popular options:

Hugging Face: Open-source solution with offerings like Hugging Face Hub (lets users discover, share, and collaborate on models) and Transformers Library (a Python library that allows users to load LLMs like GPT with a few lines of code).
OpenAI Fine-Tuning: Offers a clean API for fine-tuning GPT-3.5 and GPT-4 models.
AWS Bedrock: Enterprise-level solution that offers support for models like AWS Titan, LLaMA, and Cohere.
Google Cloud Vertex AI: Machine learning (ML) platform with enterprise-grade customization capabilities for models like Gemini and PaLM.
Weights & Biases (W&B): Model management and experiment tracking tool that lets you visualize model performance metrics and compare experiments.
Rasa Pro: While not exactly a fine-tuning platform, Rasa Pro offers Conversational AI with Language Models (CALM), a solution that integrates custom models and rule-based flows to help teams train assistants on domain-specific intents and responses. Ideal when you want to take advantage of the power of LLMs in conversational AI.

Bring smarter conversations to life with Rasa

Despite being “functional” out of the box, pretrained LLMs need fine-tuning. The process aligns them with specific tasks, tones, and business needs, facilitating more efficient and relevant model outputs.

Fine-tuning can be even more effective when paired with a strong dialogue and user flow management framework-something Rasa helps with.

Rasa can help you leverage the true potential of LLMs by integrating them with business logic. Through CALM, you can build AI assistants that align with organizational goals and keep track of business rules and conversation states to power seamless interactions.

Want to elevate your LLM’s capabilities? Reach out today to learn how Rasa can help.