When should you fine-tune your LLM? (in 2024)

It’s not even been a year since OpenAI released ChatGPT, the chatbot that revolutionized artificial intelligence (AI) and launched it into the mainstream. Since then, everybody has been talking about AI and how it will impact the world going forward. ChatGPT is an implementation of the powerful GPT-3.5 transformer model, a generative artificial intelligence (GenAI) model in the class of Large Language Models (LLM). This refers to a type of AI model that is trained to understand and generate human language. LLMs are designed to process and generate text in a way that is coherent and contextually relevant. These models are built using deep learning techniques, and they are changing how we build and maintain AI-powered products.

In this article we will go more deeply into what LLMs are, why they are important, and what you need to know to be able to use them effectively in your organization.

What are LLMs?

Large Language Models fall into yet another category of machine learning (ML) models known as Foundation Models (FM). These are large-scale ML models, often with billions of parameters, that are pre-trained on very large amounts of data. The idea is that the amount of data they are trained on is so large, that by using unsupervised learning they can gain a deep understanding of the data’s underlying patterns. LLMs are based on the transformer architecture and are examples of GenAI,

When FMs are trained on language, for instance by training on the entirety of Wikipedia, like the famous BERT model, you get an LLM. This model will have a deep understanding of the English language and will be able to perform all kinds of Natural Language Processing (NLP) tasks.

Developing LLMs requires a lot of computation power. Imaged: Photo of a datacenter by imgix on Unsplash

Advantages of pre-trained LLMs

You can imagine that developing an LLM is easier said than done. One does not simply “apply” a FM to a giant database of text. From a technical standpoint, LLMs are giant data processing functions that need to be carefully trained to yield a usable model. The textual data first has to be translated into numerical data through tokenization and vocabulary mapping, and training LLMs requires state-of-the-art hardware and months of development time (not to mention huge sums of money).

Fortunately, there are several organizations such as OpenAI, Meta, and Google, that are very interested in developing LLMs. With their deep pockets and dedicated research labs, they can fund such efforts and train impressive LLMs. Some of these models, like Meta’s LLaMA or the Technology Innovation Institute’s Falcon, have been released to the public as open-source, pre-trained LLMs.

So why are pre-trained LLMs so useful? Essentially, it’s because they can perform certain language-related tasks at a level of complexity that rivals human ability. They can save labor costs by deploying them as virtual assistants, using them for content generation, text completion, and translation, or they can be applied in search engines to further increase the likelihood that a prompt yields just what you were looking for. The advantages of pre-trained LLMs can be summarized as follows:

Versatility: LLMs can be used to generate, summarize, rewrite, extract, similarity search, cluster, and classify text. This means an LLM can be used for a wide range of NLP tasks without extensive retraining of the pre-trained model.
Contextual Understanding: LLMs can understand and generate the context of language in a coherent manner, by capturing the contextual relationships within text. Which allows them to generate more relevant and contextually appropriate responses.
Language Generation: LLMs are important and valuable tools for things like marketing, creative writing, and content creation due to their ability to generate text in a human-like manner. This ability can free up human resources, which in turn can save time and effort.
Off-the-shelf LLMs: The size of LLMs does not only require expert knowledge but also a vast amount of computational resources. Using pre-trained models can significantly reduce the resources and time required for the deployment of powerful LLM-based applications. The saved resources can then be used to adapt the model to the desired application.

Using LLMs

To use LLMs yourself you first have to decide how you want to use them. There are multiple ways to incorporate LLMs into your business operations, ranging from out-of-the-box use to highly customized use. Let’s say you want to build a chatbot for your business using an LLM – you might want to ask yourself:

Does my chatbot need to query information that is not publicly available?
What kinds of prompts will be given to my chatbot?
Do I want my chatbot to respond in a particular tone of voice?

Asking questions like these will help you to ascertain how much an off-the-shelf LLM will need to be customized to match your needs.

Finding the right pre-trained model

On platforms like HuggingFace there are many different LLMs that you could consider using. To find out which model is best for your case, some research on the specifics of different models is required. When choosing a model, there are a few factors to take into consideration:

Task and Use Case: Determine the specific task you want the LLM to perform. Different LLMs have varying capabilities and strengths. Some may be better for generating content, while others might be good at providing accurate information. This partially depends on the dataset that the model was trained on.
Model Size and Complexity: LLMs come in different sizes, with larger models generally having more capacity to understand context and generate coherent responses. However, larger models also require more computational resources, which can impact inference speed and cost.
Languages Supported: Make sure the LLM supports the actual/programming language you need for your use case.
Accuracy and Quality of Responses: Consider the quality of responses the LLM generates. Some models might produce more coherent and accurate responses than others. Typically, LLMs with a higher number of parameters tend to provide more accurate responses, but are more compute intensive to run.
Availability and Accessibility: Some LLMs might have restrictions on access, while others might be open source.

When choosing which LLM is best it’s also necessary to determine if any customization is required. Will you use the model as-is, or will you need to fine-tune it? With that said, customizability of the LLM is another factor to take into consideration.

The difference between fine-tuning and prompt engineering

In most cases you want to customize a FM for a desired task. This can be achieved in multiple ways. You can choose to either guide the model to a desired output (i.e. prompt engineering) or change its behavior entirely by changing the FM’s parameters (i.e. fine-tuning).

Let’s say you want to customize an LLM for a chatbot, for example. The FM is able to simulate human-like conversations straight from the get go, but it might not be able to produce the relevant and coherent response you are looking for. In this case you want to improve the output of the model so that it gives a more detailed and effective answer. For guiding and shaping the LLM’s output you can use prompt engineering.

Sometimes guiding and shaping the output of the LLM is not enough to produce the output that you want. FMs are often trained on a wide array of text which can cause them to hallucinate when asked a domain-specific question. In situations like this, where more domain-specific knowledge is required (e.g. LLMs for medical applications), you need to change the behavior of the model. This can be achieved by providing the model with a smaller, more domain-specific dataset that you can use to alter the parameters of the LLM. This is called fine-tuning.

So in short, with prompt engineering, the input (i.e. prompt) of a model is altered to produce a more detailed, and higher quality output. With fine-tuning, on the other hand, the parameters of a model are modified, which can help you enhance your LLM’s performance on domain-specific tasks.

Prompt engineering

Prompt engineering refers to the design and formulation of effective prompts or instructions for LLMs to generate desired outputs. It involves crafting instructions that guide the model’s responses towards generating the desired content or completing a task accurately. Prompt engineering considers factors such as context, format, and wording to elicit the best responses from the model. Well-designed prompts can significantly improve the quality and relevance of generated outputs, making interaction with a language model more productive and useful. Prompt engineering can be applied through several techniques. Few-shot prompting, for instance, is a technique where additional context is given in the prompt which the model can use to provide a better response. Some of the newer LLMs like Chat GPT-4 also provide an extended context-length-model, which enables you to use Retrieval Augmented Generation (RAG). With RAG, extra context is provided to the LLM in the prompt. This extra context can be provided in multiple ways:

One way is to provide the LLM with a vector database, this is useful when you want to use an LLM for your own internal documentation for example.
You can also combine your LLM with an information retrieval component. When an input is provided, the retrieval component will retrieve relevant or supporting documents that are concatenated as context with the input prompt and then fed to the LLM.

RAG also has the added benefit of making the LLM adaptive to situations where facts change over time.

Prompt engineering is a cost-effective way of teaching your LLM to produce relevant & accurate outputs.

Fine-tuning LLMs

The general understanding of language that pre-trained LLMs possess means that the majority of the development work is already done for you! However, to get a specialized model that is optimized to perform a specific task, you need to adapt your LLM to that task through a process called fine-tuning. This process requires some expertise, as well as state-of-the-art hardware, but much less so than if you were to develop a model from scratch.

Fine-tuning LLMs refers to taking a pre-trained LLM and tuning it using a dataset that is much smaller but more specific to a task. In this process, the general knowledge gained by the LLM during pre-training serves as the foundation for its ability to solve your specific task. Fine-tuning requires you to prepare a dataset for your specific use case. There are common architectural features of pre-trained LLMs that enable different types of fine-tuning, so let’s start with a brief overview of LLM pre-training.

This image has an empty alt attribute; its file name is CTA-banner-1-1-1024x597.png

Pre-training LLMs

In this phase, we start with a large neural network architecture like a transformer-based model for language tasks. This network is then trained on a massive and diverse dataset through unsupervised learning. During this phase, the network learns to recognize general features and patterns present in the data. In the case of text, it learns about syntax, grammar, and semantics. We now have ourselves a pre-trained LLM.

Pre-trained models can be found on platforms such as HuggingFace. Unless you have a really good reason to invest a lot of time and resources into getting the most custom and specialized LLM possible, which also requires tons of data (over 1000 GBs), you’re better off relying on one of the available pre-trained models.

After pre-training, the model has learned to represent data in a way that captures meaningful information. This knowledge is stored in the weights and parameters of the neural network’s layers.

General architecture of an LLM, GeeksforGeeks

Transfer learning

Once the network is pre-trained, it can be adapted to a more specific task with a smaller dataset. This new task might have a different domain or specific requirements – consider the use of medical literature to adapt an LLM to medical use cases.

Let’s simplify an LLM into two parts:

A pre-trained transformer: composed of many layers, of which embedding, positional encoding, and attention layers, for example. This part of the LLM deals with processing the meaning and order of words, as well as inferring context.
A classifier or output layer: which processes the transformer’s output to generate a response.

With fine-tuning you typically freeze the pre-trained transformer part of the model, and fine-tune the classifier or output part of the model. This can be done using several techniques explained and visualized below:

Feature-based approach: where you generate output embeddings and use them to train the classifier model.
Update the output layers: where only the output layers of the models are fine-tuned, and the pre-trained transformer part is kept frozen.
Update all layers: where the entire model is unfrozen and updated.

Image inspired by: finetuning-large-language-models

Going from left to right in the figure above, the costs and performance gains of fine-tuning increase. There are ways of reducing the costs, for example you could limit the amount of parameters to update by applying a Parameter-Efficient Fine-Tuning (PEFT) technique, like Low-Rank Adaptation.

The knowledge acquired by the neural network during the pre-training phase gives it a strong foundation in understanding features and patterns of the data. During fine-tuning, this knowledge is transferred to a new task, hence the term transfer learning. The network, i.e. the LLM model, can quickly adapt to the new task by adjusting its features based on the information it learned during pre-training.

Fine-tuning LLMs gives a nice balance between model strength (ie. how good it is at a specific task) versus development time.

Prompt engineering vs fine-tuning

While both concepts provide an improvement over using a pre-trained model out-of-the-box, the focus of the two concepts, and thus the outcome of the process, differs. The key differences betweens between prompt engineering and fine-tuning are shown in the table below:

Prompt Engineering	Fine-tuning
Focuses on producing better output	Focuses on enhancing model performance on specific tasks
Improves output by providing more detailed and effective inputs	Improves knowledge in specific area by training on domain-specific data
More precise control over a model’s actions and outputs	Adds depth and detail to relevant topic area
No compute resources required, but requires resources to engineer the right prompts	Requires significant compute resources and availability of sufficient task-related data

Prompt tuning

New techniques are emerging to adapt an LLM to a specific task. One of those techniques, called prompt tuning, which combines both fine-tuning and prompt-engineering to improve a model’s performance with soft prompts. With prompt tuning, the user’s prompt is first processed by a smaller, lightweight, tunable model (the P-tuned model in the image below) which then provides the LLM with both the users prompt and additional soft prompts. Soft prompts are additional numerical prompts that are generated by AI, as opposed to hard prompts which are provided by humans (e.g. via prompt engineering). A disadvantage of using soft prompts, however, is that AI-generated prompts are not readable by humans, thus reducing the explainability of a prompt tuned model.

Source: NVIDIA

Challenges of fine-tuning

Keep in mind that fine-tuning a model is quite challenging, and in most cases not necessary. To fine-tune a model, you need to prepare a dataset that contains enough raw data for your model to be able to recognize patterns. In addition to this, fine-tuning can be rather compute intensive which drives up its costs. In most situations a prompt-engineering technique, like few-shot prompting or RAG, will provide you with the performance you desire. Furthermore, new techniques like prompt-tuning are emerging which can provide similar performance increases as fine-tuning while using less resources.

If one of the techniques mentioned above still isn’t giving you your desired performance, you can, of course, still opt for fine-tuning. However, in general, LLM fine-tuning is not advised when:

You don’t have enough available domain-specific data
The data changes frequently (e.g. news related data)
The application is dynamic and context-sensitive (e.g. working with user data)

Conclusion

Large Language Models have emerged as a powerful tool for natural language processing tasks. While pre-trained models provide a strong foundation, customization is often required to optimize performance for specific use cases. There are several techniques available for adapting LLMs:

Prompt engineering allows guiding the model’s outputs by crafting effective prompts without needing to retrain the model. Techniques like few-shot prompting and RAG can provide extra context to produce higher quality and more relevant responses.
Fine-tuning goes a step further by updating the model’s parameters on a domain-specific dataset. This enhances model performance for specialized tasks, but requires more resources. Prompt tuning combines both soft prompts and fine-tuning for an optimal balance.

In summary, pre-trained LLMs provide a strong starting point which can then be adapted through various techniques like prompt engineering and fine-tuning, the right approach depending on factors like task requirements, data availability, and available resources. The UbiOps platform provides an intuitive way to leverage the power of LLMs for your products and applications by:

Simplifying the deployment and management of LLMs;
Making models accessible by exposing them with their own Application Programming Interface (API) end point;
Providing access to state-of-the-art CPUs and GPUs;
Empowering models with rapid adaptive auto-scaling.

With UbiOps, you can deploy and manage your LLM, easily integrating it into products and applications. Let us know if you have an exciting use case that you’d like to adapt an LLM for – we’d be happy to help out.

By industry

By application

On-demand GPU

Featured customers

NEW! Webinar with ReefSupport!

Latest news

UbiOps vs standard Model Serving Platforms

New UbiOps features July 2024