Falcon LLM fine-tuning

In the good old days machine learning models were made from scratch by data scientists. This involved acquiring, and cleaning data before training a model and getting it to production.

In recent years, though, the size of models has increased, and thus the training data required to train these new larger models as well. This is one of the reasons why foundation models (e.g. Computer Vision & Large Language Models (LLMs)) have become popular: these models are pre-trained and come ready off the shelf. 

In general these models are trained on very large datasets which makes them really flexible and usable in a lot of different kinds of operations, but there are also downsides. The very large datasets mean that foundation models, like LLMs, sometimes lack domain knowledge for more specific use cases. So pre-trained foundation models need to be adapted to downstream tasks. 

Prompt engineering vs fine-tuning

This can be achieved using several techniques, the two most common ones being prompt engineering and fine-tuning. With prompt-engineering, the prompt (i.e. the user’s input) is complemented with extra context. This method aims to “guide” the model to its answer. An example of this would be providing the model with several translations from french to english before asking the model for the translation you actually want. A more complicated example of prompt-engineering would be Retrieval Augmented Generation (RAG), where an information retrieval component is connected to the pre-trained LLM. This component searches for extra context that the LLM can use to generate a response to the prompt.

In some cases, prompt-engineering still doesn’t provide the LLM with the desired accuracy. In these cases, fine-tuning is the best option. Fine-tuning can also be applied when you want to change or narrow down a model for a different or more specific task than it was originally developed for. With fine-tuning, the “behavior” of the model itself is changed by (partially) re-training some of the weights and parameters of the model. You can read more about when to fine-tune an LLM in this article

In this guide

Fine-tuning a model can be tricky, but it’s definitely doable when using the right platforms. In this article I’ll show you how you can fine-tune Falcon 1B using UbiOps. You can also use this guide for fine-tuning a different version of Falcon simply by changing the “model_id” variable in the training and deployment code. 

UbiOps is a platform designed for training, hosting & serving your AI models. We’ve already released a tutorial on how to deploy an LLM (LLaMa 2) with a customizable front-end, now it’s time we showed you how you can easily fine-tune an LLM on UbiOps too. For this guide, we’ll be mostly working in an IDE, creating files and making changes in our project with the UbiOps Client Library.

We’ll make use of UbiOps’ training functionality to fine-tune the model on English quotes. In short, this training interface is made up of training experiments and training runs. The experiments define the training set up, while the actual training code executions take place inside these experiments as training runs. Sort of like a deployment and a deployment version.

The model we’ll be fine-tuning is the Falcon 7b model from Huggingface.

The requirements for executing this guide are:

      • Python 3,10

    Fine-tuning a model on UbiOps can be done in four steps. We’ll need to: 

        • establish a connection with our UbiOps environment

        • create the coding environment for our training run

        • create a training experiment

        • initialize a training run

      Connect to your UbiOps environment

      Let’s start by establishing the connection to your UbiOps environment. You can do this by creating an API token, make sure that the API token has the project editor rights. Then insert the API token into your code as instructed in the code block below:

      !pip install --upgrade ubiops

      import ubiops

      API_TOKEN = '<API TOKEN>' # Make sure this is in the format "Token token-code"

      PROJECT_NAME = '<PROJECT_NAME>'    # Fill in your project name here

      configuration = ubiops.Configuration()

      configuration.api_key['Authorization'] = API_TOKEN

      api_client = ubiops.ApiClient(configuration)

      api = ubiops.api.CoreApi(api_client)


      Then we need to create two environment variables in order for us to access the default data storage bucket for the project. We’ll store the results of the model in this bucket.


      # Create an environment variable for the api token

      envvar_projname = ubiops.EnvironmentVariableCreate(name='API_TOKEN', value=API_TOKEN, secret=False)

      api.deployment_version_environment_variables_create(PROJECT_NAME,                                            deployment_name='training-base-deployment', version='falcon-fine-tuning', data=envvar_projname)

      # Create an environment variable for the project name

      envvar_projname = ubiops.EnvironmentVariableCreate(name='PROJECT_NAME', value=PROJECT_NAME, secret=False)

      api.deployment_version_environment_variables_create(PROJECT_NAME,                                         deployment_name='training-base-deployment', version='falcon-fine-tuning', data=envvar_projname)

      Create code environment

      Then we can start creating our coding environment. We can do this by selecting a base environment and then adding additional dependencies.

      We define these dependencies in a `ubiops.yaml` (for packages that need to be installed on OS-level) and a `requirements.txt` (for packages that will be used inside the deployment). These files will be added to a directory, which we’ll zip and upload to UbiOps later on.

      The base environment in this case will be: “ubuntu22-04-python3-10-cuda11-7-1”


      Create the `requirements.txt` file:

      !mkdir fine-tuning-environment-files

      %%writefile fine-tuning-environment-files/requirements.txt









      Create the `ubiops.yaml` file:

      %%writefile fine-tuning-environment-files/ubiops.yaml



          - git


      Now that our directory is complete, we can create our environment and upload the package to UbiOps: 

      import shutil

      zip_name = "fine-tuning-environment-files"

      ENVIRONMENT_NAME = "fine-tuning-falcon1b"

      shutil.make_archive(zip_name, "zip", "fine-tuning-environment-files")

      data = ubiops.EnvironmentCreate(name=environment_name,base_environment="ubuntu22-04-python3-10-cuda11-7-1")

      api_response = api.environments_create(PROJECT_NAME, data)


      api_response = api.environment_revisions_file_upload(PROJECT_NAME, environment_name, file=f"{zip_name}.zip")


      Creating training experiment in UbiOps

      As mentioned in the introduction, the training functionality of UbiOps is built on two concepts: experiments and training runs. The experiment defines the training set up. Here you can define things like:

          • the instance type (i.e. hardware) that will be used for your training runs

          • the coding environment

          • the bucket for outputted files (UbiOps offers storage as well)


        The training run is the actual coding execution and runs inside the training experiment. For the training run you can configure things like:

            • training code

            • training data

            • any parameters that you would like to use in your training code

          Create training experiment:

          from ubiops.training.training

          import Training

          training_instance = Training(api_client)

          # Create experiment

          EXPERIMENT_NAME = "falcon-fine-tuning"

          api_response = training_instance.experiments_create(





                  description='A finetuning experiment for Falcon',



                 labels={"type": "pytorch", "model": "flaconLLM"}




          Create training runs

          We’ll initiate two training runs: one will download the model’s checkpoints and the dataset and upload these files to a UbiOps bucket (prepare.py), the other one will download these models and execute the actual training job (train.py).

          The result of the first run will be stored in the `default` bucket which we defined earlier. 


          Create preparation run

          %%writefile prepare.py

          from transformers import AutoModelForCausalLM, AutoTokenizer

          import tarfile

          import ubiops

          import requests

          import os 

          def train(training_data, parameters, context = {}):

             configuration = ubiops.Configuration(api_key={'Authorization': os.environ['api_token']})

             api_client = ubiops.ApiClient(configuration)

             api = ubiops.api.CoreApi(api_client)

             # Load model weights

              print("Load model weights")


             cache_dir = "checkpoint"

             model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_dir)

             tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

              with tarfile.open(f'{cache_dir}.tar', 'w') as tar: 


              # Uploading weights

              print("Uploading weights")



              file_uri = ubiops.utils.upload_file( 

                client = api_client, 

                project_name = os.environ['project_name'], 

                file_path = f'{cache_dir}.tar', 

               bucket_name = "default", 

                      file_name = f'{cache_dir}.tar'


             # Load dataset

             print("Load dataset")

              ds = "quotes.jsonl"

              r = requests.get("https://huggingface.co/datasets/Abirate/english_quotes/resolve/main/quotes.jsonl")

              with open(ds, 'wb') as f:


             # Uploading dataset

             file_uri = ubiops.utils.upload_file( 

               client = api_client, 

               project_name = os.environ['project_name'], 

               file_path = ds, 

                bucket_name = "default", 

               file_name = ds



          Now we initialize a training run that will execute the code in the `prepare.py` above.

          As mentioned previously, this run will download the model’s weights and the dataset, which will be used in a subsequent training run that fine-tunes the Falcon model.


          run = training_instance.experiment_runs_create(





                 description='Load model',







          # Wait for the prepare.py run to complete


             client = api_client,

             project_name = PROJECT_NAME,

              experiment_name = EXPERIMENT_NAME,

             run_id = run.id


          Set up training run

          The goal of this training run is to show how to combine multiple LLM performance optimization techniques that allows one to execute an LLM on an instance with one T4.

          So Falcon is not going to be finetuned for any specific benchmark, however it will learn to speak a bit more with quotes from famous people.


          %%writefile train.py

          import ubiops

          import os

          import tarfile

          import json

          import joblib

          from typing import List

          import torch

          import transformers

          from torch.utils.data import Dataset

          from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

          from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

          from  ubiops import utils

          class QuotesDataset(Dataset):

              def __init__(self, data: List[dict], tokenizer):

                  self.tokenizer = tokenizer

                  self.data = data

              def __len__(self) -> int:

                 return len(self.data)

              def __getitem__(self, idx: int) -> dict:

                 return self.tokenizer(self.data[idx]["quote"])

          def train(training_data, parameters, context = {}):

              configuration = ubiops.Configuration(api_key={'Authorization': os.environ['api_token']})

             api_client = ubiops.ApiClient(configuration)

              api = ubiops.api.CoreApi(api_client)       

            # The first step is to load the models' weights and and dataset of english quotes into this deployment

              for f in ["checkpoint.tar","quotes.jsonl"]: 

                 file_uri = ubiops.utils.download_file(

                    client = api_client, #a UbiOps API client, 


                   project_name= os.environ['project_name'],




              with tarfile.open("checkpoint.tar", 'r') as tar:


             # This config allows to represent the model in a lower percision. It means that every weight in it is going to take 4bits instead 32bit. So we will use ~ 8 times less vram.

              nf4_config = BitsAndBytesConfig(







              cache_dir = "checkpoint"

              # Loading the models' weights and allocating them according to config

              model = AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_dir, quantization_config=nf4_config)

             # Also enabling checkpointing, a technique that allows us to save memory by recomputing some nodes multiple times.


             model = prepare_model_for_kbit_training(model)

              # Lora is another technique that allows to save memory. However this time by reducing the absolute number of trainable parameters. It also defines a task for our fine tuning as CAUSAL_LM, which means the llm will learn to predict next word based on previous words in a quote.

              config = LoraConfig(








              model = get_peft_model(model, config)

             tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir)

              tokenizer.pad_token = tokenizer.eos_token

             lines = list()

             with open("quotes.jsonl", 'r') as f:

                  for line in f:


             dataset = QuotesDataset(lines, tokenizer)

             # Run trainer from the transformers library.

              trainer = transformers.Trainer(














                 data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),


              model.config.use_cache = False  # Silence the warnings. Please re-enable for inference!

              finetuned_model = trainer.train()

              # Save the model

              joblib.dump(finetuned_model, "finetuned_falcon.pkl")

             file_uri = ubiops.utils.upload_file( 

                client = api_client, 

                project_name = os.environ['project_name'], 

                file_path = f'{cache_dir}.tar', 

                bucket_name = "default", 

                file_name = f'{cache_dir}.tar'


              return {"artifact": "finetuned_falcon.pkl"}


          Create training run in UbiOps:







                  description='training run',





          And there you have it!

          We have just fine-tuned the Falcon 1B model in four easy steps! You can use the code from this blogpost as a template for your own use case. The fine-tuning technique applied in this guide is based on the LoRa technique.

          Having completed this guide, you may now be wondering how to deploy or manage LLM-based applications (AKA LLMOps). You can find more guides on deploying LLM-based applications here and here, and learn more about LLMOps here.

          If you’d like us to write about something specific, just shoot us a message. Our team would love to help you bring your project to life!

          Thanks for reading!

          Latest news

          Turn your AI & ML models into powerful services with UbiOps