Implementing RAG for your LLM (Mistral)

Most of the open-source models available on Huggingface come pre-trained on a large corpus of publicly available data, like WebText. In general, the size of these datasets give large language models (LLMs) an adequate performance for various use cases. For some, more specific, use cases, however, more domain specific knowledge is required for the LLM to provide satisfactory results. There are two ways to impart this knowledge to a model: 

  • Fine-tune your model, by using a smaller dataset to (partly) retrain the LLM. This approach alters the weights and parameters of the model, and so aims to edit the behavior of the model.
  • Prompt-engineering, where the input of the LLM is supplemented with additional instructions or context. Here, the aim is to guide the model to a better output instead of changing its behavior.

Fine-tuning vs. prompt-engineering

In general, fine-tuning a pre-trained LLM achieves better results for domain-specific use cases, but it’s more expensive. Furthermore, the knowledge of the model can only be updated by fine-tuning it again. These are just some of the considerations to take into account when choosing to go for fine-tuning vs. prompt-engineering

Advantage of prompt-engineering

One prompt engineering technique that is gaining increasing popularity is Retrieval Augmented Generation (RAG). With this technique, the input to a LLM is substantiated by relevant, additional context which is retrieved using a smaller model. A great advantage of this technique is that you can update a model’s domain specific knowledge without having to retrain the model.

In this guide

This guide will show you how to set up a RAG framework for the Mistral 7b Instruct v0.2t model, where the model can retrieve extra context from Git documentation on UbiOps. To do this, the documentation must be converted into embeddings. An embedding is a vector that represents text –, in our case, a document. These embeddings can then be compared to inputs that are fed to the model, to produce more specific responses.

In this blog post we will use UbiOps’ training functionality to calculate embeddings for the documents and save them to a storage bucket, which is also hosted on UbiOps. When we run the model, the input prompt we provide will be converted into an embedding, so that it can be compared to the previously stored embeddings to find similar documents.

It only takes three steps to set up a RAG framework on UbiOps:

  1. Create an environment
  2. Create an experiment & initiate a training run  
  3. Create a deployment

In order to follow along with this guide, you’ll need the following:

Let’s get started!

  1. Create a project

After creating an account you can head over to UbiOps and create an organization. UbiOps uses organizations to group projects.Projects in UbiOps allow you to compartmentalize your AI, machine learning (ML), or data science activities. Within a project, you can create deployments (your containerized code), pipelines (chains of deployments), or training experiments.

After creating your organization, you can click on “Create new project” and give your project its own unique name. Alternatively, you can have UbiOps generate a name for you.

Creating a new project

For the code to run we’ll need to create an environment variable of an API token for the training experiment and the deployment. You can create an API token, you can do this by navigating to the Project admin page, clicking on Permissions -> API tokens -> “Add token”. Make sure this token is created on “Project level”, and has the “Project editor” role. Copy the token and save it somewhere, we’ll need it later on.

  1. Create an environment

in UbiOps, the code and the environment it runs in can be managed separately. Reusing the same environment can greatly reduce the time it takes to build deployment versions and training experiments (more about these two concepts later). An environment on UbiOps consists of a base environment, to which we can add additional dependencies using a `requirements.txt` file, `ubiops.yaml` file, or a combination of the two, to create a custom environment. 

Environments can be created in two ways: implicitly, by adding environment files to your deployment package, or explicitly, by creating an environment separately. For this blog post we will be creating our environment explicitly.

Head over to the “Environment” tab on the left hand side of the WebApp, and click on “+Custom environment”. Then use the following parameters:

Base environmentUbuntu 22.04 + Python 3.10 + CUDA 11.7.1
Custom dependenciesUpload the environment file package

Then scroll down and click on “Create’, after which UbiOps will start building your environment (this can take a couple of minutes). We will use this environment for both the `` (for generating the embeddings) & the `` (running the model).

Creating an environment

  1. Create an experiment and initiate a training run

Now we will start defining the training set-up, or experiment, that we will use. Here we define the environment that the `` code file will run in and what instance type to use.

Head over to the “Training” tab in the WebApp and click on “+Create new”. Now fill in the following parameters:

Hardware settings16384 MB + 4 vCPU + NVIDIA Tesla T4 (if your project has access to GPUs)
Environment settingsrag-mistral-git (the environment we created earlier)
Select default output bucketLeave on default
Environment variablesClick on +Create variable and use the following parameters:
Name: UBIOPS_API_TOKEN, Value: Put the token we created earlier here, make sure to include the “Token” part, Secret: Yes

Scroll down and click on the “Create” button.

Creating an experiment

Now we need to upload the zip file that contains all the documents to be converted into embeddings. Go to the “Storage” page, click on the “default” bucket, and upload the documents file,

Head back over to the experiment we just created and click on “Create run”. We will now initiate the run that will create the embeddings, which the LLM will use as context when a user enters a prompt. You can fill in the following parameters for the training run:

Training codeUpload the

Leave the rest as is, scroll down, and click on “Create”. The code inside the, which creates and uploads the embeddings, will now be executed. The output of this code will be a file containing all the embeddings. This file will be uploaded to the same bucket as the documents file we uploaded earlier (default).

Initializing a training run

While the training code is running we can create the deployment that will contain the code to download the Mistral model from Huggingface and process requests.

  1. Create the deployment

A deployment is an object within UbiOps which can serve your code to process data. After uploading the data, UbiOps will containerize your code and run it as a microservice within UbiOps. Each deployment on UbiOps has its own unique API endpoint which you can use to send requests to your deployed code. Each deployment consists of one or more versions. The in- and output are defined at a deployment level, but the instance type, deployed code, and environment are all defined at the version level. 

Let’s create the deployment, and the deployment version. You can use the following parameters for the deployment leave the rest as is:

Input Type: structured, Name: user_input, Data Type: string
OutputType: structured, Name: response, Data Type: string

After filling in these parameters, scroll down and click on “Next: Create a version”

Here we’ll define the code we want to deploy to UbiOps, the instance type, and the code environment. You can use the following parameters:

Deployment packageUpload the deployment file
Select hardware180000 MB + 22 vCPU + NVIDIA Ampere A100 80gb

Now the last thing we need to do is copy the environment variable from the training experiment to this deployment. You can do so by navigating to the “Environment variables” tab -> “Copy existing variables”. For the deployment name select `training-base-deployment`, and for the version you need to select the name of the experiment (i.e., “rag-mistral-git”).

After the deployment has finished building you can click on the “Create request” button and start sending requests (i.e., prompts) to your Mistral LLM. When a user enters a prompt the LLM will search for additional context from the embeddings we created with the training run, and generate a response using this context.


And there we have it!

Our very own Mistral-7b-Instruct deployment enhance with RAG, hosted and served on UbiOps. All in under 15 minutes, no software engineering required. 

Of course, there are further optimizations that can be made to get your deployment delivering the most accurate responses every time. We left these out of scope for this guide – but we invite you to iterate and improve your own deployment!

Having completed this guide, you may be wondering how RAG compares to fine-tuning a pre-trained LLM: try our Falcon LLM fine-tuning guide to see for yourself. Or perhaps you’d like to build a front-end for your newly optimized chatbot: take a look at our guide to building a Llama 2 chatbot with a customizable Streamlit front-end.

If you’d like us to write about something specific, just shoot us a message or start a conversation in our Slack community. The UbiOps team would love to help you bring your project to life!

Thanks for reading!

Latest news

Turn your AI & ML models into powerful services with UbiOps