How to build a RAG query engine with LlamaIndex and UbiOps

Large Language Models (LLMs) are trained on vast datasets with data sourced from the public internet. But these datasets of course do not include specific datapoints regarding your business or use case. Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context in a prompt to your LLM. This way there is no need to actually retrain or fine tune the LLM, while still being able to provide accurate and relevant responses! Because training isn’t needed, you typically need far less resources to make a RAG system work well. And a nice benefit is that RAG systems are less prone to hallucinate too.

With RAG, the most relevant context is retrieved from your data and fed to the LLM together with the original prompt. There are several Python libraries out there that make this process fairly straightforward. In this guide I will be using LlamaIndex. LlamaIndex is a simple, flexible data framework for connecting custom data sources to LLMs. For this example I will be using a dataset of the public documentation of Kaggle and llama-2-13b to set up a RAG Query Engine. I’ll deploy the query engine to UbiOps to be able to serve it at scale.

Stages within RAG

Every RAG system performs the following key steps:

  1. Loading: Loading your context data into your RAG system. 
  2. Indexing: Creating a data structure that allows you to query your context data efficiently. For LLMs this nearly always means creating vector embeddings, which are numerical representations of the meaning of your data
  3. Storing: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it. For larger systems it makes sense to store your indexed data in a proper Vector Store like PineCone.
  4. Querying: Actually querying your context data to find the most relevant context to enhance your prompt to your LLM with.

To bring this to UbiOps we will put all these steps into a deployment with the help of LlamaIndex. In the __init__() function of the deployment we will perform steps 1 to 3, and in the request() function we will perform step 4. The data used for the context we can store in a UbiOps storage bucket. 

Uploading our context data to UbiOps

Before we dive into the code for setting up our query engine, let’s make sure that the context data is actually available within UbiOps. I created an archive called “kaggle_docs.zip” containing the raw Kaggle docs in txt format inside a “text” folder. The data I’m using is part of this Kaggle dataset. Do note that I’m only using the “raw” folder of this dataset, and not the cleaned up question-answer csv.

To upload it to UbiOps simply navigate to Storage > default in the UbiOps WebApp, and click “Upload file” to upload the zip.

Now that our data is on UbiOps, let’s also create an API token with enough permissions to access this file. We can use that token in our code to be able to load the file into our RAG system.

You can create a token by navigating to Project Admin > Permissions > API tokens and clicking “Create token”. You can give the token the files-reader  permissions on bucket level, for the default bucket. Copy the token and save it for later! 

Using llama-index to build a RAG system

Let’s build up our RAG step by step, before we put it in a UbiOps deployment. First thing we need to do is to import our dependencies: 

				
					from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, set_global_tokenizer
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer
import ubiops
import os
import shutil
				
			

Then let’s define the LLM we want to use with our RAG. In my case I’m going to use llama-2-13b-chat, but you can of course also use a different one. After downloading the model I’m also directly embedding it to have it ready for use.

				
					# Download set up model
model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
llm = LlamaCPP(
            model_url=model_url,
            model_path=None,
            temperature=0.1,
            max_new_tokens=256,
            context_window=3900,
            generate_kwargs={},
            # set to at least 1 to use GPU
            model_kwargs={"n_gpu_layers": 1},
            # transform inputs into Llama2 format
            messages_to_prompt=messages_to_prompt,
            completion_to_prompt=completion_to_prompt,
            verbose=True,
        )


set_global_tokenizer( AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode
        )
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
				
			

Now let’s load our Kaggle documentation data and index it for our vector store.

				
					# Download context document from UbiOps storage      
configuration = ubiops.Configuration(api_key={'Authorization': “YOUR TOKEN”})
        api_client = ubiops.ApiClient(configuration)


        file_uri = ubiops.utils.download_file(
          client = api_client, #a UbiOps API client,
          file_name="kaggle_docs.zip",
          project_name=context['project'],
          output_path=".",
          bucket_name="default"
        )


        shutil.unpack_archive("kaggle_docs.zip",".")


        # Load document
        documents = SimpleDirectoryReader(
            "./texts"
        ).load_data()


        # create vector store index
        index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

				
			

Perfect! With our data indexed and ready to go, we can use LlamaIndex to turn it into a Query Engine.

				
					# set up query engine
query_engine = index.as_query_engine(llm=llm)

				
			

To use the query engine, we can simply send a prompt to it:

				
					prompt = "Can I use a TPU with Kaggle?"
response_vector = self.query_engine.query(prompt)
print(response_vector.response)

				
			

Deploying the RAG system to UbiOps

Now that we have a RAG system, let’s deploy it to UbiOps! To do so we need to perform two steps:

  1. Create an environment with all the necessary dependencies
  2. Create a deployment with our RAG code

Creating a custom environment in UbiOps

To create an environment we need a requirements.txt detailing all the necessary pip packages, as well as a ubiops.yaml for any OS level dependencies. Our requirements.txt should look like this:

				
					llama-index>=0.10.3
llama-index-embeddings-huggingface
llama-index-llms-llama-cpp
ubiops

				
			

We will deploy our RAG system to a GPU instance in UbiOps, to be able to get good performance. To be able to use llama v2 on a GPU, we need to ensure that we install the GPU compatible version of llama-cpp. The following ubiops.yaml will do the trick:

				
					environment_variables:
- PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu117
- CMAKE_ARGS="-DLLAMA_CUBLAS=on"
- FORCE_CMAKE=1
apt:
  packages:
    - build-essential
    - gcc-11
    - nvidia-cuda-toolkit
				
			

Now that we have the necessary files, let’s create the environment in UbiOps. Navigate to the environments page via the sidebar and click “Custom environment”. This will open up a form where we can configure a new environment. You can use the python 3.10 base environment with CUDA 11.7.1 pre-installed. For the dependency files, upload a zip containing both the requirements.txt and the ubiops.yaml. When you click “create” your environment will start building.

Creating the UbiOps deployment

To create a UbiOps deployment, we need to put the code we created in the first section into a format that UbiOps will understand. UbiOps needs to know what code to execute whenever a new instance is created, and what to execute when new data is sent to the deployment. In our case, it makes sense to load the data, index it and create the query engine upon initialization of a new deployment, so that we only need to do that once. The actual query can then be performed whenever a new prompt is sent to the deployment.

 

Filling in our code in the UbiOps deployment template then leads to the following deployment.py:

				
					from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, set_global_tokenizer
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer
import ubiops
import os
import shutil




class Deployment:
    def __init__(self, base_directory, context):
        print("Initialising deployment")


        # Download set up model
        model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
        llm = LlamaCPP(
            model_url=model_url,
            model_path=None,
            temperature=0.1,
            max_new_tokens=256,
            context_window=3900,
            generate_kwargs={},
            # set to at least 1 to use GPU
            model_kwargs={"n_gpu_layers": 1},
            # transform inputs into Llama2 format
            messages_to_prompt=messages_to_prompt,
            completion_to_prompt=completion_to_prompt,
            verbose=True,
        )


        set_global_tokenizer(
            AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf").encode
        )
        embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")


        # Download context document from UbiOps storage
        api_client = ubiops.ApiClient(configuration)


        file_uri = ubiops.utils.download_file(
          client = api_client, #a UbiOps API client,
          file_name="kaggle_docs.zip",
          project_name=context['project'],
          output_path=".",
          bucket_name="default"
        )


        shutil.unpack_archive("kaggle_docs.zip",".")


        # Load document
        documents = SimpleDirectoryReader(
            "./texts"
        ).load_data()


        # create vector store index
        index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)


        # set up query engine
        self.query_engine = index.as_query_engine(llm=llm)
   


    def request(self, data):
        print("Processing request...")
        prompt = data["prompt"]
        response_vector = self.query_engine.query(prompt)


        return {
            "response": response_vector.response
        }
   

				
			

Let’s upload this to UbiOps! Navigate to Deployments via the sidebar and click “Create”. Use the following settings to create the deployment: 

  • Name: llama-index-query-engine
  • Description: A query engine built with LlamaIndex for Kaggle’s documentation.
  • Input: prompt of type string

Output: response of type `string

Click “Next: Create a version”.

For the deployment package you can use a zipped version of the deployment.py we made earlier. For instance-type I used a 16GB NVIDIA T4 instance. Note that this instance isn’t available in a free trial account.

Open the advanced settings and scroll to the environment variables section. Create a new environment variable called UBIOPS_API_TOKEN with the API token value you created and saved earlier. Mark it as a secret.

Then scroll down and click create. Your deployment will start building and once its status changes to available it’s ready for use.

Test out your RAG system on UbiOps

With the deployment ready you can now start using it! You can click “Create request” to create a request to the deployment. Fill in a prompt of your choice, for instance “What is Fluffy guarding?”

Conclusion

It’s very easy to set up a LlamaIndex RAG pipeline within UbiOps. Simply upload your context data to UbiOps storage, and index it in your deployment code to create a vector store that you can query for every prompt. In this guide we created and deployed our very own QueryEngine for Kaggle’s documentation, and that without needing any cloud knowledge! You can use the same tactic to create a chatbot for your own product’s documentation.


Having completed this tutorial, you may be wondering how RAG compares to fine-tuning a pre-trained LLM: try our Falcon LLM fine-tuning guide to see for yourself. Or perhaps you’d like to know more about how different LLMs compare and which one you should use. We also wrote a handy guide for that!

Latest news

Turn your AI & ML models into powerful services with UbiOps