Deploy LLaMa 2 with a customizable front-end in under 15 minutes using only UbiOps, Python and Streamlit: in 2024

What can you get out of this guide?

In this guide, we explain how to deploy LLaMa 2, an open-source Large Language Model (LLM), using UbiOps for easy model hosting and Streamlit for creating a chatbot UI. The guide provides step-by-step instructions for packaging a deployment, loading it into UbiOps, configuring compute on GPUs and CPUs, generating API tokens, and integrating with Streamlit for the front-end. We conclude with a benchmark test showing that GPUs can provide over 30x faster processing speeds than CPUs. This guide aims to make cutting-edge AI accessible by allowing anyone to deploy their own LLaMa 2 chatbot in minutes.

To successfully complete this guide, you will need:

Python 3.9 or higher installed
Streamlit library installed
UbiOps Client Library installed
UbiOps account (see below)

Jump to guide

Where did LLaMa 2 come from?

Natural language processing systems can be traced back to the 1960s with the advent of early chatbots like ELIZA. Since then, language models have emerged, and now LLMs, in no small part thanks to the boost in processing speed afforded by specialized GPUs for data science, an effort spearheaded by NVIDIA (read more about GPUs).

Since the introduction of Transformers, there has been an explosion in the democratization of LLMs and, more broadly, generative AI. With ChatGPT’s free version leading the charge, established businesses and aspiring entrepreneurs around the world have become aware of the power of generative AI.

Meta AI introduced the LLaMA series of large language models in early 2023, with access initially restricted to select researchers before being leaked on the internet. The landscape shifted with the release of the open-source Llama 2 family, which quickly rose to the top of benchmarks. Pre-training innovations like increased context length and data cleaning enabled Llama 2’s strong performance.

Though not yet on par with massive proprietary models, Llama 2 and other open LLMs have democratized access and sparked remarkable progress in language AI. And still, the open-source community continues to drive advances through thoughtful innovation and benchmarking.

[2303.18223] A Survey of Large Language Models

What is LLaMa v2?

Researchers created the Llama 2 family of large language models using a robust pre-training approach, including more data cleaning, increased context length, and grouped-query attention for scalability. Trained on 2 trillion tokens from public sources, Llama 2-70B performed on par with or better than other major models like GPT-3.5, PaLM, and LLaMA on benchmarks testing reasoning, common sense, multitasking, and tendency to reproduce falsehoods.

Llama 2-70B took the top spot on the HuggingFace leaderboard, surpassing leading models like LLaMA and Falcon. The top 3 models currently are Llama 2-70B, LLaMA-65B/30B, and Falcon-40B, based on average scores on benchmarks like AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA.

While Llama 2 shows novelty and strong performance, other impressive models have also emerged from fine-tuning it, demonstrating the rapid pace of advancement in large language models. Overall, Llama 2 represents a promising new family of models with state-of-the-art capabilities and the potential to serve as viable alternatives to closed-source models.

For the purpose of this guide, we will be deploying the chat version of LlaMa 2-7B, which is available on Huggingface. It is a lighter version to other LLaMa 2 models with fewer parameters, perfect for anyone looking to dip their feet into Generative AI and learn the basics quickly. The principles of this guide remain the same for heavier versions, so this is a great place to start.

What is UbiOps?

UbiOps is a powerful AI model serving and orchestration service with unmatched simplicity, speed and scale. UbiOps minimizes DevOps time and costs to run, train and manage AI models, and distributes them on any compute infrastructure at scale. It is built for training, deploying, running and managing production-grade AI in an agile way. It features unique functionality for workflow orchestration (Pipelines), automatic adaptive scaling in hybrid or multi-cloud environments as well as key MLOps features. You can learn more about UbiOps features on our Product page.

What is Streamlit?

Streamlit is an open source Python framework that enables data scientists and machine learning engineers to quickly build and share interactive web applications for their models and analyses. By being compatible with popular Python data science libraries like scikit-learn, PyTorch, and pandas, Streamlit allows users to easily leverage their existing workflows to create apps.

In a few words, Streamlit provides data science teams an easy yet powerful way to create, deploy and share interactive web apps from their Python scripts and notebooks, increasing efficiency and cross-team communication.

Why should you use UbiOps and Streamlit to deploy an LLM?

Many of UbiOps’ users deploy their LLMs, computer vision, or generative models with Streamlit front-ends so that they don’t have to learn an entirely new language. And both Streamlit and UbiOps are particularly well suited for generative AI thanks to their combined wealth of support for integrations with other data science tools.

In addition, UbiOps exposes solutions with API endpoints so you can integrate models that require heavy computational power into lightweight dashboards. UbiOps’ vast computing capabilities include powerful hardware as well as compatibility with both cloud and on-premise computing environments, key features for businesses concerned about their intellectual property.

What can you do with a deployed LLaMa 2 model and a front-end?

LLMs are being applied across industries to enhance productivity and access to information:

In agriculture, LLMs provide personalized and translated advice to farmers on best practices.
In healthcare, they assist doctors with diagnosis and care plans while empowering patients with individualized assessments.
The financial sector uses LLMs for internal applications like information retrieval and chatbots (they are still cautious about client-facing applications).
Manufacturing leverages these models so workers can query complex systems such as digital twins in natural language, increasing responsiveness on the shop floor.

Overall, LLMs make specialized knowledge more accessible to non-technical users through human-like interaction. By leveraging Streamlit’s extensive library of community code snippets, you can even deploy a LLaMa model and get it to write improvements to its own front-end code.

If you’ve made it this far into this guide, you probably get the point that a LLaMa 2, Streamlit, UbiOps combined deployment can get you access to your very own state of the art LLM with minimal time and effort spent.

With that said, let’s get started!

First things first, how to deploy AI models on UbiOps

The first step is to create a free UbiOps account. Simply sign up with an email address and within a few clicks you will be good to go.

Once you have your account, login to the WebApp and head over to the “Deployments” tab on the left and click on “Create”. In the following menu, you can fill in fields like the name of the deployment, a description (this is optional), and define the input and output of the deployment (read more about deployments here). The input and output fields of the deployment define what data the deployment expects when making a request (i.e. when running the model). In this case, both can be set as type “Structured” and data type “String”.

After providing that information, UbiOps will generate Python and R deployment code snippets that can be downloaded or copied. We will use these to create the deployment package. For this guide, we will be using Python.

How to create a deployment package

Open your preferred code editor and paste the Python deployment code snippet that UbiOps generated for you.

Which libraries to use

The first thing you probably want to do is import libraries. For this deployment, you’ll need to include the following libraries at a minimum:

import os # for interacting with environment variables in UbiOps, in this case `HF_TOKEN` which is how we will authenticate with HuggingFace

from transformers import AutoTokenizer, LlamaForCausalLM, GenerationConfig # from this library, we need the AutoTokenizer and LlamaForCausalLM functions to important the tokens and model from HuggingFace, and GenerationConfig to set a repetition penalty for the model

import torch # for interacting with the instance device

import shutil # for file operations

from huggingface_hub import login # for accessing HuggingFace

Now your code file should look a little something like this:

import os

from transformers import AutoTokenizer, LlamaForCausalLM, GenerationConfig

import torch

import shutil

from huggingface_hub import login





class Deployment:

         def __init__(self, base_directory, context):

             print("Initialising deployment llama")




         def request(self, data):

             print("Processing request for deployment llama")

             prompt_value = data["prompt"]

         

            # <YOUR CODE>




            return {

                   # TODO fill in the values of your output fields

                   "response": response_value

            }

Define your _init_ and request functions

The `_init_` function runs only once when an instance of the model starts up, while the `request` function runs every time a call is made to the model API.

Typically, an `_init_` function will include the necessary tasks for:

Checking computing device availability
Authentication
Locating any generic files for your model (e.g. tokens)

Here is how to define your `_init_` function for LLaMa 2:

    def __init__(self, base_directory, context):

        """

        Initialisation method for the deployment. Any code inside this method will execute when the deployment starts up.

        It can for example be used for loading modules that have to be stored in memory or setting up connections.

        """

# extract env vars

        PROJECT_NAME = context['project']

        UBIOPS_API_TOKEN = os.environ['UBIOPS_API_TOKEN']




        LLAMA_BUCKET = os.environ.get('LLAMA_BUCKET', 'model-artifacts')

        LLAMA_DIR = os.environ.get('LLAMA_DIR', 'llama')

        LLAMA_VERSION = os.environ.get('LLAMA_VERSION', 'llama-2-7b-chat-hf')




        self.REPETITION_PENALTY = float(os.environ.get('REPETITION_PENALTY', 1.15))

        self.MAX_RESPONSE_LENGTH  = float(os.environ.get('MAX_RESPONSE_LENGTH', 128))




        # Create a UbiOps API client and connect to api

        client = ubiops.ApiClient(ubiops.Configuration(api_key={'Authorization': UBIOPS_API_TOKEN}, host='https://api.ubiops.com/v2.1'))

        api = ubiops.CoreApi(client)




        # Use `os` and `huggingface_hub` to login to HuggingFace:

        HF_TOKEN = os.environ["HF_TOKEN"]

        login(token=HF_TOKEN)




        # Use `transformers` to download the LLaMa model and tokens from HuggingFace, and feed to computing device:

        model_hf_name = "meta-llama/Llama-2-7b-chat-hf"

        print("Downloading model from huggingface")

        self.model = LlamaForCausalLM.from_pretrained(model_hf_name)

        print("Downloading tokenizer from huggingface")

        self.tokenizer = AutoTokenizer.from_pretrained(model_hf_name)

        print(f"Model {model_hf_name} loaded")




        # Load model to GPU if available

        self.device = self.set_device()

        #self.device = "cpu"




       self.model.to(self.device)




# Set model config

        self.generation_config = GenerationConfig(repetition_penalty=self.REPETITION_PENALTY)



        print("Initialising deployment")

This implementation downloads the `meta-llama/Llama-2-7b-chat-hf` model from HuggingFace, each time an instance is spun up. For better version control and for faster deployment time, you can also store the model artifacts on the UbiOps object storage, and download the model from there.

The `request` function should fetch the input prompt, run the model itself, and return the output. We apply post-processing to filter out the prompt from the response. Here is how to define your `request` function for LLaMa 2:

    def request(self, data):

    

        """

        Method for deployment requests, called separately for each individual request.

        """




        # Fetch the input as you defined it in your deployment, tokenize, and feed to computing device:

        prompt = data["prompt"]

        print(f"Running model on {self.device}")

        inputs = self.tokenizer(prompt, return_tensors="pt")




# Load input to GPU

        inputs.to(self.device)




        # Run model to generate response:

        with torch.no_grad():

          generate_ids = self.model.generate(inputs.input_ids, max_length=self.MAX_RESPONSE_LENGTH, generation_config=self.generation_config)

        

        # Return decoded response as output:

        response = self.tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

        print("full response")

        print(response)




        filtered_response = self.filter_prompt_from_response(prompt, response[0])

        return {"response": filtered_response}




    @staticmethod

    def set_device():

      gpu_available = torch.cuda.is_available()

      print(f"GPU_available: {gpu_available}")

      device = torch.device("cuda") if gpu_available else torch.device("cpu")

      return device




    @staticmethod
    def filter_prompt_from_response(prompt, response_text):

      # Find the index where the prompt ends

      prompt_end_index = response_text.find(prompt) + len(prompt)

      # Get the generated response after the prompt

      filtered_response = response_text[prompt_end_index:].strip()

      return filtered_response

Now that you have defined your `_init_` and `request` functions in your `Deployment` class, save your `deployment.py` file to a directory. Within the same directory, create two more text files: `requirements.txt` and `ubiops.yaml`.

In the `requirements.txt` file, paste the following libraries that UbiOps will install within your code environment:

# This file contains package requirements for the environment

# installed via PIP.

diffusers

transformers

scipy

torch==1.13.1

accelerate

huggingface-hub

ubiops

Similarly, the `ubiops.yaml` file is used to install any CUDA configurations needed to run your application. Paste the following in the file that you created:

environment_variables:

- PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu110

Save both files to the same directory as your deployment code and zip the directory.

Your LLaMa 2 deployment package is now ready to be loaded into UbiOps!

How to load LLaMa 2 deployment package into UbiOps and configure compute settings

Let’s head back to the WebApp and click “Next: Create a version”. To load your deployment package, you’ll need to associate it with a version. Set the name as something like “gpu” (this will become clear later on in the guide) and upload the zipped directory you created in the last section.

Next, toggle “Enable GPU”, and choose “Python 3.9 + CUDA 11.0.3” as the base code environment upon which UbiOps will build your custom environment using the `requirements.txt` and `ubiops.yaml` files you included in your zipped deployment package.

Under “Optional / Advanced Settings” you can configure the computing settings for your LLaMa deployment (e.g. the type of GPU you want to run on, the maximum number of instances to be spun up at any given time, etc). Your choice of settings will mainly depend on how fast and how responsive you want your application to be. For a head start, here is an example setup you might like to use:

When you are happy with your setup, click “Create”. Then, navigate to the “Environment Variables” tab under your deployment. Here, you will need to define an Environment Variable for your HuggingFace token. We do this because the LLaMa 2 models are hosted on HuggingFace behind an authorization wall, so to access the model you will need to request access from HuggingFace. Once your request is approved, you can use a personal access token to download the model. In the “Environment Variables” tab, click on “Create Variable” to add an Environment Variable with the name “HF_TOKEN” and your token as its value, then set it as “Secret”.

Your deployment is now live!

How to create a front-end for LLaMa 2 using Streamlit

Streamlit has written a helpful tutorial on how to build a front-end for a LLaMa 2 chatbot, which we used to create an example of what your Streamlit code could look like, with some adjustments taken from our very own tutorial on integrating Streamlit with UbiOps:

import streamlit as st

import ubiops

import os




# App title

st.set_page_config(page_title=" Llama 2 Chatbot")

# Replicate Credentials

with st.sidebar:

    st.title(' Llama 2 Chatbot')

    if 'UBIOPS_API_TOKEN' in st.secrets:

        st.success('API key already provided!', icon='')

        ubiops_api_token = st.secrets['UBIOPS_API_TOKEN']

    else:

        ubiops_api_token = st.text_input('Enter UbiOps API token:', type='password')

        if not ubiops_api_token.startswith('Token '):

            st.warning('Please enter your credentials!', icon='')

        else:

            st.success('Proceed to entering your prompt message!', icon='')

    st.markdown(' Learn how to build this app in this [blog](#link-to-blog)!')

os.environ['UBIOPS_API_TOKEN'] = ubiops_api_token




# Store LLM generated responses

if "messages" not in st.session_state.keys():

    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]




# Display or clear chat messages

for message in st.session_state.messages:

    with st.chat_message(message["role"]):

        st.write(message["content"])




def clear_chat_history():

    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]

st.sidebar.button('Clear Chat History', on_click=clear_chat_history)




# Function for generating LLaMA2 response

# Refactored from <https://github.com/a16z-infra/llama2-chatbot>

def generate_llama2_response(prompt_input):

    string_dialogue = "You are a helpful assistant. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'."

    for dict_message in st.session_state.messages:

        if dict_message["role"] == "user":

            string_dialogue += "User: " + dict_message["content"] + "\\n\\n"

        else:

            string_dialogue += "Assistant: " + dict_message["content"] + "\\n\\n"

            

    # Request llama

    api = ubiops.CoreApi()

    response = api.deployment_version_requests_create(

        project_name = `INSERT_PROJECT_NAME`,

        deployment_name = `INSERT_DEPLOYMENT_NAME`,

        version = `INSERT_VERSION_NAME`,

        data = {"prompt" : prompt_input}

    )

    api.api_client.close()

    return response.result['response']




# User-provided prompt

if prompt := st.chat_input(disabled=not ubiops_api_token):

    st.session_state.messages.append({"role": "user", "content": prompt})

    with st.chat_message("user"):

        st.write(prompt)




# Generate a new response if last message is not from assistant

if st.session_state.messages[-1]["role"] != "assistant":

    with st.chat_message("assistant"):

        with st.spinner("Thinking..."):

            response = generate_llama2_response(prompt)

            placeholder = st.empty()

            full_response = ''

            for item in response:

                full_response += item

                placeholder.markdown(full_response)

            placeholder.markdown(full_response)

    message = {"role": "assistant", "content": full_response}

    st.session_state.messages.append(message)

The first thing to take note of is the `UBIOPS_API_TOKEN` variable. This is basically a password that is needed to authenticate with UbiOps and grant a user access to use your LLaMa 2 chatbot. It can either be provided to Streamlit through a `secrets.toml` file, or through the Streamlit code itself, or by a user accessing your front-end.

You can create an API token for your app in the UbiOps WebApp by navigating to Project Admin > Permissions > API Tokens, clicking on “Add token” and following the steps to add an API token with a “Project” level role of “deployment-request-user” or above. Take a note of your unique token, and use this code to access your chatbot (don’t forget to include the ‘Token ‘ part of the code).

Next, you will need to replace `INSERT_PROJECT_NAME`, `INSERT_DEPLOYMENT_NAME`, and `INSERT_VERSION_NAME` with the names of your application’s project, deployment, and version respectively, exactly as they are in UbiOps.

When you have completed these steps, run your Streamlit code file from a terminal.

This should open a window in your default browser with your very own LLaMa 2 chatbot!

Now let’s prompt our LLM using the Streamlit front-end

Head over to the Streamlit front-end you just created and input your UbiOps API token if necessary. This will unlock the prompt interface at the bottom of your UI.

Ask your chatbot whatever you like! In the background, Streamlit will take your prompt and make a call to your deployment in UbiOps using a unique API endpoint and display its output as the chatbot’s response.

Bear in mind that response times will depend on the number of minimum instances you set in your version configuration. If set to zero, your deployment will need to inititialize from scratch every time you feed a prompt, which will increase computing time. For a faster response time, you can increase the number of minimum instances – but remember that this will use up your computing credits faster, since you are essentially just keeping an instance of your deployment running in the background.

You can always check on a request’s progress by heading over to the “Requests” subtab under the deployment version that is in use. Here you will find a full list of all requests that have been made to your model and their status (e.g. processing, completed, etc).

How does LLaMa 2 perform on a CPU vs GPU (benchmark)

GPUs are a great way to increase the processing speed of your AI or machine learning models. Publicly available generative AIs such as ChatGPT, Midjourney, or Runway are running on powerful GPU setups which make it possible for users to get responses in seconds. But just how much faster is it to run a generative AI on a GPU compared to a CPU? Follow the steps below to see for yourself.

To do this, you will need to create a new version of your LLaMa 2 deployment. You can name it something like “cpu”, to easily differentiate it from your other version that runs on GPUs, and upload the same deployment package.Now make sure that the “Enable GPU” toggle is set to off. In the advanced settings, you can select CPUs of various speeds – for this benchmark we used the same instance type to run the model on CPU and on GPU. However, for the CPU benchmark, we did not utilize the GPU from the instance.

Now you can create requests to your LLaMa 2 deployment and have them be processed either by a GPU or a CPU by making the requests to the respective version. Then you can head over to Monitoring > Deployments in the WebApp and compare their performance side by side by selecting each version and displaying them. Feel free to uncheck the “Show aggregated metrics” checkbox to declutter your graphs.

Our own comparison showed that running LLaMa 2-7B on a CPU took around 85 seconds, compared to around 2.5 seconds on a GPU. That’s a 32X increase in processing speed! This definitely results in a better user experience.

Conclusion

And there we have it!

Our very own top of the line LLaMa 2 chatbot, accessible from anywhere through a Streamlit front-end, and hosted and served on UbiOps. All in under 15 minutes, without needing a software engineer.

Naturally, there are further optimizations that can be made to the code to get your deployment running as fast as possible every time. We left these out of scope for this guide. We invite you to iterate and improve your deployment!

Having completed this guide, you may now be wondering how to tailor LLaMa 2 to your own particular need by adding fine-tuning or prompt engineering to the equation – or maybe you have an exciting idea for a LLaMa-based application. Our team will be releasing more easy to follow guides on how to make the most of open source LLMs, so until then just shoot us a message or start the conversation in our Slack community.

Thanks for reading!

By industry

By application

On-demand GPU

Featured customers

Latest news

Why is Hybrid Cloud Deployment Useful?

UbiOps Revolutionizes AI Model Inference Using AMD Instinct