In recent times we’ve seen that open-source LLMs like Mixtral and Llama are starting to rival the performance of some proprietary LLMs. One of the things to consider when working with open-source models though is that they do not come ready to go for every use case out of the box, like the lack of front-end.
Furthermore, most open-source models are stateless by default. which means that they aren’t suitable to be used in chatbots for example. Luckily there are packages, like Streamlit, that you can easily use in combination with an open-source model, like Llama 3.1.
Whether to use proprietary or open-source models depends on the use case and company policy. In some cases, like those involving sensitive data, proprietary models do not fulfill regulatory requirements.
Large Language Models (LLMs) abilities to understand & generate human language make them very well suited for a whole range of use cases, like:
- Chatbots.
- Summarization
- Cybersecurity
Since the introduction of ChatGPT, more and more companies have started to develop their own chatbots, for either internal or external use. While most proprietary models are efficient to use out of the box for a chatbot, open-source models often miss one key future to be used behind a chatbot: conversational awareness.
Open-source models are stateless, which means that they immediately “forget” the user’s input after processing, i.e., Every query to the model is handled as independent input, not considering past interactions. You can understand why this can cause problems when you want to use an LLM for a chatbot. Luckily, there are ways to give an LLM conversational memory, like with LangChain.
Langchain is a framework that enables you to chain interoperable components to build LLM applications. Some examples are using LangChain as an information retriever for additional context in your RAG framework, or as we’ll be showing you in this blog post adding conversational memory to your LLM.
Langchain’s conversational memory comes in different forms, the most basic ones being:
- ConversationBufferMemory sends the entire history, with the latest query to the LLM. This approach is the most intuitive and gives the LLM the most information. The downside of this approach is that the number of tokens can increase rapidly, meaning slower response times, higher inference costs, and the danger of reaching the maximum token limit of the LLM.
- ConversationalBufferWindowMemory tries to tackle this by giving you the option to only remember the last K interactions, which you can determine using a sliding window.
- ConversationSummaryMemory: Here the conversation is summarized over time. This approach is useful for condensing the information of the conversation over time, which is especially useful for longer conversations where keeping the message history would take up too many tokens. The conversation is summarized as it happens, and stored in the memory. The summary of the conversation so far is then injected into the prompt.
- ConversationSummaryBufferMemory: This method tries to combine the best of both worlds, by allowing you to set a limit on the number of tokens that will be used for the chat summary. As with the ConversationSummaryMemory, the number of tokens used for short conversations might be increased, but the buffer means that recent interactions aren’t missed. .
In this blog post, we’ll show you how you can create a chatbot, using Llama 3.1 8B Instruct, LangChain, Streamlit, and UbiOps. Llama 3.1 will be used to process a user’s input, LangChain will be used to give the model conversational memory, Streamlit will be used as the front-end, and we’ll be running the model on UbiOps.
UbiOps is a platform built for model serving & management, and is perfect for managing fine-tuning tasks, running models via a REST API endpoint, and handling large-scale AI applications. UbiOps offers both on-premise installations and SaaS. Features like pipelines enable you to manage many input & output data streams for your AI application, the data streams can be connected to your model and perform logical operations on them. Furthermore, UbiOps offers extensive logging, monitoring, and event auditing capabilities. Using these tools enables you to monitor, evaluate, and respond to any potential errors in your deployment. In short: UbiOps is a powerful AI serving & orchestration tool that is useful to any machine learning stack.
For this blogpost, we’ll combine the code in the UbiOps Streamlit integration, with a modified version of the Deploy Llama 3 guide. To be able to follow along with this blog post you’ll need to have:
- Streamlit installed
- A Huggingface access token with permission to download Llama 3.1
Let’s get started!
Create a project
After creating your account, head over to the UbiOps WebApp and click on “Create new Project”, you can either let UbiOps generate a name for your project or pick your unique name.
In UbiOps you work in an organization, in which you can create one or more projects. Inside these projects, you can then create deployments, which are containerized versions of your code. As mentioned earlier you can also chain these deployments together to create pipelines.
After creating your project we can start building the environment in which the model will run.
Create the environment
Environments in UbiOps consist of a base environment, to which you can add your additional dependencies. This will create a custom environment, In UbiOps you can create custom environments in two ways:
- Implicitly: by adding environment files to your deployment package
- Explicitly: by creating an environment in the “Environments” tab, as we’ll be doing now.
Go to the “Environments” tab on the left-hand side and click on “+Custom Environment”. Then fill in the following parameters:
Name | llama-3-1-chatbot |
Base environment | Ubuntu 20.04 + Python 3.10 + CUDA 11.0.3 |
Custom dependencies | Upload this file |
After filling everything in you can scroll down and click on “Create”. Note that building the environment can take a couple of minutes.
Create your deployment
A deployment is an object within UbiOps that processes data by serving your code. Each deployment gets a unique API endpoint, which we’ll use in this blog post to send requests to from the Streamlit dashboard we’ll build later on. We’ll need to define an input & output, so our deployment knows what kind of data it can expect.
Navigate to the “Deployments” tab on the left-hand side and click on “Create”. Give your deployment a name (like llama-3-1-chatbot) and use the following parameters for the input & output:
Field | Name | Datatype |
Input | prompt | String |
Output | output | String |
After providing the input & output, Ubiops will automatically generate Python and R deployment code snippets that can be downloaded or copied.
You can use these to create the deployment package. For this guide, you can ignore these as we’ll be providing the code.
Scroll down and click on “Next: Create a version”.
Create a deployment version
Upload the deployment file, toggle the “Enable accelerated hardware” button and select the 16384 MB + 4 vCPU + NVIDIA Ada Lovelace L4. For the coding environment you can select the environment we created earlier (llama-3-chatbot).
Scroll down and click on “Create”, after which UbiOps will start building your version.
In order to download Llama 3.1 from HuggingFace you’ll need to sign into HuggingFace and accept the license agreement from Meta on the Meta-Llama-3.1-8B-Instruct page.
After accepting, within HugginFace, navigate to Settings→Access Tokens→New Token and create a new token with “read” permission. Copy this token to your clipboard.
Now we turn the HuggingFace token into an environment variable. Navigate to your deployment and the “Environment Variables” tab.
Click on “Create Variable” and name the token “HF_TOKEN”, paste your Huggingface token as the value and mark it as a Secret.
Click on the check mark to save the token. After saving your token the model is ready to receive inference requests!
Create the front-end for your chatbot using Streamlit
For the front-end, we’ll be using a modified version of the code from the How to build a front-end for a LLaMa 2 chatbot tutorial and the integrating Streamlit with UbiOps tutorial:
import streamlit as st
import ubiops
import os
# App title
st.set_page_config(page_title="💬 Llama 3 Chatbot Assistent")
# Replicate Credentials
with st.sidebar:
st.title('💬 Llama 3.1 Chatbot Assistent')
# Initialize the variable outside the if-else block
if 'UBIOPS_API_TOKEN' in st.secrets:
st.success('API key already provided!', icon='✅')
ubiops_api_token = st.secrets['UBIOPS_API_TOKEN']
else:
ubiops_api_token = st.text_input('Enter UbiOps API token:', type='password')
if not ubiops_api_token.startswith('Token '):
st.warning('Please enter your credentials!', icon='⚠️')
else:
st.success('Proceed to entering your prompt message!', icon='👉')
st.markdown('📖 Learn how to build this app in this [blog](#link-to-blog)!')
# Move the environment variable assignment outside the with block
os.environ['UBIOPS_API_TOKEN'] = ubiops_api_token
# Store LLM-generated responses
if "messages" not in st.session_state.keys():
st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
# Display or clear chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.write(message["content"])
def clear_chat_history():
st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
st.sidebar.button('Clear Chat History', on_click=clear_chat_history)
# Function for generating Mistral response
def generate_mistral_response(prompt_input):
string_dialogue = "You are a helpful assistant. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'."
for dict_message in st.session_state.messages:
if dict_message["role"] == "user":
string_dialogue += "User: " + dict_message["content"] + "\\n\\n"
else:
string_dialogue += "Assistant: " + dict_message["content"] + "\\n\\n"
# Request mistral
api = ubiops.CoreApi()
response = api.deployment_version_requests_create(
project_name = st.secrets["project_name"],
deployment_name = st.secrets["deployment_name"],
version = st.secrets["version"],
data = {"prompt" : prompt_input,
"config": {}},
timeout= 3600
)
api.api_client.close()
return response.result['output']
# User-provided prompt
if prompt := st.chat_input(disabled=not ubiops_api_token):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.write(prompt)
# Generate a new response if the last message is not from assistant
if st.session_state.messages[-1]["role"] != "assistant":
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
response = generate_mistral_response(prompt)
placeholder = st.empty()
full_response = ''
for item in response:
full_response += item
placeholder.markdown(full_response)
placeholder.markdown(full_response)
message = {"role": "assistant", "content": full_response}
st.session_state.messages.append(message)
To set up a connection between the front-end and the Llama 3.1 deployment we’ll use a UbiOps API token. This variable can be passed in three ways:
- Directly in the Streamlit code itself
- By a user accessing your front-end
- Through a Streamlit “secrets.toml” file
API tokens can be created in the UbiOps WebApp, navigate to Project Admin → Permissions —>API tokens and click on “Add token”.
Follow the steps to add an API token with a “Project” level role of “deployment-request-user” or higher. If you want, you can copy the unique token, and add it to the “secrets.toml” file in the format “Token 12345”. In the “secrets.toml” file also add your project_name, deployment_name, and version:
After filling in these parameters you can run your Streamlit code in a terminal, this should open up a window in your default browser with your very own Llama 3.1 chatbot!
Prompt your LLM using the front-end
Now you can head over to your Streamlit front-end and input your UbiOps API token if you haven’t added it to the “secrets.toml” file. After providing the token the prompt interface at the bottom of the UI will be unlocked.
You can ask your chatbot any question you like. When a prompt is provided Streamlit will make a call to your Llama 3.1 deployment using the unique API endpoint, and display its output as the response of the chatbot.
Keep in mind that the response time of the model will depend on the minimum number of instances you set while configuring the deployment version. When set to zero, your deployment will initialize from scratch every time a call is made to it, which will increase the time it takes to complete a request.
If you want faster response times, you can increase the minimum number of instances – remember that this uses up computing credits faster, since you’re just keeping instances of your deployment running in the background.
You can monitor the progress of a request by heading over to the WebApp and navigating to the “Requests” subtab under the deployment version that is in use. Here you can find a full list of requests that have been made to the model and their status.
Conclusion
And there you have it! You have now created your own chatbot using Llama 3.1 Instruct, UbiOps, Langchain, and Streamlit. We showed you how you can create a custom environment explicitly, create a deployment & deployment version, create an environment variable, create a front-end, and how to send prompts to your model.
If you are interested in learning even more functionalities from UbiOps, or general information about LLMs, you can check out: