Transformer neural networks have become hugely popular recently as the go-to deep learning architecture for a wide variety of different tasks.

ChatGPT by OpenAI keeps making the headlines and is seen as a paradigm shift in AI becoming a mainstream technology. And ChatGPT is not the only one, similar endeavors at Google and Meta use pretty much the same architecture.

Transformer models have especially revolutionized the field of natural language processing (NLP). They perform exceptionally well on a range of NLP tasks, such as machine translation, question answering, and text summarization. Because of their next level performance compared to other types of deep learning approaches, transformer models have quickly become the go-to architecture for many NLP tasks.

The architecture does not limit itself to NLP tasks alone.

In computer vision, transformers have been used for image classification or object detection, and transformers can also be used for recommendation systems and time series analysis. It’s a versatile architecture useful for a wide range of tasks. Researchers are continuing to find new applications for transformers in various domains.

We’ll explore the transformer architecture in more detail and also show you how you can run and implement pre-trained transformer models yourself with the help of the Hugging Face Transformers library and UbiOps as a model deployment and serving tool.

The power of transformer models

In 2017 the Transformer was introduced as a new neural network in the paper “Attention is all you need” (Vaswani, Ashish et al.). This new transformer architecture produced state of the art results, especially in sequence to sequence tasks that require processing of sequential input data, like language translation or summarizing text.

One of the key characteristics of a transformer model is that they are very good at understanding context. This is because one key item in their design is the concept of ‘self-attention’.

Source: “Attention is all you need”, https://arxiv.org/pdf/1706.03762.pdf

Comparing Transformers with recurrent networks

In the past Recurrent Neural Networks (RNNs) were applied for processing sequential data like text and time series. Because a recurrent neural network learns relationships inside a series of data by processing it sequentially, it has a hard time learning relationships between parts of the data that are very far apart. Let’s say if we’re feeding it a long story of 10.000 words, it can have a hard time finding relationships in the data between the first and last paragraphs of the story.

The technical term for this is the ‘vanishing gradient problem‘ and is a major flaw in traditional recurrent neural network architectures, including LSTM networks who try to overcome this by having an additional component that can learn to ‘flush’ its memory

But unlike an RNN, a transformer network does not use recurrent connections and instead relies on the mentioned self-attention mechanism to learn the relationships between input and output sequences.

It learns to calculate a set of weights, the attention vector, that is used to focus on the most important parts of the input data. This allows the network to capture long-term dependencies in the data much better than an RNN and can therefore theoretically have ‘infinite’ memory.

Another downside of RNNs compared to transformers is that they are notoriously difficult to train as the model processes text word for word in its learning phase. This makes it hard to parallelize the process, making for instance the use of GPUs for training less efficient.

Transformer models on the other hand can process all its input data simultaneously, resulting in a huge efficiency boost and optimal use of acceleration on GPUs. The model behind OpenAI’s GPT-3 model (the model that is used for ChatGPT) for example was trained on a staggering 45TB of text data.

Deploying the BERT transformer model with Hugging Face & UbiOps

Pre trained ML models

Now, GPT-3 is a closed-source, proprietary model. However, there are many great open source Transformer model variants available that are already pre-trained so you can use them directly in your own projects.

The most popular library for these pre-trained open source models is Hugging Face. They maintain a huge library of all kinds of open source machine learning models that you can implement yourself directly, or fine-tune with your own data.

Of course, these models also need to be deployed somewhere for inference. We will use the UbiOps platform to deploy the model and get a scalable inference API endpoint. This allows us to make calls to the model from anywhere and integrate it using its API.

In this example we will deploy the BERT transformer model.

About BERT

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language representation model developed by Google. It’s already pre-trained on a huge amount of text data and can be used for a variety of natural language processing (NLP) tasks.

The basic use for which BERT is trained is to predict the value of ‘masked’ words in a sentence, based on the context from the other words in the same sentence. This basically makes it a powerful foundational model that can be fine tuned and applied to a wide variety of tasks, such as question answering and language inference. For this, the BERT model architecture does not need to change. By adding an additional layer it can solve different problems, like the ones just mentioned.

For our example we will deploy the original BERT model for predicting the value of a masked word in a sentence. As in “Paris is the capital of [MASK].”.

Deploying a BERT inference API with UbiOps

The UbiOps platform is built for deploying and serving AI & ML models, like this one. UbiOps will take your code, build a container and run it as a service with its own API.

This implementation for deploying pre-trained Hugging Face models on UbiOps is basically the same for any model you want to run.

We also created a Google Colab notebook with the full code example in it that you can run yourself.

We will start with the UbiOps deployment template, which is the basic wrapper for any code we want to run on UbiOps as a serverless function.

The deployment has two methods, the initialization and the request. The initialization method runs when the deployment container starts up, so we can include all the code here that only needs to run once, like (down)loading the model, and reduce our overhead for the actual inference step.

In the request function we basically run the predict or inference step of our model. This method is called every time the deployment API receives a call with new data.

Let’s start with the code in the initialization step.

From the Hugging Face `transformers` Python library we download the AutoTokenizer tokenizer and BertForMaskedLM model from the repository and store them locally in the UbiOps deployment container.

class Deployment:

   def __init__(self, base_directory, context):

       """

       Initialisation method for the deployment. Any code inside this method will execute when the deployment starts up.

       It can for example be used for loading modules that have to be stored in memory or setting up connections.

       """

       print("Initialising deployment")

       self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

       self.model = BertForMaskedLM.from_pretrained("bert-base-uncased")

Because the local storage is not persistent when the container shuts down, the files will be downloaded from Hugging Face every time the deployment starts. To load the model files quickly the next time on start-up without too much latency you can store them to the UbiOps file storage.

UbiOps has built in object storage which also has a high bandwidth data connection to the deployment containers. You can use the UbiOps Python client to store both the tokenizer as well as the model files to the UbiOps File system and load it from there. Now, the next time the model needs to start from an inactive state, it will use the copies from the UbiOps file storage.

More information on how to do this is described in this “How-to” article in the UbiOps documentation.

Inside the `request` method, we basically use the example code from the Hugging Face docs to run the BERT model for masked input prediction.

   def request(self, data):

       """

       Method for deployment requests, called separately for each individual request.

       """

       print("Processing request")

       inputs = self.tokenizer(data["sentence"], return_tensors="pt")

       with torch.no_grad():

           logits = self.model(**inputs).logits

       # retrieve index of [MASK]

       mask_token_index = (inputs.input_ids == self.tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

       predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)

       result = self.tokenizer.decode(predicted_token_id)

       # here we set our output parameters in the form of a json

       return {"prediction": result}

Now to deploy the model on UbiOps, we can use either the Python client or the UbiOps Webapp in the browser.

We will set up this deployment in such a way that it expects a sentence as a string, with one word hidden with `[MASK]`. The output of the deployment will be the prediction for the value of the mask.

# Create the deployment

deployment_template = ubiops.DeploymentCreate(

   name=DEPLOYMENT_NAME,

   input_type='structured',

   output_type='structured',

   input_fields=[{'name': 'sentence', 'data_type': 'string'}],

   output_fields=[{'name': 'prediction', 'data_type': 'string'}]

)

api.deployments_create(project_name=PROJECT_NAME, data=deployment_template)

Now we will create a version of the deployment. For the version we need to define the name, Python version, the type of instance (CPU or GPU) as well the size of the instance.

UbiOps has the option to run deployments on different types of hardware and different node sizes. The BERT model we use in our example is quite efficient so it also runs quickly on CPU, but for larger models you can make use of a UbiOps node with an NVIDIA T4 GPU or NVIDIA A100 GPU in it.

For this we will use Python 3.10 with sufficient memory. Optionally you can run on a GPU which will speed up the inference.

# Let's first create the version

version_template = ubiops.DeploymentVersionCreate(

   version=DEPLOYMENT_VERSION,

   language='python3.10',

   instance_type= '4096mb' # You can use '16384mb_t4' if you want to run on GPU

   maximum_instances=1,

   minimum_instances=0,

   maximum_idle_time=600, # = 10 minutes

   request_retention_mode='full'

)

api.deployment_versions_create(project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=version_template)

After defining the deployment and version, we can upload the code to UbiOps. We zip and upload the folder containing the `requirements.txt` and `deployment.py` files. As we do this, UbiOps will build a container based on the settings above and install all packages defined in our requirements file.

This step might take a few minutes, you can monitor the progress in the UbiOps WebApp by navigating to the deployment version and click the `logs` icon.

Testing the deployed BERT model

Now our BERT model is deployed and live on UbiOps, we can use the UbiOps client to make a call to the model API endpoint.

data = {

   "sentence": "Paris is the capital of [MASK].",

}

api.deployment_requests_create(

   project_name=PROJECT_NAME, deployment_name=DEPLOYMENT_NAME, data=data

).result

And we get an almost instantaneous result from the model API on UbiOps. Note that ‘[MASK]’ is the universally recognized token by the BERT model for the word it needs to predict.

The same approach as described above can be used to deploy and run any other model from the Hugging Face Transformers library. Just ensure that the input and output matches with the model.

Conclusion

Transformer neural networks have proven themselves as a very powerful architecture for a range of ML applications, especially natural language processing tasks. The fact that they are much better at understanding context and that they are more efficient to train because of the way they process data.

Many of the large language models are closed source and proprietary, but there’s a world of pre-trained open source ML models that you can run, tailor and implement yourself.

As an example, we used a pre-trained BERT model from Hugging Face and deployed it to UbiOps to get a scalable inference API endpoint for it. This is something that you can easily do yourself and make use of the power of pre-trained transformer models for your own projects.

By industry

By application

On-demand GPU

Featured customers

NEW! Webinar with ReefSupport!

Latest news

UbiOps vs standard Model Serving Platforms

New UbiOps features July 2024

What are transformer models, and how to run them on UbiOps