Data pipelines: what, why and which ones

By 8 September 2021Blog, Functionality, UbiOps

Examples and explanations of how different pipeline frameworks relate to each other

If you are working in the Data Science field you might continuously see the term “data pipeline” in various articles and tutorials. You might have also noticed that the term pipeline can refer to many different things! There are pipelines spanning different parts of your IT stack, pipelines for a specific tool, and pipelines within a specific code library. UbiOps, the company I am working for, also offers its own form of pipelines, for instance. 

It can be quite confusing keeping track of what all these different pipelines are and how they differ from one another. In this article, we will map out and compare a few common pipelines, as well as clarify where UbiOps pipelines fit in the general picture.

 

What is a Data pipeline?

Let’s start at the beginning, what is a data pipeline? In general terms, a data pipeline is simply an automated chain of operations performed on data. It can be bringing data from point A to point B, it can be a flow that aggregates data from multiple sources and sends it off to some data warehouse, or it can perform some type of analysis on the retrieved data. Basically, data pipelines come in many shapes and sizes. But they all have three things in common: they are automated, they introduce reproducibility, and help to split up complex tasks into smaller, reusable components. 

You might be familiar with ETL, or its modern counterpart ELT, which are common types of data pipelines. ETL stands for Extract, Transform, Load and it does exactly that. ELT pipelines extract, load and only then transform. They are common pipeline patterns used by a large range of companies working with data. Especially when they work with data from different sources that need to be stored in a data warehouse. ETL or ELT pipelines are a subset of Data pipelines. In other words, every ETL/ELT pipeline is a data pipeline, but not every data pipeline is an ETL or ELT pipeline.

On the other side of the spectrum of data pipelines we have the more analytics focused pipelines. These are often present in data science projects. They are pipelines that process incoming data, which is generally already cleaned in some way, to extract insights. It goes beyond just loading the data and transforming it and instead performs analyses on the data. 

A data pipeline is not confined to one type or the other, it’s more like a spectrum. Some pipelines focus purely on the ETL side, others on the analytics side, and some do a bit of both. There are so many different ways you can go about setting up a data pipeline, but in the end the most important thing is that it fits your project’s needs.

 

Why do you need data pipelines?

Okay, we covered what data pipelines are, but maybe you’re still wondering what their added benefit is. Setting up pipelines does take time after all. I can assure that that time is well spent, for a couple of reasons.

For starters automated pipelines will save you time in the end because you won’t have to do the same thing over and over, manually. It allows you to save time on the repeatable tasks, so you can allocate more time to other parts of your project.

Probably the most important reason for working with automated pipelines though, is that you need to think, plan and write down somewhere the whole process you plan to put in the pipeline. In other words: it forces you to make a design up front and think about the necessities. Reflecting on the process and documenting it can be incredibly useful for preventing mistakes, and for allowing multiple people to use the pipeline.

In addition, pipelines allow you to split up a large task into smaller steps. This increases efficiency, scalability and reusability. It helps in making the different steps optimized for what they have to do. For instance, sometimes a different framework or language fits better for different steps of the pipeline. If it is one big script, you will have to stick to one, but with most pipeline tools you can pick the best framework or language for each individual part of the pipeline.

Lastly, pipelines introduce reproducibility, which means the results can be reproduced by almost anyone and nearly everywhere (if they have access to the data, of course). This not only introduces security and traceability, but it also makes debugging much easier. The process is the same every time you run the pipeline, so whenever there is a mistake, you can easily trace back the steps and find out where it went wrong.

 

How do UbiOps pipelines fit in this picture?

If you are familiar with UbiOps you will know that UbiOps also has a pipeline functionality. UbiOps pipelines are modular workflows consisting of objects that we call deployments. Every deployment serves a piece of Python or R code in UbiOps. Deployments each have their own API endpoints and are scaled dynamically based on usage. With pipelines you can connect deployments together to create larger workflows. This set-up helps in modularization, allowing you to split your application into small individual parts to build more powerful software over time. UbiOps will take care of the data routing between these deployments and the entire pipeline will be exposed via its own API endpoint for you to use.

 

When we look back at the spectrum of pipelines I discussed earlier, UbiOps is more on the analytics side. In the end you can do with UbiOps pipelines whatever you specify in the code, but it is meant more for data processing and analytics, rather than routing data between other technologies.

 

Comparison of different pipeline frameworks

As I mentioned earlier, there are a ton of different pipeline frameworks out there, all with their own benefits and use cases. A few that keep popping up in the data science scene are: Luigi, Airflow, scikit-learn pipelines and Pandas pipes. Let’s have a look at their similarities and differences, and also check how they relate to UbiOps pipelines.

Luigi

Luigi was built by Spotify for its data science teams to build long-running pipelines of thousands of tasks that stretch across days or weeks. It was intended to help stitch tasks together into smooth workflows. It’s a Python package available on an open-source license under Apache

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes, where many tasks need to be chained together. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, or running machine learning algorithms.

Luigi has 3 steps to construct a pipeline: 

  • requires() defines the dependencies between the tasks
  • output() defines the target of the task
  • run() defines the computation performed by each task

In Luigi, tasks are intricately connected with the data that feeds into them, making it hard to create and test a new task rather than just stringing them together. Because of this setup, it can also be difficult to change a task, as you’ll also have to change each dependent task individually.

Airflow

Airflow was originally built by AirBnB to help their data engineers, data scientists and analysts keep on top of the tasks of building, monitoring, and retrofitting data pipelines. Airflow is a very general system, capable of handling flows for a variety of tools. Airflow defines workflows as Directed Acyclic Graphs (DAGs), and tasks are instantiated dynamically. 

Airflow is built around:

  • Hooks, which are high level interfaces for connecting to external platforms (e.g. a Postgres Hook)
  • Operators, which are predefined Tasks that become nodes of the DAG
  • Executors (usually Celery) that run jobs remotely, handle message queuing, and decide which worker will execute each task
  • Schedulers, which handle both triggering scheduled workflows, and submitting Tasks to the executor to run.

With airflow it is possible to create highly complex pipelines and it is good for orchestration and monitoring. The most important factor to mention for Airflow is its capability to connect well with other systems, like databases, Spark or Kubernetes. 

A big disadvantage to Airflow however, is its steep learning curve. To work well with Airflow you need DevOps knowledge. Everything is highly customizable and extendable, but at the cost of simplicity. 

scikit-learn pipelines

scikit-learn pipelines are very different from Airflow and Luigi. They are not pipelines for orchestration of big tasks of different services, but more a pipeline with which you can make your Data Science code a lot cleaner and more reproducible. scikit-learn pipelines are part of the scikit-learn Python package, which is very popular for data science. 

scikit-learn pipelines allow you to concatenate a series of transformers followed by a final estimator. This way you can chain together specific steps for model training or for data processing for example. With scikit-learn pipelines your workflow becomes much easier to read and understand. Because of this, it will also become much easier to spot things like data leakage.

Keep in mind though, that scikit-learn pipelines only work with transformers and estimators from the scikit-learn library, and that they need to run in the same runtime. These pipelines are thus very different from the orchestration pipelines you can make in Airflow or Luigi. With Airflow or Luigi you could for instance run different parts of your pipeline on different worker nodes, while keeping a single control point.

Pandas Pipes

Pandas pipes are another example of pipelines for a specific Python package, in this case Pandas. Pandas is a popular data analysis and manipulation library. The bigger your data analysis becomes, the more messy your code can get as well. Pandas pipes offer a way to clean up the code by allowing you to concatenate multiple tasks in a single function, similar to scikit-learn pipelines.

Pandas pipes have one criterion: all the steps should be a function with a Data Frame as argument, and a Data Frame as output. You can add as many steps as you need as long as you adhere to that criterion. Your functions are allowed to take additional arguments next to the DataFrame, which can be passed to the pipeline as well.

With the DataFrame in, DataFrame out principle, Pandas pipes are quite diverse. They are comparable to scikit-learn pipelines in terms of use cases, and thus also vary a lot from Airflow and Luigi. They might be pipelines as well, but of a very different kind. 

Comparison

Clearly all these different pipelines are fit for different types of use cases, and might even work well in combination. It is for instance completely possible to use Pandas Pipes within the deployments of a UbiOps pipeline, in this way combining their strengths. Let’s put the aforementioned pipelines side by side to sketch the bigger picture.

In the spectrum of pipelines, Luigi and Airflow are on the higher level software orchestration side, whereas Pandas pipes and scikit-learn pipelines are down to the code level of a specific analysis, and UbiOps is somewhere in between. Luigi and Airflow are great tools for creating workflows spanning multiple services in your stack, or scheduling tasks on different nodes. Pandas Pipes and scikit-learn pipelines are great for better code readability and more reproducible analyses. UbiOps works well for creating analytics workflows in Python or R.

There is some overlap between UbiOps, Airflow and Luigi, but they are all geared towards different use cases. UbiOps is geared towards data science teams who need to put their analytics flows in production with minimal DevOps hassle. Luigi is geared towards long-running batch processes, where many tasks need to be chained together that can span days or weeks. And lastly, Airflow is the most versatile of the three, allowing you to monitor and orchestrate complex workflows, but at the cost of simplicity. With its increased versatility and power also comes a lot of complexity and a steep learning curve.

scikit-learn and Pandas pipelines are not really comparable to UbiOps, Airflow or Luigi, as they are specifically made for these libraries. scikit-learn and Pandas pipelines could actually be used in combination with UbiOps, Airflow or Luigi by just including them in the code run in the individual steps in those pipelines!

 

Conclusion

Data pipelines are a great way of introducing automation, reproducibility and structure to your projects. There are many different types of pipelines out there, each with their own pros and cons. Hopefully this article helped with understanding how all these different pipelines relate to one another.

This article was written by Anouk Dutrée, the Product Owner at UbiOps. 

 

 

—————–

About UbiOps

UbiOps is an easy-to-use deployment and serving platform built on Kubernetes. It turns your Python & R models and scripts into web services, allowing you to use them from anywhere at any time. You can embed them in your own applications, website or data infrastructure. Without having to worry about security, reliability or scalability. UbiOps takes care of this for you and helps you get ahead with MLOps in your team.

UbiOps is built to be as flexible and adaptive as you need it to be for running your code, without compromising on performance or security. We’ve designed it to fit like a building block on top of your existing stack, rather than having to make an effort to learn and adopt new technologies. It lets you manage your models from one centralized location, offering flexibility and governance.

You can find more technical information in our documentation -> www.ubiops.com/docs 

To help you getting up to speed with UbiOps, we have prepared some examples and quickstart tutorials -> www.ubiops.com/docs/tutorials/quickstart/