In most companies we speak with, data science teams operate separately from software engineering and IT teams...
Data scientists focus on devising, developing, and conducting experiments with many iterations. Software developers are focused on building stable, robust, and scalable solutions.
An algorithm alone is not enough to bring value. You want to run it live in the background as part of a larger application, feed data through it, connect it to a frontend, and so on. Therefore, in developing end-to-end data-driven applications, both expertise in data science and expertise in software, IT infrastructure and DevOps are needed to tie everything together.
The handover of an algorithm from a data scientist to the software team for integration in the larger scope of things is often a time-consuming, and sometimes even frustrating, experience. By looking at the situation from the perspective of both the data scientists and the software engineers, there are many possibilities for keeping both sides happy and productive.
The data science perspective
First, let’s have a look at the priorities for a data science team.
Their daily work is centered around designing and executing experiments. This usually means huge Jupyter notebooks and a constantly changing set of algorithms and software libraries.
The focus lies on quickly realizing the best possible prototype algorithm. It is about the performance of the model and answering the business questions. Testing more hypotheses therefore often wins it from writing cleaner code, and that’s ok. A working rough prototype is more useful in this stage than a clean and well-written algorithm that does not do the trick
- For a data scientist, the ability for rapid prototyping is important. That is also why Python as a programming language and notebooks as a development environment are the main tools a data science team uses. Python also opens the doors to a huge realm of open-source data science libraries, making the development process even quicker as many algorithms and implementations come out-of-the-box. Some, like neural networks, might require a lot of computing resources and GPUs to train, while others can quickly be tested on a laptop. This implies the need for flexibility in infrastructure and dependencies
- Furthermore, experiments are often done in a team. That requires an environment where everyone can exchange code, run experiments and collaborate. Notebooks often are the easiest to exchange as they mostly contain also the data preprocessing and cleaning steps, as well as the algorithm, graphs displaying performance and results, and so on.
Then, when the dust has settled, a prototype algorithm emerges which has to be integrated into the larger scope of the end-to-end application.
The software perspective
From a software and IT viewpoint, other things are of importance.
They need to guarantee the availability and security. But also the robustness of the end solution where a data science algorithm is just a part of. This task brings a different set of priorities.
- In the first place, this concerns the handling of possible future errors, clear documentation, and efficiency of the implementation (speed of code). That is why in practice the starting point is usually restructuring the code. This means detaching the model from a notebook, eliminating all code related to experiments that have taken place during the previous phase and preserving the essentials. Then this elemental part needs to be transformed into a service that can receive, process and return data.
- Especially with data science applications, an algorithm might require updates after a while. This can be because a better performing version is available, or the model does not perform as expected anymore in live conditions and retraining is necessary. Therefore, there is a need for a way to have version control on the algorithms in a live environment, so updates and rollbacks can be made efficiently and without downtime.
- Changes to algorithms must be traceable, so that it can be traced back to where things went wrong. Transparency and auditing of the algorithm, the data flow and underlying infrastructure is becoming increasingly important when decisions are being made based on their outcomes. You want to know exactly who changed what.
- All services need to be robust, secure and available. This means that it has good error handling, for instance when it receives faulty data and does not crash abruptly. Another important condition for the production environment is the availability of the solution. Think of automatic restarts if systems fail.
- Every application needs a certain amount of scalability. This can be because it is used differently throughout the day or week or could receive bursts of data at intervals. Also, usage of the service might increase through time. In all cases, you want to have taken steps to automate the scaling of the application before any limits are reached that
- While not applicable to every situation, it can be very worthwhile to ensure that algorithms are portable between environments. This can be when there is a separate development, test and production environment, but also when you make use of a hybrid cloud setup and want to have the ability to switch locations where your algorithm will run without having to change the code or runtime itself.
As you can see, this is a very different set of responsibilities than the data science team has. But that is not a bad thing at all, as long as the handover of work goes as smoothly as possible.
Maintaining a separation of concerns
A smooth handover and collaboration between data science and software teams will save companies a lot of time and costs. But what is needed to achieve this?
In short, data scientists need to spend their time doing experiments and designing the best algorithms. Software engineers are responsible for the availability, security and robustness of the final application. A good platform and infrastructure would serve both at the same time.
How did we approach this problem?
We approached this problem by ensuring that the responsibilities of data scientists can not affect the responsibilities of the software engineers. At the same time maintaining a strict separation of concerns between the two. In our own architecture, the data science code is containerized and isolated from the underlying infrastructure.
Data scientists can define and construct the runtime for their algorithms, with plenty of flexibility to install libraries of choice and sufficient knobs to turn to take care of resource provisioning to match the algorithm’s needs. Also, limited code refactoring is needed by data scientists in order to deploy it. Security is ensured by isolation of these containers from the rest of the infrastructure and running them with minimal permissions.
Software, development and operations
For the software and DevOps side, robustness and scalability is ensured by making use of Kubernetes and custom services. These take care of scaling and management of the containers with the data science code that runs on top of this. Graceful fail-over for instance, not affecting other active services in the process. The software team only has to take care of managing the cluster of underlying infrastructure. With all the data science code, their runtime dependencies, scaling, and monitoring abstracted away so one can not affect the other.
This is our approach to supporting a happy marriage between the data science and software team.