The new world of MLOps
Are you or any of your colleagues Data Scientists? Have you been involved in bringing data science models in production? Do you acknowledge the effort this takes? Have you seen that the actual machine learning code is only a tiny fraction of the whole solution? Then you should read further.
For a few years, we at Dutch Analytics focussed on developing data-driven solutions for predictive maintenance. In order to serve our clients, we had to set up an entire custom infrastructure for putting a data science model in production every time, and we did not like it. Model code was the result of experimenting back and forth instead of having a structured design and being written with maintainability in mind. Also, the model was written to operate on a test set of data instead of operating on live data. This led to time wasted on rewriting and restructuring the code solely for production. The deployment pipeline (travis CI) and infrastructure itself (a couple of cloud services) were changing with each model. For monitoring, we had to log into a server to check the log file. We thought this could be done much better and started building our own data science infrastructure on top of Kubernetes, because existing technologies did not answer our needs at that time.
Recently we wrote about why data science needs more Ops: best practices from the DevOps world are adjusted to the specific needs of data scientists. The result is a repeatable way to deliver and maintain data science models. We think not all data scientists should be bothered with actually becoming a DevOps specialist, nor all DevOps specialists should become machine learning engineers. Instead we think a specialized piece of technology should fill the gap between Data Scientists and DevOps specialists by answering the needs described in the ten commandments of MLOps.
But how to fit this technology in your existing landscape? What about security? Most technologies are offered either as SaaS or in a more self-managed way (managed VPC, self-managed VPC, self managed On Premises).Â
Self-managed vs SaaS
The holy grail of SaaS is 100% self service: no need to purchase, install or maintain anything before you can deploy your model live. That is why companies are increasingly adopting SaaS services to answer their growing IT needs.Â
As shown in Figure 1 below, it is very easy to get started with a SaaS solution. On your side, a little work is needed to connect to the SaaS solution. For example, a common requirement is to whitelist the IP ranges of the SaaS solution in order to access models/data residing in your infrastructure.
The alternative to SaaS, a self-managed ML infrastructure, requires upfront investment and in house knowledge. Whether the higher investment is beneficial is doubtful, as SaaS is tackling your ‘problem’ on a larger scale. Nevertheless, it may provide you the fine-grained tuning in security, performance, and monitoring that you need.
It must be noted that due to changes in the way software is written and deployed, the upfront investment is mitigated. Most of the software to host your ML applications are written ‘cloud native’. This means they are created using a microservice architecture and all rely on scalable backing services to be available like Kubernetes, managed blob storage, managed databases and networking. These building blocks are offered by the big three cloud providers (AWS, Azure, GCP) in a similar way. This means that ‘cloud native’ software is highly transferable to just another environment.
Still in a self-managed ML infrastructure more input is required from your side. For example, a shared architecture is needed on how to set up the infrastructure in such a way that it is accessible and secure, see Figure 2 for a high-level overview. Configuration can be more detailed out, e.g. whether the Kubernetes engine runs on private nodes, whether an internal or external load balancer is used, in which region or zone the infrastructure is located and how many replicas of databases are required. On the other hand integration with backend systems in your infrastructure can be much easier than for SaaS as the ML infrastructure is already in your environment. Your security officer will be much more eager to approve access to models and data residing in your infrastructure.
Hybrid options like a dedicated VPC deployment exist, which balances some advantages and disadvantages of SaaS and self-managed. We will not elaborate on these options to limit the length of this article.
Security
The most important reason to opt for a self-managed solution is security. You don’t want to rely on another company to manage sensitive data science models and the data that is processed by the models. SaaS solutions can offer a very high level of security, and may even facilitate compliance to ISO or GDPR standards, but in the end host your models and process your data.
What we think is important about security before deciding where to host your ML applications is:
- Model and data. How sensitive are the models and data actually? There is a huge variety in this. Data may be personal data and fall under GDPR regulations. Models and data can be initially regarded as highly valued IP that requires severe protection, but eventually are only valuable in the context of a company where business processes are driven by the ML application.
- Policy: company rules may simply forbid data to leave the security perimeter of your company. In a self-managed solution you are in control of the security: you can add a firewall, API gateway, or custom encryption key.
- Data retention: is the ML infrastructure only doing inference on real-time data or also storing data for training and other long term use. Is your data removed as soon as it is processed?
- Features: are your data and models encrypted in transit and at rest? Can you provide your own encryption key? How does the SaaS solution handle multi-tenancy between clients? Is it possible to trace back any access to models/data by audit trails? A note must be made that in the end the same underlying technology is used which combines Kubernetes and a bunch of cloud services.
- Legal: is the SaaS located in the EU or is the data processed in the USA? Does the data require GDPR compliance? Are the models and data still your property even when hosted in a SaaS environment?
The above points are meant as starting points of a discussion, and we would like to hear your opinion about it.
Total cost of ownership
A simplistic approach to costs is to look at the price quotes offered by companies that provide infrastructure to host your ML applications. Often it is not that simple. There are several factors to take into account to look at the total cost of ownership:
- Initial cost:
For SaaS solutions, this is typically as low as creating an account or paying for a small subscription. Setting up and maintaining an in-house infrastructure has a higher initial cost. Fortunately, most machine learning technology is based on commodity building blocks as explained before, thereby significantly lowering the adoption cost for self-managed ML infrastructure.
- Operating cost:
SaaS platforms are pay per use, which includes a margin on the actually used compute resources. In the case of a self-managed infrastructure, the cost of computing resources is for the client and therefore scales better with a large number of ML applications.
For operation, also the required FTE should be taken into account. This is much lower for a SaaS solution until you hit the boundaries. Then operating/debugging is much easier for self-managed.
- Integration cost:
Another reason to prefer self-managed is the integration with existing data stores that are used by your ML applications. Some examples:
1. Making a large set of images accessible can be very costly (egress) and slow (egress bandwidth vs internal network bandwidth)
2. Arranging access to your data stores from a SaaS solution requires a (significant?) time investment of the IT department of your company.
3. Access to your data stores from a SaaS solution requires extra security measures (VPN, proxy, etc.) to be put in place.
Team cost: to self manage a data science infrastructure some in house knowledge is needed on how to set this up within the perimeter of the company. Without IT who can support an infrastructure, it will sooner or later lack support in the company.
Integration with a self-managed environment is more flexible than with a SaaS platform. Depending on the number of ML applications, your budgets, and preferences for capital or operating expenses this may help you make the decision.
Control
To have full control an on-premises data science infrastructure is intuitively preferred, but a more utilitarian approach should think further.
Technology is much more tunable but not necessarily better in a self-managed environment. Some examples:
- When low-latency is really an issue, this will always be better when systems are closely located to each other.
- Specific hardware requirements are more easily fulfilled in a self-managed environment (e.g. a specific GPU type).
- SLA factors (e.g. uptime, latency) can be better in a SaaS environment whose sole purpose is to serve their clients according to the set contract.
Maintenance is always part of a software solution: solving issues, providing security updates, etc. SaaS typically takes good care to keep the platform online and minimize the impact of updates to their clients. In a self-managed environment, it’s your own responsibility. Whether this is an advantage depends on the expertise and dedication of your team versus the SaaS providers team.
Visibility. In a SaaS environment, only the application level logs/metrics are exposed to the end-user. In a self-managed environment, full debugging is possible.
Next steps
This article is meant as a primer to a more elaborate discussion about where to host your ML applications. It can be concluded that it is very important to think about the requirements on technology, security and control from perspectives of the different departments and people in your company. Without a clear understanding of the needs, it’s better to start small by using SaaS services.
SaaS, self-managed, and hybrid hosting methods are converging in the sense that all are depending on commodity building blocks and cloud providers. This can significantly decrease the differences in total cost of ownership and security.
Most important of all is the expertise of your team. With a team that can quickly learn how to develop and deploy ML models, any technology can be integrated successfully in your companies’ landscape and thereby leading to a successful strategy on data science.