Introduction to cloud computing for Data Scientists – #2
As the field of cloud computing is expanding, we see that many of our clients wonder whether migrating to the cloud is a feasible option for them. It can be hard to keep up with all that is happening when new services and vendors go online every day. We believe that the key to acknowledging whether the cloud is right for your business is understanding the basics of cloud computing. This is relevant for all people involved with the data science application. From Data Scientists and Product Owners to the management team.
This blog post is the final part in our series introducing the cloud’s core principles. In the first part, we introduce cloud computing and its vendors, and why it is important for Data Scientists. The second part focuses on the different cloud service models and security in the cloud.
How to choose the product/services that you need?
In the first part, we talked about being dependent on your cloud provider by ‘vendor lock-in’. Another thing that determines the level of dependency on the cloud provider is the deployment service model that you chose. This model also impacts the way of working of the Data Scientist.
In general, we consider 3 deployment models:
In the graph below, the three models are depicted. The higher up the pyramid, the more managed the service is and the less control you have over it.
IAAS stands for Infrastructure-as-a-service and it’s the model that offers the least managed service. It comes closest to the basic set-up that we mentioned before. It allows you to rent ‘bare’ machines and what you put on them is completely up to you. You decide which OS runs on them. You can also install your own (custom) databases and manage the environment of your application.
The client is most ‘in-control’ with respect to what happens on the machines, but there’s a downside to this freedom. It requires a lot of expertise to monitor and maintain the infrastructure. When something does go wrong, there is much less support available to help you handle the situation. For that reason, choosing IaaS as a service model can only be cost-effective if there’s a team of IT administrators that is able to maintain the infrastructure.
Stands for Platform-as-a-service, is a level above IaaS in terms of control and support. PaaS programs allow developers to create applications in the cloud, but not having to worry about which OS to install or what database to deploy. There is a little bit less control over the environment of the machines you rent in comparison to the IaaS model. There’s no time wasted on setting up databases and network connections. All the conditions for skilled developers are met to quickly deploy applications in the cloud. Moreover, there’s ample documentation and tutorials available from the cloud provider. So the development team will get support on the way if there are issues.
Lastly, the SaaS model, which can be seen as the most-managed model of the three, stands for Software-as-a-service. Now this is a model most of the people are familiar with, as famous SaaS products include for instance Google’s Gmail, Office 365 or Salesforce CRM. Good to mention, UbiOps is also offered as a SaaS solution for businesses and individuals. With the SaaS model, the client does not need to have any coding skills whatsoever to interact with the application. The product often has an intuitive user interface allowing the user to get started quickly without having to bridge a steep learning curve. If there is a problem, a company-owned support team is available to solve it immediately. Downsides of a SaaS model are that this type of deployment allows less control for the user to fully customize the product and it is usually more costly than the other deployment models.
How is security handled in the cloud?
A big concern for many companies migrating to the cloud is keeping their private data secure behind closed walls, while still allowing anyone with an internet connection to access their application. How can the cloud provider guarantee the security of the data? How is security managed for the three deployment models? Read more on that in this Tripwire article. While we will not go further into the details of how to achieve a secure set-up in the cloud in this article, we would like to focus on two questions that are asked frequently.
Where is cloud data stored?
The location of where cloud data is stored is important for two reasons:
- First of all, the location of the data needs to be secure. All to prevent criminals from physically stealing the software or hardware.
- Second of all, many countries have privacy linked regulations in place concerning the location of the data.
As discussed in the first part of this series, the data of the cloud resides in massive data centers, owned by the cloud provider. However, these centers are no ordinary office buildings that you will find in any city center’s business district. As you can imagine when hosting the world’s largest collection of (private) data, data centers look more like big bunkers with a military level of defence. Employees are strictly allowed entry when they have to perform maintenance tasks and are monitored the whole time to ensure no theft occurs.
Location also matters
Aside from the building’s security, the location also matters. The climate cannot be too hot or too cold to prevent malfunctioning of the machines. There can be no danger of floods, earthquakes, or any other type of natural disasters. Internet connectivity should be excellent and the electricity in the building can’t ever go down, often the data center is located right next to a power plant to ensure availability. Some data centers have their own fire department to quickly put out fires in case of emergency. To top it all of, most cloud providers make replicas of your data on multiple machines or even over multiple data centers in the same region. In the event of hardware failure of one of the machines, your data is not lost and can be restored easily.
The above measures will make sure that both the hard and software is protected and your data is always available.
Following GDPR regulations
Data can only be stored in the continent where it was generated. That’s why most cloud providers have one or more data centers located in each part of the world. However, stricter regulations may apply and differ between countries. This is the reason why it’s always important to check where the cloud provider stores your data.
Who can access my data?
As mentioned above, the level of security of the machines inside a data center is outstanding. But how is the security inside a machine controlled? In Part 1 of this series, we spoke of renting ‘machines’. One computer can only be rented by one party, but in practice, this turned out to not be feasible for the provider. The reason why cloud computing is cost-effective is because of the way a machine can be utilized for many clients using ‘virtualization’. Virtualization is the process of creating a virtual (rather than an actual) version of a resource, such as a server, desktop or even an entire Operating System (OS).
2a. A setup of a local machine. 2b. Set-up enabling virtualization of a machine
In this way, your application is not stored and run on the OS of a real machine, but on the OS that is itself a (containerized) package running on a hypervisor layer, which is running on a machine. The hypervisor layer monitors the different virtual machines and configures their environments. Effectively, the cloud provider is not hosting your application, but a containerized package that holds both your data, software and the OS it runs on. This package can be ‘activated’ (or provisioned) on any machine in the data center, leading to many advantages.
Instant provisioning allows for fast scaling (which gained the cloud its popularity), yet it also allows for load balancing. When we push our own computers to the limit by executing many tasks at the same time, CPU usage goes up and performance starts going down. Your computer will eventually freeze. In the cloud, if the load starts to get too high for one machine, another machine can provision a copy of the container and take some of the workload. This ensures maximum performance for the client.
Downside to virtualization in the cloud
One machine can host the containers of multiple users, also called ‘tenants’. Just like with normal tenancy of houses, you share some of the infrastructure and may occasionally be bothered by your ‘neighbours’. For instance, if the workload of one tenant increases very fast, this affects the performance of the entire machine and could potentially result in lower performance for the other tenants. Lastly, since data of multiple tenants sits side-by-side on the same machine, there will always be a risk of exposing the data or application to other tenants. Even though many security measures are taken to prevent this from happening, (human) errors do occur and can result in data leaks.
In this article series, we explain why we are a fan of the cloud. It’s availability, capacity and robustness gives us the power to develop UbiOps fast and with ease. But we recognize that the cloud can be intimidating at first, especially for data scientists. We believe that you only truly know whether the cloud is right for you if you understand it’s core principles. Then choosing a cloud provider that offers the service suiting your team’s needs becomes easy. It will ultimately result in a happy marriage between you and the cloud.
Did you enjoy our reading? Sign-up for newsletter. Stay up-to-date on new blogs, cool features, and other exciting UbiOps developments.