Introduction to cloud computing for Data Scientists #1

At Dutch Analytics, we are a big fan of the cloud. It offers us the resources and flexibility we require, without the hassle of maintenance. But we notice that for many of our clients, computing in the cloud seems like a daunting task. We also see that as most Data Scientists start out in a completely different field than Computer Science or Software Engineering, many core principles of the cloud are unknown to them.  That’s why we publish a two-part ‘Introduction to cloud computing for Data Scientists’ series, to get you up to speed in no-time. In the first part, we’ll introduce the concept of cloud computing, why it is important for Data Scientists, and differences between cloud vendors. The second part will focus on the different cloud service models and security in the cloud.  This article is meant to introduce the core concepts of cloud computing and serves as a starting point for your first cloud adventures, whether you’re a Data Scientist, Product owner, or in any other way related to applications that could benefit from migrating to the cloud.
  1. What is the cloud? 
  2. Why is cloud computing important for data scientists?
  3. How to choose a cloud vendor?
  4. How to choose the product/services that I need?
  5. How is security handled in the cloud?

What is the cloud?

First of all, what do we mean when we talk about ‘the cloud’ apart from that fluffy white stuff floating in the sky? It seems like an abstract concept, but it actually just means the ‘internet’. If we talk about cloud computing, we are saying ‘computing that is not happening at your local computer, but is sent over the internet to be done on a different server’. To understand why this is something that can be profitable, we have to take a look at the origin of cloud vendors.

Cloud vendors

Cloud vendors are companies that own a lot of computing resources and rent them out to other parties. One of the first cloud vendors was actually not Google or Microsoft or any other giant tech company you would expect, but an online book seller named Amazon. Back in the day, to be able to host a giant search engine like the Amazon website, it was common practice to have huge in-house data centers (data centers – many computers all linked together to provide enormous amounts of computing power and memory).

Downsides?

Only downside, the costs of maintaining these data centers were incredibly high. Just think of the electricity bill of all those computers combined, not too mention the costs of the technical engineers that have to keep all of it running. On top of that, Amazon did not need all of the computing power they had in-house all year around. In total, they had enough capacity to get them through the busy Christmas period, but in the slow summer months, the computers would just sit idle. Stil, they were paying for them. Amazon decided that it could rent out its machines to other companies for a fee, which was a win-win for everybody: All of the machines were used all year around, and the renting companies didn’t have to set-up (or maintain) expensive data centers themselves. The first cloud vendor was born.

So how does the cloud work? 

Nowadays there are many cloud vendors active on the market, that all have a similar set-up: They own a large number of computers that they’ve made available for other people to rent. The idea is as simple as it sounds: a company or individual can rent machines and run applications on them for as long as necessary. If you need more computing power, you can scale up, if you need less, scale down. The cloud vendor (also called cloud provider) will take care of all maintenance and will make sure the machines are always running, have the latest updates and will take care of security for you. Especially now, with the Covid-19 pandemic raging on and whole companies forced to work from home, the cloud’s flexibility and scaling capacities are more relevant than ever, as stated by Forbes.

Why is cloud computing important for a Data Scientist? 

Data scientists are increasingly expected to be able to work with cloud tools and frameworks. It’s hard to find a vacancy or job description out there that won’t mention experience with AWS, GCP or Azure as a big plus or even a required skill. As organizations continuously store their data in the cloud, isn’t it reasonable to ask of a Data Scientist to be able to retrieve and handle the data they are working with?

Big or small data

The arrival of cloud computing has paved the way for efficient storing and computing of Big Data (Big Data – Data that is big in volume, velocity or variety), but also companies that have not-so-big data are bringing it to the cloud. Why? Because the infrastructure needed to accommodate their data and all the actions they want to perform on it, is already present in the cloud. Moreover, they don’t need to worry about the maintenance of the servers or scalability, as well as security or replication of data (more on cloud security in part two of this article). And if all the data resides in the cloud, it’s accessible for the whole company, from everywhere, at any place and anytime.

The cloud is your friend

As good as the cloud sounds, for a Data Scientist, it can feel quite like a burden, as it’s another framework to learn, more concepts to grasp and yet another skill to master. You may be used to getting models running on your local machine, however, sending your application and all its dependencies (not to mention sensitive data) over the internet seems a little tricky. But the cloud is about making your life easier. If your goal as a Data Scientist is to work with the latest technologies, run the newest models in a reliable manner, without worrying about the underlying infrastructure and no cap on the amount of memory and computing power to use, the cloud is your friend.

How to choose a cloud vendor?

Once you’ve decided to migrate (part of) your application to the cloud, the next question soon follows: Which cloud provider should I choose? Or how does the provider chosen by my company compare to other cloud providers? Where to even get started? The most well-known cloud providers currently are Microsoft Azure, Amazon AWS and Google Cloud. Even though they are the biggest players, the market is still expanding, which can be seen from the increasing share of parties like IBM cloud, Oracle, Alibaba Cloud, VMWare and DigitalOcean.

Cloud services 

In principle, cloud providers offer one thing: computers you can rent. However, to make the process of migrating to the cloud as easy as possible, cloud providers nowadays are offering way more services on top of that. As a Data Scientist, it’s highly likely that you will want to allocate some computers to store data and other computers should compute. But setting up your own databases or applications on a freshly installed Operating System (OS), might be a little too much effort. To make this easier, cloud vendors provide services for commodities that most people will want to use, like storage and compute. For example, Amazon has ‘S3’-buckets (S3 for Simple Storage Service) and Google offers Google Storage. These services allow you to just upload your data files to the cloud and access them at any time without worrying about what’s happening under the hood. For computing, most providers have dedicated services to make data analysis or even full-blown machine learning easy. Before choosing a cloud vendor it’s good to think about what additional services it provides that might benefit you, but also consider ease of adoption: Even though Amazon AWS has the broadest range of services, it can be difficult to navigate their platform, which makes for a steep learning curve when beginning with cloud activities. Be sure to try multiple vendors before making a decision.

Vendor lock-in

There is a downside of making use of the services of the vendors and that is that it will make you vulnerable to vendor lock-in. As your company migrates more and more of its applications to the cloud, it gives away part of the ownership of that application. The application will still remain yours naturally, but because the vendor is the owner of the infrastructure that your application depends on, the vendor’s actions have a direct impact on the application. In 2009, Amazon remotely removed books from Kindles e-readers (NY Times article). Whether this was legal is up for debate, but it goes to show that the lines of who-owns-what become blurry and some autonomy could go out of the window when you purchase services from a cloud provider. Lastly, as more and more of a company’s data is transferred to the cloud, the costs of moving data from one vendor to another become substantial, making it very expensive to switch providers later on. This adds a risk to depending solely on another company for hosting your product: What if the vendor decides to discontinue a service that your application depends on? Google has been notorious for ending services that only lived for a few years. If this happens, and your application depends heavily on that particular service of the provider, it can be complicated to integrate your application with another provider’s services. Be mindful that the freedom of your application is not reduced to the abilities and features of a cloud’s service.

Stay up do date!

The more different technologies are being developed, the bigger role cloud computing will play. The reason for that is clear – the cloud is meant to make our lives easier. We hope this article helped you understand the key concepts and was a good base for deep diving into cloud computing. In the next part of the series, we will focus on the different cloud service models and security in the cloud. If you want to stay up to date with our publications, subscribe to the newsletter and follow our social media.

Latest news

Turn your AI & ML models into powerful services with UbiOps