6  Getting Started with the Cloud

Let’s say your organization decides that they need a server to process a large monthly analytics job. If the year is before 2006, you basically only have one option, hosting a server yourself – usually called on-prem.

So before you can start running the job, you’ll need to buy a physical server, wait for it to be delivered, install it in a server rack, and get it configured with the proper operating system, networking, and connectivity. Only then can you start configuring it to run analytics workloads.

Things are different today. While there are still organizations who run on-prem analytics workloads, many organizations are moving or have moved their data science and analytics workloads to the cloud.

So, what is the cloud? To put it simply, the cloud is the term for the immense server farms owned by the provider companies who rent them to organizations in need of computational capacity. When you want to get a server from the cloud, you can just request a certain configuration from the cloud provider and you’ll have it within seconds.

In this chapter, we’re going to get into what the cloud is, why you might want to use it, and spend some time on a lab where we’ll get a cloud server of our own.

A quick history of the cloud

Around the year 2000, the Amazon.com team centralized their IT platform across the company so instead of each team running and hosting its own servers, they would get them from a centralized platform.

Over the next few years, Amazon execs realized that like this service was valuable to other teams at Amazon, other organizations would also be interested in renting server capacity from AWS rather than running it themselves. They launched Amazon Web Services (AWS) in 2006. It wasn’t the first cloud service, but it was the first one to get major commercial traction.

Before we boost the cloud too much, let’s do some hype-busting.

Over the last decade, you’d often hear people say that they were going to save money by moving to the cloud. Often, the explanation was that the flexibility of the cloud was going to enable dynamic autoscaling of server capacity. So even if they paid a little more per hour, they were going to save so much by turning capacity way down at night that it’d work out in their favor.

It turns out that even with the cloud, the engineering needed to make autoscaling happen this smoothly is extremely hard and only the most sophisticated IT/Admin organizations have achieved anything close to this goal.

While there might be individual examples to the contrary, I think it’s safe to say this hasn’t materialized, and most organizations don’t save money by moving to the cloud.

On (not) saving money in the cloud

It’s hard to prove that any particular organization lost money in a cloud migration, but there’s a reason why AWS was only 13% of Amazon’s revenue in 2021, but a whopping 74% of the company’s profits for that year.1

Additionally, an a16z report found that the total cost of ownership (TCO) – including staffing – for bringing server workloads in-house for large, established companies like Dropbox, was only 1/3 to 1/2 the amount of running the same workloads in a public cloud provider.2

There’s a reason why there’s a small consulting industry solely focused on saving money on AWS bills.

So let’s assume your organization isn’t going to beat the curve at autoscaling all their cloud services. Why should you move to the cloud?

The first – and most important reason – is flexibility. While the cloud may be more expensive than hosting servers on-prem, it’s way more flexible. In an on-prem only world, your organization has to buy and host all the server capacity it needs. But in a world with cloud resources, you can have a new server up or rescale an existing one. This is an incredible ability for a small organization, an organization with unpredictable workloads, or for temporary workloads.3

The second reason is about IT team scope and competency. Say you work for a small organization with only a few employees. In the pre-cloud world, you couldn’t run a server until you hired or contracted with someone to run a server rack for you.

In today’s world, all you need to do is feel comfortable working on the command line – and you can get a server of your own.

It’s also worth considering the incentives of IT/Admin teams and professionals. IT/Admin teams tend to be stretched really thin, so the chance to offload a major component of their work – physically hosting servers – is really appealing. Moreover, as IT/Admin professionals have observed the world moving in a cloud-first direction, the resume-building incentives of individual professionals have moved towards the clouds. When I’m looking for my next job, it’s probably going to be way more valuable to say that I’m an expert at working with cloud-based Kubernetes clusters than saying that I’m really good at managing physical servers.

6.1 Layers of cloud services

While the basic premise of the cloud is rent a server, there are layers and layers of services on top of that basic premise. Depending on your organization’s needs, it may make sense to rent a very basic service, or a higher level one.

Let’s talk about cakes to help make the levels clear. This year, I’m planning to bake and decorate a beautiful cake for my friend’s birthday. It’s going to be a round layer cake with white frosting, and it’s going to say “Happy Birthday!” in teal with giant lavender frosting rosettes.

Now that I’ve decided what I’m making, I have a few different options for how I get there. If I’m a real DIY kind of person, I could buy all the ingredients from the grocery store and make everything from scratch. Or I could buy a cake mix – reducing the likelihood I’ll buy the wrong ingredients or end up with unnecessary leftovers. Or maybe I don’t want to bake at all – I could just buy a premade cake already covered in white frosting and just do the writing and the rosettes.

The choice of how much to DIY my friend’s birthday cake and how much to get from a bakery is very much akin to your choices when buying server capacity from the cloud.

TODO: Image of getting cake w/ IaaS, PaaS, SaaS

In the US, a huge fraction of server workloads run on servers rented from one of the “big three clouds” – Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). There are also numerous smaller cloud providers, many of which target more specific use cases.

It’s totally possible to just rent the raw ingredients from these cloud providers – they’ll happily rent you the most basic cloud services like servers, networking, and storage and let you put them together yourself. This kind of cloud service – the equivalent of baking from scratch – is called Infrastructure as a Service (IaaS, pronounced eye-az).

But then I’m responsible for managing all of the reproducibility stack – I have to make sure the servers are up to date with new operating systems and security patches, while also ensuring there are modern versions of R and Python on the system, and making sure there’s RStudio and Jupyter Notebooks present.

But just as you might not want to have to buy eggs, milk, and sugar just to end up with a pretty cake, IT/Admin organizations are increasingly taking advantage of the cloud equivalent of cake mixes and premade blank sheet cakes.

If I want to offload more of the work, I can think about a Platform as a Service (PaaS – pronounced pass) solution. This would give me the ability to somehow define an environment and send it somewhere to run without worrying about the underlying servers. For example, I might want to build a docker container with Python and Jupyter Notebook and host it somewhere that autoscales for the number of users and the amount of other resources needed.

There are a bunch of different entities you can run in this way, and services to match each. For example, if you want to run a general app, you can use something like AWS’s Elastic Beanstalk to run a user-facing app that dynamically scales (TODO: Azure/GCP equivalents), if you want to run a container, you might use XXXX, and if you want to get a Kubernetes cluster as a service, you might use XXXX. If you want to just run a function, there’s always AWS Lambda.

Danger

I am not recommending you run a data science workbench on a service like Elastic Beanstalk. This actually tends not to work very well, for reasons we’ll get into in a bit.

If I want to go one level more abstracted, I might want to go with a Software as a Service(SaaS - pronounced sass) solution. This would be something like one of the cloud providers hosted machine learning environment where you can train, deploy, and monitor machine learning models. AWS has SageMaker, Azure has Azure Machine Learning, and Google’s has Vertex AI. And there are organizations that host popular data science tools like Posit.cloud and Saturn Cloud that you can use.

Sometimes you’ll hear people describe PaaS or SaaS solutions called going serverless.

The first thing to understand about serverless computing is that there is no such thing as serverless computing.

Every bit of cloud computation runs on a server - the question is whether you have to deal with the server or if you just deal with a preconfigured service.

In this book, we’ll be working exclusively with IaaS services – taking the most basic services and building up a working data science environment. Once you understand how these pieces fit together, you’ll be in a much better place if it turns out your organization wants to leverage more abstracted versions of those services.

6.2 Common services for data science usage

If you’re not familiar with cloud provider terminology, it can be very hard to tell what service you might need from a cloud provider, and they don’t really help the matter. It’s very common (especially in AWS) to have many different services that fulfill similar needs, and it can be really hard to concretely tell the difference.

Making the issue even more difficult, many companies go out of their way to make their services sound grand and important and don’t just say, “this is a ___ you can rent”.

It’s helpful to keep in mind that at the very bottom, all you’re doing is renting servers and storage and managing networking and permissions for the servers and storage. Every other service is just combination of server, storage, networking, and permissions that comes with software preinstalled or configured to make your life easier.4

We’re going to talk through some of the basic services that are offered by each of the three major cloud platforms.

Cloud service naming

Azure and GCP tend to name their offerings pretty literally, while AWS chooses cutesy names that have no relationship to the task at hand. I’m going to try to name the services for each of the purposes I’m talking about – but it’s worth noting that feature sets aren’t exactly parallel across cloud providers.

This makes AWS names a little harder to learn, but much easier to recall once you’ve learned them. In this section – contrary to standard practice – I’m going to use the common abbreviated names for AWS services and put the full name in parentheses as these are just trivia.

Remember, at the bottom of all cloud services are servers and each of them has a service that is “just rent me a server”. AWS has EC2 (Elastic Cloud Compute) and Azure has Azure VMs, and Google has Google Compute Engine.

Along with servers, there are two main kinds of storage you’ll rent. The first is file storage, where you’ll store things in a file directory like on your laptop. These are AWS’s EBS (Elastic Block Store), Azure Managed Disk, and Google Persistent Disk.

The major cloud providers all also have blob (Binary Large Object) storage. Blob storage allows you to store individual objects somewhere and recall them to any other machine that has access to the blob store. The major blob stores are AWS Simple Storage Service (S3), Azure Blob Storage, and Google Cloud Storage.

There are also two important networking services for each of the clouds – a way to make a private network and a way to do DNS routing. If you don’t know what these mean, there will be a lot more detail in Chapter 9. For now, it’s enough to know that private networking is done in AWS’s VPC (Virtual Private Cloud), Azure’s Virtual Network, and Google’s Virtual Private Cloud. DNS is done in AWS’s Route 53, Azure DNS, and Google Cloud DNS.

Once you’ve got all this stuff up and running, you need to make sure that permissions are set in the right way. AWS has IAM (Identity andAccess Management), Azure has Azure Active Directory, and Google has Identity Access Management.

Now, there are a variety of things you might want to do past these basic tasks of server, storage, networking, and access management. Here are a few more services you’ll likely hear about over time.

Service AWS Azure GCP
Kubernetes cluster EKS (Elastic Kubernetes Service) AKS (Azure Kubernetes Service) GKE (Google Kubernetes Engine)
Run a function as an API Lambda Azure Functions Google Cloud Functions
Database RDS/Redshift5 Azure Database Google Cloud SQL
ML Platform SageMaker Azure ML Vertex AI

6.3 Lab

Welcome to the lab portion. In this lab, we’re going to get you up and running with an AWS account and show you how to manage, start, and stop EC2 instances in AWS.

At a high level, you can do all of this with any cloud provider. Feel free to try if you prefer Azure of GCP, but the details on how to get started and service names will be different.

We will be standing up a server in AWS’s free tier – so there will be no cost involved as long as you haven’t used up all your AWS free tier credits before now.

6.3.1 Login to the AWS Console

We’re going to start by logging into AWS. If you already know how, you can skip ahead to standing up an instance.

If not, go to https://aws.amazon.com and click Sign In to the Console .

If you’ve never set up an AWS account before, click Create a New AWS account and follow the instructions to create an account. Note that even if you’ve got an Amazon account for ordering stuff online and watching movies, an AWS account is separate.

Once you’ve logged in, you’ll be confronted by the AWS console. There are a ton of things here – feel free to poke around if you want – come back and continue when you’re ready.

6.3.2 Stand up an instance

AWS’s “rent a server” is called EC2.6 It is the most basic AWS service, and it’s what we’re going to use to get started.

Note

I’m going to walk you through setting up an instance from the AWS console. The exact buttons and text may change by the time you read this, but the general overview will be the same.

There are five things we have to configure before launching the server. I’m not going to walk you through configuring each one on this page, because Amazon is constantly updating the exact layout of the page and the text, but I’ll explain the choices and then you can make them yourself – or even make similar choices if you’re in GCP or Azure.

The first choice is instance name and tags. None of this is required – it’s for convenience and organization. I’d suggest you name the server something like do4ds-lab.

Next, you’ll have to choose the image. All clouds have a concept of server images. In AWS, they’re called AMIs (pronounced like you’re reading the letters individually, short for Amazon Machine Image).

An image is the set of software that’s preinstalled when you start the server. Images can range from just a bare operating system to a running RStudio instance that’s suspended in between two computations. There are many paid images that come preinstalled with a bunch of software ready to go.

Since we’re going to work on configuring the server from the ground up, we’re going to choose an AMI that’s just the operating system. Choose the most basic Ubuntu one. At the time of this writing, that’s using Ubuntu 22.04.

If you want to use a different operating system, that’s fine, but you may need to adjust the commands in this chapter and subsequent to match.

You’ll have to choose an Instance Type. In AWS, there are two components to instance type – the family and the size. The family is the category of server that you’re using. In AWS, families are denoted by letters and numbers, so there are T2s and T3s, C4s and C5s, R5s and many more.

Within each family, there are different sizes. The sizes vary by the family, but generally range from things below small nano to multiples of xlarge like 24xlarge.

We’ll get into choosing the right family and size for your team’s use case in Chaper 10.

For now, I’d recommend you get the largest server that is free tier eligible, which is a t2.micro with 1 CPU and 1 Gb of memory as of this writing.

Server sizing for the lab

A t2.micro with 1 CPU and 1 Gb of memory is a very small server.

If all you’re doing is walking through the lab, it should be sufficient, but if you actually want to do any data science work, you’ll need a substantially larger server.

It is possible to rack up really large AWS bills, so be careful.

That said, a modestly-sized server is still pretty cheap if you’re only putting it up for a short amount of time.

I’m writing this on a 2021 M1 Macbook Pro with 10 CPUs and 32 Gb of memory. If you wanted that same computational power from an AWS server, it’s roughly comparable to a t3.2xlarge – with 8 CPUs and 32Gb of memory. That server costs is $0.33 an hour. So a full year running full time for an instance is nearly $3,000, but if you’re only running that instance for a little while – say the few hours it’ll take you to complete this lab, it will probably only be a few dollars.

Next, you’ll need to make sure you have a keypair. We’ll get into what this key is and how to use it in Chapter 7. For now, you’ll need to create a keypair if you don’t already have one. I’d suggest naming it do4ds-lab-key because I’ll use that name, so you can just copy/paste commands if you use the same name.

I’d recommend creating a directory for this lab, perhaps something like do4ds-lab and putting your keypair there. If you’re not going to do that, just make sure you keep track of where you downloaded it.

Whether you’re on Windows or Mac, download the pem version of the key.

You shouldn’t need to change any networking settings from the defaults. Make sure it’s configured to allow SSH traffic.

You’ll need to configure storage. By default, the server comes just with root storage. Root storage is where the operating system will be, as well as storing most of the executables you’ll need.

You’ll want another volume where you’ll store the actual work you’re doing – data files, code, and more. You can attach volumes quite easily. In AWS, they’re called EBS (short for Elastic Block Store). While it’s not strictly necessary to configure an extra EBS volume to complete the labs in this book, make sure to add one since it’s necessary if you ever want to stand up an actual environment to do work.

You can feel free to look through the Advanced Details, but you shouldn’t need to adjust any of them.

When you go to the summary, it should look something like this, assuming you followed the instructions here:

Click Launch Instance. AWS is now creating a virtual server just for you.

If you go back to the EC2 page and click on Instances you can see your instance as it comes up. You may need to remove the filter for State: Running since it’ll take a few moments to be Running.

6.3.3 Grab the address of your server

If you click on the actual instance ID in blue, you can see all the details of your server.

The instance ID and public IP addresses were auto-assigned.

Grab the Public IPv4 DNS address, which starts with ec2- and ends with amazonaws.com. Copy it somewhere easy to grab. That’s going to be the way we access the server.

For example, as I write this, my server has the address ec2-54-159-134-39.compute-1.amazonaws.com. In the commands, I’ll include the variable SERVER_ADDRESS. If you’d like to be able to copy commands verbatim, you can set the variable SERVER_ADDRESS to be your server address using SERVER_ADDRESS=ec2-54-159-134-39.compute-1.amazonaws.com.

Tip

If you’re used to R, where it’s best practice to put spaces around =, notice that assigning variables in bash requires no spaces around =.

6.3.4 Stopping or burning it down

Whenever you’re stopping for the day, you may want to suspend your server so you’re not paying for it overnight or using up your free tier hours.

You can suspend an instance in the state it’s in so it can be restarted later. Depending on how much data you’re storing, it may not be free, but storage costs are generally very modest.

Whenever you want to suspend your instance, go to the EC2 page for your server. Under the Instance State drop down in the upper right, choose Stop Instance.

After a couple minutes the instance will stop and you won’t get charged for it. Before you come back to the next lab, you’ll need to start the instance back up so it’s ready to go.

If you want to completely delete the instance at any point, you can choose to Terminate Instance from that same Instance State dropdown.

6.4 Comprehension Questions

  1. What is the difference between PaaS, IaaS, and SaaS? What’s an example of each that you’re familiar with?
  2. What are the names for AWS’s services for: renting a server, file system storage, blob storage

  1. https://www.visualcapitalist.com/aws-powering-the-internet-and-amazons-profits/↩︎

  2. https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap-cloud-lifecycle-scale-growth-repatriation-optimization/↩︎

  3. By temporary workloads, I mean for example the need for a very large server to run a particular machine learning model. This is not the autoscaling failure case, because you’re not expecting this server to turn itself on and off. Instead, you’d stand it up – run it for a few hours, days, or weeks while you’re using it, and then manually turn it off. This is super easy with a cloud server and very difficult in an on-prem world.↩︎

  4. There are also some wild services that do specific things, like let you rent you satellite ground station infrastructure or do Internet of Things (IoT) workloads. Those services are really cool, but so far outside the scope of this book that I’m fine with talking like they don’t exist.↩︎

  5. There are many others. These are just the most popular I’ve seen for data science use cases.↩︎

  6. It’s short for Elastic Cloud Compute, but everyone just says EC2.↩︎