7  The Cloud

The cloud is a crucial part of the way production data science gets done these days. Nearly every data science organization is already in the cloud or is considering a cloud transition.

But as someone who doesn’t spend the day working directly with cloud services, trying to understand what the cloud is can feel like trying to catch a – well, you know – and pin it down.1 In this chapter, you’ll learn about what the cloud is and get an introduction to essential cloud services for data science.

This chapter has two labs. In the first, you’ll start with an AWS (Amazon Web Services) server – getting it stood up and learning how to start and stop it. In the second lab, you’ll put the model from our penguin mass modeling lab into an S3 bucket (more on that in a bit).

The cloud is rental servers

At one time, the only way to get servers was to buy physical machines and hire someone to install and maintain them. This is called running the servers on-prem (short for on-premises).

There’s nothing wrong with running servers on-prem. Some organizations, especially those with highly sensitive data, still do. But only those with obviously worthwhile use cases and sophisticated IT/Admin teams have the wherewithal to run on-prem server farms. If your company needs only a little server capacity or isn’t sure about the payoff, hiring someone and buying a bunch of hardware probably isn’t worth it.

Enter an online bookstore named Amazon. Around 2000, Amazon started centralizing servers across the company so teams who needed capacity could acquire it from this central pool instead of running their own. Over the next few years, Amazon execs (correctly) realized that other companies and organizations would value this ability to rent server capacity. They launched this “rent a server” business as AWS in 2006.

The cloud platform business is now enormous – collectively nearly a quarter of a trillion dollars. It’s also highly profitable. AWS was only 13% of Amazon’s revenue in 2021 but was a whopping 74% of the company’s profits for that year.2

AWS is still the biggest cloud platform by a considerable margin, but it’s far from alone. Approximately 2/3 of the market consists of the big three or the cloud hyper scalers – AWS, Microsoft Azure, and GCP (Google Cloud Platform) – with the final third comprising numerous smaller companies.3

Real (and fake) cloud benefits

The cloud arrived with an avalanche of marketing fluff. More than a decade after the cloud went mainstream, it’s clear that many of the purported benefits are real, while some are not.

The most important cloud benefit is flexibility. Moving to the cloud allows you to get a new server or re-scale an existing one in minutes; you only pay for what you use, often on an hourly basis.4 Because you pay as you go, the risk of incorrectly guessing how much capacity you’ll need is way lower than in an on-prem environment.

The other significant benefit of the cloud is that it allows IT/Admin teams to narrow their scope. For most organizations, managing physical servers isn’t part of their core competency and outsourcing that work to a cloud provider is a great choice to promote focus.

Note

One other dynamic is the incentives of individual IT/Admins. As technical professionals, IT/Admins want evidence on their resumés that they have experience with the latest and greatest technologies – generally cloud services – rather than managing physical hardware.

Along with these genuine benefits, the cloud was supposedly going to result in significant savings relative to on-prem operations. For the most part, that hasn’t materialized.

The theory was that the cloud would enable organizations to scale their capacity to match needs at any moment. So even if the hourly price were higher, the organization would turn servers off at night or during slow periods and save money.

But dynamic server scaling takes a fair amount of engineering effort, and only the most sophisticated IT/Admin organizations have implemented effective autoscaling. And even for the organizations that do autoscale, cloud providers are very good at pricing their products to capture a lot of those savings.

Some organizations have started doing cloud repatriations – bringing workloads back on-prem for significant cost savings. An a16z study found that, for large organizations with stable workloads, the total cost of repatriated workloads, including staffing, could be only one-third to one-half the cost of using a cloud provider.5

That said, even if the cash savings aren’t meaningful, the cloud is a crucial enabler for many businesses. The ability to start small, focus on what matters, and scale up quickly is worth it.

Since you’re reading this book, I’m assuming you’re a nerd, and you may be interested in buying a physical server or re-purposing an old computer just for fun. You’re in good company; I’ve run Ubuntu Server on multiple aging laptops. If you mostly want to play, go for it. But if you’re trying to spend more time getting things done and less time playing, acquiring a server from the cloud is the way to go.

Understanding cloud services

In the beginning, cloud providers did just one thing: rent you a server. But they didn’t stop there. Instead, they started building layers and layers of services on top of the rental servers they provide.

In the end, all cloud services boil down to “rent me an \(\text{X}\)”. As a data scientist trying to figure out which services you might need, you should start by asking, “What is the \(\text{X}\) for this service?”

Unfortunately, cloud marketing materials aren’t usually oriented at the data scientist trying to decide whether to use the services; instead, they’re oriented at your boss and your boss’s boss, who wants to hear about the benefits of using the services. That can make it difficult to decode what \(\text{X}\) is.

It’s helpful to remember that any service that doesn’t directly rent a server is just renting a server that already has certain software pre-installed and configured.6

Less of Serverless Computing

You might hear people talking about going serverless. The thing to know about serverless computing is that there is no such thing as serverless computing. Serverless is a marketing term meant to convey that you don’t have to manage the servers. The cloud provider manages them for you, but they’re still there.

Cloud services are sometimes grouped into three layers to indicate whether you’re renting a basic computing service or something more sophisticated. I will use an analogy to a more familiar layered object in order to explain cloud service layers. Let’s say you’re throwing a birthday party for a friend, and you’re responsible for bringing the cake.7

Big Three Service Naming

In this next section, I’ll mention services for everyday tasks from the big three. AWS tends to use cutesy names with only a tangential relationship to the task at hand. Azure and GCP name their offerings more literally.

This makes AWS names harder to learn, but much more memorable once you’ve learned them. A table of all the services mentioned in this chapter is in Appendix D.

IaaS Offerings

Infrastructure as a service (IaaS, pronounced eye-ahzz) is the basic “rent a server” premise from the earliest days of the cloud.

What’s in a VM?

When you rent a server from a cloud provider, you are usually not renting a whole server. Instead, you’re renting a virtualized server or a virtual machine (vm), usually called an instance. What you see as your server is probably just a part of a larger physical server you share with other cloud provider customers.

Unlike a personal computer, a rented cloud server doesn’t include storage (hard disk), so you’ll acquire that separately and attach it to your instance.

From a data science perspective, an IaaS offering might look like what we’re doing in the lab in this book, i.e., acquiring a server, networking, and storage from the cloud provider and assembling it into a data science workbench. This is the best and cheapest way to learn how to administer a data science environment, but it’s also the most time-consuming.

Recalling the cake analogy, IaaS is akin to going to the grocery store, gathering supplies, and baking and decorating your friend’s cake all from scratch.

Some common IaaS services you’re likely to use include:

  • Renting a server from AWS with EC2 (Elastic Cloud Compute), Azure with Azure VMs, or GCP with Google Compute Engine Instances.

  • Attaching storage with AWS’s EBS (Elastic Block Store), Azure Managed Disk, or Google Persistent Disk.

  • Creating and managing the networking where your servers sit with AWS’s VPC (Virtual Private Cloud), Azure’s Virtual Network, and GCP’s Virtual Private Cloud.

  • Managing DNS records via AWS’s Route 53, Azure DNS, and Google Cloud DNS. (More on what this means in Chapter 13).

While IaaS means the IT/Admins don’t have to be responsible for physical management of servers, they’re responsible for everything else, including keeping the servers updated and secured. For that reason, many organizations are moving away from IaaS toward something more managed.

PaaS Offerings

In a Platform as a Service (PaaS) solution, you hand off management of the servers and manage your applications via an API specific to the service.

In the cake-baking world, PaaS would be like buying a pre-made cake and some frosting and writing “Happy Birthday!” on the cake yourself.

One PaaS service that already came up in this book is blob (Binary Large Object) storage. Blob storage allows you to put objects somewhere and recall them to any other machine that has access to the blob store. Many data science artifacts, including machine learning models, are kept in blob stores. The major blob stores are AWS’s S3 (Simple Storage Service), Azure Blob Storage, and Google Cloud Storage.

You’ll also likely use cloud-based database, data lake, and data warehouse offerings. I’ve seen RDS or Redshift from AWS, Azure Database or Azure Datalake, and Google Cloud Database and Google BigQuery used most frequently. This category also includes several offerings from outside the big three, most notably Snowflake and Databricks.8

Depending on your organization, you may also use services that run APIs or applications from containers or machine images like AWS’s ECS (Elastic container Service), Elastic Beanstalk, or Lambda, Azure’s container Apps or Functions, or GCP’s App Engine or Cloud Functions.

Increasingly, organizations are turning to Kubernetes to host services. (More on that in Chapter 17.) Most organizations who do so use a cloud provider’s Kubernetes cluster as a service: AWS’s EKS (Elastic Kubernetes Service) or Fargate, Azure’s AKS (Azure Kubernetes Service), or GCP’s GKE (Google Kubernetes Engine).

Many organizations are moving to PaaS solutions for hosting applications for internal use. It removes the hassle of managing and updating actual servers. On the flipside, these offerings are less flexible than just renting a server, and some applications don’t run well in these environments.

SaaS Offerings

SaaS (Software as a Service) is where you rent the end-user software, often based on seats or usage. You’re already familiar with consumer SaaS software like Gmail, Slack, and Office365.

The cake equivalent of SaaS would be heading to a bakery to buy a decorated cake for your friend.

Depending on your organization, you might use a SaaS data science offering like AWS’s SageMaker, Azure’s Azure ML, or GCP’s Vertex AI or Cloud Workstations.

The great thing about SaaS offerings is that you get immediate access to the end-user application and it’s usually trivial (aside from cost) to add more users. IT/Admin configuration is generally limited to hooking up integrations, often authentication and/or data sources.

The trade-off for this ease is that they’re generally more expensive and you’re at the provider’s mercy for configuration and upgrades.

Cloud Data Stores

Redshift, Azure Datalake, BigQuery, Databricks, and Snowflake are full-fledged data warehouses that catalog and store data from all across your organization. You might use one of these as a data scientist, but you probably won’t (and shouldn’t) set one up.

However, you’ll likely own one or more databases within the data warehouse and you may have to choose what kind of database to use.

In my experience, Postgres is good enough for most things involving rectangular data of moderate size. And if you’re storing non-rectangular data, you can’t go wrong with blob storage. There are more advanced options, but you probably shouldn’t spring for anything more complicated until you’ve tried the combo of a Postgres database and blob storage and you’ve found it lacking.

Common Services

Regardless of what you’re trying to do, if you’re working in the cloud, you must ensure that the right people have the correct permissions. To manage these permissions, AWS has IAM (Identity and Access Management), GCP has Identity Access Management, and Azure has Microsoft Entra ID, which was called Azure Active Directory until the summer of 2023. Your organization might integrate these services with a SaaS identity management solution like Okta or OneLogin.

Additionally, some cloud services are geographically specific. Each cloud provider has split the world into several geographic areas, which they all call regions.

Some services are region-specific and can only interact with other services in that region by default. If you’re doing things yourself, I recommend just choosing the region where you live and putting everything there. Costs and service availability vary somewhat across regions, but it shouldn’t be materially different for what you’re trying to do.

Regions are subdivided into availability zones (AZs), or subdivisions of regions. Each AZ is designed to be independent, so an outage affecting one AZ won’t affect other AZs in the region. Some organizations want to run services that span multiple availability zones to protect against outages. If you’re running something sophisticated enough to need multi-AZ configuration, you should be working with a professional IT/Admin.

Comprehension questions

  1. What are two reasons you should consider going to the cloud? What’s one reason you shouldn’t?
  2. What is the difference between PaaS, IaaS, and SaaS? What’s an example of each that you’re familiar with?
  3. What are the names of AWS’s services for: renting a server, filesystem storage, blob storage?

Introduction to labs

Welcome to the lab!

You’ll build a functional data science workbench by walking through the labs sequentially. It won’t be sufficient for enterprise-level requirements, but will be secure enough for a hobby project or even a small team.

For the labs, we will use services from AWS, as they’re the biggest cloud provider and the one you’re most likely to run into in the real world. Because we’ll be mostly using IaaS services, there are very close analogs from Azure and GCP should you want to use one of them instead.

Lab: Getting started with AWS

In this first lab, we will get up and running with an AWS account and manage, start, and stop EC2 instances in AWS.

The server we’ll stand up will be from AWS’s free tier – so there will be no cost involved as long as you haven’t used up all your AWS free tier credits yet.

Tip

Throughout the labs, I’ll suggest you name things in specific ways so you can copy commands straight from the book. Feel free to use different names if you prefer.

Start by creating a directory for this lab, named do4ds-lab.

Step 1: Login to the AWS console

We’re going to start by logging into AWS at https://aws.amazon.com.

Note

An AWS account is separate from an Amazon account for ordering online and watching movies. You’ll have to create one if you’ve never used AWS before.

Once logged in, you’ll be confronted by the AWS console. You’re now looking at a list of all the different services AWS offers. There are many, and most are irrelevant right now. Poke around if you want and then continue when you’re ready.

Accessing AWS from Code

Any of the activities in this lab can be done from the AWS Command Line Interface (CLI). Feel free to configure the CLI on your machine if you want to control AWS from the command line.

There are also R and Python packages for interacting with AWS services. The most common are Python’s {boto3} package or R’s {paws} and {aws.s3}.

Regardless of what tooling you’re using, you’ll generally configure your credentials in three environment variables – AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_REGION.

You can get the access key and secret access key from the AWS console, and you should know the region. The region is not a secret, but use proper secrets management for your access ID and key.

Step 2: Stand up an EC2 instance

There are five attributes to configure for your EC2 instance. If it’s not mentioned here, stick with the defaults. In particular, stay with the default Security Group. We’ll learn what Security Groups are and how to configure them later.

Name and Tags

Instance name and tags are human-readable labels so you can remember what this instance is. Neither name nor tag is required, but I’d recommend you name the server something like do4ds-lab in case you stand up others later.

If you’re doing this at work, there may be tagging policies so the IT/Admin team can figure out who servers belong to later.

Image

An image is a system snapshot that serves as the starting point for your server. AWS are called AMIs (Amazon Machine Images). They range from free images of bare operating systems, to paid images bundled with software you might want.

Choose an AMI that’s just the newest LTS Ubuntu operating system. As of this writing, that’s 22.04. It should say, “free tier eligible”.

Instance Type

The instance type identifies the capability of the machine you’re renting. An instance type is a family and a size with a period in between – more on AWS instance types in Chapter 11.

For now, I’d recommend getting the largest free tier-eligible server. As of this writing, that’s a t2.micro with 1 CPU and 1 GB of memory.

Server Sizing for the Lab

A t2.micro with 1 CPU and 1 GB of memory is a very small server. For example, your laptop probably has at least 8 CPUs and 16 GB of memory.

A t2.micro should be sufficient to finish the lab, but you’ll need a substantially larger server to do real data science work.

Luckily, it’s easy to upgrade cloud server sizes later. More on how to do that and advice on sizing servers for real data science work are discussed in Chapter 11.

Keypair

The keypair is the skeleton key to your server. We’ll learn how to use and configure it in Chapter 8. For now, create a new keypair. I’d recommend naming it do4ds-lab-key. Download the .pem version and put it in your do4ds-lab directory.

Storage

Bump up the storage to as much as you can under the free tier, because why not? As of this writing, that’s 30 GB.

Step 3: Start the server

If you have followed these instructions, you should be looking at a summary that lists the operating system, server type, firewall, and storage. Go ahead and launch your instance.

If you go back to the EC2 page and click on Instances, you can see your instance as it comes up. When it’s up, it will transition to State: Running.

Optional: Stop the server

Whenever you’re stopping for the day, you may want to suspend your server so you’re not using up your free tier hours or paying for it. You can suspend an instance in its current state to restart it later. Suspended instances aren’t always free, but they’re very cheap.

Whenever you want to suspend your instance, go to the EC2 page for your server. Under the Instance State drop-down in the upper right, choose Stop Instance.

After a couple of minutes, the instance will stop. Before you return to the next lab, you’ll need to start the instance back up so it’s ready to go.

If you want to delete the instance, you can choose to Terminate Instance from that same Instance State dropdown.

Lab: Put the penguins data and model in S3

Whether or not you’re hosting your own server, most data scientists working at an organization that uses AWS will run into S3, AWS’s blob store.

It is common to store in an ML model S3. We will store the penguin mass prediction model we created in Chapter 2 and Chapter 3 in an S3 bucket.

Step 1: Create an S3 bucket

You’ll have to create a bucket, most commonly from the AWS console. You can also do it from the AWS CLI. I’m naming mine do4ds-lab. Public buckets need unique names, so you’ll need to name yours something else and use that name throughout the lab.

Step 2: Push the model to S3

Let’s change the code in our model.qmd doc to push the model into S3 when the model rebuilds, instead of just saving it locally.

If your credentials are in the proper environment variables, and you’re using {vetiver}, pushing the model to S3 is easy by changing the {vetiver} board type to board_s3:

model.qmd
from pins import board_s3
from vetiver import vetiver_pin_write

board = board_s3("do4ds-lab", allow_pickle_read=True)
vetiver_pin_write(board, v)
Tip

You can see a working examples of the model.qmd and other scripts in this lab in the GitHub repo for this book (akgold/do4ds) in the _labs/lab7 directory.

Under the hood, {vetiver} uses standard R and Python tooling to access an S3 bucket. If you wanted to go more DIY, you could use another R or Python package to directly interact with the S3 bucket.

Instead of using credentials, you could configure an instance profile using IAM, so the entire EC2 instance can access the S3 bucket without needing credentials. Configuring instance profiles is the kind of thing you should work with a real IT/Admin to do.

Step 3: Pull the API model from S3

You’ll also have to configure the API to load the model from the S3 bucket. If you’re using {vetiver} that’s as easy as changing the board creation line to:

from pins import board_s3
board = board_s3("do4ds-lab", allow_pickle_read=True)

Step 4: Give GitHub Actions the S3 credentials

We want our model building to correctly push to S3 even when it’s running in GitHub Actions, but since GitHub doesn’t have our S3 credentials by default, so we’ll need to provide them.

We will declare the variables we need in the Render and Publish step of the Action.

Once you’re done, that section of the publish.yml should look something like this:

.github/workflows/publish.yml
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
    AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
    AWS_REGION: us-east-1

Now, unlike the GITHUB_TOKEN secret, which GitHub Actions automatically provides to itself, we’ll have to give these secrets to the GitHub interface.

Lab Extensions

You might also want to put the actual data you’re using into S3. This can be a great way to separate the data from the project, as recommended in Chapter 2.

Putting the data in S3 is such a common pattern that DuckDB allows you to directly interface with parquet files stored in S3.


  1. Yes, that is a Sound of Music reference.↩︎

  2. https://www.visualcapitalist.com/aws-powering-the-internet-and-amazons-profits/↩︎

  3. https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/↩︎

  4. A huge amount of cloud spending is now done via annual pre-commitments, which AWS calls Savings Plans. The cloud providers offer big discounts for making an up-front commitment, which the organization then spends down over the course of one or more years. If you need to use cloud services, it’s worth investigating whether your organization has committed spending you can tap into.↩︎

  5. https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap-cloud-lifecycle-scale-growth-repatriation-optimization/↩︎

  6. There are also some wild services that do specific things, like let you rent satellite ground station infrastructure or do Internet of Things (IoT) workloads. Those services are definitely cool but are so far outside the scope of this book that I’m going to pretend they don’t exist.↩︎

  7. If you’re planning my birthday party, chocolate layer cake with vanilla frosting is the correct cake configuration.↩︎

  8. Some materials classify Snowflake and Databricks as SaaS. I find the line between PaaS and SaaS to be quite blurry and somewhat immaterial.↩︎