17  Compute at enterprise scale

There are two reasons to scale a data science cluster. The first is that you’re doing basically laptop-sized jobs but a lot of them. The second is that you’ve got jobs that exceed the capacity of an individual laptop.

In an enterprise context, it’s very plausible that your IT/Admin team is confronting both of these problems. A platform that needs to support 100 or 500 or even 1,000 data scientists has scale issues because of the sheer number of users, and also probably has to support a bunch of different use cases.

Providing that platform to users involves a lot of complexity and expense because the IT/Admin group is managing multiple – perhaps many – servers. And as the number of users and importance of a data science platform rises, downtime becomes very expensive, so the stability of the system becomes paramount.

This chapter will help you understand how enterprise IT/Admins think about managing and scaling that many servers to create a stable platform. It will also help you figure out what you need to communicate to your organization’s IT/Admins about the scale of your work.

17.1 DevOps best practices

In an enterprise, the IT/Admin group is managing dozens or hundreds or thousands of servers. In order to keep that many servers manageable, the IT/Admin group tries to keep those servers very standardized and interchangeable. This idea is often encompassed in the adage that “servers should be cattle, not pets”. That means that you almost certainty won’t be allowed to SSH in and just make changes that you might want as a data scientist.

Indeed, in many organizations, no one is allowed to SSH in and make changes. Instead, all changes have to go through a robust change management process and are deployed via Infrastructure as Code (IaC) tools so the environment can always be torn down and replaced easily.

Note

Avoiding this complexity is the major reason many organizations are moving away from directly managing servers at all. Instead, they’re outsourcing server patches and management by acquiring PaaS or SaaS software from cloud providers.

There are actually two parts to standing up an environment. First the servers and networking need to be stood up, called provisioning. Once that’s done, they need to be configured, which involves installing and activating applications like Python, R, JupyterHub, RStudio Server. In many enterprises, provisioning and configuration are actually done by separate groups, which server group and the application administration group.

There are many different IaC tools you may hear the IT/Admins at your organization talk about. These include Terraform, Ansible, CloudFormation (AWS’s IaC tool), Chef, Puppet, and Pulumi. Most of these tools can do both provisioning and configuration, but most specialize in one or the other, so many organizations use a pair of them together.

No Code IaC

Some enterprises manage servers without IaC. These usually involve writing extensive run books to tell another person how to configure the servers. If your spidey sense is tingling that this probably isn’t nearly as good as code, you’re right. Finding that your enterprise IT/Admin organizations doesn’t use IaC tooling is definitely a red flag.

Along with making deployments via IaC, organizations that follow DevOps best practices use a Dev/Test/Prod setup for making changes to servers and applications. The Dev and Test environments, often called lower environments are solely for testing changes to the environment itself. In order to differentiate this environment from the data scientist’s Dev and Test environments, I often refer to this as staging.1

Generally, the you won’t have access to the staging environment at all, except for potentially doing user acceptance testing for changes there.

In this kind of setup, environment promotion is a two-dimensional grid, with IT/Admins working on changes to the environment in staging and data scientists working in Dev and Test within the Prod IT/Admin environment. The ultimate goal of all of this is to create an extremely reliable prod-prod environment.

The IT/Admin promotes the complete staging environment, then the data scientist or IT/Admin promote within Prod.

In enterprises, moves from staging to prod, including upgrades to applications or operating systems or adding system libraries often have rules around them. They may need to be validated or approved by security. In some highly-regulated environments, the IT/Admin group may only be able to make changes during certain windows of time. This can be a source of tension between a data scientist who wants a new library or version now and an IT/Admin who isn’t allowed to move fast.

In addition to changes that go from staging to prod, enterprises also sometimes undergo a complete rebuild of their environments. These days, many of those rebuilds are the result of a move to the cloud, which can be a multi-year affair.

17.2 Compute for many users

At some number of data scientists, you outstrip the ability of any one server – even a big one – to accommodate all of the work that needs to get done. How many data scientists it takes to overtax a single server depends entirely on what data scientists do at your organization. If you’re doing highly intensive simulation work or deep learning, you may hit it with only one person. On the other hand, I’ve seen organizations that mostly work on small data sets that can comfortably fit 50 concurrent users on a single server.

Once you need multiple servers to support the data science team(s), you have to horizontally scale. There is a simple way to horizontally scale, which is just to give every user or every group their own disconnected server. In some organizations this can work very well. The downside is that this either leaves a lot of hassle for the IT/Admin to manage or they just delegate server management to the individual teams.

Many enterprises don’t want to do things this way. Instead, they want to run one centralized service that everyone, or at least a lot of users, in the company tome to. Managing just one environment makes things simpler in some ways, but it drastically increases the cost of downtime. For example, one hour of downtime for a platform that supports 500 data scientists wastes over $25,000.2

Measuring Uptime

Organizations often introduce Service Level Agreements (SLAs) or Operating Level Agreements (OLAs) about how much downtime is allowed. These limits are often measured in nines of uptime, which refers to the proportion of the time that the service is guaranteed to be online. So a one-nine service is guaranteed to be up 90% of the time, allowing for 36 days of downtime a year. A five-nine service is guaranteed up for 99.999% of the time, allowing for only about 5 1/4 minutes of downtime a year.

For that reason, organizations that support enterprise data science platforms take avoiding downtime very seriously. Most have some sort of disaster recovery policy. Some maintain frequent (often nightly) snapshots of the state of the environment so they can roll back to a known good state in the event of a failure. Sometimes it means that there is actually a copy of the environment waiting on standby in the event of an issue with the main environment.

Other times, there are stiffer requirements such that nodes in the cluster could fail and the users wouldn’t be meaningfully affected. This requirement for limited cluster downtime is often called high availability. If you’re talking to an IT/Admin about high availability, it’s worth understanding that it’s a description of a desired outcome, not a particular technology or technical approach.

17.3 Computing in clusters

Whether it’s for horizontal scaling or high availability reasons, most enterprises run their data science environments in a cluster, which is a set of servers (nodes) that operate together as one unit. Ideally, working in a cluster feels the same as working in a single-server environment, but there are multiple servers to add computational power or provide resilience if one server were to fail.

In order to have a cluster operate as a single environment, you need to solve two problems. First, you want to provide a single front door that routes users to a node in the cluster, preferably without them being aware of it happening. This is accomplished with a load balancer, which is a kind of proxy server.

Second, you need to make sure that the user is able to persist things (state) on the server even if they end up on a different node later. This is accomplished by setting up storage so that persistent data doesn’t actually live on any of the nodes. Instead, it lives in separate storage that is symmetrically accessible to all the nodes in the cluster.

Users come to the load balancer, which sends them to the nodes, which connect to the state.

If you are a solo data scientist reading this, please do not try to run your own load balanced data science cluster. When you undertake load balancing, you’ve taken on a distributed systems problem and those are inherently difficult.

High-availability is often phrased as “no single point of failure”. It’s worth noting that just load balancing doesn’t eliminate single points of failure. In fact, it’s totally possible to make your system less stable by carelessly load balancing a bunch of nodes. For example, what if your load balancer were to fail? Or the place where the state is stored? What if your NAS has low performance relative to a non-networked drive? Sophisticated IT/Admin organizations have answers to these questions and standard ways they implement high availability.

Note

For technical information on how load balancers work and different types of configuration, see Appendix B.

17.4 Docker in enterprise = Kubernetes

Originally created at Google and released in 2014, the open source Kubernetes (K8S, pronounced koo-ber-net-ees or kates for the abbreviation) is the way to run production services out of Docker containers.3 Many organizations are moving towards running much or all of their production work in Kubernetes.

Note

Apparently Kubernetes is an ancient Greek word for “helmsman”.

Relative to managing individual nodes in a load balanced cluster, Kubernetes makes things easy because it completely separates provisioning and configuration. In Kubernetes, you create a cluster of worker nodes. Then, you separately tell the cluster’s control plane that you want it to run a certain set of Docker containers with a certain amount of computational power allocated to each one. These Docker Containers running in Kubernetes are called pods.

The elegance of Kubernetes is that you don’t have to think at all about where each pod goes. The control plane schedules the pods on the nodes without you having to think about where the nodes actually are or how networking will work.

Image of a Kubernetes cluster with 3 nodes and 6 pods of various sizes arranged across the nodes.

From the perspective of the IT/Admin, this is wonderful because you just make sure you’ve got enough horsepower in the cluster and then all the app requirements go in the container, which makes node configuration trivial.

In fact, almost everyone running Kubernetes can just add nodes with a few button clicks because they’re using a cloud provider’s managed service: AWS’s EKS (Elastic Kubernetes Service, Azure’s AKS (Azure Kubernetes Service), or GCP’s GKE (Google Kubernetes Engine).4

For production purposes, pod deployments are usually managed with Helm charts, which are the standard IaC way to declare what pods you want, how many of each you need, and what their relationship is.

Kubernetes is an extremely powerful tool and there are many reasons to like it. But like most powerful tools, it’s also complicated and becoming a proficient Kubernetes admin is a skill unto itself. These days, many IT/Admins are trying to add Kubernetes to their list of skills. If you have a competent Kubernetes admin, you should be in good shape, but you should be careful that setting up your data science environment doesn’t become someone else’s chance to learn Kubernetes.

17.5 Consider HPC over Kubernetes

As you add more people, you also are likely to add more variety in requirements. For example, you might want to be able to incorporate jobs that require very large nodes. Or maybe you want to run GPU-backed nodes. Or maybe you want to have “burst-able” capacity for particularly busy days or times.

It’s unlikely that your organization wants to be running all of these nodes all the time, especially expensive large or GPU-backed ones. By far the simplest way to manage this complexity is to run a “main” cluster for everyday kind of workloads and stand up additional special environments as needed. If the need is just for burst capacity over a relatively long time scales (e.g. a full day), manually adding or removing nodes from a cluster is easy.

Some organizations really want to run a single environment that contains different kinds of nodes or that can autoscale to accommodate quick bursts of needed capacity.

Often, the best option for doing these kinds of activities is a high-performance computing (HPC) framework. HPC is particularly appropriate when you need very large machines. For example, Slurm is an HPC framework that supports multiple queues for different sets of machines and allows you to treat an entire cluster as if it were one machine with many – even thousands – of CPU cores. AWS has a service called ParallelCluster that allows users to easily set up a Slurm cluster – and with no additional cost relative to the cost of the underlying AWS hardware.

Some organizations want to accomplish this kind of approach in Kubernetes. That is generally more difficult. It is not as easy to manage multiple kinds of nodes in one cluster with Kubernetes. Where HPC frameworks are designed to let you combine an arbitrary number of nodes into what acts like a single machine with thousands or tens of thousands of cores, it’s usually not possible to schedule pods larger than an actual node in your cluster.

Autoscaling is also easier in many HPC frameworks compared to Kubernetes. Regardless of the framework, scaling up is always easy. You just add more nodes and you can schedule more work. But it’s scaling down that’s hard.

Because of its history, Kubernetes assumes that pods can easily be moved from one node to another and it wouldn’t be a problem to shut down a node on someone and just move their pod to another node. That doesn’t work well when what’s in the pod is your Jupyter or RStudio session. Relative to Kubernetes, HPC is much more “session-aware” and often does a better job scaling down the kinds of workloads that happen in a data science environment.

The upshot is that most IT/Admins will reach for Kubernetes to solve the problem of multiple use cases in one cluster or to autoscale it, but HPC may actually be a better solution in a data science context.

17.6 Comprehension Questions

  1. What is the difference between horizontal and vertical scaling? For each of the following examples, which one would be more appropriate?
    1. You’re the only person using your data science workbench and run out of RAM because you’re working with very large data sets in memory.

    2. Your company doubles the size of the team that will be working in your data science workbench. Each person will be working with reasonably small data, but there’s going to be a lot more of them.

    3. You have a big modeling project that’s too large for your existing machine. The modeling you’re doing is highly parallelizable.

  2. What is the role of the load balancer in horizontal scaling? When do you really need a load balancer and when can you go without?
  3. What are the biggest strengths of Kubernetes as a scaling tool? What are some drawbacks? What are some alternatives?

  1. You’ll have to fight out who gets to claim the title Dev/Test/Prod for their environments with the IT/Admins at your organization. Be nice, they probably had the idea long before you did.↩︎

  2. Assuming a (probably too low), fully-loaded cost of $100,000 and 2,000 working hours per year.↩︎

  3. If you are pedantic, there are other tools for deploying Docker containers like Docker Swarm and Kubernetes is not limited to Docker containers. But for all practical purposes, \(\text{production Docker = Kubernetes}\).↩︎

  4. It’s rare, but some organizations do run an on-prem Kubernetes cluster with Oracle’s OpenShift.↩︎