11  Server Resources and Scaling

You will need sufficient computational resources if you want to do more than play around. That means appropriately sizing and scaling your server to accommodate your work.

This chapter will help you develop a mental model of a server’s computational resources, teach you about the command line tools to assess resource usage, and provide recommendations on scaling and sizing a server for data science.

The briefest intro to computational theory

You’re probably aware that everything you’ve ever seen on a computer – from this book to your work in R or Python, your favorite internet cat videos, and Minecraft – is just 1s and 0s.

But the 1s and 0s aren’t the interesting part. They are just binary representations of integers (whole numbers) that themselves represent something meaningful. The mind-bending part is that everything your computer does – every single cooking video, Jupyter Notebook, and internet personality quiz – is accomplished solely by adding these integers.1

That means a helpful mental model for a computer is a factory for doing addition problems. Everything you ask your computer to do is turned into an addition problem, then processed and returned, with the results interpreted as meaningful.

Since a computer is like an addition factory, decisions about server sizing and scaling are akin to optimally designing the conveyor belts in a factory. In this computer as factory analogy, you should consider three main resources: compute, memory, and storage.

How computers compute

The addition assembly line – where the work gets done – is called compute. It’s where \(2+2\) gets turned into \(4\), and where \(345619912 + 182347910\) gets turned into \(527967822\).

Computers do most computing in their central processing unit (CPU), which completes addition problems in one or more cores.

The CPU’s speed is primarily determined by the number of cores and the speed of those cores.

The number of cores is like the number of lines in the factory. These days, most consumer-grade laptops have between 4 and 16 physical cores. Many have software capabilities that effectively double that number, so they can simultaneously do between 4 and 32 addition problems.

The baseline speed of an individual core, called single-core clock speed, is how quickly a single core completes a single addition problem. You can think of this as how fast the conveyor belt moves. Clock speeds are measured in operations per second or hertz (Hz). The cores in your laptop probably max out between two and five gigahertz (GHz), which means between two billion and five billion operations per second.

For decades, many of the innovations in computing came from increases in single-core clock speed, but those have fallen off in the last few decades. The single-core clock speeds of consumer-grade chips increased by approximately 10 times during the 90s, by 2–3 times in the 2000s, and somewhere between not at all and 1.5 times in the 2010s.

But computers have continued getting faster anyway. The improvements mostly came from increases in the number of cores, better usage of software parallelization, better heat management so the machine can run at full speed for longer, and faster loading and unloading of the CPU (called the bus).

With this background in mind, here are some of my recommendations for how to choose a performant data science machine.

Recommendation 1: Fewer, faster CPU cores

R and Python are single-threaded. This means that unless you’re using special libraries for parallel processing, you’ll max out a single CPU core while the others sit unused.

Therefore, for most R and Python work, single-core clock speed matters more than the number of cores, and fewer, faster cores are usually preferable to many slower ones.

You’re not really exposed to this tradeoff when you buy a laptop or phone. Modern consumer CPUs are all pretty good, and you should buy the one that fits your budget. But, if you’re standing up a cloud server, you often do have an explicit choice between more slower cores and fewer faster ones, determined by the instance family.

The number of cores you need for a multi-user data science server can be hard to estimate. If you’re doing non-ML tasks like counts and dashboarding or relatively light-duty machine learning, I might advise the following:

\[ \text{n cores} = \text{1 core per user} + 1 \]

The spare core is for the server to do its own operations apart from the data science usage. On the other hand, if you’re doing heavy-duty machine learning or parallelizing jobs across the CPU, you may need more cores than this rule of thumb.

How memory works

Your computer’s random access memory (RAM) is its short-term storage. RAM is like the area adjacent to the assembly line where work to be done sits waiting, and completed work is temporarily placed before it gets sent elsewhere.

Your computer can quickly access objects in RAM, so things stored in RAM are ready to go. The downside is that RAM is temporary. When your computer turns off, the RAM gets wiped.2

Note

Memory and storage are measured in bytes with metric prefixes.

Standard sizes for memory are in gigabytes (billion bytes) and terabytes (trillion bytes). Some enterprise data stores run on the scales of thousands of terabytes (petabytes) or even thousands of petabytes (yotabytes).

Modern consumer-grade laptops come with somewhere between 4 and 16 GB of memory.

Recommendation 2: Get as much RAM as feasible

In most cases, R and Python must load all your data into memory. Thus, the data you can use is limited to your machine’s RAM.

Most other machine limits will result in work completing slower than you might like, but trying to load too much data into memory will make your session crash.

Note

If you’re facing a memory limitation, reconsider your project architecture as discussed in Chapter 2. Maybe you can load less data into memory.

Because your computer needs memory for things other than R and Python, and because you’ll often be doing transformations that temporarily increase the size of your data, you need more memory than your largest dataset.

Nobody has ever complained about having too much RAM, but a good rule of thumb is that you’ll be happy if:

\[\text{Amount of RAM} \ge 3 * \text{max amount of data}\]

If you’re considering running a multi-user server, you’ll need to take a step back to think about how many concurrent users you expect and how much data you anticipate each one to load.

Understanding storage

Storage, or hard disk/drive, is where your computer stores things for the long term. It’s where applications are installed, and where you save objects you want to keep.

Relative to the RAM right next to the factory floor, your computer’s storage is like the warehouse in the next building. Storage is much slower than RAM, often 10 times to 100 times slower, but storage allows you to save things permanently.

Storage was even slower until a few years ago when solid-state drives (SSDs) became common. SSDs are collections of flash memory chips that are up to 15 times faster than the hard disk drives (HDDs) that preceded them.

HDDs consist of spinning magnetic disks with magnetized read/write heads that save and read data from the disks. While HDDs spin very fast – 5,400 and 7,200 RPM are typical speeds – SSDs with no moving parts are much faster.

Recommendation 3: Get lots of storage; it’s cheap

Get however much storage you think you’ll need when you configure storage your server, but don’t think too hard. Storage is cheap and easy to upgrade. It’s almost always more cost-effective to buy additional storage than to have a highly-paid human figure out how to delete things to free up room.

Note

If the IT/Admins at your organization want you to spend a lot of time deleting things from storage, that’s usually a red flag indicating that they aren’t thinking much about how to make the overall organization work more smoothly.

If you’re running a multi-user server, the amount of storage you need depends significantly on your data and workflows. If you’re not saving large data files, the amount of space each person needs on the server is small. Code files are negligible, and it’s rare to see R and Python packages take up more than a few dozen megabytes per data scientist. A reasonable rule of thumb is to choose

\[ \text{Amount of Storage} = \text{1GB} * \text{n users} \]

On the other hand, if your workflows save a lot of data to disk, you must consider that. In some organizations, each data scientist will save dozens of flat files of a gigabyte or more for each project.

Note

If you’re working with a professional IT/Admin, they may be concerned about the storage implications of having package copies for each team member, a best practice for using environments as code is discussed in Chapter 1. I’ve frequently heard this concern from IT/Admins thinking ahead about running their server but rarely encountered a case where it’s actually been a problem.

If you’re operating in the cloud, this isn’t an important choice. As you’ll see in the lab, upgrading the amount of storage you have is a trivial operation, requiring at most a few minutes of downtime. Choose a size you estimate will be adequate and add more if needed.

GPUs are special-purpose compute

All computers have a CPU. Some computers have specialized chips where the CPU can offload particular tasks – the most common being the graphical processing unit (GPU). GPUs are architected for tasks like rendering video game graphics, some kinds of machine learning, training large language models (LLMs), and, yes, Bitcoin mining.3

A GPU is an addition factory just like a CPU, but with the opposite architecture. CPUs have only a handful of cores, but those cores are fast. A GPU takes the opposite approach with many (relatively) slow cores.

Where a consumer-grade CPU has 4–16 cores, mid-range GPUs have 700–4,000 cores, each running at only about 1 to 10% of the speed of a CPU core. For the tasks GPUs are good at, the overwhelming parallelism ends up being more important than the speed of any individual core, and GPU computation can be dramatically faster.

Recommendation 4: Get a GPU, maybe

The tasks that most benefit from GPU computing include training highly parallel machine learning models like deep learning or tree-based models. If you have one of these use cases, GPU computing can massively speed up your computation – making models trainable in hours instead of days.

If you plan to use cloud resources for your computing, large GPU-backed instances are pricey (hundreds of dollars an hour as of this writing). You’ll want to be careful about only putting those machines up when using them.

Because GPUs are expensive, I generally wouldn’t bother with GPU-backed computing unless you’ve already tried without and find that it takes too long to be feasible.

It’s also worth noting that using a GPU won’t happen automatically. The tooling has gotten good enough that it’s usually easy to set up, but your computer won’t train your XGBoost models on your GPU unless you tell it to do so.

Now that you’re equipped with some general recommendations about choosing the right amount of resources, let’s learn how to tell whether it might be time to upgrade a system you already have.

Assessing RAM and CPU usage

Once you’ve chosen your server size and gotten up and running, you’ll want to be able to monitor RAM and CPU for problems.

A running program is called a process. For example, when you type python on the command line to start an interactive Python prompt, that starts a single Python process. If you were to start a second terminal session and run python again, you’d have a second Python process.

Complicated programs often involve multiple interlocking processes. For example, running the RStudio IDE involves (at minimum) one process for the IDE itself and one for the R session it uses in the background. The relationships between these processes are mostly hidden from you – the end user.

As an admin, you may want to inspect the processes running on your system at any given time. The top command is a good first stop. top shows information about the processes consuming the most CPU in real-time.

Here’s the top output from my machine as I write this sentence:4

Terminal
PID    COMMAND      %CPU TIME     #PORT MEM 
0      kernel_task  16.1 03:56:53 0     2272K
16329  WindowServer 16.0 01:53:20 3717  941M-
24484  iTerm2       11.3 00:38.20 266-  71M- 
29519  top          9.7  00:04.30 36    9729K 
16795  Magnet       3.1  00:39.16 206   82M 
16934  Arc          1.8  18:18.49 938   310M 

In most instances, the first three columns are the most useful. The first column is the unique process ID (pid) for that process. You’ve got the name of the process (COMMAND) and how much CPU it’s using. You’ve also got the amount of memory used a few columns over. Right now, nothing is using a lot of CPU.

The top command takes over your whole terminal. You can exit with Ctrl + c.

So Much CPU?

For top (and most other commands), CPU is expressed as a percent of single core availability. On a modern machine with multiple cores, it’s very common to see CPU totals well over 100%. Seeing a single process using over 100% of CPU is rare.

Another useful command for finding runaway processes is ps aux. It lists a snapshot of all processes running on the system and how much CPU or RAM they use. You can sort the output with the --sort flag and specify sorting by CPU with --sort -%cpu or by memory with --sort -%mem.

Because ps aux returns every running process on the system, you’ll probably want to pipe the output into head. In addition to CPU and Memory usage, ps aux tells you who launched the command and the PID.

One of the times you’ll be most interested in the output of top or ps aux is when something is going rogue on your system and using more resources than you intended. If you have some sense of the name or who started it, you may want to pipe the output of ps aux into grep to find the pid.

For example, I might run ps aux | grep RStudio to get:5

Terminal
> ps aux | grep RStudio
USER      PID   %CPU %MEM STARTED TIME     COMMAND
alexkgold 23583 0.9  1.7  Sat09AM 17:15.27 RStudio
alexkgold 23605 0.5  0.4  Sat09AM  1:58.16 rsession

RStudio is behaving nicely on my machine, but if it were not responsive, I could make a note of its pid and end the process immediately by calling the kill command with the pid.

Examining storage usage

A common culprit for weird server behavior is running out of storage space. There are two handy commands for monitoring the amount of storage you’ve got: du and df. These commands are almost always used with the -h flag to put file sizes in human-readable formats.

df (disk free) shows the capacity left on the device where the directory sits. For example, here are the first few columns from running the df command on the chapters directory on my laptop that includes this chapter.

Terminal
> df -h chapters
Filesystem     Size   Used  Avail Capacity
/dev/disk3s5  926Gi  227Gi  686Gi    25% 

You can see that the chapters folder lives on a disk called /dev/disk3s5 that’s a little less than one TB and is 25% full – no problem. This can be particularly useful to know on a cloud server because switching a disk out for a bigger one in the same spot is easy.

If you’ve figured out that a disk is full, buying a bigger one is usually the most cost-effective. But sometimes something weird happens. Maybe there are a few exceptionally big files, or you think unnecessary copies are being made.

If so, the du command (disk usage) gives you the size of individual files inside a directory. It’s particularly useful in combination with the sort command. For example, here’s the result of running du on the chapters directory where the text files for this book live.

Terminal
> du -h chapters | sort
12M chapters
1.7M    chapters/sec1/images
1.8M    chapters/sec1
236K    chapters/images
488K    chapters/sec2/images-traffic
5.3M    chapters/sec2/images-networking
552K    chapters/sec2/images
6.6M    chapters/sec2
892K    chapters/append/images
948K    chapters/append

If I were thinking about cleaning up this directory, I could see that my sec1/images directory is my biggest single directory. If you need to find big files on your Linux server, it’s worth looking through the options in the help pages for du.

Running out of resources

If you recognize you’re running out of resources on your current server, you may want to move to something bigger. There are two primary reasons servers run out of room.

The first reason is because people are running big jobs. This can happen at any scale of organization. There are data science teams of one person with use cases that necessitate terabytes of data.

The second reason is you have many people using your server. This is generally a feature of big data science teams, irrespective of workload size.

Either way, there are two options for how to scale your data science workbench. The first option is vertical scaling, which is a fancy way to say get a bigger server. The second option is horizontal scaling, which means running a whole fleet of servers in parallel and spreading the workload across them.

As a data scientist, you shouldn’t be shy about vertically scaling if your budget allows it. The complexity of managing a t3.nano with two cores and 0.5 GB of memory is the same as a C5.24xlarge with 96 cores and 192 GB of memory. In fact, the bigger one may be easier to manage since you won’t have to worry about running low on resources.

There are limits to the capacity of vertical scaling. As of this writing, AWS’s general-use instance types max out at 96–128 cores. That can quickly get eaten up by 50 data scientists with reasonably heavy computational demands.

Once you’re thinking about horizontal scaling, you’ve got a distributed service problem, which is inherently complex. You should almost certainly get an IT/Admin professional involved. See Chapter 17 for more on how to talk to them about it.

AWS Instances for Data Science

AWS offers various EC2 instance types split up by family and size. The family is the category of EC2 instance. Different families of instances are optimized for different kinds of workloads.

Here’s a table of common instance types for data science purposes:

Instance Type What It Is
t3 The “standard” configuration. Relatively cheap. Sizes may be limited.
C CPU-optimized instances, aka faster CPUs.
R Higher ratio of RAM to CPU relative to t3.
P GPU instances. Very expensive.

Within each family, there are different sizes available, ranging from nano to multiples of xl. Instances are denoted by <family>.<size>. For example, when we put our instance originally on a free tier machine, we put it on a t2.micro.

In most cases, going up a size doubles the amount of RAM, the number of cores, and the hourly cost. You should do some quick math before you stand up a C5.24xlarge or a GPU-based P instance. If your instance won’t be up for very long, it may be fine, but make sure you take it down when you’re done, lest you rack up a huge bill.

Comprehension questions

  1. Think about the scenarios below. Which part of your computer would you want to upgrade to solve the problem?

    1. You try to load a .csv file into {pandas} in Python. It churns for a while and then crashes.

    2. You go to build a new ML model on your data. You’d like to re-train the model once a day, but training this model takes 26 hours on your laptop.

    3. You design a visualization using the {matplotlib} package and want to create one version of the visualization for each US state. You could do it in a loop, but it would be faster to parallelize the plot creation. Right now, you’re running on a t2.small with 1 CPU.

  2. Draw a mind map of the following: CPU, RAM, Storage, Operations Per Second, Parallel Operations, GPU, Machine Learning.

  3. What are the architectural differences between a CPU and a GPU? Why does this make a GPU particularly good for Machine Learning?

  4. How would you do the following?

    1. Find all running JupyterHub processes that belong to the user alexkgold.

    2. Find the different disks attached to your server and see how full each one is.

    3. Find the biggest files in each user’s home directory.

Lab: Changing instance size

In this lab, we will upgrade the size of our server. And the best part is that we’re in the cloud, so it will only take a few minutes.

Step 1: Confirm the current server size

First, let’s confirm what we’ve got available. Once you ssh into the server, you can check the number of CPUs you’ve got with lscpu in a terminal. Similarly, you can check the amount of RAM with free -h. This is so that you can prove to yourself later that the instance changed.

Step 2: Change the instance type and bring it back

Now, you can go to the instance page in the AWS console. The first step is to stop (not terminate!) the instance. This means that changing instance type requires some downtime, but it’s brief.

Once the instance has stopped, you can change the instance type under Actions > Instance Settings. Then, start the instance. It’ll take a few seconds.

Step 3: Confirm the new server size

For example, I changed from a t2.micro to a t2.small. Both only have 1 CPU, so I won’t see any difference in lscpu, but running free -h before and after the switch reveals that I’ve got more RAM:

Terminal
test-user@ip-172-31-53-181:~$ free -h
               total  used  free   available
Mem:           966Mi  412Mi  215Mi   404Mi
test-user@ip-172-31-53-181:~$ free -h
               total  used  free   available
Mem:           1.9Gi  412Mi  1.4Gi   1.6Gi

There’s twice as much after the change!

There are some rules around being able to change from one instance type to another, but this is a superpower if you’ve got variable workloads or a team that’s growing. Once you’re done with your larger server, it’s just as easy to scale it back down.

Step 4: Upgrade storage (maybe)

If you want more storage, resizing the EBS volume attached to your server is similarly straightforward.

I wouldn’t recommend doing it for this lab because you can only automatically adjust volume sizes upward. That means you’d have to manually transfer your data if you ever wanted to scale back down.

If you do resize the volume, you’ll have to let Linux know so it can resize the filesystem with the new space available. AWS has a great walkthrough called Extend a Linux filesystem after resizing the volume that I recommend you follow.


  1. This was proved in Alan Turing’s 1936 paper on computability. If you’re interested in learning more, I recommend The Annotated Turing: A Guided Tour Through Alan Turing’s Historic Paper on Computability and the Turing Machine by Charles Petzold for a surprisingly readable walkthrough.↩︎

  2. You probably don’t experience this. Modern computers are pretty smart about saving RAM state onto the hard disk before shutting down and bringing it back on startup, so you won’t notice this happening unless something goes wrong.↩︎

  3. Purpose-built chips are becoming more common for AI/ML tasks, especially doing local inference on large models. These include Tensor Processing Units (TPUs) and Intelligence Processing Units (IPUs).↩︎

  4. I’ve cut out a few columns for readability.↩︎

  5. I’ve done a bunch of doctoring to the output to make it easier to read.↩︎