# 15  Choosing the right server for you

Throughout this section of the book, you’ve learned a lot about how to configure a server including getting one from AWS, how to administer Linux on it, and how to safely configure it to be accessible from the internet.

But let’s take a step back. How do you know you chose the right server? How do you choose a server with enough horsepower to get you and your team through the day, but without breaking the bank?

In this chapter, we’ll spend some time clarifying what a computer actually does and how that relates to the size and type of server you might want. We’ll also get a little into how AWS classifies servers and how you’ll choose the one you want.

## 15.1 Computers are (just) addition factories

You’re probably aware that everything you’ve ever seen on a computer – from this book to your work in R or Python, your favorite internet cat videos, and Minecraft – it’s just 1s and 0s.

The theory of why this works is deep and fascinating.1 Luckily, the amount of computational theory you need to understand to be an effective data scientist can be can be summarized in three sentences:

Everything on a computer is represented by a (usually very large) number.

At a hardware level the only thing computers do is add these numbers together.

Modern computers add very quickly and very accurately.

This means that for your day-to-day work, you can think of a computer as just a big factory for doing additions.

Every bit of input your computer gets is turned into an addition problem, processed, and the results are reverted back into something we interpret as meaningful.

The addition assembly line itself – where the work actually gets done – is referred to as compute. It’s where 2+2 gets turned into 4, and where 345619912 + 182347910 gets turned into 527967822.

There are a number of different kinds of compute, but they all basically work the same. The total speed is determined by two factors – the number of conveyor belts (cores) and the speed at which each belt is running (clock speed).

We’re going to spend some time considering the three important resources – compute, memory, and storage – you have to allocate and manage on your computer or server. We’re going to spend some time generally exploring what each one is and then get into how you should think about each specifically with regard to data science.

### 15.1.1 Most computation is on the CPU

All computers have a central processing unit (CPU). These days, most consumer-grade laptops have between 4 and 16 cores, and may have additional capabilities that effectively doubles that number. So most laptop CPUs can do between 4 and 32 simultaneous addition problems.

Clock speeds are measured in operations per second or hertz (hz). The cores in your laptop probably run between two and five gigahertz (GHz), which means between 2 and 5 billion operations per second when running at full speed.

For decades, many of the innovations in computing were coming from increases in clock speed, but raw increases in speed have fallen off a lot in the last few decades. The clock speeds of consumer-grade chips increased by approximately 10x during the 90s, by 2-3x in the 2000s, and somewhere between not at all and 1.5x in the 2010s.

But computers have continued getting a lot faster even as the increase in clock speeds has slowed. The increase has mostly come from increases in the number of cores, better software usage of parallelization, and innovative chip architectures for special-purpose usage.

#### 15.1.1.1 Recommendation 1: Fewer, faster CPU cores

R and Python are single-threaded. Unless you’re using special libraries for parallel processing, you’ll end up red-lining a single CPU core while the other just look on in silence.

Therefore for most R and Python work, single core clock speed matters more than the number of cores, and fewer, faster cores are usually preferable to many slower

You’re probably not used to thinking about this tradeoff from buying a laptop or phone. The reality is that modern CPUs are pretty darn good and you should just buy the one that fits your budget.

If you’re standing up a server, you often do have an explicit choice between more slower cores and fewer faster ones. In AWS, the instance type is what dictates the tradeoff – more on this below.

If you’re running a multi-user server, the number of cores you need is really hard to estimate. If you’re doing non-ML tasks like counts and dashboarding or relatively light-duty machine learning I might advise the following:

$\text{n cores} = \text{1 core per user} + 1$

The spare core is for the server to do its own operations apart from the data science usage. On the other hand, if you’re doing heavy-duty machine learning or parallelizing jobs across the CPU, you may need more cores than that.

### 15.1.2 GPU-backed computing

The most common special architecture chips is the graphical processing unit (GPU). GPUs are specialized chips used for tasks like editing photo or videos, rendering video game graphics, some kinds of machine learning, and (yes) Bitcoin mining.

Like a CPU, a GPU is just an addition factory but with a different architecture. A CPU has a few fast cores, so it can only work on a few problems simultaneously, but it does them very fast. In contrast, a GPU takes the opposite approach, with many slower cores. Where a consumer-grade CPU has 4-16 cores, mid-range GPUs have 700-4,000 cores, but each one runs between 1% and 10% the speed of a CPU core.

For GPU-centric tasks, the overwhelming parallelism of a GPU is more important than the speed of any individual core, and GPU computation can be dramatically faster. For the purposes of data science, many popular machine learning techniques – including neural networks, XGBoost, and other tree-based models – potentially run much much faster on GPUs relative to CPUs.

Some machines are also adding other types of specialized chips to do machine learning – though these generally aren’t accessible for training models. For example, iPhone TODO.

#### 15.1.2.1 Recommendation 2: Get a GPU…maybe…

The choice of whether you need a GPU to do your work will really depend on what you’re doing and your budget.

Only certain kinds of data science tasks are even amenable to GPU-backed acceleration. Many data science tasks can only be done in sequence others can be parallelized, but splitting it into a small number of CPU cores is perfectly adequate. For the most part, the things that will benefit most from GPU computing are training highly parallel machine learning models like a neural network or tree-based models.

If you do have one of these use cases, GPU computing can massively speed up your computation – making models trainable in hours instead of days.

If you are planning to use cloud resources for your computing, GPU-backed instances are quite pricey, and you’ll want to be careful about only putting those machines up when you’re using them.

It’s also worth noting that using a GPU won’t happen automatically. The tooling has gotten good enough that it’s usually pretty easy to set up, but your computer won’t train your XGBoost models on your GPU unless you tell it to do so.

Because GPUs are expensive, I generally wouldn’t bother with GPU-backed computing unless you’ve already tried without a GPU and find that it takes too long to be feasible.

### 15.1.3 Memory (RAM)

Your computer’s random access memory (RAM) is its short term storage. In the computer as adding factory analogy, the RAM is like the stock that’s sitting out on the factory floor ready to go right on an assembly line.

RAM is very fast to for your computer to access, so you can read and write to it very quickly. The downside is that it’s temporary. When your computer turns off, the RAM gets wiped.2

Note

You probably know this, but memory and storage is measured in bytes prefixed by metric prefixes. Common sizes for memory these days are in gigabytes (billion bytes) and terrabytes (trillion bytes). Some enterprise data stores run on the scales of thousands of terrabytes (pettabytes) or even thousands of pettabytes (yottabytes).

Modern consumer-grade laptops come with somewhere between 4 and 16 Gb of memory.

#### 15.1.3.1 Recommendation 3: Get as much RAM as feasible

In most cases, R and Python have to load all of your data into memory. Thus, the size of the data you can use is limited to the size of your machine’s RAM. Most other limits of your machine will just result in things being slower than you’d really want, but trying to load too much data into memory will result in a session crash, and you won’t be able to do your analysis.

Note

You can get around the in-memory limitation by using a database or libraries that facilitate on-disk operations like Apache Arrow or dask.

Because you’ll often be doing some sort of transformation that results in invisible data copies and your computer can’t devote all of its memory, you’ll want to leave plenty of room over your actual data size.

It’s easy to say that you’ll always want more RAM, but a rough rule of thumb for whether you’ve got enough is the following:

$\text{Amount of RAM} = \text{max amount of data} * 3$

If you’re thinking about running a multi-user server, you’ll want to think about the maximum simultaneous number of users on the server, the most data each one would want to have loaded into memory and use that for your max amount of data

### 15.1.4 Storage (Hard Drive/Disk)

Relative to the RAM that’s right next to the factory floor, your computer’s storage is like the warehouse in the next building over. It’s much, much slower to get things from storage than RAM, but it’s also permanent once its stored there.

Up until a few years ago, hard drives were very slow. HDD drives have a bunch of magnetic disks that spin very fast (5,400 and 7,200 RPM are common speeds). Magnetized read/write heads move among the disks and save and read your data.

While 7,200 RPM is very fast, there were still physical moving parts, and reading and writing data was very slow by computational standards.

In the last few years, solid-state drives (SSDs) have become more-or-less standard in laptops. SSDs, which are collections of flash memory chips with no moving parts, are up to 15x faster than HDDs. They also can take a wider variety of shapes and sizes, and are more reliable and durable because they have no moving parts. The main drawback is that they’re usually more expensive per byte, but prices are still quite reasonable.

Many consumer laptops have only an SSD at this point. Some desktops and high-end laptops combine a smaller SSD with a larger HDD.

As for storage – get a lot – but don’t think about it too hard, because it’s cheap. Both a 1TB SSD and a 4TB HDD are around $100. Storage is cheap enough these days that it is almost always more cost efficient to buy more storage rather than making a highly-paid professional spend their time trying to figure out how to move things around. One litmus test of an IT organization that is well-equipped to support data science is whether they understand this. Smart organizations know that just getting more storage is almost always worth the cost in terms of the time of admins and data scientists. If you’re running a multi-user server, the amount of storage you need depends a lot on your data and your workflows. A reasonable rule of thumb is to choose $\text{Amount of Storage} = \text{data saved} + 1Gb * \text{n users}$ One thing to keep in mind is that you can basically just choose storage to match your data size and add a Gigabyte per person. Aside from actual data, the amount of space each person needs on the server is small. Code is very small and it’s rare to see R and Python packages take up more than a few dozen Mb per data scientist. Note If you’re working with a professional IT admin, they may be concerned about the storage implications of having package copies for each person on their team. I’ve heard this concern a lot from IT/Admins thinking prospectively about running their server and almost never of a case where it’s actually been a problem. Finding a person who has more than a few hundred Mb of packages would be very strange indeed. So it really comes down to how much data you expect to save on the server. In some organizations, each data scientist will save dozens of flat files of a Gb or more for each of their projects. That team would need a lot of storage. In other teams, all the data lives in a database and you basically don’t need anything beyond that 1 Gb per person. If you’re operating in the cloud, this really isn’t an important choice. As you’ll see in the lab, upgrading the amount of storage you have is a really trivial operation, requiring at most a few minutes of downtime. Choose a size you guess will be adequate and add more if you need. ## 15.2 Comprehension Questions 1. Think about the scenarios below – which part of your computer would you want to upgrade to solve the problem? 1. You try to load a big csv file into pandas in Python. It churns for a while and then crashes. 2. You go to build a new ML model on your data. You’d like to re-train the model once a day, but it turns out training this model takes 26 hours on your laptop. 3. You design an visualization Matplotlib , and create a whole bunch in a loop, you want to parallelize the operation. Right now you’re running on a t2.small with 1 CPU. 2. Draw a mind map of the following: CPU, RAM, Storage, Operations Per Second, Parallel Operations, GPU, Machine Learning 3. What are the architectural differences between a CPU and a GPU? Why does this make a GPU particularly good for Machine Learning? ## 15.3 AWS Instance Classes for Data Science AWS offers a variety of different EC2 instance types. There are a few different types you’ll probably consider, here’s a quick guide to the types most commonly used for data science purposes. Within each type, there are different sizes available, ranging from nano to 2xl. Instances are denoted by <instance type>.<size>. So, for example, when we put our instance originally on a free tier machine, we put it on a t2.micro. In most cases, going up a size doubles the amount of RAM, the number of cores, and the cost. That isn’t strictly true, especially at some of the more extreme sizes. Instance Type Notes t3 The “standard” configuration. Relatively cheap per core/Gb RAM. good b/c of instance credits, limited size C Faster CPUs R Higher ratio of RAM to CPU P GPU instances, very expensive ## 15.4 Lab: Changing Instance Size Ok, now we’re going to experience the real magic of the cloud – flexibility. We’re going to upgrade the size of our server in just a minute or two. First, let’s confirm what we’ve got available. You can check the number of CPUs you’ve got with lscpu in a terminal. Similarly, you can check the amount of RAM with free -h. This is just so you can prove to yourself later that the instance really changed. Now, you can go to the instance page in the AWS console. The first step is to stop (not terminate!) the instance. This means that changing instance type does require some downtime for the instance, but it’s quite limited. Once the instance has stopped, you can change the instance type under Actions > Instance Settings. Then start the instance. It’ll take a few seconds to start the instance. And that’s it. So, for example, I changed from a t2.micro to a t2.small. Both only have 1 CPU, so I won’t see any difference in lscpu, but running free -h before and after the switch reveals the difference in the total column: test-user@ip-172-31-53-181:~$ free -h
total        used        free      shared  buff/cache   available
Mem:           966Mi       412Mi       215Mi       0.0Ki       338Mi       404Mi
Swap:             0B          0B          0B
test-user@ip-172-31-53-181:~\$ free -h
total        used        free      shared  buff/cache   available
Mem:           1.9Gi       225Mi       1.3Gi       0.0Ki       447Mi       1.6Gi
Swap:             0B          0B          0B

I got twice as much RAM!

There are some rules around being able to change from one instance type to another…but this is an amazing superpower if you’ve got variable workloads or a team that’s growing. The ability to scale a machine so easily is a game-changer.

A lot of the things we did – adding the elastic IP and daemonizing JupyterHub are the reason that we’re able to bring it back up so smoothly.

It’s similarly easy to resize the EBS volume attached to your server for more storage.

There are two caveats worth knowing: You can only automatically adjust volume sizes up, so you’d have to manually transfer all of your data if you ever wanted to scale back down. Second, you’ll have to adjust the Linux filesystem so it knows it can take advantage of the new space available after resizing. I’d recommend AWS’s walkthroughs on how to adjust the Linux filesystem if you do resize your volume.

1. The reason why this is the case and how it works is fascinating. If you’re interested, it comes back to Alan Turing’s famous paper on computability. I recommend The Annotated Turing: A Guided Tour Through Alan Turing’s Historic Paper on Computability and the Turing Machine by Charles Petzold for a surprisingly readable walkthrough of the paper.↩︎

2. You probably don’t experience this personally. Modern computers are pretty smart about dumping RAM onto the hard disk before shutting down, and bringing it back on startup, so you usually won’t notice this happening.↩︎