8  Basic Linux SysAdmin

In the last chapter, we went over using the command line, which empowers you with how you’ll interact with your server. This chapter is an intro to basic Linux server administration. By the end of this chapter, you’ll have a much better idea of what to do once you get there.

It’s worth saying that you can make a whole career out of Linux System Administration, so this chapter is going to be really focused on the things you need to know to successfully run a data science environment (and fun asides I feel like sharing).

In the lab for this chapter, we’re going to finally get started with something that is recognizably about data science – we’re going to configure R and Python on our EC2 instance and configure both RStudio Server and JupyterHub. By the end of the chapter you’ll have a running – though still inaccessible – data science workbench.

8.1 Linux is an operating system with a long history

A computer’s operating system (OS) defines how applications – like Microsoft Word, RStudio, and Minecraft – interact with the underlying hardware to actually do computation. OSes define how files are stored and accessed, how applications are installed and can connect to networks, and more.

These days, basically all computers run on one of a few different operating systems – Windows, MacOS, or Linux for laptops and desktops; Windows or Linux for servers, Android (a flavor of Linux) or iOS for phones and tablets, and Linux for other kinds of embedded systems (like ATMs and the chips in your car).

When you stand up a server, you’re going to be choosing from one of a few versions of Linux. If you’re unfamiliar with Linux, the number of choices can seem overwhelming, so here’s a quick primer on the history of operating systems. Hopefully it’ll help it all make sense.

Before the early 1970s, the market for computer hardware and software looked nothing like it does now. Computers released in that era had extremely tight linking between hardware and software. There were no standard interfaces between hardware and software, so each hardware manufacturer also had to release the software to use with their machine.

In the early 1970s, Bell Labs released Unix – the first operating system.

Once there was an operating system, the computer market started looking a lot more familiar to 2020s eyes. Hardware manufacturers would build machines that ran Unix and software companies could write applications that ran on Unix. The fact that those applications would run on any Unix machine was a game-changer.

In the 1980s, programmers wanted to be able to work with Unix themselves, but didn’t necessarily want to pay Bell Labs for Unix, so they started writing Unix-like operating systems. Unix-like OSes or Unix clones behaved just like Unix, but didn’t actually include any code from Unix itself.1

In 1991, Linus Torvalds – then a 21 year-old Finnish grad student – released Linux, an open source Unix clone via a amusingly nonchalant newsgroup posting.2

Since then, Linux has seen tremendous adoption. A large majority of the world’s servers run on Linux.3 Along with most of the world’s servers, almost all of the world’s embedded computers – in ATMs, cars and planes, TVs, and most other gadgets and gizmos – run on Linux. If you have an Android phone or a Chromebook – that’s Linux. Basically all of the world’s supercomputers use Linux.

As you might imagine, running Linux in so many different places has necessitated the creation of many different kinds of Linux. For example, a full-featured Linux server is going to require a very different operating system than the barebones operating system running on an ATM with extremely modest computational power.

These different versions are called distributions (distros for short) of Linux. They have a variety of technical attributes and also different licensing models.

Some versions of Linux, like Ubuntu, are completely open source. Others, like Red Hat Enterprise Linux (RHEL), are paid. Most paid Linux OSes have closely-related free and open source versions – like CentOS and Fedora for RHEL.4

Many organizations have a standard Linux distro they use – most often RHEL/CentOS or Ubuntu. Increasingly, organizations deploying in AWS are using Amazon Linux, which is independently maintained by Amazon but was originally a RHEL derivative. There are also some organizations that use SUSE (pronunced soo-suh), which has both open source and enterprise versions.

8.2 A tiny intro to Linux administration

We’ll get into how to administer a Linux server just below, but before we get there, let’s introduce what you’ll be doing as a Linux server admin. There are three main things you’ll manage as a Linux server admin:

  • System resources Each server has a certain amount of resources available. In particular, you’ve got CPU, RAM, and storage. Keeping track of how much you’ve got of these things, how they’re being used, and making sure no one is gobbling up all the resources is an important part of system administration.

  • Networking Your server is only valuable if you and others can connect to it, so managing how your server can connect to the environment around it is an important part of Linux administration.

  • Permissions Servers generally exist to allow a number of people to access the same machine. Creating users and groups and managing what they’re allowed to do them is a huge part of server administration.

  • Applications Generally you want to do something with your server, so being able to interact with applications that are running, debug issues, and fix things that aren’t going well is an essential Linux admin skill.

When you log into a Linux server, you’ll be interacting exclusively via the command line, so all of the commands in this chapter are going to be terminal commands. If you haven’t yet figured out how to open the terminal on your laptop and got it themed and customized so it’s perfect, I’d advise going back to Chapter 7 to get it all configured.5

Windows, Mac, and Linux

MacOS is based on BSD, a Unix clone, so any terminal commands you’ve used before will be very similar to Linux commands.

Windows, on the other hand, is basically the only popular operating system that isn’t a Unix clone. Over time, the Windows command line has gotten more Unix like, so the differences aren’t as big as they used to be, but there will be some differences in the commands that work on Windows vs Linux. There will be some differences in the exact commands that work on Windows vs on Linux.

The most obvious difference is the types of slashes used in file paths. Unix-like systems use forward slashes / to denote file hierarchies, while Windows uses back slashes \.

8.3 Managing who can do what

Whenever you’re doing something in Linux, you’re doing that thing as a particular user.

On any Unix-like system, you can check your active user at any time with the whoami command. For example, here’s what it looks like on my MacBook.

 ❯ whoami                                                       
alexkgold

whoami returns the username of my user.

Usernames have to be unique on the system – but their not the true identifier for a Linux user. A user is uniquely (and permanently) identified by their user id (uid). All other attributes including username, password, home directory, groups, and more are malleable – but uid is forever.

Many of the users on a Linux server correspond to actual humans. But there are more users than that. Most programs that run on a Linux server run as a service account that represent the set of permissions allowed to that program.

For example, installing RStudio Server will create a user with username rstudio-server. Then when rstudio-server goes to do something – start an R session for example – it will do so as rstudio-server.

A few details on UIDs

uids are just numbers from 0 to over 2,000,000,000. uids are assigned by the system at the time the user is created. You should probably keep uids below 2,000,000 or so if you happen to be assigning uids manually – some programs can’t deal with uids any bigger.

10,000 is the the lowest uid that’s available for use by a user account. Everything below that is reserved for predefined system accounts or application accounts.

In addition to users, Linux has a notion of groups. A group is a collection of users. Each user has exactly one primary group and can be a member of secondary groups.6 By default, each user’s primary group is the same as their username.

Like a user has a uid a group has a gid. User gids start at 100.

You can see a user’s username, uid, groups, and gid with the id command.

 ❯ id                                                                
uid=501(alexkgold) gid=20(staff) groups=20(staff),12(everyone),61(localaccounts),79(_appserverusr),80(admin),81(_appserveradm),98(_lpadmin),701(com.apple.sharepoint.group.1),33(_appstore),100(_lpoperator),204(_developer),250(_analyticsusers),395(com.apple.access_ftp),398(com.apple.access_screensharing),400(com.apple.access_remote_ae)

On my laptop, I’m a member of a number of different groups.

There’s one extra special user – called the admin, root, sudo, or super user. They get the ultra-cool uid 0. That user has permission to do anything on the system. You almost never want to actually log in as the root user. Instead, you make users and add them to the admin or sudo group so that they have the ability to temporarily assume those admin powers.

The easiest way to make users is with the useradd command. Once you have a user, you may need to change the password, which you can do at any time with the passwd. Both useradd and passwd start interactive prompts, so you don’t need to do much more than run those commands.

Command What it does
su <username> Change to be a different user.
whoami Get username of current user.
id Get full user + group info on current user.
passwd Change password.
useradd Add a new user.

8.3.1 File Permissions

Every object in Linux is just a file. Every log – file. Every picture – file. Every program – file. Every system setting – file.

So determining whether a user can take a particular action is really a question of whether they have the right permissions on a particular file.

Note

The question of who’s allowed to do what – authorization – is an extremely deep one. There’s a chapter all about authorization, how it differs from authentication, and the different ways your IT/Admins might want to manage it later in the book.

This is just going to be a high-level overview of basic Linux authorization.

There are three permissions you can have: read, write, and execute. Read means you’re allowed to see the contents of a file, write means you can save a changed version of a file, and execute means you’re allowed to run the file as a program.

The execute permission really only makes sense for some kinds of files - what would it mean to execute a csv file? But Linux doesn’t care – you can assign any combination of these three permissions for any file.

How are these permissions assigned? Every file has an owner and an owning group.

So you can think of permissions in Linux as being assigned in a 3x3 grid. The owner, the owning group, and everyone else can have permissions to read, write, or execute the file.

TODO: change to graphic

Owner Group Everyone Else
Read ✅/❌ ✅/❌ ✅/❌
Write ✅/❌ ✅/❌ ✅/❌
Execute ✅/❌ ✅/❌ ✅/❌

To understand better, let’s look at the permissions on an actual file.

Running ls -l on a directory gives you the list of files in that directory, along with their permissions. The first few columns of the list give you the full set of file permissions – though they can be a little tricky to read.

So, for example, here’s a few lines of the output of running ls -l on a python project I’ve got.

❯ ls -l                                                           
-rw-r--r--  1 alexkgold  staff     28 Oct 30 11:05 config.py
-rw-r--r--  1 alexkgold  staff   2330 May  8  2017 credentials.json
-rw-r--r--  1 alexkgold  staff   1083 May  8  2017 main.py
drwxr-xr-x 33 alexkgold  staff   1056 May 24 13:08 tests

This readout has a series of 10 characters: the file permissions, followed by a number, then the file’s owner, and the file’s group.7 Let’s learn how to read these.

The file’s owner and group are the easiest to understand. In this case, I alexkgold own all the files, and the group of all the files is staff.

The 10 character file permissions are relative to that user and group.

The first character indicates the type of file – - for normal and d for a directory.

The next 9 characters are indicators for the three permissions – r for read, a w for write, and a x for execute or - for not – first for the user, then the group, then any other user on the system.

So, for example, my config.py file with permissions of rw-r-r-- indicates the user (alexkgold) can read and write the file, and everyone else – including in the file’s group staff – has read only.

In the course of administering a server, you will probably need to change a file’s permissions. You can do so using the chmod command.

For chmod, you permissions are indicated with only 3 numbers – one for the user, group, and everyone else. The way this works is pretty clever – you just sum up the permissions as follows: 4 for read, 2 for write, and 1 for execute. You can check for yourself, but any set of permissions can be uniquely identified by a number between 1 and 7.8

So chmod 765 <filename> would give the user full permissions, read and write to the group, and read and execute to everyone else. This would be a strange set of permissions to give a file, but it’s a perfectly valid chmod command.

Note

If you spend any time administering a Linux server, you almost certainly will at some point find yourself running into a problem and frustratedly applying chmod 777 to rule out a permissions issue.

I can’t in good faith tell you not to do this – we’ve all been there. But if it’s something important, be sure you change it back once you’re finished figuring out what’s going on.

In some cases you might actually want to change the owner or group of a file. You can change users or groups with either names or ids. You can do so using the chown command. If you’re changing the group, the group name gets prefixed with a colon.

In some cases, you might not be the correct user to take a particular action. You might not want to change the file permissions, but instead to change who you are. In that case, you can switch users with the su command.

Some actions are also reserved for the admin user. For example, let’s take a look at this configuration file:

 ❯ ls -l /etc/config/my-config                      
-rw-r--r--  1 root  system  4954 Dec  2 06:37 config.conf

As you can see, all users can read this file to check the configuration settings. But this file is owned by root, and only the owner has write permissions. So I could run cat config.conf to see it. Or I could go into it with vim config.conf, but I’d find myself stuck if I wanted to make changes.

So if I want to change this configuration file, I’d need to temporarily assume my root powers to make changes. Instead of switching to be the root user, I would run sudo vim config.conf and open the file for editing with root permissions.

Command What it does Helpful options + notes
chm od <pe r m i s sions> <file> Modifies permissions on a file. Number indicates permissions for user, group, others: add 4 for read, 2 for write, 1 for execute, 0 for nothing.
ch own <u s e r / group> <file> Change the owner of a file. Can be used for user or group. , e.g. :my-group.
su <username> Change active user.
s udo <command> Adopt super user permissions for the following command.

8.4 Installing Stuff

There are several different ways to install programs for Linux, and you’ll see a few of them throughout this book.

Just as CRAN and PyPI are repositories for R and Python packages, Linux distros also have their own repositories. For Ubuntu, the apt command is used for accessing and installing .deb files from the Ubuntu repositories. For CentOS and RedHat, the yum command is used for installing .rpm files.

Note

The examples below are all for Ubuntu, since that’s what we use in the lab for this book. Conceptually, using yum is very similar, though the exact commands differ somewhat.

When you’re installing packages in Ubuntu, you’ll often see commands prefixed with apt-get update && apt-get upgrade y. This command makes your machine update the list of available packages it knows about on the server and upgrade everything to the latest version.

Packages are installed with apt-get install <package>. Depending on which user you are, you may need to prefix the command with sudo.

You can also install packages that aren’t from the central package repository. Doing that will generally involve downloading a file directly from a URL – usually with wget and then installing it from the file you’ve downloaded – often with the gdebi command.

Command What it does
apt-g e t u pdate && apt-get upgrade y Fetch and install upgrades to system packages
apt-get install <package> Install a system package.
wget Download a file from a URL.
gdebi Install local .deb file.

8.5 Debugging and troubleshooting

There are three main resources you’ll need to manage as the server admin – CPU, RAM, and storage space. There’s more on all three of these and how to make sure you’ve got enough in Chapter 10.

For now, we’re just going to go over how to check how much you’ve got, how much you’re using, and getting rid of stuff that’s misbehaving.

8.5.1 Storage

A common culprit for weird server behavior is running out of storage space. There are two handy commands for monitoring the amount of storage you’ve got – du and df. These commands are almost always used with the -h flag to put file sizes in human-readable formats.

du, short for disk usage, gives you the size of individual files inside a directory. This can be helpful for finding your largest files or directories if you think you might need to clean up things. It’s particularly useful in combination with the sort command.

For example, here’s the result of running du on the chapters directory where the text files for this book live.

 ❯ du -h chapters | sort -h                                      
 44K    chapters/sec2/images-servers
124K    chapters/sec3/images-scaling
156K    chapters/sec2/images
428K    chapters/sec2/images-traffic
656K    chapters/sec1/images-code-promotion
664K    chapters/sec1/images-docker
1.9M    chapters/sec1/images-repro
3.4M    chapters/sec1
3.9M    chapters/sec3/images-auth
4.1M    chapters/sec3
4.5M    chapters/sec2/images-networking
5.3M    chapters/sec2
 13M    chapters

So if I were thinking about cleaning up this directory, I could see that my images-networking directory in sec2 is the biggest single bottom-level directory. If you find yourself needing to find big files on your Linux server, it’s worth spending some time with the help pages for du. There are lots of really useful options.

du is useful for identifying large files and directories on a server. df, for disk free, is useful for diagnosing issues that might be a problem for a directory. If you’re struggling to write into a directory – perhaps getting out of space errors, df can help you diagnose.

df answers the question – given a file or directory, what device is it mounted on and how full is that device?

So here’s the result of running the df command on that same chapters directory.

 ❯ df -h chapters                                                    
Filesystem     Size   Used  Avail Capacity iused      ifree %iused  Mounted on
/dev/disk3s5  926Gi  163Gi  750Gi    18% 1205880 7863468480    0%   /System/Volumes/Data

So you can see that the chapters folder lives on a disk called /dev/disk3s5 that’s a little less than 1Tb and is 18% full – no problem. On a server this can be really useful to know, because it’s quite easy to switch a disk out for a bigger one in the same spot.

Command What it does Helpful options
du Check size of files.

Most likely to be used du -h <dir> | sort -h 9

Also useful to combine with head.

df Check storage space on device. -h

8.5.2 Monitoring processes

Every program your computer runs is a process. For example, when you type python on the command line to open a REPL, that’s a process. Running more complicated programs usually involves more than one process.

For example, running RStudio involves (at minimum) one process for the IDE itself and one for the R session that it uses in the background. The relationships between these different processes is mostly hidden from you – the end user.

As a server admin, finding runaway processes, killing them, and figuring out how to prevent it from happening again is a pretty common task. Runaway processes usually misbehave by using up the entire CPU, filling up the entire machine’s RAM.

Like users and groups have ids, each process has a numeric process id (pid). Each process also has an owner – this can be either a service account or a real user. If you’ve got a rogue process, the pattern is to try to find the process and make note of its pid. Then you can immediately end the process by pid with the kill command.

So, how do you find a troublesome process?

The top command is a good first stop. top shows the top CPU-consuming processes in real time. Here’s the top output from my machine as I write this sentence.

PID    COMMAND      %CPU TIME     #TH    #WQ  #PORT MEM    PURG   CMPRS PGRP
0      kernel_task  16.1 03:56:53 530/10 0    0     2272K  0B     0B    0
16329  WindowServer 16.0 01:53:20 23     6    3717  941M-  16M+   124M  16329
24484  iTerm2       11.3 00:38.20 5      2    266-  71M-   128K   18M-  24484
29519  top          9.7  00:04.30 1/1    0    36    9729K  0B     0B    29519
16795  Magnet       3.1  00:39.16 3      1    206   82M    0B     39M   16795
16934  Arc          1.8  18:18.49 45     6    938   310M   144K   61M   16934
16456  Messages     1.7  06:58.27 4      1    603   138M   2752K  63M   16456
1      launchd      1.7  13:41.03 4/1    3/1  3394+ 29M    0B     6080K 1
573    diagnosticd  1.4  04:31.97 3      2    49    2417K  0B     816K  573
16459  zoom.us      1.3  66:38.37 30     3    2148  214M   384K   125M  16459
16575  UniversalCon 1.3  01:15.89 2      1    131   12M    0B     2704K 16575

In most instances, the first three columns are the most useful. You’ve got the name of the command and how much CPU they’re using. Right now, nothing is using very much CPU. If I were to find something concerning – perhaps an R process that is using 500% of CPU – I would want to take notice of its pid to kill it with kill.

So much CPU?

For top (and most other commands), CPU is expressed as a percent of single core availability. So on a modern machine, it’s very common to see CPU totals well over 100%. Seeing a single process using over 100% of CPU is rarer.

The top command takes over your whole terminal. You can exit with Ctrl + c.

Another useful command for finding runaway processes is ps aux.10 It lists all processes currently running on the system, along with how much CPU and RAM they’re using. You can sort the output with the --sort flag and specify sorting my cpu with --sort -%cpu or by memory with --sort -%mem.

Because ps aux returns every running process on the system, you’ll probably want to pipe the output into head.

Another useful way to use ps aux is in combination with grep. If you pretty much know what the problem is – often this might be a runaway R or Python process – ps aux | grep <name> can be super useful to get the pid.

For example, here are the RStudio processes currently running on my system.

 > ps aux | grep "RStudio\|USER"                                                                                      [10:21:18]
USER               PID  %CPU %MEM      VSZ    RSS   TT  STAT STARTED      TIME COMMAND
alexkgold        23583   0.9  1.7 37513368 564880   ??  S    Sat09AM  17:15.27 /Applications/RStudio.app/Contents/MacOS/RStudio
alexkgold        23605   0.5  0.4 36134976 150828   ??  S    Sat09AM   1:58.16 /Applications/RStudio.app/Contents/MacOS/rsession --config-file none --program-mode desktop 
Tip

The grep command above looks a little weird because I used a little trick. I wanted to keep the header in the output, so the regex I used matches both a the header line (USER) and the thing I actually care about (RStudio).

Command What it does Helpful options
top See what’s running on the system.
ps aux See all system processes. Consider using --sort and pipe into head or grep
kill Kill a system process. -9 to force kill immediately

8.5.3 Managing networking

Networking is a complicated topic, which we’ll approach with great detail in Chapter 9. For now, it’s important to be able to see what’s running on your server that is accessible from the outside world on a particular port.

The main command to help you see what ports are being used and by what services is the netstat command. netstat returns the services are running and the ports associated. netstat is generally most useful with the -tlp flags to show programs that are listening and the programs associated.

TODO: get netstat example

Command What it does Helpful options
netstat See ports and services using them. Usually used with -tlp

Sometimes you know you’ve got a service running on your machine, but you just can’t seem to get the networking working. It can be useful to access the service directly without having to deal with networking. You can do this with – port forwarding, also called tunneling.

SSH port forwarding allows you to take the output of a port on a remote server, route it through SSH, and display it as if it were on a local port. For example, let’s say I’ve got RStudio Server running on my server. Maybe I don’t have networking set up yet, or I just can’t get it working. If I’ve got SSH to my server working properly, I can double check that the service is working as I expect and the issue really is somewhere in the network.

I find that the syntax for port forwarding completely defies my memory and I have to google it every time I use it. For the kind of port forwarding you’ll use most often in debugging, you’ll use the -L flag.

ssh -L <local port>:<remote ip>:<remote port> <ssh hostname>

When you’re doing ssh forwarding, local is the place you’re ssh-ed into (aka your server) and the remote is another location – usually your laptop.

Since the “remote” is my laptop, I almost always want to use localhost as the remote IP, and I usually want to use the same port remotely and locally – unless the local service is on a reserved port.

So let’s say I’ve got RStudio Server running on my server at my-ds-workbench.com on port 3939. Then I could run ssh -L 3939:localhost:3939 my-user@my-ds-workbench.com. With this command, I can bypass networking and just access whatever is at port 3939 on my server (hopefully RStudio Workbench!) by just going to localhost:3939 in my laptop’s browser.

8.5.4 Understanding PATHs

Let’s say you want to open R on your command line. Once you’ve got everything properly configured, you can just type R and have it open right up.

 ❯ R                                                       

R version 4.2.0 (2022-04-22) -- "Vigorous Calisthenics"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin17.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

But how does the operating system know what you mean when you type R? If you’ve been reading carefully, you’ve realized that running the command means opening a particular runnable file, and R isn’t a file path on my system.

You can actually just type the a complete filename of a runnable binary into your command line. For example, on my MacBook, my version of R is at /usr/bin/local/R, so I could open an R session by typing that full path. Sometimes it can be handy to be precise about exactly which executable you’re opening (looking at you, multiple versions of Python), so you may want to use full paths to executables.

If you ever want to check which actual executable is being used by a command, you can use the which command. For example, on my system this is the result of which R.

 ❯ which R                                                    
/usr/local/bin/R

Most of that time you don’t want to have to bother with full paths for executables. You want to just type R on the command line and have R open. Moreover, there are cases where functionality relies on another executable being able to find R and run it – think of running RStudio Server, which starts a version of R under the hood.

The operating system knows how to find the actual runnable programs on your system via something called the path. When you type R into the command line, it searches along the path to find a version of R it can run.

You can check your path at any time by just echoing the PATH environment variable with echo $PATH. On my MacBook, this is what the path looks like.

 ❯ echo $PATH                                                      
/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin

Later, when we get into running versions of programs that aren’t the system versions, we may have to append locations to the path so that we can run them easily.

8.6 Lab: A working Data Science Workbench

In this lab, we’re going to take your server and take it from an empty server that you can access to one with users and useful software installed and running – even if it’s not yet accessible to the outside world.

8.6.1 Step 1: Create a non-root user

The first thing we’re going to do is create a user so that you can login without running as root all the time. In general, if you’ve got a multitenant server, you’re going to want users for each actual human who’s accessing the system.

I’m going to use the username test-user. If you want to be able to copy/paste commands from the online book, I’d advise doing the same. If you were creating users based on real humans, I’d advise using their names.

Let’s create a user using the adduser command. This will walk us through a set of prompts to create a new user with a home directory and a password. Feel free to add any information you want – or to leave it blank – when prompted.

sudo adduser test-user

We want this new user to be able to adopt root privileges. Remember that the way that is determined is whether the user is part of the sudo group.

sudo usermod -aG sudo test-user

Hopefully it’s reasonably intuitive that -aG stands for add to group.

8.6.2 Step 2: Add an SSH Key for your user

In order to be able to connect as a new user, we need to add the public key as an authorized key for this user. If you’re the server admin, you’ll have your users create their SSH keys, share the public keys with you, and you’ll put them into the right place on the server.

In this case you’re both the user logging in and the admin, so you’ll just add your own SSH key in your own home directory.

We do so by putting the public key into the user’s .ssh/authorized_keys directory inside their home directory.

If you’ve already got an SSH key you use, feel free to use that one. If you don’t have one, google how to create an SSH key and do it on your laptop.

You’ll need to scp the public key to the server first.11

Note that we have to specify that we’re using the pem key for the server, which file we’re copying, which user we’re connecting as, and where it’s going on the server.

Now our public key is on the server, but it’s in the ubuntu user’s home directory, it’s owned by the ubuntu user, and it has the wrong permissions.

Here are the commands to do so:

ssh
cd /home/ubuntu
ubuntu@ip-172-31-2-42:~$ sudo mv /home/ubuntu/id_ed25519.pub /home/test-user/ 
ubuntu@ip-172-31-2-42:~$ sudo chown test-user /home/test-user/id_ed25519.pub
ubuntu@ip-172-31-2-42:~$ su test-user #change user
Password:
test-user@ip-172-31-2-42:/home/ubuntu$ cd ~ #go to home dir
test-user@ip-172-31-2-42:~$ mkdir -p .ssh #create .ssh directory
test-user@ip-172-31-2-42:~$ chmod 700 .ssh # Add appropriate permissions to directory
test-user@ip-172-31-2-42:~$ cat id_ed25519.pub >> .ssh/authorized_keys #add public key to end of authorized_keys file
test-user@ip-172-31-2-42:~$ chmod 600 .ssh/authorized_keys #set permissions

Now we’re all set up with SSH, and you can log in as a normal user from your laptop just using ssh test-user@$SERVER_ADDRESS.

When we used the pem key, we needed to specify it with -i. If you followed standard instructions for creating a key, it was generated with the default name, probably id_ed25519.12 The ssh command knows to automatically used a default key, so you no longer need the -i option.

If you want to use a different key name or if you have multiple SSH keys for different services, you can set up a config file that maps addresses to keys so you don’t always have to use -i. A google search should return good instructions for setting up your SSH config.

If you want to set up an SSH config for this server, I’d advise waiting until we’ve got a permanent URL for it in the next chapter.

Now that we’re all set up, you should store the pem key somewhere safe and never use it to log in again.

When you ever want to exit SSH and get back to your machine, you can just type exit.

8.6.3 Step 3: Install R and Python

Everything until now has been generic server administration. Now let’s get into some data science specific work – setting up R and Python.

Tip

If you run into trouble assuming sudo with your new user, try exiting SSH and coming back. Sometimes these changes aren’t picked up until you restart the shell.

8.6.3.1 Installing R

There are a number of ways to install R on your server including installing it from source, from the system repository, or using R-specific tooling.

You can use the system repository version of R, but then you just get whatever version of R happens to be current when you run sudo apt-get install R. My preferred option is to use rig, which is an R-specific installation manager.

Note

As of this writing, rig only supports Ubuntu. If you want to install on a different Linux distro, you will have to install R a different way.

Posit makes R binaries available for a variety of different operating systems, including Ubuntu and RedHat. You can find instructions at https://docs.posit.co/resources/install-r/.

TODO: confirm that installing into /opt/R works for RStudio Server OSS.

There are good instructions on downloading rig and using it to install R on the GitHub repo – https://github.com/r-lib/rig.

Once you’ve got R installed on your server, you can check that it’s running by just typing R into the command line. If that works, you’re good to move on to the next step.

8.6.4 Installing RStudio Server

Once you’ve got R installed, let’s download and install RStudio Server. This should be a very easy process. The basic process is to install the gdebi package from apt, which is used for installing downloaded packages, download the RStudio Server package, and install it.

I’m not going to reproduce the commands here because the RStudio Server version numbers change frequently and you probably want the most recent one.

You can find the exact commands on the Posit website at https://posit.co/download/rstudio-server/. Make sure to pick the version that matches your operating system. Since you’ve already installed R, you can skip down to actually installing the server.

Once you’ve installed, you can check the status with sudo systemctl status rstudio-server. If it says running, you’re good to go!

But knowing it’s good to go isn’t nearly as fun as actually trying it. We don’t have a stable public URL for the server yet, so we can’t just access it from our browser. This is a perfect use case for an SSH tunnel.

By default, RStudio Server is on port 8787, so we’ll tunnel port 8787 on the server to localhost:8787 on our laptop. The command to do that is ssh -L 8787:localhost:8787 test-user@$SERVER_ADDRESS.

Now, if you go to localhost:8787 in your browser, you should be able to access RStudio Server and login with the username test-user and password you set on the server.

8.6.5 Installing JupyerHub + JupyterLab

RStudio and R are system libraries. So when RStudio runs, it calls and owns the R process that you’ll use inside RStudio Server. In contrast, JupyterHub and JupyterLab are Python programs, so we install them inside a Python installation.

People pretty much only ever install R to do data science. In contrast, Python is one of the world’s most popular programming languages for general purpose computing. Contrary to what you might think, this actually makes configuring Python harder than configuring R.

The reason is that your system comes with a version of Python installed, but we don’t want to use that version. Given the centrality of that version to normal server operations, we want to leave it alone. It turns out that installing one or more other versions of Python and then ignoring the system version of Python isn’t totally trivial to do. Then we’re going to want to create a standalone virtual environment that’s just for running JupyterHub so it doesn’t get messed up later.

TODO: diagram of relationships of system python, DS python, Jupyter Python.

It’s very likely that the version of Python on your system is old. Generally we’re going to want to install a newer Python for doing data science work, so let’s start there. As of this writing, Python 3.10 is a relatively new version of Python, so we’ll install that one.

Let’s start by actually installing Python 3.10 on our system. We can do that with apt.

TODO: finish these instructions

> sudo su
> apt install python3.10-venv

Now that we’ve installed Python, we can create a standalone virtual environment for running JupyterHub.

> python3 -m venv /opt/jupyterhub
> source /opt/jupyterhub/bin/activate

Now we’re going to actually get JupyterHub up and running inside the virtual environment we just created. JupyterHub produces docs that you can use to get up and running very quickly. If you have to stop for any reason, make sure to come back, assume sudo, and start the JupyterHub virtual environment we created.

Here were the installation steps that worked for me:

npm install -g configurable-http-proxy
apt-get install npm nodejs
python3 -m pip install jupyterhub jupyterlap notebook

ln -s /opt/jupyterhub/bin/jupyterhub-singleuser /usr/local/bin/jupyterhub-singleuser # symlink in singleuser server, necessary because we're using virtual environment

jupyterhub

If all went well, you’ll now have JupyterHub up and running on port 8000!

If you want to confirm, tunnel in with ssh -L 8000:localhost:8000 test-user@$SERVER_ADDRESS.

8.6.5.1 Running JupyterHub as a service

As I mentioned above, JupyterHub is a Python process, not a system process. This is ok, but it means that we’ve got to remember the command to start it if we have to restart it, and that it won’t auto restart if it were to fail for any reason.

A program that runs in the background on a machine, starting automatically, and controlled by systemctl is called a daemon. Since we want JupyterHub to be a daemon, we’re got to add it as a system daemon, which isn’t hard.

We don’t need it right now, but it’ll be easier to manage JupyterHub later on from a config file that’s in /etc/jupyterhub.

Let’s create a default config file and move it into the right place using

> jupyterhub --generate-config
> mkdir -p /etc/jupyterhub
> mv jupyterhub_config.py /etc/jupyterhub

Now we’ve got to daemon-ize JupterHub. There are two steps – create a file describing the service for the server’s daemon, and then start the service.

To start with, end the existing JupyterHub process. If you’ve still got that terminal open, you can do so with ctrl + c. If not, you can use your ps aux and grep skills to find and kill the JupyterHub processes.

On Ubuntu, adding a daemon file uses a tool called systemd and is really straightforward.

First, add the following to /etc/systemd/system/jupyterhub.service. If you’re reading this book in hard copy, you can go to the online version to copy/paste or get this file on the book’s Git repo at TODO.

/etc/systemd/system/jupyterhub.service
[Unit]
Description=Jupyterhub
After=syslog.target network.target

[Service]
User=root
Environment="PATH=/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/jupyterhub/bin"
ExecStart=/opt/jupyterhub/bin/jupyterhub -f /etc/jupyterhub/jupyterhub_config.py

[Install]
WantedBy=multi-user.target

Hopefully this file is pretty easy to parse. Two things to notice – the Environment line adds /opt/jupyterhub/bin to the path – that’s where our virtual environment is.

Second, the ExecStart line is the startup command and includes our -f /etc/jupyterhub/jupyterhub_config.py – this is the command to start JupyterHub with the config we created a few seconds ago.

Now we just need to reload the daemon tool so it picks up the new service it has available and start JupyterHub!

systemctl daemon-reload
systemctl start jupyterhub

You should now be able to see that JupyterHub is running using systemctl status jupyterhub and can see it again by tunneling to it.

To set JupyterHub to automatically restart when the server restarts, run systemctl enable jupyterhub.

8.7 Running a Plumber API in a Container

In addition to running a development workbench on your server, you might want to run a data science project. If you’re running a Shiny app, Shiny Server is easy to configure along the lines of RStudio Server.

However, if you put an API in a container like we did in Chapter 5, you might want to just deploy that somewhere on your server as a running container.

The first step is just to install docker on your system with sudo apt-get install docker.io. You can check that you can run docker with docker ps. You may need to adopt sudo privileges to do so.

Once we’ve got docker installed, getting the API running is almost trivially easy using the command we used back in Chapter 5 to run our container.

docker run --rm -d \
  -p 8555:8000 \
  --name palmer-plumber \
  alexkgold/plumber

The one change you might note is that I’ve changed the port on the server to be 8555, since we’ve already got JupyterHub running on 8000.

Now, we can SSH tunnel into our server and view our running API at localhost:8555/__docs__/.

This was easy to get something up and running quickly. But you should notice that this isn’t daemonized – if we restart the server or the container dies for any reason, it won’t auto-restart.

It’s not generally a best practice to daemon-ize a docker container by just putting the run command into systemd. Instead, you should use a container management system like Docker Compose or Kubernetes that are designed specifically to manage running containers. Getting deeper into those is beyond the scope of this book.

8.8 Comprehension Questions

  1. Create a mind map of the following terms: Operating System, Windows, MacOS, Unix, Linux, Distro, Ubuntu
  2. When you initially SSH-ed into your server using ubuntu@$SERVER_ADDRESS, what user were you and what directory did you enter? What about when you used test_user@$SERVER_ADDRESS?
  3. What are the 3x3 options for Linux file permissions? How are they indicated in an ls -l command?
  4. How would you do the following?
    1. Find and kill the process IDs for all running rstudio-server processes.

    2. Figure out which port JupyterHub is running on.

    3. Create a file called secrets.txt, open it with vim, write something in, close and save it, and make it so that only you can read it.

8.8.1 Questions for Alex

We didn’t actually make use of the EBS volume we mounted for home dirs. Should we do that? https://www.tecmint.com/move-home-directory-to-new-partition-disk-in-linux/


  1. Or at least they weren’t supposed to. There’s an interesting history of lawsuits around the BSD operating system including Unix code. BSD is a Unix clone that was the predecessor of MacOS.↩︎

  2. People who are pedantic about operating systems or the history of computing will scream that the original release of Linux was just the operating system kernel, not a full operating system like Unix. I’ve noted it here to satisfy pedants, but it doesn’t matter much in practice.↩︎

  3. The remainder are almost entirely Windows servers. There are a few other Unix-like systems that you might encounter, like Oracle Solaris. There is no MacOS server. There is a product called Mac Server, but it’s just a program for managing Mac desktops and iOS devices.↩︎

  4. CentOS (short for Community ENTerprise Operating System) is an open source operating system maintained by Red Hat. The relationship between RHEL and CentOS is changing. The details are somewhat complicated, but most people expect less adoption of CentOS in enterprise settings going forward.↩︎

  5. My goal here is to be useful not precise, so I’m intermingling bash commands and Linux system commands because they’re useful. If you know the difference and are pedantic enough to care, this list isn’t for you anyway.↩︎

  6. Depending on your version of Linux, there may be a limit of 16 groups per user.↩︎

  7. You can generally ignore the number, which is the number of links to the file.↩︎

  8. Clever eyes may realize that this is just the base 10 representation of a 3 digit binary number.↩︎

  9. Note that if you’re using du with -h, sort also needs a -h so it knows to sort by human-readable attributes. Otherwise the sorting gets weird.↩︎

  10. This is another one where you’ll almost never use ps without aux.↩︎

  11. Alternatively, you could open a file of the right name and just copy/paste the contents of your public key in there. Either way works.

    scp -i do4ds-lab-key.pem   ~/.ssh/id_ed25519.pub   ubuntu@ec2-54-159-134-39.compute-1.amazonaws.com:/home/ubuntu

    ↩︎
  12. The pattern is id_<encryption type>. ed25519 is the standard SSH key encryption type as of this writing.↩︎