10 Application Administration

The last few chapters have focused on how to run a Linux server. But you don’t care about running a Linux server – you care about doing data science on a server. That means you’ll need the know-how to run data science applications like JupyterHub, RStudio, R, Python, and more.

In this chapter, you’ll learn how to install and administer applications on a Linux server and pointers for managing data science tools like R, Python, and the system packages they use.

Linux app install and config

The first step to running applications on a server is installing them. Most software you install will come from system repositories. Your system will have several default repositories; you can add others to install software from non-default repositories.

For Ubuntu, the apt command is used for interacting with repositories of .deb files. The yum command is used for installing .rpm files on CentOS and Red Hat.

Note

The examples below are all for Ubuntu, since that’s what we are using in the labs for this book. Conceptually, using yum is very similar, though the exact commands differ somewhat.

On Ubuntu, packages are installed with apt-get install <package>. Depending on your user, you may need to prefix the command with sudo.

In addition to installing packages, apt is also the utility for ensuring the lists of available packages you have are up to date with update and that all packages on your system are at their latest version with upgrade. When you find Ubuntu commands online, it’s common to see them prefixed with apt-get update && apt-get upgrade -y to update all system packages to the latest version. The -y flag bypasses a manual confirmation step.

Some packages may not live in system repositories at all. To install that software, you will download a file on the command line, usually with wget, and then install the software from the file, often with gdebi.

Application Configuration

Most applications require some configuration after they’re installed. Configuration may include connecting to auth soures, setting display and access controls, or configuring networking. You’d probably find the setting in menus on your personal computer. On a server, no such menu exists.

Application behavior is usually configured through one or more config files. For applications hosted inside a Docker container, behavior is often configured with environment variables, sometimes in addition to config files.

The application you’re running will have documentation on how to set different configuration options. That documentation is probably dry and boring, but reading it will put you ahead of most people trying to administer the application.

Where To Find Application Files

Linux applications often use several files located in different locations on the filesystem. Here are some of the ones you’ll use most frequently:

/bin, /opt, /usr/local, /usr/bin – installation locations for software.
/etc – configuration files for applications.
/var – variable data, most commonly log files in /var/log or /var/lib.

This means that on a Linux server, the files for a particular application probably don’t all live in the same directory. Instead, you might run the application from the executable in /opt, configure it with files in /etc, and troubleshoot from logs in /var.

Configuration with Vim and Nano

Since application configuration is in text files, you’ll spend a fair bit of time editing text files to administer applications. Unlike on your personal computer, where you click a text file to open and edit it, you’ll need to work with a command line text editor when you’re working on a server.

There are two command line text editors you’ll probably encounter: Nano and Vim. While they’re both powerful text editing tools, they can also be intimidating if you’ve never used them.

You can open a file by typing nano <filename> or vim <filename>.

Note

Depending on your system, you may have Vi in place of Vim. Vi is the original fullscreen text editor for Linux. Vim is its successor (Vim stands for Vi improved). The only difference germane to this section is that you open Vi with vi <filename>.

When you open Nano, some helpful-looking prompts will be at the bottom of the screen. You’ll see that you can exit with ^x. But should you try to type that, you’ll discover the ^ isn’t the caret character. On Windows, ^ is short for Ctrl; on Mac, it’s for Command (⌘), so Ctrl+x or ⌘+x will exit.

Where Nano gives you helpful – if obscure – hints, a first experience with Vim is the stuff of computer nightmares. You’ll type words, and they won’t appear onscreen. Instead, you’ll experience dizzying jumps around the page. Words and entire lines of text will disappear without a trace.

Many newbie command line users would now be unable to do anything – even to exit and try again. But don’t worry; there’s a way out of this labyrinth. This happens because Vim uses the letter keys to navigate the page, interact with Vim itself, and type words. You see, Vim was created before keyboards uniformly had arrow keys.

Vim is an extremely powerful text editor. Vim includes keyboard shortcuts, called keybindings, that make it fast to move within and between lines and to select and edit text. The learning curve is steep, but I recommend posting a list of keybindings beside your desk and getting comfortable. Most IDEs you might use, including RStudio, JupyterLab, and VS Code, have vim modes. This introduction will be just enough to get you in and out of Vim successfully.

When you enter Vim, you’re in the (now poorly named) normal mode, which is for navigation only. Pressing the i key activates insert mode, which will feel normal for those used to arrow keys. In insert mode, words will appear when you type, and the arrow keys will navigate you around the page.

Once you’ve escaped, you may wish never to return to normal mode, but it’s the only way to save files and exit Vim. You can return to normal mode with the escape key.

To do file operations, type a colon, :, followed by the shortcut for what you want to do, and enter. The two most common commands you’ll use are w for save (write) and q for quit. You can combine these to save and quit in one command using :wq.

Sometimes, you may want to exit without saving, or you may have opened and changed a file you don’t have permission to edit. If you’ve made changes and try to exit with :q, you’ll find yourself in an endless loop of warnings that your changes won’t be saved. You can tell Vim you mean it with the exclamation mark, !, and exit using :q!.

Reading logs

Once your applications are up and running, you may run into issues. Even if you don’t, you may want to examine how things are going.

Most applications write their logs somewhere inside the /var directory. Some activities will get logged to the main log at /var/log/syslog. Other things may get logged to /var/log/<application name> or /var/lib/<application name>.

It’s essential to get comfortable with the commands to read text files so you can examine logs (and other files). The commands I use most commonly are:

cat prints a whole file, starting at the beginning.
less prints a file, starting at the beginning, but only a few lines at a time.
head prints only the first few lines and exits. It is especially useful to peer at the beginning of a large file, like a csv file – so you can quickly preview the column heads and the first few values.
tail prints a file going backward from the end. This is especially useful for log files, as the newest logs are appended to the end of a file. This is such a common practice that “tailing a log file” is a common phrase.
- Sometimes, you’ll want to use the -f flag (for follow) to tail a file with a live view as it updates.

Sometimes, you want to search around inside a text file. You’re probably familiar with the power and hassle of regular expressions (regex) to search for specific character sequences in text strings. The Linux command grep is the main regex command.

In addition to searching in text files, grep is often helpful in combination with other commands. For example, you may want to put the output of ls into grep to search for a particular file in a big directory using the pipe.

Running the right commands

Let’s say you want to open Python on your command line. One option would be to type the absolute path to a Python install every time. For example, I’ve got a version of Python in /usr/bin, so /usr/bin/python3 works.

But in most cases, it’s nice to type python3 and have the correct version open up:

Terminal

> python3
Python 3.9.6 (default, May  7 2023, 23:32:45) 
[Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more.
>>>

Sometimes, you might want to go the other way. Maybe python3 opens Python correctly, but you’re unsure where it’s located. You can use the which command to identify the actual executable for a command. For example, this is the result of which python3 on my system:

Terminal

> which python3                                                    
/usr/bin/python3

Sometimes, you must make a program available without providing a full path every time. Some applications rely on others, like RStudio Server needing to find R or Jupyter Notebook needing your Python kernels.

The operating system knows how to find executables via the path. The path is a set of directories that the system knows to search when it tries to run a program. The path is stored in an environment variable conveniently named PATH.

You can check your path at any time with echo $PATH. On my MacBook, this is what it looks like:

Terminal

> echo $PATH                                                      
/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:
  /System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin

When you install a new application, you must add it to the path. Let’s say I installed a new version of Python in /opt/python. That’s not on my PATH, so my system couldn’t find it.

I can get it on the path in one of two ways. The first option would be to add /opt/python to my PATH every time a terminal session starts, usually via a file in /etc or the .zshrc.

The other option is to create a symlink to the new application in a directory already on the PATH. A symlink makes it appear that a copy of a file is in a different location without actually moving it. Symlinks are created with the ln command.

Running applications as services

On your personal computer, you probably have programs that start every time your computer does. Maybe this happens for Slack, Microsoft Teams, or Spotify. Such applications that execute on startup and run in the background, waiting for input, are called a daemon or a service.

Most server-based applications are configured to run as a service, so users can access them without needing permissions to start them first. For example, on a data science workbench, you’d want JupyterHub and/or RStudio Server to run as a service.

In Linux, the tool to turn a regular application into a daemon is called systemd. Some applications automatically configure themselves with systemd when they’re installed. If your application doesn’t, or you want to alter the startup behavior, most applications have their systemd configuration in /etc/systemd/system/<service name>.service.

Daemonized services are controlled using the systemctl command line tool.

Note

Basically, all modern Linux distros have coalesced around using systemd and systemctl. Older systems may not have it installed by default and you may have to install it or use a different tool.

The systemctl command has a set of sub-commands that are useful for working with applications. They look like systemctl <subcommand> <application>. Often systemctl has to be run as sudo, since you’re working with an application for all system users.

The most useful systemctl commands include start and stop, status for checking whether a program is running, and restart for a stop followed by a start. Many applications also support a reload command, which reloads configuration settings without restarting the process. Which settings require a restart vs. a reload depends on the application.

If you’ve changed a service’s systemd configuration, you can load changes with daemon-reload. You also can turn a service on or off for the next time the server starts with enable and disable.

Running Docker Containers as a Service

People love Docker Containers because they easily run on most machines. To run a container as a service, you’ll need to make sure Docker itself is daemonized and then ensure the container you care about comes up whenever Docker does by setting a restart policy for the container.

However, many Docker services involve coordinating more than one container. If so, you’ll want to use a purpose-built system for managing multiple containers. The most popular are Docker Compose or Kubernetes.

Docker Compose is a relatively lightweight system that allows you to write a YAML file describing the containers you need and their relationship. You can then use a single command to launch the entire set of Docker Containers.

Docker Compose is fantastic for prototyping systems of Docker Containers and for running small-scale Dockerized deployments on a single server. There are many great resources online to learn more about Docker Compose.

Kubernetes is designed for a similar purpose, but instead of running a handful of containers on one server, Kubernetes is a heavy-duty production system designed to schedule up to hundreds or thousands of Docker-based workloads across a cluster of many servers.

In general, I recommend sticking with Docker Compose for the work you’re doing. If you need the full might of Kubernetes to do what you want, you probably should be working closely with a professional IT/Admin.

Managing R and Python

As the admin of a data science server, Python and R are probably the most critical applications you’ll manage.

The easiest path to making many users happy is having several versions of R and Python installed side-by-side. That way, users can upgrade their version of R or Python as it works for their project, not according to your upgrade schedule.

If you just sudo apt-get install python or sudo apt-get install R, you’ll end up with only one version of Python or R, which will get overwritten every time you re-run the command.

Python-Specific Considerations

Python is one of the world’s most popular programming languages for general-purpose computing. This makes configuring Python harder. Getting Python up and running is famously frustrating on both servers and your personal computer.¹

Almost every system comes with a system version of Python. This is the version of Python the operating system uses for various tasks. It’s almost always old, and you don’t want to mess with it.

To configure Python for data science, you have to install the versions of Python you want to use, get them on the path, and get the system version of Python off the path.

Installing data science Python versions into /opt/python makes this simpler. Managing versions of Python somewhere wholly distinct from the system Python removes some headaches, and adding a single directory to the path is easy.

My favorite route (though I’m biased) is to install Python from the pre-built binaries provided by Posit.

Notes on Conda, Part II

In Chapter 1, I mentioned that Conda is useful when you have to create a laptop-based data science environment for yourself, but isn’t great in production. Similarly, as an admin trying to install Python for all the users on a server, you should stay away from Conda.

Conda is meant to let users install Python for themselves without help from an admin. Now, you are that admin, and should you choose to use Conda, you’ll be fighting default behaviors the whole time. Configuring server-wide data science versions of Python is more straightforward without Conda.

R-Specific Considerations

Generally, people only install R to do data science, so where you install R is usually not a big issue. Using apt-get install is fine if you know you’ll only ever want one version of R.

If you want multiple versions, you’ll need to install them manually. I recommend installing into /opt/R with binaries provided by Posit or using rig, a great R installation manager that supports Windows, Mac, and Ubuntu.

Managing system libraries

As an admin, you’ll also have to decide what to do about system packages, which are Linux libraries you install from a Linux repository or the internet.

Many packages in Python and R don’t do any work themselves. Instead, they’re just language-specific interfaces to system packages. For example, any R or Python library that uses a JDBC database connector must use Java on your system. And many geospatial libraries make use of system packages like GDAL.

As the administrator, you must understand the system libraries required for your Python and R packages. You’ll also need to ensure they’re available and on the path.

For many of these libraries, it’s not a huge problem. You’ll install the required library using apt or the system package manager for your distro. In some cases (especially Java), more configuration may be necessary to ensure that the system package you need appears on the path when your code runs.

Some admins with sophisticated requirements around system library versions use Docker Containers or Linux Environment Modules to keep system libraries linked to projects.

Comprehension questions

What are two different ways to install Linux applications, and what are the commands?
What does it mean to daemonize a Linux application? What programs and commands are used to do so?
How do you know if you’ve opened Nano or Vim? How would you exit them if you didn’t mean to?
What are four commands to read text files?
How would you create a file called secrets.txt, open it with Vim, write something in, close and save it, and make it so that only you can read it?

Lab: Installing applications

As we’ve started to administer our server, we’ve mostly been doing generic server administration tasks. Now, let’s set up the applications we need to run a data science workbench and get our API and Shiny app set up.

Step 1: Install Python

Let’s start by installing a data science version of Python, so we’re not using the system Python for data science purposes.

If you want just one version of Python, you can apt-get install a specific version. As of this writing, Python 3.10 is a relatively new version of Python, so we’ll install that one with:

Terminal

> sudo apt-get install python3.10-venv

Once you’ve installed Python, you can check that you’ve got the correct version by running the following:

Terminal

> python3 --version

This route to installing Python is easy if you only want one version. If you want to enable multiple versions of Python, apt-get install-ing Python isn’t the way to go.

Step 2: Install R

Since we’re using Ubuntu, we can use rig. There are good instructions on downloading rig and using it to install R on the rlib/rig GitHub repo. Use those instructions to install the current R release on your server.

Once you’ve installed R on your server, you can check that it’s running by just typing R into the command line. If that works, you can move on to the next step. If not, you’ll need to ensure R got onto the path.

Step 3: Install JupyterHub and JupyterLab

JupyterHub and JupyterLab are Python programs, so we will run them from within a Python virtual environment. I’d recommend putting that virtual environment inside /opt/jupyterhub.

Here are the commands to create and activate a jupyterhub virtual environment in /opt/jupyterhub:

Terminal

> sudo python3 -m venv /opt/jupyterhub
> source /opt/jupyterhub/bin/activate

Now, we will get JupyterHub up and running inside the virtual environment we just created. JupyterHub has great docs (Google “JupyterHub quickstart”) to get up and running quickly. If you must stop for any reason, assume sudo and start the JupyterHub virtual environment we created when you return.

Note that because we’re working inside a virtual environment, you may have to use the jupyterhub-singleuser version of the binary.

Step 4: Daemonize JupyterHub

Because JupyterHub is a Python process, not a system process, it won’t automatically get daemonized, so we’ll have to do it manually.

We don’t need it right now, but it will be easier to manage JupyterHub later on from a config file that’s in /etc/jupyterhub. To do so, activate the jupyterhub virtual environment, create a default JupyterHub config (Google for the command), and move it into /etc/jupyterhub/jupyterhub_config.py.

Tip

You can see working examples of the jupyterhub_config and other files mentioned in this lab in the GitHub repo for this book (akgold/do4ds) in the _labs/lab10 directory.

Now let’s move on to daemonizing JupyterHub. To start, kill the existing JupyterHub process (consult the cheat sheet in Appendix D if you need help). Since JupyterHub wasn’t automatically daemonized, you must create the systemd file in /etc/systemd/system/jupyterhub.service.

That file will need to add /opt/jupyterhub/bin to the path because that’s where our virtual environment is and will have to provide the startup command and specify that JupyterHub should use the config we created`.

Now, you’ll need to use systemctl to reload the daemon, start JupyterHub, and enable it.

Step 5: Install RStudio Server

You can find the commands to install RStudio Server on the Posit website. Make sure to pick the version that matches your operating system. Since you’ve already installed R, skip to the “Install RStudio Server” step.

Unlike JupyterHub, RStudio Server daemonizes itself right out of the box, so you can check and control the status with systemctl without further work.

Step 6: Run the Penguin API from Docker

First, you’ll have to ensure that Docker is available. It can be installed from apt using apt-get install docker.io. You may need to adopt sudo privileges to do so.

Once Docker is installed, running the API is almost trivially easy using the command we used in Chapter 6 to run our container.

Terminal

> sudo docker run --rm -d \
    -p 8080:8080 \
    --name penguin-model \
    alexkgold/penguin-model

Once it’s up, you can check that it’s running with docker ps.

Step 7: Put up the Shiny app

We will use Shiny Server to host our Shiny app on the server. Start by moving the app code to the server. I put mine in /home/test-user/do4ds-lab/app by cloning the Git repo.

After that, you’ll need to:

Open R or Python and rebuild the package library with {renv} or {venv}.
Install Shiny Server using the instructions from the Shiny Server Admin Guide.
- Note that you can skip steps to install R and/or Python, and the {shiny} package since we’ve already done that.
Edit Shiny Server’s configuration file to run the right app.
Start and enable Shiny Server with systemctl.

Lab Extensions

You might want to consider a few things before moving on to the next chapter, where we’ll start working on giving this server a stable public URL.

First, we haven’t daemonized the API. Feel free to try Docker Compose or set a restart policy for the container.

Second, neither the API nor the Shiny app will automatically update when we change them. You might want to set up a GitHub Action to do so. For Shiny Server, you’ll need to push the updates to the server and then restart Shiny Server. For the API, you’d need to configure a GitHub Action to rebuild the container and push it to a registry. You’d then need to tell Docker on the server to re-pull and restart the container.

Finally, there’s no authentication in front of our API. The API has limited functionality, so that’s not a huge worry. But if you had an API with more functionality, that might be a problem. Additionally, someone could try to flood your API with requests to make it unusable. The most common way to solve this is to buy a product that hosts the API for you or to put an authenticating proxy in front of the API. We’ll be adding NGINX soon, so you can try adding authentication later.

See, for example, the XKCD comic titled Python Environment.↩︎