10 Application Administration
The last few chapters have focused on how to run a Linux server. But you don’t care about running a Linux server – you care about doing data science on a server. That means you’ll need the know-how to run data science applications like JupyterHub, RStudio, R, Python, and more.
In this chapter, you’ll learn how to install and administer applications on a Linux server and pointers for managing data science tools like R, Python, and the system packages they use.
Linux app install and config
The first step to running applications on a server is installing them. Most software you install will come from system repositories. Your system will have several default repositories; you can add others to install software from non-default repositories.
For Ubuntu, the apt
command is used for interacting with repositories of .deb
files. The yum
command is used for installing .rpm
files on CentOS and Red Hat.
The examples below are all for Ubuntu, since that’s what we are using in the labs for this book. Conceptually, using yum
is very similar, though the exact commands differ somewhat.
On Ubuntu, packages are installed with apt-get install <package>
. Depending on your user, you may need to prefix the command with sudo
.
In addition to installing packages, apt
is also the utility for ensuring the lists of available packages you have are up to date with update
and that all packages on your system are at their latest version with upgrade
. When you find Ubuntu commands online, it’s common to see them prefixed with apt-get update && apt-get upgrade -y
to update all system packages to the latest version. The -y
flag bypasses a manual confirmation step.
Some packages may not live in system repositories at all. To install that software, you will download a file on the command line, usually with wget
, and then install the software from the file, often with gdebi
.
Application Configuration
Most applications require some configuration after they’re installed. Configuration may include connecting to auth soures, setting display and access controls, or configuring networking. You’d probably find the setting in menus on your personal computer. On a server, no such menu exists.
Application behavior is usually configured through one or more config files. For applications hosted inside a Docker container, behavior is often configured with environment variables, sometimes in addition to config files.
The application you’re running will have documentation on how to set different configuration options. That documentation is probably dry and boring, but reading it will put you ahead of most people trying to administer the application.
Where To Find Application Files
Linux applications often use several files located in different locations on the filesystem. Here are some of the ones you’ll use most frequently:
/bin
,/opt
,/usr/local
,/usr/bin
– installation locations for software./etc
– configuration files for applications./var
– variable data, most commonly log files in/var/log
or/var/lib
.
This means that on a Linux server, the files for a particular application probably don’t all live in the same directory. Instead, you might run the application from the executable in /opt
, configure it with files in /etc
, and troubleshoot from logs in /var
.
Configuration with Vim and Nano
Since application configuration is in text files, you’ll spend a fair bit of time editing text files to administer applications. Unlike on your personal computer, where you click a text file to open and edit it, you’ll need to work with a command line text editor when you’re working on a server.
There are two command line text editors you’ll probably encounter: Nano and Vim. While they’re both powerful text editing tools, they can also be intimidating if you’ve never used them.
You can open a file by typing nano <filename>
or vim <filename>
.
Depending on your system, you may have Vi in place of Vim. Vi is the original fullscreen text editor for Linux. Vim is its successor (Vim stands for Vi improved). The only difference germane to this section is that you open Vi with vi <filename>
.
When you open Nano, some helpful-looking prompts will be at the bottom of the screen. You’ll see that you can exit with ^x
. But should you try to type that, you’ll discover the ^
isn’t the caret character. On Windows, ^
is short for Ctrl
; on Mac, it’s for Command (⌘
), so Ctrl+x
or ⌘+x
will exit.
Where Nano gives you helpful – if obscure – hints, a first experience with Vim is the stuff of computer nightmares. You’ll type words, and they won’t appear onscreen. Instead, you’ll experience dizzying jumps around the page. Words and entire lines of text will disappear without a trace.
Many newbie command line users would now be unable to do anything – even to exit and try again. But don’t worry; there’s a way out of this labyrinth. This happens because Vim uses the letter keys to navigate the page, interact with Vim itself, and type words. You see, Vim was created before keyboards uniformly had arrow keys.
Vim is an extremely powerful text editor. Vim includes keyboard shortcuts, called keybindings, that make it fast to move within and between lines and to select and edit text. The learning curve is steep, but I recommend posting a list of keybindings beside your desk and getting comfortable. Most IDEs you might use, including RStudio, JupyterLab, and VS Code, have vim modes. This introduction will be just enough to get you in and out of Vim successfully.
When you enter Vim, you’re in the (now poorly named) normal mode, which is for navigation only. Pressing the i
key activates insert mode, which will feel normal for those used to arrow keys. In insert mode, words will appear when you type, and the arrow keys will navigate you around the page.
Once you’ve escaped, you may wish never to return to normal mode, but it’s the only way to save files and exit Vim. You can return to normal mode with the escape
key.
To do file operations, type a colon, :
, followed by the shortcut for what you want to do, and enter
. The two most common commands you’ll use are w
for save (write) and q
for quit. You can combine these to save and quit in one command using :wq
.
Sometimes, you may want to exit without saving, or you may have opened and changed a file you don’t have permission to edit. If you’ve made changes and try to exit with :q
, you’ll find yourself in an endless loop of warnings that your changes won’t be saved. You can tell Vim you mean it with the exclamation mark, !
, and exit using :q!
.
Reading logs
Once your applications are up and running, you may run into issues. Even if you don’t, you may want to examine how things are going.
Most applications write their logs somewhere inside the /var
directory. Some activities will get logged to the main log at /var/log/syslog
. Other things may get logged to /var/log/<application name>
or /var/lib/<application name>
.
It’s essential to get comfortable with the commands to read text files so you can examine logs (and other files). The commands I use most commonly are:
cat
prints a whole file, starting at the beginning.less
prints a file, starting at the beginning, but only a few lines at a time.head
prints only the first few lines and exits. It is especially useful to peer at the beginning of a large file, like acsv
file – so you can quickly preview the column heads and the first few values.tail
prints a file going backward from the end. This is especially useful for log files, as the newest logs are appended to the end of a file. This is such a common practice that “tailing a log file” is a common phrase.- Sometimes, you’ll want to use the
-f
flag (for follow) to tail a file with a live view as it updates.
- Sometimes, you’ll want to use the
Sometimes, you want to search around inside a text file. You’re probably familiar with the power and hassle of regular expressions (regex) to search for specific character sequences in text strings. The Linux command grep
is the main regex command.
In addition to searching in text files, grep
is often helpful in combination with other commands. For example, you may want to put the output of ls
into grep
to search for a particular file in a big directory using the pipe.
Running the right commands
Let’s say you want to open Python on your command line. One option would be to type the absolute path to a Python install every time. For example, I’ve got a version of Python in /usr/bin
, so /usr/bin/python3
works.
But in most cases, it’s nice to type python3
and have the correct version open up:
Terminal
> python3
Python 3.9.6 (default, May 7 2023, 23:32:45)
[Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more.
>>>
Sometimes, you might want to go the other way. Maybe python3
opens Python correctly, but you’re unsure where it’s located. You can use the which
command to identify the actual executable for a command. For example, this is the result of which python3
on my system:
Terminal
> which python3
/usr/bin/python3
Sometimes, you must make a program available without providing a full path every time. Some applications rely on others, like RStudio Server needing to find R or Jupyter Notebook needing your Python kernels.
The operating system knows how to find executables via the path. The path is a set of directories that the system knows to search when it tries to run a program. The path is stored in an environment variable conveniently named PATH
.
You can check your path at any time with echo $PATH
. On my MacBook, this is what it looks like:
Terminal
> echo $PATH
/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:
/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin
When you install a new application, you must add it to the path. Let’s say I installed a new version of Python in /opt/python
. That’s not on my PATH
, so my system couldn’t find it.
I can get it on the path in one of two ways. The first option would be to add /opt/python
to my PATH
every time a terminal session starts, usually via a file in /etc
or the .zshrc
.
The other option is to create a symlink to the new application in a directory already on the PATH
. A symlink makes it appear that a copy of a file is in a different location without actually moving it. Symlinks are created with the ln
command.
Running applications as services
On your personal computer, you probably have programs that start every time your computer does. Maybe this happens for Slack, Microsoft Teams, or Spotify. Such applications that execute on startup and run in the background, waiting for input, are called a daemon or a service.
Most server-based applications are configured to run as a service, so users can access them without needing permissions to start them first. For example, on a data science workbench, you’d want JupyterHub and/or RStudio Server to run as a service.
In Linux, the tool to turn a regular application into a daemon is called systemd
. Some applications automatically configure themselves with systemd
when they’re installed. If your application doesn’t, or you want to alter the startup behavior, most applications have their systemd
configuration in /etc/systemd/system/<service name>.service
.
Daemonized services are controlled using the systemctl
command line tool.
Basically, all modern Linux distros have coalesced around using systemd
and systemctl
. Older systems may not have it installed by default and you may have to install it or use a different tool.
The systemctl
command has a set of sub-commands that are useful for working with applications. They look like systemctl <subcommand> <application>
. Often systemctl
has to be run as sudo
, since you’re working with an application for all system users.
The most useful systemctl
commands include start
and stop
, status
for checking whether a program is running, and restart
for a stop
followed by a start
. Many applications also support a reload
command, which reloads configuration settings without restarting the process. Which settings require a restart
vs. a reload
depends on the application.
If you’ve changed a service’s systemd
configuration, you can load changes with daemon-reload
. You also can turn a service on or off for the next time the server starts with enable
and disable
.
Running Docker Containers as a Service
People love Docker Containers because they easily run on most machines. To run a container as a service, you’ll need to make sure Docker itself is daemonized and then ensure the container you care about comes up whenever Docker does by setting a restart policy for the container.
However, many Docker services involve coordinating more than one container. If so, you’ll want to use a purpose-built system for managing multiple containers. The most popular are Docker Compose or Kubernetes.
Docker Compose is a relatively lightweight system that allows you to write a YAML file describing the containers you need and their relationship. You can then use a single command to launch the entire set of Docker Containers.
Docker Compose is fantastic for prototyping systems of Docker Containers and for running small-scale Dockerized deployments on a single server. There are many great resources online to learn more about Docker Compose.
Kubernetes is designed for a similar purpose, but instead of running a handful of containers on one server, Kubernetes is a heavy-duty production system designed to schedule up to hundreds or thousands of Docker-based workloads across a cluster of many servers.
In general, I recommend sticking with Docker Compose for the work you’re doing. If you need the full might of Kubernetes to do what you want, you probably should be working closely with a professional IT/Admin.
Managing R and Python
As the admin of a data science server, Python and R are probably the most critical applications you’ll manage.
The easiest path to making many users happy is having several versions of R and Python installed side-by-side. That way, users can upgrade their version of R or Python as it works for their project, not according to your upgrade schedule.
If you just sudo apt-get install python
or sudo apt-get install R
, you’ll end up with only one version of Python or R, which will get overwritten every time you re-run the command.
Python-Specific Considerations
Python is one of the world’s most popular programming languages for general-purpose computing. This makes configuring Python harder. Getting Python up and running is famously frustrating on both servers and your personal computer.1
Almost every system comes with a system version of Python. This is the version of Python the operating system uses for various tasks. It’s almost always old, and you don’t want to mess with it.
To configure Python for data science, you have to install the versions of Python you want to use, get them on the path, and get the system version of Python off the path.
Installing data science Python versions into /opt/python
makes this simpler. Managing versions of Python somewhere wholly distinct from the system Python removes some headaches, and adding a single directory to the path is easy.
My favorite route (though I’m biased) is to install Python from the pre-built binaries provided by Posit.
In Chapter 1, I mentioned that Conda is useful when you have to create a laptop-based data science environment for yourself, but isn’t great in production. Similarly, as an admin trying to install Python for all the users on a server, you should stay away from Conda.
Conda is meant to let users install Python for themselves without help from an admin. Now, you are that admin, and should you choose to use Conda, you’ll be fighting default behaviors the whole time. Configuring server-wide data science versions of Python is more straightforward without Conda.
R-Specific Considerations
Generally, people only install R to do data science, so where you install R is usually not a big issue. Using apt-get install
is fine if you know you’ll only ever want one version of R.
If you want multiple versions, you’ll need to install them manually. I recommend installing into /opt/R
with binaries provided by Posit or using rig
, a great R installation manager that supports Windows, Mac, and Ubuntu.
Managing system libraries
As an admin, you’ll also have to decide what to do about system packages, which are Linux libraries you install from a Linux repository or the internet.
Many packages in Python and R don’t do any work themselves. Instead, they’re just language-specific interfaces to system packages. For example, any R or Python library that uses a JDBC database connector must use Java on your system. And many geospatial libraries make use of system packages like GDAL.
As the administrator, you must understand the system libraries required for your Python and R packages. You’ll also need to ensure they’re available and on the path.
For many of these libraries, it’s not a huge problem. You’ll install the required library using apt
or the system package manager for your distro. In some cases (especially Java), more configuration may be necessary to ensure that the system package you need appears on the path when your code runs.
Some admins with sophisticated requirements around system library versions use Docker Containers or Linux Environment Modules to keep system libraries linked to projects.
Comprehension questions
- What are two different ways to install Linux applications, and what are the commands?
- What does it mean to daemonize a Linux application? What programs and commands are used to do so?
- How do you know if you’ve opened Nano or Vim? How would you exit them if you didn’t mean to?
- What are four commands to read text files?
- How would you create a file called
secrets.txt
, open it with Vim, write something in, close and save it, and make it so that only you can read it?
Lab: Installing applications
As we’ve started to administer our server, we’ve mostly been doing generic server administration tasks. Now, let’s set up the applications we need to run a data science workbench and get our API and Shiny app set up.
Step 1: Install Python
Let’s start by installing a data science version of Python, so we’re not using the system Python for data science purposes.
If you want just one version of Python, you can apt-get install
a specific version. As of this writing, Python 3.10 is a relatively new version of Python, so we’ll install that one with:
Terminal
> sudo apt-get install python3.10-venv
Once you’ve installed Python, you can check that you’ve got the correct version by running the following:
Terminal
> python3 --version
This route to installing Python is easy if you only want one version. If you want to enable multiple versions of Python, apt-get install
-ing Python isn’t the way to go.
Step 2: Install R
Since we’re using Ubuntu, we can use rig
. There are good instructions on downloading rig
and using it to install R on the rlib/rig
GitHub repo. Use those instructions to install the current R release on your server.
Once you’ve installed R on your server, you can check that it’s running by just typing R
into the command line. If that works, you can move on to the next step. If not, you’ll need to ensure R got onto the path.
Step 3: Install JupyterHub and JupyterLab
JupyterHub and JupyterLab are Python programs, so we will run them from within a Python virtual environment. I’d recommend putting that virtual environment inside /opt/jupyterhub
.
Here are the commands to create and activate a jupyterhub
virtual environment in /opt/jupyterhub
:
Terminal
> sudo python3 -m venv /opt/jupyterhub
> source /opt/jupyterhub/bin/activate
Now, we will get JupyterHub up and running inside the virtual environment we just created. JupyterHub has great docs (Google “JupyterHub quickstart”) to get up and running quickly. If you must stop for any reason, assume sudo and start the JupyterHub virtual environment we created when you return.
Note that because we’re working inside a virtual environment, you may have to use the jupyterhub-singleuser
version of the binary.
Step 4: Daemonize JupyterHub
Because JupyterHub is a Python process, not a system process, it won’t automatically get daemonized, so we’ll have to do it manually.
We don’t need it right now, but it will be easier to manage JupyterHub later on from a config file that’s in /etc/jupyterhub
. To do so, activate the jupyterhub
virtual environment, create a default JupyterHub config (Google for the command), and move it into /etc/jupyterhub/jupyterhub_config.py
.
You can see working examples of the jupyterhub_config
and other files mentioned in this lab in the GitHub repo for this book (akgold/do4ds) in the _labs/lab10
directory.
Now let’s move on to daemonizing JupyterHub. To start, kill the existing JupyterHub process (consult the cheat sheet in Appendix D if you need help). Since JupyterHub wasn’t automatically daemonized, you must create the systemd
file in /etc/systemd/system/jupyterhub.service
.
That file will need to add /opt/jupyterhub/bin
to the path because that’s where our virtual environment is and will have to provide the startup command and specify that JupyterHub should use the config we created`.
Now, you’ll need to use systemctl
to reload the daemon, start JupyterHub, and enable it.
Step 5: Install RStudio Server
You can find the commands to install RStudio Server on the Posit website. Make sure to pick the version that matches your operating system. Since you’ve already installed R, skip to the “Install RStudio Server” step.
Unlike JupyterHub, RStudio Server daemonizes itself right out of the box, so you can check and control the status with systemctl
without further work.
Step 6: Run the Penguin API from Docker
First, you’ll have to ensure that Docker is available. It can be installed from apt
using apt-get install docker.io
. You may need to adopt sudo
privileges to do so.
Once Docker is installed, running the API is almost trivially easy using the command we used in Chapter 6 to run our container.
Terminal
> sudo docker run --rm -d \
-p 8080:8080 \
--name penguin-model \
alexkgold/penguin-model
Once it’s up, you can check that it’s running with docker ps
.
Step 7: Put up the Shiny app
We will use Shiny Server to host our Shiny app on the server. Start by moving the app code to the server. I put mine in /home/test-user/do4ds-lab/app
by cloning the Git repo.
After that, you’ll need to:
- Open R or Python and rebuild the package library with
{renv}
or{venv}
. - Install Shiny Server using the instructions from the Shiny Server Admin Guide.
- Note that you can skip steps to install R and/or Python, and the
{shiny}
package since we’ve already done that.
- Note that you can skip steps to install R and/or Python, and the
- Edit Shiny Server’s configuration file to run the right app.
- Start and enable Shiny Server with
systemctl
.
Lab Extensions
You might want to consider a few things before moving on to the next chapter, where we’ll start working on giving this server a stable public URL.
First, we haven’t daemonized the API. Feel free to try Docker Compose or set a restart policy for the container.
Second, neither the API nor the Shiny app will automatically update when we change them. You might want to set up a GitHub Action to do so. For Shiny Server, you’ll need to push the updates to the server and then restart Shiny Server. For the API, you’d need to configure a GitHub Action to rebuild the container and push it to a registry. You’d then need to tell Docker on the server to re-pull and restart the container.
Finally, there’s no authentication in front of our API. The API has limited functionality, so that’s not a huge worry. But if you had an API with more functionality, that might be a problem. Additionally, someone could try to flood your API with requests to make it unusable. The most common way to solve this is to buy a product that hosts the API for you or to put an authenticating proxy in front of the API. We’ll be adding NGINX soon, so you can try adding authentication later.
See, for example, the XKCD comic titled Python Environment.↩︎