IT/Admin for Data Science

Welcome to the part of the book I wish I didn’t need to write – the part where you’ll learn about the basics of doing IT/Admin tasks yourself.

As a data scientist, you want to share a development environment with other data professionals or publish a data science project to non-technical stakeholders. That sharing requires a centralized server, and someone needs to administer that server.

In my experience, data scientists are at their best when paired with a professional IT/Admin who administers the servers. But, that partnership often isn’t achievable.

You might work at a small organization that lacks dedicated IT/Admins. Or maybe you’re a student or hobbyist trying to cheaply DIY an environment. You may work at a sophisticated organization with professional IT/Admins, but they, unfortunately, lack the time, interest, or expertise necessary to be helpful.

Sometimes, you have to be your own IT/Admin to be able to take your work to production at all. It’s fair to say that many – if not most – data scientists will be responsible for administering the servers where their work runs at some point in their careers. And that’s a scary place to be.

Administering a server as a novice is like suddenly stepping into an 18-wheel tractor-trailer when you’ve never driven anything other than a cute Honda Civic.¹ You’re leaping from managing a personal device to wrangling a professional-scale work machine without the training to match.

Even with the many online resources available as support, the number of topics and the depth of each can be overwhelming. And being a lousy IT/Admin can lead to security vulnerabilities, system instability, and general annoyance.

This part will teach you the basics of being your own IT/Admin. You’ll be introduced to the IT/Admin topics that are relevant for a data science environment. By the end, you’ll be comfortable administering a simple data science workbench or server to host a data science project.

If you don’t have to be your own IT/Admin, that’s even better. Reading this part will give you an appreciation for what an IT/Admin does and help you be a better partner to them.

Getting and running a server

Many data science tasks require a server and supporting tools like networking and storage. These days, the most common way to set up a data science environment is to rent a server from a cloud provider. That’s why Chapter 7 introduces the cloud and how you might want to use it for data science purposes.

Unlike your phone or personal computer, you’ll never touch this cloud server you’ve rented. Instead, you’ll administer the server via a virtual interface from your computer. Moreover, servers generally don’t even have the kind of point-and-click interface you’re familiar with from your personal devices.

Instead, you’ll access and manage your server from the text-only command line. That’s why Chapter 8 is about setting up the command line on your local machine to make it convenient and ergonomic, and how to connect to your server for administration purposes using SSH.

Unlike the Apple, Windows, or Android operating systems on your personal devices, most servers run the Linux operating system. Chapter 9 will teach you a little about what Linux is and introduce you to the basics of Linux administration, including how to think about files and users on a multi-tenant server.

But you’re not just interested in running a Linux server. You want to use it to accomplish data science tasks. In particular, you want to use data science tools like R, Python, RStudio, JupyterHub, and more. You’ll need to learn how to install, run, and configure applications on your server. That’s why Chapter 10 is about application administration.

When your phone or computer gets slow or you run out of storage, it’s probably time for a new one. But, a server is a working machine that can be scaled up or down to accommodate more people or heavier workloads over time. That means you may have to manage the server’s resources more actively than your personal devices. That’s why Chapter 11 is about managing and scaling server resources.

Making it (safely) accessible

Unless you’re doing something very silly, your personal devices aren’t accessible to anyone who isn’t physically touching the device. In contrast, most servers are only useful because they’re addressable on a computer network, perhaps even the open internet.

Putting a server up on the internet makes it useful but also introduces risk. Many dastardly plans for your personal devices are thwarted because a villain must physically steal them to get access. For a server, allowing digital access invites many more potential threats to steal data or hijack your computational resources for nefarious ends. Therefore, you’ve got to be careful about how you’re providing access to the machine.

Risk aside, there’s a lot of depth to computer networking, and just getting it working isn’t trivial. You can probably muddle through by following online tutorials, but that’s a great way to end up with connections that suddenly work and no idea what you did right or how you could break it in the future.

The good news is that it’s not magic. Chapter 12 discusses how computers find each other across a network. Once you understand a computer network’s basic structure and operations, you can configure your server’s networking and feel confident that you’ve done it right.

But you’re not done once you’ve configured basic connectivity for your server. You will want to take two more steps to make it safe and easy to access. The first is to host your server at a human-friendly URL, which you’ll learn to configure in Chapter 13. The second is to add SSL/TLS to your server to secure the traffic going to and from your server. You’ll learn how to do that in Chapter 14.

By the end of these chapters, you will have solid mental models for all the basic tasks you or any other IT/Admin will take on in administering a data science workbench or deployment platform.

Labs in this Part

You created a DevOps-friendly data science project in the book’s first part. The labs in this part will focus on putting that project into production.

You’ll start by standing up a server from a cloud provider, configuring your local command line, and connecting to the server via SSH. Once you’ve done that, you’ll learn how to create users on the server and access the server as a particular user.

You’ll be ready to transition into data science work at that point. You’ll add R, Python, RStudio Server, and JupyterHub to your server and get them configured. Additionally, you’ll deploy the Shiny App and API you created in the book’s first part onto the server.

Once the server is ready, you must configure its networking to make it accessible and secure. You’ll learn how to open the proper ports, set up a proxy to access multiple services on the same server, configure DNS records so your server is available at a real URL, and activate SSL so it can all be done securely.

By the time you’ve finished the labs in this part, you’ll be able to use your EC2 instance as a data science workbench and add your penguin mass prediction Shiny App to the Quarto website you created in the book’s first part.

For more details on what you’ll do in each chapter, see Appendix C.

The first car I ever bought was a Honda Civic Hybrid. Great car.↩︎