IT/Admin for Data Science
Welcome to the section of the book I wish I didn’t need to write – the section where you’ll learn about the basics of doing IT/Admin tasks yourself.
As a data scientist, you want to share a development environment with other data professionals or publish a data science project to non-technical stakeholders. That sharing requires a centralized server, and someone needs to administer that server.
In my experience, data scientists are at their best when paird with a professional IT/Admin who administers the servers. But that partnership often isn’t achievable.
You might work at a small organization that lacks dedicated IT/Admins. Or maybe you’re a student or hobbyist trying to cheaply DIY an environment. It’s possible you work at a sophisticated organization with professional IT/Admins, but they, unfortunately, lack the time, interest, or expertise necessary to be helpful.
Sometimes, you have to be your own IT/Admin to be able to take your work to production at all. It’s fair to say that many – if not most – data scientists will find themselves responsible for administering the servers where their work runs at some point in their career. And that’s a scary place to be.
Administering a server as a novice is like suddenly stepping into an 18-wheel tractor-trailer when you’ve never driven anything other than a cute Honda Civic.1 You’re leaping from managing a personal device to wrangling a professional-scale work machine without the training to match.
Even with the many online resources available as support, the number of topics and the depth of each can be completely overwhelming. And being a bad IT/Admin can lead to security vulnerabilities, system instability, and general annoyance.
In this section, you’re going to learn the basics of being your own IT/Admin. You’ll be introduced to the IT/Admin topics that are relevant for a data science environment. By the end, you’ll be comfortable administering a simple data science workbench or server to host a data science project.
If you don’t have to be your own IT/Admin, that’s even better. Reading this section will give you an appreciation for what an IT/Admin does and help you be a better partner to them.
Getting and running a server
Many data science tasks require a server and a variety of supporting tools like networking and storage. These days, the most common way to set up a data science environment is to rent a server from a cloud provider. That’s why Chapter 7 is an introduction to what the cloud is and how you might want to use it for data science purposes.
Unlike your phone or personal computer, you’ll never touch this cloud server you’ve rented. Instead, you’ll administer the server via a virtual interface from your computer. Moreover, servers generally don’t even have the kind of point-and-click interface you’re familiar with from your personal devices.
Instead, you’ll access and manage your server from the text-only command line.That’s why Chapter 8 is all about how to set up the command line on your local machine to make it convenient and ergonomic, and how to connect to your server for administration purposes using a technology called SSH.
Unlike the Apple, Windows, or Android operating systems you have on your personal devices, most servers run the Linux operating system. Chapter 9 will teach you a little about what Linux is and will introduce you to the basics of Linux administration, including how to think about files and users on a multi-tenant server.
But you’re not just interested in running a Linux server. You want to use it to accomplish data science tasks. In particular, you want to use data science tools like R, Python, RStudio, JupyterHub, and more. You’ll need to learn how to install, run, and configure applications on your server. That’s why Chapter 10 is about application administration.
When your phone or computer gets slow or you run out of storage, it’s probably time for a new one. But a server is a working machine that can be scaled up or down to accommodate more people or heavier workloads over time. That means that you may have to manage the server’s resources more actively than your personal devices. That’s why Chapter 11 is all about managing and scaling server resources.
Making it (safely) accessible
Unless you’re doing something very silly, your personal devices aren’t accessible to anyone who isn’t physically touching the device. In contrast, most servers are only useful because they’re addressable on a computer network, perhaps even the open internet.
Making a server accessible to people over the internet makes it useful, but it also introduces risk. Many dastardly plans for your personal devices are thwarted because a villain would have to physically steal it to get access. For a server, allowing digital access means there are many more potential threats looking to steal data or hijack your computational resources for nefarious ends. Therefore, you’ve got to be careful about how you’re providing access to the machine.
Risk aside, there’s a lot of depth to computer networking and just getting it working isn’t trivial. You can probably muddle through by following tutorials on the internet, but that’s a great way to end up with connections that suddenly work and no idea what you did right or how you could break it in the future.
The good news is that it’s not magic. Chapter 12 is all about how computers find each other across a network. Once you understand the basic structure and operations of a computer network, you’ll be able to configure your server’s networking and feel confident that you’ve done it right.
But you’re not done once you’ve configured basic connectivity for your server. You will want to take two more steps make it safe and easy to access. The first is to host your server at a human-friendly URL, which you’ll learn how to configure in Chapter 13. The second is to add SSL/TLS to your server to secure the traffic going to and from your server. You’ll learn how to do that in Chapter 14.
By the end of these chapters, you will have solid mental models for all the basic tasks you or any other IT/Admin are going to take on in administering a data science workbench or hosting platform.
Labs in this Section
In the first section of the book, you created a DevOps-friendly data science project. In this section, the labs will focus on actually putting that project into production.
You’ll start by standing up a server from a cloud provider, configuring your local command line, and connecting to the server via SSH. Once you’ve done that, you’ll learn how to create users on the server and access the server as a particular user.
At that point, you’ll be ready to transition into data science work. You’ll add R, Python, RStudio Server, and JupyterHub to your server and get them configured. Additionally, you’ll deploy the Shiny App and API you created in the book’s first section onto the server.
Once the server itself is ready, you’ll need to configure the server’s networking to make it accessible and secure. You’ll learn how to open the proper ports, set up a proxy to access multiple services on the same server, configure DNS records so your server is available at a real URL, and activate SSL so it can all be done securely.
By the time you’ve finished the labs in this section, you’ll be able to use your EC2 instance as a data science workbench and add your penguin mass prediction Shiny App to the Quarto website you created in the book’s first section.
For more details on what you’ll do in each chapter, see Appendix C.
The first car I ever bought was a Honda Civic Hybrid. Great car.↩︎