IT/Admin for Data Science
Data scientists shouldn’t be responsible for server administration. There’s a reason IT Administration is a career all to itself. There’s so much to know about servers, networking, managing users, storage, and more.
You’re probably great at writing R or Python code, cleaning and managing data, or building models. Unless you feel like a career switch, managing servers isn’t in your core skillset.
And being responsible for a server is scary – you can unknowingly make choices that cause security vulnerabilities, system instability, and general annoyance.
That said, there are good reasons you might find yourself administering a server. Maybe you’re a student who wants to host a project for portfolio purposes, the hobbyist who’s hosting a toy project, or the data science leader who has no choice but to host things for themselves because they just can’t get IT/Admin support.
I’ve been in all of these situations. Believe, me I feel you.
The good news is that if you are in any of those roles, you’ve come to the right place.
The first section of this book focused on how to move your data science practices closer to the DevOps ideal by doing Dev better.
This section focuses on how to do Ops. In contrast to the first section, which assumed a pretty high level of understanding and focused on best practices, this section is a lot of introductory IT/Admin knowledge. This section assumes you know nothing about how to administer a server other than the tools introduced in ?sec-2-intro.
Throughout the chapters in this section, you’re going to learn standard patterns for simple administrative tasks for servers and apply those to a data science use-case. By the time you’re done, you’ll be ready to administer a simple server-based environment for doing or deploying data science in the cloud.
So what exactly will you learn about in this section?
We’ll start with an intro to the cloud – what it is, how it works, and how to make use of it for data science purposes. We’ll then spend a chapter on Linux administration. If you’re using a server, it’s going to be a Linux server, and understanding how to administer a Linux server is worth learning. We’ll then spend three chapters on several facets of networking, including a basic intro, how getting and using a domain name works, and how to secure a server with SSL/HTTPS. Lastly, we’ll get into how to choose the size and type of server to use for your data science purposes.
Each chapter in this section is accompanied by a lab. If you follow along with the labs in this section, you’ll get an AWS server configured and ready to run RStudio Server, JupyterHub, and serve model predictions out of a Docker Container. While these aren’t all the things you could want to do on a server, once you’ve learned to do these tasks, many other things will be reasonably similar.
What you’ll learn in this section and in the labs should be sufficient for you to do data science on a server for a small group of data scientists who you trust, assuming you’re not using data that is especially sensitive.
If this isn’t you – if you actually do have IT/Admin support, I’d still recommend skimming this section, especially Chapters Chapter 12, Chapter 13, and Chapter 14. Being able to understand a little bit about networking will make it way easier to communicate with the IT/Admin group at your organization.
And don’t forget, there’s still section 4 – where you’ll learn more about how to communicate with IT/Admin professionals for issues that really should be left to them.
Let’s jump in!