DIY DevOps for Data Science

There are times where you may find yourself administering your own data science environment. In general, this is a situation to be avoided. There’s a reason IT Administration is a career. There’s a lot to know about servers, networking, managing users, storage, and more.

But if you’ve got the need, it is possible to manage your own server.

I’m not going to say it’s easy, but it is straightforward.

In this section, we’re going to explore two main topics – how to set up a data science workbench on a server, and how to host an app of your own.

This section is going to be a hands-on walkthrough of setting up a data science workbench of your own that includes both RStudio Server and JupyterHub. Throughout, we’re going to combine hands-on labs with explanations of why we’re doing what we’re doing so that you can feel confident you’ve got it right – or can adjust if the context you’re working in is somewhat different than the one in the book.

We’re going to start by learning about what the cloud is and the services you can use there. Then we’re going to set up a cloud server on AWS.

Next, we’re going to learn about the command line and SSH. We’re going to use them to access the server we just created and poke around a little bit.

Next, we’re going to learn just a little about Linux system administration including how installing software, managing processes, and managing users works. We’re going to actually install R, Python, RStudio Server, and JupyterHub on the server.

Next, we’re going to learn a little bit about server networking, DNS, and how to keep web traffic secure. We’re going to configure our server with a real URL and an SSL certificate, so you can access your server securely and at a memorable web address.

In the next section, we’re going to stand up a database and connect to it.

In the last chapter, we’re going to talk about all the things we didn’t do. By the time you’re done, this server will be totally usable for basic data science workloads and probably will be fine if your company has a relatively permissive security policy. In the last section we’re going to talk about the things we didn’t do – including using networking to further protect our server, design a more sophisticated data science environment, and integrate with centralized authentication providers.

Other things we could do (TBD):