DevOps Lessons for Data Science

You are a software developer.

Your title is probably data scientist or statistician or data engineer. But if you’re writing R or Python code for production, you’re also a software developer.

And as a software developer, even a reticent one, DevOps has a lot to teach about how to write good software.

DevOps principles aim to create software that builds scalability, security, and stability into the software from the very beginning. The idea is to avoid building software that works locally, but doesn’t work well in collaboration or production.

You could just take general purpose DevOps principles and apply them to data science. If you talk to a software engineer or IT/Admin who doesn’t know about data science, they’ll probably encourage you to do so.

But the specifics of those principles are squishy. Basically every resource on DevOps lists a different set of core principles and frameworks.1 And for data scientists, that’s exacerbated by the profusion of data science adjacent xOps terms like DataOps, MLOps, and more.

Moreover, the type of software you’re creating as a data scientist isn’t general purpose software.

A general purpose software engineer is like an architect – they need to build something to meet a particular need. They get to dream up data structures and data flows from scratch, designing them to work optimally for their systems. That’s a very different kind of job than trying to ferret out and share a needle of signal in a haystack of noise.

As a data professional, your job is to take data generated elsewhere, by a business, social, or natural process, to derive some of signal from it, and make that signal available to the systems and people that need it.

If the software developer is an architect, you’re an archaeologist. You’re pointed at a spot where there’s some data and told to figure out what – if anything – can be built to deliver value from it. That usually means a much more meandering path to delivering something of value.

But that’s not to say there aren’t best practices. In the chapters in this section, we’ll explore what data science and data scientists can learn from DevOps to make your apps and environments as robust as possible.

Managing environments

One of the core issues DevOps addresses is the dreaded “works on my machine” phenomenon. If you’ve ever collaborated on a data science project, you’ve almost certainly reached a point where something worked on your laptop but not for your colleague, and you just don’t know why.

The code you’re writing relies on the environment in which it runs. While most data scientists have ways to share code, sharing environments isn’t always standard practice, but it should be. We can take lessons from DevOps, where the solution is to create explicit linkages between the code and the environment so you can share both together, which is what Chapter 1 is all about.

App Architecture

Despite the fact that you’re more archaeologist than architect, you do have some space to play architect as you take your work to production. At that point you should know what you’ve unearthed and you’re trying to figure out how to best share it.

Software development best practices and DevOps have a lot to say about app architecture patterns that work well, but they’re not directly portable to data science projects because of differences in the type of software you’re building.

You’re almost certainly writing software that consume data much more than it produces it. That’s in stark contrast to general purpose software, which is the opposite. On net, Microsoft Word, electronic health records, and Twitter all produce much more data then they consume.

And because data science is about using real-world data produced by a process in the outside world, you probably stuck with data flows that weren’t designed with your needs in mind, unlike a piece of software that consumes data it produced for itself.

That means that you’re going to spend a lot of time thinking about how to connect to preexisting data sources. Chapter 3 is all about how to securely connect to data sources from your data science projects.

And Chapter 2 is all about how to take DevOps and Software Engineering best practices and apply them to the layers of your app you can control – the processing and presentation layers.

Monitoring and Logging

It’s bad to find out from someone else that your app was down or that your model was producing bad results. DevOps practices aim to make the what’s happening inside the system visible during and after the code runs. Chapter 4 addresses how to build monitoring and logging into your data science projects.

Deployments

When you go to deploy your code, you want to make sure it goes right the first time. Doing so requires that you think ahead about how those deployments are going to work. Chapter 5 gets into how to design a deployment and promotion system that is robust.

Docker for Data Science

Docker is an increasingly popular tool in the software development and data science world that allows for the easy capture and sharing of the environment around code. While Docker itself doesn’t solve these problems, it’s increasingly popular to use Docker in a data science context, which is why Chapter 6 is a basic introduction to what Docker is and how to use it.

Labs in this section

Each chapter in this section has a lab so you can get hands-on experience implementing DevOps best practices in your data science projects.

In the labs, you’ll stand up a website to explore the Palmer Penguins dataset, especially the relationship between penguin bill length and mass. Your website will include pages on exploratory data analysis and model building. This website will automatically build and deploy based on changes in a git repo.

You’ll also create a Shiny app that visualizes model predictions and an API that hosts the model and provides real-time predictions to the app. Additionally, you’ll get to practice putting that API inside a Docker Container to see how using Docker can make your life easier when moving code around.

For more details on exactly what you’ll do in each chapter, see Appendix C.


  1. If you enjoy this introduction, I strongly recommend The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford. It’s a novel about implementing DevOps principles. A good friend described it as, “a trashy romance novel about DevOps”. It’s a very fun read.↩︎