DevOps Lessons for Data Science

You are a software developer.

Your title is probably data scientist or statistician or data engineer. But if you’re writing R or Python code for production, you’re also a software developer.

And as a software developer, even a reticent one, DevOps has much to teach about how to write good software.

DevOps principles aim to create software that builds security, stability, and scalability into the software from the very beginning. The idea is to avoid building software that works locally, but doesn’t work well in collaboration or production.

You could take general-purpose DevOps principles and apply them to data science. If you talk to a software engineer or IT/Admin who doesn’t know about data science, they’ll probably encourage you to do just that.

But the specifics of those principles are squishy. Every DevOps resource lists a different set of core principles and frameworks.1 And for data scientists, that’s exacerbated by the profusion of data science adjacent xOps terms like DataOps, MLOps, and more.

Moreover, you’re writing code for data science, not general-purpose software engineering.

A general-purpose software engineer designs software to fill a particular need. They get to dream up data structures and data flows from scratch, figuring out how to move data to fill the holes in the existing system.

Think of some examples like Microsoft Word, electronic health records, and Instagram. Each of these systems is a producer and consumer of its own data. That means the engineers who created them got to design the data flow from beginning to end.

That’s a very different job than the data scientist’s task of ferreting out a needle of signal in a haystack of noise. You don’t get to design your data flows. Instead, you take data generated elsewhere, by a business, social, or natural process and try to make information signal available to the systems and people that need it.

If the software developer is like an architect, the data scientist is an archaeologist. You’re pointed at some data and asked to derive value from it without even knowing if that’s possible. Delivering value as a data scientist is predictably preceded by dead-ends, backtracking, and experimentation in a way a general-purpose software engineer doesn’t experience.

But there are best practices you can follow to make it easier to deliver value once you’ve discovered something interesting. In the chapters in this section, we’ll explore what data science and data scientists can learn from DevOps to make your apps and environments as robust as possible.

Managing environments

One of the core issues DevOps addresses is the dreaded “works on my machine” phenomenon. If you’ve ever collaborated on a data science project, you’ve almost certainly reached a point where something worked on your laptop but not for your colleague, and you don’t know why.

The code you’re writing relies on the environment in which it runs. While most data scientists have ways to share code, sharing environments isn’t always standard practice, but it should be. We can take lessons from DevOps, where the solution is to create explicit linkages between the code and the environment so you can share both, which is what Chapter 1 is all about.

App Architecture

Even though you’re more archaeologist than architect, you have some space to play architect as you take your work to production. At that point, you should know what you’ve unearthed, and you’re trying to figure out how to share it.

As the creator of data science software that consumes much more data than it creates, you’ll spend a lot of time thinking about connecting to preexisting data sources. Chapter 3 is about securely connecting to data sources from your data science projects.

And Chapter 2 is all about how to take DevOps and Software Engineering best practices and apply them to the layers of your app you can control – the processing and presentation layers.

Monitoring and Logging

It’s bad to discover that your app was down or your model was producing bad results from someone else. DevOps practices aim to help you detect issues and do forensic analysis after the fact by making the system visible during and after the code runs. Chapter 4 addresses building monitoring and logging into your data science projects.

Deployments

One core process of releasing data science projects is moving them from the workbench to the deployment platform. Making that process smooth requires thinking ahead about how those deployments will work. Chapter 5 investigates how to design a robust deployment and promotion system.

Docker for Data Science

Docker is an increasingly popular tool in the software development and data science world that allows for the easy capture and sharing of the environment around code. Docker is increasingly popular in data science contexts, so Chapter 6 is a basic introduction to what Docker is and how to use it.

Labs in this section

Each chapter in this section has a lab so you can get hands-on experience implementing DevOps best practices in your data science projects.

You’ll create a website in the labs to explore the Palmer Penguins dataset, especially the relationship between penguin bill length and mass. Your website will include pages on exploratory data analysis and model building. This website will automatically build and deploy based on changes in a git repo.

You’ll also create a Shiny app that visualizes model predictions and an API that hosts the model and provides real-time predictions to the app. Additionally, you’ll get to practice putting that API inside a Docker Container to see how using Docker can make your life easier when moving code around.

For more details on precisely what you’ll do in each chapter, see Appendix C.


  1. If you enjoy this introduction, I strongly recommend The Phoenix Project by Gene Kim, Kevin Behr, and George Spafford. It’s a novel about implementing DevOps principles. A good friend described it as, “a trashy romance novel about DevOps”. It’s a very fun read.↩︎