DevOps Lessons for Data Science

DevOps Is a set of processes, tools, and cultural norms designed to simplify, speed, and lower the risk of putting software into production.

The term DevOps is a portmanteau reflecting the two halves of software delivery it is meant to bring closer together – development and operations. DevOps is a somewhat slippery concept as it’s not a specific dogma or set of tools. Instead, it’s the application of principles and norms – combined with tooling – to the particular situation you face in delivering software.

And for you, the particular situation you face is delivering data science assets. Delivering data science fundamentally is a form of software development. Whether you consciously acknowledge it or not, delivering a data science asset is the same as delivering software in many important ways.

The Problems DevOps Solves

DevOps started as an offshoot of the Agile software movement. In particular, Agile’s focus on quick iteration via frequent delivery of small chunks and immediate feedback proved completely incompatible with a pattern where developers completed software and hurled it over an organizational the wall to somehow be put into production by an IT/Admin team.

There are a few particular problems that DevOps attempted to solve – problems that will probably feel familiar if you’ve ever tried to put a data science asset into production.

The first issue DevOps addresses is the “works on my machine” phenomenon. If you’ve ever collaborated on a piece of data science code, you’ve almost certainly gotten an email, instant message, or quick shout that some code that was working great for you is now failing that your colleague is trying to work on it to collaborate.

The processes and tooling of DevOps is designed to link application much more closely to environment in order to prevent the “works on my machine” phenomenon from rearing its head.

The second problem DevOps addresses is the “breaks on deployment” issue. Perhaps you wrote some code and tested it lovingly on your machine, but didn’t have the chance to test it against a production configuration. Or perhaps you don’t really have patterns around testing code in your organization. Even if you tested thoroughly, you might not know if something breaks when its deployed. DevOps is designed to reduce the risk of deploying code that won’t function as intended the first time it’s deployed.

DevOps is designed to incorporate ideas about scaling into the genesis of software, helping avoid software that works fine locally, but can’t be deployed for real.

Core principles and best practices of DevOps

As I’ve mentioned, the term DevOps is squishy. So squishy that there isn’t even agreeement on what the basic tenets of DevOps are that help solve the problems its attempting to solve. Basically every resource on DevOps lists a different set of core principles and frameworks.

And the profusion of xOps like DataOps, MLOps, and more just add confusion about what DevOps itself actually is.

I’m going to name five core tenets of DevOps. Some lists of DevOps have more components, and some fewer, but this is a good-faith attempt to summarize what I believe the core components are.

  1. Code should be well-tested and tests should be automated.
  2. Updates should be frequent and low-risk.
  3. Security concerns should be considered up front as part of architecture.
  4. Production systems should have monitoring and logging.
  5. Frequent opportunities for review, change, and updating should be built into the system – both culturally and technically.

These five tenets are a great philosophical stance, they’re about things that should happen, and they seem pretty inarguably good.

Applying DevOps to data science

Hopefully, you’re convinced that the principles of DevOps are relevant to you as a data scientist and you’re excited to learn more!

However, it would be inappropriate to just take the DevOps principles and practices and apply them to data science.

As a data scientist, the huge majority of what you’re doing is taking data generated by a business process, deriving some sort of signal from that data flow, and making it available to other people or other software. Fundamentally, data science apps are consumers of data, almost by definition.

In contrast, most traditional software either don’t involve meaningful data flows, or are producers of business data. An online store, software for managing inventory, and electronic health record – these tools all produce data.

There’s a major architectural and process implication from this difference – how much freedom you have. Software engineers get to dream up data structures and data flows from scratch, designing them to work optimally for their systems. In contrast, you are stuck with the way the data flows into your system – most likely designed by someone who wasn’t thinking about the needs of data science at all.

Language-specific tooling

There’s one other important difference between data science and general purpose software development. As of the writing of this book, a huge majority of data science work is done in just two programming languages, R and Python (and SQL). For that reason, this book on DevOps for Data Science can get much deeper into the particularities of applying DevOps principles to those specific languages than a general purpose book on DevOps ever would.

So while the problems DevOps attempts to solve will probably resonate with most data scientists, and the core principles seem equally applicable, the technical best practices need some translation.

So here are four technical best practices from DevOps and their equivalents in the data science world.1

Use CI/CD

Continuous Integration/Continuous Delivery/Continuous Deployment (CI/CD) is the notion that there should be a central repository of code where changes are merged. Once these changes are merged, the code should be tested, built, and delivered/deployed in an automated way.

The data science analog of CI/CD is code promotion and integration processes. This chapter will help you think about how to structure your app or report so that you can feel secure moving an app into production and updating it later. This chapter will also include an introduction to real CI/CD tools, so that you can get started using them in your own work.

Infrastructure as Code

The underlying infrastructure for development and deployment should be reproducible using code so it can be updated and replaced with minimal fuss or disruption.

The data science analog is thinking about managing environments as code. This chapter will help you think about how to create a reproducible and secure project-level data science environment so you can be confident it can be used, secured, and resurrected later (or somewhere else) as need be.

Microservices

Any large application should be decomposed into smaller services that are as atomic and lightweight as possible. This makes large projects easier to reason about and makes interfaces between components clearer so changes and updates are safer.

I believe this technical best practice has the furthest translation to apply to data science, so this chapter is about how to think of your data science project in terms of its components.This chapter will help you think about what the various components of your projects are and how to split them up for painless and simple updating and atom-izing.

Monitoring and Logging

Application metrics and logs are essential for understanding the usage and performance of production services, and should be leveraged as much as possible to have a holistic picture at all times.

The fourth chapter in this section is on monitoring and logging, which is – honestly – in its infancy in the data science world, but deserves more love and attention.

Other Things

Most DevOps frameworks also include communication, collaboration, and review practices as part of their framework, as the technical best practices of DevOps exist to support the work of the people who use them. This is obviously equally important in the data science world – it’s what the entire second section is about.

And in the fifth chapter, we’ll learn about Docker – a tool that has become so common in DevOps practices that it deserves some discussion all on its own. In this section, you’ll get a general intro to what Docker is and how it works – as well as a hands-on intro to using Docker yourself.


  1. Resources you see online have anywhere between three and seven of these. I think these four cover the core of almost any list you’ll see.↩︎