Introduction

Data science itself is pretty useless.

It’s likely you became a data scientist because you love creating beautiful charts, minimizing model prediction error, or writing elegant code in R or Python.

Ultimately – and perhaps frustratingly – these things don’t matter. What matters is whether the output of your work is useful in affecting decisions at your organization or in the broader world.

That means, you’re going to have to share your work by putting it in production. Many data scientists think of in production as some mythical state of super computers running ultra-complex machine learning models running over dozens of shards of data, terrabytes each.

It definitely occurs on a misty mountaintop, and does not involve the google sheets, csv files, or half-baked database queries you probably wrangle every day.

But that’s wrong. If you’re a data scientist and you’re trying to put your work in front of someone else’s eyes, you’re in production. And if you’re in production, this book is for you.

You may sensibly be asking who I am to make such a proclimation.

As of this writing, I’ve been on the Solutions Engineering team at RStudio (soon to be Posit) for nearly four years. The Solutions Engineering team at RStudio helps users of our open source and professional tools understand how to deploy, install, configure, and use RStudio’s Professional Products.

As such, I’ve spoken with hundreds of organizations managing data science in production about what being in production means for them, and how to make their production systems for developing and sharing data science products more robust – both with RStudio’s Professional Products and using purely open source tooling.

For some orgaizations, in production means a report that will get emailed around at the end of each week. For others, it will mean hosting a live app or dashboard that people visit. For others, it means serving live predictions to another service from a machine learning model.

Regardless of the actual character of the data science products, organizations are universally concerned with making these assets reliable, reproducible, and performant (enough).

At RStudio, the Solutions Engineering team is most directly responsible for engaging with the IT/Admin organizations at our customers. So that’s what this book is about – all of the stuff that is not data science that it takes to deploy a data science asset into production.

A short history of DevOps

Here’s the one sentence definition: DevOps is a set of cultural norms, practices, and supporting tooling to help make the process of developing and deploying software smoother and lower risk.

If you feel like that definition is pretty vague and unhelpful, you’re right. Like Agile software development, to which it is closely related, DevOps is a squishy concept. That’s partially because DevOps isn’t just one thing – it’s the application of some principles and process ideas to whatever context you’re actually working in. That malleability is one of the great strenths of DevOps, but also makes the concept quite squishy.

This squishiness is furthered by the ecosystem of companies enabling DevOps. There are dozens and dozens of companies prostelytizing their own particular flavor of DevOps – one that (curiously) reflects the capabilities of whatever product they’re selling.

But underneath the industry hype and the marketing jargon, there are some extremely valuable lessons to take from the field.

To understand better, let’s go back to the birth of the field.

The Manifesto for Agile Software Development was originally published in 2001. Throughout the 1990s, software developers had begun observing that delivering software in small units, quickly collecting feedback, and iterating was an effective model. After that point, many different frameworks of actual working patterns were developed and popularized.

However, many of these frameworks were really focused on software development. What happened once the software was written?

Historically, IT Administrators managed the servers, networking, and workstations needed to deploy, release, and operate that software. So, when an application was complete (or perceived as such), it was hurled over the wall from Development to Operations. They’d figure out the hardware and networking requirements, check that it was performant enough, and get it going in the real world.

Needless to say, this pattern is very fragile and subject to many errors and it quickly became apparent that the Agile process of creating and getting feedback on small iterative changes to working software needed a complementary process to get that software deployed and into production.

DevOps arose as this discipline – a way for software developers and the administrators of operational software to better collaborate on making sure the software being written was making it reliably and quickly into production. It took a little while for the field to be formalized, and the term DevOps came into common usage around 2010.

Those who do DevOps

Throughout this book, I’ll use two different terms – and though they may sound similar, I mean very different things by them.

DevOps refers to the knowledge, practices, and tools that make it easier, safer, and faster to put work into production. So if you’re a software developer (and as a data scientist, you are) you need to be thinking about DevOps.

Most organizations also have a set of people and roles who have the permission and responsibility for managing the servers and computers at your organization. I’m going to refer to this group as IT/Admins. Their names vary widely by organization – they might be named Information Technology (IT), SysAdmin, Site Reliability Engineering (SRE), or DevOps.1

Depending on what you’re trying to accomplish, the relevant IT/Admins may change. For example, if you’re trying to get access to a particular database, the relevant IT/Admins may be a completely different group of people than if you’re trying to procure a new server.

Fundamentally, DevOps is about creating good patterns for people to collaborate on developing and deploying software. As a data scientist, you’re on the Dev side of the house, and so a huge part of making DevOps work at your organization is about finding some Ops counterparts with whom you can develop a successful collaboration. There are many different organizational structures that support collaboration between data scientists and IT/Admins.

However, I will point out three patterns that are almost always red flags – mostly because they make it hard to develop relationships that can sustain the kind of collaboration DevOps neccesitates. If you find yourself in these situations, you’re not doomed – you can still get things done. But progress is likely to be slow.

  1. At some very large organizations, IT/Admin functions that are split into small atomic units like security, databases, networking, storage, procurement, cloud, and more. This is useful for keeping the scope-of-work manageable for the people in that group (and often results in super deep expertise within the group), but also means that you’ll need to bring people together from disparate teams to actually get anything done. And even when you find the person who can help you with one task, they’re probably not the right person to help you with anything else and they may not even know who is.

  2. Some organizations have chosen to outsource their IT/Admin functions. This isn’t a problem per-se – the people who work for outsourced IT/Admin companies are often very competent, but it does indicate a lack of commitment to the Ops half of DevOps internally. The main issues in this case tend to be logistical. Many outsourced IT/Admin teams are often in India, so it can be hard to find meeting times with American and European teams. Additionally, and I’m not quite sure why, but turnover on projects and systems tends to be very high among outsourced IT/Admin organizations. That means it can be really hard to find anyone who’s an expert on a particular system – or to be able to go back to them once you’ve found them.

  3. At some very small organizations, there isn’t yet an IT/Admin function. And at others, the IT/Admins are preoccupied with other tasks and don’t have the capacity to help the data science team.This isn’t a tragedy, but it probably means you’re about to become your own IT/Admin. Luckily, you’ve picked up this book, so you’re in the right place.

Whether your organization has a IT/Admin setup that facilitates DevOps best practices or not, hopefully this book can help you take the first steps towards making your path to production smoother and simpler.

What’s in this book?

My hope for this book is twofold.

First, I’d like to share some patterns.

DevOps is a well-developed field of its own right. However, a simple 1-1 transposition of DevOps practices would be a mistake. Over the course of engaging with so many organizations at RStudio, I’ve observed some particular patterns, borrowed from DevOps, that work particularly well to grease the path to production for data scientists.

Hopefully, by the time you’re done with this book, you’ll have a pretty good mental model of some patterns and principles you can apply in your own work to make deployment more reliable. That’s what’s in the first section of this book.

Second, I want to equip you with some technical knowledge.

IT administration is an older field that DevOps or data science, full of arcane language and technologies. My hope in this book is to equip you with the vocabulary to talk to the IT/Admins at your organization and the (beginning of) skills you’ll need if it turns out that you need to DIY a lot of what you’re doing.

The second section is designed for data scientists who have to interact with IT/Admins. We’ll get into things that you probably shouldn’t manage yourself, but where it’s helpful to have a working knowledge of what particular technologies are and how they work. My hope is that reading this section will help you plan ahead for how you want to work with the IT/Admins at your organization, including the questions you’ll need to ask along the way.

The final section, is the hands-on section for anyone who’s administering data science for themselves. This section is designed to equip you with some language and tools to get started administering data science in production.


Chapter List:

Section 1: DevOps for DS

  1. Code Promotion
  2. Environments as Code
  3. Project Components
  4. Logging and Monitoring
  5. Docker for DS

Section 2: DIY DS Workbench

Number Explain Lab
1 Cloud AWS Console + Get Instance
2 Command Line + SSH SSH into server
3 Linux SysAdmin Install R, Py, RS, JH
4 Networking, DNS, SSL URL, SSL
5 How servers work + Choosing the right one Take down instance + attach to a bigger one

TODO: Do I need a chapter on standing up + connecting to a database?

Section 3: Steps not Taken

  1. Code Promotion for DevOps, Dev/Test/Prod, Docker
  2. Better Networking (Proxies, Bastion)/Offline
  3. Auth Integrations
  4. Scaling your servers

  1. I think a lot of DevOps experts would argue that you’re doing DevOps wrong if you have a standalone DevOps team, but some companies have them.↩︎