Introduction

Data science itself is pretty useless.

It’s likely you became a data scientist because you love creating beautiful charts, minimizing model prediction error, or writing elegant code in R or Python.

Ultimately – and perhaps frustratingly – these things don’t matter. What matters is whether the output of your work is useful in affecting decisions at your organization or in the broader world.

That means you’re going to have to share your work by putting it in production.

Many data scientists think of in production as some mythical state of super computers running ultra-complex machine learning models running over dozens of shards of data, terrabytes each. It definitely occurs on a misty mountaintop, and does not involve the google sheets, csv files, or half-baked database queries you probably wrangle every day.

But that’s wrong. If you’re a data scientist and you’re trying to put your work in front of someone else’s eyes, you’re in production. And if you’re in production, this book is for you.

You may sensibly be asking who I am to make such a proclamation.

As of this writing, I’ve been on the Solutions Engineering team at RStudio (soon to be Posit) for nearly four years. The Solutions Engineering team at RStudio helps users of our open source and professional tools understand how to deploy, install, configure, and use RStudio’s Professional Products.

As such, I’ve spoken with hundreds of organizations managing data science in production about what being in production means for them, and how to make their production systems for developing and sharing data science products more robust – both with RStudio’s Professional Products and using purely open source tooling.

For some orgaizations, in production means a report that will get emailed around at the end of each week. For others, it will mean hosting a live app or dashboard that people visit. For others, it means serving live predictions to another service from a machine learning model.

Regardless of the actual character of the data science products, organizations are universally concerned with making these assets reliable, reproducible, and performant (enough).

For the purpose of this book, I’m going to use the terms analytics, data science, and machine learning interchangeably. The real key is people who are deriving insights or predictions from data using R or Python.

At RStudio, the Solutions Engineering team is most directly responsible for engaging with the IT/Admin organizations at our customers. So that’s what this book is about – all of the stuff that is not data science that it takes to deploy a data science asset into production.

A short history of DevOps

Here’s the one sentence definition: DevOps is a set of cultural norms, practices, and supporting tooling to help make the process of developing and deploying software smoother and lower risk.

If you feel like that definition is pretty vague and unhelpful, you’re right. Like Agile software development, to which it is closely related, DevOps is a squishy concept. That’s partially because DevOps isn’t just one thing – it’s the application of some principles and process ideas to whatever context you’re actually working in. That malleability is one of the great strengths of DevOps, but it also makes the concept quite squishy.

This squishiness is furthered by the ecosystem of companies enabling DevOps. There are dozens and dozens of companies proselytizing their own particular flavor of DevOps – one that (curiously) reflects the capabilities of whatever product they’re selling.

But underneath the industry hype and the marketing jargon, there are some extremely valuable lessons to take from the field.

To understand better, let’s go back to the birth of the field.

The Manifesto for Agile Software Development was originally published in 2001. Throughout the 1990s, software developers had begun observing that delivering software in small units, quickly collecting feedback, and iterating was an effective model. After that point, many different frameworks of actual working patterns were developed and popularized.

However, many of these frameworks were really focused on software development. What happened once the software was written?

Historically, IT Administrators managed the servers, networking, and workstations needed to deploy, release, and operate that software. So, when an application was complete (or perceived as such), it was hurled over the wall from Development to Operations. They’d figure out the hardware and networking requirements, check that it was performant enough, and get it going in the real world.

This pattern is very fragile and subject to many errors, and it quickly became apparent that the Agile process – creating and getting feedback on small, iterative changes to working software – needed a complementary process to get that software deployed and into production.

DevOps arose as this discipline – a way for software developers and the administrators of operational software to better collaborate on making sure the software being written was making it reliably and quickly into production. It took a little while for the field to be formalized, and the term DevOps came into common usage around 2010.

Those who do DevOps

Throughout this book, I’ll use two different terms – DevOps and IT/Admin – and though they may sound similar, I mean very different things by them.

DevOps refers to the knowledge, practices, and tools that make it easier, safer, and faster to put work into production. So, if you’re a software developer (and as a data scientist, you are) you need to be thinking about DevOps.

Most organizations also have a set of people and roles who have the permission and responsibility for managing the servers and computers at your organization. I’m going to refer to this group as IT/Admins. Their names vary widely by organization – they might be named Information Technology (IT), SysAdmin, Site Reliability Engineering (SRE), or DevOps.1

Depending on what you’re trying to accomplish, the relevant IT/Admins may change. For example, if you’re trying to get access to a particular database, the relevant IT/Admins may be a completely different group of people than if you’re trying to procure a new server.

Fundamentally, DevOps is about creating good patterns for people to collaborate on developing and deploying software. As a data scientist, you’re on the Dev side of the house, and so a huge part of making DevOps work at your organization is about finding some Ops counterparts with whom you can develop a successful collaboration. There are many different organizational structures that support collaboration between data scientists and IT/Admins.

However, I will point out three patterns that are almost always red flags – mostly because they make it hard to develop relationships that can sustain the kind of collaboration DevOps neccesitates. If you find yourself in these situations, you’re not doomed – you can still get things done. But progress is likely to be slow.

  1. At some very large organizations, IT/Admin functions are split into small atomic units like security, databases, networking, storage, procurement, cloud, and more. This is useful for keeping the scope-of-work manageable for the people in that group – and often results in super deep expertise within the group – but also means that you’ll need to bring people together from disparate teams to actually get anything done. And even when you find the person who can help you with one task, they’re probably not the right person to help you with anything else. They may not even know who is.

  2. Some organizations have chosen to outsource their IT/Admin functions. This isn’t a problem per-se – the people who work for outsourced IT/Admin companies are often very competent, but it does indicate a lack of commitment at your organization. The main issues in this case tend to be logistical. Outsourced IT/Admin teams are often in India, so it can be hard to find meeting times with American and European teams. Additionally, turnover on projects and systems tends to be very high at outsourced IT/Admin organizations. That means it can be really hard to find anyone who’s an expert on a particular system – or to be able to go back to them once you’ve found them.

  3. At some very small organizations, there isn’t yet an IT/Admin function. And at others, the IT/Admins are preoccupied with other tasks and don’t have the capacity to help the data science team. This isn’t a tragedy, but it probably means you’re about to become your own IT/Admin. If your organization is pretty happy to let you do what you want, you’ve picked up this book, so you’re in the right place. If you’re going to have to fight for money to get servers, that’s an uphill battle.

Whether your organization has an IT/Admin setup that facilitates DevOps best practices or not, hopefully this book can help you take the first steps towards making your path to production smoother and simpler.

TODO: Ways to ameliorate red flags

What’s in this book?

My hope for this book is twofold.

First, I’d like to share some patterns.

DevOps is a well-developed field in its own right. However, a simple 1-1 transposition of DevOps practices from traditional software to data science would be a mistake. Over the course of engaging with so many organizations at RStudio, I’ve observed some particular patterns, borrowed from traditional DevOps, that work particularly well to grease the path to production for data scientists.

Hopefully, by the time you’re done with this book, you’ll have a pretty good mental model of some patterns and principles you can apply in your own work to make deployment more reliable. That’s what’s in the first section of this book.

Second, I want to equip you with some technical knowledge.

IT administration is an older field than DevOps or data science, full of arcane language and technologies. My hope in this book is to equip you with the vocabulary to talk to the IT/Admins at your organization and the (beginning of) skills you’ll need if it turns out that you need to DIY a lot of what you’re doing.

The second section is the hands-on section for anyone who’s administering a data science server for themselves. In this section, we’re going to DIY a data science environment. If you have to, this could be an environment you actually work in. If not, this section will help you learn the language and tools to better collaborate with the IT/Admins at your organization.

The final section is about taking the DIY environment we’ll set up in section 2 and making it enterprise-grade. Hopefully, if you have enterprise requirements, you also have enterprise IT support, so this section will be more focused on the conceptual knowledge and terminology you’ll need to productively interact with the IT/Admins who are (hopefully) responsible for helping you. And if you don’t have IT/Admins to help you, it will at least give you some terms to google.

Questions, Portfolio Exercises, and Labs

For each chapter, I’ll provide a chance for you to check if you’ve understood the content. Sections 1 and 3 are mostly informational, so in those sections, I’ll provide comprehension questions at the end of each chapter.

Mental Models + Mental Maps

Throughout the book, I’ll talk a lot about building a mental model. What a mental model is is reasonably intuitive, but just to make clear – a mental model is a model in your head of various entities and how they relate. Many of the chapters in this book will be about helping you build mental models of various DevOps concepts and how they relate to Data Science.

A mental map is a particularly helpful way to represent mental models. Because I believe generating mental maps is a great test of mental models, I’ll suggest them as exercises in a number of places.

In a mental map, you’ll draw out each of the entities I suggest and connect them with arrows. Each node will be a noun, and you’ll write the relationship between entities in the arrows.

Here’s an example about this book:

graph LR
    A[I] -->|Wrote| B[DO4DS]
    C[You] --> |Are Reading| B
    B --> |Includes| D[Exercises]
    D --> |Some Are| E[Mind Maps]

Note how every node is a noun, and the edges (labels on the arrows) are verbs. It’s pretty simple! But writing down the relationships between entities like this is a great check on understanding.

In many of the chapters, I’ll include a Portfolio Exercise. These are exercises that you can complete to demonstrate that you’ve really understood the content of the chapter. If you’re looking for a job and you complete these portfolio exercises, you’ll have a great bit of work that shows that you know how to get data science into production.

Section 2 is all about creating your own data science workbench, so each of the chapters will include a lab – a walkthrough of setting up your data science workbench. By the end of the chapter, you’ll have a place where you can do actual data science work in AWS.


Chapter List (FOR DEV PURPOSES ONLY, WILL BE REMOVED):

Section 1: DevOps for DS

  1. Code Promotion
  2. Environments as Code
  3. Project Components
  4. Logging and Monitoring
  5. Docker for Data Science

Section 2: DIY Data Science Workbench

Number Explain Lab
1 Cloud AWS Console + Get Instance
2 Command Line + SSH SSH into server
3 Linux SysAdmin Install R, Py, RS, JH
4 Networking, DNS, SSL URL, SSL
5 How servers work + Choosing the right one Take down instance + attach to a bigger one

Section 3: Steps not Taken

  1. Code Promotion for DevOps, Dev/Test/Prod, Docker
  2. Better Networking (Proxies, Bastion)/Offline
  3. Auth Integrations
  4. Scaling your servers

  1. I think a lot of DevOps experts would argue that you’re doing DevOps wrong if you have a standalone DevOps team, but some companies have them.↩︎