5  Deployments and Code Promotion

Your work doesn’t matter if it never leaves your computer. You want your work to be useful, and it only becomes useful if you share it with the people and systems that matter. That requires putting it into production.

When it goes well, putting code into production is a low-drama affair. When it goes poorly, putting code into production is a time-consuming ordeal. That’s why putting code into production – deployment – is one of the primary concerns of DevOps best practices.

The DevOps way to deploy code is called CI/CD, which is short for Continuous Integration and Continuous Deployment and Continuous Delivery (yes, the CD stands for two different things). When implemented well, CI/CD eases the deployment process through a combination of good workflows and automation.

A few of the principles of CI/CD workflows include:

In this chapter, you’ll learn how to create a code promotion process to incorporate CI/CD principles and tools into your data science project deployments.

Separate the prod environment

CI/CD is all about quickly promoting code into production. It’s all too easy to mess up production if you don’t have a clear boundary between what is in production and what isn’t. That’s why software environments are often divided into dev, test, and prod.

Dev is the development environment where new work is produced, test is where the code is tested for performance, usability; and feature completeness; and prod is the production environment. Sometimes dev and test are collectively called the lower environments and prod the higher environment.

An app moving through dev, test, and prod environments.

While the dev/test/prod triad is the most traditional, some organizations have more than two lower environments and some have only dev and prod. That’s all fine. The number and configuration of lower environments should vary according to your organization’s needs. But, just as Tolstoy said about happy families, all prod environments are alike.

Some criteria that all good prod environments meet:

  1. The environment is created using code. For data science, that means managing R and Python packages using environments as code tooling, as discussed in Chapter 1.

  2. Changes happen via a promotion process. The process combines human approvals validating code is ready for production with automations to run tests and deploy.

  3. Changes only happen via the promotion process. This means no manual changes to the environment or the active code in production.

Rules 1 and 2 tend to be easy to follow. It might even be fun to figure out how to create the environment with code and design a promotion process. But the first time something breaks in your prod environment, you will be sorely tempted to violate rule 3. Please don’t do it.

Keeping a pristine prod environment is necessary if you want to run a data science project that becomes critical to your organization. When an issue arises, you must reproduce it in a lower environment before pushing changes through your promotion process. Keeping your environments in sync is crucial to reproduce prod issues in lower environments.

These guidelines for a prod environment look almost identical to guidelines for general-purpose software engineering. The divergent needs of data scientists and general-purpose software engineers show up in the composition of lower environments.

Dev and test environments

As a data scientist, dev means working in a lab environment like RStudio, Spyder, VSCode, or PyCharm and experimenting with the data. You’re slicing the data this way or that to see if anything meaningful emerges, creating plots to see if they are the right way to show off a finding, and checking whether certain features improve model performance. All this means it is impossible to work without real data.

“Duh”, you say, “Of course you can’t do data science without real data.”

This may be obvious to you, but needing to do data science on real data in dev is a common source of friction with IT/Admins.

That’s because this need is unique to data scientists. For general-purpose software engineering, a lower environment needs data formatted like the real data, but the content doesn’t matter.

For example, if you’re building an online store, you need dev and test environments where the API calls from the sales system are in the same format as the real data, but you don’t care if it’s real sales data. In fact, you probably want to create some odd-looking cases for testing purposes.

One way to help alleviate concerns about using real data is to create a data science sandbox. A great data science sandbox provides:

  • Read-only access to real data for experimentation.

  • Places to write mock data to test things you’ll write for real in prod.

  • Expanded access to R and Python packages for experiments before promoting to prod.

Working with your IT/Admin team to get these things isn’t always easy. They might not want to give you real data in dev. One point to emphasize is that creating this environment makes things more secure. It gives you a place to do development without fear that you might damage production data or services.

Version control implements code promotion

Once you’ve invented your code promotion process, you need a way to operationalize it. If your process says that your code needs testing and review before it’s pushed to prod, you need a place to do that. Version control is the tool to make your code promotion process real.

Version control is software that allows you to keep the prod version of your code safe, gives contributors a copy to work on, and hosts tools to manage merging changes back together. These days, Git is the industry standard for version control.

Git is an open-source system for tracking changes to computer files in a project-level collection called a repository. You can host repositories on your own Git server, but most organizations host their repositories with free or paid plans from tools like GitHub, GitLab, Bitbucket, or Azure DevOps.

This is not a book on git. If you’re not comfortable using local and remote repositories, branching, and merging, then the rest of this chapter will not be useful right now. I recommend you take a break from this book and learn about Git.

Hints on learning Git

People who say learning Git is easy are either lying or have forgotten. I am sorry our industry has standardized on a tool with such terrible ergonomics It’s worth your time to learn.

Whether you’re an R or Python user, I’d recommend starting with a resource designed to teach Git to a data science user. My recommendation is to check out HappyGitWithR by Jenny Bryan.

If you’re a Python user, some specific tooling suggestions won’t apply, but the general principles will be the same.

If you know Git and need a reminder of commands, see Appendix D for a cheatsheet of common ones.

The precise contours of your code promotion process and, therefore your Git policies – are up to you and your organization’s needs. Do you need multiple rounds of review? Can anyone promote something to prod or just certain people? Is automated testing required?

You should make these decisions as part of your code promotion process, which you can enshrine in the project’s Git repository configuration.

One important decision you’ll make is how to configure the branches of your Git repository. Here’s how I’d suggest you do it for production data science projects:

  1. Maintain two long-running branches – main is the prod version of your project and test is a long-running pre-prod version.
  2. Code can only be promoted to main via a merge from test. Direct pushes to main are not allowed.
  3. New functionality is developed in short-lived feature branches that are merged into test when you think they’re ready to go. Once sufficient approvals are granted, the feature branch changes in test are merged into main.

This framework helps maintain a reliable prod version on the main branch while leaving sufficient flexibility to accomplish any set of approvals and testing you might want.

Here’s an example of how this might work. Let’s say you were working on a dashboard and were trying to add a new plot.

You would create a new feature branch with your work, perhaps called new_plot. When you were happy with it, you would merge the feature branch to test. Depending on your organization’s process, you might be able to merge to test yourself or you might require approval.

If your testing turned up a bug, you’d fix the bug in the feature branch, merge the bug fix into test, re-test, and merge to main once you were satisfied.

Here’s what the Git graph for that sequence of events might look like:

Diagram showing branching strategy. A feature branch called new plot is created from and then merged back to test. A bug is revealed, so another commit fixing the bug is merged into test and then into main.

One of the tenets of a good CI/CD practice is that changes are merged frequently and incrementally into production.

A good rule of thumb is that you want your merges to be the smallest meaningful change that can be incorporated into main in a standalone way.

There are no hard and fast rules here. Knowing the appropriate scope for a single merge is an art – one that can take years to develop. Your best resource here is more senior team members who’ve already figured it out.

CI/CD automates Git operations

The role of Git is to make your code promotion process happen. Git allows you to configure requirements for whatever approvals and testing you need. Your CI/CD tool sits on top of that so that all this merging and branching does something.1

To be more precise, a CI/CD pipeline for a project watches the Git repository and does something based on a trigger. Common triggers include a push or merge to a particular branch or a pull request opening.

The most common CI/CD operations are pre-merge checks like spell checking, code linting, automated testing, and post-merge deployments.

There are a variety of different CI/CD tools available. Because of the tight linkage between CI/CD operations and Git repos, CI/CD pipelines built into Git providers are very popular.

GitHub Actions (GHA) was released a few years ago and is eating the world of CI/CD. Depending on your organization and the age of your CI/CD pipeline, you might also see Jenkins, Travis, Azure DevOps, or GitLab.

Configuring per-environment behavior

As you promote an app from dev to test and prod, you probably want behavior to look different across the environments. For example, you might want to switch data sources from a dev database to a prod one, switch a read-only app into write mode, or use a different logging level.

The easiest way to create per-environment behavior is to:

  1. Write code that includes flags that change behavior (e.g. write to the database or no).
  2. Capture the intended behavior for each environment in a YAML config file.
  3. Read in the config file as your code starts.
  4. Choose the values for the current environment based on an environment variable.
Note

Only non-secret configuration settings should go in a config file. Secrets should always be configured using secrets management tools so they don’t appear in plain text.

Most CI/CD tooling supports injecting secrets into the pipeline as environment variables.

For example, let’s say you have a project that should use a special read-only database in dev and switch to writing in the prod database in prod. You might write the config file below to describe this behavior:

config.yml
dev:
  write: false
  db-path: dev-db
test:
  write: true
prod:
  write: true
  db-path: prod-db

Then, you’d set that an environment variable to have the value dev in dev, test in test, and prod in prod. Your code would grab the correct set of values based on the environment variable.

In Python there are many different ways to set and read a per-environment configuration. The easiest way to use YAML is to read it in with the {yaml} package and treat it as a dictionary.

In R, the {config} package is the standard way to load a configuration from a YAML file. The config::get() function uses the value of the R_CONFIG_ACTIVE environment variable to choose which configuration to use.

Comprehension Questions

  1. Write down a mental map of the relationship between the three environments for data science. Include the following terms: Git, Promote, CI/CD, automation, deployment, dev, test, prod
  2. Why is Git so important to a good code promotion strategy? Can you have a code promotion strategy without Git?
  3. What is the relationship between Git and CI/CD? What’s the benefit of using Git and CI/CD together?
  4. Write out a mental map of the relationship of the following terms: Git, GitHub, CI/CD, GitHub Actions, Version Control

Lab: Host a website with automatic updates

In labs 1 through 4, you’ve created a Quarto website for the penguin model. You’ve got sections on EDA and model building. But it’s still just on your computer.

In this lab, we will deploy that website to a public site on GitHub and set up GitHub Actions as CI/CD so the EDA and modeling steps re-render every time we make changes.

Before we get into the meat of the lab, there are a few things you need to do on your own. If you don’t know how, there are plenty of great tutorials online.

  1. Create an empty public repo on GitHub.
  2. Configure the repo as the remote for your Quarto project directory.

Once you’ve connected the GitHub repo to your project, you will set up the Quarto project to publish via GitHub Actions. There are great directions on configuring that on the Quarto website.

Following those instructions will accomplish three things for you:

  1. Generate a _publish.yml, which is a Quarto-specific file for configuring publishing locations.
  2. Configure GitHub Pages to serve your website off a long-running standalone branch called gh-pages.
  3. Generate a GitHub Actions workflow file, which will live at .github/workflows/publish.yml.

Here’s the basic GitHub Actions file (or close to it) that the process will auto-generate for you.

.github/workflows/publish.yml
on:
  workflow_dispatch:
  push:
    branches: main

name: Quarto Publish

jobs:
  build-deploy:
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Check out repository
        uses: actions/checkout@v2

      - name: Set up Quarto
        uses: quarto-dev/quarto-actions/setup@v2

      - name: Render and Publish
        uses: quarto-dev/quarto-actions/publish@v2
        with:
          target: gh-pages
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Like all GitHub Actions, this action is defined in a .yml file in the .github/workflows directory of a project. It contains several sections:

  • The on section defines when the workflow occurs. In this case, we’ve configured the workflow only to trigger on a push to the main branch.2 Another common case would be to trigger on a pull request to main or another branch.
  • The jobs section defines what happens in steps, that occur in sequential order.
    • The runs-on field specifies which runner to use. A runner is a virtual machine that is started from scratch every time the action runs. GitHub Actions offers runners with Ubuntu, Windows, and MacOS. You can also add custom runners.
    • Most steps are defined with uses, which calls a preexisting GitHub Actions step that someone else has written.
    • You can inject variables using with and environment variables using env.

If you try to run this, it probably won’t work.

The CI/CD process occurs in a completely isolated environment. This auto-generated action doesn’t include setting up versions of R and Python or the packages to run our EDA and modeling scripts. We need to get that configured before this action will work.

Note

If you read the Quarto documentation, they recommend freezing your computations. Freezing is useful if you want to render your R or Python code only once and update only the text of your document. You wouldn’t need to set up R or Python in CI/CD, and the document would render faster.

That said, freezing isn’t an option if you intend the CI/CD environment to re-run the R or Python code.

Because the main point here is to learn about getting environments as code working in CI/CD you should not freeze your environment.

First, add the commands to install R, {renv}, and the packages for your content to the GitHub Actions workflow.

.github/workflows/publish.yml
      - name: Install R
        uses: r-lib/actions/setup-r@v2
        with:
          r-version: '4.2.0'
          use-public-rspm: true

      - name: Setup renv and install packages
        uses: r-lib/actions/setup-renv@v2
        with:
          cache-version: 1
        env:
          RENV_CONFIG_REPOS_OVERRIDE: https://packagemanager.rstudio.com/all/latest
Note

If you’re having slow package installs in CI/CD for R, I’d strongly recommend using a repos override like in the example above.

The issue is that CRAN doesn’t serve binary packages for Linux, which means slow installs. You need to direct {renv} to install from Public Posit Package Manager, which does have Linux binaries.

You’ll also need to add a workflow to GitHub Actions to install Python and the necessary Python packages from the requirements.txt.

.github/workflows/publish.yml
      - name: Install Python and Dependencies
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
          cache: 'pip'
      - run: pip install jupyter
      - run: pip install -r requirements.txt

Note that, we run the Python environment restore commands with run rather than uses. Where uses takes an existing GitHub Action and runs it, run executes the shell command natively.

Once you’ve made those changes, try pushing or merging your project to main. If you click on the Actions tab on GitHub you’ll be able to see the Action running.

In all honesty, it will probably fail the first time or five. You will rarely get your Actions correct on the first try. Breathe deeply and know we’ve all been there. You’ll figure it out.

Once it’s up, your website will be available at https://<username>.github.io/<repo-name>.


  1. Strictly speaking, this is not true. There are a lot of different ways to kick off CI/CD jobs. But the right way to do it is to base it on Git operations.↩︎

  2. A completed merge counts as a push.↩︎