3  Data Science Project Architecture

As a data scientist, you’re probably well familiar with the concept of functions in your chosen programming language. The advantage of functions is that once you’ve written one, you can focus exclusively on what’s happening inside the function. When you’re not, you can logically abstract away what the function does and just assume that you’ll write a function to do that thing.

You should be thinking similarly about the architecture of your data science project.

As you’re building out your app, report, or API, you should be thinking about how to architect your project so that you can do updates in the future in a way that is painless. You want to modularize your app so that you can apply the other best practices in this book – like using CI/CD easily and painlessly.

In short, this chapter will be about how to take the software engineering notion of a three-layer app architecture and apply it to your data science project. We’ll get into what the layers should be, how to decide on a boundary point between them, and how to configure each piece to work with the others.

3.1 The three-layer app

The three-layer app architecture is the most common software architecture that exists in the world today.

A three-layer app consists, unsurprisingly, of three parts:

  • Presentation/Interaction layer – the layer where users actually interact.

  • Application layer – processes data and does the actual work of the app.

  • Data layer – the layer that stores data for the app.

If you’ve heard the terms front-end and back-end, front-end roughly corresponds to the presentation layer, and back-end to application and data layers.

Is three-layer dead?

If you’re looking around on the internet, you may see assertions that the three-layer app is dead and architectures are moving to microservices or other architectures. In my observation, this is overhyped, and many data science projects would do well to move to a three-layer architecture.

The hardest part of building a three-layer app is understanding what goes in each tier and how to draw the boundaries. In this chapter, I’ll introduce two rules of production app creation:

Separate your business logic from app logic

Separate your data from the app

These roughly correspond to the boundaries between the presentation and application layer, and between the application and data layers.

3.2 Separate business and interaction logic

There are great frameworks for writing apps in both R and Python – there’s Streamlit, Dash, and Shiny in Python and Shiny in R. This advice also applies if you’re creating a report or some kind of static document using a Jupyter Notebook, R Markdown, or Quarto.

Regardless of the framework you’re using to create your project, it’s important to separate the business and the interaction logic.

What does this mean?

All too often, I see monolithic Shiny apps of thousands or tens of thousands of lines of code, with button definitions, UI bits, and user interaction definitions mixed in among the actual work of the app. There are two reasons this doesn’t work particularly well.

It’s much easier to read through your app or report code and understand what it’s doing when the app itself is only concerned with displaying UI elements to the user, passing choices and actions to the backend, and then displaying the result back to the user.

The application tier is where your business logic should live. The business logic is that actual work that the application does. For a data science app, this is often slicing and dicing data, computing statistics on that data, generating model predictions, and constructing plots.

Separating presentation from business layers means that you want to encapsulate the business logic somehow so that you can work on how the business logic works independently from changing the way users might interact with the app. For example, this might mean creating standalone functions to write plots or create statistics.

If you’re using Shiny, R Markdown, or Quarto, this will mean passing values to those functions that are no longer reactive and that have been generated from input parameters.

Let’s start with a counterexample. Here’s a simple app in Shiny for R that visualizes certain data points from the Palmer Penguins data set. This is an example of bad app architecture.

library(ggplot2)
library(dplyr)
library(palmerpenguins)
library(shiny)

all_penguins <- c("Adelie", "Chinstrap", "Gentoo")

# Define UI
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      # Select which species to include
      selectInput(
        inputId = "species", 
        label = "Species", 
        choices = all_penguins, 
        selected = all_penguins,
        multiple = TRUE
      )
    ),
    # Show a plot of penguin data
    mainPanel(
      plotOutput("penguinPlot")
    )
  )
)

server <- function(input, output) {
  
  output$penguinPlot <- renderPlot({
    # Filter data
    dat <- palmerpenguins::penguins %>%
      dplyr::filter(
        species %in% input$species
      )
    
    # Render Plot
    dat %>%
      ggplot(
        aes(
          x = flipper_length_mm,
          y = body_mass_g,
          color = sex
        )
      ) +
      geom_point()
  })
  
}

# Run the application 
shinyApp(ui = ui, server = server)

The structure of this Shiny app is bad. Now, it’s not a huge deal, because this is a simple Shiny app that’s pretty easy to parse if you are reasonably comfortable with R and Shiny.

Why is this bad? Look at the app’s server block. Because all of the app’s logic is contained inside a single plotRender statement.

plotRender is a presentation function – it renders plots. But I’ve got logic in there that generates the data set I need and generates the plot. Again, because this app is simple, it’s not a huge deal here. But imagine if this app had several tabs, multiple input dropdowns, a dozen or more plots, and complicated logic dictating how to process the dropdown choices into the plots. It would be a mess!

Instead, we should separate the presentation logic from the business logic. That is, let’s separate the code for generating the UI, taking the user’s choice of penguin

The business logic – what those decisions mean, and the resulting calculations – should, at minimum, be moved into standalone functions.

library(ggplot2)
library(dplyr)
library(palmerpenguins)
library(shiny)

all_penguins <- c("Adelie", "Chinstrap", "Gentoo")

# Define UI
ui <- fluidPage(
  sidebarLayout(
    sidebarPanel(
      # Select which species to include
      selectInput(
        inputId = "species", 
        label = "Species", 
        choices = all_penguins, 
        selected = all_penguins,
        multiple = TRUE
      )
    ),
    # Show a plot of penguin data
    mainPanel(
      plotOutput("penguinPlot")
    )
  )
)

server <- function(input, output) {
  
  # Filter data
  dat <- reactive(
    filter_data(input$species)
    )
  
  # Render Plot
  output$penguinPlot <- renderPlot(
    make_penguin_plot(dat())
    )
}

# Run the application 
shinyApp(ui = ui, server = server)

Now you can see that the app itself had gotten much simpler. The UI hasn’t changed at all, but the server block is now just two lines! And since I used descriptive function names, it’s really easy to understand what happens in each of the places where my app has reactive behavior.

Either in the same file, or in another file I can source in, I can now include the two functions that include my business logic:

#' Get the penguin data
#'
#' @param species character, which penguin species
#' @return data frame
#'
#' @examples
#' filter_data("Adelie")
filter_data <- function(species = c("Adelie", "Chinstrap", "Gentoo")) {
  palmerpenguins::penguins %>%
    dplyr::filter(
      species %in% !!species
    )
}

#' Create a plot of the penguin data
#'
#' @param data data frame
#'
#' @return ggplot object
#'
#' @examples
#' filter_data("Adelie") |> plot_gen()
plot_gen <- function(data) {
  data %>%
    ggplot(
      aes(
        x = flipper_length_mm,
        y = body_mass_g,
        color = sex
      )
    ) +
    geom_point()
}

Note that somewhere along the way, I also added function definitions and comments using ROxygen. This isn’t an accident! Writing standalone functions is a great way to force yourself to be clear about what should happen, and writing examples is the first step towards writing tests for your code.

3.2.1 Consider using an API for long-running processes

In the case of a true three-layer app, it is almost always the case that the middle tier will be an application programming interface (API). In a data science app, separating business logic into functions is often sufficient. But if you’ve got a long-running bit of business logic, it’s often helpful to separate it into an API.

You can basically think of an API as a “function as a service”. That is, an API is just one or more functions, but instead of being called within the same process that your app is running or your report is processing, it will run in a completely separate process.

For example, let’s say you’ve got an app that allows users to feed in input data and then generate a model based on that data. If you generate the model inside the app, the user will have the experience of pressing the button to generate the model and having the app seize up on them while they’re waiting. Moreover, other users of the app will find themselves affected by this behavior.

If, instead, the button in the app ships the long-running process to a separate API, it gives you the ability to think about scaling out the presentation layer separate from the business layer.

Luckily, if you’ve written functions for your app, turning them into an API is trivial.

Let’s take that first function for getting the appropriate data set and turn it into an API using the plumber library in R. The FastAPI library is a popular Python library for writing APIs.

library(plumber)

#* @apiTitle Penguin Explorer
#* @apiDescription An API for exploring palmer penguins.

#* Get data set based on parameters
#* @param species character, which penguin species
#* @get /data
function(species = c("Adelie", "Chinstrap", "Gentoo")) {
  palmerpenguins::penguins %>%
    dplyr::filter(
      species %in% !!species
    )
}

You’ll notice that there are no changes to the actual code of the function. The commented lines that provide the function name and arguments are now prefixed by #* rather than #', and there are a few more arguments, including the type of query this function accepts and the path.

For more on querying APIs and working with paths, see Chapter 9.

I’ll also need to change my function in the app somewhat to actually call the API, but it’s pretty easy using a package like httr2 in R or requests in Python.

The easiest way to host an R or Python API is using Docker. See Chapter 5 for how you would host this example inside a Docker container, or the documentation of the relevant package for other options.

3.3 Separate data from app

Similarly to separating the presentation layer from the application layer, we’ll also want to separate out the data layer. In a data science app, there are two things you’re likely to have in the data layer.

The first is one or more rectangular data frames. A lot of data science projects are dashboards based on existing data. You may need to ingest and clean that data before it goes into the app. You should create standalone jobs to do that before it gets to the actual finished business logic.

The other thing you’re likely to store is a machine learning model. You’ll probably have a separate process to train the machine learning model and have it ready to go.

TODO: Image of two strands of ds project – ML model + rect df

3.3.1 Storage Format

The first question of how to store the data is the storage format. There are really three distinct options for storage format.

Flat file storage describes writing the data out into a simple file. The canonical example of a flat file is a csv file. However, there are also other formats that may make data storage smaller because of compression, make reads faster, and/or allow you to save arbitrary objects rather than just rectangular data. In R, the rds format is the generic binary format, while pickle is the generic binary format in python.

Flat files can be moved around just like any other file on your computer. You can put them on your computer, and share them through tools like dropbox, google drive, scp, or more.

The biggest disadvantage of flat file data storage is twofold – and is related to their indivisibility. In order to use a flat file in R or Python, you’ll need to load it into your R or Python session. For small data files, this isn’t a big deal. But if you’ve got a large file, it can take a long time to read, which you may not want to wait for. Also, if your file has to go over a network, that can be a very slow operation. Or you might have to load it into an app at startup. Also, there’s generally no way to version data, or just update part, so, if you’re saving archival versions, they can take up a lot of space very quickly.

At the other extreme end from a flat file format is a database. A database is a standalone server with its own storage, memory, and compute. In general, you’ll recall things from a database using some sort of query language. Most databases you’ll interact with in a data science context are designed around storing rectangular data structures and use Structured Query Language (SQL) to get at the data inside.

There are other sorts of databases that store other kinds of objects – you may need these depending on the kind of objects you’re working with. Often the IT/Admin group will have standard databases they work with or use, and you can just piggyback on their decisions. Sometimes you’ll also have choices to make about what database to use, which are beyond the scope of this book.

The big advantage of a database is that the data is stored and managed by an independent process. This means that accessing data from your app is often a matter of just connecting to the database, as opposed to having to move files around.

Working with databases can also be frought. You usually end up in one of two situations. In the first situation, the database isn’t really for the data science team. You can probably get read access, but not write – so you’ll be able to use the database as your source of truth, but you won’t be able to write there for intermediate tables and other things you might need. In the second situation, you have freedom to set up your own database, in which case you’ll have to own it – and that comes with its own set of headaches.

There’s a third family of options for data storage that is quickly rising in popularity for medium-sized data. These options allow you to store data in a flat file, but access it in a smarter way than “just load all the data into memory”. SQLite is a classic example on this front that gives you SQL access to what is basically just a flat file. There are also newer entrants into this place that are better from an analytics perspective, like combining Apache Arrow with feather and parquet files and the dask project in Python.

These tools can give you the best of both worlds: you get away from the R and Python limitation of having to load all your data into memory, without having to run a separate database server. But you’ll still have to keep track of where the actual files are and make them accessible to your app.

One last option is a shared spreadsheet like Google Drive. This can be a good solution because you don’t have to host it anywhere yourself, access is easy to control, and there are simple tools for interacting.

3.3.2 Storage Location

The second question after what you’re storing is where. If you are using a database, then the answer is easy. The database just lives where it lives, and you’ll need to make sure you have a way to access it – both in terms of network access, as well as making sure you can authenticate into it (more on that below).

If you’re not using a database, then you’ll have to decide where to store the data for your app. Most apps that aren’t using a database start off rather naively with the data in the app bundle.

<TODO: Image of data in app bundle>

This works really well during development and is an easy pattern to get started with. Usually this pattern works fine for a while. The problem is that this pattern generally falls apart when it goes to production. Problems start to arise when the data needs updating – and most data needs updating. Usually, you’ll be ready to update the data in the app long before you’re ready to update the app itself.

At this point, you’ll be kicking yourself that you now have to update the data inside the app every time you want to make a data update. It’s generally a better idea to have the data live outside the app bundle. Then you can update the data without mucking around with the app itself.

A few options for this include just putting a flat file (or flat with differential read) into a directory near the app bundle. The pins package is also a great option here.

3.4 Choosing your storage solution

3.4.1 How frequently are the data updated relative to the code?

Many data apps have different update requirements for different data in the app.

For example, imagine you were the data scientist for a wildlife group that needed a dashboard to track the types of animals that had been spotted by a wilderness wildlife camera. You probably have a table that gives parameters for the animals themselves – perhaps things like endangered status, expected frequency, and more. That table probably needs to be updated very infrequently.

On the other hand, the day to day counts of the number of animals spotted probably needs to be updated much more frequently.

If your data is updated only very infrequently, it might make sense to just bundle it up with the app code and update it on a similar cadence to the app itself.

<TODO: Picture data in app bundle>

On the other hand, the more frequently updated data probably doesn’t make sense to update at the same cadence as the app code. You probably want to access that data in some sort of external location, perhaps on a mounted drive outside the app bundle, in a pin or bucket, or in a database.

In my experience, you almost never want to actually bundle data into the app. You almost always want to allow for the app data (“state”) to live outside the app and for the app to read it at runtime. Even data that you think will be updated infrequently, is unlikely to be updated as infrequently as your app code. Animals move on and off the endangered list, ingredient substitutions are made, and hospitals open and close and change their names in memoriam of someone.

It’s also worth considering whether your app needs a live data connection to do processing, or whether looking up values in a pre-processed table will suffice. The more complex the logic inside your app, the less likely you’ll be able to anticipate what users need, and the more likely you’ll have to do a live lookup.

3.4.2 Is your app read-only, or does it have to write?

Many data apps are read-only. This is nice. If you’re going to allow your app to write, you’ll need to be careful about permissions, protecting from data loss via SQL injection or other things, and you have to be careful to check data quality.

If you want to save the data, you’ll also need a solution for that. There’s no one-size-fits-all answer here as it really depends on the sort of data you’re using. The main thing to keep in mind is that if you’re using a database, you’ll have to make sure you have write permissions.

3.5 When does the app fetch its data?

Does the app fetch its data at app open or throughout runtime?

The first important question you’ll have to figure out is what the requirements are for the code you’re trying to put into production.

3.5.1 How big are the data in the app?

When I ask this question, people often jump to the size of the raw data they’re using – but that’s often a completely irrelevant metric. You’re starting backwards if you start from the size of the raw data. Instead, you should figure out what’s the size of data you actually need inside the app.

To make this a little more concrete, let’s imagine you work for a large retailer and are responsible for creating a dashboard that will allow people to visualize the last week’s worth of sales for a variety of products. With this vague prompt, you could end up needing to load a huge amount of data into your app – or very little at all.

One of the most important questions is how much you can cache before someone even opens the app. For example, if you need to provide total weekly sales at the department level, that’s probably just a few data points. And even if you need to go back a long ways, it’s just a few hundred data points.

But if you start needing to slice and dice the data in a lot of directions, then the data size starts multiplying, and you may have to include the entire raw data set in the report. For example, if you need to include weekly sales at the department level, then the size of your data is the \(number of weeks * number of departments\). If you need to include more dimensions – say you need to add geographies, then your data size multiplies by the number of geographies.

3.5.2 What are the performance requirements for the app?

One crucial question for your app is how much wait time is acceptable for people wanting to see the app – and when is that waiting ok? For example, if people need to be able to make selections and see the results in realtime, then you probably need a snappy database, or all the data preloaded into memory when they show up.

For some apps, you want the data to be snappy throughout runtime, but it’s ok to have a lengthy startup process (perhaps because it can happen before the user actually arrives) and you want to load a lot of data as the app is starting and do much less throughout the app runtime.

3.5.3 Creating Performant Database Queries

If you are using a database, you’ll want to be careful about how you construct your queries to make sure they perform well. The main way to think about this is whether your queries will be eager or lazy.

In an eager app, you’ll pull basically all of the data for the app as it starts up, while a lazy app will pull data only as it is need.

<TODO: Diagram of eager vs lazy data pulling>

Making your app eager is usually much simpler – you just read in all the data at the beginning. This is often a good first cut at writing an app, as you’re not sure exactly what requirements your app has. For relatively small datasets, this is often good enough.

If it seems like your app is starting up slowly – or your data’s too big to all pull in, you may want to pull data more lazily.

Tip

Before you start converting queries to speed up your app, it’s always worthwhile to profile your app and actually check that the data pulling is the slow step. I’ve often been wrong in my intuitions about what the slow step of the app is.

There’s nothing more annoying than spending hours refactoring your app to pull data more lazily only to realize that pulling the data was never the slow step to begin with.

It’s also worth considering how to make your queries perform better, regardless of when they occur in your code. You want to pull the minimum amount of data possible, so making data less granular, pulling in a smaller window of data, or pre-computing summaries is great when possible (though again, it’s worth profiling before you take on a lot of work that might result in minimal performance improvements).

Once you’ve decided whether to make your app eager or lazy, you can think about whether to make the query eager or lazy. In most cases, when you’re working with a database, the slowest part of the process is actually pulling the data. That means that it’s generally worth it to be lazy with your query. And if you’re using dplyr from R, being eager vs lazy is simply a matter of where in the chain you put the collect statement.

So you’re better off sending a query to the database, letting the database do a bunch of computations, and pulling a small results set back, rather than pulling in a whole data set and doing computations in R.

3.5.4 How to connect to databases?

In R, there are two answers to how to connect to a database.

The first option is to use a direct connector to connect to the database. This connector generally will provide a driver to the DBI package. There are other database alternatives, but they’re pretty rare.

<TODO: image of direct connection vs through driver>

Alternatively, you can use an ODBC/JDBC driver to connect to the database. In this case, you’ll use something inside your R or Python session to use a database driver that has nothing to do with R or Python. Many organizations like these because IT/Admins can configure them on behalf of users and can be agnostic about whether users are using them from R, Python, or something else entirely.

If you’re in R, the odbc package gives you a way to interface with ODBC drivers. I’m unaware of a general solution for conencting to odbc drivers in Python.

A DSN is a particular way to configure an ODBC driver. They are nice because it means that the Admin can fill in the connection details ahead of time, and you don’t need to know any details of the connection, other than your username and password.

<TODO: image of how DSN works>

In R, writing a package that creates database connections for users is also a very popular way to provide database connections to the group.

3.6 How do I do data authorization?

This is a question you probably don’t think about much as you’re puttering around inside RStudio or in a Jupyter Notebook. But when you take an app to production, this becomes a crucial question.

The best and easiest case here is that everyone who views the app has the same permissions to see the data. In that case, you can just allow the app access to the data, and you can check whether someone is authorized to view the app as a whole, rather than at the data access layer.

In some cases, you might need to provide differential data access to different users. Sometimes this can be accomplished in the app itself. For example, if you can identify the user, you can gate access to certain tabs or features of your app. Many popular app hosting options for R and Python data science apps pass the username into the app as an environment variable.

Sometimes you might also have a column in a table that allows you to filter by who’s allowed to view, so you might just be able to filter to allowable rows in your database query.

Sometimes though, you’ll actually have to pass database credentials along to the database, which will do the authorization for you. This is nice, because then all you have to do is pass along the correct credential, but it’s also a pain because you have to somehow get the credential and send it along with the query.

<TODO: Image of how a kinit/JWT flow work>

Most commonly, Kerberos tickets or JSON web tokens (JWTs) are used for this task. Usually your options for this depend on the database itself, and the ticket/JWT granting process will likely have to be handled by the database admin.

3.6.1 Securely Managing Credentials

The single most important thing you can do to secure your credentials for your outside services is to avoid ever putting credentials in plaintext. The simplest alternative is to do a lookup from environment variables in either R or Python. There are many more secure things you can do, but it’s pretty trivial to put Sys.getenv("my_db_password") into an app rather than actually typing the value. In that case, you’d set the variable in an .Rprofile or .Renviron file.

Similarly, in Python, you can get and set environment variables using the os module. os.environ['DB_PASSWORD'] = 'my-pass' and os.getenv('DB_PASSWORD'), os.environ.get('DB_PASSWORD') or os.environ('DB_PASSWORD'). If you want to set environment variables from a file, generally people in Python use thedotenv package along with a .env file.

You should not commit these files to git, but should manually move them across environments, so they never appear anywhere centrally accessible.

In some organizations, this will still not be perceived as secure enough, because the credentials are not encrypted at rest. Any of the aforementioned files are just plain text files – so if someone unauthorized were to get access to your machine, they’d be able to grab all of the goodies in your .Rprofile and use them themselves.

Some hosting software, like RStudio Connect, can take care of this problem, as they store your environment variables inside the software in an encrypted fashion and inject them into the R runtime.

There are a number of more secure alternatives – but they generally require a little more work.

There are packages in both R and Python called keyring that allow you to use the system keyring to securely store environment variables and recall them at runtime. These can be good in a development environment, but run into trouble in a production environment because they generally rely on a user actually inputting a password for the system keyring.

One popular alternative is to use credentials pre-loaded on the system to enable using a ticket or token – often a Kerberos token or a JWT. This is generally quite do-able, but often requires some system-level configuration.

<TODO: image of kerberos>

You may need to enable running as particular Linux users if you don’t want to do all of the authentication interactively in the browser. You usually cannot just recycle login tokens, because they are service authorization tokens, not token-granting tokens.1

3.7 Comprehension Questions

  1. What are the layers of a three-layer application architecture? What libraries could you use to implement a three-layer architecture in R or Python?
  2. What is the relationship between an R or Python function and an API?
  3. What are the different options for data storage formats for apps? What are the advantages and disadvantages of each? How does update frequency relate to choosing a data storage format?
  4. When should an app fetch all of the data up front? When is it better for the app to do live data fetches?
  5. What is a good way to create a non-performant database query?

3.8 Portfolio Exercise: Standing Up a Loosely Coupled App

Find a public data source you like that’s updated on a regular candence. Some popular ones include weather, public transit, and air travel.

Create a job that uses R or Python to access the data and create a job to put it into a database (make sure this is compliant with the data use agreement).

Create an API in Plumber (R) or Flask or FastAPI (Python) that allows you to query the data – or maybe create a machine learning model that you can run against new data coming in.

Create an app that visualizes what’s coming out of the API using Dash or Streamlit (Python) or Shiny (either R or Python).


  1. I believe that it is theoretically possible to create a token that grants access to multiple services. For example, you could have a token that grants access to both RStudio Server and a database. I’ve never seen this implemented.↩︎