3  Data Pipeline Architecture

TODO: write a cheatsheet of requirements gathering questions

Data is like water. It flows through the systems you’ve designed. As a data scientist, you spend all day thinking about the water. You think about the relationships in it, the reality it reflects, and how to predict what’s going to happen next.

But there’s another important perspective, which is thinking about the pipeline carrying the water. This is how data engineers and IT/Admins think about data. The content of the data is irrelevant to them – but they care a lot that it’s flowing easily, safely, and with minimal turbulence.

That’s what this chapter is about.

This chapter has nothing to say on feature engineering, plot generation, or machine learning models. I’m going to assume that you’ve done your discovery. You’re ready to put a reasonably stable data pipeline into production. So this chapter has a lot to say is about how to architect and implement a data storage, processing, and access system that will allow you to reliably and securely control the data flowing into and through your data science project.

But we’ve already standardized on X

Great! You should probably use that.

If your organization has a standard database they use or a data warehouse don’t reinvent the wheel.

Now, it might be the case that you still need to be creative for at least part of your data pipeline, most often because you don’t have write access or because your organization’s system is only for rectangular data and you have something that’s more “blobby”.

This chapter will help you evaluate the type of data you’re taking in, the kind of project you’re trying to create, and make smarter decisions on how to get the data into your project. Hopefully by the end of this chapter, you’ll understand better how to think about the data requirements for your project and how to design your data pipeline and your project to make everything easier.

3.1 The composition of a data pipeline

The right way to design a data pipeline is to begin at the end. You want to start by thinking about the output of your project. It might be charts, reports, visualizations, apps, and more. Most often, this layer uses either an interactive app library like Shiny, Dash, or Streamlit or a report-generation library like R Markdown, Quarto, or Voila. In some cases, the presentation phase is just an API for machine to machine communication – but that’s still relatively uncommon.

There are often several kinds of data that have to get prepared to go into the presentation layer.

The first kind is rectangular business data. That’s the kind of data that obviously belongs in a data.frame in R or a pandas.DataFrame in Python. It’s the data you’re actually doing data science on. Depending on your organization, you might get it as completely raw data you actually have to enter yourself. In other organizations, data engineering teams have created complete data pipelines that deliver highly-refined data right to your model training process.

Most organizations fall somewhere in the middle. There’s a central home for data – a data warehouse, data lake, or share drive for files - but it’s not fully ready to be presented, and you’ll have to do some processing to get there.

Then there’s the other kinds of data that are the output of your analytics pipeline. Maybe you trained one or more machine learning model, derived descriptive statistics, or did causal inference. You also might’ve pulled out data specifically for the presentation layer during your pipeline.

These would be things like lists of valid categories for slicers and selectors, dates to include in the analysis, and metadata you’ll want to include like plot headers and prettier names for columns in your data.

Depending on the exact type of data, some of it might be pretty rectangular – something like a list of store locations and metadata about them or model coefficients and standard errors. Or it might be something really non-rectangular like a machine learning model.

If you take one thing away from this chapter, let it be this – you should separate your data pipeline from your presentation layer.

This may be a big change. Many data scientists just write a bunch of code, assuming that the data is just the data. It’s very common to see a Shiny app that mushes a bunch of data cleaning into the app itself. Or a Quarto report that does processing in the body of the report.

For some projects, this is a valid assumption and your data will never change. If so, you probably don’t need this chapter.

But for many data projects, data will change – perhaps even on a regular basis. In these cases, you’ll want to think carefully about how to link your data to your app. This chapter will review ways to write your project so that you can update and change out the data later without having to completely rewrite or rework your code.

3.2 Can you load all your data?

Just loading your data into memory is a good option if you haven’t established you can’t do this. Basically, you should try this first and see if it works. If it does, go no further. If it doesn’t – or you anticipate it won’t, you’ll want to choose a different answer.

You don’t have to figure out any clever way to access the data or how to make sure you’re getting all the data you need. You’ve got it all! “Real engineers” may scoff at this pattern, but don’t let their criticism dissuade you. If your data size is small and your app performance is good enough, just read in all of your data and operate on it live. Don’t complicate things more than they need to. After all, “Premature optimization is the root of all evil.” Donald Knuth

So the main question is whether your data is too big to load. Preprocessing will generally happen on a schedule, so as long as your data fits into memory and processing it doesn’t kill your server resources, you’re good!

In the presentation layer, the question is a little more complicated. It’s quick to jump to the size of the raw data you’re using and assume you can’t load all the data in, but that’s often a completely irrelevant metric. You’re starting backwards if you start from the size of the raw data. Instead, you should figure out how big the data is that you actually need in the presentation layer. In a lot of cases, you can preprocess the input data into a small dataset you can just read wholly into the presentation layer.

To make this a little more concrete, let’s imagine you work for a large retailer and are responsible for creating a dashboard that will allow people to visualize the last week’s worth of sales for a variety of products. With this vague prompt, you could end up needing to load a huge amount of data into your project – or very little at all.

One of the most important questions is how much you can cache before someone even opens the app. For example, if you need to provide total weekly sales at the department level, that’s probably just a few data points. And even if you need to go back a long ways, it’s just a few hundred data points – load all that in!

But if you start needing to slice and dice the data in a lot of directions, then the data size starts multiplying, and you may have to include the entire raw data set in the report. For example, if you need to include weekly sales at the department level, then the size of your data is the \(\text{number of weeks} * \text{number of departments}\). If you need to include more dimensions – say you need to add geographies, then your data size multiplies by the number of geographies.

Now that you’ve figured out how big the data is, you need to determine whether it’s too big. In this case, too big is really a matter of performance.

For some apps, you want the data to be snappy throughout runtime, but it’s ok to have a lengthy startup process (perhaps because it can happen before the user actually arrives). In that case, it’s probably still ok to use a flat file and read it all in on startup.

One crucial question for your project is how much wait time is acceptable for people wanting to see the project – and when is that waiting ok? For example, if people need to be able to make selections and see the results in realtime, then you probably need a snappy database, or all the data preloaded into memory when they show up.

But maybe you have high performance requirements or a lot of data. Or maybe you know your data will grow over time. In that case, you’re going to want to choose some other system for accessing your data.

If you can just load in all your data, the question then becomes how. One common way to load all your data is just to read in a whole database table. If that’s an option for you, it can be really nice.

If you don’t have a database handy, the most common way to save and read data is with a flat file. A flat file is what happens when you just save an active data frame somewhere. If you have relatively small data in a flat file or a database table, just moving flat files around is by far the simplest way to manage your data pipeline.

3.2.1 Choose your flat file format

Flat file storage describes writing the data out into a simple file. The canonical example of a flat file is a csv file. However, there are also other formats that may make data storage smaller because of compression, make reads faster, and/or allow you to save arbitrary objects rather than just rectangular data. In R, the rds format is the generic binary format, while pickle is the generic binary format in python.

Flat files can be moved around just like any other file on your computer. You can put them on your computer, and share them through tools like dropbox, google drive, scp, or more.

The biggest disadvantage of flat file data storage is twofold – and is related to their indivisibility. In order to use a flat file in R or Python, you’ll need to load it into your R or Python session. For small data files, this isn’t a big deal. But if you’ve got a large file, it can take a long time to read, which you may not want to wait for. Also, if your file has to go over a network, that can be a very slow operation. Or you might have to load it into an app at startup. Also, there’s generally no way to version data, or just update part, so, if you’re saving archival versions, they can take up a lot of space very quickly.

---

There are a few different options for types of flat files. The most common is a comma separated value (csv) file, which is just a literal text file of the values in your data with commas as separators.1 You could open it in a text editor and read it if you wanted to.

The advantage of csvs is that they’re completely ubiquitous. Basically every programming language has some way to read in a csv file and work with it.

On the downside, csvs are completely uncompressed and they can only hold rectangular data. That makes them quite large relative to other sorts of files and slow to read and write. So if you’re trying to save a machine learning model, a csv doesn’t make any sense.

Both R and Python have language-specific flat-file types – rds in R and pickle in Python. Relative to csvs, these are nice because they usually include some amount of compression. They also can hold non-rectangular data, which can be great if you want to save a machine learning model or some other sort of object. The other advantage of rds and pickle is that they can be used with built-in data types that might not come through when saving to csv. For example, if you’re saving dates, an rds or pickle file will be able to save the data frame so that it knows to come back as a date. If you’re reading from a csv, you’d probably need to read it in as text and cast it to a character that can be annoying and – depending on the size of your data – time-consuming.

3.3 How and where to save a flat file or blob

So let’s say you’ve decided that you’re ok to just load all the data, and you don’t have a database to store it in.

There’s one place you usually should not save your data, and that’s inside your presentation bundle. The bundle is the set of code and assets that make up your presentation layer. For example, let’s say you’re building a simple Dash app that’s just a single file. You could create a project structure like this.

my-project/
├─ my_app.py
├─ data/
│  ├─ my_data.csv
│  ├─ my_model.pkl

You could do this, but you shouldn’t.

Note

There’s one situation in which it’s ok to just put the data inside the project, and that’s when your data will be updated at the same cadence as the app or report itself. This is pretty rare for an in-production app, but sometimes there are cases where you’re just sharing an existing dataset in the form of a report/app.

This is convenient as a first cut at a project, but it can make it really difficult to update the data once your project is in production. You have to keep track of where the project is on your deployment file system, reach in, and update it.

Instead, you want to decouple the data location from the project location.

There are a few ways you can do this. The most basic way is just to put the data on a location in your file system that isn’t inside the app bundle. This is, again, easy. Just make a different location in your directory and you’re good.

But when it comes to deployment this can also be complicated. If you’re writing your app and deploying it on the same server, then you can access the same directory. If not, you’ll need to worry about how to make sure that directory is also accessible on the server where you’re deploying your project.

Additionally, if you just put the file somewhere on the file system, you’ll need to control access using Linux file permissions. That’s not the end of the world, but controlling Linux file permissions is generally harder than controlling other sorts of access.

If you’re not going to store the flat file on the filesystem and you’re in the cloud, the most common option for where it can go is in blob storage. Blob storage is the term for storage where you store and recall things by name.2 Each cloud provider has blob storage – AWS’s is s3 (short for simple storage service), Azure has Azure Blob store, and GCP has Google Storage.

The nice thing about blob storage is that it can be accessed from anywhere that has access to the cloud. You can also control access using standard cloud identity management tooling, so you could control who has access using individual credentials or could just say that any request for a blob coming from a particular server would be valid.

There are packages in both R and Python for interacting with AWS that are very commonly used for getting access to s3 – Boto3 in Python, and paws in R.

There’s also the popular pins package in both R and Python that basically wraps using blob storage into neater code. It can use a variety of storage backends, including cloud blob storage, networked or cloud drives like Dropbox, Microsoft365 sites, and Posit Connect.

Lastly, a google sheet can be a great, temporary, way to save and recall a flat file. While this isn’t the most robust from an engineering perspective, it can often be a good first step while you’re still figuring out what the right answer is for your pipeline. It’s primary engineering weakness – that it’s editable by someone who logs in – can also be an asset if you need someone to edit the data. If you have to get something up and running right away, storing the data in a google sheet can be a great temporary solution while you figure out the long-term home for your data.

3.4 What to do when you can’t load all the data

There are some scales at which you literally cannot hold all your data in memory. It also may be the case that your data isn’t literally too big to fit into memory but its impractical – usually because it would make app startup times unacceptably long.

In this case, you’ll need to store your data somewhere else and run live queries against the data as you’re using it.

If you are using a database, you’ll want to be careful about how you construct your queries to make sure they perform well. The main way to think about this is whether your queries will be eager or lazy.

In an eager app, you’ll pull basically all of the data for the project as it starts up, while a lazy project will pull data only as it is need.

<TODO: Diagram of eager vs lazy data pulling>

Making your project eager is usually much simpler – you just read in all the data at the beginning. This is often a good first cut at writing an app, as you’re not sure exactly what requirements your project has. For relatively small datasets, this is often good enough.

If it seems like your project is starting up slowly – or your data’s too big to all pull in, you may want to pull data more lazily.

Tip

Before you start converting queries to speed up your app, it’s always worthwhile to profile your project and actually check that the data pulling is the slow step. I’ve often been wrong in my intuitions about what the slow step of the project is.

There’s nothing more annoying than spending hours refactoring your project to pull data more lazily only to realize that pulling the data was never the slow step to begin with.

It’s also worth considering how to make your queries perform better, regardless of when they occur in your code. You want to pull the minimum amount of data possible, so making data less granular, pulling in a smaller window of data, or pre-computing summaries is great when possible (though again, it’s worth profiling before you take on a lot of work that might result in minimal performance improvements).

Once you’ve decided whether to make your project eager or lazy, you can think about whether to make the query eager or lazy. In most cases, when you’re working with a database, the slowest part of the process is actually pulling the data. That means that it’s generally worth it to be lazy with your query. And if you’re using dplyr from R, being eager vs lazy is simply a matter of where in the chain you put the collect statement.

So you’re better off sending a query to the database, letting the database do a bunch of computations, and pulling a small results set back, rather than pulling in a whole data set and doing computations in R or Python.

3.5 Query-able storage

There are basically two choices when you need to do live queries – use an external system to store and access your data or use an on-disk storage system.

An external system – a database or Spark cluster are the most common – are very common. The advantages of an external system is that you’ve got an external server that does all the computation. You ask for some computation, it goes off to this external system and then you get back results you can use.

Another advantage is that making this system accessible is as simple as getting credentials to it from the group that controls it at your organization.

But the downside is that (hopefully) another organization controls it. Being a database administrator (DBA) is yet another real job. Depending on your organization, having someone else control the cluster can be a blessing – it’s just there! It works! Or it can be a curse – it takes weeks or months to get access! You can’t get anywhere to write to! It really depends on your organization.

The other option, which have been rising in popularity for medium-sized data is on-disk storage of data that is offline query-able. DuckDB is the most popular of these tools. The basic idea is that you just store your data on disk and then you run a process only when you need to that is able to query the data and load just what you need into memory.

DuckDB is most powerful when backed with Apache Parquet or Arrow – these relatively new data formats save your data in a clever way that make it really quick to query it without loading it into memory.

I can’t stress enough how cool these tools are – all of a sudden you don’t really need a database for data well into the 10s of Gb or more.

If you have even more data than that, you need a big data tool.

These days Spark is more-or-less the standard big data tool for analytics teams. There are two ways you can use Spark. The first is just to treat it as a glorified database and access the data inside using SQL. That can work great if you just have big data and you need to access it.

But Spark is far more powerful than that - it can also do a number of other tasks, including certain kinds of machine learning inside the Spark cluster. In that case, you should use pyspark or sparklyr to do all kinds of cool Spark stuff. If you’re just getting started, I strongly recommend the book Mastering Spark with R (free online version, like this book!) for getting started. Even if you’re using pyspark, the conceptual introduction is fantastic.

3.5.1 How to connect to databases?

In both R and Python, there are two general ways to connect to a database.

The first option is to use a direct connector to connect to the database. These connectors are most often thin R or Python wrappers around a bunch of C++ code. If this exists for the database you’re using, it’s generally the most optimized for your database and your should probably use it.

The second option is to connect to the database using a system-level driver and then connect to that database from R or Python. Many organizations like these because IT/Admins can configure them on behalf of users and can be agnostic about whether users are using them from R, Python, or something else entirely.

<TODO: image of direct connection vs through driver>

In this case, you’ll have to get the driver from your database provider, configure it on the system, and connect to it from R or Python. There are two main standards for system drivers – Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC).

In general, I recommend using ODBC over JDBC if available. If you’re using JDBC, you have to deal with configuring Java in addition to your driver. Anyone who’s ever had to configure the rJava package will share horror stories. Enough said.

In R I strongly recommend using an ODBC driver and using the odbc and DBI packages to make this connection. There are other ODBC packages and there are JDBC drivers available. But odbc and DBI are very actively maintained by Posit.

In Python, pyodbc is the main package for using ODBC connections.

3.5.2 Data authorization is key in production

This is a question you probably don’t think about much as you’re puttering around inside RStudio or in a Jupyter Notebook. But when you take an app to production, this becomes a crucial question.

The best and easiest case here is that everyone who views the app has the same permissions to see the data. In that case, you can just allow the app access to the data, and you can check whether someone is authorized to view the app as a whole, rather than at the data access layer.

In some cases, you might need to provide differential data access to different users. Sometimes this can be accomplished in the app itself. For example, if you can identify the user, you can gate access to certain tabs or features of your app. Many popular app hosting options for R and Python data science apps pass the username into the app as an environment variable.

Sometimes you might also have a column in a table that allows you to filter by who’s allowed to view, so you might just be able to filter to allowable rows in your database query.

Sometimes though, you’ll actually have to pass database credentials along to the database, which will do the authorization for you. This is nice, because then all you have to do is pass along the correct credential, but it’s also a pain because you have to somehow get the credential and send it along with the query.

3.5.3 Securely Managing Credentials

The single most important thing you can do to secure your credentials for your outside services is to avoid ever putting credentials in your code. That means that the actual value of your username or password should never actually appear in your code.

There are a few alternatives for how to make your credentials safe.

The first, and simplest option, is just to set them into an environment variable. If you need a refresher on how to set or use environment variables, see Section 1.4.

In R, it’s also very common for organizations to create internal packages that provide connections to databases. The right way to do this is to create a function that returns a database connection object that you can use in conjunction with the DBI or dplyr packages.

Here’s an example of what that might look like if you were using a Postgres database:

#' Return a database connection
#'
#' @param user username, character, defaults to value of DB_USER
#' @param pw password, character, defaults to value of DB_PW
#' @param ... other arguments passed to 
#' @param driver driver, defaults to RPostgres::Postgres
#'
#' @return DBI connection
#' @export
#'
#' @examples
#' my_db_con()
my_db_con <- function(
    user = Sys.getenv("DB_USER"), 
    pw = Sys.getenv("DB_PW"), 
    ..., 
    driver = RPostgres::Postgres()
) {
  DBI::dbConnect(
    driver,
    dbname = 'my-db-name', 
    host = 'my-db.example.com', 
    port = 5432, # default for Postgres
    user = user,
    password = pw)
}

In some organizations, using environment variables may be fine for configuration, but will not be perceived as secure enough, because the credentials are not encrypted at rest.

Information that is encrypted at rest is cryptographically secured against unauthorized access. There are at least two different things people can mean. Whole-disk encryption means that your drive is encrypted so that it can only be accessed by your machine. That means that if someone were to take your hard drive to another computer, your data wouldn’t be usable. This has become more-or-less standard. Both Mac and Windows have built-in utilities to do full-disk encryption.

There’s another level to encrypted at rest that some organizations want, which is to avoid storing credentials in plaintext at all. For example, an .Rprofile file is just a simple text file. If an unauthorized user were to get access to my laptop, they could steal the credentials from my .Rprofile and use them themselves. Some organizations have prohibitions against ever storing credentials in plaintext. In these cases, the credentials must be stored in a cryptographically secure way and are only decrypted when they’re actually used. If you use a password manager or store passwords in your browser, this would be an example.

There are a number of more secure alternatives – but they generally require a little more work.

There are packages in both R and Python called keyring that allow you to use the system keyring to securely store environment variables and recall them at runtime. These can be good in a development environment, but run into trouble in a production environment because they generally rely on a user actually inputting a password for the system keyring.

Configuring a production system that never exposes credentials in plaintext is a nontrivial undertaking. Generally, you’ll need to work with an IT/Admin to pass through the identity of the person trying to access the system to a service that will grant them a token or ticket. This is something you’ll need help from an IT/Admin to accomplish and there’s more on how to discuss the issue in Chapter 18.

Note

Some hosting software, like Posit Connect, can take care of this problem, as they store your environment variables inside the software in an encrypted fashion and inject them into the runtime.

You still have the issue of how you’re going to store the environment variables in your development environment, as a .env or .Rprofile file are still the most common.

If you are using ODBC, you or your admin may want to consider configuring a Data Source Name (DSN). A DSN provides connection details for an ODBC connection. They can be configured at a system or a user level. This means that you could put the connection details like the hostname, port, database name, etc into a simple text file right on your system and users don’t need to know anything beyond their username and password to login.

3.6 Comprehension Questions

  1. What are the different options for data storage formats for apps? What are the advantages and disadvantages of each?
  2. When should an app fetch all of the data up front? When is it better for the app to do live queries?
  3. What are the different flat file types and how can they be stored?
  4. What is a good way to create a database query that performs badly? (Hint: what’s the slowest part of the query)

3.7 Portfolio Exercise: Standing Up a Loosely Coupled App

Find a public data source you like that’s updated on a regular candence. Some popular ones include weather, public transit, and air travel.

Create a job that uses R or Python to access the data and create a job to put it into a database (make sure this is compliant with the data use agreement).

Create an API in Plumber (R) or Flask or FastAPI (Python) that allows you to query the data – or maybe create a machine learning model that you can run against new data coming in.

Create an app that visualizes what’s coming out of the API using Dash or Streamlit (Python) or Shiny (either R or Python).


  1. There are other separators you can use. Tab separated value files (tsv) are something you’ll see occasionally.↩︎

  2. The term blob is great to describe the thing you’re saving in blob storage, but it’s actually an abbreviation for binary large object. I think it’s very clever.↩︎