import psychopg2
= psycopg2.connect() con
3 Using databases and data APIs
Your job as a data scientist is to sift through a massive pile of data to extract nuggets of real information – and then use that information. Working at the end of an external process, you need to meet the data where it lives.
In most cases, that will be in a database or a data API. This chapter is about the mechanics of working with those data sources, i.e. how to access the data and how to keep those connections secure.
3.1 Accessing and using databases
Databases are defined by their query-able interface, usually through structured query language (SQL).
There are many kinds of databases, and choosing the right one for your project is beyond the scope of this book. One recommendation: open source PostgreSQL (Postgres) is a great place to start for most general-purpose data science tasks.
Any database connection starts by creating a connection object at the outset of your code. You’ll then use this object to send SQL queries, which you can generate by writing them yourself or using a package that generates SQL like {sqlalchemy}
in Python or {dplyr}
in R.
For example, in Python you might write the following to connect to a Postgres database:
In R, it might look like this:
<- DBI::dbConnect(RPostgres::postgres()) con
Python or R both have standard connection APIs that define operations like connecting and disconnecting, sending queries, and retrieving results.
In Python, packages for individual databases like {psychopg2}
directly implement the API, which is why the example above calls the connect()
method of the {psychopg2}
package.
In R, the API is split into two parts. The {DBI}
package (short for database interface) implements the actual connections. It works with a database driver package, which is the first argument to DBI::dbConnect()
. Packages that implement the {DBI}
interface are called DBI-compliant.
There are Python packages that don’t implement the connections API and there are non DBI-compliant database packages in R. I’d recommend sticking with the standard route if possible.
In many cases, there will be a Python or R package that directly implements your database driver. For example, when you’re connecting to a Postgres database, there are Postgres-specific connectors – {psychopg2}
in Python and {RPostgres}
in R. For Spark, you’ve got {pyspark}
and {sparklyr}
.
If a package exists for your database, you should probably prefer it. It’s probably faster and may provide additional database-specific functionality compared to other options.
If there isn’t a database-specific package, you’ll need to use a generic system driver in concert with a Python or R package that can interface with system drivers.
While performance sometimes isn’t as good for system drivers, the tradeoff is that IT/Admins can pre-configure details of the connection in a data source name (DSN). If one is pre configured for you, you don’t have to remember the database name, host, and port, or even username and password if they’re shared.
For example, you might connect with something that looks like:
import pyodbc
= pyodbc.connect("DSN=MY_DSN") con
In R, it might look like this:
<- DBI::dbConnect(odbc::odbc(), dsn = "MY_DSN") con
System drivers come in two main varieties Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC).
In Python, {pyodbc}
is the main package for using ODBC connections and {JayDeBeApi}
for connecting using JDBC. In R, {odbc}
is the best package for using system ODBC connections and {RJDBC}
is the standard way to use JDBC.
If you’re using R and have the choice between JDBC and ODBC, I strongly recommend ODBC. JDBC requires an extra hop through Java and the {rJava}
package, which is painful to configure.1
3.2 Providing credentials to data projects
Let’s imagine you’ve created a data science project that pulls data from a database. When you’re actively working on the project, it’s easy for you to provide credentials as needed to the database. But what happens when you deploy that project to production and you’re not sitting there to provide credentials?
In many organizations, you’ll be allowed to use your own data access permissions for the project and then to share the project with others in the company at your discretion. This situation is sometimes called discretionary access control (DAC).
In some more restrictive environments, you won’t have this luxury. The IT/Admin team may maintain control of permissions themselves or require that data access be more tightly governed.
In some cases, it will be acceptable to create or use a service account, which is a non-human account that exists to hold permissions for a project. You might want to use a service account to limit the project’s permissions to exactly what it needs or to be able to manage the project’s permissions independently of the humans involved.
In the most restrictive case, you’ll actually have to use the credentials of the person who’s viewing the content and pass those along. This last option is much harder than the other two.
If you have to use the viewer’s credentials for data access, you can write code to collect them from the viewer and pass them along. I don’t recommend this as you have to take responsibility for collecting those credentials and storing and using them responsibly.
In other cases, the project may be able to run as the viewer when it is accessing the database. The patterns for doing this are complicated and will require working with an IT/Admin. More on this topic in Chapter 16.
3.3 Connecting to APIs
Some data sources come in the form of an API.
It’s common to have Python or R packages that wrap APIs, so you just write Python or R code without needing to think about the API underneath. The usage of these patterns often looks similar to databases – you create and use a connection object that stores the connection details. If your API has a package like this, you should just use it.
If you’re consuming a private API at your organization, a helper package probably doesn’t exist or you may have to write it yourself.
There’s increasingly good tooling to auto-generate packages based on API documentation, so you may never have to write an API wrapper package by hand. It’s still helpful to understand how it works.
If you find yourself having to call an API directly, you can use the {requests}
package in Python or {httr2}
in R.
These packages provide idiomatic R and Python ways to call APIs. It’s worth understanding that they’re purely syntactic sugar. There’s nothing special about calling an API from inside Python or R versus using the command line and you can go back and forth as you please. It is sometimes helpful to try to replicate Python or R API calls without the language wrapper for debugging reasons.
3.3.1 API Debugging
3.3.2 What’s in an API?
APIs are the standard way for two computer systems to communicate with each other. It’s an extremely general term that describes machine-to-machine communication. For our purposes, we’re talking about http
-based REST
-ful APIs.
http
operates on a request-response model. So when you use an API, you send a request to the API and it sends a response back.
The best way to learn about a new API is to read the documentation, which will include a lot of details about how to use it. Let’s go through some of the most salient ones.
3.3.3 API Endpoints and Paths
Each request to an API is directed to a specific endpoint. An API can have many endpoints, each of which you can think of like a function in a package. Each endpoint lives on a path, which is where you find that particular endpoint.
For example, if you did the lab in Chapter 2 and used {vetiver}
to create an API for serving the penguin mass model, you found your API at http://localhost:8080
. By default, you went to the root path at /
and found the API documentation there.
As you scrolled the documentation, there were two endpoints – /ping
and /predict
. You can read the definition to see what parameters you could send to them and what you’d get back. Those paths are relative to the root, so you could access /ping
at http://localhost:8080/ping
.
3.3.4 HTTP verbs
When you make a request over HTTP, you are asking a server to do something. The http verb, also known as the request method, describes the type of operation you’re asking for. Each endpoint has one or more verbs that it knows how to use.
If you look at the penguin mass API, you’ll see that /ping
is a GET
endpoint and /predict
is a POST
. This isn’t a coincidence. I’d approximate that 95% of the API endpoints you’ll use as a data scientist are GET
and POST
, which respectively fetch information from the server and provide information to the server.
To round out the basic http verbs you might see, PUT
and PATCH
change or update something and DELETE
(you guessed it) deletes something. There are also more esoteric ones you’ll probably never see.
3.3.5 Request parameters and bodies
Like a function in a package, each endpoint accepts specific arguments in a required format. Again, like a function, some arguments may be optional while others may be required.
For GET
requests, the arguments are specified via query parameters that end up embedded in the URL after a ?
. When you see a URL in your browser that looks like ?first_name=alex&last_name=gold
, those are query parameters.
For POST
, PUT
, and PATCH
requests, arguments are provided in a body, which is usually formatted as JSON.2 Both {httr2}
and {requests}
have built-in functionality for converting standard Python and R data types to their JSON equivalents. APIs often require their arguments to be nested in particular ways. You can experiment with how your objects get converted to JSON with {json}
in Python and {jsonlite}
in R to figure out how to get it nested correctly.
3.3.6 (Auth) Headers
Most APIs require authentication. The most common forms of authentication are a username and password combination, an API key, or an OAuth token.
API keys and OAuth tokens are often associated with particular scopes. Scopes are permissions to do particular things. For example, an API key might be scoped to have GET
access to a given endpoint but not POST
access.
Regardless of your authentication type, it will be provided in a header to your API call. Your API documentation will tell you how to provide your username and password, API key, or token to the API in a header. Both {requests}
and {httr2}
provide easy helpers for adding authentication headers as well as more general ways to set headers if needed.
Aside from authentication, headers are also used for a variety of different metadata like the type of machine that is sending the request and cookies that are set. You’ll rarely interact directly with these.
3.3.7 Request Status Codes
The first thing you’ll consult when you get a result back is the status code. Status codes indicate what happened with your request to the server. You always hope to see 200
codes, which indicate a successful response.
There are also two common error codes. 4xx
codes indicate that there’s a problem with your request and the API couldn’t understand what you were asking. 5xx
codes indicate that your request was fine, but some sort of error happened in processing your request.
See Appendix D for a table of common HTTP codes.
3.3.8 Response Bodies
The contents of the response are in the body. You’ll need to turn the body into a Python or R object you can work with.
Most often, bodies are in JSON and you’ll decode them with {json}
or {jsonlite}
. Depending on the API, you may have the option to request something other than JSON as the return. I rarely need anything other than JSON.
3.3.9 Common API patterns
Here are a couple of common API patterns that are good to know about:
Pagination – many data-feed APIs implement pagination. A paginated API returns only a certain number of results at a time to keep data sizes modest. Check out the API documentation to learn how to get all the pages of your request.
Job APIs – HTTP is designed for relatively quick request-response cycles. If your API kicks off a long-running job, it’s rare to wait until the job is done to get a response. Instead, a common pattern is for the to return a
job-id
which you can use to poll ajob-status
endpoint to check how things are going and eventually claim your result.Multiple Verbs – a single endpoint often accepts multiple verbs, for example, a
GET
and aPOST
at the same endpoint for getting and setting the data that endpoint stores.
3.4 Environment variables to secure data connections
When you take an app to production, authenticating to your data source while keeping your secrets secure is crucial.
The single most important thing you can do to secure your credentials is to avoid ever putting credentials in your code. Your username and password or API key should never appear in your code.
The simplest way to provide credentials without the values appearing in your code is with an environment variable. Environment variables are set before your code starts – sometimes from completely outside Python or R.
This section assumes you can use a username and password or API key to connect to your data source. That may not be true, depending on your organization. See Chapter 16 for how to handle data connections if you can’t directly connect with a username and password.
3.4.1 Getting environment variables
The power of using an environment variable is that you reference them by name. Using only a name makes it easy to swap out the value in production versus other environments and also means it’s totally safe to share code since all it does is reveal that an environment variable exists.
It is convention to make environment variable names in all caps with words separated by underscores. The values are always simple character values, though these can be cast to some other type inside R or Python.
In Python, you can read environment variables from the os.environ
dictionary or by using os.getenv("<VAR_NAME>")
. In R, you can get environment variables with Sys.getenv("<VAR_NAME>")
.
It’s common to provide environment variables directly to functions as arguments, including as defaults, though you can also put the values in normal Python or R variables and use them from there.
3.4.2 Setting environment variables
The most common way to set environment variables in a development environment is to load secrets from a text file. In Python, environment variables are usually set by reading a .env
file into your Python session. The {python-dotenv}
package is a good choice for doing this.
R automatically reads the .Renviron
file as environment variables and also sources the .Rprofile
file, where you can set environment variables with Sys.setenv()
. Personally I prefer putting everything in .Rprofile
so I’m only using one file, but that’s not a universal opinion.
Some organizations don’t ever want credential files in plain text. After all, if someone steals a plain text secrets file, there’s nothing to stop the thief from using them.
There are packages in both R and Python called {keyring}
that allow you to use the system keyring to securely store environment variables and recall them at runtime.
Setting environment variables in production is a little harder.
Just moving your secrets from your code into a different file you push to prod doesn’t solve the problem of putting secrets in code. And using {keyring}
in a production environment is quite cumbersome.
Your production environment may provide environment management tools. For example, GitHub Actions and Posit Connect both provide you the ability to set secrets that aren’t visible to the users, but are accessible to the code at runtime in an environment variable.
Increasingly, organizations are using token-based authorization schemes that just exchange one cryptographically secure token for another, never relying on credentials at all. The tradeoff for the enhanced security is that they can be difficult to implement, likely requiring coordination with an IT/Admin to use technologies like Kerberos or OAuth. There’s more on how to do that in Chapter 16.
3.5 Data Connection Packages
It’s very common for organizations to write their own data connector packages in Python or R that include all of the shared connection details so users don’t have to remember them. If everyone has their own credentials, it’s also nice if those packages set standard names for the environment variables so they can be more easily set in production.
Whether you’re using R or Python, the function in your package should return the database connection object for people to use.
Here’s an example of what that might look like if you were using a Postgres database from R:
#' Return a database connection
#'
#' @param user username, character, defaults to value of DB_USER
#' @param pw password, character, defaults to value of DB_PW
#' @param ... other arguments passed to
#' @param driver driver, defaults to RPostgres::Postgres
#'
#' @return DBI connection
#' @export
#'
#' @examples
#' my_db_con()
<- function(
my_db_con user = Sys.getenv("DB_USER"),
pw = Sys.getenv("DB_PW"),
..., driver = RPostgres::Postgres()
) {::dbConnect(
DBIdriver = driver,
dbname = 'my-db-name',
host = 'my-db.example.com',
port = 5432,
user = user,
password = pw,
...
) }
Note that the function signature defines default environment variables that will be consulted. If those environment variables are set ahead of time by the user, this code will just work.
3.6 Comprehension Questions
- Draw two mental maps for connecting to a database, one using a database driver in a Python or R package vs an ODBC or JDBC driver. You should (at a minimum) include the nodes database package, DBI (R only), driver, system driver, ODBC, JDBC, and database.
- Draw a mental map for using an API from R or Python. You should (at a minimum) include nodes for
{requests}
/{httr2}
, request, http verb/request method, headers, query parameters, body, json, response, and response code. - How can environment variables be used to keep secrets secure in your code?
3.7 Lab 3: Use a database and an API
In this lab, we’re going to build out both the data layer and the presentation layer for our penguin mass model exploration. We’re going to create an app to explore the model, which will look like this:
Let’s start by moving the data into a real data layer.
3.7.1 Step 1: Put the data in DuckDB
Let’s start by moving the data into a DuckDB database and use it from there for the modeling and EDA scripts.
To start, let’s load the data.
Here’s what that looks like in R:
<- DBI::dbConnect(duckdb::duckdb(), dbdir = "my-db.duckdb")
con ::dbWriteTable(con, "penguins", palmerpenguins::penguins)
DBI::dbDisconnect(con) DBI
Or equivalently, in Python:
import duckdb
from palmerpenguins import penguins
= duckdb.connect('my-db.duckdb')
con = penguins.load_penguins()
df 'CREATE TABLE penguins AS SELECT * FROM df')
con.execute( con.close()
Now that the data is loaded, let’s adjust our scripts to use the database.
In R, we are just going to replace our data loading with connecting to the database. Leaving out all the parts that don’t change, it looks like
eda.qmd
<- DBI::dbConnect(
con ::duckdb(),
duckdbdbdir = "my-db.duckdb"
)<- dplyr::tbl(con, "penguins") df
We also need to call to DBI::dbDisconnect(con)
at the end of the script.
Because we wrote our data processing code in {dplyr}
, we actually don’t have to change anything. Under the hood, {dplyr}
can switch seamlessly to a database backend, which is really cool.
eda.qmd
%>%
df group_by(species, sex) %>%
summarise(
across(
ends_with("mm") | ends_with("g"),
mean(x, na.rm = TRUE)
\(x)
)%>%
) ::collect() %>%
dplyr::kable() knitr
It’s not necessary, but I’ve added a call to dplyr::collect()
in line 31. It will be implied if I don’t put it there manually, but it helps make obvious that all the work before there has been pushed off to the database. It doesn’t matter for this small dataset, but could be a benefit on a larger dataset.
In Python, we’re just going to load the entire dataset into memory for modeling, so the line loading the dataset changes to
model.qmd
= duckdb.connect('my-db.duckdb')
con = con.execute("SELECT * FROM penguins").fetchdf().dropna()
df
con.close()
Now let’s switch to figuring out the connection we’ll need to our processing layer in the presentation layer.
3.7.2 Step 2: Call the model API from code
Before you start, make sure the API is running on your machine from the last lab.
I’m assuming it’s running on port 8080
in this lab. If you’ve put it somewhere else, change the 8080
in the code below to match the port on your machine.
If you want to call the model in code, you can use any http request library. In R you should use httr2
and in Python you should use requests
.
Here’s what it looks like to call the API in Python
import requests
= {
req_data "bill_length_mm": 0,
"species_Chinstrap": False,
"species_Gentoo": False,
"sex_male": False
}= requests.post('http://127.0.0.1:8080/predict', json = req_data)
req = req.json().get('predict')[0] res
or equivalently in R
<- httr2::request("http://127.0.0.1:8080/predict") |>
req ::req_body_json(
httr2list(
"bill_length_mm" = 0,
"species_Chinstrap" = FALSE,
"species_Gentoo" = FALSE,
"sex_male" = FALSE
)|>
) ::req_perform()
httr2<- httr2::resp_body_json(r)$predict[[1]] res
Note that there’s no translation necessary to send the request. The {requests}
and{httr2}
packages automatically know what to do with the Python dictionary and the R list.
Getting the result back takes a little more work to find the right spot in the JSON returned. This is quite common.
The {vetiver}
package also includes the ability to auto-query a {vetiver}
API. I’m not using it here to expose the details of calling an API.
Now, let’s take this API-calling code and build the presentation layer around it.
3.7.3 Step 3: Build a shiny app
We’re going to use the {shiny}
package, which is an R and Python package for creating interactive web apps using just Python code. If you don’t know much about {shiny}
, you can choose to just blindly follow the examples here or you could spend some time with the Mastering Shiny book to learn to use it yourself.
Either way, an app that looks like the picture above would look like this in Python
app.py
from shiny import App, render, ui, reactive
import requests
= 'http://127.0.0.1:8080/predict'
api_url
= ui.page_fluid(
app_ui "Penguin Mass Predictor"),
ui.panel_title(
ui.layout_sidebar(
ui.panel_sidebar("bill_length", "Bill Length (mm)", 30, 60, 45, step = 0.1),
[ui.input_slider("sex", "Sex", ["Male", "Female"]),
ui.input_select("species", "Species", ["Adelie", "Chinstrap", "Gentoo"]),
ui.input_select("predict", "Predict")]
ui.input_action_button(
),
ui.panel_main("Penguin Parameters"),
ui.h2("vals_out"),
ui.output_text_verbatim("Predicted Penguin Mass (g)"),
ui.h2("pred_out")
ui.output_text(
)
)
)
def server(input, output, session):
@reactive.Calc
def vals():
= {
d "bill_length_mm" : input.bill_length(),
"sex_Male" : input.sex() == "Male",
"species_Gentoo" : input.species() == "Gentoo",
"species_Chinstrap" : input.species() == "Chinstrap"
}return d
@reactive.Calc
@reactive.event(input.predict)
def pred():
= requests.post(api_url, json = vals())
r return r.json().get('predict')[0]
@output
@render.text
def vals_out():
return f"{vals()}"
@output
@render.text
def pred_out():
return f"{round(pred())}"
= App(app_ui, server) app
And like this in R
app.R
library(shiny)
<- "http://127.0.0.1:8080/predict"
api_url
<- fluidPage(
ui "Penguin Mass Predictor"),
titlePanel(
# Model input values
sidebarLayout(
sidebarPanel(
sliderInput("bill_length",
"Bill Length (mm)",
min = 30,
max = 60,
= 45,
value = 0.1
step
),
selectInput("sex",
"Sex",
"Male", "Female")
c(
),
selectInput("species",
"Species",
"Adelie", "Chinstrap", "Gentoo")
c(
),# Get model predictions
actionButton("predict",
"Predict"
)
),
mainPanel("Penguin Parameters"),
h2("vals"),
verbatimTextOutput("Predicted Penguin Mass (g)"),
h2("pred")
textOutput(
)
)
)
<- function(input, output) {
server # Input params
<- reactive(
vals list(
= input$bill_length,
bill_length_mm = input$species == "Chinstrap",
species_Chinstrap = input$species == "Gentoo",
species_Gentoo = input$sex == "Male"
sex_male
)
)
# Fetch prediction from API
<- eventReactive(
pred
input$predict,|>
httr2::request(api_url) |>
httr2::req_body_json(vals()) |>
httr2::req_perform()
httr2::resp_body_json(),= TRUE
ignoreInit
)
# Render to UI
<- renderText(pred()$predict[[1]])
output$pred <- renderPrint(vals())
output$vals
}
# Run the application
= ui, server = server) shinyApp(ui
Over the next few chapters, we’re going to implement more architectural best practices for the app and eventually go to deployment.
I have heard that some write operations may be faster with a JDBC driver than an ODBC one. I would argue that if you’re doing enough writing to a database that speed matters, you probably should be using database-specific data loading tools, and not just writing from R or Python.↩︎
You may see
POST
for things that look likeGET
s. For example, fetching a model prediction from an API feels like aGET
to me. The reason is the HTTP spec only recently allowedGET
endpoints to use request bodies and it’s still discouraged. So if the API wants to use a body in the request, it’s likely to be aPOST
even if it’s more of aGET
activity.↩︎