4  Logging and Monitoring

Logging and monitoring are the tools used to make the operations of a computer system visible to the outside world. Logging is the process of keeping track of what’s happening inside an app for the purpose of understanding usage or debugging if something goes wrong. Monitoring is the process of checking on the health and performance of a system at a given point in time.

Both logging and monitoring have two halves – emitting and aggregating.

Think, for example, about building a Shiny app. As you’re building your app, you have to think about what are the kinds of events you want to log – perhaps every time someone switches tabs in the app – and put code to do so in your app. Similarly, you’ve got to think about monitoring the app – are you going to save performance metrics for certain operations so you can monitor them over time.

On the flip side, there’s the aggregation part of the coin – let’s say your app does a great job emitting logs of all the right things. How will you be able to access those logs from the outside? And if your app is emitting monitoring data, how will you aggregate the monitoring metrics, detect if anything is awry, and alert the right people.

TODO: diagram of monitoring + logging emission vs aggregation

4.1 You should do more logging

Logging is a really essential tool that’s underused in data science apps and assets.

Actually implementing logging is not hard. In fact, as a data scientist, you already do a lot of logging!

The language around logging and monitoring seems most relevant in the context of a running app or API. But there are all kinds of other things data scientists do, like create reports and run batch jobs.

A (mildly) controversial take of mine is that data scientists should never run actual jobs in .R or .py files. Instead, all running analysis code should be in a self-documenting literate programming format like a Jupyter Notebook, R Markdown file, or Quarto document. If you exclusively use literate programming formats and are diligent about saving the rendered output after they run, you’ve already got logs of every time things run!

TODO: Diagram of functions in .R and .py sourced into notebook.

So, what should go into .R and .py files? Function and class definitions. Use these formats to create the objects you’ll manipulate in your actual analysis that you can render and make visible to end users or yourself later.

And if you’re thoughtful about what you note in the notebook, the document itself is the log.

If you’re creating an interactive app or API and are adding logging, you should use a formal logging library. For example, python has the standard logging package, and there are a number of great logging packages in R. I’m a particularly big fan of log4r.

4.1.1 What to log

I’ve generally seen three main purposes for logging in data science apps. The first is to log access operations for when people visit your app. If you aggregate these later, they can be really useful for making a business case about how important your app is. In addition to overall access to the app, logging when people access particular tabs or parts of the app can be really useful.

It’s worth noting that some commercial products where you might host your app, like RStudio Connect, create app access logs for you. You’re still responsible for logging anything that happens within your app, but the deployment platform takes care of logging who’s accessing your app and when.

The second is for audit trail reasons. In many cases, data science apps are read-only with respect to production data sources. I’d say this is generally a best practice, and if you can avoid writing production data sources from your app, that’s a great thing – create copies of data as needed. But if you must write to production data sources, being able to keep an audit trail of who made changes in the app and what they did is really essential.

The last purpose is for debugging reasons. If you’re running a production app, it’s really useful to log what people are doing inside your app. That way if you have an error that occurs, it’ll be much easier to understand what happened immediately before the error. The first step to debugging an error in an app is to be able to reproduce the conditions that caused it, which will be way easier if you implement a good logging system.

One important consideration is to make sure that you’re not accidentally logging sensitive information, like API keys to outside data sources in your logs.

4.1.1.1 How to structure logs

Logging is structured around the criticality of the logged message. For example, if you’re emitting logging from a Shiny app, you probably want to log every time someone changes tabs in the app, and also if something happens that makes the entire app crash – but you want to log them in different ways and be able to only pay attention to the logs you care about.

Formal logging libraries offer you actual levels of logging, which you can then selectively expose in the logs themselves. While you will have to use whatever levels your logging library exposes, it’s great to have a conceptual model of different logging levels, so even if you’re not using a formal logging framework (for example in a notebook), you have a sense of what you’d want to log and how to do so.

Most logging libraries have 5-7 levels of logging. The six below are reasonably common. If you understand these, you can condense or expand to match whatever framework you’re actually using. Six levels of logging from least to most granular:

  1. Critical/Fatal - an error so big that the app itself shuts down. For example, if your app cannot run without a connection to an outside service, you might log an inability to connect as a Critical error.
  2. Error – an issue that will make an operation not work, but that won’t bring down your app. In the language of software engineering, you might think of this as a caught exception. An example might be a user submitting invalid input.
  3. Warning – an unexpected application issue that isn’t fatal. For example, you might include having to retry doing something or noticing that resource usage is high. If something were to go wrong later, these might be helpful breadcrumbs to look at.
  4. Info – something normal happened in the app. These record things like starting and stopping, successfully making database and other connections, and configuration options that are used.
  5. Trace – a record of user interaction. If a user were to run into an issue, a trace should allow you to reconstruct what the user was doing when they ran into the issue and (hopefully) reproduce it.
  6. Debug – a deep record of what the app was doing. The difference between trace and debug is that the trace would be useful to reconstruct a sequence of events even if you weren’t familiar with the inner workings of the app. On the other hand, the debug log is meant for people who actually understand the app itself. Where a trace log would record “user pressed the button”, the debug might record the actual functions invoked and their arguments.

These aren’t hard-and-fast rules, but can be useful rules of thumb for how to use logging. Remember, far more important than sticking to any particular logging framework is creating logging that’s useful for you. Almost all frameworks include Info, Warn, Error, and Critical/Fatal. Some have several levels more acute than Error to alert on what should happen as a result. Others (log4r, for example) only include one level more granular than info, so you’ll have to figure out how you want to adjust for that.

4.1.1.2 An Example App with Logging

TODO

4.2 Do you really need to monitor?

Many (probably most) data science assets are used to display the output of an analysis to stakeholders. So the running data science processes are either async scripts that run to output a static report or analysis that stakeholders can access, or they’re apps that are largely used to display data. In many cases, these apps are important, but minute-to-minute access is not and so the effort required to implement good monitoring and logging simply isn’t worth it.

For most data science use cases, monitoring is not terribly important. You don’t really care if your ETL job takes a few extra minutes, and the hassle of setting up a monitoring platform generally isn’t worth it if the audience for your app is small and they won’t care too much if it’s down for a few hours.

The one exception to this is monitoring machine learning models. In that case, you want to model performance over time to make sure that the model performance isn’t changing too much over time. This is a distinct question from whether speed performance of serving predictions is changing or degrading over time.

Right now, the field of ML Ops is the next hot thing, and it’s not clear what the industry standards will be for monitoring machine learning model performance over time. My (admittedly biased) take is that I’m excited by the vetiver project, which is an open source framework for deploying machine learning models and monitoring their performance.

4.3 Aggregating Monitoring and Logging

TODO

Prometheus