18  Package Management in the Enterprise

The most obvious day-to-day difference for a data scientist at an enterprise versus a smaller organization is how they manage open source Python and R packages.

Many small organizations have a lassiez-faire attitude towards packages. You install what you need when you need from where you need. You have free reign to install packages from PyPI, Conda Forge, CRAN, BioConductor, GitHub, and wherever else you might want. That is unlikely to be true in an enterprise context.

As a data scientist, you understand in your bones how badly you need access to numerous open source libraries and packages to get anything done. But the IT/Admins at your organization probably don’t have that same understanding, and an enterprise may have organizational rules that govern how software comes into their environments.

This chapter will help you understand the concerns IT/Admins have around packages to help you better collaborate with them.

18.1 Ensuring packages are safe

The biggest concern most IT/Admins have about packages is that they might be unsafe. Unsafe packages might introduce exploitable bugs in your code to allow outside actors to get in or may themselves be trojan horses that exfiltrate data.

Some of these security concerns can be ameliorated because most data science projects are run entirely inside a private environment. For example, there are many security concerns with running Javascript on public websites that are sharply reduced when the only people who can access your application are already staff at your organization. Similarly, a package that maliciously grabs data from your system and exports it will be rendered ineffective in an airgapped environment.

Depending on your industry, IT/Admins may also take some responsibility for creating “validated” environments full of only packages that are known to create good results. This is particularly common in industries that have longstanding statistical practices, like Pharma. In other cases, organizations will only want to use packages that are known to be well-maintained and will be in the future.

A basic, but effective, form of package security is to limit allowed packages to popular packages, packages from known organizations/authors, or packages that outside groups have validated to be safe.

Increasingly, there are industry groups that are validating that certain packages have met quality and security standards and that anyone in the industry should feel comfortable using them. For example, the R Validation Hub is a pharmaceutical industry group that is working to create lists of packages that are broadly agreed to meet certain quality standards. There are also paid products that may serve this validation function.

Other organizzations may want to check that incoming packages don’t contain known security vulnerabilities.

Every day, security vulnerabilities in software are identified and publicized. These vulnerabilities are maintained in the CVE (Common Vulnerabilities and Exposures) system. Each CVE is assigned an identifier and a severity score that ranges from None to Critical.

Very often, companies have policies that disallow the usage of software with too many vulnerabilities. These policies often completely ban software with Critical CVEs and only temporarily allow software with a few High CVEs.

Some organizations try to ensure the security of incoming packages via a code scanner. A code scanner is a piece of software that scans all incoming code and detects potential security risks – like usage of insecure encryption libraries, calls to external web services, or places where it accesses a database.

These are almost always paid tools. It is my personal opinion that the creators of these tools often overstate the potential benefits and that a lot of code scanning is security theater. Unfortunately, that doesn’t change the reality that getting open source packages into your environment may require them going through a code scanner.

The sophistication of these tools is roughly in proportion to how popular the language is. So Javascript, which is both extremely popular and also makes up most public websites, has reasonably well-developed software scanning. Python, which is very popular, but is only rarely on the front end of websites has fewer scanners, and R, which is far less popular and is never in a website front end has none as far as I’m aware.

18.2 Open source licensing issues

In addition to security issues, some organizations are concerned about the legal implications of using free and open source software (FOSS) in their environment. These organizations, most often organizations that themselves write software, want to limit the use of certain types of licenses inside their environment.

Note

I am not a lawyer and this is not legal advice, but hopefully this is helpful context on the legal issues around FOSS software.

When someone releases software, they can choose a license, which isa legal document explaining what consumers are allowed to do with that software.

The type of license you’re probably most familiar with is a copyright. A copyright gives the owner exclusivity to distribute the software and charge for it. For example, if you buy a copy of Microsoft Word, you have a limited license to use the software, but you’re not allowed to inspect the source code of Microsoft Word and you’re not allowed to share the software.

In 1985, the Free Software Foundation (FSF) was created to support the creation of free software. They wanted to facilitate using, reusing, and sharing software. In particular, the FSF supported four freedoms for software:1

  1. Run the program however you wish for any purpose.
  2. Study the source code of the program and change it as you wish.
  3. Redistribute the software as you wish to others.
  4. Distribute copies of the software once you’ve made changes so everyone can benefit.

Now, you could just do this without applying a license to your software. But from a lawyer’s perspective, that’s dangerous and unsustainable. Creating and applying FOSS licenses to software made it something that was enforceable.

What does “free” mean?

It’s expensive to create and maintain FOSS. For that reason, the free in FOSS is about freedom, not about zero cost. As a common saying goes – it means free as in free speech, not free as in free beer.

There are many different flavors of open-source licenses. All of them I’m aware of, even the anti-capitalist one, allow you to charge for software.

Organizations have attempted to support FOSS with a variety of different business models to varying degrees of success. These models include donations, paid features or products, advertising or integrations, and paid support, services, or hosting.

There isn’t just one FOSS license, instead there are dozens that fall into two categories. Permissive licenses allow you to do essentially whatever you want with the FOSS software. For example, the permissive MIT license allows you to, “use, copy, modify, merge, publish, distribute, sublicense, and/or sell” MIT-licensed software without attribution. Most organizations have basically no concerns about using software with a permissive open source license.

The bigger concern is using software that has a copyleft or viral license. Software licensed under a copyleft regime requires that any derivative works are themselves released under a similar license. The idea is that open source software should beget more open source software and not silently be used by big companies to make megabucks.

The concern enterprises have with copyleft licenses is that they might propagate into the private work you are doing inside your company. This concern is especially keen at organizations that themselves sell proprietary software. For example, what if a court were to rule that Apple or Google had to suddenly open source all their software because of the use of copyleft licenses by developers there?

Much of the concern centers around what it means for a piece of software to be a derivative work of another. Most people agree that artifacts created with copyleft-licensed software, like your plots, reports, and apps, are not themselves derivative works. But the treatment of software that incorporates copyleft-licensed software is murky. The reality is that there have been basically no court cases on this topic and nobody knows how it would shake out if it did get to court, so some organizations err on the side of caution.

These concerns are somewhat less for Python than for R. Python is released under a permissive Python Software Foundation (PSF) license and Jupyter Notebook under a permissive modified BSD. R is released under the copyleft GPL license and RStudio under a copyleft AGPL.

However, every single package author can choose a license for themselves. In an enterprise context, these discussions tend to focus on knowing – and potentially blocking – the use of packages under copyleft licenses inside the enterprise. Some package repository software surfaces the license type of individual packages to help organizations make their own decisions.

18.3 Controlling package flows

Whether your organization is trying to limit CVE exposure, run a code scanner, limit copyleft exposure, or stick to a known list of good packages, they need a way to actually restrict the packages that are available inside the environment.

If you’ve giving someone access to Python or R, it’s not possible to just remove their ability to run pip install or install.packages. That’s one reason why many enterprise environments are airgapped – it’s the only way to ensure data scientists can’t install packages from outside the environment.

Most IT/Admins understand that airgapping is the best way to stop unauthorized package installs. The next bit – that they need to provide you some way to install packges – is the part that may require some convincing.

In order to allow packages to be installed in their environments, many enterprises run package repository software inside their firewall. Common package repositories include Jfrog Artifactory, Sonatype Nexus, Anaconda Business, and Posit Package Manager.

Artifactory and Nexus are generalized library and package management solutions for all sorts of software, while Anaconda and Posit Package Manager are more narrowly tailored for data science use cases. If possible, I’d try working with your IT/Admins to get a data science focused repository software. Often these repositories can run alongside general-purpose repositories if you already have them.

Depending on the repository software you use, it may connect to an outside sync service or support manual file transfers for package updates. In many cases, IT/Admins are comfortable having narrow exceptions so the package repository can download packages, but no one in the data science environment can reach the internet.

A data science environment getting packages from a package repository, while all other connections bounce back inside the firewall.

This tends to work best when the IT/Admin is the one who controls which packages are allowed into the repository and when. Then you, as the data scientist, have the freedom to install those packages into any individual project and manage them there using environment as code tooling, as discussed in Chapter 1.

18.4 Comprehension Questions

  1. What are the concerns IT/Admins have about packages in an enterprise context?
  2. What are three tools IT/Admins might use to ensure packages are safe?
  3. What is the difference between permissive and copyleft open source licenses? Why are some organizations concerned about using code that includes copyleft licenses?

  1. They’re numbered 1-4 here for clarity in writing, but like many computer science topics, the numbering actually starts at 0.↩︎