11 Open Source Data Science in the Enterprise
This chapter is going to be devoted to two of the biggest challenges of doing data science in the enterprise - the relationship between IT/Admin and data scientists, and how to make free and open source software (FOSS) acceptable to enterprise IT/Admins.
11.1 Data Science Sandboxes
In the chapter on environment as code tooling, we discussed how you want to create a version of your package environment in a safe place and then use code to share it with others and promote it into production.
The IT/Admin group at your organization probably is thinking about something similar, but at a more basic level. They want to make sure that they’re providing a great experience to you – the users of the platform.
So they probably want to make sure they can upgrade the servers you’re using, change to a newer operating system, or upgrade to a newer version of R or Python without interrupting your use of the servers.
These are the same concerns you probably have about the users of the apps and reports you’re creating. In this case, I recommend a two-dimensional grid – one for promoting the environment itself into place for IT/Admins and another for promoting the data science assets into production.
In order to differentiate the environments, I often call the IT/Admin development and testing area staging, and use dev/test/prod language for the data science promotion process.
11.1.1 Creating Dev/Test/Prod
In Chapter 1 on code promotion, we talked about wanting to create separate dev/test/prod environments for your data science assets. In many organizations, your good intentions to do dev/test/prod work for your data science assets are insufficient.
In the enterprise, it’s often the case that IT/Admins are required to ensure that your dev work can’t wreak havoc on actual production systems are data. In many cases, this is actually a good thing!
In this case, I recommend you work with your IT/Admins to create a data science sandbox - a place where you can have free access to the data and packages you need to develop new assets to your heart’s content.
Then you need a promotion process to get into “higher lanes”.
There are two items that are likely to raise friction between you and the IT/Admins as you’re trying to create a data science sandbox. The first is access to data, and the second is access to packages.
Many IT/Admins are familiar with building workbenches for other sorts of software development tasks, which are fundamentally different from doing data science. Because so much of data science is exploratory, you need a safe place where you can work with real data, explore relationships in the data, and try out ways of modeling and visualizing the data to see what works.
But for many software development tasks, access to fake or mock data is sufficient. Helping IT/Admin understand that you need access to real data, even if just a sample or in read-only mode is likely to be one of the highest hurdles to getting a data science sandbox that’s actually useful. In data science, the dev process looks less like test and prod – in software engineering you’re generally iterating on an app, writing code, and then promoting it into a testing lane. In data science, the dev activities are entirely different from the test or prod activities – you can’t start building a model or shiny app until you’ve really explored the data in depth.
In many cases, creating a data science sandbox with read-only access to data is a great solution. If you’re creating a tool that also writes data, you can write the data only within the dev environment.
The other issue that can be a tough sell is getting freer access to packages in the dev environment. As you probably know, part of the exploratory data process is trying out new packages – is it best to use this modeling package or that, best to use one type of visualization or another. If you can convince your IT/Admins to give you freer access to packages in dev, that’s ideal.
You can work with them to figure out what the promotion process will look like. It’s easy to generate a list of the packages you’ll need to promote apps or reports into production with tools like renv
or venv
. Great collaborations with IT/Admin are possible if you can develop a procedure where you give them package manifests that they can compare to allow-lists and then make those packages available.
11.1.2 Infrastructure As Code Tooling
There are many, many varieties of infrastructure as code tooling. There are many books on infrastructure as code tooling and I won’t be covering them in any depth here. Instead, I’ll share a few of the different “categories” (parts of the stack) of infrastructure as code tooling and suggest a few of my favorites.
To get from “nothing” to a usable server state, there are (at minimum) two things you need to do – provision the infrastructure you need, and configure that infrastructure to do what you want.
For example, let’s say I’m standing up a server to deploy a simple shiny app. In order to get that server up, I’ll need to stand up an actual server, including configuring the security settings and networking that will allow the proper people to access the server. Then I’ll need to install a version of R on the server, the Shiny package, and a piece of hosting software like Shiny Server.
So, for example, you might use AWS’s CloudFormation to stand up a virtual private cloud (VPC), put an EC2 server instance inside that VPC, attach an appropriately-sized storage unit, and attach the correct networking rules. Then you might use Chef to install the correct software on the server and get your Shiny app up-and-running.
In infrastructure as code tooling, there generally isn’t a clear dividing line between tools that do provisioning and tools that do configuration…but most tools lean one way or the other.
Basically any tool does provisioning will directly integrate into the APIs of the major cloud providers to make it easy to provision cloud servers. Each of the cloud providers also has their own IaC tool, but many people prefer to use other tools when given the option (to be delicate).
The other important division in IaC tools is declarative vs imperative. In declarative tooling, you simply enumerate the things you want, and the tool makes sure they get done in the right order. In contrast, an imperative tool requires that you provide actual instructions to get to where you want to go.
In many cases, it’s easy to be declarative with provisioning servers, but it’s often useful to have a way to fall back to an imperative mode when configuring them because there may be dependencies that aren’t obvious to the provisioning tool, but are easy to put down in code. If the tool does have an imperative mode, it’s also nice if it’s compatible with a language you’d be comfortable with.
One somewhat complicated addition to the IaC lineup is Docker and related orchestration tools. There’s a whole chapter on containerization and docker, so check that out if you want more details. The short answer is that docker can’t really do provisioning, but that you can definitely use docker as a configuration management IaC tool, as long as you’re disciplined about updating your Dockerfiles and redeployment when you want to make changes to the contents.
Basically none of these tools will save you from your own bad habits, but they can give you alternatives.
In short, exactly which tool you’ll need will depend a lot on what you’re trying to do. Probably the most important question in choosing a tool is whether you’ll be able to get help from other people at your organization on it. So if you’re thinking about heading into IaC tooling, I’d suggest doing a quick survey of some folks in DevOps and choosing something they already know and like.
11.2 FOSS in Enterprise
Succeeding with Open Source in an enterprise context means forging a good partnership with IT/Admin. Many IT/Admins have only limited experience with FOSS at all and even if they’re used to FOSS, they probably have almost no experience with FOSS for data science, which looks a good bit different from many other applications of FOSS.
11.2.1 Legal issues with open source
When using Free and Open Source Software (FOSS) in the enterprise, legal concerns may arise about the legality of using this software.
FOSS is made such by the license applied to it. There are a variety of different open source licenses with different implications for the organization that wants to make use of them. I am not a lawyer and this is not legal advice, but here’s a little context on the issues at hand.
There are some open source licenses that are called “permissive” because they allow broad use of the software licensed under them. For example, the MIT license allows anyone to, “use, copy, modify, merge, publish, distribute, sublicense, and/or sell” software that is licensed under MIT – and it doesn’t even require attribution!
This means that you are broadly allowed to use something MIT-licensed and do what you want with it, without having to worry too much about the implications.1 Other licenses (like some of the BSD variants) allow you to do basically whatever you want, but do require attribution.
On the other hand, there are more restrictive or “copyleft” licenses – popular ones are GPL and AGPL. Aside from restrictions these licenses may place on the usage of the software, they also require that derivative works are released under the same license.
R is released under GPL, a copyleft licenses. Python is released under a specialized Python Software Foundation (PSF) license, which is a permissive license.
This means, for example, that you can not take the base R language, rewrite a section of it and create a proprietary version of R you want people to pay for.2
However, there is disagreement among lawyers about the scope of copyleft licenses. Many copyright lawyers believe that the extent of a derivative work for the purpose of a copyleft license is literally repackaging a bit of code and using it.
Beyond the base languages themselves, packages in R and Python are released under whatever licenses the package authors decide.
So, for example, the popular ggplot2 R package for plotting is released under an MIT license. But wait, you say, R is GPL, doesn’t that mean that derivative works also have to be a copyleft license? This is allowed because ggplot2 is uses R, but the source code of the R language doesn’t actually appear in the ggplot2
package, so it isn’t bound by R’s GPL license.3
There is a grey area about using these packages. Say I create an app or plot in R and then share that plot with the public – is that app or plot bound by the license as well? Do I now have to release my source code to the public? Many would argue no – it uses R, but doesn’t derive from R in any way. Others have stricter concerns. Most interpretations of the very strict AGPL license is that running a service based on AGPL licensed code requires releasing the source code. Understandably, organizations that themselves sell software tend to be a little more concerned about these issues.
There are disagreements on this among lawyers, and you should be sure to talk to a lawyer at your organization if you have concerns.
11.2.2 Package Management in the Enterprise
In the chapter on environments as code, we went in depth on how to manage a per-project package environment that moves around with that project. This is a best practice and you should do it.
But in some Enterprise environments, there may be further requirements around how packages get into the environment, which packages are allowed, and how to validate those packages.
The biggest difference between libraries and repositories is how many versions of packages are allowed. Because libraries can house (at most) one version of any given package, they should live as close to where they’re used as possible.
In contrast, it’s nice to manage repositories at as general of a level as possible for convenience’s sake.
In some cases, enterprises are happy to allow data scientists open access to any packages on CRAN, PyPI, GitHub, and elsewhere. But most enterprises have somewhat more restrictive stances for security reasons.
In these cases, they have to do two things - make sure that public repositories are not available to users of their data science platforms, and use one of these repository tools to manage the set of packages that are available inside their environment. It often takes a bit of convincing, but a good division of labor here is generally that the IT/Admins manage the repository server and what’s allowed into the environment and the individual teams manage their own project libraries.
When admins hear about a package cache per-project, they start getting worried about storage space. I have heard this concern many times from admins who haven’t yet adopted this strategy, and almost never heard an admin say they were running out of storage space because of package storage.
The reality is that most R and Python packages are very small, so storing many of them is reasonably trivial.
Also, these package storage tools are pretty smart. They have a shared package cache across projects, so each package only gets installed once, but can be used in a variety of projects.
It is true that each user then has their own version of the package. Again, because packages are small, this tends to be a minor issue. It is possible to make the package cache one that is shared across users, but the (small) risk this introduces of one user affecting other users on the server is probably not worth the very small cost of provisioning enough storage that this just isn’t an issue.
Many enterprises run some sort of package repository software. Common package repositories used for R and Python include Jfrog Artifactory, Sonatype Nexus, Anaconda Business, and Posit Package Manager.
Artifactory and Nexus are generalized library and package management solutions for all sorts of software, while Anaconda and Posit Package Manager are more narrowly tailored for data science use cases.
There are two main concerns that come up in the context of managing packages for the enterprise. The first is how to manage package security vulnerabilities.
In this context, the question of how to do security scanning comes up. What exactly security professionals mean by scanning varies widely, and what’s possible differs a good bit from language to language.
It is possible to imagine a security scanner that actually reads in all of the code in a package and identifies potential security risks – like usage of insecure libraries, calls to external web services, or places where it accesses a database. The existence of tools at this level of sophistication exist roughly in proportion to how popular the language is and how much vulnerability there is.
So javascript, which is both extremely popular and also makes up most public websites, has reasonably well-developed software scanning. Python, which is very popular, but is only rarely on the front end of websites has fewer scanners, and R, which is far less popular has even fewer. I am unaware of any actual code scanners for R code.
One thing that can be done is to compare a packaged bit of software with known software vulnerabilities.
New vulnerabilities in software are constantly being identified. When these vulnerabilities are made known to the public, the CVE organization attempts to catalog them all. One basic form of security checking is looking for the use of libraries with known CVE records inside of packages.
The second thing your organization may care about is the licenses software is released under. They may want to disallow certain licenses – especially aggressive copyleft licenses – from being present in their codebases.
11.3 Comprehension Questions
- What is the purpose of creating a data science sandbox? Who benefits most from the creation of a sandbox?
- Why is using infrastructure as code an important prerequisite for doing Dev/Test/Prod?
- What is the difference between permissive and copyleft open source licenses? Why are some organizations concerned about using code that includes copyleft licenses?
- What are the key issues to solve for open source package management in the enterprise?