Making it Enterprise-Grade

Over the last few chapters, we walked through getting set up with a server, accessing and managing the server via SSH, understanding DNS and getting a real URL, securing the server with SSL/HTTPS, and right-sizing the server for your needs.

And if you walked through the process with the labs, you’ve got a data science workbench running RStudio Server, JupyterHub, and a model-serving API all of your own. That’s awesome!

That server is great if you’re working alone on some data science projects you want to host in the cloud or if you’ve got a small team of data scientists in a small organization.

But if you work at a large organization, an organization with more-stringent security needs, or a larger team of data scientists, you’re going to need to start considering some more complex ways of managing your data science environment for security and stability.

Broadly speaking, these concerns fall under the context of enterprise. In this context, Enterprise roughly translates to “a large organization with well-defined IT/Admin responsibilities and roles”. If you’re not in an Enterprise, you likely have the rights and ability to just stand up a data science workbench and use it, like we did in the last section.

If you are in an Enterprise, you probably had to do all that on a personal account, and almost certainly couldn’t connect to your real data sources.

In general, organizations are trying to implement something called the principle of least privilege. The idea behind this is that they are trying to keep systems safe and stable by giving people the permissions they need to get their work done – and no more. For example, let’s say you’re working in a data science workbench environment. You need a place where you can write code and load data. It might occasionally be useful to have root access on the server to be able to make updates to the server. But are you going to get that in an enterprise environment? Almost certainly not, because you don’t need it to get your work done day-to-day. Instead, you’ll have to coordinate with the IT/Admin team if you need to get something done that requires root access.

Doing open source data science in an Enterprise almost certainly means having to work across teams to get your workbench built and having to convince other people that you’ll be a good actor at the end of the day.

The goal of the next few chapters is to help you understand the ways your data science workbench isn’t enterprise and how to communicate with the IT/Admins at your organization who are responsible for such things. We’ll get into how to integrate open-source data science into an organization, what more complex network architecture might look like, how auth works, and how to scale your environment.

Hopefully you won’t have to implement much of this yourself. Instead, the hope is that reading and understanding the content in this chapter will help make you a better partner to the teams at your organization who are responsible for these things. You’ll be equipped with the language and mental models to ask good questions and give informative answers to the questions the IT/Admins have about your team’s requirements.

What enterprise IT is about

As a data scientist, your primary concern about your data science environment is that it’s useful. You want to be able to get all the data you want at your fingertips.

Many data scientists in enterprises find that this desire runs headlong into requirements from their organization’s IT/Admin teams.

This can be extremely frustrating, so it’s helpful to understand the concerns enterprise IT/Admins have in mind. Broadly, IT/Admins care about the security and stability of the systems they control.

Great IT/Admin teams also care about the usefulness of the system to users (that’s you), but it’s usually a distant third. And there is sometimes a tension here. After all the only system that’s completely secure is the one that doesn’t exist at all.

But that’s not always the case. Often, there’s a lot to gain by partnering with the IT/Admin team at your organization. You may be primarily focused on getting stuff done minute-to-minute, but a data science platform that is insecure and allows bad actors to break in and steal data is not useful. And one where you can do what you want but end up crashing the workbench for 50 other users is ultimately self-defeating.

Balancing security, stability, and usefulness is always about tradeoffs. Great IT/Admin organizations are constantly in conversations with other parts of the organization to figure out the right stance for your organization given the set of tradeoffs you face.

Unfortunately, many IT/Admin organizations don’t act that way – they act as gatekeepers to the resources you need to do your job. That means you’ll have to figure out how to communicate with those teams, understand what matters to them, help them understand what matters to you, and reach acceptable organizational outcomes.

You probably already have a good understanding of how a data science environment can be useful – but what about secure and stable. What do they mean?

Security is about making sure that the right people can interact with the systems they’re supposed to and that unauthorized people can’t.

IT security professionals think about security in layers. And while you’ve done a good job setting your server up to comply with basic security best practices, there are no layers. That server front door is open to the internet. Literally anyone in the world can come to that authentication page for your RStudio Server or JupyterHub and start trying out passwords. That means you’re just one person choosing the password password away from a bad actor getting access to your server.

Lest you think you’re immune because you’re not an interesting target, there are plenty of bots out there randomly trying to break in to every existing IP address, not because they care about what’s inside, but because they want to co-opt your resources for their own purposes like crypto mining or virtual DDOS attacks on Turkish banks.1

Moreover, security and IT professionals aren’t just concerned with bad actors from outside (called outsider threat) or even someone internal who decides to steal data or resources (insider threat). They are (or at least should be) also concerned with accidents and mistakes – data that is accidentally permanently deleted is bad the same way stolen data is bad.

Stability is ensuring enterprise-grade systems are around when people need them, and that they are stable during whatever load they face during the course of operating. The importance of stability tends to rise along with the scale of the team and the centrality of their operations to the functioning of your organization.

If you’re a team of three data scientists who sit in a room together, it probably won’t be a huge deal if someone accidentally knocks your data science workbench offline for 30 minutes because they tried to run a job that was too big. You’re probably all sitting in the same room and you can learn something from the experience.

That’s not the case when you get to enterprise-grade tooling. An enterprise-grade data science workbench probably supports dozens or hundreds of professionals across multiple teams. The server being down isn’t a sorta funny occurrence you can all fix together – it’s a problem that must be fixed immediately – or even better avoided altogether.

IT/Admins think hard about how to provide resources in a way that avoids having servers go down because they’re hitting resource constraints.

One thing that is almost certain to be untrue in an enterprise context is that you’ll have root access as a user of the system.

There is no one-size-fits-all (or even most) position for security. Instead, great security teams are constantly articulating and making decisions about tradeoffs.


  1. Yes, these both really happened.↩︎