Working with IT

Moving Data Science to A Server

In recent years, as data science has become more central to organizations, many have been moving their operations off of individual contributors’ laptops and onto centralized servers. Depending on your organization, the centralization of data science operations can make your life way easier – or it can be kinda a bummer.

Server migrations can work well regardless of whether they’re instigated by the data science or the IT organization. The biggest determinant is how well the data science and IT/DevOps teams can collaborate.

Data scientists are good at manipulating and using data, but most have little expertise in SysAdmin work, and aren’t really that interested. On the flip side, IT/DevOps organizations usually don’t really understand data science workflows, the data science development process, or how data scientists use R and Python.

Often, migrations to a server are instigated by the data scientists themselves – usually because they’ve run out of horsepower on their laptops. If you, or one of your teammates, enjoys and is good as SysAdmin work, this can be a great situation! You get the hardware you need for your project quickly and with minimal interference.

On the other hand, most data scientists don’t really want to be SysAdmins, and these systems are often fragile, isolated from other corporate systems, and potentially susceptible to security vulnerabilities.

Other organizations are moving to servers as well, but led by the IT group. For many IT groups, it’s way easier to maintain a centralized server environment, as opposed to helping each data scientist maintain their own environment on their laptop.

Having just one platform makes it much easier to give shared access to more powerful computing platforms, to data sources that require some configuration, and to R and Python packages that wrap around system libraries and can be a pain to configure (looking at you, rJava).

This can be a great situation for data scientists! If the platform is well-configured and scoped, you can get instant access through their web browser to more compute resources, and don’t have to worry about maintaining local installations of data science tools like R, Python, RStudio, and Jupyter, and you don’t need to worry about how to connect to important data sources – those things are just available for use.

But this can also be a bad experience. Long wait times for hardware or software updates, overly restrictive policies – especially around package management – and misunderstandings of what data scientists are trying to do on the platforms can lead to servers going largely unused.

So much of whether the server-based experience is good or not depends on the relationship between the data science and IT/Admin group. In organizations where these groups work together smoothly, this can be a huge win for everyone involved. However, there are some organizations where IT/Admins are so concerned with stability and security that they make it impossible to do data science, and the data scientists spend all their time playing cat-and-mouse games to try to get work done behind IT/Admin’s backs.

If you work at such a place, it’s frankly hard to get much done on the server. It’s probably worth investing some time into improving your relationship with your favorite person on the IT/Admin team. Hopefully, this book will help you understand a little of what’s on the minds of people in the IT group, and a sense of how to talk to them better.

As data science increasingly moves to the cloud, it’s helpful to have specific cloud recommendations. That’s why this book will go pretty deep into how to configure the things we’re discussing inside AWS. You always have the option of using other clouds, and any conceptual information you learn in this book will 100% apply, but trying to provide detailed directions across more than 1 cloud provider is just too difficult.