12  Upgraded Networking

In Chapter 9 we went over a general introduction to networking, including an introduction to DNS, how to get and use a real URL, and making sure your servers are secure with SSL/HTTPS.

When you’re managing a relatively simple server – especially one that’s relatively low-threat, this is perfectly fine. However, if you’re managing a cluster of several servers, you have other security concerns, or a high-threat environment, you may need more complex networking solutions to make sure that your environment stays secure and manageable.

In this chapter, we’ll learn a little about how an environment can be further secured from internet traffic.

12.0.1 What does it mean for networking to be more complex?

In our lab, we set up a single server so it could be directly accessed over the internet. You went to the IP address or URL we configured and could get right into the server, assuming you had the right password.

More complex networking generally consists of putting your server inside a private network. We’ll get into the details of exactly how this works, but the general idea is that the server itself lives inside a private network and is not accessible from outside. Anyone wanting to get in has to go through one or more layers of other servers, generally referred to as proxy servers or proxies.

This means that the traffic actually coming to your server only comes from other servers you control, as opposed to from anywhere on the internet. It’s easy to imagine this is more secure, and we’ll get into exactly why below.

12.0.2 What complex networking gets you

If you’re working as part of a larger organization, that’s almost certainly not an acceptable configuration for your server.

In this chapter, we’ll get a bunch into how more complex network topologies work, but we’ll first start with why. There are three main things that more complex networks get you – abstraction for ease of use, security, and compliance.

Compliance is the simplest of these. In some fields, you may have actual legal or contractual requirements that your server not be accessible to the public. That means that the configuration we used in the earlier section of this book is simply not acceptable and you’ll have to do something different.

The second thing more complex networking gets you is more useful abstraction. For example, let’s say you wanted to replace your single server data science workbench with a pair of load-balanced servers. In the configuration we currently have, you’d have to spend a lot of time manually adjusting the URLs or IP addresses stored in various other servers. By using more complex network routing and hostnames inside a private network, you can assign a hostname to an IP address in one place and then use that hostname elsewhere, so even if you switch out the actual server, not much has to change in your networking configuration. This probably doesn’t matter much in your current configuration, but gets to be really important when you have a system with many different services, each potentially with several servers.

Lastly, and perhaps most obviously, using more complex networking techniques is one of the best ways to introduce security in layers to the front door of your data science workbench. There are a few different ways more complex networking can protect your server.

The first, and simplest, is that it puts another point of failure between your high-value server and the open internet. Say you are the victim of a distributed denial of service (DDOS) attack. This is a type of attack where someone who controls a fleet of bots sends unending amounts of traffic to your servers solely for the purpose of overwhelming them – usually for ransom. It’s not great for your actual data science server to go down as a result of being overwhelmed. On the other hand, if your traffic is going through a proxy, they can take down the proxy and your data science server is humming along unimpeded, giving you the opportunity to quickly bring up another proxy or access your server a different way.

The second reason private networking is more secure is that you can put things on your proxy servers to do threat detection. For example, if you are the target of a DDOS attack, you wouldn’t know that it’s happening if and until your server goes down if you just are hosting RStudio Server on the open internet. While it generally doesn’t happen automatically, a proxy server is a great place to put monitoring for various kinds of threats.

Lastly, and most importantly, using a proxy server reduces the attack surface for outside attacks. You can think of it like this – putting your server open on the internet is akin to having a castle in the middle of an open field. It’s probably pretty safe – and if everything goes right – you’re fine. But what if someone accidentally leaves a window open? Or the guard forgets to close the gate at night. You could be in big trouble then.

Putting your valuable server inside a private network is like surrounding that castle with an impregnable wall with only one front door. Now you can ensure that anyone who wants to get in has to come through the front door first. You still don’t want to leave more openings in your server than absolutely necessary, but you can rest a little easier that mistakes configuring your server are unlikely to be catastrophic.

In this case, the castle windows and doors are like the ports on your server. You can control port openings on your server itself from inside the server, or by adjusting security groups. But almost everyone I know has at some point made their security groups wide open to try to test or debug something. What happens if you forget to set them back?

12.0.3 What does a secure network look like?

The topology of a more secure network looks, honestly, a lot like a castle. The outer walls of the castle are the boundaries of your private network, which AWS calls a virtual private cloud (VPC). Inside the VPC, you get complete control over the IP addresses inside so you can assign hostnames that are only valid inside your VPC – so you don’t need to worry about them conflicting with anything in the outside world.

Inside the VPC, you can define subnets, which are smaller areas of the VPC that can be used for different purposes. Subnets are either private – meaning only accessible from inside the VPC, or public – meaning they may be accessible from outside the subnet.

How networks are defined

VPCs and subnets are defined by something called a Classless Inter-Domain Routing (CIDR) block. We’re not going to get deep into the details, but the basic idea is that each CIDR block has a certain number of IP addresses in it. You can allot parts of the whole VPC’s CIDR blocks to each of your subnets.

Each CIDR block is defined by a starting address and the size of the network. A /32 network has only IP address in it, and lower numbers indicate broader blocks. So, so 0.0.0.0/16 would indicate the CIDR block that includes all 65,536 IPv4 addresses from 0.0.0.0 to 0.0.255.255.

For IPv4 networks, VPCs should be defined using the IP addresses specified for private usage, starting with either 10, 172, or 198.

Your public subnets generally only house the servers you’re using to proxy traffic from the outside world into your private subnets. This is sometimes called a demilitarized zone (DMZ). In most cases, you’ll have minimum two servers in the DMZ. You’ll usually have one or more proxies or load-balancers to handle the incoming HTTPS traffic that comes in from the outside world. You’ll also usually have a proxy that is just for passing SSH traffic along to hosts in the private network, often called a bastion host or jump box.

Also in the DMZ lives your NAT/Internet Gateway (MORE DETAILS).

Private networks generally host all of the servers that actually do things. Your data science workbench server, your databases, server for hosting shiny apps – all these things should live inside the private network. Any of it that needs to be accessible outside should be made accessible via setting the right rules on your security groups, proxies, and NATs.

Should you need to access resources that live in another VPC, you can do something called VPC peering, which allows two or more VPCs to essentially act as one. It’s worth noting that if you’re doing VPC peering, IP addresses have to be unique across both peered VPCs.

REFERENCE (remove): https://docs.aws.amazon.com/vpc/latest/userguide/configure-your-vpc.html

12.1 The role of proxies

A proxy is a server that exists solely as an intermediary to manage the traffic coming to it. Proxies can serve a variety of different roles including acting as a firewall, routing traffic to different places, doing authentication into the servers behind, or caching data for better performance on frequent queries.

For an IT/Admin, managing proxies is an everyday activity.

Hopefully for you as a data scientist, you almost never have to know much about the proxies themselves, but being able to ask intelligent questions – and know how they might get in the way of what you’re trying to go – can relieve some headaches before they occur.

Network Debugging Tip

If you’re experiencing weird behavior in your data science environment – files failing to upload or download, sessions getting cutoff strangely, or data not transferring right – issues with a proxy are probably the first thing you should check.

If you’re talking to your IT/Admin about the network where your data science environment sits, one of the most important questions is whether that network has a proxy. There are two different kinds of proxies that are important to know about – forward and reverse. Proxies are discussed from the perspective of being inside the network, so a forward proxy is one that intercepts and does something to outbound traffic, while a reverse proxy is one that does something to inbound traffic. Personally, I much prefer the terms inbound and outbound to forward and reverse, because it’s much easier to remember which is which.

In most cases, proxies sit right on the edge of your network – in the DMZ (public subnet) between your private servers and the open internet. In some cases, there are also proxies inside different segments of the private network. These are generally for the purposes of isolating components as a security measure – like having watertight bulkheads between sections of a ship. If you’ve got a proxy – for example – between your app development environment and where it’s deployed, that’s a reasonably common cause of issues along the way.

Developing a good mental model of where network connections originate is really important in terms of understanding why proxies might be causing you trouble. One of the most helpful things you can do when talking to your IT/Admin about networking issues is help them understand when your data science environment requires an inbound connection and when it requires an outbound connection.

TODO: image inbound vs outbound connection

As we went over in chapter 2-4, network traffic always operates on a call and response model. So whether your traffic is inbound or outbound is dependent on who makes the call. Inbound means that the call is coming from a computer outside the private network directed to a server inside the private network, and outbound is the opposite.

So basically, anything that originates on your laptop – including the actual session into the server is an inbound connection, while anything that originates on the server – including everything in code that runs on the server is an outbound connection.

12.1.1 What proxies do

There are a number of different functions proxies can do – here are a few of the most common.

Proxies – especially reverse/inbound – often do redirection. This means that the proxy is generally the what’s actually available at the public URL of your server. It then redirects people along to the actual hostname for your server. Inside your private network, you can name your server whatever you want, so if you’ve got a server that’s just named rstudio inside your network, your proxy would know that anyone going to my-org.com/rstudio ends up at the host named rstudio.

One other nice thing proxies can do along with redirection is managing ports more securely. Many services run on a high-numbered port by default to avoid conflicts with other services. For example, RStudio Server runs on port 8787 by default. But remembering what weird port has to be open for every service can be kinda a pain. Thus it can be much easier to just keep standard ports (80 for HTTP, 443 for HTTPS, and 22 for SSH) open on your proxy and have the proxy just redirect the traffic coming into it on 80 to 8787 on the server with RStudio Server.

There’s a special kind of reverse proxy called a load-balancer. This is a kind of proxy that redirects traffic coming to a single URL to not just one – but one of a pool of servers. In the scaling chapter, we’ll get more into how to think about pools of servers and load-balancing work, but the networking part is handled by a load-balancer.

[TODO: image of path rewriting + load-balancing]

Sometimes proxies also terminate SSL. Because the proxy is the last server that is accessible from the public network, many organizations don’t bother to implement SSL/HTTPS inside the private network so they don’t have to worry about managing SSL certificates inside their private network. This is getting rarer as tooling for managing SSL certificates gets better, but it’s common enough that you might start seeing HTTP addresses if you’re doing server-to-server things inside the private network.

TODO: image of SSL termination

Occasionally proxies also do authentication. In most cases, proxies pass along any traffic that comes in to where it’s supposed to go. If there’s authentication, it’s often at the server itself. Sometimes the proxy is actually where authentication happens, so you have to provide the credentials at the edge of the network. Once those credentials have been supplied, the proxy will let you through. Depending on the configuration, the proxy may also add some sort of token or header to your incoming traffic to let the servers inside know that your authentication is good and to pass along identification for authorization purposes.

TODO: image of auth at proxy

The last thing that proxies can do is just block traffic that isn’t explicitly allowed. If you’ve got a forward proxy in your environment, you may have to work with your IT/Admins to make sure that you’re able to access the resources you need to get your work done. If that’s not an option, you may have to think about how to operate offline, which we’ll address towards the end of the chapter.

12.1.2 Inbound vs outbound proxies

The things you’ll have to think about are very different depending on whether your VPC has a outbound/forward or an inbound/reverse proxy. If you’ve got a reverse proxy (and most enterprise networks do), the main thing you probably have to consider is path-rewriting – generally the proxy is what’s actually hosted at the public URL of the server, and then it passes people along to the internal hostname of the actual server.

The other big concern is what kinds of connections your proxy supports. For example, many data science app frameworks (including Shiny and Streamlit) use a technology called Websockets for maintaining the connection between the user and the app session. Most modern proxies support Websockets – but some don’t and you’ll have to figure out workarounds if you can’t get Websockets enabled.

Additionally, some inbound proxies have limitations that can make things weird for data science use cases – the most common are limiting file size for uploads and downloads and implementing timeouts on file uploads, downloads, and sessions. It’s reasonably common for organizations to have standard file size limits or timeouts that don’t work well in a data science contexts. In data science contexts, files tend to be big and session lengths long. If you’re trying to work in a data science context and weird things are happening with file uploads or downloads or sessions ending unexpectedly, checking on inbound proxy settings is a good first hunch.

Inbound proxies are what you’re thinking about when you’re wondering if anything is going wrong for your connection to your server. If you can’t upload or download the files you need directly to the server or your RStudio session keeps timing out, a reverse proxy is a likely candidate.

On the other hand, forward/outbound proxies are the likely culprit if your code is having trouble running. In general, outbound proxies are simpler than inbound. The most common thing they do is simply block traffic from leaving the private network. Many organizations have these proxies to reduce the risk of someone getting in and then being able to exfiltrate valuable resource.

The most common reasons you’d hit a forward/outbound proxy in your code is when you’re installing packages from a public repository or when you’re trying to make use of an API or other web service that’s outside your private network.

In some cases, ameliorating these issues is as easy as talking to your IT/Admin and asking them to open the outbound proxy to the right server. Especially if it’s a URL protected by HTTPS and that’s for only one thing – for example CRAN, PyPI, or public RStudio Package Manager, it’s generally pretty safe and many organizations are happy to allow-list a limited number of outbound addresses.

If not, the next section is for you.

12.2 Fully Offline/Airgapped operations

Offline or airgapped environments are quite common in highly regulated industries with strong requirements around data security and governance.

The term airgapped comes from the notion that there is a physical gap – air – between the internet and the environment. In these instances, the servers where your data science environment exists is physically disconnected from the outside world, and if you need to move something into the environment, you’ll have to load it onto a physical drive outside the environment and then walk into where the environment is.

However, environments that are this thoroughly airgapped are quite rare. Most airgapped environments are airgapped by proxies.

This means that they sharply limit where inbound connections are allowed to come from – for example perhaps only from machines in the physical building where your company sits.1 The way you get into your airgapped environment usually involves either being in a certain physical location to get on the network or logging into a VPN. If your organization requires offline operations, they almost certainly have ways to give you access to the network. Talk to your IT/Admin.

TODO: image of VPC inside VPN

If your server is offline, it’s likely that they strictly limit or completely disallow outbound connections from the servers inside the private network. This is where your organization probably doesn’t have standard practices, and you getting clear on what you need with your IT/Admins will really help.

Here are the four most common reasons you’ll need to make outbound connections from inside your data science environment.

  • Downloading Packages Downloading a package requires a network connection to the repository – usually CRAN, BioConductor, public RStudio Package Manager, Conda, PyPI, or GitHub. If you can get narrow exceptions for these, that’s great! If not, you’ll need to figure out how to run a package repository inside your data science environment like we discussed in the last chapter.

  • Data Science Tasks In many organizations, you don’t need internet access at all for the work you’re doing. You’re just working on data from databases or files inside your private network and don’t really need access to data or resources outside. On the other hand, if you’re consuming data from public APIs or scraping data from the web, that may require external connections.

  • System Libraries In addition to the R and Python packages, there are also system libraries you’ll need installed, like the versions of R and Python themselves, and other packages used by the system. Generally it’ll be the IT/Admin managing and installing these, so they probably have a strategy for doing it. This may come up specifically in the context of data science if you’re using R or Python packages that are basically just wrappers around system libraries, like the sf package in R, or the GDAL python package which wraps the GDAL system library for geospatial work.

  • Software Licensing If you’re using all open source software, this probably won’t be an issue. But if you’re buying licenses to a professional product, you’ll have to figure out how to activate the software licensing, which generally operates by reaching out to servers owned by the software vendor. They should have a method for activating servers that can’t reach the internet, but your IT/Admins will appreciate if you’ve done your homework on this before asking them to activate some new software.

Before you go ahead treating your environment as truly offline/airgapped, it’s almost always worth asking if narrow exceptions can be made to a network that is offline/airgapped. The answer may surprise you.

If you are truly offline, you probably won’t be able to move things on or off your private servers. Instead, when you need things, the IT/Admin will either connect to a server in the DMZ that has permission to access both the public internet and the private network to load things, or they’ll actually have to download things to their laptop from the internet, connect to the server in the offline environment, and upload them.

TODO: drawing of offline operations

12.3 Comprehension Questions

  1. What is the advantage of adopting a more complex networking setup than a server just deployed directly on the internet? Are there advantages other than security?
  2. Draw a mental map with the following entities: inbound traffic, outbound traffic, proxy, DMZ, private subnet, public subnet, VPC
  3. Let’s say you’ve got a private VPC that hosts an instance of RStudio Server, an instance of JupyterHub, and a Shiny Server that has an app deployed. Here are a few examples of traffic – are they outbound, inbound, or within the network?
    1. Someone connecting to and starting a session on RStudio Server.

    2. Someone SFTP-ing an app and packages from RStudio Server to Shiny Server.

    3. Someone installing a package to the Shiny Server.

    4. Someone uploading a file to JupyterHub.

    5. A call in a Shiny app using httr2 or requests to a public API that hosts data.

    6. Accessing a private corporate database from a Shiny for Python app using sqlalchemy.

  4. What are the most likely pain points for running a data science workbench that is fully offline/airgapped?

  1. You really understood the chapter on networking in section two if you have a pretty clear idea of how an IT/Admin could accomplish this.↩︎