12 Intro to Computer Networks
These days using a computer is basically synonymous with using a computer network. When you visit a website, print a document, or login to your email, you are making use of a computer network.
This chapter is an introduction to how computer networks work. This chapter is heavy on introducing mental models for how network traffic works. There is no lab in this chapter.
The computer network we’re most familiar with is the biggest of them all: The Internet. But there are myriad other networks, like the very small private network of the devices (phones, computers, TVs, etc) connected to your home wifi router, to the somewhat larger VPN (which stands for virtual private network) you might connect to for school or work.
The good news is that computer networks are basically all the same. If you understand one, you understand them all. Each step in the process is governed by a protocol that defines what is valid.
There are two basic levels of protocols that are helpful to understand – the transport + address layer, and the application layer.
Just as you and your friend could communicate in any number of human languages, computers define what a valid message is with an application layer protocol.
There are numerous application layer protocols. Some you will see in this book include SSH
for access to the terminal on a remote machine, (S)FTP
for file transfers, SMTP
for email, LDAP(S)
for authentication and authorization, and websockets
for persistent bi-directional communication, which is what the Shiny and Streamlit packages use.
The most common application layer protocol, used for all standard web traffic, is hypertext transfer protocol (http), which we discussed in some depth in Chapter 4 on using APIs for data science purposes.
For the purposes of this chapter, we’re going to completely set aside physical network infrastructure. Yes, there are wires, cables, satellites, routers, server farms, and more that allow computer networking to work. For the purposes of this chapter, we’re going to focus entirely on the software side of networking.
One cool thing about the cloud is that it allows you to abstract away all of the physical hardware and focus completely on how the software interacts.
12.1 TCP/IP and the mail
Let’s start by imagining you have a penpal, who lives in an apartment building across the country.1
Let’s say you’ve got some juicy gossip to share with your penpal. Because this gossip is so juicy, you’re not going to put it on a postcard. You’d write your letter, put it inside an envelope, address the letter to the right spot, and put it in the mail.
You trust the postal service to take your letter, deliver it to the correct address, and then your friend will be able to read your letter.
The process of actualy getting data from one computer to another is governed by the TCP/IP protocol and is called packet switching.2
When a computer has data to send to another computer, it takes that information and packages it up into a bundle called a packet.3
The data in the packet is like the contents of your letter.
Just as the postal service defines permissible envelope sizes and address formats, the TCP/IP protocol defines what a packet has to look like from the outside, including what constitutes a valid address.
A uniform resource locator (URL) is the way to address a specific resource on a network.
A full URL includes 4 pieces of information: \[\overbrace{\text{https://}}^\text{protocol}\overbrace{\text{example.com}}^\text{address}\overbrace{\text{:443}}^\text{port}\overbrace{\text{/}}^\text{resource}\]
The protocol starts the URL. It is separated from the rest of the URL by ://
.
Each of the rest of the pieces of the URL is needed to get to a specific resource and has a real-world analog.
The address is like the street address of your penpal’s building. It specifies the host where your data should go.4
A host is any entity on a network that can receive requests and send responses. So a server is a common type of host on a network, but so is a router, your laptop, and a printer.
The port is like your friend’s apartment. It specifies which service on the server to address.
Lastly, the resource dictates what resource you want on the server. It’s like the name on the address of the letter, indicating that you’re writing to your penpal, not their mom or sister.
This full URL may look a little strange to you. You’re probably used to just putting a standalone domain like \(google.com\) into your browser when you want to go to a website. That’s because https
, port 443
, and /
are all defaults, so modern browsers don’t make you specify them.
But what if you make it explicit? Try going to https://google.com:443/
. What do you get?
12.2 Ports get you to the right service
A port is a location on a server where a network connection is possible. It’s like the apartment number for the mail. Each port has a unique number 1 to just over 65,000. By default, the overwhelming majority of the ports on the server are closed for security reasons.
Your computer automatically opens ports to make outgoing connections, but if you want to allow someone to make inbound connections – like to access RStudio or a Shiny app on a server – you need to manually configure and open a port.
Any program that is running on a server and that you intend to be accessible from the outside is called a service. For example, we set up RStudio, JupyterHub, and a Plumber API as services in the lab in Chapter 11. Each service lives on a unique port on the server.
Since each service running on a server needs a unique port, it’s common to choose a somewhat random relatively high-numbered port. That makes sure it’s unlikely that the port will conflict with another service running on the server.
For example RStudio Server runs on port 8787
by default. According to JJ Allaire, there’s no special meaning to this port. It was chosen because it was easy to remember and not 8888
, which some other popular projects had taken.
There’s a cheatsheet of commonly used ports at the end of the chapter.
12.3 Assigning addresses
The proper address of a server on a network is defined by the Internet Protocol (IP) and is called an IP Address. IP addresses are mapped to human-friendly domains like \(google.com\) with the Domain Name Service (DNS).
In this chapter, we’re going to set DNS aside and talk exclusively about IP Addresses.
Chapter 13 is all about DNS.
Every host on the internet has a public IP address.
A public IP address is an IP address that is valid across the entire internet. That means that each public IP address is unique across the entire internet. You can think of a public IP address like the public street address of a building.
But many hosts are not publicly accessible on the internet. Many are housed in private networks. A private network is one where the hosts aren’t directly accessible to the public. You can think of a host in a private network like a building inside a gated community. You may still be able to get there from the public, but you can’t just walk up to the building from the street. Instead you need to come in through the specific gates that have been permitted and approach only on the roads that are allowed.
TODO: Image of public IPs like street address, private like cul-de-sac
There are many different kinds of private networks. Some are small and enforced by connection to a physical endpoint, like the private network your WiFi router controls that houses your laptop, phone, TV, Xbox, and anything else you allow to connect to your router. In other cases, the network is a software network. Many organizations have virtual private networks (VPNs). In this case, you connect to the network via software. There may be resources you can only connect to inside your VPN and there also might be limitations about what you can go out and get.
There are a few different reasons for this public/private network split. The first is security. If you’ve got a public IP address, anyone on the internet can come knock on your virtual front door. That’s actually not so bad. What’s more problematic is that they can go all the way around the building looking for an unlocked side door. Putting your host inside a private network is a way to ensure that people can only approach the host through on the pathways you intend.
The second reason is convenience.
Private networks provide a nice layer of abstraction for network addresses.
You probably have a variety of network-enabled devices in your home, from your laptop and phone to your TV, Xbox, washing machine, and smart locks. But from the outside, your house has only one public IP address – the address of the router in your home. That means that your router has to keep track of all the devices you’ve got, but they don’t need to register with any sort of public service just to be able to use the internet.
As we’ll get into in Chapter 13, keeping track of IP Addresses is best left to machines. If you’re managing a complex private network, it’s really nice to give hostnames to individual hosts. A nice feature of a private hostname compared to a public address is that you don’t have to worry if the hostname is unique across the entire internet – it just has to be unique inside your private network.
12.3.1 Firewalls, allow-lists, and other security settings
One of the most basic ways to keep a server safe is to not allow traffic from the outside to come in. Generally that means that in addition to keeping the server ports themselves closed, you’ll also have a firewall up that defaults to all ports being closed.
In AWS, the basic level of protection for your server is called the security group. If you remember, we used a default security group in launching your server. When you want to go add more services to your server, you’ll have to open additional ports both on the server and in the security group.
In addition to keeping particular ports closed, you can also set your server to only allow incoming traffic from certain IP addresses. This is generally a very coarse way to do security, and rather fragile. For example, you could configure your server to only accept incoming requests from your office’s IP address, but what if someone needs to access the server from home or the office’s IP address is reassigned?
One thing that is not a security setting is just using a port that’s hard to guess. For example, you might think, “Well, if I were to put SSH on port 2943 instead of 22, that would be safer because it’s harder to guess!” I guess so, but it’s really just an illusion of security. There are ways to make your server safer. Choosing esoteric port numbers really isn’t it.
12.4 How packets are routed
The way packets travel from one computer to another is called routing. A router is a hardware or software device that route packets.
You can think of routers and public networks as existing in trees. Each router knows about the IP addresses downstream of it and also the single upstream default address.5
TODO: diagram of routers in trees
For example, the router in your house just keeps track of the actual devices that are attached to it. So if you were to print something from your laptop, the data would just go to your router and then to your printer.
On the other hand, when you look at a picture on Instagram, that traffic has to go over the public network. The default address for your home’s router is probably one owned by your internet service provider (ISP) for your neighborhood. And that router’s default address is probably also owned by your ISP, but for a broader network.
So your packet will get passed upstream to a sufficiently general network and then back downstream to the actual address you’re trying to reach.
Meanwhile, your computer just waits for a response. Once the server has a response to send, it comes back using the same technique. Obviously a huge difference between sending a letter to a penpal and using a computer network is the speed. Where sending a physical letter takes a minimum of days, sending and receiving packets over a network is so fast that the delay is imperceptible when you’re playing a multiplayer video game or collaborating on a document online.
12.5 How to recognize an IP address
You’ve probably seen IPv4 addresses many times. They’re four blocks of 8-bit fields (numbers between 0
and 255
) with dots in between, so they look something like 65.77.154.233
.
If you do the math, you’ll realize there are “only” about 4 billion of these. While we can stretch those 4 billion IP addresses to many more devices since most devices only have private IPs, we are indeed running out of public IPv4 addresses.
The good news is that smart people started planning for this a while ago. In the last few years, adoption of the new IPv6 standard has started. IPv6 addresses are eight blocks of hexadecimal (0-9
+ a-f
) digits separated by colons, with certain rules that allow them to be shortened, so 4b01:0db8:85a3:0000:0000:8a2e:0370:7334
or 3da4:66a::1
are both examples of valid IPv6 addresses.
IPv6 will coexist with IPv4 for a few decades and we’ll eventually switch entirely to IPv6. There’s no worry about running out of IPv6 addresses any time soon, because the total quantity of IPv6 addresses is a number with 39 zeroes.
12.5.1 Reserved IP Addresses
Most IPv4 addresses are freely available to be assigned, but there are a few that you’ll see in particular contexts and it’s useful to know what they are.
The first IP address you’ll see a lot is 127.0.0.1
, also known as localhost
or loopback. This is the way a machine refers to itself.
For example, if you open a Shiny app in RStudio Desktop, the app will pop up in a little window along with a notice that says
Listening on http://127.0.0.1:6311
That http://127.0.0.1
is indicating that your computer is serving the Shiny app to itself on the localhost address.
There are also a few blocks of addresses that are reserved for use on private networks, so they’re never assigned in public.
Code | Meaning |
---|---|
127.0.0.1 | localhost or loopback – the machine that originated the request |
192.168.x.x 172.16.x.x.x 10.x.x.x |
Protected address blocks used for private IP addresses. |
You don’t really need to remember these, but it’s very likely you’ve seen an address like 192.168.0.1
or 192.168.1.1
if you’ve ever tried to configure a router or modem for your home wifi.
Now you know why.
12.6 Basic network troubleshooting
Networking can be difficult to manage because there are so many layers where it can go awry. Let’s say you’ve configured a service on your server, but you just can’t seem to access it.
The ping
command can be useful for checking whether your server is reachable on the network. For example, here’s what happens when I ping
the domain where this book sits.
> ping -o do4ds.com
PING do4ds.com (185.199.110.153): 56 data bytes
64 bytes from 185.199.110.153: icmp_seq=0 ttl=57 time=13.766 ms
--- do4ds.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 13.766/13.766/13.766/0.000 ms
This looks great – it sent 1 packet to the server and got one back. That’s exactly what I want. Seeing an unreachable host or packet loss would be an indication that my networking probably isn’t configured correctly somewhere between me and the server. I generally like to use ping
with the -o
option for sending just one packet – as opposed to continuously trying.
If ping
fails, it means that my server isn’t reachable. The things I’d want to check is that I have the URL for the server correct, that DNS is configured correctly (see Chapter 13), that I’ve correctly configured any firewalls to have the right ports open (Security Groups in AWS), and that any intermediate networking devices are properly configured (see more on proxies in Chapter 17).
If ping
succeeds but I still can’t access the server, curl
is good to check. curl
actually attempts to fetch the website at a particular URL. It’s often useful to use curl
with the -I
option so it just returns a simple status report, not the full contents of what it finds there.
For example, here’s what I get when I curl
CRAN from my machine.
> curl -I https://cran.r-project.org/
HTTP/1.1 200 OK
Date: Sun, 15 Jan 2023 15:34:19 GMT
Server: Apache
Last-Modified: Mon, 14 Nov 2022 17:33:06 GMT
ETag: "35a-5ed71a1e393e7"
Accept-Ranges: bytes
Content-Length: 858
Vary: Accept-Encoding
Content-Type: text/html
The important thing here is that first line. The server is returning a 200
HTTP status code, which means all is well. For more on HTTP status codes and how to interpret them, see Chapter 4.
If ping
succeeds, but curl
does not, it means that the server is up at the expected IP address, but the service is not accessible. At that point, you might check whether the right ports are accessible – it’s possible to (for example) have port 443 or 80 accessible on your server, but not the port you actually need for your service. You also might check on the server itself that the service is running and that it is running on the port you think it is.
If you’re running inside a container, you should check that you’ve properly configured the port inside container to be forwarded to the outside.
12.7 Comprehension Questions
- What are the 4 components of a URL? What’s the significance of each?
- What are the two things a router keeps track of? How does it use each of them?
- Are there any inherent differences between public and private IP addresses?
- What is the difference between an IP address and a port?
- Let’s say you’ve got a server at 54.33.115.12. Draw a mind map of what happens when you try to SSH into the server. Your explanation should include the terms: IP Address, port, 22, default address, router, sever.
12.8 Lab: Making it accessible in one place
Right now, the only way to get to the various services on our server is only possible via an SSH tunnel. That’s fine for you as you’re working with it – but doesn’t work well if you want to share with other folks.
Now, you could just open up the different ports each service is on and have people access them there – RStudio on 8787
, JupyterHub on 8000
, and the API on 8080
. But that’s not really ideal. That would mean your users would have to remember and use those ports.
If you do want to try it to prove to yourself that this “works”, go to your server’s Security Group settings and add custom TCP rules allowing access to ports 8787
, 8000
, and 8080
from anywhere. If you visit $SERVER_ADDRESS:8787
you should get RStudio, similarly with JupyterHub at $SERVER_ADDRESS:8000
, and the API at $SERVER_ADDRESS:8080
.
Ok, now close those ports back up so we can do this the right way.
So instead, we want all the traffic to come in one front door and for users to use convenient subpaths to reach the services. The tool to accomplish this kind of rerouting is called a proxy.
In our case, we’re just going to run a software proxy on our server that reroutes traffic to different ports on our server. We’re going to use Nginx – a very popular open source proxy.
TODO: Image of proxy
Proxies are an advanced networking topic. Most enterprise networks make extensive uses of proxies. For the data science workbench you’re configuring, you may also need to use a proxy because some open source tooling doesn’t support configuring SSL/HTTPS and don’t permit authentication.
Let’s get it configured.
12.8.1 Step 1: Configure Nginx
The first thing we’re going to do is to configure Nginx on our server. Configuring Nginx is pretty straightforward – you install Nginx, put the configuration file into place, and restart the service to pick up the changes. The hard part is figuring out the right configuration. I’ve tested these steps, and they should work for you the first time. But if they don’t, you’re about to learn about the pain of proxy debugging.
Here are the steps to configure your proxy on your server:
- SSH into your server.
- Install Nginx with
sudo apt install nginx
. - Save a backup of the default
nginx.conf
,cp /etc/nginx/nginx.conf /etc/nginx/nginx-backup.conf
.6 - Edit the Nginx configuration with
sudo vim /etc/nginx/nginx.conf
and replace it with:
/etc/nginx/nginx.conf
user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
events {
worker_connections 768;
# multi_accept on;
}
http {
map $http_upgrade $connection_upgrade {
default upgrade;
'' close;
}
server {
listen 80;
location /rstudio/ {
# Needed only for a custom path prefix of /rstudio
rewrite ^/rstudio/(.*)$ /$1 break;
# Use http here when ssl-enabled=0 is set in rserver.conf
proxy_pass http://localhost:8787;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_read_timeout 20d;
# Not needed if www-root-path is set in rserver.conf
proxy_set_header X-RStudio-Root-Path /rstudio;
# Optionally, use an explicit hostname and omit the port if using 80/443
proxy_set_header Host $host:$server_port;
}
location /jupyter/ {
# NOTE important to also set bind url of jupyterhub to /jupyter in its config
proxy_pass http://127.0.0.1:8000;
proxy_redirect off;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# websocket headers
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
}
location /penguins/ {
proxy_pass http://127.0.0.1:8555/;
proxy_set_header Host $host;
}
}
}
- Test that your configuration is valid
sudo nginx -t
. - Start Nginx with
sudo systemctl start nginx
. If you see nothing all is well.
If you need to change anything, update the config and then restart with sudo systemctl restart nginx
.
12.8.2 Step 2: Open port 80
Now, if you try to go to your server, your browser will spin for a while and nothing will happen. That’s because the AWS security group still only allows SSH access on port 22
. We need to add a rule that will allow HTTP access on port 80
.
On the AWS console page for your instance, find the Security section and click into the security group for your instance. You want to add a new inbound HTTP rule that allows access on port 80
from anywhere. Make sure not to get rid of the rule that allows SSH access on 22
. You still need that one too.
Once you do this, you should be able to visit your server address and get the default Nginx landing page.
12.8.3 Step 3: Configure RStudio Server and JupyterHub to be on a subpath
Complex web apps like RStudio and JupyterHub frequently reroute people back to themselves. In general, they assume that they’re on the root path /
. That’s not true in this case, so we’ve got to let them know about the subpath where they’re actually located.
Configuring RStudio Server is already done. The X-RStudio-Root-Path
line in the Nginx configuration adds a header to each request coming through the proxy that tells RStudio Server that it’s on the /rstudio
path.
Jupyter needs an explicit update to its own configuration to let it know that it’s on a subpath. Luckily it’s a very simple change. You can edit the Jupyter configuration with
Find the line that reads # c.JupyterHub.bind_url = 'http://:8000'
.
You can search in vim from normal mode with / <thing you're searching for>
. Go to the next hit with n
.
Delete the #
to uncomment the line and add the subpath on the end. If you’re using the /jupyter
subpath and the default 8000
port, that line will read c.JupyterHub.bind_url = 'http://:8000/jupyter'
.
JupyterHub should pick up the new config when it’s restarted with
12.8.4 Step 4: Try it out!
Now we should have each service configured on a subpath. RStudio Server at /rstudio
, JupyterHub at /jupyter
, and our machine learning API at /penguins
. For example, with my server at ec2-54-159-134-39.compute-1.amazonaws.com
, I can get to RStudio Server at http://ec2-54-159-134-39.compute-1.amazonaws.com/rstudio
.
As of this writing, the machine learning API serves itself off of /__docs__/
, so you actually won’t be able to access anything interesting at /penguins
. Instead, you’ll find the API at /penguins/__docs__/
.
This should be fixed before final publication of this book.
Note that right now, this server is on HTTP, which is not a best practice. In fact, it’s such a bad practice that your browser will probably autocorrect the url to start with https
and it won’t work. You’ll have to manually correct it to http
. Don’t leave it like this for long – make sure to make sure to configure https
in Chapter 14 before doing anything real on this server.
12.8.5 Lab Extensions
If you’ve gone to the bare URL for your server, you’ve probably noticed that it’s just the default Nginx landing page, which is not very attractive.
You might want to create a landing page with links to the subpath by serving a static html page off of /
. Or maybe you want one of the services at /
and the others at a different subpath.
If you want to change the subpaths, the location
lines in the nginx.conf
define the subpaths where the services will be available. By changing those locations, you can change the paths where the services live or you could add another that serves a static web page.
You also could add a different service at a different path. Note that the proxy_path
lines define the port where the service is running on the server. Depending on the service you’re configuring, there may be other configuration you’ll have to do, but that will vary on a service-by-service basis.
12.8.6 Cheatsheet: Special Ports
All ports below 1024 are reserved for common server tasks, so you can’t assign services to low-numbered ports.
There are also three common ports that will come up over and over. These are handy because if you’re using the relevant service, you don’t have to indicate if it’s using the default port.
Protocol | Default Port |
---|---|
HTTP | 80 |
HTTPS | 443 |
SSH | 22 |
For this analogy to work, everyone lives in apartment buildings rather than standalone houses.↩︎
TCP and IP are actually two separate protocols at different layers of the protocol stack. But they’re so closely linked that we can talk about them as one.↩︎
One of the biggest ways the mail is not like packet switching is that your message gets chopped up among lots of different packets, which are routed independently, and are reassembled when they get where they’re going. Works well for computers, not so well for real-world mail. It’s also pretty much an irrelevant detail, since this whole process works quite invisibly.↩︎
Often it’s not a single server at the address – but it behaves like one. It could instead be a proxy in front of a cluster of servers or even more complicated routing. All that matters is that it behaves like a single server from the outside.↩︎
There are actually a few different types of addresses used to do this. IP addresses are used for identifying network resources and the MAC address used for physical hardware. Your router is also responsible for assigning IP addresses to devices as they join the network via the dynamic host configuration protocol (DHCP). I’m glossing over all these details as they’re immaterial to the understanding important for this chapter.↩︎
This is generally a good practice before you start messing with config files. Bad configuration is usually preferable to a service that can’t start at all because you’ve messed up the config so badly. It happens.↩︎