graph TD A[Browser] -->|Request| B[SSL Certificate] B --> |Sent| A C[Certificate Authority] --> |Verifies| B A --> |Consults| C
Throughout the book to this point, we’ve been talking a lot about the http
application layer protocol. But these days you rarely see http
when you go to a website. Instead, you see https
.
https
is the secured version of http
. It’s a mashup of http
and secure sockets layer/transport layer security (SSL/TLS). It’s one of a number of application layer protocols that are just secured versions of other protocols, like (S)FTP and LDAP(S).
These days, almost everyone actually uses the successor to SSL, transport layer security (TLS). In fact, we’re now on the 4th version to TLS – version 1.3.
However, because the experience of configuring TLS is identical to SSL, admins usually just talk about configuring SSL even when they mean TLS.
Securing your website or server using SSL/TLS is one of the most basic things you can do to make sure your website traffic is safe. You should always configure https
for a public website – full stop.
In this chapter we’ll go through why using https
is so important, how it works, and how to configure https
for yourself on your website.
https
solvesFor a long time, most web traffic was just plain http
traffic.
If you recall, there used to be a lot of warnings around that you shouldn’t just connect to and use the WiFi at your neighborhood Starbucks. That was (mostly) because of the risks of using plain http
traffic.
Over the last 10 years or so, almost all web traffic has migrated to https
, particularly since 2015 when Google Chrome began the process of marking any site using plain http
as insecure. These days it’s actually pretty safe to use any random WiFi network you want – because of https
.
The biggest risk with those public WiFi networks was that someone could compromise the router. If the router was compromised, there were two things a bad actor could do.
http
has no way to verify that the website you think you’re interacting with is actually that website. So a bad actor could mess with the router’s DNS resolution, and make it so that \(bankofamerica.com\) went to a lookalike website to capture your banking login. That’s called a man-in-the-middle attack.
http
traffic is also completely unencrypted. So that packet you’re sending would be just as readable by someone who installed spyware on the router that’s passing it along as by the website you intended to send it to. This is called a packet sniffing attack.
Both of these risks are defanged when you’re using https
rather than plain http
.
https
do?
Using SSL/TLS helps a lot with security – but it’s worth noting that it ends at the domain. Using https
validates that you’ve gotten to the real example.com
if that’s the website you think you’re visiting. It doesn’t in any way validate who owns example.com
, whether they’re reputable, or whether you should trust them.
There are higher levels of SSL certification that do validate that, for example, the company that owns google.com
is actually the company Google.
Here’s the process of connecting to a website that uses https
. Before your browser starts exchanging any real data with the website, the website sends over its signature, which is generated with its private certificate.
Your computer checks this signature with a trusted certificate authority (CA) who has a copy of the site’s public certificate. Once your browser has verified that the signature is valid, you know that you’re actually communicating with \(bankofamerica.com\) and you’re not victim to a man-in-the-middle attack.
Once the verification process is done, your browser then establishes an encryption routine with the website on the other end. Only then does it start sending real data – now encrypted so that only the website on the other end can read the contents.
graph TD A[Browser] -->|Request| B[SSL Certificate] B --> |Sent| A C[Certificate Authority] --> |Verifies| B A --> |Consults| C
https
There are three steps to configuring https
on a website you control – getting an SSL certificate, putting the certificate on the server, and making sure the server only accepts https
traffic.
It used to be that getting an SSL certificate was somewhat of a pain. While it could cost less than $10 per year to get a basic SSL certificate for particular subdomains, getting a wildcard certificate that covered all the subdomains of a root domain could be quite pricey.
The main alternative at the time was to use a self-signed cert. With a self-signed cert, you would create a certificate yourself and add the certificate directly to any machine that needed to access that site. This is a pain. You need to share the certificate with everyone and get them to install it on their machine. Certificates do expire and so this would have to be periodically repeated. The alternative is just to ignore the scary warning that you might be the victim of a man-in-the-middle attack – and some software doesn’t even allow that.
Luckily there’s now another option. With the rise of https
everywhere, there’s now an organization called letsencrypt. It’s a free CA that issues basic SSL/TLS certificates for free. They even have some nice tooling that makes it super easy to create and configure your certificate right on your server.
When you’re talking about communications that run entirely inside private networks, you have a choice to make. Large organizations often run their own private CA that verifies all the certificates inside your private network.
This is a bit of a hassle, since you’ve got to maintain the CA and configure every host inside the network to trust your private CA. Many organizations decide they’re fine with plain http
communication within private networks.
Either way, once you’ve configured your server with https
, you generally want to shut down access via plain http
, usually by redirecting http
traffic, which comes in on port 80
to come in over https
via port 443
.
http
and how does https
mitigate that?We’re going to use letsencrypt’s certbot utility to automatically generate an SSL certificate, share it with the CA, install it on the server, and even update your nginx configuration.
For anyone who’s never dealt with self-signing certificates in the past, let me tell you, the fact that it does this all automatically is magical!
In our nginx configuration, we’ll need to add a line certbot will use to know which site to generate the certificate for.
Somewhere inside the nginx configuration’s server block, add
/etc/nginx/nginx.conf
server_name do4ds-lab.shop www.do4ds-lab.shop;
substituting in your domain for mine. Make sure to do both the bare domain and the www
subdomain.
At that point, you can just install certbot and let it do its thing!
If you google “configure nginx with letsencrypt”, there’s a great article by someone at Nginx walking you through the process.
As of this writing, that was as simple as running these lines
> sudo apt-get install certbot python3-certbot-nginx
> sudo systemctl restart nginx
> sudo certbot –nginx -d do4ds-lab.shop -d www.do4ds-lab.shop
Before you move along, I’d recommend you take a moment and inspect the /etc/nginx/nginx.conf
file to look at what certbot added.
You’ll notice two things. You’ll notice that the listen 80
is gone from inside the server block. We’re no longer listening for http
traffic.
Instead there’s a listen 443
– the default port for https
, and a bunch of stuff that tells nginx where to find the certificate on the server.
Scrolling down a little, there’s a new server block that is listening on 80
. This block returns a 301
, which you might recall is a redirect code (specifically for a permanent redirect) sending all http
traffic to https
.
Before we exit and test it out, let’s do one more thing. RStudio does a bunch of sending traffic back to itself. For that reason, the /rstudio
proxy location also needs to know to upgrade traffic from http
to https
.
So add the following line to the nginx config:
/etc/nginx/nginx.conf
proxy_set_header X-Forwarded-Proto https;
This line adds a header to the traffic the proxy forwards specifying the protocol to be https
. RStudio Server uses this header to know that it should send traffic back to itself using https
, not http
.
Ok, now try it by going to the URL your server is hosted at, and you’ll find that…it’s broken again.
Before you read along, think for just a moment. Why is it broken?
If your thoughts went to something involving ports and AWS security groups, you’re right!
By default, our server was open to SSH traffic on port 22
. Since then, we may have opened or closed port 80
, 8000
, 8787
, and/or 8555
.
But now that we’re exclusively sending https
traffic into the proxy on 443
and letting the proxy redirect things elsewhere. So you have to go into the security group settings and change it so there are only 2 rules – one that allows SSH traffic on 22
from anywhere and one that allows https
traffic on 443
from anywhere.
If you decided to only configure one service on the server, you could have a much simpler setup. Neither RStudio Server nor Plumber support direct configuration of a SSL certificate, so you would still need a proxy – though it could be a much simpler configuration that just passes traffic at the root /
on to your service at a different port.
JupyterHub can be configured directly with an SSL certificate, so you would just configure JupyterHub to be aware of the SSL certificate and put itself on port 443
.
NOW you should be able to go to <your-domain>/rstudio
and get to RStudio, <your-domain>/jupyter
to get to JupyterHub, and <your-domain>/palmer
to get the API! Voila!
And you should still be able to SSH into your server on your domain.
At this point your server is fully configured. You have a server hosting three real data science services available on a domain of your choosing protected by https
. You can really use this server to do real data science work.
Now, that’s not to say this is fully enterprise-ready. If you’re working on a small team or at a small organization, this may be sufficient. But if you’re working at a large organization, your IT/Admin group is going to have other concerns. Section 4 of this book gets into some of those concerns and how you might discuss them with the relevant teams.