Introducing TraefikProxy — a scalable and highly available proxy for JupyterHub

March 6, 2019

Georgiana Dolocan  |  Jupyter Blog

In the JupyterHub context, the proxy is the unit in charge of directing the user requests to their notebook servers.

The proxy manages a list of [user : notebook] mappings (the proxy routing table) in order to decide which request is sent where. The routing table must be continuously updated as users start and stop their servers without disrupting the requests being processed. The following drawing illustrates the proxy functionality in a JupyterHub deployment.

Why the need for a new proxy?

Currently, the default proxy implementation for JupyterHub is configurable-http-proxy (CHP), which is a single-process nodejs proxy, that stores the routing table in-memory. CHP is easy to install and run, and thus in most of the cases it’s a fine option. However, because you can only run a single copy of the proxy at a time, it has its limitations when used in dynamic, large scale systems.

What makes this new proxy special?

JupyterHub 0.8. opened the way towards allowing users to create custom proxy implementations based on their deployment needs. JupyterHub Traefik Proxy leverages this feature to offer an alternative to the default proxy. It is an implementation of the JupyterHub Proxy API based on traefik, an extremely lightweight, portable reverse proxy implementation, that supports load balancing and can configure itself automatically and dynamically. JupyterHub Traefik Proxy comes in two flavors, depending on how traefik stores the routing table:

  • TraefikTomlProxy — for smaller, single-node deployments
  • TraefikEtcdProxy — for distributed setups

How does it work?

Both TraefikTomlProxy and TraefikEtcdProxy use a toml file for the global configuration. This file contains information about how to set up the connections to the routing table provider (the unit that stores the routing rules like “/user/mary” should be sent to Mary’s server at http://10.0.1.5:12345) and to the network entry points into Traefik (listening port, SSL, traffic redirection). However, the two proxies go in different directions when it comes to the provider used for storing the routing table.

TraefikTomlProxy uses a toml file to store the routes and keeps an in-memory copy of it for a faster access to the routes. This is appropriate for smaller-scale deployments. For example, the Littlest Jupyterhub (the single-node JupyterHub distribution, for a small number of users) just switched from using two proxies (traefik as an edge proxy with letsencrypt support and CHP for routing) to a configuration with just TraefikTomlProxy that serves both requirements. ❤

TraefikEtcdProxy uses etcd, a distributed key-value store to persist the routing table. This implementation aims to benefit Zero to JupyterHub with Kubernetes because it allows having multiple proxy replicas, making the proxy highly available and thus improving the scalability and stability of the system.

Adding the information about TraefikProxy to the first diagram , the drawing below presents the two proxy flavors with their common parts (the global configuration file — traefik.toml) and their differences (the mechanism used for storing the routing table — a toml file vs. etcd).

Another cool Traefik feature is the Web UI dashboard which lists all of the registered frontends (the set of rules that determine how incoming requests are forwarded) and backends (the notebooks), the routing rules, some useful metrics, and other configuration elements. The port on which TraefikProxy’s api will run, as well as the username and password used for authenticating, are all configurable.

Here’s what the dashboard looks like:

How to enable Traefik Proxy?

These instructions help you enable one of the Traefik proxies on your JupyterHub.

1. Install it:   python3 -m pip install jupyterhub-traefik-proxy

2. Install traefik and etcd: $ python3 -m jupyterhub_traefik_proxy.install --output=/usr/local/bin
This will install the default versions of traefik and etcd, namely traefik-1.7.5 and etcd-3.3.10 to /usr/local/bin specified through the --output option.

3. Configure JupyterHub to run with TraefikProxy through jupyterhub_config.py, using the proxy_class config option.

As there are many other proxy configuration options, please check out the project documentation for more info and some example configurations.

What’s next?

TraefikEtcdProxy is soon to be integrated into zero-to-jupyterhub-k8s, the Helm Chart for deploying JupyterHub on Kubernetes. By using the Traefik proxy with etcd, we can eliminate the downtime while the proxy restarts or upgrades, as well as have an all-in-one proxy that can be used for both routing and HTTPS (Let’s Encrypt) support.

Also, there’s a performance analysis of all the three proxies on the way that will help us understand the advantages and limitations of this new JupyterHub Proxy implementation.

The story behind JupyterHub Traefik Proxy was developed as part of my Outreachy internship. I am very grateful to have had the chance to work with the Jupyter community and I’m happy and proud to say that I’ve learned a lot and the experience has been invaluable. There were a lot of questions and a ton of fears along the way, but the encouragements and guidance I received, helped me move forward and finish this great project. So, a big thank you to everyone for their support and to NumFocus and Berkeley Institute for Data Science (BIDS) for sponsoring this Outreachy round! ❤

Thanks to Chris Holdgraf.  This blog is cross-posted on the Jupyter Blog here.



Featured Fellows

Chris Holdgraf

Project Jupyter, Data Science Education Program, Neuroscience
Alumni - Postdoctoral Researcher