update: December 14, 20:45 UTC, all services should be restored and back up.
On December 13, at 22:10 UTC (4:10pm EST), a large number of Jupyter-provided services stopped responding. This included, but was not limited to https://nbviewer.jupyter.org, https://try.jupyter.org (powered by tmpnb) and https://cdn.jupyter.org. We quickly narrowed this down to an issue with our hosting provider and have been working with them to resolve the issue as fast as possible.
When outages happen, the Jupyter Status page should show which services are affected and we publish updates there.
How are Jupyter services hosted?
To understand the cause of the outage, we need to understand how the Jupyter services are hosted and maintained. As Jupyter is an open organization which is mostly maintained by volunteers, we do not have a dev-ops team assigned to maintaining our infrastructure. Even with full-time developers hired through universities or companies, the time spent fixing infrastructure is taken on nights and weekends. These developers are often stretched thin and cannot be available 24/7.
Most of our cloud infrastructure is donated to us by companies like CloudFlare, Rackspace, Fastly, Google, and Microsoft. Donating resources can be challenging, both technically and legally. In this particular case, Rackspace graciously created a special account for Jupyter that handles invoices on our behalf, thereby making resources free to the project. Following a hiccup, this Jupyter account was suspended and all services are unavailable as a result.
Temporary resolution
As nbviewer is one of the most used services provided by Jupyter, we’ve moved it to one of our personal account at another cloud-provider. Fastly was set up to load-balance on the yet-to-come-back-up instances as well as this newly created instance, so all should be fine now.
The other services (tmpnb, mails@jupyter.org, cdn.jupyter.org, …) will still unavailable or highly degraded until a permanent solution is found, or the services are restarted. try.jupyter.org will likely redirect to a repo on https://mybinder.org in the meantime so people can still try out Jupyter.
Low bus factor
The outage of all these services lasted for a significant time (more than 18 hours). Which perturbed many of you relying on these services. We understand that this is hardly acceptable and we hope you’ll indulge us as these services are provided for free and without ads. One of the factors leading to the slow reestablishment of service was a relatively low bus factor, with only one and a half of our developers knowing how to deploy and maintain these services. Documentation and access to credentials was also limited.
This is one of the challenges in a distributed team like Jupyter where contributors self-organize. It is easy to forget that new code is not the only way to contribute and that infrastructure and maintenance are crucial.
We also overly rely on a single vendor (in this case Rackspace), and while we are happy with Rackspace and have no reason to move to another provider, we should have a plan to restore critical services even temporarily in case of failure.
A couple of months ago, the subject was brought to our attention, and we developed a plan to move many of our deployment to Kubernetes (which is provider agnostic). We underestimated the probability to need an emergency plan this early.
How can you help
Jupyter is mainly governed by the community all around the world. Contributing is not limited to writing code! We need members with knowledge in multiple languages, in design, dev-ops, etc. Whether you are an expert, or still learning, we would like you to get involved.
Thanks everyone for your patience and the kind words when you reached to us when discovering the services were down.
Incident Report: Jupyter services down was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.