Quantcast
Channel: Jupyter Blog - Medium
Viewing all articles
Browse latest Browse all 314

Introducing repo2docker

$
0
0

The Binder Project’s repo2docker tool gives data scientists the benefits of containerization technology without needing to learn Docker itself. To make your repository compatible with repo2docker, you only need to add text files that are already present in many repositories. This means that you get the benefits of containerization, a powerful and complex ecosystem, without having to change your workflow.

repo2docker is a lightweight command-line tool written in Python that takes a path or URL to a git repository and creates a suitable docker image for it. To achieve this it follows the steps that a human would take to do so. The steps are:

  1. Inspect the repository for common “configuration” files (like requirements.txt),
  2. From these well-known files infer the Docker commands to run; and
  3. Build a Docker image.

It has a few more tricks up its sleeve, such as automatically installing RStudio for you when it detects that you are using R. Once the image has been built, a Docker container is created and executed, giving you access to the environment in which the repository author wanted the code to be executed. To achieve this, one needs access to two things: repo2docker and a docker daemon (they do not necessarily have to have docker installed on their local computer).

The JupyterHub team just released v0.7 of repo2docker, so we decided to spend a bit of time explaining what it’s all about.

An example repo2docker workflow. In this case, repo2docker is invoked locally. repo2docker is passed a URL to a git repository (https://github.com/norvig/pytudes). It then clones the repository, discovers configuration files in the repo (in this case, `requirements.txt`), builds a Docker image with this environment installed, and opens a local Jupyter server to explore and run the contents of the repo.

The guiding principles behind repo2docker

repo2docker is meant to be as lightweight and common-sense as possible. The driving principles behind repo2docker are as follows:

  1. Leverage pre-existing workflows in data science as much as possible. This means using standard configuration files (like requirements.txt) instead of requiring people to learn new configuration patterns.
  2. The shareable unit is a repository or directory containing human-readable files. Not a single file (like a notebook) nor a binary blob (like a built docker image). This means that humans can inspect and extend other repositories meant for repo2docker, and that they can manually do what repo2docker does automatically. No black box.
  3. Be workflow agnostic. repo2docker supports many languages and user interfaces, it can run arbitrary shell scripts that are baked into the image, or it can trigger a script to be run each time a person runs the Docker image.
  4. Be extensible and composable. repo2docker should allow for multiple languages, tools, or workflows to be defined in a single GitHub repository. It should also be relatively easy to extend to support new use-cases.
  5. Enable deterministic outputs. We want repo2docker to make it possible for authors to generate the exact same environment from their repository every time, provided that they follow best-practices in computational methods (like providing specific version numbers for packages). repo2docker can build a specific commit, tag, or branch of a repository, which allows for an image to be deterministically built.

How can repo2docker be used?

Over the last 18 months, we have been using repo2docker in production to automatically generate images that run repositories for mybinder.org.
It is used to build around 1000 unique repositories every week. The core functionality has proven itself and is considered production ready.

Over the last year, we’ve seen a few major use-cases come out of repo2docker:

First, it can be used as a part of production systems like BinderHub. BinderHub automatically uses repo2docker to build images that run a user’s environment, and lets them share links that let others interact with the image.

Second, repo2docker can be used to build an image for use with a JupyterHub. For example, teachers have used repo2docker to convert their GitHub repository with course materials into a runnable Docker image that students access via a shared jupyterhub in the cloud.

Finally, repo2docker has been used by individuals who wish to build reproducible images from their local work. repo2docker can optionally run a Jupyter server from within the built image, which makes it possible to verify the results of analyses in an environment that was built solely from the configuration files present in the repository.

What next?

We think that repo2docker serves as a useful tool for the community and that it is an important part of the large reproducible scientific software stack. It gives data scientists the benefits of containerization technology without needing to learn a new tool like Docker. It achieves this by being a lightweight command-line tool written in Python that automates the creation of the environment in which the authors of a piece of software wanted it to be executed.

We’d love to see the repo2docker community grow, and for more
languages, interfaces, use-cases, and workflows to be supported
with repo2docker’s build pack system. Let us know what you think!

repo2docker is primarily maintained by the JupyterHub and Binder teams. If you’d like to get involved with the community or want to learn
more about the tool, reach out! Check out these links for more information:

Note: some folks might be wondering why we developed repo2docker instead of contributing to a pre-existing containerization tool such as the excellent source2image project. We take the decision to create new open-source tech very seriously, and wrote a blog post about our decision to do-so in this case: http://words.yuvi.in/post/why-not-s2i/


Introducing repo2docker was originally published in Jupyter Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.


Viewing all articles
Browse latest Browse all 314

Trending Articles