A farewell to the Berkeley Institute for Data Science

May 25, 2018

In February 2015, I joined the UC Berkeley Institute for Data Science (BIDS) in a very unusual position: I got to focus full-time on making the open Python ecosystem work better for scientists. My contract is ending in a bit over a month, so I'm currently thinking about what's next. But in this post I want to instead look back on what this unique opportunity allowed me to do, both as a kind of personal post-mortem and in the hopes that it might be of interest to people and institutions who are thinking about different models for funding open source and open science. In particular, there's also a follow-up post discussing some implications for software sustainability efforts.

A BIDS retrospective

Might as well start with the worst part: in late 2016 I came down with some serious health issues, and have been on partial disability leave since then. This has been gradually getting better – cross your fingers for me. But it does mean that despite the calendar dates, in terms of hours worked I've only been at BIDS for 2 years and change.

But I'm pretty proud of what I accomplished in that time. There were four main projects I led while at BIDS, that I'll discuss in individual sections below. And to be clear, I'm certainly not claiming exclusive credit for any of these – they all involved lots of other people, who together did way more than I did! But I think it's fair to say that these are all projects where I played a critical role in identifying the issues and finding a way to push the community towards solving them, and that if BIDS hadn't funded my position then none of these things would have happened.

Revitalizing NumPy development

NumPy is so central to numerical work in Python, and so widely used in both academia and industry, that many people assume that it must receive substantial funding and support. But it doesn't; in fact for most of its history it's been maintained by a small group of loosely-organized, unpaid volunteers. When I started at BIDS one of my major goals was to change that, ultimately by getting funding – but simply airdropping money into a community-run OSS project doesn't always produce good results.

So the first priority was to get the existing maintainers on the same page about where we wanted to take the project and how funding could be effectively used – basically paying down "social debt" that had accumulated during the years of under-investment. I organized a developer meeting, and based on the discussions there (and with many other stakeholders) we were ultimately able to get consensus around a governance document (latest version) and technical roadmap. Based on this, I was able to secure two grants totaling $1.3 million from the Moore and Sloan foundations, and we've just finished hiring two full-time NumPy developers at BIDS.

I have to pause here to offer special thanks to the rest of the NumPy grant team at BIDS: Jonathan Dugan, Jarrod Millman, Fernando Pérez, Nelle Varoquaux, and Stéfan van der Walt. I didn't actually have any prior experience with writing grant proposals or hiring people, and initially I was on my own figuring this out, which turned out to be, let's say, challenging... especially since I was trying to do this at the same time as navigating my initial diagnosis and treatment. (It turns out not all buses have wheels.) They deserve major credit for stepping in and generously contributing their time and expertise to keep things going.

Improving Python packaging (especially for science)

Software development, like science in general, is an inherently collaborative activity: we all build on the work of others, and hopefully contribute back our own work for others to build on in turn. One of the main mechanisms for this is the use and publication of software packages. Unfortunately, Python packaging tools have traditionally been notoriously unfriendly and difficult to work with – especially for scientific projects that often require complex native code in C/C++/Fortran – and this has added substantial friction to this kind of collaboration. While at BIDS, I worked on reducing this in two ways: one for users, and one for publishers.

On the package user side, conda has done a great deal to relieve the pain... but only for conda users. For a variety of reasons, many people still need or prefer to use the official community-maintained pip/PyPI/wheel stack. And one major limitation of that stack was that you could distribute pre-compiled packages on Windows and MacOS, but not on the other major OS: Linux. To solve this, I led the creation of the "manylinux" project. This has dramatically improved the user experience around installing Python packages on Linux servers, especially the core scientific stack. When I ran the numbers a few weeks ago (2018-05-07), ~388 million manylinux packages had been downloaded from PyPI, and that number was growing by ~1 million downloads every day, so we're almost certainly past 400 million now. And if you look at those downloads, scientific software is heavily represented: ~30 million downloads of NumPy, ~15 million SciPy, ~15 million pandas, ~12 million scikit-learn, ~8 million matplotlib, ~4 million tensorflow, ... (Fun fact: a back of the envelope calculation1 suggests that the manylinux wheels for SciPy alone have so far prevented ~90 metric tons of CO2 emissions, equivalent to planting ~2,400 trees.)

So manylinux makes things easier for users. Eventually, users become developers in their own right, and want to publish their work. And then they have to learn to use distutils/setuptools, which is... painful. Distutils/setuptools can work well, especially in simple cases, but their design has some fundamental limitations that make them confusing and difficult to extend, and this is especially problematic for any projects with complex native code dependencies or that use NumPy's C API, i.e. scientific packages. This isn't exactly distutils's fault – its design dates back to the last millennium, and no-one could have anticipated all the ways Python would be used over the coming decades. And Python's packaging maintainers have done a heroic job of keeping things working and incrementally improving on extremely minimal resources. But often this has meant piling expedient hacks on top of each other; it's very difficult to revisit fundamental decisions when you're a all-volunteer project struggling to maintain critical infrastructure with millions of stakeholders. And so fighting with distutils/setuptools has remained a rite of passage for Python developers. (And conda can't help you here either: for builds, conda packages rely on distutils/setuptools, just like the rest of us.)

Another of my goals while at BIDS was to chart a path forward out of this tangle – and, with the help of lots of folks at distutils-sig (especially Thomas Kluyver, whose efforts were truly heroic!), we now have one. PEP 518 defines the pyproject.toml file and for the first time makes it possible to extend distutils/setuptools in a reasonable way (for those who know setup.py: this is basically setup_requires, except it works). This recently shipped in pip 10. And PEP 517 isn't quite implemented yet, but soon it will make it easy for projects to abandon distutils/setuptools entirely in favor of tools that are easier to use or better prepared to handle demanding scientific users, making software publication easier and more accessible to ordinary scientists.

The Viridis colormap

When I started at BIDS, matplotlib still used the the awful "jet" colormap by default, despite probably dozens of peer-reviewed articles pointing out how rainbow colormaps like "jet" distort users' understanding of their data, create barriers to accessibility, and lead to bad decisions, including (for example) unnecessary medical diagnostic errors. So I suggested to Stéfan that we fix this. This was an interesting challenge, with two parts: first, the computational challenge of building a set of tools to visualize and design better colormaps, and second and more importantly, the social challenge of convincing people to actually use them. After all, there have been many proposals for better colormaps over the years. Most of them sank without a trace, and it was entirely possible that our colormap "viridis" would do the same.

This required working with the matplotlib community to first find a socially acceptable way to make any changes at all in their default styles – here's my suggestion of a style-change-only 2.0 release proved successful (and ultimately led to a much-needed broader style overhaul). Then we had the problem that there are many perfectly reasonable colormaps, and we needed to build consensus around a single proposal without getting derailed by endless discussion – avoiding this was the goal of a talk I gave at SciPy 2015.

In the end, we succeeded beyond our wildest expectations. As of today, my talk's been watched >85,000 times, making it the most popular talk in the history of the SciPy conference. Viridis is now the default colormap in matplotlib, octave, and parts of ggplot2. Its R package receives hundreds of thousands of downloads every month which puts it comfortably in the top 50 most popular R packages. Its fans have ported it to essentially every visualization framework known to humankind. It's been showcased in Nobel-prize winning research and NASA press releases, and inspired stickers and twitter bots and follow-ups from other researchers.

On the one hand, it's "just" a colormap. But it feels pretty good to know that every day millions of people are gaining a little more understanding, more insight, and making better decisions thanks to our work, and that we've permanently raised the bar on good data visualization practice.

Making concurrent programming more accessible

Here's a common problem: writing a program that does multiple things concurrently, either for performance or as an intrinsic part of its functionality – from web servers handling simultaneous users and web spiders that want to fetch lots of pages in parallel, to Jupyter notebooks juggling multiple backend kernels and a UI, to complex simulations running on HPC clusters. But writing correct concurrent programs is notoriously challenging, even for experts. This is a challenge across the industry, but felt particularly acutely by scientists, who generally receive minimal training as software developers, yet often need to write novel high-performance parallel code – since by definition, their work involves pushing the boundary of what's possible. (In fact Software Carpentry originally "grew out of [Greg Wilson's] frustration working with scientists who wanted to parallelize complex programs but didn't know what version control was...".)

Over the last year I've been developing a new paradigm for making practical concurrent programming more accessible to ordinary developers, based on a novel analysis of where some of the difficulties come from, and repurposing some old ideas in language design. In the course of this work I've produced a practical implementation in the Python library Trio, together with a series of articles, including two discussing the theory behind the core new language constructs:

This last project is a bit different than the others – it's more in the way of basic research, so it will be some time before we know the full impact. But so far it's attracting quite a bit of interest across the industry and from language designers (for example) and I suspect that either Trio or something very like it will become the de facto standard library for networking and concurrency in Python.

Other work

Some other smaller things I did at BIDS, besides the four major projects discussed above:

  • Was elected as an honorary PSF Fellow, and to the Python core developer team.
  • Wrote up feedback for the BLAS working group on their proposal for a next generation BLAS API. The BLAS is the set of core linear algebra routines that essentially all number-crunching software is built on, and the BLAS working group is currently developing a the first update in almost two decades. In the past, BLAS has been designed mostly with input from traditional HPC users running Fortran on dedicated clusters; this is the first time NumPy/SciPy have been involved in this process.
  • Provided some assistance with organizing the MOSS grant that funded the new PyPI.
  • Created the h11 HTTP library, and came up with a plan for using it to let urllib3/requests and downstream packages join the new world of Python async concurrency.
  • Had a number of discussions with the conda team about how the conda and pip worlds could cooperate better.
  • And of course lots of general answering of questions, giving of advice, fixing of bugs, triaging of bugs, making of connections, etc.

...and the ones that got away

And finally, there are the ones that got away: projects where I've been working on laying the groundwork, but ran out of time before producing results. I think these are entirely feasible and have transformative potential – I'm mentioning them here partly in hopes that someone picks them up:

PyIR: Here's the problem. Libraries like NumPy and pandas are written in C, which makes them reasonably fast on CPython, but prevents JIT optimizers like PyPy or Numba from being able to speed them up further. If we rewrote them in Python, they'd be fast on PyPy or Numba, but unusably slow on regular CPython. Is there any way to have our cake and eat it too? Right now, our only solution is to maintain multiple copies of NumPy and other key libraries (e.g. Numba and PyPy have both spent significant resources on this), which isn't scalable or sustainable.

So I organized a workshop and invited all the JIT developers I could find. I think we came up with a viable way forward, based around the idea of a Cython-like language that generates C code for CPython, and a common higher-level IR for the JITs, and multiple projects were excited about collaborating on this – but this happened literally the week before I got sick, and I wasn't able to follow up and get things organized. It's still doable though, and could unlock a new level of performance for Python – and as a bonus, in the long run it might provide a way to escape the "C API trap" that currently blocks many improvements to CPython (e.g., removing the GIL).

Telemetry: One reason why developing software like NumPy is challenging is that we actually have very little idea how people use it. If we remove a deprecated API, how disruptive will that be? Is anyone actually using that cool new feature we added? Should we put more resources into optimizing module X or module Y? And what about at the ecosystem level – how many users do different packages have? Which ones are used together? Answering these kinds of questions is crucial to providing responsible stewardship, but right now there's simply no way to do it.

Of course there are many pitfalls to gathering this sort of data; if you're going to do it at all, you have to do it right, with affirmative user consent, clear guidelines for what can be collected and how it can be used, a neutral non-profit to provide oversight, shared infrastructure so we can share the effort across many projects, and so on. But these are all problems that can be solved with the right investment (about which, see below), and doing so could radically change the conversations around maintaining and sustaining open scientific software.

What next?

So there you have it: that's what I've been up to for the last few years. Not everything worked out the way I hoped, but overall I'm extremely proud of what I was able to accomplish, and grateful to BIDS and its funders for providing this opportunity.

As mentioned above, I'm currently considering options for what to do next – if you're interested in discussing possibilities, get in touch!

Or, if you're interested in the broader question of sustainability for open scientific software, I wrote a follow-up post trying to analyze what it was about this position allowed it to be so successful.


1. Assuming that the people installing SciPy manylinux wheels would instead have built SciPy from source (which is what pip install scipy used to do), that building SciPy takes 10 minutes, and that during that time the computer consumes an extra 50 W of power, then we can calculate 10 minutes * 50 W / 60 minutes/hour / 1000 Wh/kWh * 15,000,000 builds = 125,000 kWh of reduced electricity usage, which I then plugged into this EPA calculator