The Data Structures for Data Science Workshop: Exploring Common Data Structures across Programming Languages and Packages

November 9, 2015

Co-written by Kyle Barbary

On September 18 and 19, 2015, we brought together 60 community experts for the Data Structures for Data Science (DS4DS) Workshop. Data structures, the way in which (most often numerical) data is represented, are the foundation upon which computational tools, particularly those for data science, are built. Common data structures across programming languages and packages allow for interoperability and code reuse, yet many modern data structures are confined to a single language or computing environment. This workshop, a technical hands-on event, was an attempt at practically answering questions around the unification of data structures with specific consideration given to storage formats, libraries and implementations, exchange protocols, and in- versus out-of-core manipulation.  

To that end, the invited participants represented a range of languages, from Python and R to Julia and C++, and various scientific domains from both industry and academia. Authors of popular packages in statistics, visualization, and machine learning joined forces with domain experts in compression, graphs, and distributed computation to explore a variety of problems around numerical processing of data at scale.

Both days of the workshop started with lightning talks, where speakers illustrated popular concepts and libraries. On the first day, this was followed by open discussion, identifying important sub-topics in preparation for the second day, which was designated for hands-on working sessions to produce code, designs, or concept notes.

Topics under discussion included the design of cross-platform DataFrame APIs; the management of large datasets; ways to handle data in memory-efficient ways; a better data-type system for NumPy; and refactorings at the interfaces between packages like NumPy, x-ray, and Pandas.

On Saturday, work was done on refactoring core parts of NumPy into Cython, using Dask for machine learning, wrapping the Dato SFrame in Julia and R, adding sparse and HDF5-backed storage to Pandas, implementing declarative visualization in Python, adding labeled arrays to NumPy, doing machine learning with Ibis, and on improving the data types in DyND to be dynamic.

Many enthusiastic participants stayed well into both nights to talk and hack together. It will be a while until the full impact of the workshop can be measured, but we believe that even the opportunity to connect key players in data structures was in itself well worth it.  We are planning another event for 2016, and while this workshop was invitation only, we are considering opening up the next gathering. Is this the kind of event you would like to attend? If so, please send feedback to Stéfan at!

Featured Fellows

Kyle Barbary

Berkeley Center for Cosmological Physics