Multi-Dimensional Arrays for Big Labeled Data

Data Science Lecture Series


October 16, 2015
1:00pm to 2:30pm
190 Doe Library
Get Directions

The N-dimensional array has been the mainstay data structure of scientific computing. But can we do better? In this talk, I’ll describe two Python projects, xray and dask.array, that extend multi-dimensional arrays with features that enable scalable and reproducible science. Xray adds labels and metadata to arrays, letting developers use meaningful names for operations instead of easily confused axis numbers or integer positions. Dask is a framework for easy parallel computing with arrays that may be too big to fit into memory. I will highlight examples in climate science and meteorology, for which large and complex datasets are common.


Stephan Hoyer

Data Scientist, The Climate Corporation

Stephan is is a data scientist and researcher at The Climate Corporation, where he builds statistical models for climate and weather data. These models help farmers make better decisions, such as when to plant and how much fertilizer to apply. Stephan graduated from Swarthmore College with a BA in physics in 2008 and from UC Berkeley with a PhD in physics in 2013. He is the main author of xray and is a core contributor to a number of other projects in the scientific Python stack, including dask, pandas and NumPy.