Berkeley Lab, BIDS Take on Big Data

August 30, 2018

Berkeley Lab-BIDS Fellows to share their expertise at ML4Sci Workshop and California Water Data Hackathon

August 30, 2018 | Linda VuBerkeley Lab Computing Sciences News

The world is currently generating data at a break-neck pace — about 2.5 quintillion bytes per day — and this trend is only accelerating. To make sense of this torrent of information, Berkeley Institute for Data Science (BIDS) has built an ecosystem of researchers to advance data-analytic methods and inquiry, develop and expand software and analytics tools, and share best practices.

The BIDS ecosystem comprises an impressive network of Fellows, including some who are Lawrence Berkeley National Laboratory (Berkeley Lab) scientists. This month, several Berkeley Lab-BIDS Fellows are organizing two of events to share their data-science expertise. Some are helping to organize a Machine Learning for Science (ML4Sci) Workshop that will be held in early September, where they will introduce and train scientists to use state-of-the-art machine learning applications on massively parallel supercomputers. At the end of September, another group is hosting the California Water Data Hackathon to help address the state’s lack of access to clean, safe drinking water.

“There’s a perspective that one promise of data science comes from the interdisciplinary nature of the research it enables. ‘Inter’ can mean among different fields of inquiry, but also can mean among alternative approaches to handling data-intensive workloads in research,” said BIDS Executive Director David Mongeau. “For instance, many data scientists at UC Berkeley might default to familiar infrastructure for their work, but by interacting with Berkeley Lab can explore alternative approaches made possible with high performance computing.”

Machine Learning For Science (ML4Sci)

Cori supercomputer at NERSC. (Photo by Marilyn Chung, Berkeley Lab)

Some of the Berkeley Lab researchers bringing this expertise to BIDS are Deborah Agarwal, head of the Computational Research Division’s (CRD’s) Data Science and Technology Department; Daniela Ushizima, a staff scientist in the Center for Advanced Mathematics for Energy Research Applications (CAMERA) and CRD’s Data Analytics & Visualization group; and Kristofer Bouchard, computational bioscientist in the Biosciences Area. They are helping to organize the ML4Sci workshop, which will be held at Berkeley Lab Sept. 4-5 in conjunction with the National Energy Research Scientific Computing Center’s (NERSC’s) annual Data Day (Sept. 6-7). Other key organizers of the workshop are NERSC’s Data & Analytics Services group members Prabhat, Steve Farrel, Mustafa Mustafa, and Zarija Lukić of Berkeley Lab’s Computational Cosmology Center.

The workshop will feature several UC Berkeley faculty-BIDS Fellows as keynote speakers, including Bin Yu, John Canny, Philip Stark, and Joshua Bloom. The event will introduce researchers to cutting-edge machine learning applications for high-energy physics, nuclear physics, cosmology, chemistry, biosciences, materials engineering, climate, and high performance computing. Additionally, machine learning experts, will provide hands-on training to deploy these applications on supercomputers at NERSC.

“There are so many benefits from the cross-pollination of expertise and resources between Berkeley Lab and BIDS,” said Ushizima. “During the ML4Sci workshop, Berkeley Lab staff will be showcasing Jupyter tools. Today, these tools are open source and serve a variety of data science needs—for example, there are currently more than 2 million Jupyter Notebooks hosted on Github. But the root of Jupyter was pioneered by Fernando Perez, one of the founding fathers of BIDS, currently a professor in the Department of Statistics at UC Berkeley, and a Berkeley Lab researcher.”

Earlier this year, the Association for Computing Machinery honored the Jupyter Project Team for developing a tool that has had a lasting influence on computing. At Berkeley Lab, Ushizima also leads the Department of Energy Early Career Project Image across Domains, Algorithms and Learning (IDEAL).

California Water Data Hackathon

Berkeley View from Berkeley Lab. (Photo by Roy Kaltschmidt, Berkeley Lab)

Beyond scientific applications, BIDS also focuses on social impact issues. Earlier this year, when a number of state agencies, private companies and the West Big Data Innovation Hub joined forces to create the 2018 California Safe Drinking Water Data Challenge, BIDS knew it wanted to be a part of this effort. Zexuan Xu, a BIDS Data Science Fellow and a postdoctoral researcher in hydrology in Berkeley Lab’s Earth and Environmental Sciences Area, is helping to organize BIDS’ participation in this event.

As part of the challenge, BIDS is teaming up with UC Berkeley’s Division of Data Sciences to host the California Water Data Hackathon on Sept. 14-15. According to Xu, the hackathon is open to all but mostly undergraduate and graduate students from a variety of disciplines. The goal is to teach the students about California’s water issues, then have them use publicly available data to help find innovative ways to increase community access to safe drinking water, better understand vulnerabilities, then help identify and deploy solutions.

“Up to 1 million Californians lack access to clean, safe drinking water at some point during the year. Droughts and other disruptions in water supply and contamination in water quality can limit or eliminate access to safe drinking water for days, months, or years,” said Xu. “All the topics that the hackathon participants will address are currently open questions. If they come up with interesting questions and/or solutions, we will deliver their interests to the state agencies, and encourage them to continue the research.”

In many ways the hackathon embodies the philosophy of BIDS, which takes a broad view of data science and welcomes candidates from a full range of research focuses—from digital humanities and psychology to statistics and computer science—who are interested in pushing the frontiers of data-intensive research in their own field and in cross-disciplinary collaborations.

“The greatest benefit of being a BIDS Fellow is getting to know people that work in different fields of science. I am a domain expert in earth and environmental science, but others are experts in math, software development, statistics, bioscience, etc.,” said Xu. “Because the community is so integrated, I can collaborate with mathematicians that I don’t normally have access to. We work on research projects together, then I have a chance to learn the cutting-edge research in other science areas and also share my knowledge and insights with others in my domain area.”

That benefit and the bonds that Berkeley Lab and Univeristy continue to strengthen come in part through Nobel Laureate Saul Perlmutter serving as BIDS Director. He shares the 2011 Nobel Prize in Physics for the discovery of the accelerating expansion of the universe.

Although registration for the ML4Sci workshop is closed, you can still register for the California Water Data Hackathon here:

A full list of BIDS Fellows:

Featured Fellows

Deb Agarwal

Computational Research Division, LBNL

Daniela Ushizima

Computational Research Division, Lawrence Berkeley National Lab

Kristofer Bouchard

Bioengineering and Biomedical Sciences Division, LBNL

Bin Yu


John Canny

Division of Computer Science

Philip Stark

Statistics; Statistical Computing Facility
Co-I for Moore/Sloan Data Science Environments

Joshua Bloom

Astronomy; Center for Time-Domain Informatics
Co-I for Moore/Sloan Data Science Environments

Fernando Perez

Co-I for Moore/Sloan Data Science Environments

Zexuan Xu

Climate and Ecosystem Science Division, LBNL

Saul Perlmutter

Berkeley Institute for Data Science

David Mongeau

Berkeley Institute for Data Science
Executive Director