Three-day event at Berkeley Lab and BIDS focused on Jupyter in HPC and science user facilities
July 12, 2019 | Kathy Kincade | Berkeley Computing Sciences News
The three-day Jupyter Community Workshop for Scientific User Facilities and High-Performance Computing, held June 11-13 at Berkeley Lab and the Berkeley Institute for Data Science (BIDS), brought together more than 40 Jupyter developers, engineers, and experimental and observational data (EOD) facilities staff to brainstorm on how to make this increasingly popular open-source tool the pre-eminent interface for managing EOD workflows and data analytics at high performance computing (HPC) centers. The event was jointly sponsored by the National Energy Research Scientific Computing Center (NERSC) and BIDS and was part of a series of Jupyter Community Workshops being funded by the media company Bloomberg.
40+ participants from universities, national labs, industry, and science user facilities attended
the Jupyter Community Workshop at Berkeley Lab in June. Credit: Fernando Perez.
Project Jupyter is an international collaboration of more than 1,500 contributors that develops tools for “interactive computing,” a process of human-computer interplay for scientific exploration and data analysis. These tools – which include the very popular Jupyter Notebook and JupyterHub – have become a de facto standard for data analysis in research, education, journalism, and industry and are becoming increasingly critical for scientific discovery. (By coincidence, the workshop coincided with the release of the latest version of Jupyter’s user interface, JupyterLab.)
Advances in EOD technologies, high-bandwidth global networks, and HPC have resulted in an exponential growth of data to collect, manage, and understand. Interpreting these data streams requires computational and storage resources greatly exceeding those available on laptops, workstations, or university department clusters. Funding agencies are thus increasingly looking to HPC centers to address the growing and changing data needs of their scientists, and scientists are seeking new ways to seamlessly and transparently integrate HPC into their EOD workflows.
This is where Jupyter comes in.
“Over the past four years, we have seen Jupyter at NERSC evolve from a novel science gateway application used by just a few Python enthusiasts into a principal means for many of our users to interact with our systems and services,” said Rollin Thomas, NERSC Data Architect and chair of the workshop. In 2016, Thomas and the NERSC Jupyter team engineered a way for users to launch notebooks on a single shared node on NERSC’s Cori supercomputer; since then, demand has increased to the point that two additional nodes have been allocated for running Jupyter notebooks. On any given day, 200 users have notebooks running on these nodes; a level of usage comparable to the more traditional shared “login” nodes. Recently the team has expanded access to Cori’s compute nodes through Jupyter as well.
In addition, NERSC has joined forces with the Usual Software Systems group in Berkeley Lab’s Computational Research Division to enhance Jupyter to enable it as a key interface for EOD workflows that run at NERSC under the Superfacility initiative. “Scientists at major DOE-supported user facilities have told us that they want to manage and manipulate their data and compute through Jupyter, so we need to develop tools that work with our infrastructure to make that happen,” said Debbie Bard, leader of the Data Science Engagement Group and NERSC Superfacility Team lead. “This is a problem faced by all big science facilities: streaming their data through high-performance compute resources to accelerate science in a seamless way.”
‘A Great Deal of Excitement’
These and related challenges are what prompted the June workshop, which featured dozens of talks and breakout sessions focused on “pain points” and best practices in Jupyter deployment, infrastructure, and user support; securing Jupyter in multi-tenant environments; sharing notebooks; HPC/EOD-focused Jupyter extensions; and strategies for communicating with stakeholders.
“Scientists love Jupyter because it combines visualization, data analytics, text, and code into a document they can share, modify, and even publish,” Thomas said. “But what about using Jupyter to control experiments in real time, steer complex simulations on a supercomputer, or connect experiments to HPC for real-time feedback and decision making? How can users reach outside the notebook to corral external data and computational resources in a seamless, Jupyter-friendly manner?”
“We began talking about this event two years ago,” said Fernando Perez, an assistant professor of statistics at UC Berkeley, a Senior Fellow at BIDS, and a staff scientist in Berkeley Lab’s Computational Research Division who is credited with developing IPython – an interactive add-on to Python that served as the foundation for Jupyter – during an address to the group on the first morning. “There is a great deal of excitement from the Jupyter community about how HPC will use Jupyter. We see Jupyter as the heart of the human/machine connection, enabling and supporting interactive scientific computing.”
Perez is a founding member of BIDS, which hosted the third day of the workshop on the UC Berkeley campus. The Jupyter team at Berkeley focuses on tools for interactive interfaces for data science and education (Jupyter Notebooks, JupyterLab, and Jupyter Book), shared infrastructure for interactive computing (JupyterHub and JupyterHub distributions), and reproducible, sharable computational environments (through the Binder Project). Each project is run in partnership with researchers and educators at BIDS and UC Berkeley's Division of Data Science.
During the workshop, Michael Milligan from the Minnesota Supercomputing Center echoed Perez’ sentiments in his keynote, “Jupyter: a One-Stop Shop for Interactive HPC Services.” Milligan is the creator of BatchSpawner and WrapSpawner, JupyterHub extensions that let HPC users run notebooks on compute nodes supporting a variety of batch queue systems. In addition, contributors to both extensions met in an afternoon-long breakout to build consensus around some technical issues and start managing development and support collaboratively.
“In the past, most computational tasks fit into one of two buckets: ‘local, interactive, informally managed compute’ or ‘remote, scheduled, professionally managed compute,’” Milligan said. “Now users are accustomed to compute that is remote and interactive by default – Google Docs, not Microsoft Word. We hope our users are going to be doing something fundamentally new, so we need to give them general-purpose tools.”
Other highlights of the workshop included:
- Jupyter security. Thomas Mendoza from Lawrence Livermore National Laboratory talked about his work to enable end-to-end SSL in JupyterHub and best practices for securing Jupyter, while two breakout sessions on security yielded a number of next steps, including a plan to more prominently document security best practices and to convene a future meeting focused specifically on security in Jupyter.
- Jupyter implementation at national labs and user facilities. Speakers from Lawrence Livermore and Oak Ridge National Laboratories, the European Space Agency showed off a variety of JupyterLab extensions, integrations, and plug-ins for climate science, complex physical simulations, astronomical images and catalogs, and atmospheric monitoring. People at these and other facilities are finding ways to adapt Jupyter to meet the specific needs of their scientists.
- Interactive Distributed Computing. Berkeley Lab’s Shreyas Cholia gave a lightning talk on Interactive Distributed Computing with Jupyter and Friends, stitching together ipyparallel, QGrid, BQPlot, Kale to deliver interactive, distributed deep learning. Other lightning talks covered topics ranging from LFORTRAN and KBase to dashboards, Slurm, CharlieCloud, and GPUs.
Looking ahead, plans are in motion for a security-focused meeting to be held in the Fall; in addition, and a panel at PEARC19 will include a retrospective discussion of the workshop in Berkeley.
“Many facilities have figured out how to deploy, manage, and customize Jupyter, but they’ve done it while focused on their unique requirements and capabilities,” Thomas said. “Still others are just taking their first steps and want to avoid reinventing the wheel. With some initial critical mass, we can start contributing what we’ve learned separately into a shared body of knowledge, patterns, tools, and best practices.“
To read more about this and related events, read this blog post by Rollin Thomas. For more on the history and development of Jupyter and interactive computing, check out this April 2019 TEDx Talk by Fernando Perez.
About Computing Sciences at Berkeley Lab:
Berkeley Lab Computing Sciences provides the computing and networking resources and expertise critical to advancing Department of Energy Office of Science (DOE-SC) research missions: developing new energy sources, improving energy efficiency, developing new materials, and increasing our understanding of ourselves, our world, and our universe. ESnet, the Energy Sciences Network, provides the high-bandwidth, reliable connections that link scientists at 40 DOE research sites to each other and to experimental facilities and supercomputing centers around the country. The National Energy Research Scientific Computing Center (NERSC) powers the discoveries of 7,000-plus scientists at national laboratories and universities. NERSC and ESnet are both Department of Energy Office of Science National User Facilities. The Computational Research Division (CRD) conducts research and development in mathematical modeling and simulation, algorithm design, data storage, management and analysis, computer system architecture and high-performance software implementation.
Berkeley Lab addresses the world's most urgent scientific challenges by advancing sustainable energy, protecting human health, creating new materials, and revealing the origin and fate of the universe. Founded in 1931, Berkeley Lab's scientific expertise has been recognized with 13 Nobel prizes. The University of California manages Berkeley Lab for the DOE’s Office of Science. The DOE Office of Science is the United States' single largest supporter of basic research in the physical sciences and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.
This article is cross-posted with Berkeley Computing Sciences News's press release about this event:
Jupyter Community Workshop Showcases Open Source Success: Three-day event at Berkeley Lab and BIDS focused on Jupyter in HPC and science user facilities
July 12, 2019 | Kathy Kincade | Berkeley Computing Sciences News
Contact: Kathy Kincade, firstname.lastname@example.org, +1 510 495 2124