Last week, several BIDS fellows and members headed down to Austin, TX, for SciPy 2016, the 15th annual Scientific Computing with Python conference. The general purpose of SciPy is to bring industry, academia, and government participants together to highlight their work, learn from Python power users, and collaborate on code development.
Like last year, the BIDS community was heavily involved in SciPy, giving numerous tutorials, lectures, working sessions, lightning talks, and more. Below is a recap of BIDS fellows', staff's, and members' participation in SciPy 2016, including descriptions and videos:
Tutorials
Scikit-image: Image analysis in Python (Intermediate)
Stéfan Van Der Walt, BIDS Research Fellow—Computation (with Andreas Mueller, New York University Center for Data Science; Juan Nunez-Iglesias, University of Melbourne)
Summary: Across domains, modalities, and scales of exploration, images form an integral subset of scientific measurements. Despite a deep appeal to human intuition, gaining understanding of image content remains challenging and often relies on heuristics. Even so, the wealth of knowledge contained inside images cannot be understated. Scikit-image is an image processing library built on top of SciPy that provides researchers, practitioners, and educators access to a strong foundation upon which to build algorithms and applications. In this tutorial, aimed at intermediate users of scientific Python, we introduce the library; give practical, real-world examples of applications; and briefly explore its use in the context of machine learning. Throughout, attendees are given the opportunity to learn through hands-on exercises. Prerequisites: a working knowledge of NumPy arrays.
Matthias Bussonnier (BIDS Postdoctoral Scholar) and Jessica Hamrick (BIDS Member) also led Software Carpentry training.
Talks
Reproducible One-Button Workflows with the Jupyter Notebook and SCons
Jessica Hamrick, BIDS Member
Summary: What is the best way to develop analysis code in the Jupyter notebook while managing complex dependencies between analyses? In this talk, I will introduce nbflow, which is a project that integrates a Python-based build system (SCons) with the Jupyter notebook, enabling researchers to easily build sophisticated, complex analysis pipelines entirely within notebooks while still maintaining a "one-button workflow" with which all analyses can be executed in the correct order from a single command. I will show how nbflow can be applied to existing analyses and how it can be used to construct an analysis pipeline stretching the entire way from data cleaning to computing statistics, generating figures, and even automatically generating LaTeX commands that can be used in publications to format results without the risk of copy-and-paste error.
Proselint: The Linting of Science Prose, and the Science of Prose Linting
Michael Pacer, BIDS Member (in collaboration with Jordan Suchow, University of California, Berkeley)
Summary: Writing is notoriously hard, even for the best writers, and it's not for lack of good advice—a tremendous amount of knowledge is strewn across usage guides, dictionaries, technical manuals, essays, pamphlets, websites, and the hearts and minds of great authors and editors. But this knowledge is trapped, waiting to be extracted and transformed. We built Proselint, a Python-based linter for prose. Proselint identifies violations of expert style and usage guidelines. Proselint is open-source software released under the BSD license and works with Python 2 and 3. It runs as a command-line utility or editor plugin (e.g., Sublime Text, Atom, Vim, Emacs) and outputs advice in standard formats (e.g., JSON). Though in its infancy—perhaps 2% of what it could be—Proselint already includes modules addressing redundancy, jargon, illogic, clichés, sexism, misspelling, inconsistency, misuse of symbols, malapropisms, oxymorons, security gaffes, hedging, apologizing, pretension. Proselint can be seen as both a language tool for scientists and a tool for language science. On the one hand, it includes modules that promote clear and consistent prose in science writing. On the other, it measures language usage and explores the factors relevant to creating a useful linter.
Reinventing the whl: New Developments in the Upstream Python Packaging Ecosystem
Nathaniel Smith, BIDS Research Fellow—Computation
Summary: Pip, wheels, and setuptools are the standard tools for installing, distributing, and building Python packages, which means that if you're a user or package author then you're probably using them at least some of the time even though when it comes to handling scientific packages, they've traditionally been a major source of pain. Fortunately, things have been getting better! In this talk, I'll describe how members of the scientific Python community have been working with upstream Python to solve some of the worst issues and show you how to build and distribute binary wheels for Linux users, build Windows packages without MSVC, use wheels to handle dependencies on non-Python libraries like BLAS or libhdf5, plus give the latest updates on our effort to drive a stake through the heart of setup.py files and replace them with something better.
Computational Supply Chain Risk Management for Open Source Software
Sebastian Benthall, BIDS Member
Summary: We address the cybersecurity problems of supply chain risk management in open source software. How does one detect high-risk components in a deployed software system that includes many open source components? As a complement to software assurance approaches based on static source code analysis, we propose a technique based on an analysis of the entire open source ecosystem, inclusive of its technical products and contributor activity. We show how dependency topology, community activity, and exogenous vulnerability and exposure information can be integrated to detect high risk "hot spots" requiring additional investment. We demonstrate this technique using the Python dependency topology extracted from PyPi and data from GitHub. We will dicuss how our analysis prototype has been implemented with SciPy tools.
MPCite: Continuous and High-throughput Allocation of Digital Object Identifiers for Calculated and Contributed Data in the Materials Project
Shreyas Cholia, BIDS Member (in collaboration with Patrick Huck, Anubhav Jain, Daniel Gunter, Donald Winston, Kristin Persson at Lawrence Berkeley National Laboratory)
Summary: We introduce “MPCite,” which enables the continuous request, validation, and dissemination of digital object identifiers (DOIs) for all inorganic materials currently available in the Materials Project (www.materialsproject.org). It provides our users with the necessary software infrastructure to achieve a new level of reproducibility in their research: it allows for the convenient and persistent citation of our materials data in online and print publications and facilitates sharing amongst collaborators. We also demonstrate how we extend the use of MPCite to non-core database entries, such as theoretical and experimental data contributed through "MPContribs" or suggested by the user for calculation via the “MPComplete” service. We expect MPCite to be easily extendable to other scientific domains where the number of data records demands high-throughput and continuous allocation of DOIs.
Governing Open Source Projects at Scale: Lessons from Wikipedia's Growing Pains
Stuart Geiger, BIDS Ethnographer
Summary: Many open source volunteer-driven projects begin with a small tight-knit group of collaborators but then rapidly expand far faster than anyone expects or plans for. I discuss cases of governance growing pains in Wikipedia, which have many lessons for running open source software projects. I discuss how Wikipedians have dealt with various issues as they have become one of the largest volunteer-based open collaboration projects, including the project’s growing bureaucracy and controversies between volunteers and professional staff.
Machine Learning for Time Series Data in Python
Brett Naul, BIDS Member; Stéfan Van Der Walt, BIDS Research Fellow—Computation; Fernando Perez, BIDS Associate Researcher; Joshua Bloom, BIDS Senior Fellow (in collaboration with Ari Crellin-Quick)
Summary: The analysis of time series data is a fundamental part of many scientific disciplines, but there are few resources meant to help domain scientists to easily explore time course datasets: traditional statistical models of time series are often too rigid to explain complex time domain behavior, while popular machine learning packages deal almost exclusively with "fixed-width" datasets containing a uniform number of features. Cesium is a time series analysis framework consisting of a Python library as well as a web front-end interface that allows researchers to apply modern machine learning techniques to time series data in a way that is simple, easily reproducible, and extensible.
JupyterLab: Building Blocks for Interactive Computing
Fernando Perez, BIDS Associate Researcher (in collaboration with Brian Granger, Cal Poly State University, Project Jupyter; Jason Grout, Bloomberg LP; Sylvain Corlay; Chris Colbert, Continuum Analytics; Cameron Oelsen; David Willmer; Afshin Darian)
Summary: Project Jupyter provides building blocks for interactive and exploratory computing. These building blocks make science and data science reproducible across over 40 programming language (Python, Julia, R, etc.). Central to the project is the Jupyter Notebook, a web-based interactive computing platform that allows users to author data- and code-driven narratives—computational narratives—that combine live code, equations, narrative text, visualizations, interactive dashboards, and other media. While the Jupyter Notebook has proved to be an incredibly productive way of working with code and data interactively, it is helpful to decompose notebooks into more primitive building blocks: kernels for code execution, input areas for typing code, markdown cells for composing narrative content, output areas for showing results, terminals, etc. The fundamental idea of JupyterLab is to offer a user interface that allows users to assemble these building blocks in different ways to support interactive workflows that include, but go far beyond, Jupyter Notebooks. JupyterLab accomplishes this by providing a modular and extensible user interface that exposes these building blocks in the context of a powerful work space. Users can arrange multiple notebooks, text editors, terminals, output areas, etc., on a single page with multiple panels, tabs, splitters, and collapsible sidebars with a file browser, command palette, and integrated help system. The codebase and user interface of JupyterLab is based on a flexible plugin system that makes it easy to extend with new components. In this talk, we will demonstrate the JupyterLab interface and its codebase and describe how it fits within the overall roadmap of the project.
Launching Python Applications on Peta-scale Massively Parallel Systems
Yu Feng, BIDS Fellow
Summary: We introduce a method to launch Python applications at near-native speed on large high-performance computing systems. The Python run-time and other dependencies are bundled and delivered to computing nodes via a broadcast operation. The interpreter is instructed to use the local version of the files on the computing node, removing the shared file system as a bottleneck during application start-up. Our method can be added as a preamble to the traditional job script, improving the performance of user applications in a non-invasive way. Furthermore, our method allows us to implement a three-tier system for the supporting components of an application, reducing the overhead of runs during the development phase of an application. The method is used for applications on Cray XC30 and Cray XT systems up to full machine capability with an overhead typically less than two minutes. We expect the method to be portable to similar applications in Julia or R. We also hope the three-tier system for the supporting components provides some insight for the container based solutions for launching applications in a development environment. We provide the full source code of an implementation of the method here. Given that large-scale Python applications can be launched extremely efficiently on state-of-art super-computing systems, it is time for the high-performance computing community to seriously consider building complicated computational applications at large scale with Python.
A Whirlwind Tour of UC Berkeley’s Data Science Education Program
Cathryn Carson, BIDS Senior Fellow; Sam Lau, BIDS URAP Student; Chris Holdgraf, BIDS fellow; David Culler, BIDS Senior Fellow (in collaboration with Elaine Angelino, John DeNero, Ryan Lovett at UC Berkeley)
Summary: At the University of California, Berkeley, an exciting and new Data Science Education Program is running full-steam ahead. This presentation will provide an overview of the program, focusing on the Python-based Foundations of Data Science (DATA 8) course aimed at any and all freshmen. This material will be of interest to diverse folks thinking about data science education, using Jupyter notebooks in the classroom, and/or deploying and scaling JupyterHub. In this presentation, we’ll highlight student-facing content and provide an overview of our JupyterHub deployment. All these materials are publicly available on GitHub.