The BIDS community recently gathered to celebrate the 2018-2019 research milestones of this year’s cohort of BIDS undergraduate interns. BIDS Data Science Fellows provide a variety of engaging projects to augment the Berkeley undergraduate research experience through the Undergraduate Research Apprentice Program (URAP), through which they mentored a total of 35 undergraduates participating in a variety of collaborative research projects in areas including astrophysics, biomedical imaging, software engineering, data analysis and reporting, genomics and environmental science.
“This is such an exciting time for you and your generation,” BIDS Data Science Fellow Ciera Martinez said as she welcomed the students assembled at BIDS. “You are the first generation in human history to not only have this magnitude of data at your disposal, but also the data science skills at such an early age. With these powerful tools at hand, you are able to apply them to whatever research questions you may encounter, no matter what field you’re working in.”
BIDS is home to a diverse range of cross-disciplinary data-intensive research, and the undergraduates assembled were pursuing a wide range a academic studies in areas including statistics, data science and computer science, as well as applied mathematics and linguistics.

BIDS Data Science Fellows and URAP mentors Maryam Vareth,
Stuart Geiger, Ciera Martinez, Stéfan van der Walt
and Andreas Zoglauer, with their Interns.
BIDS Data Science Fellow Andreas Zoglauer has been leading two projects focused on the NASA-funded Compton Spectrometer and Imager (COSI) gamma-ray telescope. Preparing for the next 100-day stratospheric balloon flight of the Compton Spectrometer and Imager (COSI) focused on preparing the COSI telescope for flight readiness, and Improving the COSI data-analysis pipeline with Machine Learning centered on improving COSI’s data analysis pipeline by applying the latest machine learning tools to individual pipeline tasks. Andreas’ group is also working on new data analysis techniques for future gamma-ray telescopes beyond COSI, and these projects will continue into the next academic year. In total, Andreas mentored 10 undergraduates representing various fields of research including statistics, data science, computer science, maths, physics, and mechanical engineering.
BIDS Research Data Scientist Stéfan van der Walt leads the BIDS Machine Shop, in which students help build targeted computational solutions (library, software, web apps) for research problems across campus - and in the process, learn good principles of software engineering and development. During this last year, undergraduate intern Dennis Feng has been broadening his programming skills in Python and in scientific imaging applications by developing software that measures various attributes of insect specimens from photographs -- software that is now being used by researchers at the Natural History Museum in London.
BIDS-LLNL Data Science Fellow Maryam Vareth leads a collaborative project entitled Deep Learning in Medical Imaging, which is offered through BIDS and the UCSF Department of Radiology and Biomedical Imaging. As part of this project, students contributed to the work of a UCSF medical imaging research team, learning that real-life data sets can be much messier than expected. This project is also helping students think about the personalization of data, and the privacy and ethical issues entailed in doing collaborative health-related research.
Six undergraduate researchers have been working with BIDS Ethnographer Stuart Geiger on a project entitled Garbage In, Garbage Out? Do Machine Learning Research Papers Report Where Training Data Comes From? As part of this research project, students investigated the source and soundness of human-labeled training data in over 150 papers about machine learning classifiers. They are now exploring to what extent cutting-edge machine learning research follows best practices in reporting how they got their training data. “Most machine learning courses give students cleanly-labeled datasets and focus on the math and programming aspects of machine learning, but we all got to see how important it is to also pay attention to where that data comes from and whether we should trust it,” Stuart Geiger said. This project is also helping students think more critically about data and information, and the importance of consistent standards for data collection and analysis. Exploring the ethical issues around big data is also an important component of Cathryn Carson’s Human Contexts and Ethics in Data Science course, which many of the students on Stuart’s team have taken as part of UC Berkeley’s new Data Science curriculum.
BIDS Data Science Fellows Ciera Martinez and Sara Stoudt led a project funded by the Mozilla Foundation called The Cabinet of Curiosity Project - Natural History and Data Science. As part of this project, each student chose one of the many vast biodiversity datasets to explore, document, and clean. Ciera’s group included students majoring Data Science and Computing Sciences, but although they are not life scientists per se, their training in data science prepared them for working with these highly specialized databases. You can view their work at http://curiositydata.org.
Ciera is also leading a project with UC Berkeley Genetics professor Michael Eisen entitled Cryptography of the unknown regions of genomes, in which students learn to identify genetic “enhancers,” discovering how they are sequenced and influence(s) they have on organism development. Ciera has found the program to be an invaluable learning experience, for both the interns and the BIDS mentors: “These interns are extremely well trained and a vital resource for Berkeley research — I’m always learning new perspectives and viewpoints from them. They are an excellent resource for furthering this important research, and in turn, they are learning crucial data science skills that will serve them in whatever research field they decide to work in.”
BIDS/CERC-WET Data Science Fellow Zexuan Xu leads a project called Hydrologic forecasting for the East River, CO, in which students implemented a workflow to ensure that datasets downloaded from instrumented watersheds are comparable and operational, then develop and test machine learning models to predict river discharge. Interns are working with professionals at the intersection of data science and hydrology, improving their data processing and modeling techniques, and prepared them for more diverse data science applications. This project is part of the BIDS Research Project Improving hydrology forecasting and water/energy nexus led by BIDS Senior Fellows Laurel Larsen and Fernando Perez.
BIDS plans to expand research opportunities for undergraduates in the coming year, and there are also plans to relaunch an expanded BIDS Data Science Fellowship Program (for graduate students and postdoctoral researchers) in fall 2019.