Documenting Data Science and Documentation in Data Science: an Ethnographic Exploration

eScience Institute, Data Science Seminar

Lecture

January 24, 2019
4:30pm to 5:20pm
Seattle, WA

Register

The collection, curation, and analysis of data has always been as social as it is technical. Even in the most automated, data-driven systems, there are always humans who work behind the scenes, from the software developers and hardware operators who maintain invisible infrastructures to those who collect, label, annotate, clean, validate, merge, and manage data. These activities tend to get far less attention than the headline-grabbing technologies of machine learning and artificial intelligence, but it is crucial to always keep them in view. In this talk, I specifically discuss the central yet often passed over role of documentation in data science, based on several recent and ongoing studies and projects about the role and importance of documentation in software packages, datasets, analysis code, research protocols, and research teams. Documentation is often seen as an unglamorous, low-status chore to be left for later, but it is a crucial form of communication, collaboration, and collective sensemaking. However, documentation can be so difficult precisely because of the complex skills involved in writing good documentation, as well as the many different, sometimes even contradictory roles it plays for various audiences and stakeholders. In examining the work of documentation as communication, we gain a broader view into many pressing issues in data science, including those around open science, reproducibility, and data ethics.

Speaker(s)

R. Stuart Geiger

BIDS Alum – Ethnographer

Former BIDS Ethnographer Stuart Geiger is now a faculty member at the University of California, San Diego, jointly appointed in the Department of Communication and the Halıcıoğlu Data Science Institute. At BIDS, as an ethnographer of science and technology, he studied the infrastructures and institutions that support the production of knowledge. He launched the Best Practices in Data Science discussion group in 2019, having been one of the original members of the MSDSE Data Science Studies Working Group. Previously, his work on Wikipedia focused on the community of volunteer editors who produce and maintain an open encyclopedia. He also studied distributed scientific research networks and projects, including the Long-Term Ecological Research Network and the Open Science Grid. In Wikipedia and scientific research, he studied topics including newcomer socialization, community governance, specialization and professionalization, quality control and verification, cooperation and conflict, the roles of support staff and technicians, and diversity and inclusion. And, as these communities are made possible through software systems, he studied how the design of software tools and systems intersect with all of these issues.  He received an undergraduate degree at UT Austin, and an MA in Communication, Culture, and Technology at Georgetown University, where he began empirically studying communities using qualitative and ethnographic methods.  As part of receiving his PhD from the UC Berkeley School of Information, he worked with anthropologists, sociologists, psychologists, historians, organizational and management scholars, designers, and computer scientists.