New insights on the practices of documentation of open-source software

June 12, 2018

The Berkeley Institute for Data Science (BIDS) hosts research, events, and tool development focused on facilitating data-intensive research. Today, computational research and data analytics relies on complex ecosystems of open source software (OSS) tools and libraries. Software documentation is crucial to help researchers discover and use these tools, to help build a community ecosystem of data science packages, and to define best practices in the field. However, software documentation in open source is notoriously considered low-quality. Why is good documentation so difficult to write in open source software, what can communities do to better support good documentation in their projects?

We, (BIDS fellows & researchers Stuart Geiger, Nelle Varoquaux, Charlotte Mazel-Cabasse and Chris Holdgraf) recently published a study about the types, roles, and practices of documentation in open source software projects, focusing on data analytics software libraries like ggplot2, pandas, or Matplotlib. The paper appears in the Journal of Computer-Supported Cooperative Work, a premier venue for research on the social and technical dimensions of collaboration.

The study was based on interviews with novice and veteran documentation contributors to open source data analytics software libraries, additionally drawing on Stuart and Charlotte’s broader ethnographic research into data science as well as Nelle and Chris’s extensive experience participating in many OSS projects. The study began as part of the Docathon event (March 6-9, 2017) held at BIDS, UW’s eScience Institute, and virtually around the world, in which contributors pledged to spend time together writing documentation.

We discuss the many types and formats of documentation, ranging from short examples to book-like tutorials. Each of the types and formats play different and complementary part in educating, promoting, and organizing the tools. One challenge around documentation is that so many types of documentation serve different purposes for different audiences. The practices of documentation themselves are multifaceted, and require a broad range of skills, such as writing, reviewing, maintaining documentation, and coding. Documentation contributors require a large set of skills beyond that of software development, and often need to overcome many social and technical barriers to contribute to a project’s documentation.

Last but not least, most contributors do not inherently enjoy writing documentation, compared to writing code. One of our interviewee stated that “We all hate writing documentation” -- although we did find a small number of people who enjoy it as much as writing code. Some volunteer as a sense of responsibility, or because the project they contributed to required to do so. In addition, many contributors stated that they did not feel like they received same levels of positive community feedback for documentation work as they did for adding new features or fixing bugs. We identify issues around what kinds of work receive recognition and credit in various communities. Even though writing documentation involves substantial technical expertise, it is often seen as a trivial “non-technical” chore, which impacts motivation. Overall, contributors felt they lacked the incentive, motivation and credit for writing documentation from the broader community.

We also identify substantial “documentation guilt” around not writing or updating documentation, as documentation is a kind of task that is often left for later -- if at all. At the 2017 SciPy conference, the results of a survey around the practices of documentation found that 76.5% of the respondents felt they should spend more time writing documentation than they do!

This research project has also been a compelling case of ethnographers collaborating with data scientists to empirically study the people, practices, and platforms behind the scenes of data science --- what we call “data science studies.” Our paper was written to speak to both open source software contributors and social science researchers, as we believe there is much to be learned in bringing these groups closer together. In combining our various strengths and expertises, we are able to turn the research lens on data science itself. With such research, we better understand what is needed to support and sustain the critical infrastructure we rely on for doing data- and computationally-intensive research in open and reproducible ways.

The paper has already received substantial attention both from the open source software community and the CSCW research community: the paper won an honorable mention for the annual David B. Martin best paper award, the presentation of the paper won best presentation at the ECSCW 2018 conference, and it is the most shared paper on social media published in the last five years in the Journal of CSCW according to altmetric.com. We are thrilled by this response, and stay tuned for more collaborative research in this area!

_________

The Types, Roles, and Practices of Documentation in Data Analytics Open Source Software Libraries: A Collaborative Ethnography of Documentation Work
R. Stuart Geiger, Nelle Varoquaux, Charlotte Mazel-Cabasse, and Chris Holdgraf
May 29, 2018 | Computer-Supported Cooperative Work (JCSCW)



Featured Fellow

R. Stuart Geiger

Berkeley Institute for Data Science
Ethnographer

Nelle Varoquaux

Statistics

Charlotte Cabasse

Ethnographer

Chris Holdgraf

Data Science Education Program