In their new PNAS article, Veridical Data Science, BIDS Senior Fellow and UC Berkeley Statistics/EECS Professor Bin Yu, along with former doctoral student Karl Kumbier (now at UCSF), unveil the PCS framework for navigating and quality-controlling the data science life cycle in all research domains.
Based on the core data science principles of predictability, computability, and stability (PCS), the PCS framework lays the groundwork for using data science tools reliably, reproducibly and transparently; and for guiding data-driven decision-making across all fields of science, social science, engineering, business, and government, as well as the non-statistical disciplines — including genomics, astronomy, precision medicine, political science, economics — and any field that uses data science tools to extract meaningful information from the vast data resources available.
The data science life cycle (DSLC) begins with a domain-specific research question, and then proceeds through data collection and management, processing (cleaning), exploration, modeling and interpretation, which generates results and guides subsequent actions. According to Yu and Kumbier, “Given the transdisciplinary nature of this process, data science requires human involvement from those who collectively understand both the domain and tools used to collect, process, and model data.” The PCS framework consists of a PCS workflow for all stages of the DS life cycle, and a PCS documentation in Rmarkdown or JupyterNotebook to provide narratives (qualitative and quantitative arguments) and codes for the data conclusions, including records of the human judgment calls made in the DS life cycle.
The PCS framework can be used to navigate and quality-control the scientific process — from hypothesis generation to experimental design — because it incorporates the principles and best practices of the sciences while embracing the machine-learning platform that is part of modern statistics. According to Bin Yu, “The stability of choices and the decisions made by data scientists at all stages of the data life cycle is a minimum requirement for the interpretability of results as well as a key to their validity.”
Veridical Data Science
February 13, 2020 | Bin Yu and Karl Kumbier | PNAS
QnAs with Bin Yu
February 12, 2020 | Farooq Ahmed | PNAS