BIDS Faculty Affiliate Bin Yu presented the Hotelling Lectures as part of an annual event hosted by the Department of Statistics & Operations Research at the University of North Carolina – Chapel Hill.
Bin Yu to deliver Hotelling Lectures
April 13, 2022 | UNC News
Veridical Data Science: the practice of responsible data analysis and decision-making was presented on Tuesday, April 19, 2022, in 209 Manning Hall. Abstract: “A.I. is like nuclear energy — both promising and dangerous” — Bill Gates, 2019. Data Science is a pillar of A.I. and has driven most of recent cutting-edge discoveries in biomedical research and beyond. In practice, Data Science has a life cycle (DSLC) that includes problem formulation, data collection, data cleaning, modeling, result interpretation and the drawing of conclusions. Human judgement calls are ubiquitous at every step of this process, e.g., in choosing data cleaning methods, predictive algorithms and data perturbations. Such judgment calls are often responsible for the “dangers” of A.I. To maximally mitigate these dangers, we developed a framework based on three core principles: Predictability, Computability and Stability (PCS). Through a workflow and documentation (in R Markdown or Jupyter Notebook) that allows one to manage the whole DSLC, the PCS framework unifies, streamlines and expands on the best practices of machine learning and statistics – taking a step forward towards veridical Data Science. In this lecture, we will illustrate the PCS framework through the development of of iterative random forests (iRF) for predictive and stable non-linear interaction discovery and through using iRF and UK biobank data to find gene-gene interactions driving, respectively, red-hair and a heart disease called hypertrophic cariomyopathy.
Interpreting deep neural networks towards trustworthiness was presented on Wednesday, April 20, 2022, in 120 Hanes Hall. Abstract: Recent deep learning models have achieved impressive predictive performance by learning complex functions of many variables, often at the cost of interpretability. This lecture first defines interpretable machine learning in general and introduces the agglomerative contextual decomposition (ACD) method to interpret neural networks. Extending ACD to the scientifically meaningful frequency domain, an adaptive wavelet distillation (AWD) interpretation method is developed. AWD is shown to be both outperforming deep neural networks and interpretable in two prediction problems from cosmology and cell biology. Finally, a quality-controlled data science life cycle is advocated for building any model for trustworthy interpretation and introduce a Predictability Computability Stability (PCS) framework for such a data science life cycle.