BIDS-BCHSI Research Xchange Forum — Comparison of synthetic electronic health record data generation techniques for training predictive clinical models

I4H Forum

March 1, 2021
12:30pm to 1:30pm
Virtual Participation


BIDS-BCHSI Research Xchange Forum 
Date: Monday, March 1, 2020
Time: 12:30-1:30 PM Pacific Time
Location: Virtual Participation 
Register to receive the virtual access links.

Comparison of synthetic electronic health record data generation techniques for training predictive clinical models 

Haley Hunter-Zinck
2019-2021 Data Science Health Innovation Fellow 
BIDS and UCSF Bakar Computational Health Sciences Institute 

Abstract: Synthetic data is gaining attention for facilitating electronic health records (EHR) data access for building predictive clinical models.  Currently, there are several methodologies for generating synthetic data. Some rely on access to real and patient-level EHR data, such as methods based on generative adversarial networks or other machine learning or statistical techniques.  Others, such as Synthea, do not depend on record level EHR access and use publicly available and aggregate data resources.  Here, we perform quantitative and qualitative comparisons of different synthetic data generation methodologies for the purpose of building clinical predictive models using EHR data. We formulate comparable synthetic datasets with CorGAN and Synthea using the Veteran Health Administration’s COVID-19 Shared Data Resource as a template and a benchmark.  Using each synthetic dataset, we train predictive models to predict COVID-19 outcomes such as transfer to the intensive care unit or mortality and validate the synthetically trained models on a real test dataset to measure and compare model utility.  We also qualitatively compare synthetic data generators on aspects such as privacy risks, required data inputs, as well as an assessment of manual effort and computational requirements for training the generators.    

The BIDS-BCHSI Research Xchange Forum is an open discussion platform for the interdisciplinary exchange of ideas and research projects at the intersection of healthcare and data science. Participants are invited to engage in a variety of activities, including presentations of work-in-progress, discussions and critiques of recent papers and AI methods in healthcare, introductions to new tools and methods, and opportunities to foster new collaborations. Invited speakers include leading voices in AI and Healthcare, and active conversations invite participants to share fresh perspectives. Clinicians and physicians with an interest in data science methods and tools, as well as data science faculty and researchers with applications or interests in the healthcare and health sciences, are welcome and encouraged to participate.  Regular participants will also include the I4H Fellows, as well as post-docs, staff, and faculty from UC Berkeley, UCSF, and Johnson & Johnson. The immediate goals of this Forum are to share our current research projects with a wider audience, and to increase engagement and improve communication among the three host organizations. Meetings are held virtually on the first Monday of each month at 12:30-1:30 PM Pacific Time, and interested members of the UC Berkeley, UCSF, and Johnson & Johnson communities are invited to to  sign up for this group's mailing list to receive information about upcoming webinars.  Please contact for more information.

I4H sponsors logos banner


Haley Hunter-Zinck

Data Science Health Innovation Fellow

Haley Hunter-Zinck joined BIDS and BCHSI/UCSF in Fall 2019 as part of the first cohort of Data Science Health Innovation Fellows in the Innovate For Health program.  She earned a Ph.D. in Computational Biology from Cornell University in 2014. She completed a postdoc in medical informatics at the VA Boston Healthcare System and continued at VA Boston between 2017 to 2019 as a researcher focusing on applying machine learning to emergency department and hospital flow problems.