Open-source software for generating synthetic electronic health records

BIDS Health and Life Sciences Lead Maryam Vareth offers this project (#2) through UC Berkeley's Undergraduate Research Apprentice Program (URAP).

Research access to electronic health record (EHR) data is limited due to patient privacy concerns. Creating synthetic EHR data (data that models realistic patterns and yet does not correspond to real patient records) provides a potential mechanism to expand data access.

Although the academic and commercial sectors have developed successful methodologies for generating realistic synthetic EHR data, these methodologies are not in common use despite a great need. Commercial products are closed source and expensive. Academic solutions are sometimes open source but often buggy, unportable, and difficult to use, especially for clinical users. Furthermore, different generators are implemented in different languages and packages, making direct comparison and benchmarking laborious. Finally, validation of realism and privacy preserving properties of generated synthetic datasets is often not incorporated into the generation pipeline.

This project aims to create a portable, usable, consolidated, and open-source software package for generating synthetic EHR data. The student will gain knowledge of open-source software, generative, unsupervised machine learning techniques such as generative adversarial networks, user-interface design in addition to gaining exposure to EHR data and healthcare data science. The student will be expected to present on their work at the end of the semester in addition to the potential to contribute to a manuscript describing the developed software package.


Fall 2022 BIDS Undergraduate Internships - Apply August 17-29
Spring 2022 BIDS Undergraduate Internships - Apply January 11-24

BIDS Affiliates


Maryam Vareth

Health and Life Sciences Lead