Open-source software for generating synthetic electronic health records

BIDS Health and Life Sciences Lead Maryam Vareth is offering this project (#2) through UC Berkeley's Undergraduate Research Apprentice Program (URAP) for the Spring 2022 academic semester. Eligible undergraduates may apply online January 11-24, 2022.

Project Description

Research access to electronic health record (EHR) data is limited due to patient privacy concerns. Creating synthetic EHR data (data that models realistic patterns and yet does not correspond to real patient records) provides a potential mechanism to expand data access.

Although the academic and commercial sectors have developed successful methodologies for generating realistic synthetic EHR data, these methodologies are not in common use despite a great need. Commercial products are closed source and expensive. Academic solutions are sometimes open source but often buggy, unportable, and difficult to use, especially for clinical users. Furthermore, different generators are implemented in different languages and packages, making direct comparison and benchmarking laborious. Finally, validation of realism and privacy preserving properties of generated synthetic datasets is often not incorporated into the generation pipeline.

This project aims to create a portable, usable, consolidated, and open-source software package for generating synthetic EHR data. The student will gain knowledge of open-source software, generative, unsupervised machine learning techniques such as generative adversarial networks, user-interface design in addition to gaining exposure to EHR data and healthcare data science. The student will be expected to present on their work at the end of the semester in addition to the potential to contribute to a manuscript describing the developed software package.

The specific selection of tasks will depend on the skill sets and interest of the student and could include the following:

• Re-implementation and/or standardization of existing approaches to generating synthetic EHR generators. For example...
• MedGAN (Choi et al. 2017,
• CorGAN (Torfi et al. 2020,
• Writing modules for validation of synthetic data realism and privacy preserving properties
• Developing a command line interface or dashboard for users to interact with the package
• Generic testing and debugging
• Testing the package on a real EHR dataset (e.g. MIMIC-III,
• Presenting work at group meetings
• Formal presentation at the end of the semester to BIDS community
• Contribute to writing a manuscript


• Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F., Sun, J. (2017). Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. 68(PG-1-20), 1–20.
• Torfi, A., Fox, E. A. (2020, January 25). COR-GAN: Correlation-Capturing Convolutional Neural Networks for Generating Synthetic Healthcare Records. ArXiv.Org.


(Required): • Interest in open-source software development, data science, machine learning, and healthcare research • Great teamwork (e.g. communication skills, punctuality, organization) • Proficiency in Python • Majoring in EECS, BioE, CS, data science, math, statistics, or other related discipline • Working knowledge of Tensorflow/Keras or Pytorch • Working knowledge of version control (e.g. GitHub) (Recommended): • Familiarity with machine learning • Experience with portable package creation (e.g. Docker) • CS 188 and/or CS 189


Spring 2022 BIDS Undergraduate Internships - Apply January 11-24

BIDS Affiliates


Maryam Vareth

Health and Life Sciences Lead