Note: all analyses and data used in this post can be found at this github repository. You can also run the code interactively at this Binder link.
The Brain Imaging Data Structure is a community-driven specification for structuring human electrophysiology data and metadata. It is not a new data format, and instead abstracts out the structure of metadata from the raw data file itself. BIDS began in the fMRI community, and has several extension proposals for MEG, EEG, iEEG currently under way by these communities.
A goal of BIDS is to accommodate pre-existing workflows in these communities, while simultaneously encouraging and facilitating best-practices in reproducible, efficient, and open science. A long-lasting challenge to this has been the data format fragmentation problem. Because there is no accepted standard format for storing raw electrophysiology data, there are a multitude of labs using completely different standards. Sometimes this is for good reason - one format may not be technically-suited for a particular kind of data (for example, if the format does not allow numbers to be represented with enough bytes). Other times, it is a relatively arbitrary choice made due to lab or institute history.
As the EEG, MEG, and iEEG communities attempt to define their own standards around data (re)use, we decided to ask community members what formats they currently use, as well as what they’d be willing to use for data sharing. We sent out a short survey to members of the EEG, MEG, and iEEG communities via email listservs and social media. The survey consisted of three simple questions:
- What brain modalities do you currently study?
- Which data formats do you currently use for raw data?
- Which data formats would you be willing to use for sharing raw data?
This post describes the results of this survey, a snapshot of the current usage patterns for data formats within the human electrophysiology community. We’ll break it down into a few main takeaways below.
Respondents
Here is a quick breakdown of respondent information. We received over 440 responses from nearly 90 unique universities and institutions. Respondents were largely located in the United States and Europe, with several responses in Asia as Australia. Here’s a quick breakdown of the geographic distribution of most respondents:
Unsurprisingly, the majority of responses came from the EEG community, which is both older and larger than either the MEG or iEEG communities. Here is a breakdown of response by modality type:
The responses also showed patterns of high overlap of within-researcher brain modalities:
Formats vary widely by brain modality
It is common for labs to utilize a variety of data formats in their labs - either for technical reasons or because researchers within the lab prefer one format to another. The survey asked about a wide range of formats, focusing on those that were reported by members of each community beforehand. Here is a list of all formats mentioned in the survey:
For each format, we calculated the % of users using this format (columns) that also reported using another format (rows). The results showed high usage overlap for subsets of data formats, particularly within communities:
It is clear from these results that many labs don’t pick one format and use it exclusively - they often utilize a variety of formats within their group. With this in mind, we also asked respondents to tell us what formats they’d be willing to use in order to share their data with others - a primary goal of the BIDS project. Here is a breakdown of which formats respondents would be willing to share:
As you can see, the majority of respondents would be willing to share data formats with a custom structure under-the-hood (e.g. `.mat` or `.hdf5` files). This is likely because labs already utilize some internal structure within the lab, and are most likely to recommend whatever format they’re already utilizing.
Unfortunately, using bespoke data structures for sharing data is not a sustainable solution for open data sharing practices in the community. BIDS is interested in finding structured, open data formats that fit most researcher needs. To this extent, we next break down formats that respondents would be willing to share with others. We’ll make one plot for each modality, and show a dotted line at the format that covers 80% of respondents. We’ll remove any custom-tailored format responses (.mat and .hdf5) since these couldn’t adhere to the BIDS specification by themselves.
EDF, EEGLAB, and FieldTrip formats are used everywhere
One takeaway from the plots above is that EDF is by far the most popular structured data format across these modalities. This is a good thing! EDF has been around for many years, is well-defined, and is an open format. However it also has technical drawbacks, most-notably that one cannot store data with more than 16 bytes per number. BIDS needs to support formats that are both widely-adopted and also technically match the workflows needed by neuroscientists. As a final analysis, let’s look at the number of formats that were within the top 80% of responses for each modality:
Takeaways
So what did the BIDS format survey teach us? First, it confirmed what we already knew: that in human MEG/EEG/iEEG communities, there are a variety of data formats used by researchers. However, we also learned that most people would be willing to share their data in a subset of formats. We hope that the BIDS specifications for these communities will reflect this distribution of responses, and that this data will be utilized to support formats that facilitate researchers sharing their data with the community.