Data about Data: Starting to Define "Data Science"

May 28, 2015

Last October, BIDS hosted the first Data Science Environment (DSE) Summit, bringing together researchers, staff, and students from UC Berkeley, NYU, and UW who are partnering to “harness the potential of data scientists and big data for basic research and scientific discovery.”

At the summit, there was a lot of talk about what data science is exactly—a lot of talk but no real consensus. We came to the conclusion that data science encompasses a lot of things—different methods, different domains, different types of people, different end goals. There’s no easy explanation.

While we didn’t come up with a simple definition for what data science is at the summit, it was the first time I heard about an interesting “experiment” being done at the Moore Foundation to start to explore the data science paradigm a little more deeply.

As part of its Data-Driven Discovery Initiative—the goal of which is to “support people who innovate around data-driven discovery”—the Gordon and Betty Moore Foundation ran an Investigator Competition to identify the top data scientists throughout the country. In the competition pre-application stage, the Moore Foundation asked each applicant to provide up to five works related to “big data” in scientific discovery that had influenced their research. When all was said and done, they had collected nearly 5,000 references from 1,100 scientists conducting data-intensive research.

In late May, Mark Stalzer and Chris Mentzel from the Moore Foundation shared a preprint of their preliminary findings. It summarizes list of resources from a population of people who are at the top of the data science field. While the authors did not frame the paper in this way, I feel this list could be a very interesting data set for helping to define data science. After all, it captures the zeitgeist of a large number of researchers most likely to be called “data scientists” in an academic context.

Stalzer and Mentzel used a casual sorting analysis to process the citations and manually categorized the works together into some main arenas in the data science world: foundational theory, astronomy, genomics, classical statistical methods, machine learning, the Google Papers, general tools, and the centrality of the scientific method.  While these emergent themes are informative, more insight could be gained from a more automated natural language processing pipeline as well as more formal topic modeling and clustering.

The current results are preliminary, but with a more formal approach to the language processing and analysis, the data set might serve as part of the foundation for a broader discussion of what data science is. For those of us still trying to determine what data science is exactly, it is definitely worth watching out for a release of the raw data or an updated version of the analysis.