With Data, Bigger Might Not Always Be Better

March 21, 2016

Every conversation I’ve heard about what it means to be a data scientist consists of tons of ideas but no consensus. While it seems like nobody can agree on the sufficient conditions for obtaining this illustrious title, a lot of people are vehement about a necessary one: being a data scientist means you work with big data.

This might not seem like a problem—after all, the existence of data science as a field was borne out of the lava from the big data eruption. But this doesn’t have to mean that you have to work with datasets that crash your hard drive to be a data scientist! An alternative title to this blog post: You’re probably a data scientist; you just don’t know it yet.

In fact, the thing that the recent data explosion has given us (besides, well, data) is a slow-building but powerful prejudice against non-massive datasets. If you aren’t panicking about how to efficiently store, access, and analyze your data, then what are you even doing? I work in biology, and from what I’ve seen, big data fever generally comes with the side effect of dividing the community into two warring factions: particularly, those who can get big data and those who can’t.

There’s been quite a bit of talk of late about a quantitative revolution in the life sciences, but the truth is that biology, or at least several parts of it, have been quantitative for quite some time. As XCKD pointed out, biology is the most applied of the traditional “hard” sciences (we can all fight about this definition later). This unique position has meant that biologists have always benefitted from using quantitative techniques from chemistry and physics.

The only difference is that, nowadays, the amount of data coming out of the biological sciences has made mathematicians and computer scientists finally take notice of us. And this is great! But—there’s always a “but”—the focus of this attention has largely been on things that end in “omics,” which, in my opinion, are just a small portion of the biological fields that generate and analyze interesting datasets.

I really hope that this will change in the future, and working toward correcting this oversight is a major reason I’m excited to be a part of BIDS. I work with data that comes from single-molecule manipulation and detection, like optical tweezers and fluorescence microscopy. While the data I handle on a day-to-day basis isn’t as huge as genomics data has been since the sequencing explosion, they come with their own set of interesting and challenging problems.

For instance, when you try to track something that’s really, really small, thermal noise is usually about as large as the forces you’re trying to measure. This, predictably, makes finding efficient and accurate techniques to separate the two pretty tough. If you want to learn more about how computational techniques are being used across the full spectrum of the life sciences, here’s an awesome video about some of the exciting work going on right here at Berkeley.

I’ll also point out something that isn’t talked about that much in data science: it can be hard (and sometimes really hard) to get any data at all, let alone the big kind. During my undergrad, I studied snake locomotion (shameless plug here). Unsurprisingly, getting as many replicates out of an intact vertebrate as you could out of its genome is just not possible. And, it’s even harder when you work with higher animals!

These fields come with their own unique set of challenges. For example, raising and caring for vertebrates (especially higher ones) is hard, and it’s often not feasible to stick to the one-animal-one-trial rule needed to assume that data points are independent. Low-n datasets also require the use of nonparametric tests since the underlying distribution is hard to tease out. Issues like these, which are unique to small data, have been neglected by data science at large. In addition to hoping that the quantitative community comes to embrace other areas of the life sciences as enthusiastically as they have genomics, I also hope that the rise of big data in biology will not overshadow the importance of the fields that, by necessity, rely on small datasets.

Having said all this, I will concede that, generally speaking, having a mountain of data is probably not a bad thing, despite the aforementioned panic it might elicit. The need for more powerful statistical techniques in the life sciences has changed the landscape of the field for the better. Nowadays, even biologists who work with small data (and the journals who publish them) are acutely aware of the need for rigorous testing and have access to the tools they need to perform it.

The problem arises when we get so caught up in the frenzy to get more and more and more data that we stop paying attention to how we’re getting it and what it’s really telling us (or, importantly, if it’s telling us anything at all). After all, the same lesson permeates almost every aspect of our lives: quality should trump quantity. Why should our data be any different?