by Kasia Metkowski
Data meets narrative in Rebecca Barter’s Superheat, an R package that creates colorful and customizable heatmaps.
The author, Rebecca Barter, seeks to demystify the world of statistics and inject beauty into data visualizations. A third-year PhD student in the Department of Statistics at UC Berkeley and a data science fellow at the Berkeley Institute for Data Science (BIDS), she hails from Australia, where she graduated from The University of Sydney in 2013 with a bachelor of science (advanced) and was awarded with the University Medal in Statistics.
Kidney Rejection in HIV-Positive Patients
Together with her advisors Professors Bin Yu and Jasjeet Sekhon, Barter currently collaborates with the Sarwal Lab at UCSF on a project concerning kidney rejection in HIV-positive patients. Because HIV patients now live longer, problems like kidney disease are becoming more common in this population, often resulting in an increased demand for transplants. While the survival rate of HIV-positive transplant patients is comparable to their HIV-negative counterparts overall, compared to HIV-negative patients, HIV-positive patients are more likely to physically reject transplants.
Barter observes expression levels of thousands of genes at a time to develop models that will hopefully reveal the causes of these increased rejection rates. To explore and understand connections between immunological/genetic characteristics and transplant rejection, she and her team require special visualization tools.
Researchers like Barter often face challenges when trying to visualize these huge amounts of data. Common tools, such as scatterplots or bar plots, rarely help in these situations as computer screens lack the space needed to display large amounts of data in these formats. For example, if Barter were to use scatterplots for her research, all the data points would sit on top one another.
Heatmaps
However, there’s still hope for researchers wanting to display large amounts of data in a useful, aesthetically pleasing format. Heatmaps, which present variations of values in differently colored cells, have proven to be the best visualization option for Barter and her colleagues in bioinformatics.
In heatmaps, colors translate values, thus enabling users to easily interpret the different colors and quickly grasp micropatterns. For example, light blue could represent small sums and dark blue could represent larger sums. For most researchers, it is much easier to understand such displays than a table of numbers, making them invaluable to Barter’s work.
All visualization methods, heatmaps included, have their limitations. According to Barter, much of the existing software in R for producing heatmaps is difficult to manipulate and looks unappealing. She credits Professor Bin Yu with inspiring her to improve the model when Dr. Yu expressed her desire to compare specific data regions to models.
Superheat
Thus, she authored Superheat, an R package that builds upon the traditional R package, making it more customizable and manipulatable. For instance, plots, like line plots and boxplots, can be added adjacent to the color grids.
In her package, researchers have full control over the aesthetics. It offers a wide range of extensions, making it easier to see relationships between independent variables and responses. Superheat maps are uniquely extendable: it is straightforward to add extra information, such as text annotations and adjacent dendrograms. Users can choose palettes based on preference, and the graph’s interior can be manipulated as well. Inside the cells, text can be added, fonts can be changed, words can be italicized, and so on.
“The results are much prettier,” she cheerfully adds.
She conceived her package by adding a scatterplot of model residuals next to a plotted heatmap while clustering similar data points together. She then thought of more and different kinds of features to add, such as bar plots of correlation information and line plots of response variables.
She explains, “I just kept adding more and more features until it eventually became what it is today.”
One of the key features of Superheat is the ability to plot additional data adjacent to the heatmap. She explains this feature’s significance with an example of a model that predicts rainfall amount. This model measures temperature, air pressure, wind speed, and humidity over many different days for its predictions.
This model demonstrates when residuals (the difference between predicted rainfall amount and the true rainfall amount) are higher for certain types of days (e.g., cold, humid days). Put simply, it indicates when the model is not working well for the different weather conditions present on different days. In order to indicate these discrepancies, Superheat maps daily measurements. It then plots a scatterplot of the “residuals” adjacent to each day’s data. To assess errors, the observer can compare residuals to corresponding weather-conditions.
Barter personally uses Superheat in her project at UCSF when she needs to match patients who rejected their transplant to comparable patients who did not reject their transplant. Her package allows her to visualize variables she wishes to match on (e.g., age, pre-transplant CD4 count, etc.) to find natural groupings among the patients. Because she can add text to each cell, she can see how many measurements she has of that type of variable for each patient.
Promotion of Heatmaps
Despite their potential, heatmaps rarely appear outside bioinformatics. Barter hopes to see them used more often as their applications could affect a wide array of fields, from public health to text analysis to neuroscience, benefiting any researcher who wants to visualize multiple types of data at the same time.
Barter acknowledges barriers to their popularity. She explains that academics tend to stick to what has always been used in their field even when better tools exist. Above all, the biggest barrier to heatmaps’ popularity is the fact that not many know of the tool.
“Raising awareness is half the battle,” she says.
When not transforming the field of data analysis, Barter likes to knit and practice yoga. She is currently co-authoring a book about the field with Professor Bin Yu, and she blogs about life as a graduate student.
You can find out more about her work by checking out her blog or by emailing her.
Superheat features can be explored here.