scPCA: A toolbox for sparse contrastive principal component analysis in R

Philippe Boileau1, Nima S. Hejazi, and Sandrine Dudoit

The Journal of Open Source Software
February 25, 2020


Data pre-processing and exploratory data analysis are crucial steps in the data science life- cycle, often relying on dimensionality reduction techniques to extract pertinent signal. As the collection of large and complex datasets becomes the norm, the need for methods that can successfully glean pertinent information from among increasingly intricate technical artifacts is greater than ever. What’s more, many of the most historically reliable and commonly used methods have demonstrably poor performance, or even fail outright, in reducing the dimensionality of large and noisy datasets in a stable, interpretable, and relevant manner.

Principal component analysis (PCA) is one such method. Although popular for its interpretable results and ease of implementation, PCA’s performance on high-dimensional data often leaves much to be desired. Its performance has been characterized as unstable in such settings (Johnstone & Lu, 2009), and it has been shown to often emphasize unwanted variation (e.g., batch e ects) in lieu of the signal of interest.

Consequently, modifications of PCA have been developed to remedy these issues. Namely, sparse PCA (SPCA) (Zou, Hastie, & Tibshirani, 2006) was created to increase the stability and interpretability of the principal component loadings in high dimensions, while constrastive PCA (cPCA) (Abid, Zhang, Bagaria, & Zou, 2018) leverages control data to adjust for unwanted e ects and capture relevant information.

Although SPCA and cPCA have proven useful in resolving individual shortcomings of PCA, neither is capable of tackling the issues of stability and relevance simultaneously. The scPCA R package implements sparse constrastive PCA (scPCA) (Boileau, Hejazi, & Dudoit, 2019), a combination of these methods, drawing on cPCA to remove unwanted e ects and on SPCA to sparsify the principal component loadings. In both simulation studies and data analy- sis, Boileau et al. (2019) provided practical demonstrations of scPCA’s ability to extract stable, interpretable, and uncontaminated signal from high-dimensional biological data. In- deed, scPCA was found to produce more informative and interpretable embeddings than linear (e.g. PCA, cPCA) and non-linear dimensionality reduction methods (e.g. UMAP (McInnes, Healy, & Melville, 2018), t-SNE (van der Maaten & Hinton, 2008)) commonly used to ex- plore high-dimensional biological data. Such demonstrations included the re-analysis of several publicly available protein expression, microarray gene expression, and single-cell transcriptome sequencing datasets.

As the scPCA software package was specially designed for use in disentangling biological sig- nal from technical noise in high-throughput sequencing data, a free and open-source software implementation has been made available via the Bioconductor Project (Gentleman, Carey, Huber, Irizarry, & Dudoit, 2006; Gentleman et al., 2004; Huber et al., 2015) for the R lan- guage and environment for statistical computing (R Core Team, 2020). The scPCA package Boileau et al., (2020). scPCA: A toolbox for sparse contrastive principal component analysis in R. Journal of Open Source Software, 5(46), 1 2079. also implements cPCA, previously unavailable in the R language, in two avors: (1) the semi- automated version of Abid et al. (2018) and (2) the automated version formulated by Boileau et al. (2019). In order to interface seamlessly with data structures common in computational biology, the scPCA package integrates fully with the SingleCellExperiment container class (Lun & Risso, 2019), using the class to store the cPCA and scPCA representations generated via the reducedDims accessor method. Finally, to facilitate parallel computation, the scPCA package contains parallelized versions of each of its core subroutines, making use of the infras- tructure provided by the BiocParallel package. In order to e ectively use parallelization, one need only set parallel = TRUE in a call to the scPCA package, after having registered a particular parallelization backend, as per the BiocParallel documentation.

Featured Fellows

Sandrine Dudoit

Statistics, Epidemiology and Biostatistics, School of Public Health, UC Berkeley
BIDS Faculty Council