Salvaging a Genetics Project: Identifying and Correcting Sample Mix-Ups in High-Dimensional Data

Data Science Lecture Series


April 7, 2017
1:10pm to 2:30pm
190 Doe Library
Get Directions

Consider two inbred mouse strains that differ in a phenotype of interest, such as blood pressure. An intercross between these strains (similar to Mendel's crosses between different strains of peas) can be used to identify genes that contribute to the trait difference: we look for genomic regions (called quantitative trait loci [QTL]) for which the offspring genotypes are associated with their blood pressures. A key weakness of this approach is that one is generally left with very large regions containing many genes. One strategy to deal with this weakness is to also measure intermediate phenotypes, such as the mRNA expression of all genes in a relevant tissue. We then seek to identify genetic loci (called expression quantitative trait loci [eQTL]) that affect mRNA expression and to find genes for which genotype is associated with mRNA expression and also blood pressure.

Keeping track of sample identifiers is critical in this sort of work: sample mix-ups in the genotypes, phenotypes, or mRNA expression data will weaken the genotype/phenotype/mRNA associations. In a recent study with 500 intercross mice and gene expression microarray data on six tissues, we identified a large number of sample mix-ups (~18%) in the genotype data and a smaller number of mix-ups in each set of microarrays. I'll describe how I found and corrected these problems. In a nutshell: the expression of some genes is so strongly associated with genotype that the expression data can effectively serve as a DNA fingerprint for establishing individuals' identities.


Karl W. Broman

Biostatistics and Medical Informatics, University of Wisconsin-Madison

Karl Broman is professor in the Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison. He is an applied statistician working on problems in genetics and genomics. He develops the R package R/qtl (, has written a number of short tutorials useful for data scientists (, and is very keen to develop tools for interactive data visualization (for example, see He was Terry Speed's student in statistics at UC Berkeley; he got is PhD in 1997.

Twitter: @kwbroman