Background: Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression.
Results: We extend and efficiently implement iterative hard thresholding (IHT) for multivariate regression. Our extensions accommodate generalized linear models (GLMs), prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing, and exhibits a 2 to 3 orders of magnitude decrease in false positive rates compared to lasso regression. These advantages capitalize on IHT’s ability to recover unbiased coefficient estimates. We also apply IHT to the Northern Finland Birth Cohort of 1966 and find that IHT recovers plausible variants associated with HDL and LDL.
Conclusions: Our real data analysis and simulation studies suggest that IHT can (a) recover highly correlated predictors, (b) avoid over-fitting, (c) deliver better true positive and false positive rates than either marginal testing or lasso regression, (d) recover unbiased regression coefficients, and (e) exploit prior information and group-sparsity. Although these advances are studied for GWAS inference, our extensions are pertinent to other regression problems with large numbers of predictors.