Abstract: Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The precise underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze how the landscape of the loss functional is different for large batch size training. We compute the true Hessian spectrum, without approximation, by back-propagating the second derivative. Our results on multiple networks show that, when training at large batch sizes, one tends to stop at points in the parameter space with noticeably higher/larger Hessian spectrum, i.e., where the eigenvalues of the Hessian are much larger. We then study how batch size affects robustness of the model in the face of adversarial attacks. All the results show that models trained with large batches are more susceptible to adversarial attacks, as compared to models trained with small batch sizes. Furthermore, we prove a theoretical result which shows that the problem of finding an adversarial perturbation is a saddle-free optimization problem. Finally, we show empirical results that demonstrate that adversarial training leads to areas with smaller Hessian spectrum. We present detailed experiments with five different network architectures tested on MNIST, CIFAR-10, and CIFAR-100 datasets.
October 2, 2018 | arXiv.org
Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney