Abstract: Scientists across fields are exploring and exploiting DL techniques for classification, prediction, and simulation dimensionality reduction. These DL applications are naturally supercomputing applications given the computation, communication, and I/O characteristics. In this talk, I will present two works to enable highly scalable distributed DL training. The first one is to enable efficient and scalable l/O for DL applications on supercomputers with FanStore, with which we are able to scale real world applications to hundreds nodes on CPU and GPU cluster with over 90% scaling efficiency. The second one focuses on scaling practice and its application in ImageNet training on thousands of compute nodes with the state-of-the-art validation accuracy.