id author title date pages extension mime words sentences flesch summary cache txt cord-026827-6vjg386e Awan, Ammar Ahmad HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow 2020-05-22 .txt text/plain 6142 347 56 To address these problems, we create HyPar-Flow—a model-size and model-type agnostic, scalable, practical, and user-transparent system for hybrid-parallel training by exploiting MPI, Keras, and TensorFlow. HyPar-Flow provides a single API that can be used to perform data, model, and hybrid parallel training of any Keras model at scale. We create an internal distributed representation of the user-provided Keras model, utilize TF's Eager execution features for distributed forward/back-propagation across processes, exploit pipelining to improve performance and leverage efficient MPI primitives for scalable communication. For ResNet-1001, an ultra-deep model, HyPar-Flow provides: 1) Up to 1.6[Formula: see text] speedup over Horovod-based data-parallel training, 2) 110[Formula: see text] speedup over single-node on 128 Stampede2 nodes, and 3) 481[Formula: see text] speedup over single-node on 512 Frontera nodes. To achieve performance, we need to investigate if applying widely-used and important HPC techniques like 1) efficient placement of processes on CPU cores, 2) pipelining via batch splitting, and 3) overlap of computation and communication can be exploited for improving performance of model-parallel and hybrid-parallel training. ./cache/cord-026827-6vjg386e.txt ./txt/cord-026827-6vjg386e.txt