Scaling TensorFlow and Caffe to 256 GPUs
I wrote this story on Aug 7, 2017 on the IBM Blog …
Deep learning has taken the world by storm in the last four years, powering hundreds of consumer web and mobile applications that we use every day. But the extremely long training times in most frameworks present a hurdle that’s curtailing the broader proliferation of deep learning. It currently may take days or even weeks to train large AI models with big data sets to get the right accuracy levels.
At the crux of this problem is a technical limitation. The popular open-source deep-learning frameworks do not seem to run as efficiently across multiple servers. So, while most data scientists are using servers with four or eight GPUs, they can’t scale beyond that single node. For example, when we tried to train a model with the ImageNet-22K data set using a ResNet-101 model, it took us 16 days on a single Power Systems server (S822LC for High Performance Computing) with four NVIDIA P100 GPU accelerators.
16 days — that’s a lot of time you could be spending elsewhere.
And since model training is an iterative task, where a data scientist tweaks hyper-parameters, models, and even the input data, and trains the AI models multiple times, these kinds of long training runs delay time to insight and can limit productivity.
Continue reading on IBM’s blog ….
Read all the blogs I have written on IBM’s blog.
Learn More about PowerAI
PowerAI is a software suite based on open-source AI frameworks like TensorFlow, PyTorch, etc. Learn more in these blogs: