September 13 -14, 2017 - Los Angeles, CA
Click Here For Information & Registration
Back To Schedule
Friday, September 15 • 2:00pm - 2:40pm
Distributed Deep Learning on Apache Mesos with GPUs and Gang Scheduling - Min Cai, Alex Sergeev, Paul Mikesell & Anne Holler, UBER

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Distributed deep learning is essential to speed up complex model training, scale out to hundreds of GPUs, and shard models that can not be fit into a single machine. With recent advance on deep learning models in self-driving car areas such as lane-detection, perception and so on, it is important to enable distributed deep learning with large-scale GPU clusters.

This presentation will discuss our design and implementation of running distributed TensorFlow on top of Mesos clusters with hundreds of GPUs. It leverages several key features offered by Mesos such as GPU isolation and nested containers. We also implement several features in our scheduler to support GPU and Gang scheduling, task discovery and dynamic port allocation. Finally, we will show the speed up of distributed training on Mesos using an example TensorFlow model for image classification.


Min Cai

Staff Engineer, UBER
Min Cai is a Staff Engineer at UBER working on cluster management. He received his Ph.D. degree in Computer Science from USC. Before joining Uber, he was a Sr. Staff Engineer at VMware working on vMotion and vSphere.

Alex Sergeev

Senior Engineer, UBER
Alex Sergeev is a Senior Engineer at UBER working on scalable Deep Learning. He recived his MS. degree in Computer Science from MEPhI. Before joining UBER, he was Senior Engineer at Microsoft working on Big Data Mining.

Friday September 15, 2017 2:00pm - 2:40pm PDT
Diamond Salon 1 & 2