Byung-Gon Chun, Brian Cho, Beomyeol Jeon, Joo Seong Jeong, Gunhee Kim, Joo Yeon Kim, Woo-Yeon Lee, Yun Seong Lee, Markus Weimer, Youngseok Yang, Gyeong-In Yu


Large-scale machine learning (ML) systems are becoming widely used. Typically, these ML systems run on fixed resources, but it is difficult to find their optimal configurations (e.g., how many nodes to use, how to distribute data) since they depend on multiple factors such as hardware environments, ML algorithms, input datasets, etc. Furthermore, optimal configurations often can change over time due to fluctuating cluster resources and changing ML algorithm patterns. In this paper, we present Dolphin, an elastic machine learning framework that addresses the configuration problem at runtime. Dolphin solves a cost-based optimization problem to find an optimal configuration and reconfigures the system dynamically at runtime. Dolphin introduces a new distributed memory abstraction to change resource and data configurations based on the optimizer plan transparently and efficiently.

Download PDF


  title={Dolphin: Runtime Optimization for Distributed Machine Learning},
  author={Byung-Gon Chun, and Brian Cho, and Beomyeol Jeon, and Joo Seong Jeong, and Gunhee Kim, and Joo Yeon Kim, and Woo-Yeon Lee, and Yun Seong Lee, and Markus Weimer, and Youngseok Yang, and Gyeong-In Yu},
  booktitle={The ML Systems Workshop at ICML},