TY - JOUR
T1 - NetPlacer+: Model Parallelism Based on Load Balance in Distributed Deep Learning
AU - Gao, Yunqi
AU - Hu, Bing
AU - Mashhadi, Mahdi Boloursaz
AU - Jin, A-Long
AU - Xiao, Pei
PY - 2025
Y1 - 2025
N2 - The importance of Model Parallelism in Distributed Deep Learning continues to grow due to the increase in the Deep Neural Network (DNN) scale and the demand for higher training speed. Different from all the existing works, we propose a model-parallel strategy called NetPlacer+ based on load balance. The major idea in NetPlacer+ is to partition the DNN model into multiple devices by balancing each device's computation and communication load. We build the mathematical model of NetPlacer+. We transform the mathematical model of NetPlacer+ and obtain its approximate optimal solution using the interior point method. Extensive experiments in two GPU clusters and eight modern DNNs are conducted to verify the effectiveness of NetPlacer+. Experimental results show that the model-parallel strategy of NetPlacer+ achieves up to 1.25x speedup compared to NVIDIA's DLPlacer.
AB - The importance of Model Parallelism in Distributed Deep Learning continues to grow due to the increase in the Deep Neural Network (DNN) scale and the demand for higher training speed. Different from all the existing works, we propose a model-parallel strategy called NetPlacer+ based on load balance. The major idea in NetPlacer+ is to partition the DNN model into multiple devices by balancing each device's computation and communication load. We build the mathematical model of NetPlacer+. We transform the mathematical model of NetPlacer+ and obtain its approximate optimal solution using the interior point method. Extensive experiments in two GPU clusters and eight modern DNNs are conducted to verify the effectiveness of NetPlacer+. Experimental results show that the model-parallel strategy of NetPlacer+ achieves up to 1.25x speedup compared to NVIDIA's DLPlacer.
U2 - 10.1109/TETCI.2025.3543765
DO - 10.1109/TETCI.2025.3543765
M3 - Article
SN - 2471-285X
JO - IEEE Transactions on Emerging Topics in Computational Intelligence
JF - IEEE Transactions on Emerging Topics in Computational Intelligence
ER -