US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Yunqi Gao; Bing Hu; Mahdi Boloursaz Mashhadi; A-Long Jin; Pei Xiao; Chunming Wu

doi:10.1109/TPDS.2023.3331372

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Yunqi Gao, Bing Hu^*, Mahdi Boloursaz Mashhadi, A-Long Jin, Pei Xiao, Chunming Wu

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

5 Citations (Scopus)

Abstract

The communication bottleneck severely constrains the scalability of distributed deep learning, and efficient communication scheduling accelerates distributed DNN training by overlapping computation and communication tasks. However, existing approaches based on tensor partitioning are not efficient and suffer from two challenges: 1) the fixed number of tensor blocks transferred in parallel can not necessarily minimize the communication overheads; 2) although the scheduling order that preferentially transmits tensor blocks close to the input layer can start forward propagation in the next iteration earlier, the shortest per-iteration time is not obtained. In this paper, we propose an efficient communication framework called US-Byte. It can schedule unequal-sized tensor blocks in a near-optimal order to minimize the training time. We build the mathematical model of US-Byte by two phases: 1) the overlap of gradient communication and backward propagation, and 2) the overlap of gradient communication and forward propagation. We theoretically derive the optimal solution for the second phase and efficiently solve the first phase with a low-complexity algorithm. We implement the US-Byte architecture on PyTorch framework. Extensive experiments on two different 8-node GPU clusters demonstrate that US-Byte can achieve up to 1.26x and 1.56x speedup compared to ByteScheduler and WFBP, respectively. We further exploit simulations of 128 GPUs to verify the potential scaling performance of US-Byte. Simulation results show that US-Byte can achieve up to 1.69x speedup compared to the state-of-the-art communication framework.

Original language	English
Pages (from-to)	123-139
Number of pages	17
Journal	IEEE Transactions on Parallel and Distributed Systems
Volume	35
Issue number	1
DOIs	https://doi.org/10.1109/TPDS.2023.3331372
Publication status	Published - 1 Jan 2024
Externally published	Yes

Keywords

Communication scheduling
data parallelism
distributed deep learning
tensor fusion
tensor partitioning

Access to Document

10.1109/TPDS.2023.3331372

Cite this

@article{93e33c0ed4834932b5781e86bb352712,

title = "US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning",

abstract = "The communication bottleneck severely constrains the scalability of distributed deep learning, and efficient communication scheduling accelerates distributed DNN training by overlapping computation and communication tasks. However, existing approaches based on tensor partitioning are not efficient and suffer from two challenges: 1) the fixed number of tensor blocks transferred in parallel can not necessarily minimize the communication overheads; 2) although the scheduling order that preferentially transmits tensor blocks close to the input layer can start forward propagation in the next iteration earlier, the shortest per-iteration time is not obtained. In this paper, we propose an efficient communication framework called US-Byte. It can schedule unequal-sized tensor blocks in a near-optimal order to minimize the training time. We build the mathematical model of US-Byte by two phases: 1) the overlap of gradient communication and backward propagation, and 2) the overlap of gradient communication and forward propagation. We theoretically derive the optimal solution for the second phase and efficiently solve the first phase with a low-complexity algorithm. We implement the US-Byte architecture on PyTorch framework. Extensive experiments on two different 8-node GPU clusters demonstrate that US-Byte can achieve up to 1.26x and 1.56x speedup compared to ByteScheduler and WFBP, respectively. We further exploit simulations of 128 GPUs to verify the potential scaling performance of US-Byte. Simulation results show that US-Byte can achieve up to 1.69x speedup compared to the state-of-the-art communication framework.",

keywords = "Communication scheduling, data parallelism, distributed deep learning, tensor fusion, tensor partitioning",

author = "Yunqi Gao and Bing Hu and Mashhadi, {Mahdi Boloursaz} and A-Long Jin and Pei Xiao and Chunming Wu",

note = "Publisher Copyright: {\textcopyright} 1990-2012 IEEE.",

year = "2024",

month = jan,

day = "1",

doi = "10.1109/TPDS.2023.3331372",

language = "English",

volume = "35",

pages = "123--139",

journal = "IEEE Transactions on Parallel and Distributed Systems",

issn = "1045-9219",

number = "1",

}

TY - JOUR

T1 - US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

AU - Gao, Yunqi

AU - Hu, Bing

AU - Mashhadi, Mahdi Boloursaz

AU - Jin, A-Long

AU - Xiao, Pei

AU - Wu, Chunming

PY - 2024/1/1

Y1 - 2024/1/1

N2 - The communication bottleneck severely constrains the scalability of distributed deep learning, and efficient communication scheduling accelerates distributed DNN training by overlapping computation and communication tasks. However, existing approaches based on tensor partitioning are not efficient and suffer from two challenges: 1) the fixed number of tensor blocks transferred in parallel can not necessarily minimize the communication overheads; 2) although the scheduling order that preferentially transmits tensor blocks close to the input layer can start forward propagation in the next iteration earlier, the shortest per-iteration time is not obtained. In this paper, we propose an efficient communication framework called US-Byte. It can schedule unequal-sized tensor blocks in a near-optimal order to minimize the training time. We build the mathematical model of US-Byte by two phases: 1) the overlap of gradient communication and backward propagation, and 2) the overlap of gradient communication and forward propagation. We theoretically derive the optimal solution for the second phase and efficiently solve the first phase with a low-complexity algorithm. We implement the US-Byte architecture on PyTorch framework. Extensive experiments on two different 8-node GPU clusters demonstrate that US-Byte can achieve up to 1.26x and 1.56x speedup compared to ByteScheduler and WFBP, respectively. We further exploit simulations of 128 GPUs to verify the potential scaling performance of US-Byte. Simulation results show that US-Byte can achieve up to 1.69x speedup compared to the state-of-the-art communication framework.

AB - The communication bottleneck severely constrains the scalability of distributed deep learning, and efficient communication scheduling accelerates distributed DNN training by overlapping computation and communication tasks. However, existing approaches based on tensor partitioning are not efficient and suffer from two challenges: 1) the fixed number of tensor blocks transferred in parallel can not necessarily minimize the communication overheads; 2) although the scheduling order that preferentially transmits tensor blocks close to the input layer can start forward propagation in the next iteration earlier, the shortest per-iteration time is not obtained. In this paper, we propose an efficient communication framework called US-Byte. It can schedule unequal-sized tensor blocks in a near-optimal order to minimize the training time. We build the mathematical model of US-Byte by two phases: 1) the overlap of gradient communication and backward propagation, and 2) the overlap of gradient communication and forward propagation. We theoretically derive the optimal solution for the second phase and efficiently solve the first phase with a low-complexity algorithm. We implement the US-Byte architecture on PyTorch framework. Extensive experiments on two different 8-node GPU clusters demonstrate that US-Byte can achieve up to 1.26x and 1.56x speedup compared to ByteScheduler and WFBP, respectively. We further exploit simulations of 128 GPUs to verify the potential scaling performance of US-Byte. Simulation results show that US-Byte can achieve up to 1.69x speedup compared to the state-of-the-art communication framework.

KW - Communication scheduling

KW - data parallelism

KW - distributed deep learning

KW - tensor fusion

KW - tensor partitioning

UR - http://www.scopus.com/inward/record.url?scp=85177065080&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2023.3331372

DO - 10.1109/TPDS.2023.3331372

M3 - Article

AN - SCOPUS:85177065080

SN - 1045-9219

VL - 35

SP - 123

EP - 139

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

IS - 1

ER -

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this