OF-WFBP: A Near-Optimal Communication Mechanism for Tensor Fusion in Distributed Deep Learning

Yunqi Gao; Zechao Zhang; Bing Hu; A-Long Jin; Chunming Wu

doi:10.1016/j.parco.2023.103053

OF-WFBP: A Near-Optimal Communication Mechanism for Tensor Fusion in Distributed Deep Learning

Yunqi Gao, Zechao Zhang, Bing Hu^*, A-Long Jin, Chunming Wu

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

The communication bottleneck has severely restricted the scalability of distributed deep learning. Tensor fusion improves the scalability of data parallelism by overlapping computation and communication tasks. However, existing tensor fusion schemes only result in suboptimal training performance. In this paper, we propose an efficient communication mechanism (OF-WFBP) to find the optimal tensor fusion scheme for synchronous data parallelism. We present the mathematical model of OF-WFBP and prove it is an NP-hard problem. We mathematically solve the mathematical model of OF-WFBP in two cases. We propose an improved sparrow search algorithm (GradSSA) to find the near-optimal tensor fusion scheme efficiently in other cases. Experimental results on two different GPU clusters show that OF-WFBP achieves up to 1.43x speedup compared to the state-of-the-art tensor fusion mechanisms.

Original language	English
Article number	103053
Journal	Parallel Computing
Volume	118
DOIs	https://doi.org/10.1016/j.parco.2023.103053
Publication status	Published - Nov 2023
Externally published	Yes

Keywords

Data parallelism
Distributed deep learning
Tensor fusion

Access to Document

10.1016/j.parco.2023.103053

Cite this

@article{6b2ed5c27ebb48e3957ab4468c9f225f,

title = "OF-WFBP: A Near-Optimal Communication Mechanism for Tensor Fusion in Distributed Deep Learning",

abstract = "The communication bottleneck has severely restricted the scalability of distributed deep learning. Tensor fusion improves the scalability of data parallelism by overlapping computation and communication tasks. However, existing tensor fusion schemes only result in suboptimal training performance. In this paper, we propose an efficient communication mechanism (OF-WFBP) to find the optimal tensor fusion scheme for synchronous data parallelism. We present the mathematical model of OF-WFBP and prove it is an NP-hard problem. We mathematically solve the mathematical model of OF-WFBP in two cases. We propose an improved sparrow search algorithm (GradSSA) to find the near-optimal tensor fusion scheme efficiently in other cases. Experimental results on two different GPU clusters show that OF-WFBP achieves up to 1.43x speedup compared to the state-of-the-art tensor fusion mechanisms.",

keywords = "Data parallelism, Distributed deep learning, Tensor fusion",

author = "Yunqi Gao and Zechao Zhang and Bing Hu and A-Long Jin and Chunming Wu",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = nov,

doi = "10.1016/j.parco.2023.103053",

language = "English",

volume = "118",

journal = "Parallel Computing",

issn = "0167-8191",

}

TY - JOUR

T1 - OF-WFBP: A Near-Optimal Communication Mechanism for Tensor Fusion in Distributed Deep Learning

AU - Gao, Yunqi

AU - Zhang, Zechao

AU - Hu, Bing

AU - Jin, A-Long

AU - Wu, Chunming

PY - 2023/11

Y1 - 2023/11

N2 - The communication bottleneck has severely restricted the scalability of distributed deep learning. Tensor fusion improves the scalability of data parallelism by overlapping computation and communication tasks. However, existing tensor fusion schemes only result in suboptimal training performance. In this paper, we propose an efficient communication mechanism (OF-WFBP) to find the optimal tensor fusion scheme for synchronous data parallelism. We present the mathematical model of OF-WFBP and prove it is an NP-hard problem. We mathematically solve the mathematical model of OF-WFBP in two cases. We propose an improved sparrow search algorithm (GradSSA) to find the near-optimal tensor fusion scheme efficiently in other cases. Experimental results on two different GPU clusters show that OF-WFBP achieves up to 1.43x speedup compared to the state-of-the-art tensor fusion mechanisms.

AB - The communication bottleneck has severely restricted the scalability of distributed deep learning. Tensor fusion improves the scalability of data parallelism by overlapping computation and communication tasks. However, existing tensor fusion schemes only result in suboptimal training performance. In this paper, we propose an efficient communication mechanism (OF-WFBP) to find the optimal tensor fusion scheme for synchronous data parallelism. We present the mathematical model of OF-WFBP and prove it is an NP-hard problem. We mathematically solve the mathematical model of OF-WFBP in two cases. We propose an improved sparrow search algorithm (GradSSA) to find the near-optimal tensor fusion scheme efficiently in other cases. Experimental results on two different GPU clusters show that OF-WFBP achieves up to 1.43x speedup compared to the state-of-the-art tensor fusion mechanisms.

KW - Data parallelism

KW - Distributed deep learning

KW - Tensor fusion

UR - http://www.scopus.com/inward/record.url?scp=85177976455&partnerID=8YFLogxK

U2 - 10.1016/j.parco.2023.103053

DO - 10.1016/j.parco.2023.103053

M3 - Article

AN - SCOPUS:85177976455

SN - 0167-8191

VL - 118

JO - Parallel Computing

JF - Parallel Computing

M1 - 103053

ER -

OF-WFBP: A Near-Optimal Communication Mechanism for Tensor Fusion in Distributed Deep Learning

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this