TY - GEN
T1 - SyncMalloc
T2 - 53rd International Conference on Parallel Processing, ICPP 2024
AU - Zhang, Jiajian
AU - Wu, Fangyu
AU - Jiang, Hai
AU - Cheng, Guangliang
AU - Chen, Genlang
AU - Wang, Qiufeng
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/8/12
Y1 - 2024/8/12
N2 - Dynamic memory allocation on GPUs, increasingly crucial for applications with dynamic computational patterns, encounters significant challenges due to the complex calculations with intricate branches and substantial memory resources consumed by metadata from massive thread allocations. Despite the current research, there is a lack of a scalable and flexible solution that effectively manages dynamic memory allocation while minimizing memory usage on GPUs. This paper introduces SyncMalloc, a synchronized Host-Device Co-Management system that is specifically designed to adeptly handle dynamic memory allocations of diverse magnitudes. Through the integration of pipelining and producer-consumer mechanisms, SyncMalloc effectively reduces communication overhead and resolves architectural mismatches, further enhancing its capability through synergistic integration with CUDA's unified memory to facilitate oversubscription. Moreover, SyncMalloc advances slab-based memory management to enhance the efficiency of small allocations, reducing conflict probabilities and overhead in high-activity scenarios. Finally, we present a comprehensive performance evaluation, expanding benchmarks and measurement dimensions to reflect the performance of real-world applications more accurately. The experimental results demonstrate the effectiveness of SyncMalloc in supporting dynamic GPU allocations scaled from 4B to 200GB from multiple perspectives. Our source code is available at https://github.com/jjZhang94/SyncMalloc.
AB - Dynamic memory allocation on GPUs, increasingly crucial for applications with dynamic computational patterns, encounters significant challenges due to the complex calculations with intricate branches and substantial memory resources consumed by metadata from massive thread allocations. Despite the current research, there is a lack of a scalable and flexible solution that effectively manages dynamic memory allocation while minimizing memory usage on GPUs. This paper introduces SyncMalloc, a synchronized Host-Device Co-Management system that is specifically designed to adeptly handle dynamic memory allocations of diverse magnitudes. Through the integration of pipelining and producer-consumer mechanisms, SyncMalloc effectively reduces communication overhead and resolves architectural mismatches, further enhancing its capability through synergistic integration with CUDA's unified memory to facilitate oversubscription. Moreover, SyncMalloc advances slab-based memory management to enhance the efficiency of small allocations, reducing conflict probabilities and overhead in high-activity scenarios. Finally, we present a comprehensive performance evaluation, expanding benchmarks and measurement dimensions to reflect the performance of real-world applications more accurately. The experimental results demonstrate the effectiveness of SyncMalloc in supporting dynamic GPU allocations scaled from 4B to 200GB from multiple perspectives. Our source code is available at https://github.com/jjZhang94/SyncMalloc.
KW - Dynamic Allocation
KW - GPU
KW - Memory Management
UR - http://www.scopus.com/inward/record.url?scp=85202438630&partnerID=8YFLogxK
U2 - 10.1145/3673038.3673069
DO - 10.1145/3673038.3673069
M3 - Conference Proceeding
AN - SCOPUS:85202438630
T3 - ACM International Conference Proceeding Series
SP - 179
EP - 188
BT - 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings
PB - Association for Computing Machinery
Y2 - 12 August 2024 through 15 August 2024
ER -