Accelerating content-defined-chunking based data deduplication by exploiting parallelism

Wen Xia; Dan Feng; Hong Jiang; Yucheng Zhang; Victor Chang; Xiangyu Zou

doi:10.1016/j.future.2019.02.008

Accelerating content-defined-chunking based data deduplication by exploiting parallelism

Wen Xia, Dan Feng^*, Hong Jiang, Yucheng Zhang, Victor Chang, Xiangyu Zou

^*Corresponding author for this work

International Business School Suzhou

Research output: Contribution to journal › Article › peer-review

38 Citations (Scopus)

Abstract

Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

Original language	English
Pages (from-to)	406-418
Number of pages	13
Journal	Future Generation Computer Systems
Volume	98
DOIs	https://doi.org/10.1016/j.future.2019.02.008
Publication status	Published - Sept 2019

Keywords

Backup storage systems
Content-defined chunking
Data deduplication
Performance evaluation

Access to Document

10.1016/j.future.2019.02.008

Cite this

@article{8c6fc0bdf9db444b8790c95d29814189,

title = "Accelerating content-defined-chunking based data deduplication by exploiting parallelism",

abstract = "Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.",

keywords = "Backup storage systems, Content-defined chunking, Data deduplication, Performance evaluation",

author = "Wen Xia and Dan Feng and Hong Jiang and Yucheng Zhang and Victor Chang and Xiangyu Zou",

note = "Funding Information: Hong Jiang received the B.Sc. degree in Computer Engineering in 1982 from Huazhong University of Science and Technology, Wuhan, China; the M.A.Sc. degree in Computer Engineering in 1987 from the University of Toronto, Toronto, Canada; and the Ph.D. degree in Computer Science in 1991 from the Texas A&M University, College Station, Texas, USA. He is currently Chair and Wendell H. Nedderman Endowed Professor of Computer Science and Engineering Department at the University of Texas at Arlington. Prior to joining UTA, he served as a Program Director at National Science Foundation (2013.1–2015.8) and he was at University of Nebraska-Lincoln since 1991, where he was Willa Cather Professor of Computer Science and Engineering. He has graduated 13 Ph.D. students who upon their graduations either landed academic tenure-track positions in Ph.D.-granting US institutions or were employed by major US IT corporations. His present research interests include computer architecture, computer storage systems and parallel I/O, high-performance computing, big data computing, cloud computing, performance evaluation. He recently served as an Associate Editor of the IEEE Transactions on Parallel and Distributed Systems. He has over 200 publications in major journals and international Conferences in these areas, including IEEE-TPDS, IEEE-TC, Proceedings of the IEEE, ACM-TACO, JPDC, ISCA, MICRO, USENIX ATC, FAST, EUROSYS, LISA, SIGMETRICS, ICDCS, IPDPS, MIDDLEWARE, OOPLAS, ECOOP, SC, ICS, HPDC, INFOCOM, ICPP, etc., and his research has been supported by NSF, DOD, the State of Texas and the State of Nebraska. Dr. Jiang is a Fellow of IEEE, and Member of ACM. Funding Information: We are grateful to the anonymous reviewers for their insightful comments and feedback on this work. This research was partly supported by National Key Research and Development Program of China under Grant 2017YFB0802204 , NSFC, China , No. 61821003 , No. 61502190 , No. U1705261 , No. 61832007 , No. 61772222 , No. 61772180 and No. 61672010 ; The Scientific Research Fund of Hubei Provincial Department of Education, China , B2017042 ; US NSF under Grants CCF-1704504 and CCF-1629625 . The preliminary manuscript is published in the proceedings of IEEE International Conference on Networking, Architecture, and Storage (IEEE NAS), 2012. Funding Information: We are grateful to the anonymous reviewers for their insightful comments and feedback on this work. This research was partly supported by National Key Research and Development Program of China under Grant 2017YFB0802204, NSFC, China, No.61821003, No.61502190, No.U1705261, No.61832007, No.61772222, No.61772180 and No.61672010; The Scientific Research Fund of Hubei Provincial Department of Education, China, B2017042; US NSF under Grants CCF-1704504 and CCF-1629625. The preliminary manuscript is published in the proceedings of IEEE International Conference on Networking, Architecture, and Storage (IEEE NAS), 2012. Publisher Copyright: {\textcopyright} 2019 Elsevier B.V.",

year = "2019",

month = sep,

doi = "10.1016/j.future.2019.02.008",

language = "English",

volume = "98",

pages = "406--418",

journal = "Future Generation Computer Systems",

issn = "0167-739X",

}

TY - JOUR

T1 - Accelerating content-defined-chunking based data deduplication by exploiting parallelism

AU - Xia, Wen

AU - Feng, Dan

AU - Jiang, Hong

AU - Zhang, Yucheng

AU - Chang, Victor

AU - Zou, Xiangyu

N1 - Funding Information: Hong Jiang received the B.Sc. degree in Computer Engineering in 1982 from Huazhong University of Science and Technology, Wuhan, China; the M.A.Sc. degree in Computer Engineering in 1987 from the University of Toronto, Toronto, Canada; and the Ph.D. degree in Computer Science in 1991 from the Texas A&M University, College Station, Texas, USA. He is currently Chair and Wendell H. Nedderman Endowed Professor of Computer Science and Engineering Department at the University of Texas at Arlington. Prior to joining UTA, he served as a Program Director at National Science Foundation (2013.1–2015.8) and he was at University of Nebraska-Lincoln since 1991, where he was Willa Cather Professor of Computer Science and Engineering. He has graduated 13 Ph.D. students who upon their graduations either landed academic tenure-track positions in Ph.D.-granting US institutions or were employed by major US IT corporations. His present research interests include computer architecture, computer storage systems and parallel I/O, high-performance computing, big data computing, cloud computing, performance evaluation. He recently served as an Associate Editor of the IEEE Transactions on Parallel and Distributed Systems. He has over 200 publications in major journals and international Conferences in these areas, including IEEE-TPDS, IEEE-TC, Proceedings of the IEEE, ACM-TACO, JPDC, ISCA, MICRO, USENIX ATC, FAST, EUROSYS, LISA, SIGMETRICS, ICDCS, IPDPS, MIDDLEWARE, OOPLAS, ECOOP, SC, ICS, HPDC, INFOCOM, ICPP, etc., and his research has been supported by NSF, DOD, the State of Texas and the State of Nebraska. Dr. Jiang is a Fellow of IEEE, and Member of ACM. Funding Information: We are grateful to the anonymous reviewers for their insightful comments and feedback on this work. This research was partly supported by National Key Research and Development Program of China under Grant 2017YFB0802204 , NSFC, China , No. 61821003 , No. 61502190 , No. U1705261 , No. 61832007 , No. 61772222 , No. 61772180 and No. 61672010 ; The Scientific Research Fund of Hubei Provincial Department of Education, China , B2017042 ; US NSF under Grants CCF-1704504 and CCF-1629625 . The preliminary manuscript is published in the proceedings of IEEE International Conference on Networking, Architecture, and Storage (IEEE NAS), 2012. Funding Information: We are grateful to the anonymous reviewers for their insightful comments and feedback on this work. This research was partly supported by National Key Research and Development Program of China under Grant 2017YFB0802204, NSFC, China, No.61821003, No.61502190, No.U1705261, No.61832007, No.61772222, No.61772180 and No.61672010; The Scientific Research Fund of Hubei Provincial Department of Education, China, B2017042; US NSF under Grants CCF-1704504 and CCF-1629625. The preliminary manuscript is published in the proceedings of IEEE International Conference on Networking, Architecture, and Storage (IEEE NAS), 2012. Publisher Copyright: © 2019 Elsevier B.V.

PY - 2019/9

Y1 - 2019/9

N2 - Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

AB - Data deduplication, a data reduction technique that efficiently detects and eliminates redundant data chunks and files, has been widely applied in large-scale storage systems. Most existing deduplication-based storage systems employ content-defined chunking (CDC) and secure-hash-based fingerprinting (e.g., SHA1) to remove redundant data at the chunk level (e.g., 4 KB/8 KB chunks), which are extremely compute-intensive and thus time-consuming for storage systems. Therefore, we present P-Dedupe, a pipelined and parallelized data deduplication system that accelerates deduplication process by dividing the deduplication process into four stages (i.e., chunking, fingerprinting, indexing, and writing), pipelining these four stages with chunks & files (the processing data units for deduplication), and then parallelizing CDC and secure-hash-based fingerprinting stages to further alleviate the computation bottleneck. More important, to efficiently parallelize CDC with the requirements of both maximal and minimal chunk sizes and inspired by the MapReduce model, we first split the data stream into several segments (i.e., “Map”), where each segment will be running CDC in parallel with an independent thread, and then re-chunk and join the boundaries of these segments (i.e., “Reduce”) to ensure the chunking effectiveness of parallelized CDC. Experimental results of P-Dedupe with eight datasets on a quad-core Intel i7 processor suggest that P-Dedupe is able to accelerate the deduplication throughput near linearly by exploiting parallelism in the CDC-based deduplication process at the cost of only 0.02% decrease in the deduplication ratio. Our work provides contributions to big data science to ensure all files go through deduplication process quickly and thoroughly, and only process and analyze the same file once, rather than multiple times.

KW - Backup storage systems

KW - Content-defined chunking

KW - Data deduplication

KW - Performance evaluation

UR - http://www.scopus.com/inward/record.url?scp=85063748545&partnerID=8YFLogxK

U2 - 10.1016/j.future.2019.02.008

DO - 10.1016/j.future.2019.02.008

M3 - Article

AN - SCOPUS:85063748545

SN - 0167-739X

VL - 98

SP - 406

EP - 418

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

ER -

Accelerating content-defined-chunking based data deduplication by exploiting parallelism

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this