Record-aware compression for big textual data analysis acceleration

Dapeng Dong, John Herbert

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

7 Citations (Scopus)

Abstract

Big data analysis technologies are becoming more widely used in industry. The ever-increasing data volume, however, puts data analytic platforms such as Hadoop under constant pressure. Several compression methods have been made available on the Hadoop platform to effectively reduce data size and efficiently deliver data between cluster nodes. In the Hadoop context, compressed data can be categorized as splittable or non-splittable. Working with non-splittable data conflicts with the goal of parallelism. In addition, the current realization of splittable data by indexing is potentially harmful to the data locality property. To this end, we introduce the Record-aware Compression (RaC) scheme that makes the compressed contents splittable, uses a lightweight Hadoop Record Reader, and preserves the parallelism and data locality properties as much as possible. We evaluate RaC using a set of classical MapReduce jobs with a collection of well-known datasets from companies such as Google, Yahoo!, and Amazon. The experimental results show an average 24% improvement on analysis performance and up to 75% data size reduction.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
EditorsFeng Luo, Kemafor Ogan, Mohammed J. Zaki, Laura Haas, Beng Chin Ooi, Vipin Kumar, Sudarsan Rachuri, Saumyadipta Pyne, Howard Ho, Xiaohua Hu, Shipeng Yu, Morris Hui-I Hsiao, Jian Li
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1183-1190
Number of pages8
ISBN (Electronic)9781479999255
DOIs
Publication statusPublished - 22 Dec 2015
Externally publishedYes
Event3rd IEEE International Conference on Big Data, IEEE Big Data 2015 - Santa Clara, United States
Duration: 29 Oct 20151 Nov 2015

Publication series

NameProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

Conference

Conference3rd IEEE International Conference on Big Data, IEEE Big Data 2015
Country/TerritoryUnited States
CitySanta Clara
Period29/10/151/11/15

Keywords

  • Big Data
  • Compression
  • Hadoop
  • MapReduce
  • Record-aware

Fingerprint

Dive into the research topics of 'Record-aware compression for big textual data analysis acceleration'. Together they form a unique fingerprint.

Cite this