Record-aware two-level compression for big textual data analysis acceleration

Dapeng Dong, John Herbert

Research output: Chapter in Book or Report/Conference proceedingConference Proceedingpeer-review

2 Citations (Scopus)

Abstract

An increasing volume of data puts MapReduce data analytic platforms such as Hadoop under constant resource pressure. A new two-phase text compression scheme has been specially designed to accelerate data analysis and reduce cluster resource usage, and this has been implemented for Hadoop. The scheme consists of two levels of compression. The first level compression allows a Hadoop program to consume the compressed data directly, thus reducing the data transmission cost within a cluster during analysis. The second level packages data into fixed-size blocks that respect the logical data records. This further reduces the data size to a size similar to that achieved by a higher-order entropy encoder while also making the compressed data splittable for the HDFS. The use of the compression scheme is made transparent to Hadoop developers by the provided utility functions. The compression scheme is evaluated using a set of standard MapReduce jobs for a selection of real-world datasets. The experimental results show an improvement on analysis performance of up to 72% and compression ratios close to that achieved by a standard compressor such as Bzip.

Original languageEnglish
Title of host publicationProceedings - IEEE 7th International Conference on Cloud Computing Technology and Science, CloudCom 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages9-16
Number of pages8
ISBN (Electronic)9781467395601
DOIs
Publication statusPublished - 1 Feb 2016
Externally publishedYes
Event7th IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2015 - Vancouver, Canada
Duration: 30 Nov 20153 Dec 2015

Publication series

NameProceedings - IEEE 7th International Conference on Cloud Computing Technology and Science, CloudCom 2015

Conference

Conference7th IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2015
Country/TerritoryCanada
CityVancouver
Period30/11/153/12/15

Keywords

  • Big Data
  • Compression
  • Content-aware
  • Hadoop
  • MapReduce
  • Record-aware

Fingerprint

Dive into the research topics of 'Record-aware two-level compression for big textual data analysis acceleration'. Together they form a unique fingerprint.

Cite this