Abstract
An increasing volume of data puts MapReduce data analytic platforms such as Hadoop under constant resource pressure. A new two-phase text compression scheme has been specially designed to accelerate data analysis and reduce cluster resource usage, and this has been implemented for Hadoop. The scheme consists of two levels of compression. The first level compression allows a Hadoop program to consume the compressed data directly, thus reducing the data transmission cost within a cluster during analysis. The second level packages data into fixed-size blocks that respect the logical data records. This further reduces the data size to a size similar to that achieved by a higher-order entropy encoder while also making the compressed data splittable for the HDFS. The use of the compression scheme is made transparent to Hadoop developers by the provided utility functions. The compression scheme is evaluated using a set of standard MapReduce jobs for a selection of real-world datasets. The experimental results show an improvement on analysis performance of up to 72% and compression ratios close to that achieved by a standard compressor such as Bzip.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - IEEE 7th International Conference on Cloud Computing Technology and Science, CloudCom 2015 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 9-16 |
| Number of pages | 8 |
| ISBN (Electronic) | 9781467395601 |
| DOIs | |
| Publication status | Published - 1 Feb 2016 |
| Externally published | Yes |
| Event | 7th IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2015 - Vancouver, Canada Duration: 30 Nov 2015 → 3 Dec 2015 |
Publication series
| Name | Proceedings - IEEE 7th International Conference on Cloud Computing Technology and Science, CloudCom 2015 |
|---|
Conference
| Conference | 7th IEEE International Conference on Cloud Computing Technology and Science, CloudCom 2015 |
|---|---|
| Country/Territory | Canada |
| City | Vancouver |
| Period | 30/11/15 → 3/12/15 |
Keywords
- Big Data
- Compression
- Content-aware
- Hadoop
- MapReduce
- Record-aware