Abstract
Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor's L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.
Original language | English |
---|---|
Pages (from-to) | 915-924 |
Number of pages | 10 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 71 |
Issue number | 7 |
DOIs | |
Publication status | Published - Jul 2011 |
Keywords
- Hardware prefetching
- Parallel nonscaling analysis
- Parallel performance degradation
- Temporal caching efficiency