Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Nan Zhang*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor's L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.

Original languageEnglish
Pages (from-to)915-924
Number of pages10
JournalJournal of Parallel and Distributed Computing
Volume71
Issue number7
DOIs
Publication statusPublished - Jul 2011

Keywords

  • Hardware prefetching
  • Parallel nonscaling analysis
  • Parallel performance degradation
  • Temporal caching efficiency

Fingerprint

Dive into the research topics of 'Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture'. Together they form a unique fingerprint.

Cite this