Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Nan Zhang

doi:10.1016/j.jpdc.2011.03.005

Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Nan Zhang^*

^*Corresponding author for this work

School of Advanced Technology

Research output: Contribution to journal › Article › peer-review

1 Citation (Scopus)

Abstract

Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor's L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.

Original language	English
Pages (from-to)	915-924
Number of pages	10
Journal	Journal of Parallel and Distributed Computing
Volume	71
Issue number	7
DOIs	https://doi.org/10.1016/j.jpdc.2011.03.005
Publication status	Published - Jul 2011

Keywords

Hardware prefetching
Parallel nonscaling analysis
Parallel performance degradation
Temporal caching efficiency

Access to Document

10.1016/j.jpdc.2011.03.005

Cite this

@article{7771c43d15e04bc692199ccd2d0e8764,

title = "Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture",

abstract = "Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor's L2 cache. Through an analysis on relevant performance events using Intel VTune{\texttrademark}Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.",

keywords = "Hardware prefetching, Parallel nonscaling analysis, Parallel performance degradation, Temporal caching efficiency",

author = "Nan Zhang",

year = "2011",

month = jul,

doi = "10.1016/j.jpdc.2011.03.005",

language = "English",

volume = "71",

pages = "915--924",

journal = "Journal of Parallel and Distributed Computing",

issn = "0743-7315",

number = "7",

}

TY - JOUR

T1 - Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

AU - Zhang, Nan

PY - 2011/7

Y1 - 2011/7

N2 - Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor's L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.

AB - Parallel workloads on shared-memory multi-core processors often suffer from performance degradation. Cache eviction, true/false sharing and bus contention are among the well-understood causes to this problem. This paper presents a study that shows the L2 DPL (data prefetch logic) in processors based on Intel Core microarchitecture can be a cause to this problem as well. The study through a case of an image integration finds the nonscaling problem on the parallel integration of images whose size exceeds the capacity of the processor's L2 cache. Through an analysis on relevant performance events using Intel VTune™Performance Analyser the L2 DPL prefetch is found less effective over the parallel integration in prefetching needed data than over the serial ones. To resolve the problem a novel parallel image reverse loading is developed with the purpose of reducing the number of memory accesses over the parallel integration and the associated delay. Experimental results demonstrate that the parallel integration after the parallel reverse loading shows significant speedup against the same parallel integration but after serial loading.

KW - Hardware prefetching

KW - Parallel nonscaling analysis

KW - Parallel performance degradation

KW - Temporal caching efficiency

UR - http://www.scopus.com/inward/record.url?scp=79957507436&partnerID=8YFLogxK

U2 - 10.1016/j.jpdc.2011.03.005

DO - 10.1016/j.jpdc.2011.03.005

M3 - Article

AN - SCOPUS:79957507436

SN - 0743-7315

VL - 71

SP - 915

EP - 924

JO - Journal of Parallel and Distributed Computing

JF - Journal of Parallel and Distributed Computing

IS - 7

ER -

Resolving a L2-prefetch-caused parallel nonscaling on Intel Core microarchitecture

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this