TY - GEN
T1 - Event-based analysis of a L2 prefetch related parallel nonscaling on intel dual core processor
AU - Zhang, Nan
PY - 2010
Y1 - 2010
N2 - Performance degradation is a common problem in parallel computing as when a workload is parallelised the speedup gained is notably lower than the factor predicted by Amdahl's law. This work examines such a case where image pixels are summed up along the rows in parallel by two threads, but no speedup against the sequential summation is gained on images whose sizes exceed the capacity of the L2 cache of the processor. Counts collected by Intel VTune™ Performance Analyser on relevant performance events show that this nonscaling problem is not caused by those well-understood pitfalls, such as cache contention, bus overloading and unbalanced workload, but that over the parallel summation the L2 prefetchers of the processor are less effective in bringing in data before they are needed. Consequently, a considerably more number of cache lines are brought into the L2 cache by demand requests originated from the L1 data caches, and for such accesses the parallel computation pay the penalty.
AB - Performance degradation is a common problem in parallel computing as when a workload is parallelised the speedup gained is notably lower than the factor predicted by Amdahl's law. This work examines such a case where image pixels are summed up along the rows in parallel by two threads, but no speedup against the sequential summation is gained on images whose sizes exceed the capacity of the L2 cache of the processor. Counts collected by Intel VTune™ Performance Analyser on relevant performance events show that this nonscaling problem is not caused by those well-understood pitfalls, such as cache contention, bus overloading and unbalanced workload, but that over the parallel summation the L2 prefetchers of the processor are less effective in bringing in data before they are needed. Consequently, a considerably more number of cache lines are brought into the L2 cache by demand requests originated from the L1 data caches, and for such accesses the parallel computation pay the penalty.
KW - Hardware prefetching
KW - Parallel nonscaling analysis
KW - Parallel performance degradation
KW - Parallel scalability
UR - http://www.scopus.com/inward/record.url?scp=77958071519&partnerID=8YFLogxK
U2 - 10.1109/ICCET.2010.5485345
DO - 10.1109/ICCET.2010.5485345
M3 - Conference Proceeding
AN - SCOPUS:77958071519
SN - 9781424463503
T3 - ICCET 2010 - 2010 International Conference on Computer Engineering and Technology, Proceedings
SP - V225-V229
BT - ICCET 2010 - 2010 International Conference on Computer Engineering and Technology, Proceedings
T2 - 2010 2nd International Conference on Computer Engineering and Technology, ICCET 2010
Y2 - 16 April 2010 through 18 April 2010
ER -