scalability of threaded applications intel software college

Scalability of Threaded Applications

Intel Software College

Scalability of Multithreaded Applications

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Objectives

After completion of this module you will understand

• The need for designing multithreaded applications for scalability to take advantage of an increasing number of available cores

• What tools are available to measure and predict scalability

• How several different factors can inhibit scaling of applications on increased number of cores

Agenda

Why focus on scalability?• Measuring and estimating scalability

• Where would you start?

Tools for scalability analysis

Factors inhibiting scalability• Serially Dominant Workloads

• Granularity and Parallel Overhead

• Load Imbalance

• Synchronization Issue

• Memory Related Issues

• I/O

What is scalability?

Handle growing amounts of work in a graceful manner

What resources might be increased?• Cores and threads

• Memory capacity

• Data, problem size• Not a resource, but likely to see increases as computation power increases

“What is it that we really mean by scalability? A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added.”

-- Werner Vogels CTO - Amazon.com

Many-core array• CMP with 10s-100s low

power cores• Scalar cores• Capable of TFLOPS+• Full System-on-Chip• Servers, workstations,

embedded…

Dual core• Symmetric multithreading

Multi-core array• CMP with ~10 cores

Evolution

Large, Scalar cores for high single-thread performance

Scalar plus many core for highly threaded workloads

Evolutionary configurable architecture

Amdahl’s Law

Speedup is limited by the

amount of serial code

Maximum Theoretical Speedup from Amdahl's Law

0 1 2 3 4 5 6 7 8

Number of cores

%serial= 0

%serial=10

%serial=20

%serial=30

%serial=40

%serial=50

Ψ(p) ≤

s + (1 - s) / pwhere 0 ≤ s ≤ 1, the

fraction of serial operations

Question 1

A: 1.25B: 2.0C: 4.0D: No speedup

If application is only 25% serial, what’s the maximum speedup you can ever achieve, assuming infinite number of processors ? (ignore parallel overhead)

Scaled Speedup (Gustafson-Barsis’s Law)

Amdahl’s Law does not take into account

• overhead costs

• increases in problem size able to be computed with more cores

Increasing the number of cores enables…

• Increasing the problem size ―> Decreasing the sequential fraction of computation ―> Increasing Speedup

Given p cores and a parallel code solving a problem of size n, let s be the fraction of serial execution in the code.

Ψ ≤ p + (1 – p) / s

Scaled Speedup estimates how much faster parallel execution is over same

computation on single core

Assumes problem size increases linearly with number of cores

Using Scaled Speedup

If application runs on 64 cores in 220 seconds with 11 seconds devoted to serial execution, what is the scaled speedup?

Assuming fixed serial time, what is single core execution time?

(220-11)*64 + 11 = 13387 seconds

• Amdahl’s Law then yields speedup of 60.84 on 64 cores with 0.08271% serial time

Would serial time be fixed? Would problem fit on one core?

Ψ = 64 + (1 – 64) (11/220)

= 64 – 63 * 0.05

= 60.85

Amdahl’s Law

5% serial on 64 cores => 15.42

Question 2

What is the maximum amount of serial execution time for a parallel application to achieve a scaled speedup of 7.5 on an eight-core system?

Using Amdahl’s Law, serial percentage must be ≤ 0.952%

7.5 = 8 + (1 – 8) s

s = 0.5 / 7

= 0.071 => 7.1%

Estimating potential scalability of serial applications

• Need to estimate serial vs. parallelizable execution times• Speedup estimate based on Amdahl’s law

• VTune sampling• Identify potential areas for parallelization

• Example: loops• Use clock ticks to estimate parallel time

• Serial time = Total run time – parallelizable run time• Compute scalability estimate

• VTune call graph• See potential call trees for parallelization

• Use “Total time” (self + descendents) for parallelizable run time

Estimating scalability upper bound for parallel applications

Need to estimate serial vs. parallel execution times• Speedup estimate based on Amdahl’s law • Serial percentage for Gustafson-Barsis’s Law

Thread Profiler• Use critical path information in Profile View• Use information in Concurrency Level view

• Experimental technique based on CPU utilization of all processors/cores

Finding Serial and Parallel TimeThread Profiler

Thread Profiler – for parallel applications

• Use Concurrency Level View

• Total Serial (CL:0 and CL:1) and Parallel (CL:2 and up) times• Under Utilized times counted as parallel time

Perfmon Dual Xeon™ processor/2.8Ghz/Windows* XP

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

Workload run time (sec)

Serial Serial Parallel

Finding Serial and Parallel TimeCPU utilization

Experimental approaches –for parallel applications• Monitor utilization of all CPUs

over time

• Parallel region is where all CPUs are active

• Perfmon* (Windows) or mpstat (Linux)

• Example: 76% serial, 24% parallel on DP

Perfmon* or mpstat does not capture sub-second behavior

Measured 2P Speedup

Measured 4P Speedup

2P and 4P Speedup (IBM* X440 .NET RC1)

How can speedup estimate help identify scalability issues?

• Different workloads can exercise different parts of application

• Estimates can point to workloads that need scalability analysis and improvement

• Compare measured vs. estimate

• Choose largest delta workloads for analysis Workloads 5 & 12 show significant difference

between estimate and actual; focus tuning hereWorkload 11 is predicted to have low scaling

Quick Review: Measuring and Estimating Speedup

Estimate serial vs. parallel times in workloads

• Allows prediction of speedup upper bounds

Serial applications

• Estimate based on VTune Sampling or Callgraph runs

Parallel applications

• Use Thread Profiler

• Experimental techniques• Measuring CPU utilization over time for all processors

Agenda

• Load Imbalance

• I/O

Approaching a serial application

1. Pick a workload

2. Establish a scalability target• Example: Must have at least 2.5x improvement 1core4core

3. Estimate amount of parallelization required• Dictated by Amdahl’s law

• Example: 2.5X improvement 1c4c would require 80% of run time to be parallelized

• Identify areas to parallelize• Cannot find areas to meet required amount of parallelization?

• Reset scalability target and continue parallelization

4. Parallelize and measure speedup

5. Did you meet the scalability target?• If not, root cause and improve

Repeat for other workloads

Approaching a parallel application

1. Pick a workload

2. Estimate expected speedup• Amdahl, Gustafson-Barsis

3. Measure speedup

4. Did you meet the expected scaling?• If not, root cause and improve

Repeat for other workloads

Question 3a: What is the best design for scalability?

Audio processing application

• Left channel computation

• Right channel computation

Question 3b: What is the best design for scalability?

Video stream encoding

• Thread intra-frame?

• Thread groups of pictures?

Question 3c: What is the best design for scalability?

Room Assignment Problem (Simulated Annealing)

Goal: Find most compatible roommate assignments

Method:

• Roomers take interest survey

• Roommates initially chosen at random

• Two people are swapped at random

• Does new assignment increase common interests in roommates (reduce conflict)?• If yes, keep new assignment• If no, undo swap; shrinking random chance to keep bad match

• Continue until solution stabilizes

Agenda

• Load Imbalance

• I/O

Windows*: Perfmon*

Recommended first set of counters

• “Processor” performance object: %processor time, %privileged time (for each CPU)

• “System” performance object: Context Switches/sec, System Calls/sec

• “PhysicalDisk” performance object: Disk Read bytes/sec, Disk Write bytes/sec (for each disk)

• “Memory” performance object: Pages/sec

• “Network Interface” performance object: Bytes Total/sec (for each network card)

Windows command line tools available

• Logman

• Relog

• Typeperf

Windows*: Fixing Process to Core

Eliminate “noise” from context switches that abandon cache

Windows Task Manager

• “Process” Tab right click on process to set affinity

Windows APIs

• SetProcessAffinityMask

• SetThreadAffinityMask

VTune* call graph

Helps isolate call trees for potential threading

VTune Counter Monitor

Tracks operating system counters over time

Some relevant counters:

• Processor time

• Available memory

• Context switches

Intel Thread Profiler

Identifies

• Serial vs. parallel run times

• Lock contention areas

• Parallel overhead

• Load imbalance

Loop graph viewer

Being considered for VTune 9.0

View program as loop hierarchies• Loop self times and total times (in terms of instructions retired)

• Similar to call graph self and total times

• Loop counts

Helps identify loops for coarse grain threading• Loop hierarchies can span functions and files

PIN tool based prototype• Currently Linux-only

Quick Review: Tools

No single tool may give you all the answers for scalability issues

• Thread Profiler comes close

Simple tools can provide insight into scalability issues

• Perfmon• Monitoring of CPU utilization of processors and application threads

Agenda

• Load Imbalance

• I/O

Effects of serial domination

Serially dominated workloads do not scale well

• Amdahl’s Law

How to estimate serial time?

• VTune sampling, VTune Call graph• Serial applications

• Thread Profiler, experimental approaches• Parallel applications

76% serial time on DP• 53% non-concurrent

time on UP

• 4P theoretical scaling 1.6X

• 4P measured scaling 1.1X

Case Study 1W

Measured 2P Speedup

Measured 4P Speedup

2P and 4P Speedup (IBM* X440 .NET RC1) Perfmon Dual Xeon™ processor/2.8Ghz/Windows* XP

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

Serial Serial Parallel

Parallelize serial sections for better scalability

Question 4

Profile shows 80% of runtime spent calculating multidimensional FFT

Assume calling sequence of fft2d fft1d fftcc

Where would you thread for better scaling?

gprof profile:

Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 64.71 93.43 93.43 23952910 0.00 0.00 fftcc_ 11.47 110.00 16.57 23952910 0.00 0.00 fft1d_ 11.41 126.47 16.47 100 164.70 1402.89 ssf_3dcs_ 4.94 133.59 7.13 151600 0.05 0.77 fft2d_ 2.94 137.84 4.25 37900 0.11 0.11 phaseshift3d_ 2.24 141.07 3.23 100 32.30 32.30 imaging_

A: fftcc

B: fft1d

C: fft2d

D: All of the above

Top-Down Design

The iterative “hotspot” tuning process:

Find and fix hot spot…

The top-down parallelization process:

Find the highest level of natural parallelism…

Top-down approach considers the parallelism of the whole application rather than individual hotspots.

The result is usually a more scalable, parallel application.

Agenda

• Load Imbalance

• I/O

Granularity

Loosely defined as the ratio of computation to synchronization

Be sure there is enough work to merit parallel computation

Example: Working on the railroad. How many more workers can be added?

Activity 1: Workload-dependent Scaling

Lab shows a chosen number of spheres bouncing within an enclosed box

• Obey laws of physics for bouncing off walls and colliding with other spheres

User is able to control

• Number of spheres

• Amount of physics computation before rendering

• Whether to run with single thread or multithreaded• Load balance between threads will be explored in later lab

GUI frames per second displayed is performance metric

Parallel Overhead

Parallel overhead impacts scalability

Thread creation/destruction

• Amount of work vs. overhead

• Thread Pool (Windows*) may be a good solution

Synchronization

• Call overhead

• Transition in and out of kernel space

Possible indicators

• High kernel time

• Thread Profiler• Critical Path view showing large overhead times

Example: Threading Quicksort

Algorithm:

• Pick pivot value from elements

• Partition data around pivot• Less-than or equal to pivot• Greater than pivot

• Quicksort the two partitions

Less-than or equal

Greater-than

QuickSort(int p, int r) // Assume global array of data{ if (p < r) { int q = Partition(p, r); QuickSort(p, q-1); // sort less-than QuickSort(q+1, r); // sort greater-than }}

Example: Threading Quicksort

How about creating threads at each recursive call?

DWORD WINAPI QuickSort(LPVOID pr){ int p = ((qParams *)pr)->s; int r = ((qParams *)pr)->t; qParams lo, hi; HANDLE hLOHI[2];

if (p < r) { int q = Partition(p, r); lo.s = p; lo.t = q; hi.s = q+1; hi.t = r; hLOHI[0] = CreateThread(NULL, 0, QuickSort, (LPVOID) &lo, 0, NULL); hLOHI[1] = CreateThread(NULL, 0, QuickSort, (LPVOID) &hi, 0, NULL); WaitForMultipleObjects(2, hLOHI, TRUE, INFINITE); } return 0;}

typedef struct { int s,t;} qParams;

For single parameter

Quicksort Performance Results

Is There a More Scalable Quicksort Implementation?

Thread pool to control number of threads

Producer/Consumer relationship with index pair queue• Dequeue pair struct from queue and partition (Consumer)• Recursive calls become enqueue of index struct (Producer)

DWORD WINAPI QuickSort(LPVOID pArg){ int p, r, q; while (1) { WaitForSingleObject(hSem, INFINITE); dequeue(&p, &r); if (p < r) { q = Partition(p, r); enqueue(p, q); enqueue(q+1, r); q = ReleaseSemaphore(hSem, 2, NULL); } } return 0;}

Semaphore counts number of pairs in queue

Encapsulation of index pairs done in queue routines

Quickstart Thread Pool Performance

Agenda

• Load Imbalance

• I/O

Looking at Load Imbalances

Load imbalance reduces scalability

• Idle CPU

• Easier to spot on 4P or above• On 2P, idle times might be mistaken as “serial” sections

How do you detect this?

• Windows* Perfmon

• Linux* mpstat

• Thread Profiler

Case Study 2

Linux* mpstat CPU data

2P data not suggestive of load imbalance

4P data shows CPUs drop off

Case Study 2: Improved

Linux* mpstat CPU data

Second figure shows improvement with load balancing

• 1P-4P scaling improves from 2.1x to 2.7x

Spotting Load Imbalance in Thread Profiler

Differences in Active Thread state

First problem noticed is create/destroy threads for each iteration…

…but there is a difference in Active Thread state within pairs.

Activity 2: Effects of Load Balance in Multi-Threaded Implementation

Use Load Balance control within Basic Physics GUI to control number of spheres assigned to threads

How does this affect FPS measure?

Agenda

• Load Imbalance

• I/O

Synchronization

Lost time waiting for locks

Most likely scenario for high contention

• Work inside AND outside protected region is very small

• “Threads pile up” on the lock• Symptoms: High context

switches/sec, high kernel times

Spotting highly contended synchronization objects

• Thread Profiler

In CriticalTime

Thread 0

Thread 1

Thread 2

Thread 3

Lock Contention Indicators in Thread Profiler

Large percentage of Locks time

Large amount of Impact time associated with

object

Synchronization PrimitivesWindows*

Choice of synchronization primitives

• Atomic increments/decrements• InterlockedIncrement

• Critical Section, Critical Section with spin count• EnterCriticalSection, LeaveCriticalSection, SetCriticalSectionSpinCount• Works within a single process

• Events• Signal condition has been changed/satisfied

• Mutex• Works across processes as well

• Semaphore• Works across processes as well

Activity 3: Measuring Synchronization Object Overhead

Determine overhead for using different synchronization objects• InterlockedIncrement• CRITICAL_SECTION• CRITICAL_SECTION with spin count• Mutex• Semaphore

CRITICAL_SECTION is used as baseline

• InterlockedIncrement is specialized functionality; others more general

Lock times relative to InterlockedIncrement1P/1C/1T (data in L1 cache)

Windows XP 32-bit (MP kernel)(higher => more expensive)

Lock type

Mrm 2.61C/1T

Synchronization Primitives Costs: Un-contended

Use the least expensive synchronization method possible

Lock Contention

Lock contention reduces scalability

Following factors combine to produce contention and reduce scalability

• Amount of work inside vs. outside protected region

• Synchronization primitive costs

• OS context switches during lock contention

Possible indicators (without Thread Profiler)

• High context switches/sec• >10,000/s should be investigated

• And high kernel time• >20% should be investigated

Watch for high context switches/sec and kernel time

Reducing Lock Contention

Lock contention reduces scalability – fix?

• Ideally, work inside << work outside• Redesign

• Explore use of “spin count” (Windows*)• InitializeCriticalSectionAndSpinCount, SetCriticalSectionSpinCount• #define _WIN32_WINNT 0x0403 // or higher• Spin count = 4000 recommended by Microsoft*• Not very portable

Case Study 3

17% serial time on 2P• 9% serial time on UP

• 4P theoretical scaling 3.1X

• Measured 4P scaling is 0.4X

Why should we think this is a Synchronization issue?

Clues: 50% kernel time

>40K context switches/sec on 4P

Measured 2P Speedup

Measured 4P Speedup

2P and 4P Speedup (IBM* X440 .NET RC1)

Perfmon* Dual Xeon™ processor 2.8Ghz/Windows*XP

1 6 11 16 21 26 31 36 41 46 51 56 61 66

Serial Parallel

Pretty good load balance

Case Study 3: Speedup

We have a negative scaling problem…

1/2/4 Xeon™ processors / Windows* XP(2.8Ghz/512KL2/2MBiL3)

1.001.26

Orig. Code

If adding more threads results in worse performance, there must be some increased contention on a shared resource

Case Study 3: First Approach

Root cause: A class defined a critical section as a static member variable

Solution: Have each instance of class use separate lock by removing static declaration

Before: 4 threads randomly accessing 8 lights with 1 global lock

After: 4 threads randomly accessing 8 lights with 8 private locks

Before

Case Study 3: Performance

Observations

• 4P scaling has improved from 0.7x to 1.3x

• There still is much work to do:• Now 80,000 context switches/second

• Utilization of each CPU near 75%

1/2/4 Xeon™ processors / Windows* XP(2.8Ghz/512KL2/2MBiL3)

1.00 1.001.26

Orig. Code New Code

Case Study 3: Second Approach

Perfmon* observations (4P)

• Almost no serial execution, utilization of each CPU near 50%

• Almost 200,000 Context Switches/sec!

Case Study 3: Diagnosis

Root cause: poor choice of synchronization primitive• Computation is incrementing a single variable• Threads contending on single Critical Section object

Solution: Use of “InterlockedIncrement”• Critical section with spin count is another possibility

1/2/4 Scaling Xeon™ processors / Windows* XP(2GHz/512KL2/2MBiL3)

1.00 1.001.40

Orig. Code New Code

1P/4PUse the least expensive synchronization method possible

How Much of Data Structure to be Locked?

Example: Array of counts/buckets/pointers (random access)

• Enumeration sort, radix sort, bucket sort

• Hash table

Lock whole structure?• Easy to implement• Severely restricts access

Lock individual elements?• Individual access by different threads• Extra space in structure

Modulo Locks

Assuming little contention for individual elements

Create array of locks to protect every Kth element• Fixed number of locks, say 2

• Lock index used to determine which elements are protected• To access element Data[Q], thread must hold LOCK[Q % 2]

• Works for 2-D and 3-D arrays• For example, with eight locks, accessing A[i,j] would use LOCK[(i+j) % 8]

Set number of locks equal to number of threads

Agenda

• Load Imbalance

• I/O

Frontside Bus (FSB) Bandwidth

Cores share bus in current Intel® multi-core architectures• Saturating the bus limits scalability

• Newer independent bus designs improve scalability

• Applies to current SMP platforms too

Good metric to monitor, if • CPU utilization is close to 100%• Poor scaling to 4P or 8P• Low context switches/sec

How do you measure this?• VTune™ Performance Analyzer

• Compare 1-thread vs. multi-thread VTune runs• Look for areas where clock ticks show significant jumps• Code inspection

Case Study 4

IPF Madison 1.5Ghz/9M/400Mhz

• 1P to 2P scaling: 1.28

• 1P to 4P scaling: 1.27

2P close to FSB saturation

• ~5GB/s

• Madison 400Mhz bus peakbandwidth is 6.4GB/s

Solution?

• Change algorithm / data structures to keepdata in cache more often

• Easier said than done

Frontside Bus Lab

Intent of this lab

• Observe impact of FSB saturation on scalability using Stream benchmark

• Learn use of appropriate VTune performance event to monitor bus utilization

ChipsetMCH

Bus 0 Bus 1

Core 02M L2

Woodcrest Socket 0

Core 12M L2

Core 22M L2

Core 32M L2

Woodcrest Socket 1

Core 42M L2

Core 52M L2

Core 62M L2

Core 72M L2

Computing FSB Data BandwidthCore 2™ Processor

Bus bandwidth (MB/s) per core = (BDC.ta / CCU.b) * TB

• BDC.ta is the BUS_DRDY_CLOCKS.THIS_AGENT event count• Counts the number of bus cycles when data is sent on the bus (the DRDY

[Data Ready] signal is asserted on the bus)

• CCU.b is the CPU_CLK_UNHALTED.BUS event count• Counts the number of bus cycles occurred during measurement (bus

cycles when core is not halted)

• TB is Theoretical bandwidth of the bus in MB/s = 8 * bus frequency• Example: 2.6GHz processor with 1333MHz FSB

• Theoretical bandwidth = 8 bytes/clock * 1333 = 10664 MB/s

Total bus bandwidth = ∑ “per core” bandwidths

Frontside Bus Lab: VTune Notes

• Where is “BUS_DRDY_CLOCKS.THIS_AGENT” event? • Configure Sampling Events Tab -> Event Groups: “External Bus Events”

• View results as Table in VTune• Easier to compute bandwidth

• View results per CPU (Show/Hide CPU Info.)

Frontside Bus Lab: Example

Bus bandwidth (MB/s) per core = (BDC.ta / CCU.b) * TB• TB = 8 bytes/clock * 1333 MHz = 10664 MB/s

Processor 0 BW = (159,204,930 / 1,318,288,023) * 10664 MB/s

= (0.159/1.318)*10.7 GB/s

~ 1.29 GB/s

Activity 4: Measuring Frontside Bus Saturation

Intent of this activity

• Observe impact of FSB saturation on scalability using Stream benchmark

• Learn use of appropriate VTune performance event to monitor bus utilization

What you may seeScalabilitybus0.bat: ~ 5 sec / ~1.2 GB/s

Scalabilitybus0123.bat: ~18 sec / ~4 GB/s

Frontside Bus Lab: 4 Core Examples

Processor 0 BW = (0.308/3.216)*10.7 GB/s ~ 1.02 GB/s

Total BW ~ 1.0*4 = 4 GB/s

Total BW ~ 0.67*4 = 2.68 GB/s

Frontside Bus Lab: 8 Core Example

Total BUS_DRDY_CLOCKS.THIS_AGENT = 2,524,773,727

Average Total CPU_CLK_UNHALTED.BUS = 52,382,932,480 / 8

= 6,547,866,560

Total Avg. BW = 2,524,773,727 / 6,547,866,560 * 10.7 GB/s

~ 4.11 GB/s

Frontside Bus Lab: Discussion

Almost 3x slower run time from 1 to 4 cores

• Same amount of data transferred by each thread

• Contention for shared bus makes everything run slower

How do clockticks compare in first 2 runs?

• Notice the clock ticks go up significantly in the 4 stream case in the source view as well

Why is there a difference in MB/s reported by Stream vs. what you calculated using VTune?

Frontside Bus Lab: CautionMeasuring BW on Underutilized System

Process or thread migration can break the formula

Example: Single thread Stream allowed to migrate in the lab

Is bandwidth used equal to (3.9 * 4 =) 15.6 GB/s?

Processor0 Processor1 Processor2 Processor3Bus Data Ready 89,419,935 59,181,430 193,423,450 158,135,505Clockticks 2,358,400,000 1,545,600,000 5,033,600,000 4,128,000,000FSB BW MB/s 3927 3890 3928 3942

Dempsey 3.2/2S/2C/1066Mhz - 1 Stream process allowed to migrate

Best to tie threads to cores for bandwidth analysis

What is False Sharing?

Multiple threads repeatedly write to the same cache line shared by cores• Usually different data

• Cache lines get invalidated• Forces additional reads from memory

• Severe performance impact in tight loops, in general• Threads read/write to the same cache line very rapidly

• Good metric to monitor if • CPU utilization of all processors very high• Poor scaling to 4P or 8P• Low context switches/sec

Detecting False Sharing with VTune Analyzer

Core 2® processor-based events:

• MACHINE_NUKES.MEM_ORDER event

• Significant last level cache read misses • 2nd Level or 3rd Level Cache Read Misses• MEM_LOAD_RETIRED.L2_MISS

• Significant FSB activity• BUS_DRDY_CLOCKS.THIS_AGENT

Compare 1-thread vs. multi-thread VTune runs

• Look for areas where clock ticks show significant jumps

• Code inspection

False Sharing Example 1

#define N_THREADS 16double sum=0.0, sum_local[N_THREADS];

#pragma omp parallel{ int me = omp_get_thread_num(); sum_local[me] = 0.0; #pragma omp for for (i=0; i<N; i++) sum_local[me] += x[i] * y[i]; #pragma omp atomic sum += sum_local[me];}

No overlap of memory access; no sync needed

Each thread can invalidate cache line for

others

To fix, declare and use true local sum variable for each thread

False Sharing Example 2

Normalization of an array of spatial vectors (double precision)

• 10,000 vectors (<256K size; fits in L2)

• 5⁄ vectors per cache line

False sharing case

• Round-Robin distribution

• Each thread works on “start index + i*Num_Threads”

No false sharing case

• Each thread works on a block of data

• Block per thread = Array size / Num_Threads

V0 V1 . . . .V9V8V7V6V5V4V3V2

Thread 0 Thread 1 Thread 2 Thread 3

V0 V1 . . . .V5000V4999…V2500V2499…V2

Thread 0 Thread 1 Thread 2 Thread 3

False Sharing Example 2 – Effects on Speedup (2S/2C Dempsey; HT off)

Effects of False SharingDempsey 3.2 2S/2C Windows 2003 Server 64-bit

ScalabilityLab-FS.exe 32-bit

0 1 2 3 4

Number of Cores/Threads

No False Sharing

With False Sharing

Activity 5: Identifying False Sharing

Intent of this activity

• Observe impact of false sharing on scalability

• Learn use of appropriate VTune performance events• Compare and contrast false sharing vs. no false sharing

Activity 5: DiscussionTypical Results

Sum of events on all cores

Event 1T 4T-FS 4T-NOFS

CPU_CLK_UNHALTED.CORE 38.6 E+09 173.2 E+09 37.1 E+09

INST_RETIRED.ANY 30.5 E+09 30.5 E+09 28.5 E+09

MACHINE_NUKES.MEM_ORDER 1.92 E+06 105.48 E+06 0.04 E+06

MEM_LOAD_RETIRED.L2_MISS 0.020 E+06 75.557 E+06 0.185 E+06

BUS_DRDY_CLOCKS.THIS_AGENT 8.4 E+06 2454.5 E+06 7.3 E+06

No false sharing in single thread execution

MACHINE_NUKES.MEM_ORDER counts events most likely due to false sharing

• Cache misses can be indication of problems

Effects of On-die Shared Cache

Last level cache (LLC) size, shared vs. not shared

• Dempsey: 2MB L2 not shared

• Merom/Woodcrest: 4MB L2 shared

• Clovertown: 8MB L2 (4MB per die shared)

Cache sensitive application will run better with threads on cores not sharing cache

Chipset

L2 L2 L2 L2

Dempsey

Chipset

Woodcrest

Chipset

Clovertown

Detecting Effects of On-die Shared Cache

VTune sampling

• LLC cache misses increase significantly when run on same socket vs. different sockets

Experiments with single socket vs. multi-socket show differences in scaling

May require thread affinity to correct performance problems

Paying Attention to NUMA Issues

NUMA may affect scalability

• Non-Uniform Memory Access

• Adds extra memory layer to locate data• Registers• Cache• Memory• “Far” Memory

Cache-coherent Interconnect

Chipset

Dual IndependentBus

Chipset

Paying attention to NUMA issues

NUMA related scalability issues depend on• Platform design

• NUMA aware OS used or not• NUMA aware OSs’: Windows* Server 2003 and Linux 2.6 kernel

• Application being NUMA aware or not

Check for NUMA issues if• Scaling falls off when going from SMP to NUMA

• Low context switches/sec

• Application is memory latency sensitive

How do you detect this?• Knowledge of platform architecture• Through experimentation • Tie threads to different cores to measure performance

• Measure memory latency ratio between “near” and “far” memory

OS and Application NUMA Support

Definition of node• Own processors and memory

• Connected to the larger system through a cache-coherent interconnect

Role of NUMA-aware OS• Schedule threads on processors in the same node as memory being used

• Satisfy memory-allocation requests from within the node• But will allocate memory from other nodes if necessary

Role of NUMA-aware applications• Use of NUMA APIs

• Topology of nodes• Memory per node

• Use of Affinity Mask APIs• SetThreadAffinityMask, SetProcessAffinityMask• Keep threads sharing memory on the same node

Agenda

• Load Imbalance

• I/O

Watching I/O

I/O impacts scalability

• CPU likely to be idle

• Check for I/O to disk and network

How do you detect this?

• Windows* Perfmon

• Linux* vmstat, sar, iostat

Case Study 5

Linux* mpstat CPU data, sar I/O data

Correlation between disk write peaks and CPU utilization troughs

When I/O is reduced using application configuration options• 1P-4P scaling improves from 1.9x to

Striped or RAID disk configurations could have helped

Overlapped I/O implementation in application

Factors Inhibiting Scalability Summary

Serially dominated workload

Choice of synchronization primitives and lock contention

Granularity

Parallel overhead

I/O (Disk and Network)

Load Imbalance

High front side bus utilization

Memory related• NUMA• False sharing• Shared cache effects

Backup

MESI protocol

Every cache line is marked with one of the four following states (coded in two additional bits):• M - Modified: The cache line is present only in the current cache, and is dirty; it has been

modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (not longer valid) main memory state.

• E - Exclusive: The cache line is present only in the current cache, but is clean; it matches main memory.

• S - Shared: Indicates that this cache line may be stored in other caches of the machine.

• I - Invalid: Indicates that this cache line is invalid.

A cache may satisfy a read from any state except Invalid. An Invalid line must be fetched (to the Shared or Exclusive states) to satisfy a read.

A write may only be performed if the cache line is in the Modified or Exclusive state. If it is in the Shared state, all other cached copies must be invalidated first. This is typically done by a broadcast operation.

A cache may discard a non-Modified line at any time, changing to the Invalid state. A Modified line must be written back first.

MESI protocol (contd.)

A cache that holds a line in the Modified state must snoop (intercept) all attempted reads (from all of the other CPUs in the system) of the corresponding main memory location and insert the data that it holds. This is typically done by forcing the read to back off (i.e. to abort the memory bus transaction), then writing the data to main memory and changing the cache line to the Shared state.

A cache that holds a line in the Shared state must also snoop all invalidate broadcasts from other CPUs, and discard the line (by moving it into Invalid state) on a match.

A cache that holds a line in the Exclusive state must also snoop all read transactions from all other CPUs, and move the line to Shared state on a match.

The Modified and Exclusive states are always precise: i.e. they match the true cacheline ownership situation in the system. The Shared state may be imprecise: if another CPU discards a Shared line, and this CPU becomes the sole owner of that cacheline, the line will not be promoted to Exclusive state. (because broadcasting all cacheline replacements from all CPUs is not practical over a broadcast snoop bus)

Linux*: vmstat

Quick way to watch for• Disk i/o

• Overall cpu utilization

• Swap

• Context switches

vmstat –n 1• print header only once

• output data every 1 sec

Linux*: sar

May not be installed by default• sysstat package (CD 3 RPMS directory – RHEL3 U2)

Monitors• Disk i/o, all CPUs, swap, network traffic, interrupts

sar –U ALL –bWw –o <binfile> 1 0Report statistics, 1 sec interval, forever-U ALL Report on all CPUs-b aggregated disk I/O (for more details use iostat)-W swap statistics-w context switches/s-o <binfile> write to binary file–f <binfile> read from binary file

Linux*: sar

Using sar in application launch scriptsexport PATH=$PATH:/sbin

sar –U ALL –bWw –o app.sar 1 0 &

kill -9 `pidof sar` (could use: killall -9 sar)

kill -9 `pidof sadc` (could use: killall -9 sadc)

sar seems more expensive• Time gaps in reports even if 1 sec output is requested

• Needs more investigation

Linux*: mpstat

mpstat –P ALL 1-P ALL all processors

Using mpstat in application launch scriptsexport PATH=$PATH:/sbin

mpstat –P ALL 1 >mpstat.out &

kill -9 `pidof mpstat` (could use: killall -9 mpstat)

Linux*: top

Useful interactive mode options• Press the appropriate keys

s changes delay between updatesu selects only specified user’s processH show threads & utilization (toggle)

(shows CPU on which thread is scheduled)

i idle processes or threads (toggle)

b batch mode

• Does not report threads

Linux*: using processor affinity

schedutils package• “taskset” command

Affinity system call APIs• sched_setaffinity

• In 2.6 kernels

scalability of threaded applications intel software college

intel logo

scalability analysis

united states

respective owners

scaling of applications

increased number of

cores evolution large

core array cmp

Documents

threaded paths

scalability -

lustre* with zfs* sc16 presentation - intel · accelerating...

threading methodology: principles and practices ·...

from clearcase to perforce helix: breakthroughs in...

jiesheng hardware customize hex standoff/male female...

second generation intel xeon scalable processors · (9200...

multi-threaded rendering and physics simulation ·...

nvthreads: practical persistence for multi-threaded...

multi-core programming thread profiler. 2 tuning threaded...

threaded inserts - jergens · pdf file51 threaded inserts...

intel® architecture at the edge for greater flexibility...

fully-threaded optimal fully-threaded diverse …

intel® roadmap directions 2009 · computational...

10 gen brinou e gs yth fastest mole processorbi · built...

scalability of routing: compactness and dynamics dmitri...

scalability and heterogeneity · scalability and...

tnw2k-02 partnered by 5-may-994 world’s fastest...

memory scalability evaluation of the next-generation intel...

taking r to new heights for scalability and performance ·...