lecture 1 (overview)
DESCRIPTION
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lecture 1 (Overview). Course Objectives. A hands-on opportunity to learn: Multi-core architectures; and Programming multi-core systems Emphasis on programming: - PowerPoint PPT PresentationTRANSCRIPT
Programming Multi-Core Processors based
Embedded Systems
A Hands-On Experience on Cavium Octeon based Platforms
Lecture 1 (Overview)
KICS, UET1-8
Course Objectives
A hands-on opportunity to learn: Multi-core architectures; and Programming multi-core systems
Emphasis on programming: Using multi-threading paradigm Understand the complexities Apply to generic computing/networking
problems Implement on an popular embedded multi-
core platform
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Grading Policy and Reference Books
Grading Policy Lectures (40%) Labs (50%) Quizzes (daily) (10%)
Reference material: Shameem Akhtar and Jason Roberts, Multi-
Core Programming, Intel Press, 2006 David E. Culler and Jaswinder Pal Singh,
Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998
Class notes
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Course Outline
Introduction Parallel architectures and terminology Context for current interest: multi-core
processors Programming paradigms for multi-core Octeon processor architecture
Multi-threading on multi-core processors Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning
An Introduction to Parallel Computing in the context of
Multi-Core Architectures
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Developing Software for Multi-Core: A Paradigm Shift
Application developers are typically oblivious of underlying hardware architecture Sequential program Automatic/guaranteed performance benefit
with processor upgrade No work on the programmer
No “free lunch” with multi-core systems Multiple cores in modern processors Parallel programs needed to exploit parallelim Parallel computing is now part of main-stream
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Parallel Computing for Main-Stream: Old vs. New Programming Paradigms
Known tools and techniques: High performance computing and communication
(HPCC) Wealth of existing knowledge about parallel
algorithms, programming paradigms, languages and compilers, and scientific/engineering applications
Multi-threading for multi-core Common in desktop and enterprise applications Exploits parallelism of multi-core with its challenges
New realizations of old paradigms: Parallel computing on Playstation 3 Parallel computing on GPUs Cluster computing for large volume data
KICS, UETCopyright © 2010 Cavium University Program 1-8
Dealing with the Challenge of Multi-Core Programming with Hands-On Experience
Our objective is two-fold Overview the known paradigms for background Learn using the state-of-the-art implementations
Choice of platform for hands-on experience Cavium Networks’ Octeon processor based system Multiple cores (1 to 16) Suitable for embedded products Commonly used in networking products Standard Linux based development environment
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Agenda for Today
Parallel architectures and terminology Processor technology trends Architecture trends Taxonomy
Why multi-core architectures? Traditional parallel computing Transition to multi-core architectures
Programming paradigms Traditional Recent additions
Introduction to Octeon processor based systems
Architecture and Terminology
Background of parallel architectures and commonly used
terminology
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Architectures and Terminology
Objectives of this section: Understand the processor technology trends Realize that parallel architectures evolve
based on technology and architecture trends
Terminology used in parallel computing Von Neumann Flynn’s taxonomy Bell’s taxonomy Other commonly used terminology
Processor Technology Trends
Processor technology evolution, Moore’s law, ILP, and current
trends
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Processor Technology Evolution
Increasing number of transistors on a chip Moore’s law: number of transistors on a chip is
expected to double every 18 months Chip densities are reaching their physical limits Technological breakthroughs have kept Moore’s law
alive Increasing clock rates during 90’s
Faster and smaller transistors, gates, and circuits on a chip
Clock rates of microprocessors increase by ~30% per year
Benchmark (e.g., SPEC suite) results indicate performance improvement with technology
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Moore’s Law
Gordon Moore , Founder of Intel 1965: since the integrated circuit was
invented, the number of transistors/inch2 in these circuits roughly doubled every year this trend would continue for the foreseeable future
1975: revised - circuit complexity doubles every 18 months
This was simply a prediction Based on little data However, it has defined the processor industry
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Moore’s Original Law (2)
ftp://download.intel.com/research/silicon/moorespaper.pdf
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Moore’s Original Issues
Design cost still valid Power dissipation still valid What to do with all the functionality possible
ftp://download.intel.com/research/silicon/moorespaper.pdf
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Moore’s Law and Intel Processors
From: http://www.intel.com/technology/silicon/mooreslaw/pix/mooreslaw_chart.gif
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Good News: Moore’s Law isn’t done yet
Source: Webinar by Dr. Tim Mattson, Intel Corp.
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Bad News:Single Thread Performance is Falling Off
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Worse News:Power (normalized to i486) Trend
Source: Webinar by Dr. Tim Mattson, Intel Corp.
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Addressing Power Issues
Source: Webinar by Dr. Tim Mattson, Intel Corp.
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Architecture Optimized for Power:a big step in the right direction
Source: Webinar by Dr. Tim Mattson, Intel Corp.
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Long term solution: Multi-Core
Source: Webinar by Dr. Tim Mattson, Intel Corp.
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Summary of Technology Trends
Moore’s law is still relevant Need to deal with related issues
Design complexity Power consumption Uniprocessor performance is slowing down
Multiple processor cores resolve these issues Parallelism at hardware level End user is exposed to it Added complexities related to programming
such systems
Taxonomy for Parallel Architectures
Von Neumann, Flynn, and Bell’s taxonomies and other common
terminology
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Von Neumann Architecture Evolution
Scalar
Sequential Lookahead
I/E overlap Functionalparallelism
Multiple func.units Pipeline
Implicitvector
Explicitvector
Memory tomemory
Register toregister
SIMD MIMD
Associativeprocessor
Processorarray Multicomputer Multiprocessor
Von Neumann architecture
Massively Parallel Processors
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Pipelining and Parallelism
Instructions prefetch to overlap execution Functional parallelism supported by:
Multiple functional units Pipelining
Pipelining Pipelined instruction execution Pipelined arithmetic computations Pipelined memory access operations Pipelining is attractive for performing identical
operations repeatedly over vector data strings
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Flynn’s Classification
Michael Flynn classified architectures in 1972 based on instruction and data streams
Single Instruction stream over a Single Data stream (SISD) Conventional sequential machines
Single Instruction stream over Multiple Data streams (SIMD) Vector computers are equipped with scalar
and vector hardware
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Flynn’s Classification (2)
Multiple Instruction streams over Single Data stream (MISD) Same data flowing through a linear array of
processors Aka systolic arrays for pipelined execution
of algorithms Multiple Instruction streams over
Multiple Data streams (MIMD) Suitable model for general purpose parallel
architectures
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Bell’s Taxonomy for MIMD
Multicomputers Multiple address
spaces System consists of
multiple computers, called nodes
Nodes are interconnected by a message-passing network
Each node has its own processor, memory, NIC, and I/O devices
Multiprocessors Shared address space Further classified
based on how memory is accessed
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)
Cache-Only Memory Access (COMA)
Cache-Coherent Non-Uniform Memory Access (cc-NUMA)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multicomputer Genenrations
First generation (1983-87) Processor boards connected in hypercube architecture Software-controlled message switching Examples: Caltech Cosmic, Intel iPSC/1
Second generation (1988-1992) Mesh connected architecture Hardware message routing Software environment for medium grain distributed
computing Example: Intel Paragon
Third generation (1993-1997) Fine grain multicomputers Examples: MIT J-Machine and Caltech Mosaic
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multiprocessor Examples
Distributed memory (scalable) Dynamic binding of address to processors (KSR) Static binding, caching (Alliant, DASH) Static program binding (BBN, Cedar)
Central memory (not scalable) Cross-point or multi-stage (Cray, Fujitsu,
Hitachi, IBM, NEC, Tera) Simple multi bus (DEC, Encore, NCR,
Sequent, SGI, Sun)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Supercomputers
Supercomputers use vector processing and data parallelism
Classified into two categories Vector supercomputers SIMD supercomputers
SIMD machines with massive data parallelism Instruction is broadcast to large number of Pes Examples: Illiac (64 PEs), MasPar MP-1 (16,384
PEs), and CM-2 (65,538 PEs)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Vector supercomputers
Machines with powerful vector processors If decoded instruction is a vector
operation, it is sent to vector unit Register-register architecture:
Fujitsu VP2000 series Memory-to-memory architecture:
Cyber 205 Pipelined vector supercomputers:
Cray Y-MP
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Dataflow Architectures
Represent computation as a graph of essential dependences Logical processor at each node, activated by
availability of operands Message (tokens) carrying tag of next
instruction sent to next processor Tag compared with others in matching store;
match fires execution
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Dataflow Architectures (2)1 b
a
+
c e
d
f
Dataflow graph
f = a d
Network
Tokenstore
WaitingMatching
Instructionfetch
Execute
Token queue
Formtoken
Network
Network
Programstore
a = (b +1) (b c)d = c e
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Systolic Architectures
Replace single processor with array of regular processing elements
Orchestrate data flow for high throughput with less memory access
M
PE
M
PE PE PE
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Systolic Architectures (2)
Different from pipelining Nonlinear array structure, multidirection
data flow, each PE may have (small) local instruction and data memory
Different from SIMD: each PE may do something different
Initial motivation: VLSI enables inexpensive special-purpose chips
Represent algorithms directly by chips connected in regular pattern
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Systolic Arrays (Cont’d)• Example: Systolic array for 1-D convolution
Practical realizations (e.g. iWARP) use quite general processors Enable variety of algorithms on same hardware
But dedicated interconnect channels Data transfer directly from register to register across channel
Specialized, and same problems as SIMD General purpose systems work well for same algorithms (locality etc.)
y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)
x8
y3 y2 y1
x7x6
x5x4
x3
w4
x2
x
w
x1
w3 w2 w1
xin
yin
xout
yout
xout = x
yout = yin + w xinx = xin
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Cluster of Computers
Started as a poor man’s parallel system In-expensive PCs In-expensive switched Ethernet switch Run-time system to support message-passing Low performance for HPCC applications
High network I/O latency Low bandwidth
Suitable for high throughput applications Data center applications Virtualized resources Independent threads or processes
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Summary of Taxonomy
Multiple taxonomies Based on functional parallelism
Von Neumann and Flynn’s taxonomies Based on programming paradigm
Bell’s taxonomy Parallel architecture types
Multi-computers (distributed address space) Multi-processors (shared address space) Multi-core Multi-threaded Others: vector, data flow, systolic, and cluster
Why Multi-Core Architectures
Based on technology and architecture trends
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multi-Core Architectures
Traditional architectures Sequential Moore’s law = increasing clk
freq Parallel Diminishing returns from ILP
Transition to multi-core Architecture similar to SMPs Programming typically SAS
Challenges to transition Performance = efficient parallelization Selecting a suitable programming paradigm Performance tuning
Traditional Parallel Architectures
Definition and development tracks
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Defining a Parallel Architecture
A sequential architecture is characterized by: Single processor Single control flow path
Parallel architecture: Multiple processors with interconnection network Multiple control flow paths Communication and synchronization
A parallel computer can be defined as a collection of processing elements that communicate and cooperate to solve large problems fast
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Broad Issues in Parallel Architectures
Resource allocation: how large a collection? how powerful are the elements? how much memory?
Data access, communication and synchronization how do the elements cooperate and communicate? how are data transmitted between processors? what are the abstractions and primitives for
cooperation? Performance and scalability
how does it all translate into performance? how does it scale?
KICS, UET
Copyright © 2010 Cavium University Program 1-8
General Context: Multiprocessors
Multiprocessor is any computer with several processors
SIMD Single instruction, multiple data Modern graphics cards
MIMD Multiple instructions, multiple data
Lemieux cluster,Pittsburgh
supercomputing center
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Architecture Developments Tracks
Multiple-Processor Tracks Shared-memory track Message-passing track
Multivector and SIMD Tracks Multithreaded and Dataflow Tracks Multi-core track
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Shared-Memory Track Starts with C.mmp system developed at CMU in
1972 UMA multiprocessor with 16 PDP 11/40 processors Connected to 16 shared memory modules via crossbar
switch Pioneering multiprocessor OS (Hydra) development
effort Illinois Cedar (1987)
IBM RP3 (1985) BBN Butterfly (1989)
NYU/Ultracomputer (1983) Stanford/DASH (1992) Fujitsu VPP500 (1992) KSR1 (1990)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Message-Passing Track
The Cosmic Cube (1981) pioneered message-passing computers
Intel iPSCs (1983) Intel Paragon (1992) Medium-grain multicomputers
nCUBE-2 (1990) Mosaic (1992)
MIT/J Machine (1992) Fine-grain multicomputers
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multivector Track
CDC 7600 (1970) Cray 1 (1978)
Cray Y-MP (1989) Fujitsu, NEC, Hitachi models
CDC Cyber205 (1982) ETA 10 (1989)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
SIMD Track
Illiac IV (1968) Goodyear MPP (1980)
DAP 610 (AMT, Inc. 1987) CM2 (TMC, 1990)
BSP (1982) IBM GF/11 (1985) MasPar MP1 (1990)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multi-Threaded Track
Each processor can execute multiple threads of control at the same time
Multi-threading helps hide long latencies in building large-scale multiprocessors
Multithreaded architecture was pioneered by Burton Smith in HEP system (1978)
MIT/Alewife (1989) Tera (1990)
Symmetric Multi-Threading (SMT) Two hardware threads Available in Intel processors
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Dataflow Track
Instead of control-flow in von Neumann architecture, dataflow architectures are based on dataflow mechanism
Dataflow concept was pioneered by Jack Dennis (1974) with a static architecture
This concept later inspired dynamic dataflow MIT tagged token (1980) Manchester (1982)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multi-Core Track
Intel Dual core processors AMD Opteron IBM Cell processor SUN processors Cavium processor Freescale processors
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multi-Core processor is a special kind of a Multiprocessor
All processors are on the same chip Multi-core processors are MIMD:
Different cores execute different threads (Multiple Instructions),
Operate on different parts of memory (Multiple Data)
Multi-core is a shared memory multiprocessor: All cores share the same address space
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Summary: Why Parallel Architecture? Increasingly attractive
Economics, technology, architecture, application demand Readily available multi-core processors based systems
Increasingly central and mainstream
Parallelism exploited at many levels Instruction-level parallelism Multiple threads of software Multiple cores Multiprocessor processors
Focus of this class:
Multiple cores and/or processors
Programming paradigms
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Example: Intel Pentium Pro Quad
All coherence and multiprocessing glue in processor module
Highly integrated, targeted at high volume
Low latency and bandwidth
P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Example: SUN Enterprise
16 cards of either type: processors + memory, or I/O All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 F
iber
Cha
nnel
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Example: Cray T3E
Scale up to 1024 processors, 480MB/s links Memory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this)
Switch
P
$
XY
Z
External I/O
Memctrl
and NI
Mem
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Example: IBM SP-2
Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bw limited by I/O bus)
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Super computer
Transition to Multi-Core
Selected multi-core processors
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Single to Multiple Core Transition
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Multi-Core Processor Architectures
Source: Michael Perrone of IBM
KICS, UET
Copyright © 2010 Cavium University Program 1-8
IBM POWER 4
Shipped in Dec. 2004 Technology: 180 nm
lithography Dual processor cores 8-way superscalar
Out of order execution 2 load/store units 2 fixed and fp units >200 insts in flight
Hardware instruction and data pre-fetch
KICS, UET
Copyright © 2010 Cavium University Program 1-8
IBM POWER 5
Shipped in Aug. 2003 Technology: 130 nm
lithgraphy Dual processor core 8-way superscalar SMT core
Up to 2 virtual cores Natural extension to
POWER4 design
KICS, UET
Copyright © 2010 Cavium University Program 1-8
IBM Cell Processor IBM, Sony, and Toshiba
alliance in 2000 Based on IBM POWER5
64 bit architectures 9 cores, 10 threads
1 dual thread power processor element (PPE) for control
8 synergistic processor element (SPE) for processing
Up to 25 GB/s memory bw
Up to 75 GB/s I/O bw >300 GB/s Element
Interface Bus bandwidth
source: Dr. Michael Perrone, IBM
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Future Multi-Core Platforms
KICS, UET
Copyright © 2010 Cavium University Program 1-8
A Many Core Example:Intel’s 80 Core Test Chip
Programming the Multi-Core
Programming, OS interaction, applications, synchronization, and
scheduling
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Parallel Computing is Ubiquitous
Over the next few years, all computing devices will be parallel computers
Servers Laptops Cell phones
What about software? Herb Sutter of Microsoft said in Dr. Dobbs’ Journal:
The free lunch is over: Fundamental Turn towards Concurrency in software
Software will no longer increase from one generation to the next as hardware improves … unless it is parallel software
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Interaction with OS
OS perceives each core as a separate processor Same for SMP processor Linux SMP kernel works on multi-core
processors OS scheduler maps threads/processes
to different cores Migration is possible Also, possible to pin processes to processors
Most major OSes support multi-core today:Windows, Linux, Mac OS X, …
KICS, UET
Copyright © 2010 Cavium University Program 1-8
What Applications Benefit from Multi-Core? Database servers Web servers (Web
commerce) Compilers Multimedia applications Scientific applications,
CAD/CAM In general, applications
with Thread-level parallelism(as opposed to instruction-level parallelism)
Each can run on itsown core
KICS, UET
Copyright © 2010 Cavium University Program 1-8
More examples
Editing a photo while recording a TV show through a digital video recorder
Downloading software while running an anti-virus program
“Anything that can be threaded today will map efficiently to multi-core”
BUT: some applications difficult toparallelize
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Programming for Multi-Core Programmers must use threads or processes
Threads are relevant for business/desktop apps Multiple processes with complex systems or SPMD
based parallel applications Spread the workload across multiple cores
OS maps threads/processes to cores Transparent to the programmer
Write parallel algorithms True for scientific/engineering applications Programmer needs to define the mapping to cores
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Thread Safety is very important
Pre-emptive context switching:context switch can happen AT ANY TIME
True concurrency, not just uniprocessor time-slicing
Concurrency bugs exposed much faster with multi-core
KICS, UET
Copyright © 2010 Cavium University Program 1-8
However: Need to use synchronization even if only time-slicing on a uniprocessor
int counter=0;
void thread1() { int temp1=counter; counter = temp1 + 1;}
void thread2() { int temp2=counter; counter = temp2 + 1;}
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Need to use synchronization even if only time-slicing on a uniprocessor
temp1=counter;
counter = temp1 + 1;
temp2=counter;
counter = temp2 + 1
temp1=counter;
temp2=counter;
counter = temp1 + 1;
counter = temp2 + 1
gives counter=2
gives counter=1
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Assigning Threads to the Cores
Each thread/process has an affinity mask
Affinity mask specifies what cores the thread is allowed to run on
Different threads can have different masks
Affinities are inherited across fork()
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Affinity Masks are Bit Vectors
Example: 4-way multi-core, without SMT
1011
core 3 core 2 core 1 core 0
• Process/thread is allowed to run on cores 0,2,3, but not on core 1
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Affinity Masks when Multi-Core and SMT are Combined
Separate bits for each simultaneous thread
Example: 4-way multi-core, 2 threads per core1
core 3 core 2 core 1 core 0
1 0 0 1 0 1 1
thread1
• Core 2 can’t run the process• Core 1 can only use one simultaneous thread
thread0
thread1
thread0
thread1
thread0
thread1
thread0
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Default Affinities
Default affinity mask is all 1s:all threads can run on all processors
Then, the OS scheduler decides what threads run on what core
OS scheduler detects skewed workloads, migrating threads to less busy processors
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Process Migration is Costly
Need to restart the execution pipeline Cached data is invalidated OS scheduler tries to avoid migration as
much as possible: it tends to keeps a thread on the same core
This is called soft affinity
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Hard Affinities
The programmer can prescribe their own affinities (hard affinities)
Rule of thumb: use the default scheduler unless a good reason not to
KICS, UET
Copyright © 2010 Cavium University Program 1-8
When to Set your own Affinities
Two (or more) threads share data-structures in memory map to same core so that can share cache
Real-time threads:Example: a thread running a robot controller: must not be context switched,
or else robot can go unstable dedicate an entire core just to this thread
Source: Sensable.com
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Kernel Scheduler API
#include <sched.h>
int sched_getaffinity(pid_t pid, unsigned int len, unsigned long * mask);
Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’.
‘len’ is the system word size: sizeof(unsigned int long)
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Kernel Scheduler API
#include <sched.h>
int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask);
Sets the current affinity mask of process ‘pid’ to *mask
‘len’ is the system word size: sizeof(unsigned int long)
To query affinity of a running process: $ taskset -p 3935pid 3935's current affinity mask: f
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Windows Task Manager
core 2
core 1
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Summary
Reasons for multi-core based systems: Processor technology trends Architecture trends and diminishing returns Application trends
Background of using multi-core systems: Traditional parallel architectures Experience with parallelization Transition to multi-core is happening
Challenge: programming the multi-core
Programming Paradigms
Traditional as well as new additions
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Programming Paradigms In the context of multi-core systems
Architecture is similar to SMPs SAS programming works However, message-passing is also possible
Traditional paradigms and their use for parallelization
Explicit message passing (MP) Shared address space (SAS) Multi-threading as a parallel computing model
New realizations of traditional paradigms Multi-threading on Sony Playstation 3 MapReduce CUDA
Traditional Programming Models
David E. Culler and Jaswinder Pal Singh, Parallel Computer
Architecture: A Hardware/Software Approach, Morgan Kaufmann,
1998
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Parallel Programming Models
Message passing Each process uses its private address space Access to non-local address space through
explicit message passing Full control of low level data movement
Data parallel programming Extension of message passing Parallelism based on data decomposition
and assignment
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Parallel Programming Models (2)
Shared memory programming Single address space (SAS) Hardware manages low level data movement Primary issue is coherence
Multi-threading An extension of SAS Commonly used on SMP as well as UP systems
Libraries and building blocks Often specific to architectures Help reduce the development effort
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Message Passing
Message passing operations Point-to-point unicast Collective communication
Efficient collective communication can greatly enhance performance
Characteristics of message-passing: Programmer’s control over interactions Customizable to leverage hardware features Tunable to enhance performance
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Distributed Memory Issues
Load balancing Programmer has explicit control over load balancing
Data locality High performance depends on keeping processor
busy Required data should be kept locally (decomposition
and assignment) Data distribution
Data distribution affects the computation-to-communication ratio
Low level data movement and process control Data movement result from algorithmic structure
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Data Movement and Process Control
Types of low level data movements
Replication Reduction Scatter/gather Parallel prefix,
segmented scan Permutation
Process control involves
Barrier synchronization Global conditionals
Data movement results in communication
High performance depends on efficient data movement
Overhead results due to processes waiting for data
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Message Passing Libraries
Motivation Portability: application need not be
reprogrammed on a new platform Heterogeneity Modularity Performance
Implementations PVM, Express, P4, PICL, MPICH, etc. MPI has become a de facto API standard
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Data Parallel Programming
This is an extension of message passing based programming
Instead of explicit message passing, parallelism is implemented implicitly in languages Parallelism comes from distributing data
structures and control Implemented through hints for compilers
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Data Parallel Programming (2)
Motivations Explicit message passing programs are
difficult to write and debug Data distribution results in computation on
local data Useful for massively parallel distributed-
memory as well as cluster of workstations Developed as extensions to existing
languages Fortran, C/C++, Ada, etc.
KICS, UET
Copyright © 2010 Cavium University Program 1-8
High Performance Fortran (HPF)
HPF defines language extensions to Fortran To support data parallel programming Compiler generates a message-passing
program Features include:
Data distribution and alignment Purpose is to reduce inter-processor
communication DISTRIBUTE, ALIGN, and PROCESSORS directives
KICS, UET
Copyright © 2010 Cavium University Program 1-8
HPF (2)
Features (cont’d): Parallel statements
Provide ability to express parallelism FORALL, INDEPENDENT, and PURE directives
Extended intrinsic functions and standard library
EXTRINSIC procedures Mechanism to use efficient machine-specific primitives
Limitations Useful only for FORTRAN code Not useful outside of HPCC domain
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Shared Memory Programming Single address space view of the entire
memory Logical for the programmer Easier to use
Parallelism comes from loops Require loop level dependence analysis Various iterations of a loop that are independent wrt
index can be scheduled on different processors Fine grained parallelism
Requires hardware support for memory consistency to avoid overhead
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Shared Memory Programming (2)
Performance tuning is non-trivial Data locality is critical Reduce the grain size by parallelizing the outer-most
loop Require inter-procedural dependence analysis Very few parallelizing compilers that can analyze and
parallelize even reasonable size of application codes
Simplifies parallel programming Sequential code is directly usable with compiler hints
for parallel loops OpenMP standardizes the compiler directives
for shared memory parallel programming
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Libraries and Building Blocks
Pre-implemented parallel code that may work as kernels of larger well-known parallel applications Several numerical methods are reusable in
various applications; Example:
linear equation solver, partial differential equations, matrix based operations etc.
KICS, UETCopyright © 2010 Cavium University Program 1-8
Libraries and Building Blocks (2)
Low level operations, such as data movement, can be used for efficient matrix operations Matrix transpose Matrix multiply LU decomposition
These building blocks can be implemented in distributed-memory numerical libraries Examples:
ScaLPACK: dense and banded matrices Massively parallel LINPACK
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Summary: Traditional Paradigms
Message-passing based programming Suitable for distributed memory multi-
computers Multiple message-passing libraries exist MPI is a de facto standard
Shared address space programming Suitable for SMPs Based on low-granularity loop-level parallelism Implemented through compiler directives OpenMP is a de facto standard
Threading as a parallel programming model
Types of applications and threading environments
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Parallel Programming with Threads
Two types of applications: Compute-intensive applications HPCC systems Throughput-intensive applications Distributed
systems Threads help in different ways
By dividing data reduces work per thread By distributing work enables concurrency
Programming languages that support threads C/C++ or C# Java OpenMP directives for C/C++ code
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Virtual Environments
VMs and platforms
Runtime virtualization
System virtualization
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Systems with Virtual Machine Monitors
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Summary: Threading for Parallelization
Threads can exploit parallelism on multi-core Scientific and engineering applications High throughput applications
Shared memory programming realized through:
Pthread OpenMP
Challenges of multi-threading: Synchronization Memory consistency is the dark side of shared
memory
Octeon Processor Acrchitecture
Target system processor and system architecture and
programming environment
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Cavium Octeon Processor MIPS64
architecture
Up to 16 cores
Shared L2 cache
Security accelerator
Used in network security gateways
source:
http://www.caviumnetworks.com/OCTEON_CN38XX_CN36XX.html
KICS, UET
Copyright © 2010 Cavium University Program 1-8
Key Takeaways for Today’s Session
Multi-core systems are here to stay Processor technology trends Application performance requirements
Programming is a challenge No “free lunch” Programmers need to parallelize Extra effort to tune the performance
Programming paradigms Message passing Shared address space based programming Multi-threading focus for this course