lecture 1 (overview)

Programming Multi-Core Processors based

Embedded Systems

A Hands-On Experience on Cavium Octeon based Platforms

Lecture 1 (Overview)

KICS, UET1-8

Course Objectives

A hands-on opportunity to learn: Multi-core architectures; and Programming multi-core systems

Emphasis on programming: Using multi-threading paradigm Understand the complexities Apply to generic computing/networking

problems Implement on an popular embedded multi-

core platform

KICS, UET

Copyright © 2010 Cavium University Program 1-8

Grading Policy and Reference Books

Grading Policy Lectures (40%) Labs (50%) Quizzes (daily) (10%)

Reference material: Shameem Akhtar and Jason Roberts, Multi-

Core Programming, Intel Press, 2006 David E. Culler and Jaswinder Pal Singh,

Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann, 1998

Class notes

KICS, UET


Course Outline

Introduction Parallel architectures and terminology Context for current interest: multi-core

processors Programming paradigms for multi-core Octeon processor architecture

Multi-threading on multi-core processors Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning

An Introduction to Parallel Computing in the context of

Multi-Core Architectures

KICS, UET


Developing Software for Multi-Core: A Paradigm Shift

Application developers are typically oblivious of underlying hardware architecture Sequential program Automatic/guaranteed performance benefit

with processor upgrade No work on the programmer

No “free lunch” with multi-core systems Multiple cores in modern processors Parallel programs needed to exploit parallelim Parallel computing is now part of main-stream

KICS, UET


Parallel Computing for Main-Stream: Old vs. New Programming Paradigms

Known tools and techniques: High performance computing and communication

(HPCC) Wealth of existing knowledge about parallel

algorithms, programming paradigms, languages and compilers, and scientific/engineering applications

Multi-threading for multi-core Common in desktop and enterprise applications Exploits parallelism of multi-core with its challenges

New realizations of old paradigms: Parallel computing on Playstation 3 Parallel computing on GPUs Cluster computing for large volume data

KICS, UETCopyright © 2010 Cavium University Program 1-8

Dealing with the Challenge of Multi-Core Programming with Hands-On Experience

Our objective is two-fold Overview the known paradigms for background Learn using the state-of-the-art implementations

Choice of platform for hands-on experience Cavium Networks’ Octeon processor based system Multiple cores (1 to 16) Suitable for embedded products Commonly used in networking products Standard Linux based development environment

KICS, UET


Agenda for Today

Parallel architectures and terminology Processor technology trends Architecture trends Taxonomy

Why multi-core architectures? Traditional parallel computing Transition to multi-core architectures

Programming paradigms Traditional Recent additions

Introduction to Octeon processor based systems

Architecture and Terminology

Background of parallel architectures and commonly used

terminology

KICS, UET


Architectures and Terminology

Objectives of this section: Understand the processor technology trends Realize that parallel architectures evolve

based on technology and architecture trends

Terminology used in parallel computing Von Neumann Flynn’s taxonomy Bell’s taxonomy Other commonly used terminology

Processor Technology Trends

Processor technology evolution, Moore’s law, ILP, and current

trends

KICS, UET


Processor Technology Evolution

Increasing number of transistors on a chip Moore’s law: number of transistors on a chip is

expected to double every 18 months Chip densities are reaching their physical limits Technological breakthroughs have kept Moore’s law

alive Increasing clock rates during 90’s

Faster and smaller transistors, gates, and circuits on a chip

Clock rates of microprocessors increase by ~30% per year

Benchmark (e.g., SPEC suite) results indicate performance improvement with technology

KICS, UET


Moore’s Law

Gordon Moore , Founder of Intel 1965: since the integrated circuit was

invented, the number of transistors/inch2 in these circuits roughly doubled every year this trend would continue for the foreseeable future

1975: revised - circuit complexity doubles every 18 months

This was simply a prediction Based on little data However, it has defined the processor industry

KICS, UET


Moore’s Original Law (2)

ftp://download.intel.com/research/silicon/moorespaper.pdf



KICS, UET


Moore’s Original Issues

Design cost still valid Power dissipation still valid What to do with all the functionality possible


KICS, UET


Moore’s Law and Intel Processors

From: http://www.intel.com/technology/silicon/mooreslaw/pix/mooreslaw_chart.gif

http://www.intel.com/technology/silicon/mooreslaw/pix/mooreslaw_chart.gif

KICS, UET


Good News: Moore’s Law isn’t done yet

Source: Webinar by Dr. Tim Mattson, Intel Corp.

KICS, UET


Bad News:Single Thread Performance is Falling Off

KICS, UET


Worse News:Power (normalized to i486) Trend


KICS, UET


Addressing Power Issues


KICS, UET


Architecture Optimized for Power:a big step in the right direction


KICS, UET


Long term solution: Multi-Core


KICS, UET


Summary of Technology Trends

Moore’s law is still relevant Need to deal with related issues

Design complexity Power consumption Uniprocessor performance is slowing down

Multiple processor cores resolve these issues Parallelism at hardware level End user is exposed to it Added complexities related to programming

such systems

Taxonomy for Parallel Architectures

Von Neumann, Flynn, and Bell’s taxonomies and other common

terminology

KICS, UET


Von Neumann Architecture Evolution

Scalar

Sequential Lookahead

I/E overlap Functionalparallelism

Multiple func.units Pipeline

Implicitvector

Explicitvector

Memory tomemory

Register toregister

SIMD MIMD

Associativeprocessor

Processorarray Multicomputer Multiprocessor

Von Neumann architecture

Massively Parallel Processors

KICS, UET


Pipelining and Parallelism

Instructions prefetch to overlap execution Functional parallelism supported by:

Multiple functional units Pipelining

Pipelining Pipelined instruction execution Pipelined arithmetic computations Pipelined memory access operations Pipelining is attractive for performing identical

operations repeatedly over vector data strings

KICS, UET


Flynn’s Classification

Michael Flynn classified architectures in 1972 based on instruction and data streams

Single Instruction stream over a Single Data stream (SISD) Conventional sequential machines

Single Instruction stream over Multiple Data streams (SIMD) Vector computers are equipped with scalar

and vector hardware

KICS, UET


Flynn’s Classification (2)

Multiple Instruction streams over Single Data stream (MISD) Same data flowing through a linear array of

processors Aka systolic arrays for pipelined execution

of algorithms Multiple Instruction streams over

Multiple Data streams (MIMD) Suitable model for general purpose parallel

architectures

KICS, UET


Bell’s Taxonomy for MIMD

Multicomputers Multiple address

spaces System consists of

multiple computers, called nodes

Nodes are interconnected by a message-passing network

Each node has its own processor, memory, NIC, and I/O devices

Multiprocessors Shared address space Further classified

based on how memory is accessed

Uniform Memory Access (UMA)

Non-Uniform Memory Access (NUMA)

Cache-Only Memory Access (COMA)

Cache-Coherent Non-Uniform Memory Access (cc-NUMA)

KICS, UET


Multicomputer Genenrations

First generation (1983-87) Processor boards connected in hypercube architecture Software-controlled message switching Examples: Caltech Cosmic, Intel iPSC/1

Second generation (1988-1992) Mesh connected architecture Hardware message routing Software environment for medium grain distributed

computing Example: Intel Paragon

Third generation (1993-1997) Fine grain multicomputers Examples: MIT J-Machine and Caltech Mosaic

KICS, UET


Multiprocessor Examples

Distributed memory (scalable) Dynamic binding of address to processors (KSR) Static binding, caching (Alliant, DASH) Static program binding (BBN, Cedar)

Central memory (not scalable) Cross-point or multi-stage (Cray, Fujitsu,

Hitachi, IBM, NEC, Tera) Simple multi bus (DEC, Encore, NCR,

Sequent, SGI, Sun)

KICS, UET


Supercomputers

Supercomputers use vector processing and data parallelism

Classified into two categories Vector supercomputers SIMD supercomputers

SIMD machines with massive data parallelism Instruction is broadcast to large number of Pes Examples: Illiac (64 PEs), MasPar MP-1 (16,384

PEs), and CM-2 (65,538 PEs)

KICS, UET


Vector supercomputers

Machines with powerful vector processors If decoded instruction is a vector

operation, it is sent to vector unit Register-register architecture:

Fujitsu VP2000 series Memory-to-memory architecture:

Cyber 205 Pipelined vector supercomputers:

Cray Y-MP

KICS, UET


Dataflow Architectures

Represent computation as a graph of essential dependences Logical processor at each node, activated by

availability of operands Message (tokens) carrying tag of next

instruction sent to next processor Tag compared with others in matching store;

match fires execution

KICS, UET


Dataflow Architectures (2)1 b

a

+

c e

d

f

Dataflow graph

f = a d

Network

Tokenstore

WaitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Programstore

a = (b +1) (b c)d = c e

KICS, UET


Systolic Architectures

Replace single processor with array of regular processing elements

Orchestrate data flow for high throughput with less memory access

M

PE

M

PE PE PE

KICS, UET


Systolic Architectures (2)

Different from pipelining Nonlinear array structure, multidirection

data flow, each PE may have (small) local instruction and data memory

Different from SIMD: each PE may do something different

Initial motivation: VLSI enables inexpensive special-purpose chips

Represent algorithms directly by chips connected in regular pattern

KICS, UET


Systolic Arrays (Cont’d)• Example: Systolic array for 1-D convolution

Practical realizations (e.g. iWARP) use quite general processors Enable variety of algorithms on same hardware

But dedicated interconnect channels Data transfer directly from register to register across channel

Specialized, and same problems as SIMD General purpose systems work well for same algorithms (locality etc.)

y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)

x8

y3 y2 y1

x7x6

x5x4

x3

w4

x2

x

w

x1

w3 w2 w1

xin

yin

xout

yout

xout = x

yout = yin + w xinx = xin

KICS, UET


Cluster of Computers

Started as a poor man’s parallel system In-expensive PCs In-expensive switched Ethernet switch Run-time system to support message-passing Low performance for HPCC applications

High network I/O latency Low bandwidth

Suitable for high throughput applications Data center applications Virtualized resources Independent threads or processes

KICS, UET


Summary of Taxonomy

Multiple taxonomies Based on functional parallelism

Von Neumann and Flynn’s taxonomies Based on programming paradigm

Bell’s taxonomy Parallel architecture types

Multi-computers (distributed address space) Multi-processors (shared address space) Multi-core Multi-threaded Others: vector, data flow, systolic, and cluster

Why Multi-Core Architectures

Based on technology and architecture trends

KICS, UET


Multi-Core Architectures

Traditional architectures Sequential Moore’s law = increasing clk

freq Parallel Diminishing returns from ILP

Transition to multi-core Architecture similar to SMPs Programming typically SAS

Challenges to transition Performance = efficient parallelization Selecting a suitable programming paradigm Performance tuning

Traditional Parallel Architectures

Definition and development tracks

KICS, UET


Defining a Parallel Architecture

A sequential architecture is characterized by: Single processor Single control flow path

Parallel architecture: Multiple processors with interconnection network Multiple control flow paths Communication and synchronization

A parallel computer can be defined as a collection of processing elements that communicate and cooperate to solve large problems fast

KICS, UET


Broad Issues in Parallel Architectures

Resource allocation: how large a collection? how powerful are the elements? how much memory?

Data access, communication and synchronization how do the elements cooperate and communicate? how are data transmitted between processors? what are the abstractions and primitives for

cooperation? Performance and scalability

how does it all translate into performance? how does it scale?

KICS, UET


General Context: Multiprocessors

Multiprocessor is any computer with several processors

SIMD Single instruction, multiple data Modern graphics cards

MIMD Multiple instructions, multiple data

Lemieux cluster,Pittsburgh

supercomputing center

KICS, UET


Architecture Developments Tracks

Multiple-Processor Tracks Shared-memory track Message-passing track

Multivector and SIMD Tracks Multithreaded and Dataflow Tracks Multi-core track

KICS, UET


Shared-Memory Track Starts with C.mmp system developed at CMU in

1972 UMA multiprocessor with 16 PDP 11/40 processors Connected to 16 shared memory modules via crossbar

switch Pioneering multiprocessor OS (Hydra) development

effort Illinois Cedar (1987)

IBM RP3 (1985) BBN Butterfly (1989)

NYU/Ultracomputer (1983) Stanford/DASH (1992) Fujitsu VPP500 (1992) KSR1 (1990)

KICS, UET


Message-Passing Track

The Cosmic Cube (1981) pioneered message-passing computers

Intel iPSCs (1983) Intel Paragon (1992) Medium-grain multicomputers

nCUBE-2 (1990) Mosaic (1992)

MIT/J Machine (1992) Fine-grain multicomputers

KICS, UET


Multivector Track

CDC 7600 (1970) Cray 1 (1978)

Cray Y-MP (1989) Fujitsu, NEC, Hitachi models

CDC Cyber205 (1982) ETA 10 (1989)

KICS, UET


SIMD Track

Illiac IV (1968) Goodyear MPP (1980)

DAP 610 (AMT, Inc. 1987) CM2 (TMC, 1990)

BSP (1982) IBM GF/11 (1985) MasPar MP1 (1990)

KICS, UET


Multi-Threaded Track

Each processor can execute multiple threads of control at the same time

Multi-threading helps hide long latencies in building large-scale multiprocessors

Multithreaded architecture was pioneered by Burton Smith in HEP system (1978)

MIT/Alewife (1989) Tera (1990)

Symmetric Multi-Threading (SMT) Two hardware threads Available in Intel processors

KICS, UET


Dataflow Track

Instead of control-flow in von Neumann architecture, dataflow architectures are based on dataflow mechanism

Dataflow concept was pioneered by Jack Dennis (1974) with a static architecture

This concept later inspired dynamic dataflow MIT tagged token (1980) Manchester (1982)

KICS, UET


Multi-Core Track

Intel Dual core processors AMD Opteron IBM Cell processor SUN processors Cavium processor Freescale processors

KICS, UET


Multi-Core processor is a special kind of a Multiprocessor

All processors are on the same chip Multi-core processors are MIMD:

Different cores execute different threads (Multiple Instructions),

Operate on different parts of memory (Multiple Data)

Multi-core is a shared memory multiprocessor: All cores share the same address space

KICS, UET


Summary: Why Parallel Architecture? Increasingly attractive

Economics, technology, architecture, application demand Readily available multi-core processors based systems

Increasingly central and mainstream

Parallelism exploited at many levels Instruction-level parallelism Multiple threads of software Multiple cores Multiprocessor processors

Focus of this class:

Multiple cores and/or processors

Programming paradigms

KICS, UET


Example: Intel Pentium Pro Quad

All coherence and multiprocessing glue in processor module

Highly integrated, targeted at high volume

Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

KICS, UET


Example: SUN Enterprise

16 cards of either type: processors + memory, or I/O All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

KICS, UET


Example: Cray T3E

Scale up to 1024 processors, 480MB/s links Memory controller generates comm. request for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this)

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

KICS, UET


Example: IBM SP-2

Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bw limited by I/O bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

KICS, UET


Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

Transition to Multi-Core

Selected multi-core processors

KICS, UET


Single to Multiple Core Transition

KICS, UET


Multi-Core Processor Architectures

Source: Michael Perrone of IBM

KICS, UET


IBM POWER 4

Shipped in Dec. 2004 Technology: 180 nm

lithography Dual processor cores 8-way superscalar

Out of order execution 2 load/store units 2 fixed and fp units >200 insts in flight

Hardware instruction and data pre-fetch

KICS, UET


IBM POWER 5

Shipped in Aug. 2003 Technology: 130 nm

lithgraphy Dual processor core 8-way superscalar SMT core

Up to 2 virtual cores Natural extension to

POWER4 design

KICS, UET


IBM Cell Processor IBM, Sony, and Toshiba

alliance in 2000 Based on IBM POWER5

64 bit architectures 9 cores, 10 threads

1 dual thread power processor element (PPE) for control

8 synergistic processor element (SPE) for processing

Up to 25 GB/s memory bw

Up to 75 GB/s I/O bw >300 GB/s Element

Interface Bus bandwidth

source: Dr. Michael Perrone, IBM

KICS, UET


Future Multi-Core Platforms

KICS, UET


A Many Core Example:Intel’s 80 Core Test Chip

Programming the Multi-Core

Programming, OS interaction, applications, synchronization, and

scheduling

KICS, UET


Parallel Computing is Ubiquitous

Over the next few years, all computing devices will be parallel computers

Servers Laptops Cell phones

What about software? Herb Sutter of Microsoft said in Dr. Dobbs’ Journal:

The free lunch is over: Fundamental Turn towards Concurrency in software

Software will no longer increase from one generation to the next as hardware improves … unless it is parallel software

KICS, UET


Interaction with OS

OS perceives each core as a separate processor Same for SMP processor Linux SMP kernel works on multi-core

processors OS scheduler maps threads/processes

to different cores Migration is possible Also, possible to pin processes to processors

Most major OSes support multi-core today:Windows, Linux, Mac OS X, …

KICS, UET


What Applications Benefit from Multi-Core? Database servers Web servers (Web

commerce) Compilers Multimedia applications Scientific applications,

CAD/CAM In general, applications

with Thread-level parallelism(as opposed to instruction-level parallelism)

Each can run on itsown core

KICS, UET


More examples

Editing a photo while recording a TV show through a digital video recorder

Downloading software while running an anti-virus program

“Anything that can be threaded today will map efficiently to multi-core”

BUT: some applications difficult toparallelize

KICS, UET


Programming for Multi-Core Programmers must use threads or processes

Threads are relevant for business/desktop apps Multiple processes with complex systems or SPMD

based parallel applications Spread the workload across multiple cores

OS maps threads/processes to cores Transparent to the programmer

Write parallel algorithms True for scientific/engineering applications Programmer needs to define the mapping to cores

KICS, UET


Thread Safety is very important

Pre-emptive context switching:context switch can happen AT ANY TIME

True concurrency, not just uniprocessor time-slicing

Concurrency bugs exposed much faster with multi-core

KICS, UET


However: Need to use synchronization even if only time-slicing on a uniprocessor

int counter=0;

void thread1() { int temp1=counter; counter = temp1 + 1;}

void thread2() { int temp2=counter; counter = temp2 + 1;}

KICS, UET


Need to use synchronization even if only time-slicing on a uniprocessor

temp1=counter;

counter = temp1 + 1;

temp2=counter;

counter = temp2 + 1

temp1=counter;

temp2=counter;

counter = temp1 + 1;

counter = temp2 + 1

gives counter=2

gives counter=1

KICS, UET


Assigning Threads to the Cores

Each thread/process has an affinity mask

Affinity mask specifies what cores the thread is allowed to run on

Different threads can have different masks

Affinities are inherited across fork()

KICS, UET


Affinity Masks are Bit Vectors

Example: 4-way multi-core, without SMT

1011

core 3 core 2 core 1 core 0

• Process/thread is allowed to run on cores 0,2,3, but not on core 1

KICS, UET


Affinity Masks when Multi-Core and SMT are Combined

Separate bits for each simultaneous thread

Example: 4-way multi-core, 2 threads per core1

core 3 core 2 core 1 core 0

1 0 0 1 0 1 1

thread1

• Core 2 can’t run the process• Core 1 can only use one simultaneous thread

thread0

thread1

thread0

thread1

thread0

thread1

thread0

KICS, UET


Default Affinities

Default affinity mask is all 1s:all threads can run on all processors

Then, the OS scheduler decides what threads run on what core

OS scheduler detects skewed workloads, migrating threads to less busy processors

KICS, UET


Process Migration is Costly

Need to restart the execution pipeline Cached data is invalidated OS scheduler tries to avoid migration as

much as possible: it tends to keeps a thread on the same core

This is called soft affinity

KICS, UET


Hard Affinities

The programmer can prescribe their own affinities (hard affinities)

Rule of thumb: use the default scheduler unless a good reason not to

KICS, UET


When to Set your own Affinities

Two (or more) threads share data-structures in memory map to same core so that can share cache

Real-time threads:Example: a thread running a robot controller: must not be context switched,

or else robot can go unstable dedicate an entire core just to this thread

Source: Sensable.com

KICS, UET


Kernel Scheduler API

#include <sched.h>

int sched_getaffinity(pid_t pid, unsigned int len, unsigned long * mask);

Retrieves the current affinity mask of process ‘pid’ and stores it into space pointed to by ‘mask’.

‘len’ is the system word size: sizeof(unsigned int long)

KICS, UET


Kernel Scheduler API

#include <sched.h>

int sched_setaffinity(pid_t pid, unsigned int len, unsigned long * mask);

Sets the current affinity mask of process ‘pid’ to *mask

‘len’ is the system word size: sizeof(unsigned int long)

To query affinity of a running process: $ taskset -p 3935pid 3935's current affinity mask: f

KICS, UET


Windows Task Manager

core 2

core 1

KICS, UET


Summary

Reasons for multi-core based systems: Processor technology trends Architecture trends and diminishing returns Application trends

Background of using multi-core systems: Traditional parallel architectures Experience with parallelization Transition to multi-core is happening

Challenge: programming the multi-core

Programming Paradigms

Traditional as well as new additions

KICS, UET


Programming Paradigms In the context of multi-core systems

Architecture is similar to SMPs SAS programming works However, message-passing is also possible

Traditional paradigms and their use for parallelization

Explicit message passing (MP) Shared address space (SAS) Multi-threading as a parallel computing model

New realizations of traditional paradigms Multi-threading on Sony Playstation 3 MapReduce CUDA

Traditional Programming Models

David E. Culler and Jaswinder Pal Singh, Parallel Computer

Architecture: A Hardware/Software Approach, Morgan Kaufmann,

1998

KICS, UET


Parallel Programming Models

Message passing Each process uses its private address space Access to non-local address space through

explicit message passing Full control of low level data movement

Data parallel programming Extension of message passing Parallelism based on data decomposition

and assignment

KICS, UET


Parallel Programming Models (2)

Shared memory programming Single address space (SAS) Hardware manages low level data movement Primary issue is coherence

Multi-threading An extension of SAS Commonly used on SMP as well as UP systems

Libraries and building blocks Often specific to architectures Help reduce the development effort

KICS, UET


Message Passing

Message passing operations Point-to-point unicast Collective communication

Efficient collective communication can greatly enhance performance

Characteristics of message-passing: Programmer’s control over interactions Customizable to leverage hardware features Tunable to enhance performance

KICS, UET


Distributed Memory Issues

Load balancing Programmer has explicit control over load balancing

Data locality High performance depends on keeping processor

busy Required data should be kept locally (decomposition

and assignment) Data distribution

Data distribution affects the computation-to-communication ratio

Low level data movement and process control Data movement result from algorithmic structure

KICS, UET


Data Movement and Process Control

Types of low level data movements

Replication Reduction Scatter/gather Parallel prefix,

segmented scan Permutation

Process control involves

Barrier synchronization Global conditionals

Data movement results in communication

High performance depends on efficient data movement

Overhead results due to processes waiting for data

KICS, UET


Message Passing Libraries

Motivation Portability: application need not be

reprogrammed on a new platform Heterogeneity Modularity Performance

Implementations PVM, Express, P4, PICL, MPICH, etc. MPI has become a de facto API standard

KICS, UET


Data Parallel Programming

This is an extension of message passing based programming

Instead of explicit message passing, parallelism is implemented implicitly in languages Parallelism comes from distributing data

structures and control Implemented through hints for compilers

KICS, UET


Data Parallel Programming (2)

Motivations Explicit message passing programs are

difficult to write and debug Data distribution results in computation on

local data Useful for massively parallel distributed-

memory as well as cluster of workstations Developed as extensions to existing

languages Fortran, C/C++, Ada, etc.

KICS, UET


High Performance Fortran (HPF)

HPF defines language extensions to Fortran To support data parallel programming Compiler generates a message-passing

program Features include:

Data distribution and alignment Purpose is to reduce inter-processor

communication DISTRIBUTE, ALIGN, and PROCESSORS directives

KICS, UET


HPF (2)

Features (cont’d): Parallel statements

Provide ability to express parallelism FORALL, INDEPENDENT, and PURE directives

Extended intrinsic functions and standard library

EXTRINSIC procedures Mechanism to use efficient machine-specific primitives

Limitations Useful only for FORTRAN code Not useful outside of HPCC domain

KICS, UET


Shared Memory Programming Single address space view of the entire

memory Logical for the programmer Easier to use

Parallelism comes from loops Require loop level dependence analysis Various iterations of a loop that are independent wrt

index can be scheduled on different processors Fine grained parallelism

Requires hardware support for memory consistency to avoid overhead

KICS, UET


Shared Memory Programming (2)

Performance tuning is non-trivial Data locality is critical Reduce the grain size by parallelizing the outer-most

loop Require inter-procedural dependence analysis Very few parallelizing compilers that can analyze and

parallelize even reasonable size of application codes

Simplifies parallel programming Sequential code is directly usable with compiler hints

for parallel loops OpenMP standardizes the compiler directives

for shared memory parallel programming

KICS, UET


Libraries and Building Blocks

Pre-implemented parallel code that may work as kernels of larger well-known parallel applications Several numerical methods are reusable in

various applications; Example:

linear equation solver, partial differential equations, matrix based operations etc.

KICS, UETCopyright © 2010 Cavium University Program 1-8

Libraries and Building Blocks (2)

Low level operations, such as data movement, can be used for efficient matrix operations Matrix transpose Matrix multiply LU decomposition

These building blocks can be implemented in distributed-memory numerical libraries Examples:

ScaLPACK: dense and banded matrices Massively parallel LINPACK

KICS, UET


Summary: Traditional Paradigms

Message-passing based programming Suitable for distributed memory multi-

computers Multiple message-passing libraries exist MPI is a de facto standard

Shared address space programming Suitable for SMPs Based on low-granularity loop-level parallelism Implemented through compiler directives OpenMP is a de facto standard

Threading as a parallel programming model

Types of applications and threading environments

KICS, UET


Parallel Programming with Threads

Two types of applications: Compute-intensive applications HPCC systems Throughput-intensive applications Distributed

systems Threads help in different ways

By dividing data reduces work per thread By distributing work enables concurrency

Programming languages that support threads C/C++ or C# Java OpenMP directives for C/C++ code

KICS, UET


Virtual Environments

VMs and platforms

Runtime virtualization

System virtualization

KICS, UET


Systems with Virtual Machine Monitors

KICS, UET


Summary: Threading for Parallelization

Threads can exploit parallelism on multi-core Scientific and engineering applications High throughput applications

Shared memory programming realized through:

Pthread OpenMP

Challenges of multi-threading: Synchronization Memory consistency is the dark side of shared

memory

Octeon Processor Acrchitecture

Target system processor and system architecture and

programming environment

KICS, UET


Cavium Octeon Processor MIPS64

architecture

Up to 16 cores

Shared L2 cache

Security accelerator

Used in network security gateways

source:

http://www.caviumnetworks.com/OCTEON_CN38XX_CN36XX.html

KICS, UET


Key Takeaways for Today’s Session

Multi-core systems are here to stay Processor technology trends Application performance requirements

Programming is a challenge No “free lunch” Programmers need to parallelize Extra effort to tune the performance

Programming paradigms Message passing Shared address space based programming Multi-threading focus for this course

lecture 1 (overview)

Documents