performance and productivity of emerging architectures

11
Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies

Upload: kieve

Post on 31-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Performance and Productivity of Emerging Architectures. Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies. Overview. 300. GPU. 200. Performance (GF). 100. CPU. 0. 2004. 1998. 2000. 2002. 2006. Dual-core (Woodcrest) Quad-core (Clovertown) Programming models MPI - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Performance and Productivity of Emerging Architectures

Presented by

Performance and Productivityof Emerging Architectures

Jeremy Meredith

Sadaf Alam

Jeffrey Vetter

Future Technologies

Page 2: Performance and Productivity of Emerging Architectures

2 Meredith_Architectures_SC07

OverviewMotivation Goals Technologies on the horizon

several orders of magnitude more powerful and power-efficient than microprocessors

Grand challenge applications requiring several orders of magnitude more performancethan now available

Gauge performanceof emerging devices using high-level languages

Estimate productivity with respect to contemporary microprocessing devices

2006

Perf

orm

ance

(GF)

300

200

100

01998 2000 2002 2004

CPU

GPU

Page 3: Performance and Productivity of Emerging Architectures

3 Meredith_Architectures_SC07

Intel Xeon multicore CPU

Dual-core (Woodcrest)

Quad-core (Clovertown)

Programming models MPI OpenMP Pthreads

Recourse contention Memory bandwidth I/O bandwidth

BlackfordMCH

ESB-2I/O

bridgeFBDFBDFBDFBD

FBDFBDFBDFBD

FBDFBDFBDFBD

FBDFBDFBDFBD

Configurable set of PCIe ports

x8

ESI

DIB1066/1333 MHz8.5/10.5 GB/s Up to 21 GB/s

DempseyWoodcrestClovertown

PCIe

slo

t

SunriseLake PCI-X

10 GbEI/OATiSCSI

SAS/SATA-2 PCI-X 10 GbE

x8x8

Page 4: Performance and Productivity of Emerging Architectures

4 Meredith_Architectures_SC07

BIC

IBM Cell broadband engine

Cell heterogeneous multicore processor 8 synergistic processing

element (SPE) cores

200 GFLOPS at 3.2 GHz

200-GB/s memory bandwidth

Programming models Multisource compilation

Threadlike launch semantics for SPEs

SPUSXU

LS

MFC

SPUSXU

LS

MFC

SPUSXU

LS

MFC

SPUSXU

LS

MFC

SPUSXU

LS

MFC

SPUSXU

LS

MFC

SPUSXU

LS

MFC

SPUSXU

LS

MFC

EIB (up to 96 B/cycle)

L2

PPU

LI PXU

BIC

Dual XDR Flex10

Source: M. Gschwiind et al., Hot Chips-17, August 2005

SPE

PPE

Page 5: Performance and Productivity of Emerging Architectures

5 Meredith_Architectures_SC07

L2 Tex

Graphics processing unit (GPU)

NVIDIA 7900GTX 24 SIMD pixel

pipelines 200 GFLOPS at 650 MHz 50-GB/s memory

bandwidth

Programming models Multiple-source

compilation Gather semantics

define parallelism Host CPU drives

program setup and data transfer

Triangle setup

Memorypartition

Memorypartition

Memorypartition

Memorypartition

Z-cull Shader instruction dispatch

Vertexshader units

Pixel pipelines

ROP engineFragment crossbar

Page 6: Performance and Productivity of Emerging Architectures

6 Meredith_Architectures_SC07

Extreme multithreading with Cray MTA-II

Cray XMT system MTA-I and MTA-II

processor architecture XT3/4 scalable

infrastructure AMD Torrenza

technology

Fine-grain multithreading(128 concurrent threads)

Uniform memory hierarchy in MTA-Iand MTA-II

Programs runningin parallel

Concurrent threads of computation

Hardwarestreams (128)

Instructionready pool

Pipeline of executinginstructions

1 2 3 4

i=n

i=3

i=2

i=1

i=n

i=1

i=0

Sub- problem

A

Sub- problem

B

SerialCode

Sub-problem A

Unusedstreams

Page 7: Performance and Productivity of Emerging Architectures

7 Meredith_Architectures_SC07

Application kernels

Molecular dynamics (MD) calculations Applications in biology,

chemistry, and materials Newton’s second law of motion Bonded calculations Nonbonded electrostatic

and van der Waals forces Workload characteristics

Floating-pointintensive

Irregularmemory accesspatterns

Covariance matrix calculation

Applications in hyperspectral imaging and machine learning

Determines covariance among samples in data

Workload characteristics Memory-bandwidth and

floating-point intensive Regular memory access

patterns

Page 8: Performance and Productivity of Emerging Architectures

8 Meredith_Architectures_SC07

Performance

OpenMP improves performance moderately on commodity CPUs

High parallelismof Cell and GPUcan improve performance, but more sowhen memoryaccess is regular

Molecular Dynamics Covariance Matrix0

2

4

6

8

Spee

dup

Reference (Woodcrest, single thread)

OpenMP (Woodcrest, 2 cores)

OpenMP (Clovertown, 4 cores)

Cell (8 SPEs)

GPU (NVIDIA 7900GTX)

MTA-2 (1 processor, 128 streams)

Page 9: Performance and Productivity of Emerging Architectures

9 Meredith_Architectures_SC07

Productivity

Despite small performance increases through OpenMP, the ease of using it means that productivity can still be increased

Conversely, the high speed of Cell and GPU means that even substantial effortresults in higher productivity0.0

1.0

2.0

3.0

Molecular Dynamics Covariance Matrix

Prod

uctiv

ity im

prov

emen

t

Reference (Woodcrest, single thread)

OpenMP (Woodcrest, 2 cores)

OpenMP (Clovertown, 4 cores)

Cell (8 SPEs)

GPU (NVIDIA 7900GTX)

MTA-2 (1 processor, 128 streams)

Page 10: Performance and Productivity of Emerging Architectures

10 Meredith_Architectures_SC07

0

1

2

3

Molecular Dynamics Covariance Matrix

Spee

dup

Scaling when accessing multiple devices

Productivity scaling

Increased parallelism

from all architectures

with littleincreased effort results in higher

productivity

Reference (Woodcrest, single thread)

OpenMP (Woodcrest, 2 cores)

OpenMP (Clovertown, 4 cores)

MTA-2 (32 processors)

Page 11: Performance and Productivity of Emerging Architectures

11 Meredith_Architectures_SC0711 Meredith_Architectures_SC07

Contacts

Jeremy MeredithFuture Technologies Group(865) [email protected]

Sadaf AlamFuture Technologies Group(865) [email protected]

Jeffrey VetterFuture Technologies Group(865) [email protected]