dusd(labs) breaking down the memory wall for future scalable computing platforms wen-mei hwu...

DUSD(Labs)

Breaking Down the Memory Wall Breaking Down the Memory Wall for Future Scalable Computing Platformsfor Future Scalable Computing Platforms

Wen-mei HwuWen-mei HwuSanders-AMD Endowed Chair ProfessorSanders-AMD Endowed Chair Professor

withwith

John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,John W. Sias, Erik M. Nystrom, Hong-seok Kim, Chien-wei Li,Hillery C. Hunter, Ronald D. Barnes, Shane Ryoo, Sain-Zee Ueng, Hillery C. Hunter, Ronald D. Barnes, Shane Ryoo, Sain-Zee Ueng,

James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,James W. Player, Ian M. Steiner, Chris I. Rodrigues, Robert E. Kidd,Dan R. Burke, Nacho Navarro, Steven S. LumettaDan R. Burke, Nacho Navarro, Steven S. Lumetta

University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign

SIGMICRO Online Seminar—January 18, 2005SIGMICRO Online Seminar—January 18, 2005Wen-mei W. HwuWen-mei W. Hwu—University of Illinois at Urbana-Champaign—University of Illinois at Urbana-Champaign 22

Trends in hardwareTrends in hardware High variabilityHigh variability

Increasing speed and power Increasing speed and power variability of transistors variability of transistors

Limited frequency increaseLimited frequency increaseReliability / verification Reliability / verification

challengeschallenges

Large interconnect delayLarge interconnect delay Increasing interconnect delay Increasing interconnect delay

and shrinking clock domains and shrinking clock domains Limited size of individual Limited size of individual

computing enginescomputing engines

Interconnect RC DelayInterconnect RC Delay

1

10

100

1000

10000

350 250 180 130 90 65

Del

ay (

ps)

Clock Period

RC delay of 1mm interconnect

Copper Interconnect

130nm

30%

5X0.90.9

1.01.0

1.11.1

1.21.2

1.31.3

1.41.4

11 22 33 44 55Normalized Leakage (INormalized Leakage (Isbsb))

No

rmal

ized

Fre

qu

ency

No

rmal

ized

Fre

qu

ency

Data: Shekhar Borkar, Intel


Trends in architectureTrends in architecture

Transistors are free… until connected or usedTransistors are free… until connected or used Continued scaling of traditional processor core no longer Continued scaling of traditional processor core no longer

economically viableeconomically viable2-3X effective area yields ~1.6X performance [PollackMICRO32]2-3X effective area yields ~1.6X performance [PollackMICRO32]Verification, power, transistor variabilityVerification, power, transistor variability

Only obvious scaling route: “Multi-Everything”Only obvious scaling route: “Multi-Everything”Multi-thread, multi-core, multi-memory, multi-?Multi-thread, multi-core, multi-memory, multi-?CW: Distributed parallelism is easy to designCW: Distributed parallelism is easy to design

But what about software?But what about software? If you build a better mousetrap…If you build a better mousetrap…


A “multi-everything” processor of the futureA “multi-everything” processor of the future

LOCALMEMORY

MA

INM

EMO

RY

GPP

MTM

LOCALMEMORY

ACC ACC

APP

LOCALMEMORY

Distributed, less complex Distributed, less complex componentscomponents

Variability, power density, and Variability, power density, and verification – easier to addressverification – easier to address

Who bears the SW mapping Who bears the SW mapping burden?burden?

General purpose software General purpose software changes prohibitivelychanges prohibitively expensive expensive (cf. SIMD, IA-64)(cf. SIMD, IA-64)

Advanced compiler featuresAdvanced compiler features“Deep Analysis”“Deep Analysis”

New programming models / New programming models / frameworksframeworks

Interactive compilersInteractive compilers


General purpose processor component(s)General purpose processor component(s)

LOCALMEMORY

MA

INM

EMO

RY

GPP

MTM

LOCALMEMORY

ACC ACC

APP

LOCALMEMORY

The The system directorsystem director Performs traditionally-Performs traditionally-

programmed tasksprogrammed tasks software migration starts heresoftware migration starts here

Likely multiple GPP’sLikely multiple GPP’s Less complexLess complex processor cores processor cores


Computational efficiency through customizationComputational efficiency through customization

LOCALMEMORY

MA

INM

EMO

RY

GPP

MTM

LOCALMEMORY

ACC ACC

APP

LOCALMEMORY

Goal: Offload most processing Goal: Offload most processing to more specialized, more to more specialized, more efficient unitsefficient units

Application Processors (APP)Application Processors (APP) Specialized instruction sets, Specialized instruction sets,

memory organizations and memory organizations and access facilitiesaccess facilities

Programmable Accelerators Programmable Accelerators (ACC)(ACC)

Think ASIC with knobsThink ASIC with knobs Highly-specialized pipelinesHighly-specialized pipelines Approximate ASIC design pointsApproximate ASIC design points

Higher performance/watt than Higher performance/watt than general purpose for target general purpose for target applicationsapplications


Memory efficiency through diversityMemory efficiency through diversity

LOCALMEMORY

MA

INM

EMO

RY

GPP

MTM

LOCALMEMORY

ACC ACC

APP

LOCALMEMORY

Traditional monolithic memory Traditional monolithic memory model – major power / model – major power / performance sinkperformance sink

Need partnership of general-Need partnership of general-purpose memory hierarchy and purpose memory hierarchy and software-managed memoriessoftware-managed memories

Local memories will reduce Local memories will reduce unnecessary unnecessary memory trafficmemory traffic and and power consumptionpower consumption

Bulk data transfer scheduled Bulk data transfer scheduled by by Memory Transfer ModuleMemory Transfer Module

Software will gradually adopt Software will gradually adopt decentralized model for decentralized model for powerpower and and bandwidthbandwidth


Tolerating communication & adding macropipeliningTolerating communication & adding macropipelining

LOCALMEMORY

MA

INM

EMO

RY

GPP

MTM

LOCALMEMORY

ACC ACC

APP

LOCALMEMORY

Bulk communication overhead Bulk communication overhead often substantial for traditional often substantial for traditional acceleratorsaccelerators

Shared memory / snooping Shared memory / snooping communication approach communication approach limits available bandwidthlimits available bandwidth

Compilation tools will have to Compilation tools will have to seamlessly connect seamlessly connect processors and acceleratorsprocessors and accelerators

Accelerators will be able to Accelerators will be able to operate on bulk transferred, operate on bulk transferred, buffered data…buffered data…

… … or on streamed dataor on streamed data


Embedded systems already trying out this paradigmEmbedded systems already trying out this paradigm

XScaleCore

HashEngine

Scratch-pad

SRAM

RFIFO

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

Microengine

QD

RS

RA

M

QD

RS

RA

M

QD

RS

RA

M

QD

RS

RA

M

RD

RA

M

RD

RA

M

RD

RA

M

PC

I

CSRs

TFIFO

SP

I4 / C

SIX

Intel IXP1200 Intel IXP1200 Network Network

ProcessorProcessor

Philips Philips Nexperia Nexperia (Viper)(Viper)

ARM

MICRO-

ENGINES

ACCESSCTL.

MIPS

MPEG

VLIW

VIDEO

MSP

Intel IXP2400 Intel IXP2400 Network Network

ProcessorProcessor


Decentralizing parallelism in a JPEG decoderDecentralizing parallelism in a JPEG decoder

Convert a typical media-processing application to the Convert a typical media-processing application to the decentralized modeldecentralized modelArrays used to implement streamsArrays used to implement streamsMultiple loci of computation with various models of parallelismMultiple loci of computation with various models of parallelismMemory access bandwidth a bottleneck w/o private dataMemory access bandwidth a bottleneck w/o private data

BypassedUpsample

Optional

Upsample

ColorConversion

YCCImage

UpsampledImage

RGBImage

texttext

textConversion

Tables

Conceptual dataflow view of two JPEG decoding stepsConceptual dataflow view of two JPEG decoding steps


Data privatization and local memoryData privatization and local memory

BypassedUpsample

Optional

Upsample

ColorConversion

YCCImage

UpsampledImage

RGBImage

texttext

textConversion

Tables

Conceptual dataflow view of two JPEG decoding stepsConceptual dataflow view of two JPEG decoding steps

Accelerate color conversion first (execute in ACC or APP)Accelerate color conversion first (execute in ACC or APP) Main processor sends inputs, receives outputsMain processor sends inputs, receives outputs

Large tables – Large tables – inefficient to send datainefficient to send data from main processor from main processor Need tables to reside in the accelerator for efficiency of accessNeed tables to reside in the accelerator for efficiency of access Tables are Tables are initialized onceinitialized once during program execution, and never modified during program execution, and never modified

againagain Accurate Accurate pointer analysis necessarypointer analysis necessary to determine this to determine this


Increasing parallelismIncreasing parallelism

Convert

Stream

ConvertUpsample

Upsample

Time

YCCImage

UpsampledImage

RGBImage

Heavyweight loop nests communicate though intermediate array Heavyweight loop nests communicate though intermediate array Direct streamingDirect streaming of data is possible, supports of data is possible, supports higher parallelism higher parallelism

(macropipelining)(macropipelining) Convert()Convert() and and Upsample()Upsample() loops can be chained loops can be chained Accurate interprocedural dataflow analysis is necessaryAccurate interprocedural dataflow analysis is necessary


How the next-generation compiler will do it How the next-generation compiler will do it (1)(1)

To-do list:o Identify acceleration

opportunitieso Localize memoryo Stream data and

overlap computation

Heavyweight loops

Acceleration opportunities:o Heavyweight loops identified for accelerationo However, they are isolated in separate functions called

through pointers

Upsample

ColorConversion

LoadScanline

TableInitialization

MemoryCallgraph


Accelerator 2

Accelerator 1

Upsample

ColorConversion

LoadScanline

MemoryCallgraph

TableInitialization

Large constant lookup tables identified


To-do list: Identify acceleration

opportunitieso Localize memoryo Stream data and

overlap computation

Localize memory:o Pointer analysis identifies localizable memory objectso Private tables inside accelerator initialized once, saving most

traffic

Initialization code identified




opportunitiesLocalize memoryo Stream data and

overlap computation

Streaming and computation overlap:o Memory dataflow summarizes array/pointer access patternso Opportunities for streaming are automatically identifiedo Unnecessary memory operations replaced with streaming

Accelerator 2Accelerator 1

Upsample

ColorConversion

LoadScanline

MemoryCallgraph

TableInitialization

Summarize input access pattern

Summarize output access pattern

Constant tableprivatized




opportunitiesLocalize memoryStream data and

overlap computation

Achieve macropipelining of parallelizable acceleratorso Upsampling and color conversion can stream to each othero Optimizations can have substantial effect on both efficiency

and performance

Accelerator 2

Accelerator 1

Upsample

ColorConversion

LoadScanline

MemoryCallgraph

TableInitialization


Memory dataflow in the pointer worldMemory dataflow in the pointer world

Y C C Y C C...

Y C C Y C C...

Y C C Y C C...

…Y

C

C

ColsRows

Row

s

Cols

Arrays are not true 3D arrays (unlike in Fortran)Arrays are not true 3D arrays (unlike in Fortran) Actual implementation: array of pointers to array of samplesActual implementation: array of pointers to array of samples New type of dataflow problemNew type of dataflow problem – understanding the semantics of – understanding the semantics of

memory structures instead of true arraysmemory structures instead of true arrays

Array of constantpointers Row arrays never

overlap


Compiler vs. hardware memory wallsCompiler vs. hardware memory walls

Hardware memory wallHardware memory wallProhibitive implementation cost of memory system while trying Prohibitive implementation cost of memory system while trying

to keep up with the processor speed under power budgetto keep up with the processor speed under power budget Compiler memory wallCompiler memory wall

The use of memory as a generic pool obstructs compiler’s view The use of memory as a generic pool obstructs compiler’s view of true program and data structuresof true program and data structures

The decentralized and diversified memory approach is key The decentralized and diversified memory approach is key to breaking the hardware memory wallto breaking the hardware memory wall

Breaking the compiler memory wall will be increasingly Breaking the compiler memory wall will be increasingly important in breaking the hardware memory wallimportant in breaking the hardware memory wall


Pointer analysis: sensitivity, stability and safetyPointer analysis: sensitivity, stability and safety

Improved efficiency increases the Improved efficiency increases the scope over which unique, heap-scope over which unique, heap-

allocated objects can be discoveredallocated objects can be discovered

Improved analysis algorithms provide Improved analysis algorithms provide more more accurate call graphsaccurate call graphs (below) instead of a (below) instead of a blurred view (above) for use by program blurred view (above) for use by program

transformation toolstransformation tools

A multitudeof distinct

objects

Observed Connectivity1 10 100 1000 10000

1

10

100

1000

Dis

cove

red

Ob

jec

ts

132.ijpeg

BETTER

WORSE

A few, highly-connected

objects

3

2

1

0

ANALYSISSCOPE

......

......

[PASTE2004][PASTE2004]


Pointer analysis: sensitivity, stability and safetyPointer analysis: sensitivity, stability and safety

Analysis is abstract executionAnalysis is abstract executionsimplifying abstractions → analysis stabilitysimplifying abstractions → analysis stability““unrealizable dataflow” resultsunrealizable dataflow” results

Many components of accuracyMany components of accuracyTypical to cut some corners to enable “key” Typical to cut some corners to enable “key”

component for particular applicationscomponent for particular applications Making the components usefully Making the components usefully

compatible compatible is a major contributionis a major contributionNo need for No need for a prioria priori corner-cutting → better corner-cutting → better

results across broad code baseresults across broad code base Safety in “unsafe” languagesSafety in “unsafe” languages

C poses major challengesC poses major challengesEfficiency challenge increased in safe algos.Efficiency challenge increased in safe algos.?

Con-text

FieldSub-

typing

Heap

Arith-metic

Flow


How do sensitivity, stability and safety coexist?How do sensitivity, stability and safety coexist?

Our two-pronged approach to sensitive, stable, safe pointer analysisOur two-pronged approach to sensitive, stable, safe pointer analysis

CEO

VP VP

MANAGERMANAGER

MANAGER MANAGERMANAGER

WORKER WORKER

WORKERWORKER

WORKER WORKER

WORKERWORKER

Incr

ease

d A

bstr

actio

n

Summarization:Summarization:Only relevant details are forwarded Only relevant details are forwarded to a higher levelto a higher level

Containment:Containment:The algorithm can cut its losses The algorithm can cut its losses locally (like a bulkhead) …locally (like a bulkhead) …

… … to avoid a globalto avoid a globalexplosion in problem sizeexplosion in problem size

Example: summarization-based context sensitivity…Example: summarization-based context sensitivity…


Context sensitivity: naïve inliningContext sensitivity: naïve inliningint g;

iris()int a;

jade1(&g, 1)

jade2(&a, 3)

g := 1

a := 3

Retention of side effect still leads to spurious results

*p := q;

r := g + 5;

jade(int *p, int q)int r;

x := z

p := &g;

p := &a;

q := 1;

q := 3;

g := 1 a := 3

Excess statements unnecessary and costly

g := 3

a := 1

r := 6

r := 8

jade2*p2 := q2;

p2 := &a; q2 := 3;

r2 := g + 5;

x2 := z2

jade1

x1 := z1

p1 := &g; q1 := 1;*p1 := q1;r1 := g + 5;


Context sensitivity: summarization-basedContext sensitivity: summarization-basedint g;

iris()int a;

jade1(&g, 1)

jade2(&a, 3)

p := &g;

p := &a;

q := 1;

q := 3;

g := 1 a := 3

g := 1

a := 3

Now, only correct result derived

Compact summary of jade used

int r;

jade(int *p, int q)

*p := q;

r := g + 5;

*p := q; r := 6Summary accounts for all

side-effects. BLOCK assignment to prevent

contaminationx := z

p1 := &g; q1 := 1;

*p1 := q1;jade1

*p2 := q2;

p2 := &a; q2 := 3;jade2


Analyzing large, complex programsAnalyzing large, complex programs

Bench-Bench-markmark

INACCURATEINACCURATE Context Context

InsensitiveInsensitive (seconds)(seconds)

PREV PREV Context-Context-

Sensitive Sensitive (seconds)(seconds)

NEW NEW Context-Context-SensitiveSensitive

(seconds)(seconds)

espressoespresso 22 99 11

lili 11 13321332 11

ijpegijpeg 22 8585 11

perlperl 44 408408 1111

gccgcc 5252 HOURSHOURS 124124

perlbmkperlbmk 155155 MONTHSMONTHS 198198

gapgap 6262 33503350 117117

vortexvortex 55 136136 33

twolftwolf 11 22 11

This results in an efficient analysis This results in an efficient analysis process without loss of accuracyprocess without loss of accuracy

Originally, problem size exploded as Originally, problem size exploded as more contexts were encounteredmore contexts were encountered

New algorithm contains problem New algorithm contains problem size with each additional contextsize with each additional context

008.espresso099.go 130.li

124.m88ksim 175.vpr134.perl

176.gcc 254.gap 255.vortex

1E+00

1E+02

1E+04

1E+06

1E+08

1E+10

1E+12

1E+14

Naï

ve

Exh

au

sti

ve

In

lin

ing

1 3 5 7 9

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

Ne

w C

om

pa

cti

on

Alg

ori

thm

Call Graph Depthmain() leaves

1 3 5 7 9

11

13

15

17

19

21

23

25

27

29

31

33

35

37

39

Call Graph Depthmain() leaves

1012

104

[SAS2004][SAS2004]


The outlook in softwareThe outlook in software

Software is changing too, more graduallySoftware is changing too, more gradually Applications driving development – rich in parallelismApplications driving development – rich in parallelism

Physical world – medicine, weatherPhysical world – medicine, weatherVideo, games – signal & media processingVideo, games – signal & media processing

Source code availabilitySource code availabilityOpen Source continues to growOpen Source continues to growMicrosoft’s Phoenix Compiler ProjectMicrosoft’s Phoenix Compiler Project

New programming modelsNew programming modelsEnhanced developer productivity & enhanced parallelismEnhanced developer productivity & enhanced parallelism


Beyond the traditional language environmentBeyond the traditional language environment

Domain-specific, higher-level modeling languagesDomain-specific, higher-level modeling languagesMore intuitive than C for inherently parallel problemsMore intuitive than C for inherently parallel problems Implementation details abstracted away from developersImplementation details abstracted away from developers

increased productivity, increased portabilityincreased productivity, increased portability Still an important role for the compiler in this domainStill an important role for the compiler in this domain

Little visibility “through” the model for low-level optimization by Little visibility “through” the model for low-level optimization by developersdevelopers communication, memory optimization will be communication, memory optimization will be critical critical in next-gen systemsin next-gen systems

Model can provide structured semantics for the compiler, beyond Model can provide structured semantics for the compiler, beyond what can be derived from analysis of low-level codewhat can be derived from analysis of low-level code

As new system models are developed, compilers, As new system models are developed, compilers, modeling languages, and developers will take on new, modeling languages, and developers will take on new, interactive rolesinteractive roles


Domain-specific modeling and optimizationDomain-specific modeling and optimization

rt(ipLookup)

fromDevice

fromDevice checkIPHeader

checkIPHeader toDevice

toDevice

discardPush

Main

Mem

ory

checkIPHeaderLoad IPHeader

ipLookup

Packet

Token

Load IPHeader

Main

Mem

ory

checkIPHeaderLoad IPHeader

ipLookup

PacketD

ata

NPClick Programming Model

Naïve Implementation Compiler Optimized Implementation

Redundant LoadElimination

Programming Model Programming Model provides the compiler with informationprovides the compiler with information that one cannot that one cannot

extract with analysis aloneextract with analysis alone Compiler Compiler breaks the limitationsbreaks the limitations that are imposed by the model, allowing for that are imposed by the model, allowing for

efficient, high-performance binariesefficient, high-performance binaries


Concluding thoughtsConcluding thoughts

Reaching the true potential of multi-everything hardware Reaching the true potential of multi-everything hardware Scalability requires distributed parallelism and memory modelsScalability requires distributed parallelism and memory modelsRequires new compilation tools to break compiler memory wallRequires new compilation tools to break compiler memory wall

Broad suite of analyses necessaryBroad suite of analyses necessaryAdvanced pointer analysisAdvanced pointer analysisMemory dataflow analysisMemory dataflow analysisNew interactions of classical analysesNew interactions of classical analyses

This is not just reinventing HPFThis is not just reinventing HPFNew distributed parallelism paradigmsNew distributed parallelism paradigmsNew applications New applications new challenges! new challenges!

As the field develops, new domain-specific programming As the field develops, new domain-specific programming models will also benefit from advanced compilation models will also benefit from advanced compilation technologytechnology

dusd(labs) breaking down the memory wall for future scalable computing platforms wen-mei hwu...

Documents

knobs s

viable s

future u

s multithread

u likely multiple gpps

transistor variability

s advanced compiler

architecture u transistors