progggramming the flexram parallel intelligent memory...

38
Programming the FlexRAM Parallel Intelligent Memory System B. B. Fraguela*, J. Renau , P. Feautrier D. Padua and J. Torrellas *Univ. da Coruña Univ. of Illinois ENS de Lyon

Upload: others

Post on 16-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Programming the FlexRAM Parallel g gIntelligent Memory System

B. B. Fraguela*, J. Renau†, P. Feautrier‡

D. Padua† and J. Torrellas†

*Univ. da Coruña†Univ. of Illinois

‡ENS de Lyon

Page 2: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Intelligent Memory ArchitecturesIntelligent Memory Architectures

Main memory enhanced with many simple processors

•HeterogeneousHeterogeneous

•Highly parallel

P bl Littl h h t th hit t

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 2

Problem: Little research on how to program these architectures

Page 3: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

ContributionsContributions

Language support for intelligent memories

OpenMP-like directives (CFlex)p ( )

Library of Intelligent Memory Operations (IMOs)

Runtime system and OS extensionsRuntime system and OS extensions

Speedups of over one order of magnitude

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 3

Page 4: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

OutlineOutline

FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 4

Page 5: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

FlexRAM ArchitectureFlexRAM Architecture64 PArrays/chip PArrays arePHost Off-the-shelf system6 ays/c p

64 MB/chipy

muchsimpler thanthe PHost(s)

2D torus insideeach chip

Controller: comm& synchr tasks

PHost

S d d S d d S d d

the PHost(s)

StandardMemory

StandardMemory

StandardMemory

ProgrammedindependentlyThe FlexRAM bus

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 5

depe de t y(MIMD, SPMD)interconnects

all the chips

Page 6: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

FlexRAM Architectural IssuesFlexRAM Architectural Issues

PArrays cannot interrupt/invoke the PHost(s)PArray requests are sent to chip controller

PHost polls the controllers and services the requests

Communication PHost - PArray : pass input andCommunication PHost PArray : pass input and output arguments through memory

No HW cache coherenceNo HW cache coherenceCompiler inserted cache flushes and invalidations

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 6

Page 7: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

OutlineOutline

FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 7

Page 8: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Operating System ExtensionsOperating System Extensions

Common address space for PHost(s) and PArraysPArrays kernel

Manages the TLBManages the TLB Manages spawn and termination of local tasks

PHost OSUpdates the shared page tableFirst-touch placement of pagesCooperates with PArray kernels to keep TLBs coherentCooperates with PArray kernels to keep TLBs coherent

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 8

Page 9: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Programming FlexRAMProgramming FlexRAM

OpenMP-like directives (CFlex)

Library of Intelligent Memory Operations (IMOs)Library of Intelligent Memory Operations (IMOs)

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 9

Page 10: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

CFlexCFlex

CFlex: family of directives inspired by OpenMPExecution modifiers: build/sync tasksData modifiers: properties of data structuresExecutable directives: barriers, prefetches,...

#pragma FlexRAM directive-type [clauses]

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 10

Page 11: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Execution Modifier: SpawnExecution Modifier: Spawn

Directive-type

[phost|parray] : specifies kind of processor to useClClauses

on_home(x): run the task on the PArray on whose bank x is located.sync/async: parent task must stop until child finishes (sync) or not (async)pfor: parallelize for loop

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 11

Page 12: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Execution Modifier: Spawn (II)Execution Modifier: Spawn (II)

Clauses (cont.):if(cond)/ else: conditional execution of the compiler directivecompiler directiveshared, private, firstprivate, lastprivate, reduction: scope clauses with p , pthe same meaning as in OpenMP flush: specify which pieces of data to flush from PHost cachePHost cache

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 12

Page 13: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Example: Parallelizing a LoopExample: Parallelizing a Loop

for(p = head; p != NULL; p = p->next)process(p->data);

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 13

Page 14: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Example: Parallelizing a LoopExample: Parallelizing a Loop

for(p = head; p != NULL; p = p->next)for(p head; p ! NULL; p p >next)

process(p->data);

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 14

Page 15: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Example: Parallelizing a LoopExample: Parallelizing a Loop

#pragma FlexRAM phost syncfor(p = head; p != NULL; p = p->next)

#pragma FlexRAM parray async on_home(*(p->data)) \firstprivate(p)

process(p->data);

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 15

Page 16: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Example: Parallelizing a LoopExample: Parallelizing a Loop

#pragma FlexRAM parray pfor on_home(*(p->data)) \firstprivate(p)

for(p = head; p != NULL; p = p->next)process(p->data);

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 16

Page 17: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Parallelizing Complex CodesParallelizing Complex Codes

int TreeAdd (register tree_t *t) {if (t == NULL) return 0;if (t NULL) return 0;else {

int leftval, rightval;

leftval = TreeAdd(t->left);leftval TreeAdd(t >left);

rightval = TreeAdd(t->right);

return leftval + rightval + t->val;return leftval + rightval + t >val;}

}

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 17

Page 18: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Parallelizing Complex Codes (II)Parallelizing Complex Codes (II)PHost async task

PHost async task PHost async taskPHost-allocated node

PArray subtreePArray async task

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 18

Page 19: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Parallelizing Complex Codes (III)Parallelizing Complex Codes (III)

int TreeAdd (register tree_t *t) {if (t == NULL) return 0;else {

int leftval, rightval;

#pragma FlexRAM phost sync{

#pragma FlexRAM parray async on_home(*(t->tleft)) if (lcl(t->left))##pragma FlexRAM phost async else

leftval = TreeAdd(t->left);

#pragma FlexRAM parray async on_home(*(t->tright)) if (lcl(t->right))( )rightval = TreeAdd(t->right);

}

return leftval + rightval + t->val;}

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 19

}}

Page 20: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

OutlineOutline

FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 20

Page 21: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Intelligent Memory Operations (IMOs)Intelligent Memory Operations (IMOs)

Libraries that hide FlexRAM while providing near-optimal performanceImplement common operations on data structures often used in programsHi hl ti i d b th th ti l d thHighly optimized, both the sequential and the parallel versions

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 21

Page 22: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Example IMOs: Vector ContainerExample IMOs: Vector Container

IMO description Syntax

Apply func f with arg a Vector apply(v f a)Apply func f with arg a Vector_apply(v,f,a)

Search element that fulfills

cond f with arg aVector_search(v,f,a)

Generate vector with the result

of appl func f with arg av2=Vector_map(v,f,a)

Reduce vector applying func f,Vector reduce(v,f,a)

whose neutrum is aVector_reduce(v,f,a)

Process two vectors and an arg

a, generating a new vectorv3=Vector_map2(v,v2,f,a)

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 22

Page 23: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

OutlineOutline

FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 23

Page 24: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Architecture ParametersArchitecture Parameters

PHost1.6 GHz, 5 issueL1 cache: 32 KB

PArray1.2 GHz, 2 issue in orderL1 cache: 8 KB

L2 cache: 1 MB

Mem latency 180 cycles

No FP support

Mem latency 14 cyclesMem latency 180 cycles Mem latency 14 cycles

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 24

Page 25: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Applications (I)Applications (I)

Application Suite Access Data Task Size

TSP Olden Ptr FP Large

TreeAdd Olden Ptr Int Large

Swim SPEC OMP 2001 Reg FP Med

Mgrid SPEC OMP 2001 Reg FP Var

Dmxdm Kernel Reg FP Large

S K l I d FP M dSpmxv Kernel Ind FP Med

Distance CAM Ptr Int Large

Path CAM Ptr Int Small

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 25

Path CAM Ptr Int Small

Page 26: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Applications (II)Original code not changed

Applications (II)

Application Code size (lines) Directives Additional lines

TSP 485 12 5

TreeAdd 71 8 4

Swim 272 8 0

Mgrid 470 13 0

Dmxdm 81 1 2

Spmxv 47 1 0

Distance 108 17 7

h 16 1 9

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 26

Path 165 17 9

Page 27: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

SpeedupsSpeedups

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 27

Speedups: Over one order of magnitude!

Page 28: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Further OptimizationsFurther Optimizations

Single task for consecutive pages affected by an on_home residing in the same bankConsecutive rather than cyclic spawn among theConsecutive rather than cyclic spawn among the FlexRAM chipsLimiting the usage of on_home to large loops

Increase the average speedup by about 30%

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 28

Page 29: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

OutlineOutline

FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 29

Page 30: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Other Intelligent Mem ArchOther Intelligent Mem Arch.

[ k l] [ ll l]Active Pages [Oskin et al], DIVA [Hall et al]Program the machine by handC l b d ith CFl /IMOCan also be programmed with CFlex/IMOsSeem more sensitive to data placement than FlexRAMFurther extensions of CFlex to specify alignment andFurther extensions of CFlex to specify alignment and placement of data structuresMessage passing is more natural for DIVA

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 30

Page 31: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Related WorkRelated Work

FlexRAM [Solihin et al]: automatic partition and mapping by a compiler

Only feasible for simple codesOnly feasible for simple codesPerformance is limited

Widespread compiler directivesOpenMP: UMA model, no locality clausesHPF: extensive alignment + replicationBoth: Unadequate for irregular applicationsq g pp

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 31

Page 32: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

ConclusionsConclusions

Effective programming support for Intelligent MemCFlex: family of pragmas inspired by OpenMPIMOs: library of intelligent memory operations

CFlex parallelizes more problems than OpenMPfComplexity can be further hidden using IMOs

Speedups over one order of magnitude

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 32

Page 33: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Programming the FlexRAM Parallel g gIntelligent Memory System

B. B. Fraguela*, J. Renau†, P. Feautrier‡

D. Padua† and J. Torrellas†

*Univ. da Coruña†Univ. of Illinois

‡ENS de Lyon

Page 34: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Runtime SystemRuntime System

Task managementCreation of buffer with input args + messageCreation of buffer with input args + messagePHost spins on a termination flag set by the PArray

Interface to chip controller locks and constructions built upon them, like barriersHeap memory management

ll f h l hPolling of the FlexRAM chips

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 34

Page 35: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

PArray StructurePArray Structure

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 35

Page 36: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Further OptimizationsFurther Optimizations

Initial optimizations:Alignments using the OS first-touch policyon home clause to exploit localityon_home clause to exploit locality

New optimizations:H: single task for consecutive pages affected by an on_home

idi i th b kresiding in the same bankC: consecutive rather than cyclic spawn among the FlexRAM chipsL: limiting the usage of on_home to large loops

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 36

Page 37: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Software Optimization ResultsSoftware Optimization Results

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 37

Page 38: Progggramming the FlexRAM Parallel Intelligent Memory Systemiacoma.cs.uiuc.edu/iacoma-papers/PRES/present_ppopp03.pdf · aIMOs aEvaluation aRelated Work aConclusions B. B. Fraguela

Final ResultsFinal Results

B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 38