dszoom@wmpi2001uppsala architecture research team (uart)1 zoran radović and erik hagersten {zoranr,...

21
DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 1 Zoran Radović and Erik Hagersten {zoranr, eh}@it.uu.se Uppsala University Information Technology Department of Computer Systems Implementing Low Latency Distributed Software-Based Shared Memory

Post on 18-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 1

Zoran Radović and Erik Hagersten{zoranr, eh}@it.uu.se

Uppsala UniversityInformation Technology

Department of Computer Systems

Implementing Low Latency DistributedSoftware-Based Shared Memory

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 2

Problems with Traditional SW-DSMs

Page-sized coherence unit False Sharing![e.g., Ivy, Munin, TreadMarks, Cashmere-2L, Shasta, GeNIMA, …]

Protocol agent messaging is slow Most efficiency lost in interrupt/poll

CPUs

MemProt.agent

CPUs

MemProt.agent

LD x

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 3

Our proposal: DSZOOMRun entire protocol in requesting-processor No protocol agent communication!

Assumes user-level remote memory access put, get, and atomics [ InfiniBand]

Fine-grain access-control checks[e.g., Shasta, Blizzard-S, Sirocco-S]

CPUs

Mem

Protocol

CPUs

Mem

atomicDIR

get

LD x

DIR

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 4

Outline

Motivation

General DSZOOM Overview

Experimentation Environment

DSZOOM-WF Implementation Details

Performance Results

Improved DSZOOM… [SC2001]

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 5

DSZOOM Cluster

DSZOOM Nodes: Each node consists of an unmodified SMP

multiprocessor SMP hardware keeps coherence among the caches

and the memory within each SMP node

DSZOOM Cluster Network: Non-coherent cluster interconnect Inexpensive user-level remote memory access Remote atomic operations [e.g., InfiniBand]

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 6

Current DSZOOM Hardware

Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction Data migration and coherent memory replication (CMR) are kept inactive

16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory Memory access times: 330 ns local / 1700 ns remote (lmbench latency)

Run as 16-way SMP, 28 HW-ccNUMA, and 28 SW-DSM

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 7

Compilation Process

DSZOOM-WFImplementationof PARMACS

Macros

a.out

Binary

EEL

DSZOOM-WFRun-Time Library

m4

GNU

gcc

UnmodifiedSPLASH-2Application

CoherenceProtocols

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 8

Stack

Text & Data

Heap

PRIVATE_DATA

shmid = A

Physical Memoryof the Cabinet 1

shmget

shmid = B

shmget

Physical Memoryof the Cabinet 2

Process and Memory Distribution

Cabinet 1

forkforkfork

pset_bindpset_bindpset_bind

forkforkfork

0x80000000

G_MEM

Cabinet_1_G_MEM

Cabinet_2_G_MEM

Cabinet_1_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

Cabinet_2_G_MEM

Cabinet_1_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

Cabinet_2_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

Stack

Text & Data

Heap

PRIVATE_DATA

Cabinet_1_G_MEM

Cabinet_2_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

”Aliasing”

Stack

Text & Data

Heap

PRIVATE_DATA

Cabinet 2

Stack

Text & Data

Heap

PRIVATE_DATA

Cabinet_1_G_MEM

Cabinet_2_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

shmat

shmat

shmat

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 9

So far …

DSZOOM-WFImplementationof PARMACS

Macros

a.out

(Un)executable

EEL

DSZOOM-WFRun-Time Library

m4

GNU

gcc

UnmodifiedSPLASH-2Application

CoherenceProtocols

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 10

Squeezing Protocols into Binaries …

Static Binary Instrumentation EEL — Machine-independent Executable

Editing Library implemented in C++• Replace global loads with snippets containing fine-

grain access control checks• Insert coherence protocols

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 11

1: ld [address],%reg // original LD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop

5: Call global coherence load routine

hit:

Fine-grain Access Control Checks

The “magic” value is a small integer corresponding to an IEEE floating-point NaN [Blizzard-S, Sirocco-S]

Floating-point load example:

CoherenceProtocols

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 12

Modified-Shared-Invalid (MSI)

G_MEM

Cabinet_2_G_MEM

Shared cache line

Invalid cache line

MEM_STORE

Cabinet_1_G_MEM

0 0 0 0 0 0 1 0LOCK

After MEM_STORE

Presence bitsDIR_ENTRY

0 0 0 0 0 0 0 1LOCK

Before MEM_STORE

One DIR_ENTRYper cache line

Distributed DIR

”Aliasing”

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 13

Read Data from Home Node:2–hop read

MemDIR1a. f&s

2. put

= Small packet (~10 bytes)

= Large packet (~68 bytes)

= Message on the critical path

= Message off the critical path

1b. get

data

Requestor

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 14

Instrumentation Performance

Program Problem Size%LD

%ST

InstrumentationOverhead

FFT 1,048,576 points (48.1 MB) 26.1 22.2 1.43

LU-Cont 10241024, block 16 (8.0 MB) 22.7 14.5 1.68

LU-Non-Cont 10241024, block 16 (8.0 MB) 23.9 16.6 1.42

Radix 4,194,304 items (36.5 MB) 24.1 14.9 1.15

Barnes-Hut 16,384 bodies (32.8 MB) 37.5 50.5 1.25

FMM 32,768 particles (8.1 MB) 25.5 22.9 1.12

Ocean-Cont 514514 (57.5 MB) 28.6 26.2 1.34

Ocean-Non-Cont 258258 (22.9 MB) 15.5 31.6 1.21

Radiosity Room (29.4 MB) 31.1 35.0 1.11

Raytrace Car (32.2 MB) 28.8 31.5 1.53

Water-nsq 2,197 mols., 2 steps (2.0 MB) 24.5 32.4 1.21

Water-sp 2,197 mols., 2 steps (1.5 MB) 25.5 27.6 1.21

Average 26.2 27.2 1.30

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 15

Normalized Instrumentation Overhead Breakdown (Seq. Exec.)

0%

20%

40%

60%

80%

100%

f-p-ST-snippet

int-ST-snippet

f-p-LD-snippet

int-LD-snippet

E6000 seq

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 16

Results (1)Execution Times in Seconds (16 CPUs)

0

2

4

6

8

10

12

E6000 16 Processors ccNUMA 2x8 DSZOOM-WF 2x8 CL128

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 17

Results (2)Normalized Execution Time Breakdowns (16 CPUs)

0%

20%

40%

60%

80%

100%

Store

Load

Locks

Barriers

Task

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 18

DSZOOM completely eliminates asynchronous messaging between protocol agents

Consistently competitive and stable performance in spite of high instrumentation overhead 35% slowdown compared to hardware State-of-the-art checking overheads are in the range of

5–35% (e.g., Shasta), DSZOOM: 11–68%

Conclusions

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 19

Improved DSZOOM… [SC2001]

Protocol/Overall optimizations Coherency unit variations

Synchronization improvements More balanced execution between cabinets

Better instrumentation More detailed backward slice algorithm

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 20

SC2001 TeaserExecution Times in Seconds (16 CPUs)

0

2

4

6

8

10

12

ccNUMA 2x8 DSZOOM-WF 2x8 CL128 DSZOOM Today

DSZOOM@wmpi2001 Uppsala Architecture Research Team (UART) 21

http://www.it.uu.se/research/group/uart

DSZOOM’s Home Page