the memory is the computer rob schreiber hp labs doe salishan conference, 2014

28
The Memory is the Computer Rob Schreiber HP Labs DOE Salishan Conference, 2014

Upload: carol-rich

Post on 18-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

The Memory is the Computer

Rob SchreiberHP Labs

DOE Salishan Conference, 2014

Let’s Build an Exascale Machine

And make it useful.–Adequate memory capacity–No disks (except for archival store)–20 MW (good thing)–More than a loosely-connected cluster– It won’t always work!

It will be a very parallel machine

Exaflops at gigahertz

a billion operations per clock

How will we get that much parallelism?

• Big problems

• More than one problem

• Pipelines of problems– Preprocess, mesh gen– Solve, resolve, UQ, optimize– Postprocess and visualize

• Solving the same problem twice!

And we will need a lot of memory

• Many problems on the machine at one time

• Performance costs memory

• No disks!

Disk is the New Tape

How slow is a millisecond?

• From 1960 – present: clock improves 6000 X. (2 micro – 0.3 nanoseconds).

• Latency: 11 X– IBM 1405, year = 1961, RPS = 23. Today 250 RPS

• Seek: 100X (IBM 1405 = 600 ms. Today 6 ms.)

The millisecond is not a reasonable unit in the exascale era.

We need a lot of memory

Things are becoming unbalanced !!!

Argonne National Labs plans for leading-edge supercomputing. Thanks, Rick!

2012 2016 2020 2024

Peak FLOP/s 10-20 PF 100-200 PF 500-2000 PF 2000-4000 PF

Memory 0.5-1 PB 5-10 PB 32-64 PB 50-100 PB

Flops/Bytes 20 20 16 – 32 40

Moore’s LawLast Call?

“There’s no getting around the fact that we make these things out of atoms.”

- Gordon Moore

We need some new Moore’s Laws

Density challenge

$10 per gigabyte (DRAM) today– A DRAM exabyte costs $10B– At exascale time, still billions

Let’s consider other memory technologies

How warm

As we move data from disk to memory, all other things being equal, the memory cools

Cost, Density, Power

• Feature shrink – running on empty

• MLC technologies

• 3D technologies (not stacking, real 3D…)

• Static power and memory capacity

Power challenges

DRAM – Reduced-memory exascale– Overfetch, leakage, refresh, scrubbing– Giridhar et al, SC 13: 100PB can be achieved at 4.7 MW

Nonvolatile memory is usually energy-costly to write,

but no static power, no scrub, no refresh

We could make a lot more money if our customers had a bigger plug to plug our machines into.

Flash

3D NAND Flash is BIG128Gb chips reported (vs. 4-8 Gb for DRAM).But ..

Characteristics

Flash in Exascale Systems

Return of the millisecond

And it can wear out; so it would be a separate tier

Flash is the new disk

NVRAM has a future

"I'm reasonably confident that ... nonvolatile technologies will replace flash and bring non-volatile memory very close [to compute] with dramatic improvements in latency. Architectures will clearly have to react and respond to that.“

-- Justin Rattner

New memory on the horizon

19

• Spin-Torque-Transfer RAM (STTRAM)– Grandis (54nm, acquired by Samsung)

• Phase-Change RAM (PCRAM)– Samsung (20nm, diode, up to 8Gb)– Micron and Nokia – In phones now

• Resistive RAM (memristor)– Panasonic (180nm process, 4-layer xpoint)– Unity Semi (64MB, acquired by Rambus)– Crossbar– Several others under development

ReRAM

–Very promising, still– Technical issues– Fits in the density range between

DRAM and Flash– Should scale well– Low power – exabyte will be feasible

NVM Programming

• Storage (files, databases)• Persistent heaps• Just plain memory• The SNIA PM Programming TWG• Caches and persistence, transactions, failure

atomicity• A co-design opportunity

Attack of the killer cellphones

The end of Moore’s Law a restoration of diversity

Networks and machine usability

• Ethernet cadence: 1Gb, 10, 40, 100.• No Moore’s Law• Very high overheads• New interest, even outside HPC, in

– RDMA– Topological routing– User-level comm– Very fine grained, low latency

Error detection and correction

How will we know we got a correct answer?

How do we respond to an error flag?

Verify in-line (at every timestep) and recomputed if there is a probable error!

Embedded auxiliary scheme

0 50 100 150 200 250 300iteration

350 400 450 5000

0.1

0.2

0.6

0.5

0.4

0.3

0.7

0.8

N

Appa

rent

Loc

al E

rror

Bigger errors are easier to detectThere is high recall (almost all errors found) and very few false positives(For a simple case – the heat equation)

A plan for progress

DRAM cannot provide adequate low-static-power capacity

No disks. Solid-state memory+storage

Twilight of the one-size-fits-all server

Low-latency communication

Self-checking algorithms + in-NVRAM checkpoints resilience

Generic Disclaimer, and Acknowledgements

The views in this presentation are not necessarily those of HP.

Thanks to: Sarah Anthony, Cullen Bash, Al Davis, Paolo Faraboschi, Dick Henze, Kevin Lim, Moray McLaren, Naveen Muralimanohar, Jerry Rolia, Mike Tan