12th openpower-based open hybrid computing › images › eventpresos › 2016... · towards...

20
12 th ANNUAL WORKSHOP 2016 OPENPOWER-BASED OPEN HYBRID COMPUTING Bernard Metzler, IBM Zurich Research Building the Ecosystem for a Flexible Heterogeneous Compute Architecture

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

12th ANNUAL WORKSHOP 2016

OPENPOWER-BASED

OPEN HYBRID COMPUTING

Bernard Metzler, IBM Zurich Research

Building the Ecosystem for a Flexible Heterogeneous

Compute Architecture

Page 2: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

INDUSTRY TRENDS:

TOWARDS WORKFLOW OPTIMIZED SYSTEM

0.1

1

10

100

2004 2006 2008 2010 2012 2014 2016 Pri

ce/P

erf

orm

ance

*

System stack innovations are required to drive Cost/Performance

Moore's Law

(*schematic diagram) 2

Page 3: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

INDUSTRY TRENDS:

TOWARDS WORKFLOW OPTIMIZED SYSTEM

0.1

1

10

100

2004 2006 2008 2010 2012 2014 2016 Pri

ce/P

erf

orm

ance

*

System stack innovations are required to drive Cost/Performance

Moore's Law Frameworks

(*schematic diagram) 3

Page 4: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

INDUSTRY TRENDS:

TOWARDS WORKFLOW OPTIMIZED SYSTEM

0.1

1

10

100

2004 2006 2008 2010 2012 2014 2016 Pri

ce/P

erf

orm

ance

*

System stack innovations are required to drive Cost/Performance

Moore's Law Frameworks GPU & Accelerator Integration

(*schematic diagram) 4

Page 5: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

INDUSTRY TRENDS:

TOWARDS WORKFLOW OPTIMIZED SYSTEM

0.1

1

10

100

2004 2006 2008 2010 2012 2014 2016 Pri

ce/P

erf

orm

ance

*

System stack innovations are required to drive Cost/Performance

Moore's Law Frameworks

GPU & Accelerator Integration IO Acceleration

(*schematic diagram) 5

Page 6: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

OPENPOWER CONSORTIUM

Collaboration around Power Architecture products

Open technical membership organization, founded in 2013

IBM is opening up technology around Power Architecture on a liberal license: • Processor specifications

• Firmware

• Software

Member companies: • Can customize POWER CPU processors and system platforms

• Custom systems for large/warehouse data centers

• Custom workload acceleration through GPUs, ASICs, FPGAs, advanced I/O

• Over 190 members as of now

Goal: Create an ecosystem of HW and SW development to drive innovation in HPC and cloud computing • Allow for workload optimized solutions

• Rely on open, stable spec’s and interfaces

• Agree on a common architecture, but avoid vendor lock-in

6

Page 7: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

190+ MEMBERS

7

Page 8: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

OPENPOWER

I/O, ACCELERATOR AND GPU TECHNOLOGY ROADMAP

2015 2016 2017

POWER8 POWER8’ POWER9

OpenPower CAPI Interface

Enhanced CAPI & enhanced NVLink

Connect-IB FDR InfiniBand

PCIe Gen3

ConnectX-4 EDR InfiniBand

CAPI over PCIe Gen3

ConnectX-5 Next-Gen InfiniBand

Enhanced CAPI over PCIe Gen4

Mellanox Interconnect

IBM CPUs

NVIDIA GPUs Kepler PCIe Gen3

Volta Enhanced NVLink

Pascal NVLink

IBM Systems

CAPI & NVLink

8

Firestone Garrison Witherspoon

Page 9: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

EFFICIENT GPU INTEGRATION

Heterogeneous systems combining strong general purpose CPUs with GPU accelerators:

• Strong single thread performance (fast response time) & broad applicability of CPU

• Tremendous efficiency of the GPU’s massively parallel FLOPs for scientific & analytic applications

GPU CPU

GP

U M

emo

ry

Host Memory

Standard PCIe connected GPUs can’t realize full potential due to communication bottlenecks

NVLINK eliminates this bottleneck, substantially increasing the GPU-CPU communication bandwidth

2 x NVLINK

2 x NVLINK POWER8 with NVLINK

POWER8 with

NVLINK Host Memory

GP

U M

emo

ry

GP

U M

emo

ry

Standard PCIe

System bottleneck

PCIe

9

Power 8 with NVLINK

Page 10: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

WHY ACCELERATORS?

Transistor Efficiency & Extreme Parallelism

• Bit-level operations

• Variable-precision floating point

Power-Performance Advantage

• > 2x compared to Multi-core (MIC) or GPGPU

• Unused LUTs are powered off

Technology Scaling better than CPU/GPU

• FPGAs are not frequency or power limited yet

• 3D has great potential

Dynamic reconfiguration

• Flexibility for application tuning at run-time vs. compile-time

Additional advantages when FPGAs are network or storage connected

• allows network/storage as well as compute specialization

• Near memory computation

11

Page 11: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

CAPI: COHERENT ACCELERATOR

PROCESSOR INTERFACE

Customer defined acceleration within

processor context

Function based acceleration:

• Main application executed on host processor

• Computational heavy functions offloaded

• Single binary with both HW and SW version of function

• Runs also w/o accelerator available

CAPI device: full peer to processor

• Enables coherent memory between AFU and application

• Via PSL, AFU can ‘understand’ app’s VA

• No need to pin application memory for I/O

• AFU works within application context

FPGA AFU AFU AFU

PSL

PCIe

CAPP

Memory (coherent)

PPC Core

Page 12: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

CAPI: COHERENT ACCELERATOR

PROCESSOR INTERFACE

Customer defined acceleration within

processor context

Function based acceleration:

• Main application executed on host processor

• Computational heavy functions offloaded

• Single binary with both HW and SW version of function

• Runs also w/o accelerator available

CAPI device: full peer to processor

• Enables coherent memory between AFU and application

• Via PSL, AFU can ‘understand’ app’s VA

• No need to pin application memory for I/O

• AFU works within application context

FPGA AFU AFU AFU

PSL

PCIe

CAPP

Typical I/O Model Flow

Flow with a Coherent Model

DD Call Copy or Pin Source Data

MMIO Notify Accelerator

Acceleration Poll / Int Completion

Copy or Unpin Result Data

Ret. From DD Completion

Memory (coherent)

PPC Core

13 Shared Mem. Notify Accelerator

Acceleration Shared Memory Completion

Page 13: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

CAPI: ACCELERATOR OPERATION

0. OS is aware of CAPI

device(s), libCapi in

place, application bin

contains algorithm

1. Algorithm gets

transferred to FPGA

2. Shared context set up

containing shared

• Work descriptors,

• Result descriptors

• Shared locks,

• …

3. CAPI algorithm starts

working

4. Ongoing communication

between app. and CAPI

attached algorithm

5. Work completes, CAPI

algorithm shutdown

FPGA/ASIC Power8 (core and memory)

IBM supplied PSL

Ap

p’s

CA

PI A

lgo

rith

m

OS

libC

api

Accel. App

App elf

Ap

p’s

CA

PI

Alg

ori

thm

1 1

(Oth

er A

lgo

rith

m)

(Oth

er A

lgo

rith

m)

1

PCI CAPP

1

2

3

4

4

5

WD RD

14

1. Virtual addressing: no pinning, no copy

2. Coherent caching

3. Elimination of device driver: direct app – device communication

4. Efficient hybrid HW/SW co-design

AFU

Page 14: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

EFFICIENT RDMA:

MEMORY REGISTRATION

Classic RDMA communication buffer management

• Static communication buffer memory

pinning for RDMA memory access

• Adapter translates VA/PA on fast path

• Potential waste of host memory

resources

• Time consuming for short lived

communication contexts

CAPI model

• No memory pinning needed

• Adapter runs in user context and ‘understands’ VA

• OS supports on the fly page-in if needed

• No pinning of sparsely used memory

• Memory registration

• Communicates buffer boundaries to adapter

• Used by adapter to enforce access protection

• Elegant way of doing ‘lazy memory registration’

The cost of ibv_reg_mr()

15

Page 15: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

EFFICIENT RDMA:

ATOMIC OPERATIONS

RDMA Atomics

• Atomic read-modify-write operations on small in memory objects, such as

• Compare & Swap, Fetch & Add

• Used for efficient distributed access control

• Database, collective operations etc.

Classic Implementation

• Adapter fetches current object value

via PCI transfer (1)

• Adapter potentially sets new value (2) and

writes back via PCI transfer (3)

CAPI implementation

• Adapter and host share PSL/CAPP

managed consistent view on atomic object

• Atomic object cached/in-situ access by

adapter, PCI xfer only if local CPU accesses

• Higher Atomic RDMA OP/s rate possible

• Atomics to be moved to memory controller

(enhanced CAPI)

RDMA NIC

CPU

1

3 2

main memory

RDMA atomic

RDMA NIC

CPU

main memory

RDMA atomic

CAPI protocol

16

Page 16: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

Classic RDMA endpoint model

• Memory mapped endpoint resources

• QP, CQ, RQ, …

• Memory mapped doorbell register

• Overhead from memory mapping

(page pinning)

• Overhead for short lived endpoints

(endpoint setup time)

Using CAPI model

• Application transparently shares endpoint

resources with adapter

• CPU and adapter updates to work

queues and other structures are kept

coherent

• Simplifies user library

EFFICIENT RDMA:

ENDPOINT RESOURCE ALLOCATION

RDMA NIC

CPU

mmap

mmio QP

main memory

load/store

CQ

page

RDMA NIC

CPU

CAPI

QP

main memory

load/store

CQ

17

Page 17: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

CAPI AND NVM INTEGRATION

TMS Flash attached to POWER8 via CAPI FC coherent adapter

Read/Write from application to significantly shorten code path

5 x the bandwidth per core at half the latency

Application

LVM

Disk & Adapter DD

read/write syscall

strategy() iodone()

File system strategy()

iodone()

Interrupt, unmap, unpin, iodone scheduling

Application

User Library

shared memory work queue

20K instructions

reduced to 2K

read/write

20

Pin buffers, Translate, Map DMA, Start I/O

Page 18: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

COHERENT NVM INTEGRATION - OUTLOOK

CAPI I/O model will benefit from higher P9 CAPI bandwidth

Emerging new non-volatile memory technologies

• Higher endurance, lower latency

• Justify closer integration with CPU

• Complement asynchronous I/O model

Introduction of true load/store NVM interface

• Use enhanced CAPI attached memory

• CPU load/store to NVM

Results in 3 models of coherent

access to mmap()’d memory:

• GPU Memory (enhanced NVLINK)

• NV Memory async. I/O access (CAPI)

• NV Memory ld/st access (enhanced CAPI)

GPU Memory

Storage Class Memory

Storage Class Memory

Coherency Infrastructure

Main memory

CAPI I/O

CAPI ld/st

Enhanced NVLINK

CPU

I/O Domain

mmap()

21

Page 19: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

OpenFabrics Alliance Workshop 2016

SUMMARY AND OUTLOOK

Accelerator and GPU integration • Dictated by technology trends

• Efficient integration means:

• Flexible software/accelerator co-design

• Maintaining coherency

• IO balanced, high sustained bandwidth

OpenPOWER • Industry consortium building open ecosystem around POWER technology

• Aiming at efficient Accelerator and GPU integration

• Coherent Accelerator Processor Interface (CAPI) and NVLINK

OpenPOWER based efficient OpenFabrics RDMA stack • Lazy memory registration

• RDMA endpoint management

• Coherent RDMA Atomics

• Efficient NVM integration

Outlook: hybrid computer systems • Programmable near-memory accelerators

• see 2016 OpenPOWER Summit talk “Programmable Near-Memory Acceleration on ConTutto”

22

Page 20: 12th OPENPOWER-BASED OPEN HYBRID COMPUTING › images › eventpresos › 2016... · TOWARDS WORKFLOW OPTIMIZED SYSTEM 0.1 1 10 100 * 2004 2006 2008 2010 2012 2014 2016 System stack

12th ANNUAL WORKSHOP 2016

THANK YOU