m3 microkernel-based system for heterogeneous manycores file[1] gpufs: integrating a le system with...

68
Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation M 3 Microkernel-based System for Heterogeneous Manycores Nils Asmussen MKC, 06/29/2017 1 / 48

Upload: others

Post on 30-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

M3 – Microkernel-based System forHeterogeneous Manycores

Nils Asmussen

MKC, 06/29/2017

1 / 48

Page 2: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Heterogeneous Systems

2 / 48

Page 3: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Heterogeneous Systems

2 / 48

Page 4: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Heterogeneous Systems

2 / 48

Page 5: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Why?

FPGA-based memcached 16x better in performance per wattthan Atom CPU [1]

Machine learning accelerator is 20% faster than GPU andrequires 128 times less energy [2]

[1] Thin servers with smart pipes: Designing SoC accelerators for memcached, ISCA’13[2] PuDianNao: A polyvalent machine learning accelerator, ASPLOS’15

3 / 48

Page 6: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

The Problem for OSes

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

4 / 48

Page 7: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

The Problem for OSes

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

Kernel

Kernel

4 / 48

Page 8: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

The Problem for OSes

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

Kernel

Kernel

Kernel

Kernel

4 / 48

Page 9: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

The Problem for OSes

IntelXeon

IntelXeon

ARMbig

ARMLITTLE

Kernel

Kernel

Kernel

Kernel

4 / 48

Page 10: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Making Accelerators More First-Class

File system access for GPUs [1]

Network access for GPUs [2]

Access to OS services from FPGAs [3,4]

Computing directly on the SSD [5]

[1] GPUfs: integrating a file system with GPUs, ASPLOS’13[2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14[3] ReconOS: An operating system approach for reconfigurable computing, MICRO’14[4] A Unified Hardware/Software Runtime Environment for FPGA-based Reconfigurable Computers Using BORPH,TECS’08[5] Willow: A user-programmable SSD, OSDI’14

5 / 48

Page 11: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Is There a Systematic Way?

Can we design a system that treats all computeunits (CU) as first-class citizens from the beginning?

1 Run untrusted code without causing harm2 Access operating system services3 Context switching support4 Direct communication without involving CPU

6 / 48

Page 12: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Outline

1 Overall System Design

2 Prototype Platforms

3 Capabilities

4 OS Services

5 Context Switching

6 Evaluation

7 / 48

Page 13: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Outline

1 Overall System Design

2 Prototype Platforms

3 Capabilities

4 OS Services

5 Context Switching

6 Evaluation

8 / 48

Page 14: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

My Approach – Hardware

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16

9 / 48

Page 15: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

My Approach – Hardware

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

DTUDTUDTUDTU

DTU DTU DTU DTU

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16

9 / 48

Page 16: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

My Approach – Hardware

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

DTUDTUDTUDTU

DTU DTU DTU DTU

PE PE PE PE

PEPEPEPE

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16

9 / 48

Page 17: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

My Approach – Software

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

DTUDTUDTUDTU

DTU DTU DTU DTU

App

AppAppApp App

AppAppKernel

PE PE PE PE

PEPEPEPE

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16

9 / 48

Page 18: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

My Approach – Software

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

DTUDTUDTUDTU

DTU DTU DTU DTU

App

AppAppApp App

AppAppKernel

PE PE PE PE

PEPEPEPE

Asmussen et al.: M3: A Hardware/OS Co-Design to Tame Heterogeneous Manycores, ASPLOS’16

9 / 48

Page 19: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Data Transfer Unit

Supports memory access and message passing

Provides a number of endpoints

Each endpoint can be configured for:1 Accessing memory (contiguous range, byte granular)2 Receiving messages into a receive buffer3 Sending messages to a receiving endpoint

Direct reply on received messages

Configuration only by kernel, usage by application

Credit system to prevent DoS attacks

10 / 48

Page 20: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

OS Design

Kernel m3fs

pipeserv App

App

App

M3: Microkernel-based system forhet. manycores (or L4 ±1)

Implemented from scratch

Drivers, filesystems, . . . are imple-mented on top

Kernel manages permissions, usingcapabilities

DTU enforces permissions (communication, memory access)

Kernel is independent of other CUs in the system

11 / 48

Page 21: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

M3 System Call

ARMbigDSP

Mem DTU

AppKernel

DTUMemS R

12 / 48

Page 22: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Outline

1 Overall System Design

2 Prototype Platforms

3 Capabilities

4 OS Services

5 Context Switching

6 Evaluation

13 / 48

Page 23: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Tomahawk 2 and 4

Xtensa LX4

Instr.SPM

DataSPM

DTU

PEPEPE

PE

PE PE

PE

DRAM

RRR

R R R

RRR

PE

MemCtrl.

PEs have no OS support:

No privileged mode

No MMU, no caches, but SPM

T2: simple DTU; T4: most features

14 / 48

Page 24: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Linux

M3 runs on Linux using it as a virtual machine

A process simulates a PE, having two threads (CPU + DTU)

DTUs communicate over UNIX domain sockets

No accuracy because

Programs are directly executed on hostData transfers have huge overhead compared to HW

Very useful for debugging and early prototyping

15 / 48

Page 25: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

gem5

Modular platform for computer architecture research

Supports various ISAs (x86, ARM, Alpha, SPARC, . . . )

Provides detailed CPU and memory models

Cycle-accurate simulation

We built a DTU for gem5

We also added hardware accelerators

16 / 48

Page 26: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

gem5 – Example Configuration

x86 PE

L2$

DTUL1$

AccelPE

DTU

SPM

IO$

AccelPE

DTU

L1$

x86

PE

DTU

L1$ IO$

DTU

VM

ME

DRAM

17 / 48

Page 27: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Outline

1 Overall System Design

2 Prototype Platforms

3 Capabilities

4 OS Services

5 Context Switching

6 Evaluation

18 / 48

Page 28: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Overview

0 2 0 21VPE 1 VPE 2

Kernel

VPE 2VPE 1

VPE SGate RGate VPE

19 / 48

Page 29: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Capabilities

M3 has the following capabilities:

Send: send messages to a receive EP

Receive: receive messages from send EPs

Memory: access remote memory via DTU

Mapping: access remote memory via load/store

Service: create sessions

Session: exchange caps with service

VPE: use a PE

20 / 48

Page 30: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Capability Exchange

Kernel provides syscalls to create, exchange and revoke caps

There are two ways to exchange caps:1 Directly with another VPE (typically, a child VPE)2 Over a session with a service

The kernel offers two operations:1 Delegate: send capability to somebody else2 Obtain: receive capability from somebody else

Difference to L4:

Applications communicate directly, without involving the kernel→ Capability exchange cannot be done during IPCSpecial communication channel between kernel and serversKernel uses this channel to send exchange requests to server

21 / 48

Page 31: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Communication

DTU

DTUDTU adds

CU

Mem

buffer

occupunread

EP credits

labeltarget

Receiver: PE1 Sender: PE2

channel

Kernel: PE0

SendGate

DTU

Mem

CUCU

EP

configuration of endpoints to establish a channel

VPE1: PE1

header data

Recv Cap RecvGate

VPE2: PE2

Send CapMem

EP

cmdregcmdreg cmdreg

22 / 48

Page 32: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Virtual PEs

M3 kernel manages user PEs in terms of VPEs

VPE is combination of a process and a thread

VPE creation yields a VPE cap. and memory cap.

Library provides primitives like fork and exec

VPEs are used for all PEs:

Accelerators are not handled differently by the kernelAll VPEs can perform system callsAll VPEs can have time slices and priorities. . .

23 / 48

Page 33: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

VPEs – Examples

Executing ELF-Binaries

VPE vpe("test");

char *args[] = {"/bin/hello", "foo", "bar"};

vpe.exec(3, args);

Asynchronous Lambdas

VPE vpe("test");

MemGate mem = MemGate :: create_global (0x1000 , RW);

vpe.delegate(CapRngDesc(mem.sel(), 1));

vpe.run_async ([& mem ]() {

mem.read(buf , sizeof(buf));

cout << "Done reading !\n";

});

24 / 48

Page 34: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

VPEs – Examples

Executing ELF-Binaries

VPE vpe("test");

char *args[] = {"/bin/hello", "foo", "bar"};

vpe.exec(3, args);

Asynchronous Lambdas

VPE vpe("test");

MemGate mem = MemGate :: create_global (0x1000 , RW);

vpe.delegate(CapRngDesc(mem.sel(), 1));

vpe.run_async ([& mem ]() {

mem.read(buf , sizeof(buf));

cout << "Done reading !\n";

});

24 / 48

Page 35: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Outline

1 Overall System Design

2 Prototype Platforms

3 Capabilities

4 OS Services

5 Context Switching

6 Evaluation

25 / 48

Page 36: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

File Protocol

S

M

RClient Server

req(in/out)

resp(pos, len)

Mem

File protocol is used for allfile-like objects

Simple for accelerators, yetflexible for software

Software uses POSIX-like APIon top of the protocol

Server provides client accessto data by configuring client’smemory endpoint

Client accesses data via DTU, without involving others

req(in/out) requests next input/output piece and implicitlycommits previous piece

commit(nbytes) commits nbytes of previous piece

Receiving resp(n, 0) indicates EOF

26 / 48

Page 37: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Implementation: m3fs – Overview

m3fs is an in-memory file system

m3fs organizes the file’s data in extents

Two types of sessions: metadata session, file session

Metadata session is created first, allows stat, open, . . .

open creates a new file session

Both sessions can be cloned to provide other VPEs access

27 / 48

Page 38: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Implementation: m3fs – File Protocol

The file session implements the file protocol (plus seeking)

File session holds file position and advances it on read/write

req(in/out) request next extent

m3fs configures client’s EP for this extent

Appending reserves new space, invisible to other clients

commit(nbytes) commits a previous append

28 / 48

Page 39: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Implementation: Pipe – Overview

writer reader

29 / 48

Page 40: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Implementation: Pipe – Overview

writer reader

Shared Memory

msg passing

pipeserv

30 / 48

Page 41: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Implementation: Pipe

Two types of sessions: pipe session, channel session

Pipe session represents whole pipe, allows to create channels

Channel session implements file protocol

Channel session can be cloned

Server configures client’s EP just once at the beginning

req(in/out) request access to next data

commit(nbytes) commits previous request

31 / 48

Page 42: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

File Multiplexing

File protocol maps directly to EPs (limited resource)

Number of open files shouldn’t be limited (that much)

libm3 dedicates at most 4 EPs to files and multiplexes them

Multiplexing requires:1 commit(nbytes) to commit read/written data2 revocation of EP capability (old server)3 delegation of EP capability (new server)4 next read/write will contact server again

Fortunately, file multiplexing does almost never happen

32 / 48

Page 43: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accel. Example: Stream Processing

Accelerator works on scratchpad memory

Input data needs to be loaded into scratchpad

Result needs to be stored elsewhere

FSM

Acceleratorlogic DTU

SPM

S in

out

CU

M

SM

33 / 48

Page 44: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accel. Example: Stream Processing

Accelerator works on scratchpad memory

Input data needs to be loaded into scratchpad

Result needs to be stored elsewhere

FSM

Acceleratorlogic DTU

SPM

S in

out

CU

M

SM

33 / 48

Page 45: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Shell Integration

M3 allows to use accelerators from the shell:preproc | accel1 | accel2 > output.dat

Shell connects the EPs according to stdin/stdout

Accelerators work autonomously afterwards

Requires about 30 additional lines in the shell

34 / 48

Page 46: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Demo

Demo

35 / 48

Page 47: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Outline

1 Overall System Design

2 Prototype Platforms

3 Capabilities

4 OS Services

5 Context Switching

6 Evaluation

36 / 48

Page 48: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Context Switching

Kernel

DTUDTU

CtxSw

CU: ARM CU: Accel

37 / 48

Page 49: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Context Switching

interrupt request

Kernel

DTUDTU

CtxSw

CU: ARM CU: Accel

37 / 48

Page 50: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Context Switching

interrupt request

Kernel

DTUDTU

CtxSw

CU: ARM CU: Accel

CU: Accel

RCTMux

37 / 48

Page 51: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Context Switching

interrupt request

Kernel

DTUDTU

CtxSw

CU: ARM CU: Accel

CU: Accel

RCTMux

DTU

RCTMux

App

CU: x86

37 / 48

Page 52: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Communication with Suspended VPEs

If a VPE is suspended, communication channels stay valid

Each DTU knows the ID of the currently running VPE

Messages contain the target VPE ID

If these do not match, DTU responds with an error

In this case, the sender lets the kernel forward the message

Kernel will resume the VPE and afterwards transmit themessage on behalf of the sender

38 / 48

Page 53: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Computing vs. Idling

How does the kernel know what VPEs are doing?

VPEs communicate directly, without involving the kernel andwait for the next msg via DTU

The kernel asks VPEs to report idling, if other VPEs are ready

As soon as a VPE starts to idle, it checks whether it shouldreport that

If so, the VPE waits for the time chosen by the kernel andperforms a system call afterwards

39 / 48

Page 54: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Outline

1 Overall System Design

2 Prototype Platforms

3 Capabilities

4 OS Services

5 Context Switching

6 Evaluation

40 / 48

Page 55: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Experimental Setup

Evaluation platform is gem5

Each general-purpose PE has x86 64 core @ 3GHz,32+32 KiB L1 cache, 256 KiB L2 cache

Accelerator PEs are clocked with 1GHz

DRAM (DDR3 1600 8x8) clocked with 1GHz

Short running, but representative benchmarks

41 / 48

Page 56: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accelerator Chaining – Variants

input

output

OS

Logic

SPM

Accel

DMA

SPM

AccelDMA

SPM

AccelDMA

Assisted

input

output

shell

SPMDTU

SPM

DTU

SPM

AccelDTU

Logic

Accel

Logic

Accel

Logic

Autonomous

42 / 48

Page 57: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accelerator Chaining – Variants

input

output

OS

Logic

SPM

Accel

DMA

SPM

AccelDMA

SPM

AccelDMA

Assisted

input

output

shell

SPMDTU

SPM

DTU

SPM

AccelDTU

Logic

Accel

Logic

Accel

Logic

Autonomous

42 / 48

Page 58: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accelerator Chaining – Results

Assisted Autonomous

4G

B

2G

B

1G

B

0.5

GB

1 Accel.

Tim

e (

ms)

0

5

10

4G

B

2G

B

1G

B

0.5

GB

2 Accel.

4G

B

2G

B

1G

B

0.5

GB

4 Accel.

4G

B

2G

B

1G

B

0.5

GB

8 Accel.

43 / 48

Page 59: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accelerator Chaining – Results

Assisted Autonomous

4G

B

2G

B

1G

B

0.5

GB

1 Accel.

Tim

e (

ms)

0

5

10

4G

B

2G

B

1G

B

0.5

GB

2 Accel.

4G

B

2G

B

1G

B

0.5

GB

4 Accel.

4G

B

2G

B

1G

B

0.5

GB

8 Accel.

43 / 48

Page 60: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accelerator Chaining – Results

Assisted Autonomous

4G

B

2G

B

1G

B

0.5

GB

1 Accel.

CP

U t

ime

(re

l)

0.0

0.5

1.0

4G

B

2G

B

1G

B

0.5

GB

2 Accel.

4G

B

2G

B

1G

B

0.5

GB

4 Accel.

4G

B

2G

B

1G

B

0.5

GB

8 Accel.

44 / 48

Page 61: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Accelerator Chaining – Results

Assisted Autonomous

4G

B

2G

B

1G

B

0.5

GB

1 Accel.

CP

U t

ime

(re

l)

0.0

0.5

1.0

4G

B

2G

B

1G

B

0.5

GB

2 Accel.

4G

B

2G

B

1G

B

0.5

GB

4 Accel.

4G

B

2G

B

1G

B

0.5

GB

8 Accel.

44 / 48

Page 62: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Application Performance

Comparison to Linux 4.10, using tmpfs

Traced obtained on Linux and replayed on M3

M3: 3 user PEs; Linux: 1 core (same config)

Lx

M3

tar

0

2

4

6

8

10

Tim

e (

ms)

Lx

M3

untar

Lx

M3

shasum

Lx

M3

sort

Lx

M3

find

Lx

M3

SQLite

Lx

M3

LevelDB

App Xfers OS

45 / 48

Page 63: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Application Performance

Comparison to Linux 4.10, using tmpfs

Traced obtained on Linux and replayed on M3

M3: 3 user PEs; Linux: 1 core (same config)

Lx

M3

tar

0

2

4

6

8

10

Tim

e (

ms)

Lx

M3

untar

Lx

M3

shasum

Lx

M3

sort Lx

M3

find

Lx

M3

SQLite

Lx

M3

LevelDB

App Xfers OS

45 / 48

Page 64: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Application Performance

Comparison to Linux 4.10, using tmpfs

Traced obtained on Linux and replayed on M3

M3: 3 user PEs; Linux: 1 core (same config)

Lx

M3

tar

0

2

4

6

8

10

Tim

e (

ms)

Lx

M3

untar

Lx

M3

shasum

Lx

M3

sort Lx

M3

find

Lx

M3

SQLite

Lx

M3

LevelDB

App Xfers OS

45 / 48

Page 65: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

PE Sharing

3 user PEs: pager, m3fs, app (baseline)

2 user PEs: pager+m3fs, app

1 user PEs: pager+m3fs+app

tar untar shasum sort find SQLite LevelDB0

1

2

3

Re

lative

ru

ntim

e

M3 (3 uPEs) M

3 (2 uPEs) M

3 (1 uPE) Linux (1 core)

46 / 48

Page 66: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

PE Sharing

3 user PEs: pager, m3fs, app (baseline)

2 user PEs: pager+m3fs, app

1 user PEs: pager+m3fs+app

tar untar shasum sort find SQLite LevelDB0

1

2

3

Re

lative

ru

ntim

e

M3 (3 uPEs) M

3 (2 uPEs) M

3 (1 uPE) Linux (1 core)

46 / 48

Page 67: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Ongoing Work

Multiple instances of the kernel/services (by Matthias Hille)

Improved network support (by Georg Kotheimer)

Extension of m3fs for storage devices (by Sebastian Reimers)

47 / 48

Page 68: M3 Microkernel-based System for Heterogeneous Manycores file[1] GPUfs: integrating a le system with GPUs, ASPLOS’13 [2] GPUnet: Networking Abstractions for GPU Programs, OSDI’14

Overall System Design Prototype Platforms Capabilities OS Services Context Switching Evaluation

Conclusion

M3 uses a hardware/software co-design

DTU introduces common interface for all CUs

Allows to treat all CUs as first-class citizens

Access to OS services for all CUs

M3 uses the same concepts for all CUs

Allows simple management of complex systems

48 / 48