scalability and heterogeneity - tu...

45
Faculty of Computer Science, Institute for System Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY Nils Asmussen Dresden, 11/29/2016

Upload: others

Post on 08-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Faculty of Computer Science, Institute for System Architecture, Operating Systems Group

SCALABILITY AND HETEROGENEITY

Nils Asmussen

Dresden, 11/29/2016

Page 2: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Layers

Runtime, Services, ...

OS Kernel

Core Core

Mem Mem

Coherency

Interconnect

Application

Scalability and Heterogeneity Slide 2 of 39

Page 3: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Commodity System with GPU

Core Core

Mem Interconnect

Application

GPU

Mem

OS Kernel

Runtime

CC

Co

mp

ute

Ker

nel

Driver

OpenCL

HLSL

Scalability and Heterogeneity Slide 3 of 39

Page 4: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Current Trend

Kernel

Core

Mem Mem

Core Core

MemMem

RT

Kernel Application

??

??

Acc

Mem

Core

Scalability and Heterogeneity Slide 4 of 39

Page 5: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Why?

• More cores can (for some usecases) deliver moreperformance

• Specialization is the next step

• Cache coherency gets more expensive(performance, complexity and energy) with more(and heterogeneous) cores

Scalability and Heterogeneity Slide 5 of 39

Page 6: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Commodity Hardware

Scalability and Heterogeneity Slide 6 of 39

Page 7: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

NUMA

Non-Uniform Memory Access

• Core-to-RAM distance differs

• Various interconnect topologies:bus, star, ring, mesh, . . .

• The good: all memory can be directly addressed

• The bad: different access latencies

• Consider placement of data

Scalability and Heterogeneity Slide 7 of 39

Page 8: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

NUMA Machine

Measuring NUMA effects on:

Daniel Muller: Memory and Thread Management on NUMA Systems, Diploma Thesis, 2013

Scalability and Heterogeneity Slide 8 of 39

Page 9: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

NUMA Effects

Operation Access Time NUMA Factor

read local 37.420s 1.000read remote 53.223s 1.422write local 23.555s 1.000write remote 23.976s 1.018

Daniel Muller: Memory and Thread Management on NUMA Systems, Diploma Thesis, 2013

Scalability and Heterogeneity Slide 9 of 39

Page 10: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

NUMA Mechanisms

Daniel Muller: Memory and Thread Management on NUMA Systems, Diploma Thesis, 2013

Scalability and Heterogeneity Slide 10 of 39

Page 11: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

NUMA Policies

• fundamental options:migrate thread vs. migrate data

• use performance counters to decide

• dynamic management shows > 10% performancebenefit compared to best static placement

Scalability and Heterogeneity Slide 11 of 39

Page 12: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

HPC

Core Core

Mem MemInterconnect

K

RT

App

K

RT

AppMPI

Scalability and Heterogeneity Slide 12 of 39

Page 13: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Research Prototypes

Scalability and Heterogeneity Slide 13 of 39

Page 14: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Barrelfish

Andrew Baumann et al.: The Multikernel: A new OS architecture for scalable multicore systems, SOSP 2009

Scalability and Heterogeneity Slide 14 of 39

Page 15: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Barrelfish

• Concept: multikernel,implementation: barrelfish

• Treat the machine as cores with a network

• “CPU driver” plus exokernel-ish structure

• No inter-core sharing at the lower levels

• Monitors coordinate system-wide state viareplication and synchronization

Andrew Baumann et al.: The Multikernel: A new OS architecture for scalable multicore systems, SOSP 2009

Scalability and Heterogeneity Slide 15 of 39

Page 16: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Cosh

• Based on Barrelfish

• Introduces abstractions for non-CC systems

• Takes advantage of CC, if possible

• Otherwise, data transfers via, e.g., DMA units

• Used to implement OS services (net, fs, . . . )

• Evaluated for Intel i7 CPU + Intel Knights Ferry

Andrew Baumann et al.: Cosh: clear OS data sharing in an incoherent world, TRIOS 2014

Scalability and Heterogeneity Slide 16 of 39

Page 17: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Barrelfish + Cosh

Core Core

Mem MemInterconnect

K

Mon

K

MonMessages

Acc

Mem

App AppMessages

Scalability and Heterogeneity Slide 17 of 39

Page 18: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Barrelfish Scalability

• Driven by scalability issues of shared kerneldesigns and cache coherence

• This might not be a pressing issue today

Andrew Baumann et al.: The Multikernel: A new OS architecture for scalable multicore systems, SOSP 2009

Scalability and Heterogeneity Slide 18 of 39

Page 19: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Factored Operating System

David Wentzlaff, Anant Agarwal: Factored Operating Systems (fos): The Case for a Scalable OperatingSystem for Multicores, SIGOPS OSR 2009

Scalability and Heterogeneity Slide 19 of 39

Page 20: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Popcorn Linux

• Idea: multiple Linux’s on one system

• Provide the illusion of an POSIX SMP system

• Kernels communicate to sync/exchange state

• Does not rely on global shared memory

• Distributed shared memory, if necessary

• Processes can migrate between kernels

Barbalace et al.: Popcorn: Bridging the Programmability Gap in Heterogeneous-ISA Platforms, EuroSys 2015

Scalability and Heterogeneity Slide 20 of 39

Page 21: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Popcorn Linux

Barbalace et al.: Popcorn: Bridging the Programmability Gap in Heterogeneous-ISA Platforms, EuroSys 2015

Scalability and Heterogeneity Slide 21 of 39

Page 22: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Popcorn Linux

Core Core

Mem MemInterconnect

K

Runtime

KMessages

Application

Acc

Mem

Scalability and Heterogeneity Slide 22 of 39

Page 23: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Helios

• Idea: heterogeneous ISA systems need some kindof compiler support

• ISA-specific kernels: “satellite kernels”

• Provide uniform OS abstractions

• Memory management, scheduling

• Bootstrap: first kernel becomes coordinator, bootsother cores

Nightingale et al.: Helios: Heterogeneous Multiprocessing with Satellite Kernels, SOSP 2009

Scalability and Heterogeneity Slide 23 of 39

Page 24: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Helios

• Share-nothing, even on ccNUMA

• Processes cannot span across kernels

• Implementation based on Singularity

• Applications compiled into intermediate code

• 2nd stage compilation to native code of allavailable ISAs at install time

• Placement based on affinity hints

Nightingale et al.: Helios: Heterogeneous Multiprocessing with Satellite Kernels, SOSP 2009

Scalability and Heterogeneity Slide 24 of 39

Page 25: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Helios

Nightingale et al.: Helios: Heterogeneous Multiprocessing with Satellite Kernels, SOSP 2009

Scalability and Heterogeneity Slide 25 of 39

Page 26: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Helios

Mem

Acc

Mem

K

RT

App

Core Core

Application

OS Kernel

Runtime

CC

Chan

IC

Scalability and Heterogeneity Slide 26 of 39

Page 27: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Our Own Work

Scalability and Heterogeneity Slide 27 of 39

Page 28: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

M3

Approach

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

MemMemMemMem

Mem Mem Mem Mem

DTUDTUDTUDTU

DTU DTU DTU DTU

Scalability and Heterogeneity Slide 28 of 39

Page 29: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

M3

Approach

IntelXeon

IntelXeon

ARMbig

ARMLITTLEDSP

DSPAudio

Decoder

FPGA

MemMemMemMem

Mem Mem Mem Mem

DTUDTUDTUDTU

DTU DTU DTU DTU

PE PE PE PE

PEPEPEPE

App

AppAppApp App

AppAppKernel

Scalability and Heterogeneity Slide 28 of 39

Page 30: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Data Transfer Unit

• Supports memory access and message passing

• Provides a number of endpoints• Each endpoint can be configured for:

1 Accessing memory (contiguous range, byte granular)2 Receiving messages into a ringbuffer3 Sending messages to a receiving endpoint

• Configuration only by kernel, usage by application

• Direct reply on received messages

Scalability and Heterogeneity Slide 29 of 39

Page 31: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

M3

System Call

ARMbigDSP

Mem DTU

AppKernel

DTUMemS R

Scalability and Heterogeneity Slide 30 of 39

Page 32: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

M3

= L4 ±1

• Microkernel-based system for het. manycores

• Implemented from scratch

• Mechanisms for PEs, memory and communication

• Drivers, filesystems, . . . are implemented on top

• Kernel manages permissions, using capabilities

• DTU enforces permissions (communication,memory access)

• Kernel is independent of other cores in the system

Scalability and Heterogeneity Slide 31 of 39

Page 33: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Virtual PEs

• Creating VPE yields a VPE cap. and memory cap.

• Library provides primitives like fork and exec

Execute function on different PE

VPE vpe("test");

vpe.run_async([]() {

Serial::get() << "Hello World!\n";

return 0;

});

int exitcode = vpe.wait();

Scalability and Heterogeneity Slide 32 of 39

Page 34: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Virtual PEs

• Creating VPE yields a VPE cap. and memory cap.

• Library provides primitives like fork and exec

Execute function on different PE

VPE vpe("test");

vpe.run_async([]() {

Serial::get() << "Hello World!\n";

return 0;

});

int exitcode = vpe.wait();

Scalability and Heterogeneity Slide 32 of 39

Page 35: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Filesystem: m3fs

• FS service is implemented outside of kernel

• m3fs is (currently) an in-memory filesystem

m3fs App kernel

Mem FS

PE1 PE2 PE3

Scalability and Heterogeneity Slide 33 of 39

Page 36: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Filesystem: m3fs

• FS service is implemented outside of kernel

• m3fs is (currently) an in-memory filesystem

m3fs App kernel

Mem FS

PE1 PE2 PE3open

Scalability and Heterogeneity Slide 33 of 39

Page 37: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Filesystem: m3fs

• FS service is implemented outside of kernel

• m3fs is (currently) an in-memory filesystem

m3fs App kernel

Mem FS

PE1 PE2 PE3obtain

Scalability and Heterogeneity Slide 33 of 39

Page 38: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Filesystem: m3fs

• FS service is implemented outside of kernel

• m3fs is (currently) an in-memory filesystem

m3fs App kernel

Mem FS

PE1 PE2 PE3obtain

obtain

Scalability and Heterogeneity Slide 33 of 39

Page 39: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Filesystem: m3fs

• FS service is implemented outside of kernel

• m3fs is (currently) an in-memory filesystem

m3fs App kernel

Mem FS

PE1 PE2 PE3

read/write

Scalability and Heterogeneity Slide 33 of 39

Page 40: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

M3

Core Core

Mem MemInterconnect

K

Srv SrvMessages

Acc

Mem

App AppMessages

Scalability and Heterogeneity Slide 34 of 39

Page 41: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Tomahawk

Xtensa LX4

Instr.SPM

DataSPM

DTU

PEPEPE

PE

PE PE

PE

DRAM

RRR

R R R

RRR

PE

MemCtrl.

• Cores attached to NoC with DTU

• No privileged mode

• No MMU, no caches, but SPM

• Only simple DTU + SW emulation

Scalability and Heterogeneity Slide 35 of 39

Page 42: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Linux

• M3 runs on Linux using it as a virtual machine

• A process simulates a PE, having two threads(CPU + DTU)

• DTUs communicate over UNIX domain sockets• No accuracy because

– Programs are directly executed on host– Data transfers have huge overhead compared to HW

• Very useful for debugging and early prototyping

Scalability and Heterogeneity Slide 36 of 39

Page 43: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

gem5

• Modular platform for computer-systemarchitecture research

• Supports various ISAs (x86, ARM, Alpha, . . . )

• Cycle-accurate simulation

• Has an out-of-order CPU model

• We built a DTU for gem5

• Support for caches and virtual memory

Scalability and Heterogeneity Slide 37 of 39

Page 44: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

gem5 – Example Configuration

PE

...

DTUCtrl

ME

DTU

x86

L1L2

L1L2

L1

DTUSPM

DRAM

PE

PEPE

TLB PT

DTUTLB PT

DTUTLB PT

x86

x86x86

Scalability and Heterogeneity Slide 38 of 39

Page 45: Scalability and Heterogeneity - TU Dresdenos.inf.tu-dresden.de/Studium/KMB/WS2016/08-Scalability.pdf · Scalability and Heterogeneity Slide 2 of 39. Commodity System with GPU Core

Summary and Outlook

• Various different approaches

• Not clear yet how to handle heterogeneity

• Memory will get heterogeneous as well (NVM)

• Reconfigurable hardware will emerge

Scalability and Heterogeneity Slide 39 of 39