naums mogers - functional interface for performance ......naums mogers christophe dubach arm...

32
Naums Mogers Christophe Dubach Functional Interface for Performance Portability on Parallel Accelerators

Upload: others

Post on 11-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

▣ Naums Mogers

Christophe Dubach

ARM Research Summit, September 2019

Functional Interface forPerformance Portability on Parallel Accelerators

Page 2: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Hardware accelerators

Architectures

Scopes

Applications

2Requirements

CPU GPU

CGRA

FPGA

ASIC

ML

Security

Server Desktop Mobile Embed

Vision

DBGraph

Energy Mem

Page 3: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Hardware accelerators

Architectures

Scopes

Applications

3Requirements

CPU GPU

CGRA

FPGA

ASIC

ML

Security

Server Desktop Mobile Embed

Vision

DBGraph

Energy Mem

Page 4: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Applications

Current landscape

XLA

TPU

BrainWaveISA

BrainWave

ArmNN

ARM ML

MDK

Movidius

HiAI

HiSiliconNPU

OpenCL

GPUsMulticore

CPUs

OpenMP VHDL

FPGAs

Page 5: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

XLA

TPU

BrainWaveISA

BrainWave

ArmNN

ARM ML

MDK

Movidius

HiAI

HiSiliconNPU

Applications

Image Processing

Neural Networks

Graph Analytics

Data-ParallelIntermediate Language

Performance PortableCode Generator

OpenCL

GPUsMulticore

CPUs

OpenMP VHDL

FPGAs

Domain-Specific Languages (DSLs)

Language for Parallelism

Compiler Technology

What we need

Page 6: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

6

What is the right interface for HW accelerators?

Page 7: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

7

What is the right interface for HW accelerators?

Page 8: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Functional approach

Abstract

▣ Expresses algorithm (WHAT), not implementation (HOW)

8

Safe

▣ Easier to use and parallelise

High-level

▣ Captures plenty of algorithmic meta-info for analysis

Expressive

▣ Control flow

▣ Memory management

Pure

▣ Easy to transform

Composable

▣ Easier to maintain, code-reuse

Page 9: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

XLA

TPU

BrainWaveISA

BrainWave

ArmNN

ARM ML

MDK

Movidius

HiAI

HiSiliconNPU

Applications

Image Processing

Neural Networks

Graph Analytics

Lift: Data-ParallelIntermediate Language

Lift: Rewrite rule-based compilerto HW-specific languages

OpenCL

GPUsMulticore

CPUs

OpenMP VHDL

FPGAs

Domain-Specific Languages (DSLs)

Language for Parallelism

Compiler Technology

Lift

Page 10: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

10

Lift www.lift-project.org

▣ Functional data-parallel language and compiler

Page 11: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift IR www.lift-project.org

toGlobal

toLocal

toPrivate

toVector

toScalar

add, mul

dot

tanh

Address space operators

Casters Data operators

Int, Float

Vector

Array

Data types

Map, Reduce

Zip, Split

Scatter, Gather

Slide

Algorithmic patterns

Page 12: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

12

Lift IR: Views www.lift-project.org

Reorder(stride(s)) >> Map(f)

▣ Virtual composable data layout transformations□ Reorder, Transpose, Slide, Slice, etc

▣ Expressed with Views▣ Help avoid extra memory writes

NO WRITES TO MEMORY

TRANSFORMED READS

Page 13: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift IR www.lift-project.org

toGlobal

toLocal

toPrivate

toVector

toScalar

add, mul

dot

tanh

Address space operators

Casters Data operators

Int, Float

Vector

Array

Data types

Map, Reduce

Zip, Split

Scatter, Gather

Slide

Algorithmic patterns

Page 14: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

14

Lift IR www.lift-project.org

Address space operators

Casters Data operators

IR level

toGlobal

toLocal

toPrivate

toVector

toScalar

add, mul

dot

tanh

Generic

DSL

conv, lstm

blur, sharpen

select

toDRAM

toSRAM

toRegistor

MapGlobalMapLocalReduceSeq

toInt8toFloat16

VVMulMVMulVTanh

Platform-specific

Map, Reduce

Zip, Split

Scatter, Gather

Slide

Algorithmic patterns

Int, Float

Vector

Array

Data types

Vector

Matrix

Tensor

Int8Float16Float32

Page 15: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

15

How do we achieve performance portability?

Page 16: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift: Rewrite Rules

▣ Express algorithmic implementation choices▣ Preserve semantic correctness▣ Leverage algorithmic info

▣ Decouples optimisation from code generation16

Split-join rule Map fusion rule GEMV rule

Page 17: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Rewrite rules

17

Lift: Rewrite Rules

IR level

▣ Split-join rule

▣ Map fusion rule

▣ Reduce rules

▣ ...

Generic

DSL

▣ Algorithm choices for high-level primitives

▣ Precision level

▣ ...

Platform-specific

▣ Using built-ins

▣ Lowering to the platform programming model

▣ ...

Page 18: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

18

Lift: rewriting

HOW TO OPTIMISE?

Page 19: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

19

Lift: rewriting

HARD STARTING POINT

Page 20: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

20

Lift: rewriting Search

Page 21: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

21

Lift: rewriting

Exploitation

Search

Built-in primitive

Page 22: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

22

Lift: rewriting

Exploitation

Search

Built-in primitive

Parallelisation choice

Page 23: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

23

Lift: rewriting

Exploitation

Code generation

Search

Built-in primitive

Parallelisation choice

Page 24: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift: Rewrite Rules

▣ Domain-specific and generic▣ Reusable▣ Provably correct▣ Self-contained, extensible

24

Page 25: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift: Constraint Inference

▣ Required for valid search space generation when using tuning parameters

▣ Leverages algorithmic meta-info▣ Can express heuristics and HW restrictions

25

Page 26: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift: Search Space Exploration

▣ Uniform random sampling▣ Predictor models▣ Genetic algorithms▣ …

26

Page 27: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift: Research Directions

▣ Linear algebra▣ Sparse data parallelism▣ Optimising reductions▣ Stencil computations▣ 3D wave modelling▣ High-level synthesis for FPGAs▣ Machine Learning

27

Page 28: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift for Machine Learning

▣ Machine Learning□ Convolution inference optimisation□ Platforms: Mali GPUs, BrainWave

28

Page 29: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift for Machine Learning

▣ Machine Learning□ Convolution inference optimisation□ Platforms: Mali GPUs, BrainWave

29

Page 30: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift for Machine Learning

▣ Machine Learning□ Convolution inference optimisation□ Platforms: Mali GPUs, BrainWave

30

Page 31: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

Lift for Machine Learning

Naums Mogers, PhD student, Edinburgh

How to best exploit HW accelerators?

31

Christof Schlaak, PhD student, Edinburgh

How to generate accelerator architectures?

Lu Li, Postdoctoral Researcher, Edinburgh

How to optimise the host code?How to drive the rewriting process?

Christophe Dubach, Reader, Edinburgh

All of the above

Page 32: Naums Mogers - Functional Interface for Performance ......Naums Mogers Christophe Dubach ARM Research Summit, September 2019 Functional Interface for Performance Portability on Parallel

32

Lift source code is published

https://github.com/lift-project/lift

References(icons) Noun Project, https://thenounproject.com(icons) Font Awesome, https://fontawesome.com

http://www.lift-project.org