dataflow for high performance computing leandro marzulo universidade do estado do rio de janeiro...

DATAFLOW FOR HIGH PERFORMANCE COMPUTINGLeandro MarzuloUniversidade do Estado do Rio de Janeiro

[email protected]

No more free lunch…• Can’t buy a new processor and expect to improve

performance automatically.• Parallel programming is a must!

• Average programmers don’t know how to do it• Parallel implementation may not scale• Synchronization

• Heterogeneous Systems• So many devices – CPU, GPU, Xeon Phi, FPGA …• So many libraries/languages – CUDA, OpenCL, TBB, OpenMP,

MPI, Pthreads, VHDL…

• TOO MUCH TO LEARN!

Sweet times ahead..• Time to think out of the box

• To experiment with different stuff

• To revisit old concepts

• To rethink the way we teach programming

• To connect to different fields and research groups

The industry is investing!!!

Why Dataflow?

Just because it feels natural!

Dataflow x Von NeumannCharacteristic Dataflow Von Neumann

Register File ✖ ✔

Program Counter ✖ ✔

Control Flow Steer (one per operand) Branches and Jumps

Parallelism Natural(Parallelism Explosion)

- Pipeline- Branch Prediction

- Tomasulo- ROB

…

Language requirements Functional(no side effects) * Nonrestrictive

Compilation difficultiesControl Flow

(specially loops and functions)

Several architectural specific optimizations

* Wavescalar and its wave-ordering annotation scheme

Dataflow Revives!• TERAFLUX (Unisi, BSC, Microsoft, HP, …)

• Language• Compiler• Simulator (no actual HW yet)

• OmpSS (BSC)• Heterogeneous

• TBB Flowgraph (Intel)• Create and connect nodes• Associate them to Lambda Functions• Inject starter operands

Maxeler• Static Dataflow – DAGs (mostly)• FPGA based – DFE (DataFlow Engine)• Michael Flynn – MPP / SBAC-PAD 2014 Keynote• More performance requires more effort (Flynn’s words)• Compiler – Dataflow Graph in FPGA• Galava DFE – Academic version (USD 4999)

• 500 multipliers• 12 GB RAM• PCI-E

Maxeler - Products

CPUs plus DFEsIntel Xeon CPU cores and

up to 6 DFEs with 288GB of RAM

DFEs shared over Infiniband

Up to 8 DFEs with 384GB of RAM and dynamic

allocation of DFEs to CPU servers

Low latency connectivityIntel Xeon CPUs and 1-2

DFEs with up to six 10Gbit Ethernet connections

MaxWorkstationDesktop development system

MaxCloudOn-demand scalable accelerated compute resource, hosted in London

Maxeler - RTM• 3U System

• 1U traditional CPU node• 2 x MPC-X 2000 (16 DFEs)• Less than 2.5KW power usage

• Performance = 80 x 16 core Intel nodes!• 27x space reduction• 15x power consumption reduction• 5x improvement on total cost of ownership

• There are other similar examples

TALM• Talm is an Architecture and Language for Multithreading

• Hybrid Dataflow/Von Neumann (coarse-grained)

• Trebuchet Virtual Machine

• THLL (Annotations – C)

• Couillard Compiler

Treb

uch

et

TALM

.c

C Source

.df.c

Annotated Source

.lib.c

Super-instructions Source

.fl

Dataflow ASM Code

.so

Super-instruction Library

Blocks Deffinition(THLL)

Couillard

Super-Instruction Code Extraction

Dataflow Compilation

Ass

embl

er

Placement FileCreation

Dataflow BinaryCode Generation

Library Compilation(gcc)

Network

Inst 3Inst 50Inst 52

PE 1

Inst 19Inst 39Inst 43

PE N

.

.

.

Loader.flb

Dataflow Binary

.pla

Placement File

TALM – NW Code

TALM – Results - Blackscholes

TALM – Results - NW

TALM Extra Features• Static Scheduler – Can use profiler information• Selective Workstealing – Custom heuristic• Memory Speculation

• Transactional Memories• Distributed Control – Commit Graph• Avoid manual synchronization (dummy edges)• No Compiler Support yet

• Error Detection and Recovery• Redundant execution• Distributed Control – in the graph

• Can have super-instructions in CUDA• Compiler support needed (data movements)

Sucuri• A minimalistic Dataflow Programing Library for Python

• Transparent Execution on Clusters• Mpi_enable = TRUE• Need to obey DF principles – All data treated as operands• Python serializes objects – easy implementation

• Main Classes• Scheduler – Pool of tasks• Graph – Container• Nodes – Related to functions

Sucuri - Architecture

Sucuri - Pipeline

Create a Graph

Create a Scheduler

Create Nodes

Connect Nodes

Start Scheduler

Add nodes to Graph

Sucuri – Results - LCS

Ongoing Work• TALM

• Compiler Improvements• Cluster Version• Placement Improvements

• Sucuri• Node Galery• Graph Templates• Better scheduler

• Both• Full GPU Support• FPGA Support• Multiple implementations for the same task!• Applications and users!

ImageFilterNode

Fork/Join Graph

WavefrontGraph

Our Dataflow Research Group• Leandro Marzulo (UERJ)• Tiago Alves • Felipe França (UFRJ)• Sandip Kundu (UMASS)• Vítor Santos Costa (UPorto)• Master Students (6 ongoing, 1 finished):

• Brunno Goldstein – UFRJ• Leandro Santiago – UFRJ• Marcos Paulo Rocha – UFRJ• Leandro Rouberte – UFRJ• Alexandre Machado – UERJ• Julio Ho - UERJ• Alexandre Sardinha – Finished his Master – Petrobras

• Undergrad students (UERJ)• 6 finished – 3 are Master students now• 11 ongoing

Questions?

TALM – Results - RT

Sucuri – Hierarchical reduction

Sucuri - Wavefront

dataflow for high performance computing leandro marzulo universidade do estado do rio de janeiro...

Documents

dataflow x

london slide

talm nw code slide

talm talm

talm results nw slide

gb ram pcie slide

gb of ram dfes

maxeler static dataflow