dataflow for high performance computing leandro marzulo universidade do estado do rio de janeiro...
TRANSCRIPT
![Page 1: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/1.jpg)
DATAFLOW FOR HIGH PERFORMANCE COMPUTINGLeandro MarzuloUniversidade do Estado do Rio de Janeiro
![Page 2: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/2.jpg)
![Page 3: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/3.jpg)
No more free lunch…• Can’t buy a new processor and expect to improve
performance automatically.• Parallel programming is a must!
• Average programmers don’t know how to do it• Parallel implementation may not scale• Synchronization
• Heterogeneous Systems• So many devices – CPU, GPU, Xeon Phi, FPGA …• So many libraries/languages – CUDA, OpenCL, TBB, OpenMP,
MPI, Pthreads, VHDL…
• TOO MUCH TO LEARN!
![Page 4: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/4.jpg)
Sweet times ahead..• Time to think out of the box
• To experiment with different stuff
• To revisit old concepts
• To rethink the way we teach programming
• To connect to different fields and research groups
The industry is investing!!!
![Page 5: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/5.jpg)
Why Dataflow?
Just because it feels natural!
![Page 6: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/6.jpg)
Dataflow x Von NeumannCharacteristic Dataflow Von Neumann
Register File ✖ ✔
Program Counter ✖ ✔
Control Flow Steer (one per operand) Branches and Jumps
Parallelism Natural(Parallelism Explosion)
- Pipeline- Branch Prediction
- Tomasulo- ROB
…
Language requirements Functional(no side effects) * Nonrestrictive
Compilation difficultiesControl Flow
(specially loops and functions)
Several architectural specific optimizations
* Wavescalar and its wave-ordering annotation scheme
![Page 7: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/7.jpg)
Dataflow Revives!• TERAFLUX (Unisi, BSC, Microsoft, HP, …)
• Language• Compiler• Simulator (no actual HW yet)
• OmpSS (BSC)• Heterogeneous
• TBB Flowgraph (Intel)• Create and connect nodes• Associate them to Lambda Functions• Inject starter operands
![Page 8: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/8.jpg)
Maxeler• Static Dataflow – DAGs (mostly)• FPGA based – DFE (DataFlow Engine)• Michael Flynn – MPP / SBAC-PAD 2014 Keynote• More performance requires more effort (Flynn’s words)• Compiler – Dataflow Graph in FPGA• Galava DFE – Academic version (USD 4999)
• 500 multipliers• 12 GB RAM• PCI-E
![Page 9: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/9.jpg)
Maxeler - Products
CPUs plus DFEsIntel Xeon CPU cores and
up to 6 DFEs with 288GB of RAM
DFEs shared over Infiniband
Up to 8 DFEs with 384GB of RAM and dynamic
allocation of DFEs to CPU servers
Low latency connectivityIntel Xeon CPUs and 1-2
DFEs with up to six 10Gbit Ethernet connections
MaxWorkstationDesktop development system
MaxCloudOn-demand scalable accelerated compute resource, hosted in London
![Page 10: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/10.jpg)
Maxeler - RTM• 3U System
• 1U traditional CPU node• 2 x MPC-X 2000 (16 DFEs)• Less than 2.5KW power usage
• Performance = 80 x 16 core Intel nodes!• 27x space reduction• 15x power consumption reduction• 5x improvement on total cost of ownership
• There are other similar examples
![Page 11: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/11.jpg)
TALM• Talm is an Architecture and Language for Multithreading
• Hybrid Dataflow/Von Neumann (coarse-grained)
• Trebuchet Virtual Machine
• THLL (Annotations – C)
• Couillard Compiler
![Page 12: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/12.jpg)
Treb
uch
et
TALM
.c
C Source
.df.c
Annotated Source
.lib.c
Super-instructions Source
.fl
Dataflow ASM Code
.so
Super-instruction Library
Blocks Deffinition(THLL)
Couillard
Super-Instruction Code Extraction
Dataflow Compilation
Ass
embl
er
Placement FileCreation
Dataflow BinaryCode Generation
Library Compilation(gcc)
Network
Inst 3Inst 50Inst 52
PE 1
Inst 19Inst 39Inst 43
PE N
.
.
.
Loader.flb
Dataflow Binary
.pla
Placement File
![Page 13: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/13.jpg)
TALM – NW Code
![Page 14: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/14.jpg)
TALM – Results - Blackscholes
![Page 15: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/15.jpg)
TALM – Results - NW
![Page 16: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/16.jpg)
TALM Extra Features• Static Scheduler – Can use profiler information• Selective Workstealing – Custom heuristic• Memory Speculation
• Transactional Memories• Distributed Control – Commit Graph• Avoid manual synchronization (dummy edges)• No Compiler Support yet
• Error Detection and Recovery• Redundant execution• Distributed Control – in the graph
• Can have super-instructions in CUDA• Compiler support needed (data movements)
![Page 17: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/17.jpg)
Sucuri• A minimalistic Dataflow Programing Library for Python
• Transparent Execution on Clusters• Mpi_enable = TRUE• Need to obey DF principles – All data treated as operands• Python serializes objects – easy implementation
• Main Classes• Scheduler – Pool of tasks• Graph – Container• Nodes – Related to functions
![Page 18: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/18.jpg)
Sucuri - Architecture
![Page 19: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/19.jpg)
Sucuri - Pipeline
Create a Graph
Create a Scheduler
Create Nodes
Connect Nodes
Start Scheduler
Add nodes to Graph
![Page 20: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/20.jpg)
Sucuri – Results - LCS
![Page 21: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/21.jpg)
Ongoing Work• TALM
• Compiler Improvements• Cluster Version• Placement Improvements
• Sucuri• Node Galery• Graph Templates• Better scheduler
• Both• Full GPU Support• FPGA Support• Multiple implementations for the same task!• Applications and users!
ImageFilterNode
Fork/Join Graph
WavefrontGraph
![Page 22: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/22.jpg)
Our Dataflow Research Group• Leandro Marzulo (UERJ)• Tiago Alves • Felipe França (UFRJ)• Sandip Kundu (UMASS)• Vítor Santos Costa (UPorto)• Master Students (6 ongoing, 1 finished):
• Brunno Goldstein – UFRJ• Leandro Santiago – UFRJ• Marcos Paulo Rocha – UFRJ• Leandro Rouberte – UFRJ• Alexandre Machado – UERJ• Julio Ho - UERJ• Alexandre Sardinha – Finished his Master – Petrobras
• Undergrad students (UERJ)• 6 finished – 3 are Master students now• 11 ongoing
![Page 23: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/23.jpg)
Questions?
![Page 24: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/24.jpg)
TALM – Results - RT
![Page 25: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/25.jpg)
Sucuri – Hierarchical reduction
![Page 26: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/26.jpg)
Sucuri - Wavefront
![Page 27: DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro leandro@ime.uerj.br](https://reader033.vdocuments.net/reader033/viewer/2022042615/56649d765503460f94a57702/html5/thumbnails/27.jpg)