john d. leidel jleidel ttu edu by jovan 1078 slideshows follow user 37 views presentation posted in:...
DESCRIPTION
An Introduction to Goblin-Core64 A Massively Parallel Processor Architecture Designed for Complex Data Analytics. John D. Leidel jleidel ttu edu. Overview. Data Intensive Computing Architectural Challenges The destruction of cache efficiency using irregular algorithms - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/1.jpg)
An Introduction to Goblin-Core64
A Massively Parallel Processor Architecture Designed for Complex Data Analytics
John D. Leideljleidel<at>ttu<dot>edu
![Page 2: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/2.jpg)
Overview
• Data Intensive Computing Architectural Challenges– The destruction of cache efficiency using irregular
algorithms
• Goblin-Core64 Architecture Infrastructure Design– Sustainable Exascale performance with data intensive
applications
• Progress and Roadmap– The path forward
![Page 3: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/3.jpg)
DATA INTENSIVE COMPUTING ARCHITECTURAL CHALLENGES
The destruction of cache efficiency using irregular algorithms
![Page 4: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/4.jpg)
What is Big Data?…and how does it relate to HPC?
• Problem spaces outside of traditional HPC are now encountering the same problems that we find in HPC– Complexity– Time to Solution– Scale
• These problems are generally not– Simulating the physical world– Bound by simple floating point performance– As the problem scales, the result set is fixed
• These problems are generally– Sparse in nature– Contain complex [sometimes unconstrained] data types– As the problem scales, the result set scales
• The other side of the HPC coin
![Page 5: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/5.jpg)
Three Drivers to HPC Solutions
5
Time To Solution
Algorithmic Complexity
Scale of Working Set
HPCHPC
HPC
![Page 6: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/6.jpg)
Convergence Criteria for HPC Adoption
• Time + Complexity– Fraud Detection– High Performance Trading Analytics
• Time + Scale– Power Grid Analytics– Graph500 Benchmark
• Complexity + Scale– Epidemiology– Agent-Based Network Analytics
• Time + Complexity + Scale– Grand Challenge Problems– Cyber Analytics
6
![Page 7: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/7.jpg)
Dense Solver Efficiency
Tianhe-2
(Milk
yWay
-2)
Sequoia
- BlueG
ene/Q
K computer
Jaguar
- Cray
XT5-HE
Roadru
nner
Roadru
nner
BlueGen
e/L
BlueGen
e/L
BlueGen
e/L
Earth-Si
mulator
Earth-Si
mulator
Earth-Si
mulator
ASCI W
hite, S
P Power3
ASCI R
ed
ASCI R
ed
ASCI R
ed
ASCI R
ed
SR2201/1
024
Numerical
Wind Tu
nnel
XP/S140
CM-5/10240.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
top500 efficiency
top500 efficiency
![Page 8: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/8.jpg)
Sparse Solver Efficiency
DOE/NNSA/LLNL Sequoia BGQ UV 2000 GraphCREST-MC48 Dingus Matterhorn Gordon0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Graph500 GTEPS/core
Graph500 GTEPS/core
Cache-less Architectures
![Page 9: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/9.jpg)
GOBLIN-CORE64 ARCHITECTURE INFRASTRUCTURE DESIGN
Sustainable Exascale performance with data intensive applications
![Page 10: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/10.jpg)
Goal
Build an architecture that efficiently maps programming model concepts to hardware in order to improve data intensive [sparse] application throughput
![Page 11: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/11.jpg)
The Result: Goblin-Core64• Hierarchical set of architectural modules that provide:
– Native PGAS memory addressing– High efficiency RISC ISA– SIMD capabilities – Architectural notion of “tasks” – Latency hiding techniques
• Single cycle context/task switching– Advanced synchronization techniques
• Ease the burden of barriers and sync points by eliminating spin waits– Memory coalescing//aggregation
• Local requests • Global requests• AMO’s
– Makes use of latest high bandwidth memory technologies• Hybrid Memory Cube
![Page 12: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/12.jpg)
Goblin-Core64 ModulesTask
Reg
Task Unit Task Proc
ALUSIMD
Task Group
MMU
GC64 Socket
Software Managed
Scratchpad
HMC Memory Interface
AMO Unit
Coalesce Unit
SOC
NET
Peripherals Packet Engine
1 2
34
![Page 13: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/13.jpg)
GC64 Module Hierarchy• Task Unit
– Small divisible unit; Register file + control logic• Task Proc
– Multiple Task Units + context switch control logic• Task Group
– Multiple Task Procs + local MMU• GC64 Socket
– Coalesce Unit: coalesces adjacent memory requests into a single payload– AMO Unit: intelligently handles local AMO requests– HMC Unit: HMC packet request engine + SERDES– Software Managed Scratchpad: on-chip memory– Packet Engine: off-chip memory interface
![Page 14: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/14.jpg)
GC64 Scalable Units
………..
![Page 15: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/15.jpg)
GC64 Execution Model• GC64 execution model provides
“pressure-driven”, single cycle context switching between threads/tasks
• Pressure state machine provides fair-sharing of ALU based upon:– Number of outstanding requests– Statistical probability of a register
stall– Number of cycles in current
execution context• Minimum execution is two
instructions– Based upon instruction format
ALU SIMD
Task Unit Mux
Context Switch State Machine
![Page 16: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/16.jpg)
GC64 Unified [PGAS] Addressing• GC64 physical addressing provides block access to:
– Local HMC [main] memory – Local software-managed scratchpad– Globally mapped [remote] memory
• Pointer arithmetic between memory spaces• Obeys all the constraints of paged, virtual memory
Base Physical Address [33:0]CUB [37:33]Reserved [41:38]Socket [49:42]Unused
[63:50]
Local HMC MemoryScratchpadRemote Memory
One of 8 local HMC devicesCUB = 0xFRemote Socket ID
Physical Address Specification
Physical Address Destinations
![Page 17: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/17.jpg)
GC64 ISA• Simple 64-bit instruction format
– Two instructions per payload• Optional immediate payload
• Instruction control block– Specifies immediate values, breakpoints and vector register aliasing
![Page 18: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/18.jpg)
GC64 Vector Register Aliasing• Vector register aliasing
provides access to scalar register file from SIMD unit
• No need for additional vector register file– Increasing the data path, not
the physical storage• Compiler optimizations can be
used to perform complex, irregular operations– Vector-Scalar-Vector arithmetic– Vector Fill– Scatter/Gather
![Page 19: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/19.jpg)
GC64 Potential PerformanceTask
GroupsTask Procs
Task Units
Tasks/Socket
Peak Giga-ops Bandwidth/Socket
2 2 2 4 4 2.384 GB/s
4 4 4 64 32 9.536 GB/s
8 8 4 256 256 38.147 GB/s
16 16 8 2048 1024 152.588 GB/s
32 16 16 8192 4096 305.176 GB/s
32 32 16 16384 8192 610.352 GB/s
64 32 16 32768 16384 1220.703 GB/s
128 32 16 65536 32768 2441.406 GB/s
*256 128 32 1048576 262144 19531.25 GB/s
*256 *256 *256 16777216 524288 39062.50 GB/s
*Max config- SIMD Width = 4- Task Issue Rate = 2- Cycle = 1Ghz
![Page 20: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/20.jpg)
PROGRESS AND ROADMAPThe path forward
![Page 21: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/21.jpg)
GC64 Progress Report• Complete
– GC64 ISA definition– Physical Address Format– Execution Model– HMC Simulator [to be used in the GC64 sim]
• 2 x papers submitted, third paper in progress• First academic publications on Hybrid Memory Cube technology
• In Progress– Architecture specification document– GC64 Simulator– ABI definition– Virtual addressing model– Compiler & Binutils [LLVM]
• Active Research Topics– Memory coalescing & AMO techniques [Spring 2014]– Context switch pressure model– Software managed scratchpad organization– Off-chip network protocol– Thread/task runtime optimization
![Page 22: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/22.jpg)
HMC-Sim Stream Triad Results
![Page 23: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/23.jpg)
Goblin-Core64
• Source code and specification is licensed under a BSD license
• www.gc64.org– Source code– Architecture documentation– Developer documentation
![Page 24: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/24.jpg)
References[1] John D. Leidel. Convey ThreadSim: A Simulation Framework for Latency-Tolerant Architectures. High Performance Computing Symposium: HPC2011, Boston, MA. April 6, 2011.[2] John D. Leidel. Designing Heterogeneous Multithreaded Instruction Sets from the Programming Model Down. 2012 SIAM Conference on Parallel Processing for Scientific Computing, Savannah, Georgia. February 2012.[3] John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A Framework and Instruction Set for Latency Tolerant, Massively Multithreaded Processors. In Proceeedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC ’12). IEEE Computer Society, Washington, DC, USA 232-239. [4] John D. Leidel. Toward a General Purpose Partitioned Global Virtual Address Specification for Heterogeneous Exascale Architectures. 2013 Exascale Applications and Software Conference, Edinburgh, Scotland, UK. April 2013. [5] John D. Leidel, Geoffrey Rogers, Joe Bolding. Toward a Scalable Heterogeneous Runtime System for the Convey MX Architecture. 2013 Workshop on Multithreaded Architectures and Applications, Boston, MA. May 2013. [6] https://code.google.com/p/goblin-core/[7] http://www.hybridmemorycube.org/[8] http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf[9] John D. Leidel, Yong Chen. A High Fidelity, and Accurate Simulation Framework for Hybrid Memory Cube Devices. 2014 Internal Parallel and Distributed Processing Symposium. Submitted.
![Page 25: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/25.jpg)
BACKUP
![Page 26: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/26.jpg)
What did we learn from CHOMP?• Pros
– We can design tightly coupled ISA’s and runtime models that are extremely efficient• Each instruction becomes precious & necessary
– Code generation is quite natural• Allows the compiler to the best opportunities for optimization
– Latency hiding characteristics function as designed– Single-cycle context switch mechanisms function as designed– AMO’s are increasingly useful
• For more than just memory protection– RISC ISA’s are still providing high performance architectures
• VLSI and JIT’ing is unnecessary overhead/area
![Page 27: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/27.jpg)
What did we learn from CHOMP?• Cons
– NEED MORE BANDWIDTH!• Tests with dense per-thread memory operations could utilize ~4X more bandwidth
with no other changes– Designing for an FPGA has its constraints– The lack of ILP hinders arithmetic throughput in some applications
• Even SpMV kernels can utilize ILP– Paged virtual memory is always expensive
• Not all applications/programmers exploit large pages– Use the native runtime!
• Don’t simply rely on the compiler to generate efficient code for higher-level parallel languages [OpenMP,et.al]. Use the machine-level runtime
– Cache is bad, but coalescing is good• Cache is expensive to implement and often impedes performance• We can occasionally take advantage of spatial locality at access time
![Page 28: John D. Leidel jleidel ttu edu By jovan 1078 SlideShows Follow User 37 Views Presentation posted in: General](https://reader035.vdocuments.net/reader035/viewer/2022062323/568164dc550346895dd7362a/html5/thumbnails/28.jpg)
GC64 Research• PGAS Addressing
– How do we develop a segmented address directory w/o the use of a TLB? – How do we distribute and maintain this directory while providing applications “virtual memory” security?
• Efficient Synchronization– Building efficient synchronization algorithms using the GC64 ISA mechanisms and available bandwidth is
orthogonal to traditional implementations• Memory Coalescing
– Development of memory request coalescing algorithms for local, remote and AMO-type requests– How do our synchronization techniques play into this?
• Software-Managed Scratchpad– Is there room/desire to add features such as this? – What additional pressure does this put on the programming model/compiler?
• Compilation Techniques– Providing MTA-C style loop transformations– https://sites.google.com/site/parallelizationforllvm/loop-transforms
• Runtime Models– Building a machine-level optimized runtime that is programming model agnostic
• HMC Integration– HMC 1.0 specification is available. Very different than traditional DRAM technology