tilempower-gx36 - architecture overview & performance benchmarks – presented by younghyun jo...
TRANSCRIPT
TILEmpower-Gx36
- Architecture overview &performance benchmarks –
Presented by Younghyun Jo
2013/12/18
2Computer Systems and Platforms Lab
Outlines
Architecture Overview Motivation Specification of TILE-Gx8036 processors
Performance evaluations Computational performance evaluation Memory performance evaluation
Conclusion
4Computer Systems and Platforms Lab
Motivation
Dr. Anant Agarwal A founder of Tilera Corp. Computer architecture researcher, professor of EECS at MIT He led Alewife project and Raw architecture project
MIT Alewife project (1990 ~ 1999) Alewife : a large scale multiprocessor Cache-coherent, distributed shared memory and user-level massage-passing
in a single integrated hardware framework
Raw Processor (1997 ~ 2007) Tiled multicore architecture Wire efficient multicore architecture (interconnection between tiles) Highly parallel VLSI, Compiler knows low-level details of the hardware2002
5Computer Systems and Platforms Lab
Motivation
Scalar Operand Networks [IEEE TPDS] : Challenges and overcomes in the de-sign of scalable Scalar Operand Networks Frequency Scalability Bandwidth Scalability Deadlock and Starvation Handling Exceptional Events Efficient Operation-Operand Matching
Tiled multicore Distributed everything + Routed interconnection Replace long wires with routed interconnect From centralized clump of CPUs to distributed ALUs, Routed Bypass Network From a large centralized cache to a distributed shared cache
7Computer Systems and Platforms Lab
TILE-Gx8036
TILE-Gx8036 36 cores DDR3 DRAM Rshim
Boot controls, diagstics TRIO
Transactional I/O with DMA mPIPE
Packet management MiCA
Hardware accellerators Crypto & Compression
8Computer Systems and Platforms Lab
TILE-Gx8036
Each core Processor
1.2 GHz 64 bits addressing mode 3 way VLIW CPU
Storage 32 KB L1I / L1D Cache 256 KB L2 Cache 9MB coherent L3 cache :
Dynamic Distributed Cache
9Computer Systems and Platforms Lab
Processor Pipelines
Processor pipelines It consists of 6 main stages
Fetch, Branch Predict, Decode, Execute 0, Execute 1, and Write Back
11Computer Systems and Platforms Lab
Switch Interfaces
Switch Interfaces IDN : Internal dynamic networks UDN : User dynamic networks RDN : Memory response networks QDN : Memory request networks SDN : Shared dynamic networks
12Computer Systems and Platforms Lab
Operating systems/Processes isolation
Hardwall Prevent unwanted communication between user applications running on adja-
cent tiles Programmable protection bit on each outport of the UDN or STN
Hardwall also provides a powerful virtualization tool
13Computer Systems and Platforms Lab
Network Arbitration
Network Arbitration Packets requiring the same
output port are blocked untilthe current packet has finished routing
It basically use round robin manner Round robin Network priority round robin
Routing algorithm X dimension is checked first Y dimension is checked as follows
14Computer Systems and Platforms Lab
System Software Stack
System Software Stack Tile Processor Hardware Hypervisor Supervisor : Tile Linux Applications / User
4 different modes for tiles Standard : SMP Tile Linux (2.6.38) Dataplane : Zero Overhead Linux Bare metal environments :
User-created run-time environment Dedicated : Tile for debugging
15Computer Systems and Platforms Lab
Bare metal environment
Bare Metal Environment Run-time environment that allows users to run applications that require direct
access to the hardware Abilities
Full access to all hardware resources Install interrupt vectors Virtual/physical memory allocator I/O device setup UDN/IDN (also can communicate with SMP Linux) Libc utilities that do not depend on OS system services
16Computer Systems and Platforms Lab
Power management
Dynamic voltage and frequency scaling (DVS, DFS) are available
Configurable I/O and accelerator shutdowns
Hardware-initiated zero-latency Tile sleep
Software-initiated low-power Tile NAP mode
17Computer Systems and Platforms Lab
Multicore Development Environment
TILEmpower-Gx Development environment
X86 Host machinebern.snu.ac.kr
-MDE 4.1/4.2-
- RPM -
Operating systems
Multicore profiler/debuggerEvaluation platformsKVM, IDE, gcc, and so on
$ tile-monitor -flags
19Computer Systems and Platforms Lab
Computational performance evaluation
Benchmark scenario Matrix Multiplication with OpenMP C (1000 by 1000) =
A (1000 by 1000) X B (1000 by1000)
Performance
6x6 3x6 6x3 3x3 2x2 1x10
0.2
0.4
0.6
0.8
1
1.2
nomarlized performance
21Computer Systems and Platforms Lab
Memory performance for each core
Memory access cycles for each core on ZOL (Zero Overhead Linux) Blue : load buffer0 in node0 / Green : load buffer1 in node1
Tile 0104114
Tile 1106112
Tile 2108109
MemoryNode 0
Buffer 0
MemoryNode 1
Buffer 1
Tile 3109108
Tile 4112106
Tile 5114104
Tile 6100109
Tile 7102107
Tile 8104105
Tile 9106103
Tile 10108100
Tile 11109100
Tile 12104114
Tile 13106112
Tile 14108109
Tile 15109108
Tile 16112106
Tile 17114104
Tile 18104114
Tile 19106112
Tile 20108109
Tile 21109108
Tile 22112106
Tile 23114104
Tile 24100109
Tile 25102107
Tile 26104105
Tile 27106103
Tile 28108100
Tile 29109100
Tile 30104114
Tile 31106112
Tile 32108109
Tile 33109108
Tile 34112106
Tile 35******
Faster row
Legend : the number of cycles
22Computer Systems and Platforms Lab
Memory performance for each core
Memory access cycles for each core on BME (Bare Metal Environment) Blue : load buffer0 in node0 / Green : load buffer1 in node1
Tile 0103113
Tile 1105111
Tile 2107108
MemoryNode 0
Buffer 0
MemoryNode 1
Buffer 1
Tile 3108107
Tile 4111105
Tile 5113103
Tile 6100108
Tile 7100106
Tile 8103104
Tile 9105102
Tile 10107100
Tile 1110998
Tile 12103113
Tile 13105111
Tile 14107108
Tile 15108107
Tile 16111105
Tile 17113103
Tile 18103113
Tile 19105111
Tile 20107108
Tile 21108107
Tile 22111105
Tile 23113103
Tile 24100108
Tile 25100106
Tile 26103104
Tile 27105102
Tile 28107100
Tile 2910998
Tile 30100113
Tile 31105111
Tile 32107108
Tile 33108107
Tile 34111105
Tile 35113103
Faster row
Legend : the number of cycles
23Computer Systems and Platforms Lab
Memory controller
Memory controller block diagram