tilempower-gx36 - architecture overview & performance benchmarks – presented by younghyun jo...

24
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

Upload: peregrine-wood

Post on 16-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

TILEmpower-Gx36

- Architecture overview &performance benchmarks –

Presented by Younghyun Jo

2013/12/18

Page 2: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

2Computer Systems and Platforms Lab

Outlines

Architecture Overview Motivation Specification of TILE-Gx8036 processors

Performance evaluations Computational performance evaluation Memory performance evaluation

Conclusion

Page 3: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

3Computer Systems and Platforms Lab

Motivation of Tilera architec-tures

Page 4: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

4Computer Systems and Platforms Lab

Motivation

Dr. Anant Agarwal A founder of Tilera Corp. Computer architecture researcher, professor of EECS at MIT He led Alewife project and Raw architecture project

MIT Alewife project (1990 ~ 1999) Alewife : a large scale multiprocessor Cache-coherent, distributed shared memory and user-level massage-passing

in a single integrated hardware framework

Raw Processor (1997 ~ 2007) Tiled multicore architecture Wire efficient multicore architecture (interconnection between tiles) Highly parallel VLSI, Compiler knows low-level details of the hardware2002

Page 5: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

5Computer Systems and Platforms Lab

Motivation

Scalar Operand Networks [IEEE TPDS] : Challenges and overcomes in the de-sign of scalable Scalar Operand Networks Frequency Scalability Bandwidth Scalability Deadlock and Starvation Handling Exceptional Events Efficient Operation-Operand Matching

Tiled multicore Distributed everything + Routed interconnection Replace long wires with routed interconnect From centralized clump of CPUs to distributed ALUs, Routed Bypass Network From a large centralized cache to a distributed shared cache

Page 6: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

6Computer Systems and Platforms Lab

Specification of TILE-Gx8036 processors

Page 7: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

7Computer Systems and Platforms Lab

TILE-Gx8036

TILE-Gx8036 36 cores DDR3 DRAM Rshim

Boot controls, diagstics TRIO

Transactional I/O with DMA mPIPE

Packet management MiCA

Hardware accellerators Crypto & Compression

Page 8: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

8Computer Systems and Platforms Lab

TILE-Gx8036

Each core Processor

1.2 GHz 64 bits addressing mode 3 way VLIW CPU

Storage 32 KB L1I / L1D Cache 256 KB L2 Cache 9MB coherent L3 cache :

Dynamic Distributed Cache

Page 9: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

9Computer Systems and Platforms Lab

Processor Pipelines

Processor pipelines It consists of 6 main stages

Fetch, Branch Predict, Decode, Execute 0, Execute 1, and Write Back

Page 10: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

10Computer Systems and Platforms Lab

Processor Pipelines

Pipeline latencies

Page 11: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

11Computer Systems and Platforms Lab

Switch Interfaces

Switch Interfaces IDN : Internal dynamic networks UDN : User dynamic networks RDN : Memory response networks QDN : Memory request networks SDN : Shared dynamic networks

Page 12: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

12Computer Systems and Platforms Lab

Operating systems/Processes isolation

Hardwall Prevent unwanted communication between user applications running on adja-

cent tiles Programmable protection bit on each outport of the UDN or STN

Hardwall also provides a powerful virtualization tool

Page 13: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

13Computer Systems and Platforms Lab

Network Arbitration

Network Arbitration Packets requiring the same

output port are blocked untilthe current packet has finished routing

It basically use round robin manner Round robin Network priority round robin

Routing algorithm X dimension is checked first Y dimension is checked as follows

Page 14: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

14Computer Systems and Platforms Lab

System Software Stack

System Software Stack Tile Processor Hardware Hypervisor Supervisor : Tile Linux Applications / User

4 different modes for tiles Standard : SMP Tile Linux (2.6.38) Dataplane : Zero Overhead Linux Bare metal environments :

User-created run-time environment Dedicated : Tile for debugging

Page 15: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

15Computer Systems and Platforms Lab

Bare metal environment

Bare Metal Environment Run-time environment that allows users to run applications that require direct

access to the hardware Abilities

Full access to all hardware resources Install interrupt vectors Virtual/physical memory allocator I/O device setup UDN/IDN (also can communicate with SMP Linux) Libc utilities that do not depend on OS system services

Page 16: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

16Computer Systems and Platforms Lab

Power management

Dynamic voltage and frequency scaling (DVS, DFS) are available

Configurable I/O and accelerator shutdowns

Hardware-initiated zero-latency Tile sleep

Software-initiated low-power Tile NAP mode

Page 17: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

17Computer Systems and Platforms Lab

Multicore Development Environment

TILEmpower-Gx Development environment

X86 Host machinebern.snu.ac.kr

-MDE 4.1/4.2-

- RPM -

Operating systems

Multicore profiler/debuggerEvaluation platformsKVM, IDE, gcc, and so on

$ tile-monitor -flags

Page 18: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

18Computer Systems and Platforms Lab

Computational performance evaluation

Page 19: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

19Computer Systems and Platforms Lab

Computational performance evaluation

Benchmark scenario Matrix Multiplication with OpenMP C (1000 by 1000) =

A (1000 by 1000) X B (1000 by1000)

Performance

6x6 3x6 6x3 3x3 2x2 1x10

0.2

0.4

0.6

0.8

1

1.2

nomarlized performance

Page 20: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

20Computer Systems and Platforms Lab

Memory performance evalua-tion

Page 21: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

21Computer Systems and Platforms Lab

Memory performance for each core

Memory access cycles for each core on ZOL (Zero Overhead Linux) Blue : load buffer0 in node0 / Green : load buffer1 in node1

Tile 0104114

Tile 1106112

Tile 2108109

MemoryNode 0

Buffer 0

MemoryNode 1

Buffer 1

Tile 3109108

Tile 4112106

Tile 5114104

Tile 6100109

Tile 7102107

Tile 8104105

Tile 9106103

Tile 10108100

Tile 11109100

Tile 12104114

Tile 13106112

Tile 14108109

Tile 15109108

Tile 16112106

Tile 17114104

Tile 18104114

Tile 19106112

Tile 20108109

Tile 21109108

Tile 22112106

Tile 23114104

Tile 24100109

Tile 25102107

Tile 26104105

Tile 27106103

Tile 28108100

Tile 29109100

Tile 30104114

Tile 31106112

Tile 32108109

Tile 33109108

Tile 34112106

Tile 35******

Faster row

Legend : the number of cycles

Page 22: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

22Computer Systems and Platforms Lab

Memory performance for each core

Memory access cycles for each core on BME (Bare Metal Environment) Blue : load buffer0 in node0 / Green : load buffer1 in node1

Tile 0103113

Tile 1105111

Tile 2107108

MemoryNode 0

Buffer 0

MemoryNode 1

Buffer 1

Tile 3108107

Tile 4111105

Tile 5113103

Tile 6100108

Tile 7100106

Tile 8103104

Tile 9105102

Tile 10107100

Tile 1110998

Tile 12103113

Tile 13105111

Tile 14107108

Tile 15108107

Tile 16111105

Tile 17113103

Tile 18103113

Tile 19105111

Tile 20107108

Tile 21108107

Tile 22111105

Tile 23113103

Tile 24100108

Tile 25100106

Tile 26103104

Tile 27105102

Tile 28107100

Tile 2910998

Tile 30100113

Tile 31105111

Tile 32107108

Tile 33108107

Tile 34111105

Tile 35113103

Faster row

Legend : the number of cycles

Page 23: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

23Computer Systems and Platforms Lab

Memory controller

Memory controller block diagram

Page 24: TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18

24Computer Systems and Platforms Lab

Thank you