phytium 64 core cpu preview
Post on 16-Apr-2017
1.966 Views
Preview:
TRANSCRIPT
Mars: A 64-core ARMv8 Processor
Charles Zhang
Phytium Technology Co., Ltd
Phytium Technology Co., Ltd
Statements
The following slides are presented to introduce the
general features of one of our products, instead of any
commitment about it. It is for information purposes
only, and may not be incorporated into any contract. It
is not suggested to make purchasing decisions
accordingly. The development, release, and timing of
any features or functionality described here remains at
the sole discretion of Phytium.
2
Phytium Technology Co., Ltd
A Brief Introduction of Phytium
China corporation, founded in 2012
Guangzhou
Tianjin
Vision
Leading edge CPU and ASIC provider in China
Market focuses on chips for
Internet & Cloud Computing infrastructure
Traditional workload mainframe servers
3
Phytium Technology Co., Ltd
China is a Fast-growing Server Market
4
Company
1Q15
Revenue
1Q15 Market
Share (%)
1Q14
Revenue
1Q14 Market
Share (%)
1Q15-1Q14
Growth (%)
HP 3,191,694,948 23.8 2,890,992,229 25.5 10.4
Dell 2,296,473,026 17.1 2,006,639,006 17.7 14.4
IBM 1,887,939,141 14.1 2,244,631,789 19.8 -15.9
Lenovo 970,254,659 7.2 127,973,470 1.1 658.2
Cisco 890,179,930 6.6 616,620,000 5.4 44.4
Others 4,157,871,704 31.0 3,469,383,444 30.6 19.8
Total 13,394,413,409 100.0 11,356,239,939 100.0 17.9
Company
1Q15
Revenue
1Q15 Market
Share (%)
1Q14
Revenue
1Q14 Market
Share (%)
1Q15-1Q14
Growth (%)
Inspur 332,613,480 21 227,328,256 17 46
Dell 322,063,140 20 246,281,271 19 31
Lenovo 295,914,571 18 80,084,826 6 270
HP 217,487,450 14 167,775,923 13 30
Huawei 197,490,419 12 189,963,266 14 4
Sugon 140,377.091 9 70,705,366 5 99
Others 104,566,737 6 329,549,621 25 -68
Total 1,610,512,888 100.0 1,311,688,529 100.0 23
Source: Gartner (May 2015)
China
WW
Phytium Technology Co., Ltd
What is Mars for?
5
High performance
High volume of memory
High bandwidth memory access
High bandwidth I/O access
Large scale cache coherency maintained
Moderate performance
High power efficiency
High density computing
High bandwidth memory
access
Low cost
Mars
Earth
Phytium Technology Co., Ltd
Mars Overview
Architecture Features
64 Xiaomi cores, ARMv8
compatible
Hardware-maintained global
cache coherency
Panel-based data affinity
architecture
Mesh topology on chip network
32MB L2 cache
8 Cache & Memory Chips (CMC)
128MB L3 cache
16 DDR3-1600 channels
Two 16-lane PCIE3.0 i/f
ECC and parity protection on all
caches, tags and TLBs
6
Physical • ~180M instances
• 2.0GHz@28nm
• 120W
Performance • Peak:512GFLOPS
• Mem BW:204GB/s
• I/O BW: 32GB/s
panel0 panel1 panel3 panel2
panel4 panel5 panel7 panel6
CM
C
PCIe
PCIe
DDR3
DDR3
CM
C
DDR3
DDR3
CM
C
DDR3
DDR3
CM
C
DDR3
DDR3
CMC
DD
R3
DD
R3
CMC
DD
R3
DD
R3
CMC
DD
R3
DD
R3
CMC
DD
R3
DD
R3
Phytium Technology Co., Ltd
Panel Architecture
Eight Xiaomi Cores
Compatible design with ARMv8 arch license
Both AArch32 and AArch64 modes
EL0~EL3 supported
ASIMD-128 supported
Adv. hybrid Branch Prediction
4 fetch/4 decode/4 dispatch Out-of-Order
superscalar pipeline
Cache Hierarchy
Separated L1 ICache and L1 Dcache
Shared L2 cache, totally 4MB
Directory-based cache coherency
maintenance
Directory Control Unit (DCU)
Routing Cell
7
Xiaomi
Xiaomi
Xiaomi
Xiaomi L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi L2cache
6000μm
10600μm
Phytium Technology Co., Ltd 8
Xiaomi Core
ITLB I Cache BTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoder decoder decoder decoder
Rename Logic
Arch.
Reg
file Phy.
Reg
file Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
Phytium Technology Co., Ltd
ITLB I Cache BTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoder decoder decoder decoder
Rename Logic
Arch.
Reg
file Phy.
Reg
file Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
9
Xiaomi Core Front End
ITLB I Cache BTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
Prefetch
• 32KB L1 instr. Cache
• Next line prefetch
• Hybrid Branch Predictor
• 2048-entry BTB
• Direction predict with TAGE predictor
• 512-entry indirect predictor
• 48-entry Speculative Return Stack
• Four instructions fetched per cycle
• 32-entry instruction buffer
• Loop detect and Instr. Cache bypass
Phytium Technology Co., Ltd
ITLB I Cache BTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoder decoder decoder decoder
Rename Logic
Arch.
Reg
file Phy.
Reg
file Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
10
Xiaomi Core Decode, Rename & Dispatch
• Up to four instructions
decoded per cycle
• 192 physical registers
• Up to four instructions
renamed per cycle decoder decoder decoder decoder
Rename Logic
Arch.
Reg
file Phy.
Reg
file Dispatch Logic
Reorder
Buffer
• Up to four instructions dispatched per cycle
• Reorder buffer can hold 160 instructions, and about 210+ instructions
can be in-flight in the whole pipeline.
• Dispatch in-order, execution out-of-order, retirement in-order.
Phytium Technology Co., Ltd
ITLB I Cache BTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoder decoder decoder decoder
Rename Logic
Arch.
Reg
file Phy.
Reg
file Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
11
Xiaomi Core Function Units
• Two separated 16-entry integer and ASIMD queues shared by four
integer units
• Two integer unit can process single-cycle integer instructions and
integer SIMD instructions, one can also process branch instructions.
• Two integer units can process multi-cycle integer instructions and
integer SIMD instructions.
• One shared16-entry floating point and ASIMD queue
• Two FP/ASIMD units equipped, which can be combined into one
lockstep ASIMD unit.
• FMA supported in both units.
• FMUL: 3cycles, FADD: 3cycles, FMA: 6cycles
SCInt/VT
Queue
FP/VT
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
Phytium Technology Co., Ltd
ITLB I Cache BTB
DirPre
IndPre
SRS Loop
Detect
Instruction Buffer
decoder decoder decoder decoder
Rename Logic
Arch.
Reg
file Phy.
Reg
file Dispatch Logic
Reorder
Buffer
SCInt/VT
Queue
FP/VT
Queue
LD/ST
Queue
ALU
/BR
FMA
C/FDI
V
FMA
C/FDI
V
DTLB
D Cache
L2 Cache
STB & Prefetch
Prefetch
Debug
/Trace
/Interrupt
/Timer ALU
/BR
MUL
/DIV
MUL
/DIV
MCI/VT
Queue
12
Xiaomi Core Function Units
• One 24-entry load/store queue
• 32KB L1 data cache
• 6 outstanding loads
• 4 cycles latency from load to use
• Next line and stride detected data
prefetch
• Streamlined pattern auto detected
LD/ST
Queue
DTLB
D Cache
STB & Prefetch
Phytium Technology Co., Ltd
Cache coherence protocol Hawk cache coherence protocol
Distributed directory-based global cache coherency
MOESI-like packet-based coherence protocol
A home node DCU(directory control unit) supports
Affinitive pairing of L2Cs and CMCs
“Infinite” capacity for non-conflicting Reads & Writes
Optimized transaction flow for exclusive atomic accesses
Reduced latency by cacheline forwarding
13
L2C L2C L2C
Hawk
L3C &
Memory I/O
Interconnects
Global Exclusive Monitor
Co
re0
Co
re7
Coherence Logic
Pa
ne
l N
MEM
Co
re0
Co
re7
Coherence Logic
Pa
ne
l 0
Local Monitor
Phytium Technology Co., Ltd
Network on Chip
2D Concentrated Mesh Architecture
Cell based switch with 6 bidirectional ports
Uniform package format for each port, a port can be configured to be
connected with a device or cascade cell
4 physical channels for CC and 1 channel for debug, DOR Y-X routing
Low latency: 3 cycles for each hop
High bandwidth: 384GB/s each cell
Cell1
53
0
24
L2cache
L2cache
MIU/IOU
MIU or
Cascade
Dest. Lat. (cycles)
0 3
1 6
2 9
3 12
4 15
5 12
6 9
7 6
Avg. 9
3
4
Cell01
53
0
24 Cell1
0
24
1
53
Cell45
13
2
04 Cell5
2
04
5
13 Cell7
5
13
2
04
Cell20
24
1
53Cell3
1
53
0
24
Cell62
04
5
13
master
0
1 2
5 6
7
14
Phytium Technology Co., Ltd
Cache & Memory Chip
L3 cache
16MB Data Array
2MB Data ECC
DDR bandwidth
2 x DDR3-800:25.6GB/s
Proprietary interface between Mars &
CMC
Parallel interface
Needs more pins, but lower latency than
serdes
Separate write/cmd and read data
channel
L3
Bank0
Mars Interface
L3
Bank1
L3
Bank2
L3
Bank3
Mem
Ctrl0
Mem
Ctrl1
D
D
R
D
D
R
15
Effective read channel bandwidth:12.8GB/s
Effective write/cmd channel bandwidth:6.4GB/s
Phytium Technology Co., Ltd
Latency of affinitive access
Memory access latency(ns)
Local L1 cache hit ~2
Local L2 cache hit ~8
Affinitive L2 cache hit ~20
Affinitive L3 cache hit ~36
Affinitive DDR access ~70
• Panel : 2.0GHz
• NoC: 2.0GHz
• CMC: 1.5GHz
* PCB latency not considered
16
Xiaomi
Xiaomi
Xiaomi
Xiaomi L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi L2cache
Xiaomi
Xiaomi
Xiaomi
Xiaomi L2cache
Routing Cell
DCU
DCU
Xiaomi
Xiaomi
Xiaomi
Xiaomi L2cache
CM
C
Phytium Technology Co., Ltd
Memory Tune (mTune)
Rich Data Collection
Number of cache hits/misses for L1/L2/L3
Workload of cache pipelines
Busyness of the NoC
ECC corrections of the memory system
Support Multiple Metrics
Average Miss rate/Hit rate
Minimal/Maximal/Average Access Latency
Bandwidth Analysis
Concurrent Average Memory Access Time (CAMAT)
Support MPI/OpenMP Applications
Thread behavior analysis
Inter-process behavior analysis
17
Phytium Technology Co., Ltd
Scalable Debug System
ARMv8 CoreSight Compatible debug system
Scalable dedicated debug network across 64 cores
Distributed debug components
Configurable events broadcast scope
Timestamp broadcasts with single signal to simplify
implementation
18
Cell0
DNB_C
DNB_C
1
530
DNB_M
DNB_I2
4 Cell14 31 0
5 2
DNB_C
DNB_C
DNB_M
Cell4
DNB_C
DNB_C
5
132
04
DNB_M
Cell54 35 2
1 0
DNB_C
DNB_CDNB_M
Cell33 40 1
2 5
DNB_C
DNB_C
DNB_M
Cell24 31 0
5 2
DNB_C
DNB_C
DNB_M
Cell7
DNB_C
DNB_C
5
132
04
DNB_M
Cell64 35 2
1 0
DNB_C
DNB_CDNB_M
panel0 panel1
panel4 panel5
panel3 panel2
panel7 panel6
hdbg
JTAG1
JTAG2
Phytium Technology Co., Ltd
Physical Design
28nm process
0.9v core/1.8v IO
10 metal layers
~180M instances
2.0GHz
120W
640mm2 die size
FCBGA
~3000 pins
19
25.38mm
25.2mm
Phytium Technology Co., Ltd
Performance Evaluation
SpecCPU2006
20
Single copy of SPEC CPU benchmark 64 copies of SPEC CPU benchmark
19.2 17.8
0
5
10
15
20
25
INT FP
SPEC_CPU2006_base
672 585
0
100
200
300
400
500
600
700
800
INT FP
SPEC_CPU2006_rate
Phytium Technology Co., Ltd
Performance Evaluation
STREAM
21
0
10
20
30
40
50
60
70
80
90
1 2 4 8 16 24 32 40 48 56 64
STREAM triad
#cores
Bandw
idth
(G
B/s
)
Phytium Technology Co., Ltd
Next Generation Scale-up CPU
More powerful core
Aggressive Branch Predictor
Multithreading
More aggressive ILP exploitation
Wider SIMD
More RAS features
Higher bandwidth memory access
Higher power efficiency
22
Mars: A 64-core ARMv8 Processor
Charles Zhang
Charles.zhang@phytium.com.cn
top related