sparc64(tm) viiifx: fujitsu's new generation octo core ... · sparc64 v + gs21 sparc64...

21
All Rights Reserved,Copyright© FUJITSU LIMITED 2009 SPARC64 VIIIfx: Fujitsu’s New Generation Octo Core Processor for PETA Scale computing August 25, 2009 Takumi Maruyama LSI Development Division Next Generation Technical Computing Unit Fujitsu Limited

Upload: lediep

Post on 26-Apr-2019

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

All Rights Reserved,Copyright© FUJITSU LIMITED 2009

SPARC64™ VIIIfx: Fujitsu’s New Generation Octo Core Processor

for PETA Scale computing

August 25, 2009

Takumi Maruyama

LSI Development DivisionNext Generation Technical Computing UnitFujitsu Limited

Page 2: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20092

2004~2007

2000~2003

1998~1999

1996~1997

~1995

SPARC64II

SPARC64™Processor

SPARC64SPARC64™™ProcessorProcessor

Tr=190MCMOS Cu

130nm

Tr=30MCMOS Cu

180nm / 150nm

Tr=190MCMOS Cu

130nmSPARC64

V

SPARC64GP

Tr=30MCMOS Al

250nm / 220nm

Tr=46MCMOS Cu180nm

GS8900

MainframeMainframeMainframe

GS21600

GS8800B

Tr=400MCMOS Cu

90nm

Tr=540MCMOS Cu

90nm

SPARC64VI

GS8800

Tr=500MCMOS Cu

90nm

Tr=10MCMOS Al

350nm

Store AheadBranch HistoryPrefetch

Single-chip CPU

Non-Blocking $O-O-O ExecutionSuper-Scalar

L2$ on DieMulti-core / Multi-thread

Fujitsu Processor DevelopmentFujitsu Processor Development

High Perform

anceT

echnologyH

igh Reliability

Technology GS8600

GS21900

Tr=600MCMOS Cu

65nm

SPARC64V +

GS21

SPARC64

UNIXUNIXUNIX

HPC-ACE System On Chip

Hardware Barrier

SPARC64VIIIfx

SPARC64VII

Tr=760MCMOS Cu / 45nm

2008~

MainframeMainframeMainframe

HPCHPCHPCFeatures

$ ECCRegister/ALU ParityInstruction Retry$ Dynamic DegradationRC/RT/History

Page 3: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20093

SPARC64™ VIIIfx Design TargetProcessor for Fujitsu’s supercomputer for the peta-scale

computing age, which realizes both high performance and low powerUnachievable goal with conventional processor design

• More GF (Giga Flops)• More Efficiency• No more high frequency

Large ISA (Instruction Set Architecture)extension is required

High Integration: SoC (System On Chip)High Reliability

Reuse SPARC64TM VII design when applicable

Focus of this presentation

Page 4: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20094

HPC-ACE (High Performance Computing - Arithmetic Computational Extensions)

ISA (Instruction Set Architecture) of SPARC64VIIIfx is: Complies with

• The SPARC-V9 standard• JPS (Joint Programmer’s Specification): Extension to SPARC-V9

HPC-ACE: Fujitsu’s unique ISA extension for HPC• Large register sets• SIMD (single instruction multiple data) instructions• etc

You can download SPARC64TM VIIIfx ISA documents from http://jp.fujitsu.com/solutions/hpc/brochures/ The SPARC® Architecture Manual Version 9 SPARC® Joint Programming Specification (JPS1): Commonality SPARC64 VIIIfx Extensions

Page 5: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20095

Large register sets 1/2Register enhancements from V9

160192 INT registers 32256 FP registers (DP)

• Lower 32 registers are the same with SPARC-V9’s.

• FP registers are “flat”. Extended FP registers can be accessed with non-SIMD instruction as well.

Why needed To extract more parallelism

currently limited by the number of Arch registers.

Reduce spill/fill overhead

Ext.

V9

Ext.

V9V9

Ext.

INT Reg FP RegRegisterWindow

32

224

32

32160

SIMD(basic)

SIMD(extended)

Loop0

Loop1

Loop0Loop1

LoopN

Limited parallelismw/ O-O-O

Large parallelismw/ loop unrolling

32

Page 6: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20096

Large register sets 2/2 Instruction format for 256 FP registers

8 bit x 4 (3 read +1 write) register number fields are necessary for FMA (Floating-point Multiply and Add) instruction.

But SPARC-V9 instruction length is limited (32bits – fixed) Defined a new prefix instruction (SXAR) to specify upper-3bit of

register numbers of the following two instructions.

SXAR inst1 inst2

Lower-5bit x4Upper-3bit x4

SXAR (Set XAR) instruction XAR: Extended Arithmetic Register

• Set by the SXAR instruction• Valid bit is cleared once the corresponding subsequent instruction

gets executed.

fv furd furs1 furs2 furs3 sv surd surs1 surs2 surs331 0

fsimd ssimd1615

First Upper Register Source-1 bits

SXAR1: set XAR for subsequent one instruction. SXAR2: set XAR for subsequent two instructions.

Page 7: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20097

SIMD SIMD FP operation

1 SIMD FP instruction executes two SP or DP floating-point operations

SIMD Load/Store 1 SIMD load/store instruction accesses two

contiguous SP or DP data in memory Data alignment requirement for SIMD-load

(DP) is 8 byte rather than usual 16byte. Combine two independent FP operation into

one with the Flat FP registers

SXAR instruction Specifies SIMD operation of the

subsequent two instructions. No new opcode is required for SIMD

operation

More SW optimization possibility with SIMD

16byte Load x2

basic extended

SIMD

SIMD

Conventional SIMD usageSIMD-Load (Src1)SIMD-Load (Src2)SIMD-FPSIMD-Store (Dst)

Combine two FP ops with SIMDLoad (Src1-A)Load (Src1-B)Load (Src2-A)Load (Src2-B)SIMD-FPStore (Dst-A)Store (Dst-B)

B-pipe D-pipe

A-pipe C-pipe

#128-255#32-127

#0-31

Page 8: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20098

FP Trigonometric Functions

Taylor series approximation of sin (x)~= x – 1/3!x3 + 1/5!x5 – 1/7!x7 + 1/9!x9 – 1/11x11 + 1/13x13 - 1/15x15

= x ((((((((0 – 1/15!)x2 + 1/13!)x2 – 1/11!)x2 + 1/9!)x2 – 1/7!)x2 + 1/5!)x2 – 1/3!)x2 + 1)

New instruction - ftrimaddd Function: rs1 × abs(rs2) + T[index] → rd Usage:

Instructions for fast Trigonometric Function calculation

# S = 0ftrimaddd S, X2, 7, S # S * X2 – 1/15! Sftrimaddd S, X2, 6, S # S * X2 + 1/13! Sftrimaddd S, X2, 5, S # S * X2 – 1/11! Sftrimaddd S, X2 , 4, S # S * X2 + 1/9! Sftrimaddd S, X2 , 3, S # S * X2 – 1/7! Sftrimaddd S, X2 , 2, S # S * X2 + 1/5! Sftrimaddd S, X2 , 1, S # S * X2 – 1/3! Sftrimaddd S, X2 , 0, S # S * X2 + 1/1! SFmuld S, X, S # S * X S

Page 9: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 20099

1

Software controlled Cache

HW Implementation Keep sector info of each cache line. Select the way to be replaced to meet the

sector ratio on cache misses

SW specifies a sector depending on temporal and spatial locality of data

L2 Cache structure

111111000

Way0 Way9Ex) Sector0:Sector1=30%:70%

1011110110index

Give SW ability to control cache to optimize performance while keeping cache coherency

Divide Cache into 2 groups (sectors) SXAR instruction specifies the sector

• sector 0: Instruction fetch / normal operand access (default)• sector 1: operand access explicitly specified by SXAR

Sector cache configuration registerSpecifies the ratio of sector 0 and 1 at the same index

Page 10: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200910

Other HPC-ACE featuresConditional operation

For efficient execution of Loop with ‘if’ Conditional branch can be removed with the

following conditional instructions. FP Conditional Compare to Register Move Selected FP-register on FP-register’s condition Store FP-register on FP-register’s condition

Compiler can optimize the loop with SW pipeline

FP Reciprocal Approximation of Divide/Square-root Calculate reciprocal approximation with

rounding error < 1/256 To achieve higher divide/square-root

performance with pipelined operation

FP Minimum and Maximum

SIMD

i=0 i=1i=2 i=3i=4 i=5

time

for (i=0; i<00; i++) {if (A(i)>0)

FPop X(i)else

FPop Y(i)}

basic extended

Usage:# %f2 / %f10 %f0frcpad %f10, %f6fmuld %f2, %f6, %f2fnmsubd %f6, %f10, 1.0, %f6fmuld %f6, %f6, %f0fmaddd %f6, %f6, %f6, %f4fmaddd %f0, %f0, %f6, %f0fmaddd %f4, %f2, %f2, %f4fmaddd %f0, %f4, %f2, %f0

i=6 i=7

Page 11: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200911

DO J=1,NDO I=1,MA(J)=A(J)+A(I,J+1)*B(I,J)

ENDEND

DO J=1,NDO I=1,MA(J)=A(J)+A(I,J+1)*B(I,J)

ENDEND

DO J=1,NDO I=1,M

A(J)=A(J)+A(I,J+1)*B(I,J)END

END

Integrated Multicore Parallel Architecture

Parallel

Time

PPP

Vector

Time

VVV

P

P

Vector Conventional Scalar

SPARC64TM VIIIfx

CORE0 CORE2CORE1 CORE3

I= CORE0 CORE1 CORE2

SIMD basic extendedbasic extended

SPARC64TM VIIIfx HW HPC-ACE Shared L2 cache to avoid false sharing Hardware barrier for fast inter-core

synchronization Fujitsu’s compiler technology

Automatic parallelization

More possibility of optimization:Parallelize the innermost loop

J 1234・・・

J 1234・・・

I 1234・・・

J = 1 2 3 4 5I= 1 2 3 4 5

Time

Parallel

Page 12: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200912

SPARC64™ VIIIfx Chip Overview• Architecture Features

• 8 cores• Shared 5 MB L2$• Embedded Memory Controller• 2 GHz

• Fujitsu 45nm CMOS• 22.7mm x 22.6mm• 760M transistors• 1271 signal pins

• Performance (peak)• 128GFlops• 64GB/s memory throughput

• Power• 58W (TYP, 30℃)• Water Cooling – Low leakage

power and High reliability

Core5

Core4

Core1

Core0

Core7

Core6

Core3

Core2

DD

R3

inte

rfac

e

DD

R3

inte

rfac

e

L2$ Data

L2$ Data

HSIO

L2$ ControlMACMAC

MACMAC

Page 13: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200913

<floating-point><integer>Execution units

256 (FPR)48 x2 (FUB)

188 (GPR)32 (GUB)

#Registers

FMA x4 (2SIMD)COMPARE x2

DIVIDE x2VIS x1

ALU x2SHIFT x2MULT x1DIVIDE x1AGEN x2

8 (= 4 Floating Multiply and Add)#FP-operations per clock

L1I$ 32KB/2wayL1D$ 32KB/2way

L1$

SPARC-V9/JPS + HPC-ACE

Instruction Set Architecture

Execution UnitL1D$

L1I$

Instruction controlL1$

control

SPARC64™ VIIIfx Core spec

FPR + FUB

Page 14: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200914

L1 I$64KB2Way

BranchTarget

Address8Kentry4Way

Decode& Issue

RSE8x2Entry

RSA10Entry

RSF8x2Entry

RSBR10Entry

GUB48Registers

GPR156Registers

x2

EXA

EXB

EAGA

EAGB

FPR64Registers

x2

FUB48Registers

FLA

FLB

FetchPort

16Entry

Store Port

8x2Entry

Write Buffer

5x2Entry

L1 D$64KB2Way

System BusInterface

Fetch(4stages)

Issue(2stages)

Dispatch Reg.-Read(4stages)

Execute Memory(L1$: 3(int)/4(fp)stages)

CSE64Entry

Commit(2stages)

PCx2

ControlRegisters

x2

SPARC64TM VII Pipeline

L2$6MB

12Way

Page 15: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200915

L1 I$32KB2Way

BranchTarget

Address1Kentry2Way

Decode& Issue

RSE10Entry

RSA10Entry

RSF8x2Entry

RSBR6Entry

GUB32Registers

GPR188Registers

EXA

EXB

EAGA

EAGB

FPR256Registers

FUB48x2Registers

FLA

FLB

FetchPort

20Entry

Store Port

8Entry

L1 D$32KB2Way

MemoryController

Fetch(4stages)

Issue(3stages)

Dispatch Reg.-Read(4(int)/5(fp) stages)

Execute Memory(L1$: 3(int)/4(fp)stages)

CSE48Entry

Commit(2stages)

PC

ControlRegisters

SPARC64TM VIIIfx Pipeline

L2$5MB

10WayFLC

FLD

DIMM

Write Buffer5Entry

Page 16: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200916

Changes in the pipeline

Issue stage Added “PD” stage to handle SXAR 6-instructions packed into 4 at “PD” An CSE entry is assigned at “D” The packed instructions have

Extended register fields SIMD operation attributes

4 packed (=6 unpacked) instructions can be committed at the same time.

Execution Stage “B” stage has been split into two for flat

Floating-point Register access

SXAR

E

PD

D

A B SXAR C D

A’ B’ C’ D’

Decode

Read IB

P

X

B FPR FUBFPR FUB

B2 Bypass

basic

FLA/B FLC/D

extended

Page 17: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200917

0

0.2

0.4

0.6

0.8

1

1.2

1.4

VII VIIIfx(noSIMD)

VIIIfx VII VIIIfx(noSIMD)

VIIIfx VII VIIIfx(noSIMD)

VIIIfx

Loop6: y(i) = x1(i) / x2(i) y(i) = sin(x1(i)) Loop14: 9th Degr Polyn

Performance= x 1.4

Performance Results – Core

Benchmark:Euroben + in-house

SPARC64TM [email protected]

ecut

ion

Tim

eR

elat

ive

to S

PAR

C64

TMVI

I@2.

5GH

z

Performance= x 4.3

Performance= x 6.8

SPARC64TM VIIIfx realizes much higher core performance than SPARC64TM VII thanks to HPC-ACE despite its lower frequency

Hardware measured results

0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit

0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit

Page 18: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200918

Performance Results – ChipSimulated results

SPARC64TM [email protected]

Performance= x 2.4

Performance= x 2.6

SPARC64TM VIIIfx shows about x 2.5 performance of SPARC64TM VII Stall time due to the execution unit busy has been reduced dramatically.

Expect x 3 performance with further compiler optimization

Exec

utio

n Ti

me

Rel

ativ

e to

SPA

RC

64TM

VII@

2.5G

Hz Benchmark:

Scientific Application- Molecular Dynamics- Fluid Dynamics

0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit

0 commit due to- $/system- Execution unit1 commit2-3 commit4 commit

0

0.2

0.4

0.6

0.8

1

1.2

1.4

SPARC64VII SPARC64VIIIfx SPARC64VII SPARC64VIIIfx

Molecular Dynamics Fluid Dynamics

Page 19: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200919

SPARC64™ History and FutureEvolution rather than revolution

SPARC64TM V (1 core)• RAS• Single thread performance

SPARC64TM VI (2 core x 2 VMT)• Throughput

SPARC64TM VII (4 core x 2 SMT)• More Throughput• High Performance Computing

SPARC64TM VIIIfx (8 core)• High Performance Computing• Low Power• SoC

Peak Performance & Power

0

5

10

15

20

25

V V + VI VII VIIIfx

SPARC64

SPA

RC

64 V

= 1

.0

Performance (GF) = x 3

Power (W) = x 0.5

Page 20: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200920

SPARC64TM VIIIfx SummarySPARC64TM VIIIfx has been designed to be used for

Fujitsu’s supercomputer for the PETA-scale computing age.

HPC-ACE instruction sets overcomes the limitation of conventional SPARC-V9 architecture.

SPARC64TM VIIIfx has combined high performance and low power.

SPARC64TM VIIIfx chip is up and running in the lab.

Fujitsu will continue to develop SPARC64TM series to meet the needs of a new era.

Page 21: SPARC64(TM) VIIIfx: Fujitsu's New Generation Octo Core ... · SPARC64 V + GS21 SPARC64 UNIXUNIX HPC-ACE System On Chip Hardware Barrier SPARC64 VIIIfx SPARC64 VII Tr=760M CMOS Cu

SPARC64™ VIIIfx All Rights Reserved,Copyright© FUJITSU LIMITED 200921

Abbreviations• SPARC64TM VIIIfx

– IB: Instruction Buffer – RSA: Reservation Station for Address generation– RSE: Reservation Station for Execution– RSF: Reservation Station for Floating-point– RSBR: Reservation Station for Branch– GUB: General Update Buffer– FUB: Floating point Update Buffer– GPR: General Purpose Register– FPR: Floating Point Register– CSE: Commit Stack Entry