5.5 gflops vector units for “emotion synthesis”
TRANSCRIPT
5.5 GFLOPS Vector Units for“Emotion Synthesis”
System ULSI Engineering Laboratory, TOSHIBA Corp.
A.Kunimatsu, N.Ide, T.Sato, Y.Endo,H.Murakami, T.Kamei, M.Hirano
Sony Computer Entertainment Inc.
M.Oka, A.Ohba, T.Yutaka, T.Okada,M.Suzuoki
Overview
• Strategy of VU architecture• VU features and instruction sets• Examples• Performance
Strategy of VU Architecture
• Targets– Optimized for 3D calculations– Easy to develop software pipelined programs– Simple hardware
• Solutions– VLIW processor– 4 parallel fMACs– All operations have same pipeline depth
(Except DIV/SQRT/RSQRT operations)And one more essence And one more essence …….. ..
"Emotion Synthesis"
• "Emotion Synthesis"– more realistic, more detailed and more natural motions
• We prepared two different of Vector Unit(VU)s– VU0 : for flexible calculation with CPU control– VU1 : for well-defined 3D calculations
Need Huge Calculation Power
COP1FPU
I$
128bitProcessor
Core
CPU
IPU MemoryInterface
VPU1VPU0
COP2
System Bus128-bit
128
External Memory Peripherals
10ch DMAC
I/OInterface
D$ SP-RAM
GIFVU0
Inst.MEM4KB
DataMEM4KB
Inst.MEM16KB
DataMEM16KB
to RenderingEngine
(GS LSI)
EFUVU1
VIF0 VIF1
“Emotion Engine” Block Diagram
COP1FPU
VU• VLIW mode
– 64-bit VLIW instruction formats– 5 function units are available simultaneously
4 FMAC + (FDIV, Branch, load/store, or iALU)
• Co-processor mode– MIPS COP2 instruction : 32-bit opcode– Two kinds of COP2 instructions :
• 4 parallel Floating operations• call micro-subroutine of VLIW mode
VU0 and VU1
AvailableAvailableVLIW mode
With … Inst. Memory : 16KB Data Memory : 16KB VIF(system bus I/F) GIF(Graphics I/F) EFU(VU option)
With … Inst. memory : 4KB Data memory : 4KB VIF(system bus I/F)
VPU
N/AAvailableCo-processor mode
Well-defined 3Dcalculations
Flexible Calculation withCPU control
Main purpose
VU1VU1VU0VU0
32
Upper Instruction Lower Instruction64
32
fMAC
w
fMAC
z
fMAC
y
fMAC
x
Upper Execution Unit
RAN
DU/etc
LSU
iALU
Lower Execution Unitfloating
registers128bit x 32
Data Mem16KByte
integerregisters16bit x 16
16
16
16
Vector Unit :VU
fDIV
VIF
Inst. Mem16KByte
fMAC
fDIV
EFU
GIF
System Bus
64
to Rendering Engine (G
S LSI)
Path1
Path2
Path3
128
128 128
128
128
128
12864
128
VPU1 Block Diagram
Instruction opcode to support 2 modes
VLIW mode : 64-bit VLIW instruction opcodeVLIW mode : 64-bit VLIW instruction opcode
4 bit 5 bit 5 bit 5 bit 4 bit 2 b
dest ft r eg fs r eg fd r eg VMADD bc- - - - - - - - - - - - - - - - - - - 0010 - -
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
co1
010010 1COP2
6 bit
Opcode BodyOpcode Body31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
4 bit 5 bit 5 bit 11 bit
is r egft r egdest VLQI- - - - - - - - - - - - - - 01101 1111 00
co1
010010 1COP2
6 bit
Opcode BodyOpcode Body
1 1 1 1 1 1 1 4 bit 5 bit 5 bit 5 bit 4 bit 2 b
I E M D T - - dest ft r eg fs r eg fd r eg MADD bc- - - - - 0 0 - - - - - - - - - - - - - - - - - - - 0010 - -
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
7 bit 4 bit 5 bit 5 bit 11 bit
is r egft r egLow er OP. dest LQI1000000 - - - - - - - - - - - - - - 01101 1111 00
Opcode BodyOpcode Body Opcode BodyOpcode Body
Upper 32bit Lower 32bit
Co-processor mode : 32-bit MIPS COP2 instruction opcodeCo-processor mode : 32-bit MIPS COP2 instruction opcode
A VLIW A VLIW instruction includestwo COP2 two COP2 instructions.
64-bit VLIW instruction
Upper Instructions : 4 parallel floating ADD/SUB 4 parallel floating MUL 4 parallel floating MADD/MSUB 4 parallel floating MAX/MIN Outer product calculation Clipping detection
Lower Instructions : Floating DIV/SQRT/RSQRT Load/Store 128-bit data(4 FP data) Jump/Branch Elementary Function Unit Random Unit
1 1 1 1 1 1 1 4 bit 5 bit 5 bit 5 bit 4 bit 2 b
I E M D T - - dest ft r eg fs r eg fd r eg MADD bc- - - - - 0 0 - - - - - - - - - - - - - - - - - - - 0010 - -
63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
7 bit 4 bit 5 bit 5 bit 11 bit
is r egft r egLow er OP. dest LQI1000000 - - - - - - - - - - - - - - 01101 1111 00FMAC x 4FMAC x 4 FDIV, load/store, iALUFDIV, load/store, iALU
Upper 32-bit Lower 32-bit
Registers
::
128-bit floating register x 32
16-bit integer register x 16
95 64 63 32 31 0
VF00
32bit 32bit 32bit 32bit
VF01VF02VF03
15 0
VI00
127 96
1.00 0.00 0.00 0.00VF01w VF01z VF01y VF01xVF02w VF02z VF02y VF02xVF03w VF03z VF03y VF03x
VF31 VF31w VF31z VF31y VF31x
0VI01VI02
VI15
::
4 Parallel FMADD
w z y x
w z y x
fMACx
ACCw ACCz ACCy ACCx
fMACyfMACzfMACw ++++
Source register 1Source register 1
Source register 2Source register 2
ACCw
4 Parallel FMADD with Broadcast
w z y x
w z y x
fMACx
Source register 1Source register 1
ACCw ACCz ACCy ACCx
fMACyfMACzfMACw ++++
Source register 2Source register 2
ACCw
VLIW mode Pipeline Stages
• 7 cycle Latency : DIV / SQRTDIVF D 1 2 3 4 5 6 7
SQRTF D 1 2 3 4 5 6 7
RSQRTF D 1 2 3 4 5 6 7
• 13 cycle Latency : RSQRT8 9 10 11 12 13
• 3 cycle Latency : Basic OperationsADD, SUB, MUL, MADD, MSUBload/store, iALU operations, e.t.c.
F D X1 X2 X3 W
FMACxFMACyFMACzFMACw
m m m mm m m mm m m mm m m m
xyzw
m x m y m z m wm x m y m z m wm x m y m z m wm
::::
00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
00 01 02 03
10 11 12 13
20 21 22 23
!
"
####
$
%
&&&&
!
"
####
$
%
&&&&
=
+ + ++ + ++ + +
3030 31 32 33x m y m z m w+ + +
!
"
####
$
%
&&&&
Geometry Transformation
F D X1 X2 X3 W 4 x MULA
4 x MADDA
4 x MADDA
4 x MADD
Geometry Transformation Pipeline
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
Upper ( )xmxmxmxm 30201000
( )ymymymym 31211101
( )zmzmzmzm 32221202
( )wmwmwmwm 33231303
+
+
+
x broadcastx broadcast
y broadcasty broadcast
z broadcastz broadcast
w broadcastw broadcast
FMACx FMACy FMACz FMACw
+
+
+
+
+
+
+
+
+
FDX1 - 3W
: Fetch: Decode: eXecute: Write back
Geometry Transformation Pipeline
Perspective Transformation
7 Cycles : DIV repeat time
Upper
Lower
4 x MULA4 x MADDA4 x MADDA
DIV
4 x MULF D X1 X2 X3 W
FF DD X1X1 X2X2 X3X3 WW
FF DD X1X1 X2X2 X3X3 WWFF DD X1X1 X2X2 X3X3 WW
FF DD X1X1 X2X2 X3X3 WW
F D d1 d2 d3 d4 d5 d6 d7
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 W
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 W
FF DD X1X1 X2X2 X3X3 WW
F D X1 X2 X3 W
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 W
F D d1 d2 d3 d4 d5 d6 d7
FF DD d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7
F D d1 d2 d3 d4 d5 d6 d7
4 x MADD
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 WF D X1 X2 X3 W Others
LOOP 1 LOOP 2 LOOP 3 LOOP 4Perspective Transformation Software Pipelining
Geo
met
ryTr
ansf
orm
atio
n
Another Lower inst.: load/store, loop control
Vector Normalization
RSQRT
Ç Ç X1 X2 W
F D 1 2 3 4 5 6 7 8 9 10111213
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 W
MUL 2xMADD 22 yx +MADD 222 zyx ++
222
1zyx ++
3 x MUL
''(
)
++++++ 222222222,,
zyxz
zyxy
zyxx
25 cycles for normalization
13 cycles for repeat(software pipelining)
X3 2x22 yx +
Co-processor Mode Pipeline stages• Basic operations
I Q WF D X1 X2 X3 W
R A D MIPS core pipelineVU pipeline
10 stages• call VLIW processor mode programs
I Q WR A D MIPS core pipelineF D X1 X2 X3 W
F D X1 X2 X3 WF D X1 X2 X3 W
F D X1 X2 X3 W………….
11 + n stages
n steps
VU VLIW pipelines
Example : Jelly
• Dynamic simulation
Spring model
Performance
150
66 66
46
0
20
40
60
80
100
120
140
160
Geometry Trans. PerspectiveTrans.
Distance 1/Distance
M v
ecto
rs/se
c
Emotion Engine (300 MHz)
EE Micrograph
Die size : 15.02mm x 15.04mmFrequency : 300MHzTransistors : 13.5MPower : 18WattsDesign Rule : 0.25umGate Length : 0.18um
VU M
icro
grap
h
Acknowledgements
TOSHIBA Corp.
Y.Fujimoto, H.Goto, S.Hayakawa, F.Ishihara, T.Kawaai,S.Kawakami, M.Saito, H.Tago, H.Takeda, T.Teruyama,Y.Watanabe, I.Yamazaki
TOSHIBA MICROELECTRONICS Corp.
Y.OgawaTOSHIBA ENGINEERING Corp.
N.Iwatsuki, M.Otsuki, Y.TsutsumiTOSHIBA INFORMATION SYSTEMS(JAPAN) Corp.
M.Kuramochi, Y.HikitaSONY COMPUTER ENTERTAINMENT Inc.
S.Aoki, K.Kutaragi, S.Shinohara, K.Umemura