emotion engine™

31
Emotion Engine™ Emotion Engine™ AKA the “Playstation 2” AKA the “Playstation 2” Architecture Architecture Or Or The progeny of a MIPS and a The progeny of a MIPS and a DSP DSP By Idan Gazit – June 2002 By Idan Gazit – June 2002

Upload: carlos-delaney

Post on 30-Dec-2015

24 views

Category:

Documents


1 download

DESCRIPTION

Emotion Engine™. AKA the “Playstation 2” Architecture Or The progeny of a MIPS and a DSP By Idan Gazit – June 2002. Overview. Based around a modified and extended MIPS R3000 core. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Emotion Engine™

Emotion Engine™Emotion Engine™

AKA the “Playstation 2” ArchitectureAKA the “Playstation 2” ArchitectureOrOr

The progeny of a MIPS and a DSPThe progeny of a MIPS and a DSPBy Idan Gazit – June 2002By Idan Gazit – June 2002

Page 2: Emotion Engine™

OverviewOverview

Based around a modified and Based around a modified and extended MIPS R3000 core.extended MIPS R3000 core.

Designed from the ground up to run Designed from the ground up to run “media applications” (read: games) “media applications” (read: games) VERY fast – but can function as a VERY fast – but can function as a general purpose CPUgeneral purpose CPU

Bears much resemblence to “DSP’s” Bears much resemblence to “DSP’s” (Digital Signal Processors) – more on (Digital Signal Processors) – more on this later.this later.

Page 3: Emotion Engine™

Basic Layout – Parallelism is Basic Layout – Parallelism is Key!Key!

MIPS R3K CPUMIPS R3K CPU1 FPU (Floating Point) coprocessor1 FPU (Floating Point) coprocessor2 VU (Vector Units) – more on this later2 VU (Vector Units) – more on this laterGraphics Interface Unit (GIF) – passes Graphics Interface Unit (GIF) – passes

on rendered data to the Graphics on rendered data to the Graphics Synth, which does the work of actually Synth, which does the work of actually “drawing” it to the screen.“drawing” it to the screen.

128b wide main bus128b wide main bus10 Channel DMA controller10 Channel DMA controller

Page 4: Emotion Engine™

Basic LayoutBasic Layout

Page 5: Emotion Engine™

The Nitty-GrittyThe Nitty-Gritty The main job of the EE is to render entire The main job of the EE is to render entire

frames, the product of which is a “display frames, the product of which is a “display list”, i.e. a list of geometry (points, list”, i.e. a list of geometry (points, polygons, textures) and where they need to polygons, textures) and where they need to be placed on the screen.be placed on the screen.

All of this needs to be done very fast, so All of this needs to be done very fast, so note the very wide data paths (128b main note the very wide data paths (128b main bus, and additional “private” links between bus, and additional “private” links between certain units).certain units).

Also 10 channel DMA controller – CPU Also 10 channel DMA controller – CPU shouldn’t waste time on I/O. Multiple shouldn’t waste time on I/O. Multiple connections between different units allow connections between different units allow for more than one I/O transaction at once, for more than one I/O transaction at once, so long as they’re on different busesso long as they’re on different buses

Page 6: Emotion Engine™

The CPUThe CPUHonest, it’s just a plain MIPS with Honest, it’s just a plain MIPS with

some minor extensions.some minor extensions.32x128b general purpose regs32x128b general purpose regs2 x 64b ALU (Arithmetic Logic Units)2 x 64b ALU (Arithmetic Logic Units)1 x 128b Load/Store unit (Parallelism 1 x 128b Load/Store unit (Parallelism

again – load/store 4 words at once)again – load/store 4 words at once)1 Branch execution unit1 Branch execution unit2 Coprocessors: FPU and VU0 – 2 Coprocessors: FPU and VU0 –

proper MIPS coprocessors controlled proper MIPS coprocessors controlled by COP instructions!by COP instructions!

Page 7: Emotion Engine™

The CPUThe CPUAble to do 2 64b integer ops per cycle, Able to do 2 64b integer ops per cycle,

or one 64b int op and one 128b or one 64b int op and one 128b load/store.load/store.

ALUs are interesting: they are ALUs are interesting: they are pipelined, but can be used two ways:pipelined, but can be used two ways:Separately, as in normal CPUs (2 x 64b Separately, as in normal CPUs (2 x 64b

op)op)Locked, to perform a 128b instruction:Locked, to perform a 128b instruction:

16 x 8b ops in one cycle16 x 8b ops in one cycle8 x 16b ops in one cycle8 x 16b ops in one cycle4 x 32b ops in one cycle4 x 32b ops in one cycle

Page 8: Emotion Engine™

The CPUThe CPU Example Supported instructions:Example Supported instructions:

MUL/DIV instructions MUL/DIV instructions 3-op MUL/MADD instructions 3-op MUL/MADD instructions Arithmetic ADD/SUB instructions Arithmetic ADD/SUB instructions Pack and extend instructions Pack and extend instructions Min/Max instructions Min/Max instructions Absolute instructions Absolute instructions Shift instructions Shift instructions Logical instructions Logical instructions Compare instructions Compare instructions Quadword Load/Store (remember, 128b L/S Quadword Load/Store (remember, 128b L/S

unit)unit)

Page 9: Emotion Engine™

The CPUThe CPU 8k data / 16k instruction cache, 2-way set 8k data / 16k instruction cache, 2-way set

associativeassociative 6-stage pipeline (shallow, compared to 6-stage pipeline (shallow, compared to

modern PC architectures)modern PC architectures) Speculative execution possible, but the Speculative execution possible, but the

penalty for a branch miss isn’t bad penalty for a branch miss isn’t bad because it’s a short pipeline.because it’s a short pipeline.

Pipeline Stages:Pipeline Stages:1. PC select 1. PC select 2. Instruction fetch 2. Instruction fetch 3. Instruction decode and register read 3. Instruction decode and register read 4. Execute 4. Execute 5. Cache access 5. Cache access 6. Writeback6. Writeback

Page 10: Emotion Engine™

The CPUThe CPU 16k of SPRAM – “Scratch Pad” RAM – VERY 16k of SPRAM – “Scratch Pad” RAM – VERY

VERY FAST.VERY FAST. In the CPU core.In the CPU core. What is this stuff? This is actually a very fast What is this stuff? This is actually a very fast

data cache shared by the CPU and VU0.data cache shared by the CPU and VU0. The 128b “private” link between the CPU The 128b “private” link between the CPU

and VU0 allows VU0 to use the SPRAM and and VU0 allows VU0 to use the SPRAM and the CPU to directly reference the VU’s the CPU to directly reference the VU’s registers.registers.

Which leads us nicely to the fact that the Which leads us nicely to the fact that the really difficult work is performed by…really difficult work is performed by…

Page 11: Emotion Engine™

Vector UnitsVector Units: The heart of EE: The heart of EE

FMAC: Floating-Point Multiply-FMAC: Floating-Point Multiply-AccumulateAccumulateAs it turns out, this operation is critical to As it turns out, this operation is critical to

3D rendering, and is performed many 3D rendering, and is performed many times in tight loops.times in tight loops.

An obvious candidate for parallelism and An obvious candidate for parallelism and pipelining!pipelining!

Between both VU’s and the FPU, a total of Between both VU’s and the FPU, a total of 10 FMAC units able to do 1 FMAC per 10 FMAC units able to do 1 FMAC per cycle, but also other useful instructions.cycle, but also other useful instructions.

Page 12: Emotion Engine™

Example VU “Useful Example VU “Useful Instructions”Instructions”

FMAC: 1 cycleFMAC: 1 cycleMin/Max: 1 cycleMin/Max: 1 cycleFDIV – another logical unit, 1 per VUFDIV – another logical unit, 1 per VU

Floating-Point divide: 7 cyclesFloating-Point divide: 7 cyclesSquare Root: 7 cyclesSquare Root: 7 cyclesInv Square Root: 13 cyclesInv Square Root: 13 cycles

Page 13: Emotion Engine™

Vector UnitsVector Units

However, there are However, there are differences to the two VU’s differences to the two VU’s and how they are utilized.and how they are utilized.

Both are VLIW – take long Both are VLIW – take long instructions with multiple instructions with multiple pieces of data.pieces of data.

Processing units are split into Processing units are split into two “working groups”:two “working groups”: Group 1: CPU + FPU + VU0Group 1: CPU + FPU + VU0

““Emotion SynthesisEmotion Synthesis” on diagram” on diagram Group 2: VU1 + GIFGroup 2: VU1 + GIF

““Geometry ProcessingGeometry Processing” on ” on diagramdiagram

Page 14: Emotion Engine™

Group 1Group 1

Here, the FPU and VU0 act as proper MIPS Here, the FPU and VU0 act as proper MIPS coprocessors, and are linked to the CPU by coprocessors, and are linked to the CPU by a private 128b wide bus to avoid crowding a private 128b wide bus to avoid crowding the main bus.the main bus.

FPU is nothing special, just another FPU FPU is nothing special, just another FPU coprocessor. 1 FMAC unit, 1 FDIV unit, coprocessor. 1 FMAC unit, 1 FDIV unit, each identical to VU FMAC/FDIV.each identical to VU FMAC/FDIV.

VU0 does the real heavy lifting when it VU0 does the real heavy lifting when it comes to the math; the CPU acts as more comes to the math; the CPU acts as more of a traffic director in feeding data as fast of a traffic director in feeding data as fast as it can to the VU for processing. as it can to the VU for processing.

Page 15: Emotion Engine™

Group 1Group 1

Although group 1 does geometry Although group 1 does geometry processing, it is also responsible for more processing, it is also responsible for more general-purpose calculations, such as general-purpose calculations, such as enemy AI, game physics, etc.enemy AI, game physics, etc.

Therefore group 1 has the (more Therefore group 1 has the (more generalized) CPU, whereas group 2 focuses generalized) CPU, whereas group 2 focuses only on geometry (and has only VU1 and the only on geometry (and has only VU1 and the GIF)GIF)

Definite hierarchy of control in group 1 – Definite hierarchy of control in group 1 – CPU controls FPU and VU0.CPU controls FPU and VU0.

Page 16: Emotion Engine™

Group 1 – Vector Unit 0Group 1 – Vector Unit 0

Page 17: Emotion Engine™

Group 1 – Vector Unit 0Group 1 – Vector Unit 0

32 x 128b FP registers, each holds 4 32 x 128b FP registers, each holds 4 x 32b single-precision FP numbers.x 32b single-precision FP numbers.

16 x 16b integer regs for int math16 x 16b integer regs for int math Instructions are just standard 32b Instructions are just standard 32b

“COP” (coprocessor) instructions“COP” (coprocessor) instructionsData is passed from CPU in 128b Data is passed from CPU in 128b

bundles, which the VIF (VU Interface) bundles, which the VIF (VU Interface) “unpacks” into 4x32b data words.“unpacks” into 4x32b data words.

8k each for data cache/inst cache8k each for data cache/inst cache

Page 18: Emotion Engine™

Group 2Group 2

Consists of VU1 and the GIF (Graphics Consists of VU1 and the GIF (Graphics Interface).Interface).

VU1 acts like a standalone VLIW processor, VU1 acts like a standalone VLIW processor, and is not directly controlled by the CPU.and is not directly controlled by the CPU.

Perhaps a proper name for VU1 is the Perhaps a proper name for VU1 is the “Geometry Processor” for the GIF – this is “Geometry Processor” for the GIF – this is pure data processing and it has to happen pure data processing and it has to happen quick to keep the GIF saturated with quick to keep the GIF saturated with graphics to draw out to your TV.graphics to draw out to your TV.

Page 19: Emotion Engine™

Group 2 – Vector Unit 1Group 2 – Vector Unit 1

Page 20: Emotion Engine™

Group 2 – Vector Unit 1Group 2 – Vector Unit 1

Same general features as VU0, but some Same general features as VU0, but some differences according to VU1’s role:differences according to VU1’s role:

Addition of an “EFU” (elementary Addition of an “EFU” (elementary functional unit) – basically one FMAC and functional unit) – basically one FMAC and FDIV unit doing the more rudimentary FDIV unit doing the more rudimentary geometry calculations. Note a striking geometry calculations. Note a striking resemblence to the FPU from group 1…resemblence to the FPU from group 1…

16k each of data & inst cache, up from 8k 16k each of data & inst cache, up from 8k – since VU1 must handle geometry – since VU1 must handle geometry independently of the CPU, it ends up independently of the CPU, it ends up handling much more handling much more datadata than VU0. than VU0.

Page 21: Emotion Engine™

Group 2 – Vector Unit 1Group 2 – Vector Unit 1

Special direct connection between Special direct connection between data cache and the GIF.data cache and the GIF.

Why is this special? VU1 can work on Why is this special? VU1 can work on a display list in cache and have it a display list in cache and have it sent over to the GIF by DMA. Quicker sent over to the GIF by DMA. Quicker than using the main bus to shuttle than using the main bus to shuttle data around, less dependent on CPU, data around, less dependent on CPU, and leaves the main bus free for load and leaves the main bus free for load instructions.instructions.

Page 22: Emotion Engine™

Vector Unit ComparisonVector Unit Comparison

Designers opted for flexibility in Designers opted for flexibility in design, and thus the architecture is design, and thus the architecture is slightly confusing:slightly confusing:

VU0 is a coprocessor, VU1 is a VLIW VU0 is a coprocessor, VU1 is a VLIW mini-processor.mini-processor.

BUT… VU0 can be switched into BUT… VU0 can be switched into VLIW-mode, where the CPU then VLIW-mode, where the CPU then communicates with it like VU1. (E.G. communicates with it like VU1. (E.G. receiving 64b instruction “bundles” receiving 64b instruction “bundles” and parsing them with the VIF).and parsing them with the VIF).

Page 23: Emotion Engine™

Vector Unit InstructionsVector Unit Instructions

We really should treat the VU’s as We really should treat the VU’s as limited processors.limited processors.

Each 64b VLIW breaks down into two Each 64b VLIW breaks down into two 32b COP instructions, an “upper” 32b COP instructions, an “upper” instruction and a “lower” instruction.instruction and a “lower” instruction.

The upper/lower distinction is The upper/lower distinction is important; the types of work they do important; the types of work they do are differentare different

Page 24: Emotion Engine™

Vector Unit InstructionsVector Unit Instructions

Upper Instructions: SIMD (Single Upper Instructions: SIMD (Single Instruction – Multiple Data) instructionsInstruction – Multiple Data) instructions

Aptly named – these are the “fast” Aptly named – these are the “fast” multimedia instructions that do the multimedia instructions that do the same operation on lots and lots of same operation on lots and lots of data.data.

Logically, these types of instructions Logically, these types of instructions are handled by the special VU units: are handled by the special VU units: FMAC, FDIV, etc.FMAC, FDIV, etc.

Note that these instructions ONLY use Note that these instructions ONLY use the “special” units in each VU. the “special” units in each VU.

Page 25: Emotion Engine™

Vector Unit InstructionsVector Unit Instructions

Lower Instructions: non SIMD typeLower Instructions: non SIMD type More “utility” than processing:More “utility” than processing:

Load/store instructionsLoad/store instructions Jump/Branch instructionsJump/Branch instructions Random Number GenerationRandom Number Generation EFU instructions (only in VU1, remember 1 EFU instructions (only in VU1, remember 1

FMAC and 1 FDIV).FMAC and 1 FDIV). Note that these instructions use units in Note that these instructions use units in

the VU’s that I didn’t mention (RNG unit, the VU’s that I didn’t mention (RNG unit, Load/Store unit, etc) – they’re the more Load/Store unit, etc) – they’re the more “mundane” units for the more “mundane” “mundane” units for the more “mundane” tasks.tasks.

Page 26: Emotion Engine™

Flow of ExecutionFlow of Execution

So with all of this confusing So with all of this confusing flexibility, what do we get?flexibility, what do we get?

Two ways of doing work:Two ways of doing work:Group 1 & Group 2 both render in Group 1 & Group 2 both render in

parallel, both passing on display lists to parallel, both passing on display lists to the GIFthe GIF

Group 1 (CPU,VU0,FPU) prepares Group 1 (CPU,VU0,FPU) prepares instructions for VU1 – load/store, instructions for VU1 – load/store, branching, etc – which VU1 renders and branching, etc – which VU1 renders and passes on to the GIF.passes on to the GIF.

Page 27: Emotion Engine™

Flow of ExecutionFlow of Execution

Method 1: (parallel)Method 1: (parallel)

Method 2: (serial)Method 2: (serial)

Page 28: Emotion Engine™

DSP’s, PS2’s and PC’s, oh my!DSP’s, PS2’s and PC’s, oh my!

Essentially, the PS2 (like DSP’s), is Essentially, the PS2 (like DSP’s), is performing a small amount of performing a small amount of instructions on a large amount of instructions on a large amount of “uniform” data.“uniform” data.

Exactly the opposite of PC’s – Exactly the opposite of PC’s – performing large amounts of performing large amounts of instructions on varying data.instructions on varying data.

Side-effect bonus: good “Locality of Side-effect bonus: good “Locality of Reference” – instructions in PS2 don’t Reference” – instructions in PS2 don’t jump around much like in PC’s, jump around much like in PC’s, therefore less chance of cache misses therefore less chance of cache misses or branch mispredictions.or branch mispredictions.

Page 29: Emotion Engine™

DSP’s, PS2’s and PC’s, oh my!DSP’s, PS2’s and PC’s, oh my! Note design decisions that promote data-Note design decisions that promote data-

intensive computing:intensive computing: Wide buses, and private connections between Wide buses, and private connections between

units that move a lot of data.units that move a lot of data. VLIW – instructions come packaged with lots VLIW – instructions come packaged with lots

and lots of data.and lots of data. Large registers and load/store units. Large registers and load/store units.

Instructions geared towards SIMD-style (e.g. Instructions geared towards SIMD-style (e.g. 128 bit loads 4 words of data at once.)128 bit loads 4 words of data at once.)

MASSIVE ability to calculate inner-loop MASSIVE ability to calculate inner-loop instructions (FMAC) in ONE CYCLE – 10 FMAC’s, instructions (FMAC) in ONE CYCLE – 10 FMAC’s, therefore 10 of these can be done in 1 cycle. therefore 10 of these can be done in 1 cycle. Even FDIV’s are fast (7 cycles).Even FDIV’s are fast (7 cycles).

Page 30: Emotion Engine™

ConclusionConclusion

Entire EE design centered around Entire EE design centered around specialized-purpose: games! It can run specialized-purpose: games! It can run generalized apps but with a penalty.generalized apps but with a penalty.

How much of a penalty? Interesting How much of a penalty? Interesting question. Perhaps not much, because question. Perhaps not much, because there is a general-purpose MIPS at the there is a general-purpose MIPS at the core.core.

More similar in design to a DSP – fixed More similar in design to a DSP – fixed small amount of instructions to be done small amount of instructions to be done on large amounts of uniform data.on large amounts of uniform data.

Page 31: Emotion Engine™

The End & ReferencesThe End & References http://www.arstechnica.com/reviews/1q00/playstation2/ee-1.htmlhttp://www.arstechnica.com/reviews/1q00/playstation2/ee-1.html http://www.arstechnica.com/cpu/2q00/ps2/ps2vspc-1.htmlhttp://www.arstechnica.com/cpu/2q00/ps2/ps2vspc-1.html http://www.scea.com/news/press_example.asp?ps2=ps2&ReleaseID=9http://www.scea.com/news/press_example.asp?ps2=ps2&ReleaseID=9 http://users.ece.gatech.edu/~scotty/7102/pres/5http://users.ece.gatech.edu/~scotty/7102/pres/5 http://www.eecg.toronto.edu/~stoodla/processors/Sony/EmotionEngine.htmlhttp://www.eecg.toronto.edu/~stoodla/processors/Sony/EmotionEngine.html http://ntsrv2000.educ.ualberta.ca/nethowto/examples/m_ho/ps2eengine.htmlhttp://ntsrv2000.educ.ualberta.ca/nethowto/examples/m_ho/ps2eengine.html http://www.geocities.com/SiliconValley/Bay/6114/cpu2.htmlhttp://www.geocities.com/SiliconValley/Bay/6114/cpu2.html