cell broadband engine introduction & architecture

53
IBM Systems & Technology Group Cell/BE Cell Programming Workshop 03/25/22 © 2007 IBM Corporation 1 Cell Broadband Engine Introduction & architecture Francesco Bertagnolli Francesco Bertagnolli System & Technology Group System & Technology Group

Upload: mckenzie-pearson

Post on 31-Dec-2015

42 views

Category:

Documents


3 download

DESCRIPTION

Cell Broadband Engine Introduction & architecture. Francesco Bertagnolli System & Technology Group. Agenda. Cell introduction Cell architecture SDK 3.0 Linux on ps3 Cell basic programming Hands-on Cell applications. Systems and Technology Group. Cell History. - PowerPoint PPT Presentation

TRANSCRIPT

IBM Systems & Technology Group

Cell/BE

Cell Programming Workshop 04/19/23 © 2007 IBM Corporation1

Cell Broadband Engine

Introduction& architecture

Francesco BertagnolliFrancesco BertagnolliSystem & Technology GroupSystem & Technology Group

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/232

Agenda

– Cell introduction

– Cell architecture

– SDK 3.0

– Linux on ps3

– Cell basic programming

– Hands-on

– Cell applications

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/233

IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Austin-based Design Center opened in March 2001 Single CellBE operational Spring 2004 February 7, 2005: First technical disclosures November 9, 2005: Open source SDK & simulator published February 8, 2006: IBM announced Cell Blade July 2006: SDK 1.1 available Sep 2006: GA of IBM Blade Center QS20 Dec 2006: SDK 2.0 available Oct 2007: SDK 3.0 available Oct 2007: QS21 available May 2008: QS22 available!!

Systems and Technology Group

Cell History

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/234

1.0E+02

1.0E+03

1.0E+04

1990 1995 2000 2005 2010

Clo

ck S

pee

d (

MH

z)

Intel Processors

IBM Processors

103

102

104

Po

we

r3

Po

we

r3-I

I

Po

we

r4

Po

we

r4+

Po

we

r5 GS

/GT

Po

we

r5+

Po

we

r6Z

6 C

PZ

6 S

C

Blu

eGe

ne

/L

Blu

eGe

ne

/P

Power5+

Intel's 2003Roadmap

RS

64-4

10

100

1000

10000

100000

1000000

1994

1996

1998

2000

2002

2004

2006Game Processors

PC Processors

Sin

gle

Pre

cis

ion

Flo

ati

ng

Po

int

(Mfl

op

s)

Year

1 TFlop

1 GFlop

Introduction

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/235

The CBE processor is the first implementation of a new multiprocessor family conforming to the Cell Broadband Engine Architecture (CBEA)

The CBEA and the CBE processor are the result of a collaboration between Sony, Toshiba, and IBM known as STI, formally begun in early 2001

The Cell Broadband Engine Architecture has been designed to support a very broad range of applications (commercial, scientific fields...)

Although the CBE processor is initially intended for applications in media-rich

consumer-electronics devices such as game

consoles and high-definition televisions, the architecture

has been designed to enable fundamental advances in processor performance.

Overview of the Cell Broadband Engine Processor

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/236

2006 2007 2008 2009 2010

PerformanceEnhancements/Scaling

EnhancedCell BE

(1+8eDP SPE)65nm SOI

Cell BE(1+8)

90nm SOI

CostReduction

All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs.

Next Gen (2PPE’+32SPE’)

45nm SOI~1 TFlop (est.)

Cell BE(1+8)

65nm SOI

Cell Competitive Roadmap

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/237

Cell Broadband Engine Architecture BladesIBM BladeCenter QS20 and beyond

2006 20082007 2009-2010

BladeCenter QS20• 2 Cell/B.E. processors • 1PPE + 8SPE• SP: 460 GFLOPS per

Cell blade• DP: 42 GFLOPS per

Cell blade• 1 GB memory

BladeCenter QS2X• 2 Cell/B.E. processors • 1PPE + 8SPE• SP: 460 GFLOPS per

Cell blade• DP: 42 GFLOPS per

Cell blade• Next Generation I/O

chip• 2 GB memory

BladeCenter QS2Y• 2 CBEA-compliant

processors • 1PPE + 8eDP SPE• SP: 460 GFLOPS per

blade• eDP: 217 GFLOPS per

blade• Up to 32 GB memory• PCI Express™ x16 slots

SDK 1.1

SDK 2.1 SDK 3.0

SDK 4.0

GA September 2006

Target availability: 4Q07

Target availability: 1H08

Available July 2006

Available:March 07

Target release:September 07

Target release: 08

BladeCenter QS2Z• First CBEA teraflop

processor• 2PPE’+32 eSPE• Power Architecture

compliant• ~2 TFLOPS SP per blade• ~1 TFLOPS DP per blade• Next generation memory

technology

Target availability: 1H10

SDK 5.0

Target release:December 08

Concept

Committed

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/238

Cell Basic Design Concept

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/239

Cache

Deep Pipelining

Out-of-Order Processing

X

X

X

Where Have All the Transistors Gone …?

Add Performance … and Inefficiency

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2310

Power Wall

– Hard limit to acceptable system power

Memory Wall

– Processor frequency vs. DRAM memory latency

Frequency Wall

Three Major Limiters to Processor Performance

Cell Concept

Increased efficiency and performance

– Non Homogenous Coherent Chip Multiprocessor

• Allows an attack on the “Frequency Wall”

– DMA architecture attacks “Memory Wall”

– Design, low operating voltage attacks “Power Wall”

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2311

Example Dual Core349mm2, 3.4 GHz @ 150W

2 Cores, ~54 SP GFlops

Cell/B.E.3.2 GHz

9 Cores, ~230 SP GFlops

Cell/B.E. - ½ the space & power vs traditional approaches

Please note, that on any traditional processor, the show ratio of cores to cache illustrated here remains ~50% of area.

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2312

Why Cell ? (1)

Cell/BE: General Purpose…

Flexibility

Parallelism multi-levels

Stream processing

Double pipeline into SPEs

Static scheduling pipeline: no buffer

Storage hierarchy

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2313

Simple hardware LS

SPEs indipendent & synergistic– cluster with 8

Several systems: – game, HDTV, Blades, supercomputing, cluster computing,

mainframes, etc..

– Structure is not fix

MFC, DMA

Registers 128x128 (4x32)

Why Cell ? (2)

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2314

Technology 90-65-45.. nm

State of art

Software development support

Low consumer

Flaws? NO, It’s RISC..

FLEXIBILITY

Why Cell ? (3)

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2315

Cell/B.E. enables scalable, shared architecture with full consumer to professional potential

SCE PS3(Cell/B.E. + GPU)

IBM Cell/B.E. Blade

(2 Cell/B.E.s)b

IBM Roadrunner(16,000 Cell/B.E.s

+ AMD)Sony Cell/B.E. Computing Unit

(Cell/B.E. + GPU + AV I/O)

Consumer ProfessionalHigh Perf

Computing Business

Mercury Cell/B.E. PCI Card

(Cell/B.E. + Network)

Common Operating Systems, Infrastructure, Tools, Libraries, Code…

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2316

Challenges of Digital Future – System integration and flexibility

Integration of offload engines and accelerators into processor

– Simpler system structure

Integration of bridge functionality

– More efficient I/O designs

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2317

Cell Hardware components & performance

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2318

Hardware Environment

The Processor Elements

Element Interconnect Bus

Memory Interface Controller

Cell Broadband Engine Interface Unit

Block diagram of the CBE processor hardware

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2319

Synergistic Processor ElementsSynergistic Processor Elements

PowerPC Processor ElementPowerPC Processor Element

Mem

ory

Inte

rface

Contr

olle

r

Element Interconnect BusElement Interconnect Bus

Cell B

roadband E

ngin

e In

terfa

ce U

nit

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2320

Power Processor Elements: PPE

EIB

64-bit Power

Architecture

with VMX

PPE

PXUL1

PPU

L2

L2

PPU

The PowerPC Processor Element (PPE) features:

a general-purpose 64-bit RISC processor

conforms to the PowerPC Architecture

dual-threaded

with vector/SIMD multimedia extensions

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2321

PPE responsability:

o responsible for overall control of a CBE system

o run the operating systems

It has:

32 KB level-1 (L1) instruction and data caches

512 KB level-2 (L2) unified (instruction and data) cache

The PPE supports the standard PowerPC Architecture

instructions and the vector/SIMD multimedia extensions

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2322

PPE Registers

32 General-Purpose Registers (GPRs)—Fixed-point instructions operate on the full 64-bit width of the GPRs.

32 Floating-Point Registers (FPRs), 64 bits wide. The internal format of floating- point data is the IEEE 754 double-precision format. Single-precision results are maintained internally in the double-precision format.

64-bit LR - to hold the effective address of a branch target.

64-bit CTR - to hold either a loop counter or the effective address of a branch target.

64-bit XER - contains the carry and overflow bits and the byte count for the move-assist instructions.

32 128-bit-wide VMRs - served as source and destination registers for all vector instructions.

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2323

To software, the PPE appears to provide two independent instruction-processing units.

The threads appear to be independent because the PPE provides each thread with a copy of architectural state (registers), but the threads are not completely independent because many execution resources are shared by the threads to reduce the hardware cost of multithreading.

To software, the PPE implementation of multithreading looks similar to a multiprocessor implementation, but there are several important differences

PPE multithreading

It has duplicate sets of the PowerPC and vector user-state

register files (one set for each thread)

The PPE hardware supports two simultaneous threads of execution

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2324

PPE Multithreading vs Multi-Core Implementations

Table compares the PPE multithreading implementation to a conventional dual-core microprocessor

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2325

Pre-Decode

L1 Instruction Cache

Microcode

SMT Dispatch (Queue)

Decode

Dependency

Issue

Branch Scan

Fetch ControlL2

Interface

VMX/FPU Issue (Queue)

VMXLoad/Store/

Permute

VMXArith./Logic Unit

FPULoad/Store

FPUArith/Logic Unit

Load/StoreUnit

BranchExecution

Unit

Fixed-PointUnit

FPU CompletionVMX Completion

Completion/Flush

Thread A Thread B

Thread A

Thread B

Thread A

L1 Data Cache

PPE Block Diagram

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2326

Synergistic Processor Elements: SPEs

SPE1

Each SPE:

RISC core

256 KB SRAM Local Store for instructions and data

128X128-bit register file

support a special SIMD instruction set

EIB

SPE

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

LS

SXUSPU

MFC

SPU Core (SXU)

Channel Unit

Local StoreMFC

(DMA Unit)

SPU

SPE

To Element Interconnect Bus

DMA Unit: Transfers data between

Local Store and Main Memory

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2327

Synergistic Processor Element (SPE)

It is not optimized for running an operating system

The SPEs are independent processor elements, each running their own individual application programs or threads

The SPEs are designed to be programmed in high-level languages, such as C/C++

They support a rich instruction set that includes extensive SIMD functionality

However, use of SIMD data types is preferred, not mandatory

The eight identical SPEs are single-instruction, multiple-data (SIMD) processor elements are optimized for data-rich

operations allocated to them by the PPE.

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2328

SPU Organization

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2329

SPE Registers

128 of 128-bit General-Purpose Registers (GPRs) that can be used to store all data types

The Floating-Point Status and Control Register (FPSCR) records information about the result and any associated exceptions.

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2330

One Difference between PPE and SPEs ...

The more significant difference between the SPE and PPE lies in how they access memory

The PPE accesses main storage with load and store instructions that move data between main storage and a private register file, the contents of which may be cached

The SPEs, in contrast, access main storage with direct memory access (DMA) commands that move data and instructions between main storage and a private local memory, called a local store or local storage (LS). The LS has no associated cache

This 3-level organization of storage (register file, LS, main storage) is a radical break from conventional architecture and programming models

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2331

System Memory

4x128 kB L2-Cache Sub-Array

512 kB L2-Cache

32 kB L1 Data-Cache 32 kB L1 Instruction-Cache 256 kB Local Store

16x16 kB

Sub-Array

CHIP CELL BE

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2332

Memory Interface Controller - MIC

MIC

EIB

Dual XDRTM

The MIC provides the interface between the EIB and physical memory

It supports one or two Rambus extreme data rate (XDR) memory interfaces (which together support between 64 MB and 64 GB of XDR DRAM memory)

XDR Dram is ECC-protected, with multi-bit error detection and optional single bit error correction

Memory Interface16 B/cycle25.6 GB/s (@1.6 Ghz)

Memory Interface16 B/cycle25.6 GB/s (@1.6 Ghz)

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2333

On chip coherent bus96B / cycle bandwidth2 Rings in each direction

On chip coherent bus96B / cycle bandwidth2 Rings in each direction

I/O InterfaceCan be coherent16 B/cycle x 2

I/O InterfaceCan be coherent16 B/cycle x 2

Element Interconnect Bus - EIB

Cell Broadband Engine Interface Unit – (BEI)

EIB

BEI

BEI

FlexIOTM

EIB

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2334

Cell performance

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2335

>100 GFLOPs DP in 65nm>100 GFLOPs DP in 65nm

Cell is not a collection of different processors, but a synergistic wholeCell is not a collection of different processors, but a synergistic whole

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2336

>100 GFLOPs DP in 65nm>100 GFLOPs DP in 65nm

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2337

Source: Cell Broadband Engine Architecture and its first implementation – A performance view, http://www-128.ibm.com/developerworks/library/pa-cellperf/

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2338

Key Performance Characteristics

Cell's performance is about an order of magnitude better than GPP for media and other applications that can take advantage of its SIMD capability

– Performance of its simple PPE is comparable to a traditional GPP performance

– its each SPE is able to perform mostly the same as, or better than, a GPP with SIMD running at the same frequency

– key performance advantage comes from its 8 de-coupled SPE SIMD engines with dedicated resources including large register files and DMA channels

Cell can cover a wide range of application space with its capabilities in

– floating point operations

– integer operations

– data streaming / throughput support

– real-time support

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2339

Cell Blade

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2340

The First Generation Cell Blade

Cell Processors1GB XDR Memory IBM Blade Center interface

BladeCenter Network Interface

CellProcessor

SouthBridge

XDRAM

CellProcessor

SouthBridge

XDRAM

IB4X

IB4X

GbE GbE

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2341

14 blades

BladeCenter-H

- 2 Cell Chips pro QS21-Blade- 14 QS21 Blades pro BladeCenter- 60 Watt pro Cell

Peak Performance

Up to 460 GFLOPS per blade

Up to 6.4 TFLOPS in a single BladeCenter H chassis

Up to 25.8 TFLOPS in a standard 42U rack

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2342

Workstations

back

IBM

QS

20

IBM

QS

20

IBM

QS

20

IBM

QS

20

IBM

QS

20

IBM

QS

20

IBM

QS

20

IBM BladeCenter

IBM

QS

20

IBM

QS

21

IBM BladeCenter

InfiniBand

InfiniBandIB <> Eth.

echo Thinkpad Thinkpad

C:\IBM\product\Cell\_

Thinkpad T60

echo Thinkpad Thinkpad

C:\IBM\product\Cell\_ echo Thinkpad Thinkpad

C:\IBM\product\Cell\_

Thinkpad T60

echo Thinkpad Thinkpad

C:\IBM\product\Cell\_

Eth. Switch

echo Thinkpad Thinkpad

C:\IBM\product\Cell\_

Thinkpad T60

echo Thinkpad Thinkpad

C:\IBM\product\Cell\_

echo Thinkpad Thinkpad

C:\IBM\product\Cell\_

echo PC PC

C:\IBM\product\Cell\_

echo Thinkpad Thinkpad

C:\IBM\product\Cell\_

echo PC PC

C:\IBM\product\Cell\_

Server architecture

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2343

IBM BladeCenter QS21

Announcement: August 28, 2007

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2344

IBM BladeCenter QS22

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2345

IBM BladeCenter QS22: specifications

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2346

Where to get more Cell BE information?

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2347

Cell Resource

Cell resource center at developerWorks– http://www-128.ibm.com/developerworks/power/cell/

Cell developer's corner at power.org– http://www.power.org/resources/devcorner/cellcorner/

The cell project at IBM Research– http://www.research.ibm.com/cell/

The Cell BE at IBM alphaWorks– http://www.alphaworks.ibm.com/topics/cell

Cell BE at IBM Engineering & Technical Services– http://www-03.ibm.com/technology/

IBM Power Architecture– http://www-03.ibm.com/chips/power/

Cell BE documentation at IBM Microelectronics– http://www-306.ibm.com/chips/techlib/techlib.nsf/products/

Cell_Broadband_EngineCell

Linux info at the Barcelona Supercomputing Center website– http://www.bsc.es/projects/deepcomputing/linuxoncell/

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2348

Cell Education

Online courses at IBM Education Assistant

– http://publib.boulder.ibm.com/infocenter/ieduasst/stgv1r0/index.jsp

Online courses at IBM Learning

– http://ibmlearning.ibm.com/index.html

Podcasts at power.org

– http://www.power.org

Onsite classes at IBM Innovation Center

– https://www-304.ibm.com/jct09002c/isv/spc/events/cbea.html

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2349

Cell BE Documentation

The following documents define the Cell Broadband Engine architecture, programming using the SDK, the new IBM BladeCenter QS20, XL C/C++compiler, Full-System Simulator, and the PowerPC base architecture.

Cell Broadband Engine – Cell Broadband Engine Architecture V1.01 (updated) – Cell Broadband Engine Programming Handbook V1.0 – Cell Broadband Engine Registers V1.4 (updated) – SPU C/C++ Language Extensions V2.2.1 (updated) – Synergistic Processor Unit (SPU) Instruction Set Architecture V1.11 (updated) – SPU Application Binary Interface Specification V1.5.1 (updated) – SPU Assembly Language Specification V1.4 (updated)

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2350

Cell BE Documentation

Cell Broadband Engine Programming using the SDK – Cell Broadband Engine SDK Installation Guide V2.0 (updated) – Cell Broadband Engine SDK Programmer's Guide V1.0 (new) – Cell Broadband Engine Programming Tutorial V2.0 (updated) – Cell Broadband Engine Linux Reference Implementation Application Binary Interface

Specification V1.1 (updated) – SPE Runtime Management library documentation V1.2 (updated) – SPE Runtime Management library documentation V2.0 (new) – Cell Broadband Engine SIMD Math Library Specification V1.0 (new) – Accelerator Library Framework Programming Guide and API Reference V1.0 (new) – Sample Library documentation V2.0 (updated) – IDL Compiler for Remote Procedure Calls – Post-link Optimization Utility (new)

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2351

Cell BE Documentation

IBM BladeCenter QS20 – IBM BladeCenter QS20 Datasheet – IBM BladeCenter QS20 Installation and User's Guide – IBM BladeCenter QS20 Problem Determination and Service Guide

IBM XL C/C++ Compiler – Getting Started with IBM XL C/C++ Compiler (new) – IBM XL C/C++ Compiler Language Reference (new) – IBM XL C/C++ Compiler Programming Guide (new) – IBM XL C/C++ Compiler Reference (new) – IBM XL C/C++ Compiler Installation Guide (new)

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2352

Cell BE Documentation

IBM Cell Broadband Engine Full-System Simulator – IBM Full-System Simulator Users Guide (updated) – IBM Full-System Simulator Command Reference (updated) – Performance Analysis with the IBM Full-System Simulator – IBM Full-System Simulator BogusNet HowTo (updated)

PowerPC Base – PowerPC Architecture Book, Version 2.02

• Book I: PowerPC User Instruction Set Architecture • Book II: PowerPC Virtual Environment Architecture • Book III: PowerPC Operating Environment Architecture

– PowerPC Microprocessor Family • Vector/SIMD Multimedia Extension Technology Programming Environments

Manual Version 2.06c

IBM Systems & Technology Group

© 2007 IBM CorporationCell Programming Workshop 04/19/2353

Cell BE Technical Articles

Real-time Ray Tracing

Papers from the Fall Processor Forum 2005: Unleashing the Cell Broadband Engine Processor: The Element Interconnect Bus

Papers from the Fall Processor Forum 2005: Unleashing the power of the Cell Broadband Engine: A programming model approach

Cell Broadband Engine Architecture and its first implementation

Introduction to the Cell Broadband Engine

Introduction to the Cell Multiprocessor

Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal application performance

Terrain Rendering Engine (TRE): Cell Broadband Engine Optimized Real-time Ray-caster

An Implementation of the Feldkamp Algorithm for Medical Imaging on Cell Broadband Engine

Cell Broadband Engine Support for Privacy, Security, and Digital Rights Management Applications

A Programming Example: Large FFT on the Cell Broadband Engine