fpga, a history of interconnect - electrical and computer ...ocin06/talks/bolsens.pdf · sparse...

32
1 FPGA, a history of interconnect Ivo Bolsens CTO, Xilinx

Upload: phamphuc

Post on 18-Mar-2018

232 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

1

FPGA, a history of interconnect

Ivo Bolsens

CTO, Xilinx

Page 2: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

2Xilinx Confidential

“The Architecture of the Month”

TC

IBMCLCALQL

ACT

CP

MPA

ERA

VirtexApex

XC4000

XC2000XC3000

AT

6200Am4000

ORCADLXC5200

FLEX

The Architectural Shakeout

Mass extinctions in the mid 1990sXilinx: 8100, 6200, 4700, Prizm, …Plessey, Toshiba, Motorola, IBM, ...

We were hit by fast-moving CMOS process technology, particularly multiple metal layers.

Page 3: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

3Xilinx Confidential

TrendsC

ourte

sy a

nd C

opyr

ight

of U

MC

0100002000030000400005000060000700008000090000

100000

1985 1990 1995 2000 2005010002000300040005000600070008000900010000

LUTs

Wire

Page 4: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

4Xilinx Confidential

Interconnect and ease of use

Longer wire reach gave dramatic improvement for ease of design

0

1000

2000

3000

4000

5000

6000

0 200 400 600 800 1000

LUTs within reach

Delay

(Tilo

+inte

rcon

nect

) ps

4036EX

VirtexVirtex II

4000XL

Virtex-II Pro

Lower is better

Page 5: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

5Xilinx Confidential

State-of-the-Art 65nm FPGA

High-Performance 6-LUT Fabric

High-Performance 6-LUT Fabric

36Kbit Dual-Port

Block RAM / FIFO with ECC

36Kbit Dual-Port

Block RAM / FIFO with ECC

SelectIO with ChipSync

+ XCITE DCI

SelectIO with ChipSync

+ XCITE DCI

550 MHz ClockManagement Tile

DCM + PLL

550 MHz ClockManagement Tile

DCM + PLL

25x18 MultiplierDSP Slice with Integrated ALU

25x18 MultiplierDSP Slice with Integrated ALU

MoreConfiguration

Options

MoreConfiguration

Options

Page 6: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

6Xilinx Confidential

Virtex-4 Interconnect PatternPredictable Interconnect

Fast Connect1 Hop2 Hops3 Hops

1 CLBSymmetric routing pattern reaches more CLBswith fewer hops

Dramatically increases design performanceDramatically increases design performance

Page 7: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

Xilinx Upstairs 7 7Xilinx Confidential

Layers of Interconnect

1 mmBall Pitch

175 umBump Pitch

o o o o o o o o o o o o o o o o o

} Signals

Ground Plane

Ground Plane

12 layers on chip, 10 layers in package, 10 layers in PCB.

82% of noise is determined by the pin-out and found in package balls and PCB vias

82% of noise is determined by the pin-out and found in package balls and PCB vias

Page 8: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

Xilinx Upstairs 8 8Xilinx Confidential

Sparse Chevron Pin-out Pattern

X X X X X X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X 9 X X X X X X X X X X X X X X X X X X X X X X X X 3 X X X X X X XX X X X X X X X X X X X X X X X X X 3 X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X X 9 X X X XX X X X X 0 X X X X X X X X X X X X X X X X 9 X X X X X XX X X 0 X X X X X X X X X X X X X X X X 3 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 5 X X XX X X X X X 0 X X X X X X X X X X X X X X X X X X X X XX X X X 6 X X X X X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X 5 X XX X X X X X X X X X X X X 5 X X X X X X X X 6 X X X X X X X X X X X X XX X 6 X X X X X X X v X X X X X X X X 1 XX X X X X X X X X X X X X X 1 X X X XX X X X X 2 X X X X X X X X X X X X XX X X 2 X X X X X X X X X X X X X X X X X X X X X X X X X X X X 1 X X XX X X X X X 2 X X X X X X X X X X X XX X X X 4 X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X XX X X X X X X 4 X X v X X X X X X X X X X X X 4 X X X X X X X X X X X X X X X XX X 8 X X X X X X X X X X X X X X X X X X X XX X X X X X X X X X X X X X X X X X X X X X X X X X X XX X X X X 8 X X X X X X X X X X X X X X X X X X X X X XX X X 8 X X X X X X X X X X X X X X X X v X X X X X X X X X X X X X X X X 6 X X X X X X X X X X X X X X X X X X XX X X X X X 2 X X X X X X X X X X X X X X X X 1 X X X X XX X X X 2 X X X X X X X X X X X X X X X X X X X X X X X XX X X X X X X X X 6 X X X X X X X X X X X X X X X X 1 X XX X X X X X X 6 X X X X X X X X X X X X X X X X 1 X X X X X X X X 2 X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X

GNDCore VccI/O VccI/O SignalX

X X X X X X X X X X X XX X X X X X X X X X X X X X

X X X X X X X X X X X X XX X X X X X X X X X X X X XX X X X X X X X X X X X X XX X X X X 0 X X X X X X X XX X X 0 X X X X X X X X X X X

X X X X X X X X X X X X XX X X X X X 0 X X X X X X X XX X X X 6 X X X X X X X X X XX X X X X X X X X X X XX X X X X X X X X

X X X X 6 X X X XX X 6 X X X X X X XX X X X X X X X XX X X X X 2 X X X XX X X 2 X X X X X X

• Every SelectIO adjacent to return path

• Achieves near-ideal return current loop

IO pins in a regular array of return path pinsIO pins in a regular array of return path pins

Page 9: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

9Xilinx Confidential

Off chip interconnect• Growing gap between number of logic

gates and I/O • Technology scaling favors logic density

15x drop inI/O-to-logic

ratioby 2020

Source: ITRS

Page 10: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

10Xilinx Confidential

I/O to Logic RatioComparison of number of I/O per 1000 logic cell in the largest FPGA in each family

~60x decrease inI/O-logic

ratio

Page 11: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

11Xilinx Confidential

Year

Microprocessor’s chip area is dominated by cache memory to overcome the

lack of I/O bandwidth

Logic to I/O Gap in Microprocessors

Page 12: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

The Move to Serial Connectivity

12Xilinx Confidential

Source: Intel/Xilinx

Arrays of serial Gb transceivers

ProgrammableProgrammableTransceivers Transceivers 6 Gbps

1 Gbps Parallel Bus Limit1 Gbps Parallel Bus Limit

MANY STANDARDSMANY STANDARDSSerial

InterconnectSerialSerial

InterconnectInterconnect

PCI ExPCI ExFCFC--1/21/2

GbEGbE

10 GbE10 GbE

SATASATA

FCFC--44

OCOC--48483 Gbps

Sign

alin

g R

ate

Sign

alin

g R

ate

ISAISAISA

PCIPCIPCI

UP TO 66 MHzUP TO 66 MHz

VESAVESA

PCIPCI--XX

ParallelInterconnect

ParallelParallelInterconnectInterconnect

1980s 1990s 2000s

Arrays of serial Gb transceivers

Page 13: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

13Xilinx Confidential

Die-Stacking Landscape (Connection Density, Number of Device Layers)

>106/cm2

3-4 device layers

Photo: Tezzaron

102-103/cm2

4+ device layers 104 -106/cm2

4+ device layers

Photo: Amkor

Monolithic 3-D integration

Chip-Stacking(Through Silicon Vias)

Chip-Stacking(wire-bonding)

Page 14: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

14Xilinx Confidential

Imagine…

Specialized Layers ofDSP fabric,Memory fabric,FPGA fabric, …

Benefits :• Single package of heterogeneous die• Multiple configurations of standard products • Lower cost• Lower power• Ultimate customization• Ultimate flexibility

With 3D Interconnect

Optimized for Technology130nm, 90nm, 32nm, …

And at its heart…An FPGA SoC with: • Embedded Processing• Embedded DSP• High-speed Serial Connectivity• Reprogrammable FPGA Logic Fabric as the Base

Page 15: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

15Xilinx Confidential

Wafer BondingDaughter Wafer

Mother Wafer

Daughter IC

Mother Wafer

Daughter ICPrep Wafers; Dice Daughter

1)

2)

Attach

3)

Fuse

4)

4 to 15 µm 7 to 50 µm

< 5µm

Wafers from Foundry

Source: Cubic Wafer

Page 16: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

16Xilinx Confidential

Inter-Die Connection Performance

Die 2

Die 1

1 ns

• 100X less dynamic power than conventionalsingle-ended I/O

• Link performance ~ 1 Gigabit/sec

Page 17: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

17Xilinx Confidential

2-D FPGA Fabric

Routing SwitchLook Up Table (LUT). A programmable logic block

Page 18: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

18Xilinx Confidential

3-D FPGA Fabric

A. Rahman, et al., IEEE TVLSI, 11(1), 2003

• Shorter wire-length and delay• Higher logic density

Page 19: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

19Xilinx Confidential

Performance Improvement• Integration of RAM with FPGA by high

bandwidth die-stacking• 2 Terabit/sec bandwidth between FPGA

and RAM

High Performance Applications

Projected Performance Improvement

Sparse Matrix/ Vector Multiply

4X-8X over 2.4GHz Pentium

170X over 2.2GHz Opteron

20X over 1.7GHz Pentium

Molecular Dynamics ~20X over 3 GHz Processor

Traffic Simulation

Radiative Heat Transfer

(Courtesy of Maya Gokhale, LANL)

Page 20: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

20Xilinx Confidential

The FPGA System

MGTs I/OsMemory PowerPCLogicEmulation DSP

Com

mun

icatio

nPo

rt

CustomLogic

Inte

rnal

Mem

ory

Exte

rnal

Mem

ory P

ort

DSP

Acce

lerat

or

µP

Page 21: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

21Xilinx Confidential

MicroBlaze’s Flexible Acceleration Interface

• FSL = Fast Simplex Link• Eliminates bus signaling overhead

– No address decode– No arbitration– No acknowledge cycles

• Simple instruction programming• Flexible number master and slave

FSL ports – Configurable depth FIFO in FSL– Input and output FSL port

width is configurable as 8,16, or 32 bits.

• Dedicated MicroBlaze instruction– Get fromInputFSL M, toReg N– Put toOutputFSL M, fromReg N– Blocking and non-blocking

support

Page 22: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

22Xilinx Confidential

Application-SpecificHardware Acceleration

• When the processor core begins to reach software task capacity, then Fabric Acceleration to the rescue– Use Fast Simplex Link

(FSL) to interface to customer-defined accelerators

– Enables dramatic improvements in performance

Page 23: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

FPGA/processor• CoreConnect Architecture

– Processor Local Bus (PLB)• Ideal for bust transfers

– Memory, High Speed Peripherals, Cache Interface• 32-bit address, 64-bit data• 2.1 GB/s Max BW @ 133 MHz

– On-Chip Peripheral Bus (OPB)• Low speed peripherals• 32-bit address, 32-bit data

– Device Control Register Bus (DCR)• For peripheral setup and control

• On Chip Memory Interface (OCM)– 4 Processor Cycle Latency– Lowest Latency and good data rate– Not a bus interface

23Xilinx Confidential

Page 24: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

24Xilinx Confidential

Accelerate Performance Beyond the Core

SoftAuxiliary

Processor

SoftSoftAuxiliaryAuxiliary

ProcessorProcessorAPU

ControlAPUAPU

ControlControlPowerPCPowerPCPowerPC

FPGA FabricFPGA Fabric

Hard Processor BlockHard Processor Block

OCMOCM

PLBPLB

APU I/F

FPGAInterface

• Extends PPC 405 Instruction Set– Floating point support– User Defined Instructions

• Offloads CPU intensive operations

– Matrix calculations• Video processing

– Floating point mathematics• 3D data processing

• Direct interface to HW accelerators

– High Bandwidth – Low Latency

• Reduce number of bus cycles by factor of 10X

• Increase performance by over 20X

Page 25: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

25Xilinx Confidential

Comparison with Traditional Bus-basedAPU PLB

ProcessorBlock Soft Aux.

ProcessorProcessor

Block Soft Aux. ProcessorAPU I/F APU I/F

Write Instruction

and operands

Read Result and Status

Write Operand1

Read Status

5 PLB cycles + 2 CPU cycles

1 APU cycle

ExecutionExecution

Write Operand2 and Instruction

Read Result

ExecutionExecution 5 PLB cycles + 2 CPU cyclesNEX APU cycle

NEX PLB cycle1 APU cycle +1 CPU cycle

6 PLB cycles + 3 CPU cycles

6 PLB cycles + 3 CPU cycles

Page 26: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

26Xilinx Confidential

Processor Performance and Fabric Acceleration

1,0001,000

20022002 20042004 20062006

D-MIPSD-MIPS

• PowerPC 405 at 450 MHz• APU

20082008

2,0002,000

3,0003,000

20102010

Next generationPowerPC

VirtexVirtex--II ProII Pro

VirtexVirtex--44

Fabric AccelerationFabric AccelerationPowerPC PowerPC –– APUAPU

MicroBlaze MicroBlaze -- FSLsFSLs

““TraditionalTraditional””Frequency ScalingFrequency Scaling

VirtexVirtex--55• PowerPC • APU

Page 27: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

Virtex-II Pro Fabric

PowerPC 405

ISO

CM

ISO

CM

DSO

CM

DSO

CM

GPIOGPIO UARTUART10/100Ethernet10/100

Ethernet

User PeripheralUser Peripheral

OPB IPIFOPB IPIFUser LogicUser Logic

BRAMBRAM

PCI 64/66PCI 64/66GE MACGE MAC MemoryControllerMemory

Controller

DDR SDRAMDDR SDRAMSDRAMSDRAM

MemoryMemory

ZBT SSRAMZBT SSRAM

PowerPC ArchitectureData Control Register Bus - DCR

Instruction DataOPB

ArbiterProcessor Local Bus - PLB On-Chip Peripheral Bus - OPBPLB Arbiter

PLB/OPBBridge

JTAG BlockJTAG Block

System Reset

System Reset

Full System Customization & High Performance

27Xilinx Confidential

Page 28: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

28Xilinx Confidential

Scalability

2005 2007 2009

Scalable in performance, and re-usable solution FPGAProcessing

rate Flexible systemon chip used fortailored systemarchitectures

Fixed processor

Processor / memory bottlenecksworsen

Next fixed architecture Next fixed architectureNo re-use, architecture dependent

Page 29: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

29Xilinx Confidential

Configurable InterconnectCell Processor

85GB External Bandwidth205GB internal bus400GB register file bandwidth400GB internal memory(cache)60-130 Watts

FPGA80GB External Bandwidth2TB internal memory(BRAM)2.5TB internal embedded mem>10TB ALU<->register<->ALU10-20 watts

You must customize your program to make use of the busses of the Cell processor

The FPGA connectivity can be customized to fit your program

Page 30: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

Flexibility

0

50

100

150

200

Computation (GOPS) Memory Bandwidth(GB/sec)

CPUV2ProV5

Woodcrest = 80 WattVirtex-5 = 10 Watt

Woodcrest : 24 GFlops @3GhzVirtex-5 : 60 GFlops (70 GFlops-SP) @ 350Mhz

30Xilinx Confidential

Page 31: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

31Xilinx Confidential

Internal Memory Bandwidth

0100200300400500600700800900

1000

0 10 20 30 40 50 60

Bandwidth (Tbps)

Mem

ory

(KB

)

4VLX200 @250Mhz2V60003.73GHz iPxe965

BRAM

LUT-RAM

Page 32: FPGA, a history of interconnect - Electrical and Computer ...ocin06/talks/bolsens.pdf · Sparse Chevron Pin-out Pattern ... bandwidth die-stacking • 2 Terabit/sec bandwidth between

32Xilinx Confidential

Conclusions

• It is all about connectivity• Important aspects

– Off-chip interconnect goes serial• Latency, power

– On-chip interconnect has to be• Scalable• Hierarchical• Flexible• Easy to use

– System interconnect heterogeneous and tailored to application