system-on-chip: an architecture perspective · optimizations & code morphing ... system-on-chip...
TRANSCRIPT
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
System-on-Chip:An Architecture Perspective
CSE 291E / ECE 260CSpring 2002
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
System-On-Chip(or what am I going to do with a billion transistors?)
• At 0.13µm• ARM9 = 3mm2
• Pentium 4 = 144mm2
~50 x ARM9
• By 2007• 0.02µm (20-nm)• 20GHz• 1 billion transistors !• ~2500 x ARM9
ARM93mm2
Pentium 4144mm2
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
What am I going to do with a billion transistors?
• View 1: More logic to make sequential execution faster• More & more logic for predictive &
speculative execution, “on-the-fly” optimizations & code morphing …
• View 2: More memory on-chip• Currently 70% logic, 30% cache/memory• Expect 30% logic, 70% cache/memory
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
What am I going to do with a billion transistors?
• View 3: Chassis to boards to IC’s• What used to be in a
chassis become on a board• What used to be on a
board become on a single IC
• Been doing this for decades• Most common definition of
“System-on-Chip”
RISC DSP
Modulator
ChannelCodec
SourceCodec SRAM
MemoryController
PCIController Test
MACFramer Tables
RISC DSP
F4
F3
F2
F1
SRAM
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
What about I/O to a billion transistor IC?
• Problem• # transistors 2x every 18 mos (exponential)• Pin count constant, linear at best
• Solution• More pins: 1000+ pins possible• SerDes (Serializer/Deserializer) to the
rescue!
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
SerDes to the Rescue!
• Based on differential signaling• Can pump 4Gbps per pair of pins• Internally serialize/deserialize. 8:1 & 16:1 currently common• Roadmap: 10Gbps per pair of pins, 32:1 in a few years
Tx+
-Rx
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
8b/10b Encoding
• Ensures bit transitions within each 10b word• Clock can be recovered from the data
Des
eria
lizer
Seria
lizer
Tx
10-bitcoded words
Rx
8b/10bEncoder
10b/8bDecoder
8-bitdata bytes
ParallelInterface
ParallelInterface
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
DSP vs. µP Cores
• Many DSP & digital communications algorithms have this “sum-of-product” computation. e.g. FFT computation
• DSP datapath vs. µP datapath
Hn = Σ W nk hkk = 0
N - 1
ALU
MUL
ALU
DSPdatapath
(hard codemultiply
accumulate)µP
datapath
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
DSP vs. µP Cores
• Try to do multiply-accumulate with just an adder datapath• Requires many instructions!• Think of hardware multiply-accumulator datapath as 1st “kernel-
accelerator” with broad appeala4x0
a3x1
a3x0
a2x1
a3x2 a2x2
a3x3 a2x3
a3x4 a2x4
0
a1x4 a0x4
a1x3 a0x3
a1x2 a0x2
a1x1 a0x1
a2x0 a1x0
a4x1
a4x2
a4x3
a4x4
a0x0
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
DSP vs. µP Cores
• Many small differences have been touted between DSP’s (e.g. TI-DSP) & µP’s (e.g. ARM)
• Configurable processors (e.g. Tensilica) make multiply-accumulator datapath an “option”• With multi-acc = DSP?• Without multi-acc = µP?• Distinction between DSP & µP blurred?
• At least one significant issue• Multiply-accumulator delay >> adder delay• Pipelining design trade-offs may be different
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
SoC Design
• e.g ARM core and AMBA bus
• Analogous to Pentium, Windows/Linux, PCI, memory, graphics card,sound card, Ethernet card …
• Need standardized bus & driver models (e.g. AMBA bus instead of PCI)• Need stripped down OS (e.g. WindRiver instead of Windows/Linux) to
host software and “drivers” (communication channel) to hardware
ARMCPU
AMBA – (AHB)
Test I/FController
SmartCard I/F
AudioCodec I/F
ColorLCD
ControllerUART
SynchronousSerial Port
SDRAMController
SRAM
Brid
ge
System Bus
AMBA – (APB)
Peripherals Bus
High Speed Low Power
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
SoC Design
• Example on-chip bus interconnects• ARM’s AMBA bus• IBM’s Core Connect• Virtual Socket Interface Alliance group• Open Connect Protocol group
• Example processor cores• ARM• MIPS• PowerPC
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
Standardization is key to creation of SoC value-chain
• 3rd embedded software possible• 3rd party ASIC cores possible (often called IP
cores, IP = Intellectual Property, not Internet Protocol)
• Plug-able to a standardized interconnect• Interoperable with embedded software on
processors via well-defined drivers …• Towards “plug-n-play”• Effects similar to the Intel-PC value-chain,
NPF (Network Processing Forum) based switch/router value-chain …
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
See Tality for examples of 3rd
party cores• USB 2.0 PHY• SDRAM memory controller• GSM/GPRS reference platform• GSM/GPRS data module• 10bit 50MHz pipelined ADC• 10bit 300MHz DAC• Viterbi decoder• Reed-Solomon decoder• …
• 10/100 PHY• 10/100 MAC• 10G XAUI interface• 8b/10b encoders• HomePNA PHY• 1.25G/2.5G SerDes• DOCSIS 1.1 PHY• 802.11a OFDM modem• 802.11a/b dual-mode MAC• Bluetooth baseband• Bluetooth PHY
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
Current SoC Interconnect Approaches
• Standardized busses
• Point-to-point wires
• Shared FIFO’s or shared register files
Bus
Wires
Shared FIFO orShared register file
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
Need Scalable On-Chip Switch Architecture
• Shared busses & local point-to-point connections may not be sufficient as we move into the billion transistor era
• Too many cores to interconnect• Complex design & verification effort• Lots & lots of wires to route
• Need systematic scalable network architecture• Analogous to crossbar-based packet switches at the
board level, again with standardized protocols & interfaces needed for a value-chain
• But on-chip & board-level design trade-offs may be different
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
Example On-Chip Network Architecture: Hypercube-on-Chip
• Stanford’s proposal (B. Dally)
• Use Hypercube architecture used in Cray’s T3, SGI’s Origin …
• O(Log N) #hops & outgoing links per tile
30
20
10
00
31
21
11
01
32
22
12
02
33
23
13
03
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
Example On-Chip Network Architecture: Hypercube-on-Chip
• Claims• 6.6% area overhead using
layer 6 & 7 metal that route over tiles (cores)
• Regularized routing enable faster data rates over wires than global clock (e.g. using double/quad data rates or serial differential signaling approaches)
• Alternative architectures should be investigated• e.g. Crossbar, mesh, quad
trees, token rings …
NorthOutput
SouthOutput
Tile
Out
put
Wes
tO
utpu
t
East
Out
put
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
On-Chip Memory Hierarchy Another Major Issue
• Need to move lots of data in-and-out of chip!• Many open challenges to on-chip memory hierarchy design• Cache coherency across multiple processing nodes on-chip?
RDRAM RDRAM RDRAM RDRAM
320Gbps = 4 x 80Gbpsmemory bandwidth
RIMM I/F
Stripping
RIMM I/F RIMM I/F RIMM I/F
L3
L2 L2
L1 L1 L1 L1
Logic
Increasingly faster &lower latencydata access
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
Cross-Fertilization of SoC and CMP Research
• Many challenges and ideas in the Chip-Multi-Processor (CMP) research are applicable to System-on-Chip (SoC) research
• Many challenges and ideas in System-on-Chip (SoC) research are applicable to Chip-Multi-Processor (CMP) research• Digital communications, network processing, and
video processing guys have been building chip multiprocessors for years
CSE 291E/ECE 290C, Spring 2002 - Copyright © 2002, Bill Lin
Questions?