a 10gbps ethernet tcp/ip processor - hot chips 10gbps ethernet tcp/ip processor hot chips 15 -...
TRANSCRIPT
A 10Gbps Ethernet A 10Gbps Ethernet TCP/IP ProcessorTCP/IP Processor
HOT Chips 15 - August 17, 2003
Jianping Xu, Nitin Borkar, Vasantha Erraguntla, Yatin Hoskote, Tanay Karnik,
Sriram Vangal, Justin Rattner
Microprocessor Research, Intel Labs
2
Outline
TCP/IP processing challengesTCP/IP protocol processor (TIPP) features TIPP microarchitectureTIPP implementationExperimental chip performanceSummary
3
TCP/IP Challenges100MbE100MbE 148 k 148 k pktspkts/sec/sec 6.72 ms6.72 ms1GbE1GbE 1.48M 1.48M pktspkts/sec/sec 672 ns672 ns10GbE10GbE 14.8M 14.8M pktspkts/sec/sec 67.2 ns67.2 ns40GbE40GbE 59.5M 59.5M pktspkts/sec/sec 16.8ns16.8ns
General processor
cannot keep up.
ArchitecturalEnhancements
Required to Close the Gap
Process ImprovementProvides This Gain
Time
Process Gain byMoore’s
Law
NetworkBandwidth
Demand
CPU
CPU
MIP
S
4
CPU Requirements
0
2
4
6
8
64B 1KB 2KB 16KB 64KB
Transfer Size
GH
z/G
bps
TXRX
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
64B 1KB 2KB 16KB 64KB
Transfer size
% C
PU u
tiliz
atio
n
0
200
400
600
800
1000
Mbp
s
CPU Utilization-TX CPU Utilization-RXThroughput-TX Throughput-RX
Throughput & CPU Utilization (2.4GHz)
5
Ingress Processing
Packet Filter
ProtocolDemuxer
IPReassembler Execution Unit
(UDP Proc)
TCP IngressProcess Engine
BootPROM
Ingress Processing
Ingress Stream
Internal Bus
PHY
RxBuf
TxBuf
SPII/F
Ingress Proc
Egress Proc
MAC
PCIexpress
XBarSDRAM I/F
Host
6
Egress Processing
IP Forwarder
IPFragmentation
Exec Unit(UDP Proc)
TCP EgressProcessing
BootPROM Egress Processing
Stream fromsend buffer Internal
Bus
PHY
RxBuf
TxBuf
SPII/F
Ingress Proc
Egress Proc
MAC
PCIexpress
XBarSDRAM I/F
Host
7
TCP Offload
0
20
40
60
80
100
Pro
toco
l typ
e (%
)
TCP Others
Why TCP processing?• 82% of all traffic is TCP• TCP processing is the bottleneck
— complex— compute intensive
UDP
Packet Filter
ProtocolDemuxer
IPReassembler Exec Unit(UDP Proc)
TCP IngressProc Engine
BootPROM
Ingress Processing
Ingress Stream
Internal Bus
8
Key Features of TIPPSpecial purpose processor– Dual frequency, low latency, buffer-free design– High frequency execution core– Accelerated context lookup and loading
Programmability for up-to-date protocols– Programmable design with special instructions– Rapid validation and debug
Scalable solution– Across bandwidth and packet sizes– Extendable to multi-core solution
9
TIPP Organization
CAM — Content Addressable Memories IDB — Ingress Dispatcher BlockTCB — Transmission Control Block SR — State Register ECore — Execution Core SP — Scratch Pad IROM — Instruction ROM EB — Egress Buffer
CAM1
CAM2ECore
IDB
IROM
SP
TCB
SR
EB
6
32 118
32
96
264
32
32
32
112
10GbEingressstream
egressstream
HighFrequency
10
Instruction SetGeneral purpose instructions Special purpose instructions
1 LOAD A data2 MOV A B3 AND A B cond4 OR A B cond5 ADD A B C6 SUB A B C7 CMP A B cond8 EQUAL A B cond9 NOT A C10 BREQZ / BRNEQZ label11 JMP label12 SHL2 A13 NOP
1 CAM1CLR index2 CAM2CLR index3 CAM1LKUP key index, data4 CAM2LKUP key index, data5 CAM2WR key data6 CAM2EMPTY cond7 TCBWR index data8 TCBRD index data
11
High Speed Execution Core
IDB SR
index
TCBTCB
264
Key CAM1
6
ALUCAM2
Ingressstream
SP PCPipelinedALU
Branch addressStart address
IROM
9
3232
112
IR
decode
ALU output
Next address
pipestage
96
12
ALU
ALU Out
pipelinedadd / sub
pipelinedcmp / equal
orand not shl2 loadmov
Source A Source B32 32
Cond regIROM Pipe stage
13
Packets Reordering
Packets arrive in random orderConventional software solution uses sorting to reorder packetsHardware sorting is expensive and cumbersomeUse CAM to eliminate sorting– Store and lookup of packets in CAMs by
sequence number – Combine adjacent packets in CAMs
Critical for wire speed processing
14
System Scalability
Multi-core solution with each core supporting up to 8K connectionsLarge number of connections: > 32K to 64K
4 signal handshake between CTL and each TIPP
TIPP offload engine
CTL
CTL
15
Enabling Circuit Technologies
Quaternary tree adder architecture Dual-VT design Semi-dynamic flip-flopsHigh speed instruction ROM High bandwidth, leakage tolerant register fileAdaptive body biasMultiple clock domain synchronization
16
Experimental Chip
ExecCore
PLLTC
B ECore
PLL
CAM2
IROM
CA
M1
IDB
EB
2.23 x 3.54mm2
-VT
1 poly, 7 metal460K306
Chip AreaProcess
InterconnectTransistorsPad count
2.23 x 3.54mm2
90nm comm.CMOS
1 poly, 7 metal460K306
Chip AreaProcess
InterconnectTransistorsPad count
18
Measured Processing Performance
0
2
4
6
8
10
0.8 1 1.2 1.4 1.6 1.8
Vcc (V)
Pro
cess
ing
Rat
e (G
bps)
4.4Gbps, 0.9V
9.64Gbps, 1.72V
19
Power Consumption
Simulated Power Density
0
1
2
3
4
5
6
7
2 4 6 8 10
Processing Rate (Gbps)
Mea
sure
d Po
wer
(W)250-300
200-250150-200
100-15050-100
0-50
Watts/[email protected], 100C 9.64Gbps, 1.72V
4.4Gbps, 0.9V
DECAP
EB
Ecore IROMTCB
CAM
1
IDB
CAM2
20
Summary
Programmable hardware engine– Line-speed TCP ingress processing– Support minimum packet size
Dual frequency and buffer-free design – High speed execution core– 9.64Gbps processing at 1.72V, 6.39W in 90nm
communication CMOS process– Extendable to large number of connections
Dynamic reordering of packets in hardware
21
Acknowledgments
B. Bloechel, G. Dermer, D. Finan, J. Howard, K. Ikeda, D. Klowden, S. Narendra, H. Nguyen, C. Parsons, G. Ruhl, D. Somasekhar, S. Tang, K. Truong, J. Tschanz, V. Veeramachaneni, H. Wilson for the TIPP experimental chip contributions A. Foong, F. Hady for simulation contributionsS. Borkar, M. Haycock for encouragement and support