ethernet summit 2011_toe

23
Santa Clara, CA USA February 2011 1 intilop Corporation 4800 Great America Pkwy. Ste-231 Santa Clara, CA. 95054 Ph: 408-496-0333, Fax: 408-496-0444 www.intilop.com intilop TCP Offload Vs Non Offload Delivering 10G Line rate Performance with ultra low latency A TOE Story

Upload: intilop

Post on 11-Jun-2015

227 views

Category:

Technology


2 download

DESCRIPTION

Intilop Corporation is a pioneer in developing and providing ‘Customizable Silicon IP’ in the area of Networking, Network Security, data storage-SAN/NAS and embedded applications that allows customers to differentiate their products and make quick enhancements. Intilop and its customers have successfully implemented these in several ASICs, SOCs, FPGAs and full-scale systems.

TRANSCRIPT

Page 1: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 1

intilop Corporation4800 Great America Pkwy.

Ste-231Santa Clara, CA. 95054

Ph: 408-496-0333, Fax: 408-496-0444

www.intilop.com

intilop

TCP Offload Vs Non Offload

Delivering 10G Line rate Performance with ultra low latency

A TOE Story

Page 2: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 2

- Network Traffic growth …………………………3

- TCP/IP in Networks ……………………………… 5

- TCP Offload VS Non Offload …………………… 6

- Why TCP/IP Software is Slow? …………………8

- Market Opportunity ……………………….………12

- Required Protocol Layers…………………..…… 13

- Why In FPGA? ……………………………………15

- Intilop’s TOE – Key Features …………………... 18

- 10 G bit TOE – Architecture …………………..… 21

Topics

Content Slide

Page 3: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 3

Network Traffic Growth

TOE

Global IP traffic will increase fivefold by 2015:

Global IP traffic is expected to increase fivefold from 2010 to 2015, approaching 70 exabytes per month in 2015, up from approximately 11 exabytes per month in 2009.

By 2015, annual global IP traffic will reach about 0.7of a zettabyte.IP traffic in North America will reach 13 exabytes per month by 2015, slightly ahead of Western Europe, which will reach 12.5 exabytes per month, and behind Asia Pacific (AsiaPac), where IP traffic will reach 21 exabytes per month Middle East and Africa will grow the fastest, with a compound annual growth rate of 51 percent, reaching 1 exabyte per month in 2015.

An optimized TCP stack running on a Xeon Class CPU – @2.x GHz when dedicated to just one Ethernet port can handle data rate of up to about 500 MHz before slowing down.

Terabyte (TB) 1012 240

Petabyte (PB) 1015 250

exabyte (EB) 1018 260

Zettabyte (ZB) 1021 270

Yottabyte (YB) 1024 280

Page 4: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 4

TCP/IP in Networks & Challenges

TOE

NAS or DB Server

SAN Storage

Server

- Increasing TCP/IP performance has been a major research area for the networking system designers.- Many incremental improvements, such as TCP checksum offload have since become widely adopted.- However, these improvements only serve to keep the problem from getting worse over time, as they do not solve the network scalability problem caused by increasing disparity of improvement of CPU speed, memory bandwidth,memory latency and network andwidth. - At multigigabit data rates, TCP/IP processing is still a major sourceof system overhead.

Page 5: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 5

TCP Offload VS Non Offload

Layer 2 MAC

Layer 3 IP Layer

Layer 4 TCP Layer

Sockets/Buffers-Map

Application-Socket API

Applications /Upper level Protocols

StandardTCP Protocol Software Stack(Linux or Windows)

Full TCP/IP Offload(intilop)

Applications

Socket API

Current TCP/IP Software Architecture TCP Offload Architecture

PHY PHY

Page 6: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 6

Various degrees of TCP Offload

Enhanced TCP/IP (Partial Offload in a few designs)

Traditional TCP/IP Implementation

Full_TCP Offload

StandardTCP/IP Protocol Software Stack(Linux or Windows)

Applications

Applications

Applications

MAC+TOE

PHY PHY PHY

MAC MAC

Partial_TOE(Hardware Assist)

Remaining_TCPFunctions -CPU

Page 7: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 7

Traditional methods to reduce TCP/IP overhead offer limited gains:

•After an application sends data across a network, several data movement

and protocol-processing steps occur. These and other TCP activities

consume critical host resources:

• The application writes the transmit data to the TCP/IP sockets interface for

transmission in payload sizes ranging from 4 KB to 64 KB.

• The OS segments the data into maximum transmission unit (MTU)–size

packets, and then adds TCP/IP header information to each packet.

Why TCP/IP Software is Slow?

Page 8: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 8

• The OS copies the data onto the network interface card ‘s(NIC)

send queue.

• The NIC performs the direct memory access (DMA) transfer of

each data packet from the TCP buffer space to the NIC, and

interrupts CPU to indicate completion of the transfer.

•The two most popular methods to reduce the substantial CPU

overhead that TCP/IP processing incurs are TCP/IP checksum

offload and large send offload.

Why TCP/IP Software is Slow

Page 9: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 9

•TCP/IP checksum offload:

•Offloads calculation of Checksum function to hardware.

•Resulting in speeding up by 8-15%

•Large send offload(LSO) or TCP segmentation offload

(TSO):

Relieves the OS from the task of segmenting the application’s transmit

data into MTU-size chunks. Using LSO, TCP can transmit a chunk of

data larger than the MTU size to the network adapter. The adapter driver

then divides the data into MTU-size chunks and uses an early copy TCP

and IP headers of the send buffer to create TCP/IP headers for each

packet in preparation for transmission.

Why TCP/IP Software is Slow?

Page 10: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 10

•CPU interrupt processing:

An application that generates a write to a remote host over a network

produces a series of interrupts to segment the data into packets and

process the incoming acknowledgments. Handling each interrupt

creates a significant amount of context switching

•All these tasks end up taking 10s of thousands of lines of code.

•There are optimized versions of TCP/IP software running which

acheive 10-50% performance improvement;

•Question is: Is that enough?

Why TCP/IP Software is too Slow?

Page 11: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 11

Accelerated Financial Transactions, deep packet inspection and storage data processing requires ultra-fast, highly intelligent information processing and Storage/retrieval-

Problem:

The critical functions and elements that are traditionally performed by software and can not meet today’s performance requirements.

Response:

-We developed the critical building blocks in ‘Pure-Hardware creating Mega IPs’ to perform these tasks with lightning speed.

Market Opportunity

Page 12: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 12

Required Protocol Layers

Layer 2 MAC

Layer 3 IP Layer

Layer-1

PHY

Layer-2 hdr

Layer-3 hdr

Layer-4 hdr

Layer-5/6/7 - App

Update Cntrl

Cntrl Read

Read Pkt

Checksum+Strip header

Flow_0

Descr_n

Descr_0

Read Payload

write Payload

TDI/App

CPU Mem

Payld_n

Payld_0

Rx Pkt

Descr_n

Sokt-App_buff

Flow_n

Page 13: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 13

TCP/IP protocol hardware implementation

Layer -2 MAC

Layer - 3 IP Layer

PHY

Update Cntrl

Cntrl Read

Read Pkt

Checksum+Strip header

Flow_0

Descr_n

Descr_0

Read Payload

Write payload

App

CPU Mem

Payld_n

Payld_0

Rx Pkt

Descr_n

Sockt-App_buff

Flow_nApplications

TOE-FPGA(4 Layers Integrated)

AppRxBuf

AppTxBuf

Page 14: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 14

Why In FPGA?

•Flexibility of Technology and Architecture-

•By design, FPGA technology is much more conducive and adaptive to

innovative ideas and implementation of them in hardware

•Allows you to easily carve up the localized memory utilization in sizes

varying from 640 bits to 144K bits based upon dynamic needs of the

number of sessions and performance desired that is based on FPGA’s

Slices/ALE/LUT + blk RAM availability.

•Availability of existing mature and standard hard IPcores makes it

possible to easily integrate them and build the whole system that is at

the cutting edge of technology.

Page 15: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 15

Why in FPGA

•Speed and ease of development

•A typical design mod/bug fix can be done in a few hours vs

several months in ASIC flow

•Most tools used to design with FPGAs are available much

more readily, are inexpensive and are easy to use than ASIC

design tools.

•FPGAs have become a defacto standard to start

development with

•Much more cost effective to develop

Page 16: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 16

Why in FPGA

•Spec changes

•TCP spec updates/RFC updates are easily adaptable.

•Design Spec changes are implemented more easily

•Future enhancements

•Addition of features, Improvements in code for higher

throughput/lower latency, upgrading to 40G/100G are

much easier.

•Next generation products can be introduced much faster

and cheaper

Page 17: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 17

Intilop’s TOE – Key Features

•Scalability and Design Flexibility

•The architecture can be scaled up to 40G MAC+TOE

• Scalability of internal FIFO/Mem from 64 bytes to 16K bytes that can

be allocated on a per session basis and to accommodate very ‘Large

Send’ data for even higher throughput.

•Implements an optimized and simplified ‘Data Streaming interface’

(No INTRs, Asynchronous communication between User and TOE)

•Asynchronous User interface that can run over a range of Clk speeds

for flexibility.

•Gives user the ability to target to slower and cheaper FPGA devices.

Page 18: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 18

•Easy hardware and Software integration;

•Standard FIFO interface with User hardware for Payload.

•Standard Embedded CPU interface for control

•Easy integration in Linux/Windows. Runs in ‘Kernel_bypass’ mode in ‘user_space’

•Performance Advantage

•Line rate TCP performance.

•Delivers 97% of theoretical network bandwidth and 100% of TCP bandwidth. Much better utilization of existing pipe’s capacities

•No need to do “Load balancing’ in switch ports resulting in reduced number of ‘Switch Ports’ and number of Servers/ports.

•Latency for TCP Offload; ~200 ns. Compared to 50 us for CPU.

•Patented Search Engine Technology being utilized in critical areas of TOE design to obtain fastest results.

Intilop’s TOE – Key Features

Page 19: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 19

• Complete Control and Data plane processing of TCP/IP sessions in hardware accelerates by 5 x – 10 x

• TCP Offload Engine- 20G b/s (full duplex) performance

• Scalable to 80 G b/s

• 1-256 Sessions, depending upon on-chip memory availability

• TCP + IP check sum- hardware

• Session setup, teardown and payload transfer done by hardware. No CPU involvement.

• Integrated 10 G bit Ethernet MAC.

• Xilinx or Altera CPU interfaces

• Out of sequence packet detection/storage/Reassembly(opt)

• MAC Address search logic/filter (opt)

• Accelerate security processing, Storage Networking- TCP

• DDMA- Data placement in Applications buffer

-> reduces CPU utilization by 95 +%

• Future Proof- Flexible implementation of TCP Offload

• Accommodates future Specifications changes.

• Customizable. Netlist, Encrypted Source code- Verilog,

• Verilog Models, Perl models, Verification suite.

• Available ; Now

10 G bit TOE - Key Features

Page 20: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 20

10 G bit TOE Engine - Diagram

Hdr/Flg Proc

Filters Blk

Rx I/F Tx I/F

Protocol ProcessorP

D

D

Rx

Rx/Tx –Pkt Seq & Que-Mgr

4/8 DMAs SRAM Ctl Ext Mem Ctl

Ext Mem

(opt)Session Proc

Payload_FIFO

PCIe-DMA

PLB/APB I/F Regs Blk

PCIe – I/FFlash

Ext (opt)

User_Tx_ Payload

User_Rx_ Payload

Opt

Opt

Opt

XGMII XGMII10G EMAC

Control bus To Host

10G TCP Offload Engine + EMAC (Simplified Block Diagram)

Page 21: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 21

10-G-Bit MAC

• Full functionality Proven on Xilinx SOC-FPGA, ASIC• Scalable architecture to 10G bit• High end Switches, Routers, security appliances • Full 20 G bit Line rate, Packet transfer/Reception-

Sustained. • 14 M, 64 Byte Packets tested through each port• XGMII or XAIU interface• User configurable Deep FiFOs- 16k, 32k, 64k Bytes• Direct memory storage – interface• Statistics counters• Fully integrated content inspection engine (optional)• Fully integrated CAM controller/format engine(optional)• Configurable RAM block• Configurable Packet Bus- 32/64/128 bits• Configurable Int-Host bus- 32/64/128 bits

• Source code- Verilog• Perl Models• Verilog Models• Verification suite • Customizable• Netlist version

• Available- Now

Page 22: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 22

Intilop’s Network Acceleration and Security Engines

•These main building block IPs that are integral components for Network Security engine that performs deep packet inspection of network traffic at multi G bit line rate, sustained, full duplex.

1.10 G bit TCP Offload Engine

2.1 G bit TCP Offload engine

3.10-G Ethernet MAC

4.1-G bit Ethernet MAC

5. 4-16 Port Layer 2 switch with IEEE-1588. 1 G bit per port.

6. Deep-Packet Content Inspection Engine

Page 23: Ethernet summit 2011_toe

Santa Clara, CA USAFebruary 2011 23

THANK YOU