bridge chip composing a pcie switch over ethernet to … · pcie doesn’t reach dc-scale....

1
Problem Statement Cloud data center must provide computer platforms along with diversified user requests quickly, in reasonable price One of the solutions is a “disaggregated platform” which has resource pools of CPUs, storages, GPUs, and other devices. PCIe is appropriate interconnection Simple protocol, vast hard/soft assets. However, the transport layer of PCIe doesn’t reach DC-scale. Connection distance and port count are limited to in-box scale. Coupling of host and I/O devices are too tight to disaggregate them into shared resource pool. Architecture : Distributed PCIe switch To be PCIe compliant, Ethernet must be transparent. Internal bus of PCIe switch chip is extended by Ethernet. ExpEther appears as a single hop PCIe switch for OS/software. Utilize commodity device, OS, device driver w/o modification. Performance Storage : Block I/O read/write to PCIe SSD shows 92/74% of direct-attached device’s performance @4KB size x7 and x2 performance of those of iSCSI, and iSCSI/ToE GPU Application : 3D graphic benchmark7 performance is not so much degraded, marking about 80% to 97% score of those direct attached case. Proposal Using Ethernet as a transport of PCI Express to realize 100-times port-count and connection-distance Seamless expansion of PCIe to realize a big single computer with many CPU/IO resources on a single network. Software defined configuration to realize function/performance along with user’s diversified requirements. Architecture : Reliable Ethernet To maintain PCIe connection, Ethernet must be reliable. Prevent packet loss by congestion control and retry xN bandwidth and redundancy by multipath transport Utilize conventional Ethernet switch w/o modification Disaggregated Resource Management To make PCIe connection, ExpEther chip has group ID and MAC address I/O devices are attached to host which has the same GID. Packet routing is autonomously performed by Ethernet layer. GID can be set remotely by management software. Software defined configuration by OpenStack based software. Conclusion Distributed PCIe switch architecture and reliable Ethernet realize LAN-scale disaggregated computer w/ resource pool PCIe compliant : Utilize low-cost commodity soft/hardware. Comparable performance with that of direct attached device. Software defined configuration achieving required performance. Technology Proven : Already in service. Implementation To have comparable performance with PCIe switch, all functions are implemented in a chip w/o S/W stack. Main functional blocks are PCIe Ethernet Bridge (PEB) and Ethernet Forwarding Engine (EFE). External interface are standard; PCI Express and Ethernet. Seamless Disaggregated Computer To make a computer providing user-required performance and function, necessary devices can be selected and attached from resource pool. ExpEther provides a singe-hop PCIe connection for all CPU / devices in the resource pool over Ethernet. Bridge Chip Composing a PCIe Switch over Ethernet to Make a Seamless Disaggregated Computer in Data-Center Scale Takashi Yoshikawa 1 , Jun Suzuki 1 , Yoichi Hidaka 2 , Junichi Higuchi 2 , and Shinji Abe 3 1 Green Platform Res. Labs, 2 System Device Division, 3 IT Platform Division, NEC Port Count Distance (m) 0.1 1 10 100 1k 10k 10k 1k 100 10 1 Data Center Scale Rack-scale Disaggregation video storage flexible Multi- tenant Rack Campus Board Rigid Stanadrd PCIe Switch Upstream Port (PCI Bridge) Downstream Port (PCI Bridge) Downstream Port (PCI Bridge) I/O Device PCI Express PCI Express Ethernet Switch I/O Device I/O Device CPU (RC) I/O Device PCI Express PCI Express Ethernet ExpEther Host Chip (upstream) ExpEther IO Chip (downstream) ExpEther IO Chip (downstream) Internal bus Implemented in a same chip Ethernet region (transparent) Ethernet Switch CPU (RC) Monitoring RTT to control packet rate 2 ExpEther packet 5 4 3 2 1 ACK packet Retry Ethernet-1 ExpEther Bridge ExpEther Bridge PCIe Device PCIe Device Ethernet-2 Restoration at Path Error xN performance using N paths Congestion Control and retry Multi-path ExpEther Bridge ExpEther Bridge MDIO Interface I 2 C Master I/F PEB I 2 C Slave I/F EFE Register Bus Control Register Reset I/F Virtual Wire I/F Ethernet MAC Retry Control Congestion Control Multi-path Forwarder PCIe Switch Routing Core Virtual Register Encap / Decapsulation Retry Buffer Ethernet MAC UP/Down Stream PCIe Bridge Core 16-bit CPU I/F Total SM2.0 HDR/SM3.0 CPU 1 0.8 0.6 0.4 0.2 0 ¥ ExpEther 2G x8 Direct 4G x16 0 10 20 30 40 50 Direct ExpEther 512 4k 16k 256k 4M Access Size IOPS(k) Random Read Host B D G I PCIe Switch Host A J PCIe Switch Host C E H PCIe Switch Host F PCIe Switch Group#1 Group#2 Group#3 Group#4 Logical View Host Host Host Host 1 2 3 4 A B C D E F G H I J 1 1 1 1 2 2 3 3 3 4 *Note that this figure omits the EE chips Multi-rack resource pool system w/ More than100 server / IO devices (GPU, Storage (HDD, PCIe Flash), VDI GRID, PCoIP) Service in Apr. 2014. Virtual PCI Express Switch over LAN Dev. A Dev. F Dev. J ・・・・ CPUMem. App./OS CPU/Mem. App./OS PCI Express: In-box I/O Bus Conventional computer & network system CPU/Mem. Dev. A App./OS Dev. F Dev. G Dev. J. CPU/Mem. App./OS IP/Ether/IB/Storage Network ExpEther: Out-of-box I/O Bus Plug-in I/Os In Network PCIe w/ LAN scale distance, port count, dis/connectivity ExpEtherPCI Express over Ethernet 1 10 100 1000 0.1 0.1 1 10 100 Distance (m) Latency (usec) PCIe switch ExpEther Rack floor campus building HPC I/O memory Flash HPC Copro. CPU Pool Server Module Chipset Xeon Xeon uServer Module uServer Module uServer Module uServer Module Avoton Hi-Performance Low-Performance Storage Pool MemCache NVMe Drive Network Pool Legacy HDD Module Legacy HDD Module Legacy HDD Module Legacy HDD Module SAS Ctrl HDD HDD NIC w/ DPI Hi-end NIC Normal NIC NIC Special IO Pool HPC Copro. Xeon Phi HPC Copro. HPC Copro. GPGPU Security Encryption Storage GPU Application Management Server -Set GID -Resource Monitor Resource pool proto-type w/ ExpEther on mother board. Each PCIe slot can be used for both CPU module and IO devices Latency Estimation (for 10GE-FPGA) -EE chip ~ 0.7us -Ether SW ~ 0.2us -Cable 5ns/m CPU pool system w/ OpenStack Control PCI Express Ethernet chip ASIC FPGA Structured ASIC PCIe Gen2.1 Gen2.1 Gen2.1 1 lane 8 lane 8 lane @ 2 Gbps @2 Gbps @2 Gbps Ethernet 2 x GbE 2 x 10GE 2 x 10GE Power(RMS) 1.0W 6.7 W 3.5 W Dimension 19 x 19mm 33 x 33 mm 29 x 29 mm Pin Count w/ Power/GND 388 pin 780 pin 780 pin Process 130 nm 40 nm 40 nm Latency 1050 ns 700 ns 700 ns Internal Clock 125 MHz 125/156 MHz 125/156 MHz Detailed Specification of chips

Upload: lekhanh

Post on 01-Aug-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bridge Chip Composing a PCIe Switch over Ethernet to … · PCIe doesn’t reach DC-scale. Connection distance and port Legacy HDD Module count are limited to in-box scale. Coupling

Page 1 © NEC Corporation 2014 NEC Group Internal Use Only

Problem Statement

▌ Cloud data center must provide computer platforms along

with diversified user requests quickly, in reasonable price

One of the solutions is a “disaggregated platform” which has

resource pools of CPUs, storages, GPUs, and other devices.

▌ PCIe is appropriate interconnection

Simple protocol, vast hard/soft assets.

▌ However, the transport layer of

PCIe doesn’t reach DC-scale.

Connection distance and port

count are limited to in-box scale.

Coupling of host and I/O devices

are too tight to disaggregate them

into shared resource pool.

Architecture 𝟏 𝟐 : Distributed PCIe switch

▌ To be PCIe compliant, Ethernet must be transparent.

Internal bus of PCIe switch chip is extended by Ethernet.

ExpEther appears as a single hop PCIe switch for OS/software.

Utilize commodity device, OS, device driver w/o modification.

Performance

▌ Storage : Block I/O read/write to PCIe SSD shows 92/74%

of direct-attached device’s performance @4KB size

x7 and x2 performance of those of iSCSI, and iSCSI/ToE

▌ GPU Application : 3D graphic benchmark7 performance is

not so much degraded, marking about 80% to 97% score of

those direct attached case.

Proposal

▌ Using Ethernet as a transport of PCI Express to realize

100-times port-count and connection-distance

Seamless expansion of PCIe to realize a big single computer

with many CPU/IO resources on a single network.

Software defined configuration to realize function/performance

along with user’s diversified requirements.

Architecture 𝟐 𝟐 : Reliable Ethernet

▌ To maintain PCIe connection, Ethernet must be reliable.

Prevent packet loss by congestion control and retry

xN bandwidth and redundancy by multipath transport

Utilize conventional Ethernet switch w/o modification

Disaggregated Resource Management

▌ To make PCIe connection, ExpEther chip has group ID and

MAC address

I/O devices are attached to host which has the same GID.

Packet routing is autonomously performed by Ethernet layer.

▌ GID can be set remotely by management software. Software defined configuration by OpenStack based software.

Conclusion

▌ Distributed PCIe switch architecture and reliable Ethernet

realize LAN-scale disaggregated computer w/ resource pool

PCIe compliant : Utilize low-cost commodity soft/hardware.

Comparable performance with that of direct attached device.

Software defined configuration achieving required performance.

Technology Proven : Already in service.

Implementation

▌ To have comparable performance with PCIe switch, all

functions are implemented in a chip w/o S/W stack.

Main functional blocks are PCIe Ethernet Bridge (PEB) and

Ethernet Forwarding Engine (EFE).

External interface are standard; PCI Express and Ethernet.

Seamless Disaggregated Computer

▌ To make a computer providing user-required performance

and function, necessary devices can be selected and

attached from resource pool.

▌ ExpEther provides a singe-hop PCIe connection for all CPU

/ devices in the resource pool over Ethernet.

Bridge Chip Composing a PCIe Switch over Ethernet to Make a Seamless Disaggregated Computer in Data-Center Scale Takashi Yoshikawa1, Jun Suzuki1, Yoichi Hidaka2, Junichi Higuchi2, and Shinji Abe3 1Green Platform Res. Labs, 2System Device Division, 3IT Platform Division, NEC

Port

Co

unt

Distance (m)

0.1 1 10 100 1k 10k

10k

1k

100

10

1

Data Center Scale

Rack-scale Disaggregation

video

storage

flexible

Multi- tenant

Rack Campus Board

Rigid

Stanadrd PCIe Switch

Upstream Port (PCI Bridge)

Downstream Port (PCI Bridge)

Downstream Port (PCI Bridge)

I/O

Device

PCI Express

PCI Express

Ethernet Switch

I/O

Device

I/O

Device

CPU

(RC)

I/O

Device

PCI Express

PCI Express

Ethernet

ExpEther Host Chip (upstream)

ExpEther IO Chip (downstream)

ExpEther IO Chip (downstream)

Internal bus

Implemented in

a same chip

Ethernet region

(transparent)

Ethernet Switch

CPU

(RC)

Monitoring RTT to control packet rate

2 ExpEther packet

5 4 3 2 1

ACK packet

Retry Ethernet-1

ExpEther

Bridge

ExpEther

Bridge

PCIe

Device

PCIe

Device

Ethernet-2

Restoration at Path Error

xN performance using N paths

Congestion Control and retry Multi-path

ExpEther

Bridge

ExpEther

Bridge

MD

IO

Inte

rfa

ce I2C Master

I/F

PEB

I2C Slave I/F

EFE Register Bus

Control Register

Reset I/F

Virtual Wire I/F

Eth

ern

et

MA

C

Retr

y

Contr

ol

Congestion C

ontr

ol

Multi-path

Forw

ard

er

PC

Ie S

witch

Routing C

ore

Virtual Register

Encap /

Decapsula

tion

Retry Buffer

Eth

ern

et

MA

C

UP

/Dow

n S

tream

PC

Ie B

ridge C

ore

16-bit CPU I/F

Total SM2.0 HDR/SM3.0 CPU

1

0.8

0.6

0.4

0.2

0

¥

ExpEther 2G x8 Direct 4G x16 0

10

20

30

40

50

Direct

ExpEther

512 4k 16k 256k 4M

Access Size

IOP

S(k

)

Random Read

Host

B D G I

PCIe Switch

Host

A J

PCIe Switch

Host

C E H

PCIe Switch

Host

F

PCIe Switch

Group#1 Group#2 Group#3 Group#4

Logical View

Host Host Host Host 1 2 3 4

A B C D E F G H I J 1 1 1 1 2 2 3 3 3 4

*Note that this figure omits the EE chips

Multi-rack resource pool system w/

More than100 server / IO devices

(GPU, Storage (HDD, PCIe Flash),

VDI GRID, PCoIP) Service in Apr. 2014.

Virtual PCI Express

Switch over LAN

Dev. A

Dev. F

Dev. J

・・・・

CPU・Mem.

App./OS

CPU/Mem.

App./OS PCI Express:

In-box I/O Bus

Conventional computer

& network system

CPU/Mem.

Dev. A

App./OS

Dev. F Dev. G Dev. J.

CPU/Mem.

App./OS

IP/Ether/IB/Storage Network

ExpEther:

Out-of-box I/O Bus

Plug-in I/Os

In Network

PCIe w/ LAN scale distance,

port count, dis/connectivity

ExpEther:PCI Express over Ethernet

1 10 100 1000 0.1

0.1

1

10

100

Distance (m)

Late

ncy (

usec)

PCIe switch

ExpEther

Rack floor campus building

HPC

I/O

memory

Flash

HPC Copro.

CPU

Pool

Server Module Server

Module Server Module Server

Module

Chipset Xeon

Xeon

uServer Module

uServer Module uServer Module

uServer Module

Avoton

Hi-Performance Low-Performance

Storage

Pool

MemCache

NVMe Drive

Network

Pool

Legacy HDD Module Legacy HDD Module

Legacy HDD Module

Legacy HDD Module

SAS Ctrl

HDD

HDD

NIC w/ DPI NIC w/ DPI

NIC w/ DPI NIC w/ DPI

Hi-end NIC

NIC w/ DPI NIC w/ DPI

NIC w/ DPI Normal NIC

NIC

Special

IO Pool

HPC Copro.

Xeon Phi

HPC Copro. HPC Copro.

GPGPU

Security

Encryption

Storage GPU Application Management

Server -Set GID

-Resource Monitor

Resource pool proto-type w/ ExpEther

on mother board.

Each PCIe slot can be used for

both CPU module and IO devices

Latency Estimation

(for 10GE-FPGA)

-EE chip ~ 0.7us

-Ether SW ~ 0.2us

-Cable 5ns/m

CPU pool system w/ OpenStack Control

PC

I E

xp

ress

Eth

ern

et

chip ASIC FPGA Structured

ASIC

PCIe

Gen2.1 Gen2.1 Gen2.1

1 lane 8 lane 8 lane

@ 2 Gbps @2 Gbps @2 Gbps

Ethernet 2 x GbE 2 x 10GE 2 x 10GE

Power(RMS) 1.0W 6.7 W 3.5 W

Dimension 19 x 19mm 33 x 33 mm 29 x 29 mm

Pin Count w/

Power/GND 388 pin 780 pin 780 pin

Process 130 nm 40 nm 40 nm

Latency 1050 ns 700 ns 700 ns

Internal

Clock 125 MHz 125/156 MHz 125/156 MHz

Detailed Specification of chips