[2019-hipeac-terzenidis] photonics for disaggregated datacenter...

Photonics Systems and

Networks (PhosNET)

research group

Dept. of Informatics,

Aristotle Univ. of

Thessaloniki,

Center for

Interdisciplinary

Research & Innovation

(CIRI), Greece

Photonics for Disaggregated DataCenter and

Computercom Architectures

Nikos Terzenidis, Miltiadis Moralis-Pegios, Stelios Pitris, Theoni Alexoudi and Nikos Pleros

HiPEAC 2019 , Valencia, Spain, January 21, 2019

Motivation

Disaggregation on the rack-level

The Hipoλaos optical switch architecture

Disaggregation on the board-level

The ICT-STREAMS silicon photonics interconnection

architecture

Combining Hipoλaos and STREAMS architectures

Conclusion

Presentation Outline


Intel and Tencent, “Tencent Explores Datacenter ResourcePooling Using Intel® Rack Scale Architecture (Intel® RSA).”

Significant heterogeneity in resource usage per machine and workload

“One size fits all” server configuration does not work in modern Data-Centers➢ Specialized servers is the norm (90% of total

servers) in order to support the variety of workloads and their requirements.

Considerable resource underutilization – up to 50%➢ Impact on cost & energy efficiency

Virtualization to increase utilization

Resource Under-utilization in Modern Data-Centers (DC)

Data Centers globally consume 100GWh/year

DC Workloads

*2015

Even sophisticated Virtualization is not enough

It can reduce idle resourcesDoes not eliminate inefficiencies


Disaggregation – a whole new way of organizing the Data Center!

Traditional DC Architecture Multiple Servers with fixed set of resources

(CPU / MEMORY / STORAGE / NETWORK )

Lack of flexibility – waste of resources / energy

Disaggregated DC Architecture

Multiple Rack-Trays with pools of resources

Decoupling CPU/MEMORY/ STORAGE from a single server-boardFine-grain control of available resourcesUp-to 42% average power saving

*Ali et al. 2017


BW

latency

1000Gb/s

10nsec

500Gb/s

20nsec

6 Gb/s

10μsec

1 Gb/s

1 msecTypical server

box

Switch Requirements

Latency

SAN

Deployment

Pitwon

et al. SPIE ‘14

Pitwon

et al. SPIE ‘17

Intel RSA

Facebook

Yosemite

Weerasinghe

et al. FPT ‘16Abel et al.

HOTI ‘17

C. C. Tu et al.

ANCS ‘14J. Gu et al. USENIX

‘17

dREDBox HiPEAC ‘18

Typ

e o

f R

eso

urc

e D

isa

gg

reg

ati

on

BW Radix

sub-μs ~100s>10G/port

Communication requirements / Deployment of Resource disaggregation

New challenges for the Interconnection

network


Interconnection layers in a disaggregated DC

Network is the most crucial infrastructure that determines performance in disaggregation

DC RACK

ToR

Intel QPI/UPI

Rack-level interconnects Dominated by electrical ToR switches

Board-level interconnects Dominated by point-to-point interconnects


Arista 7250QX-64Arista 7150S-64

• 1.28Tbps capacity

• 380nsec latency

• ~175pJ/bit energy

• 5.12Tbps capacity

• ~2 μsec latency


• 20Tbps capacity

• >2 μsec latency


Arista DCS-7308

❖ Capacity increases only via multiple line-cards

❖ > 2μsec latency for >256-port configurations

❖ Energy efficiency reduces with capacity..up to 300pJ/bit!

Rack level interconnects - Electrical DC switches

Latency & energy efficiencyof current electrical switches can be prohibiting factors for efficient resource disaggregation

!


Rack level interconnects - What about optical switching ?

Minimized through distributed control

(Lower complexity algorithms at high radices)Minimized through Optical buffering at line-rate

(avoid OEO and Serdes)

Header processing& Switch reconfiguration

Switching Latency = Packet Contention Resolution+

Optical circuit switching (OCS): can yield the required port count but cannot provide dynamic operation on a per packet level

Optical packet switching (OPS): ns-scale switching in low-radices (up to 64-ports) that can scale beyond μs for higher port counts

msecreconfiguration

SWITCH DESIGN PORT NUMBERDISTR.

CONTROL

CONTENTION

RESOLUTIONDATA RATE

Iris 80×80 No Optical Delay Line 40 GB/S

Petabit 1024x1024 No EB-Input 10 Gb/sTonak-Lions 1024x1024 Yes EB-Input 10 Gb/s

NTT 270x270 No No 10Gb/sNagoya 1536x1536 No No 10Gb/sA-Star 448x448 No No 40Gb/s

OpSquare 2056x2056 Yes EB-Input 40Gb/sOSMOSIS 2048x2048 No EB-Input/Output 40 Gb/s

Data Vortex 10Kx10k Yes Deflection 10 Gb/s

Architectures employing distributed control utilize either electrical buffers or deflection routing

Only the IRIS architecture offers optical delay line buffering, but lacks of distributed controller


Rack level interconnects - Our approach

Hipoλaos:

a High-port λ-routed all-optical packet switch

Do

rre

ne

t a

l. J

OC

N 2

01

2

✓ Strictly non-blocking

✓ Autonomously controlled

switches

✓ Easily implemented in optics

via B&S

Spanke design:

✓ Optical delay line

feed-forward buffering

✓ Avoid OEO

Optical buffering

at line-rate:

✓ Any-to-any connectivity

✓ Low-latency

AWGR-based

wavelength

routing:

N. Terzenidis et al, OpEx, pp. 8756-8766, (2018)


Generic view of the Hipoλaos architecture

DC Rack organized in Trays

Switch Planes aggregate N ports, implement B&S, contention resolution

Wavelength routing to desired node via AWGRs

Internal switch organization:➢ Switch Planes➢ AWGRs

Scalable up-to 1024-ports

ToR Switch interconnect:➢ N2 nodes➢ N Trays➢ N nodes/ Tray


AWGR Principle of operation

Passive photonic device➢ Routing of the

Optical signal according to wavelength

WDM signal on input Each wavelength routed

to different output


Detailed layout of the 256-port Hipoλaos architecture

3 Functional Stages Stage A:

➢ Header Processing on FPGA

➢ Signal Broadcasting Stage B:

➢ Tray Selection➢ Contention Resolution

Stage C:➢ Destination node

selection & routing via AWGR

256-port configuration➢ 16 Planes➢ 16 ports/Plane


Process Flow on incoming packets

1. Packet arrival to switch inputs #1 & #162. Data streams split to 2 to achieve header

processing3. Data streams broadcasted to all WCs (each

connected to different output tray)4. Tray selection by enabling the appropriate

WC5. Packets delayed in feed-forward buffers6. Conversion to the appropriate wavelength7. Routing to the destination node via AWGR


✓ 95% Throughput with 8 buffers✓ 85% Throughput with only 2 buffers✓ 65% Throughput with 0 Buffers

✓ Up to 780ns latency✓ 610ns latency with 2 buffers✓ Biggest part associated to the FPGA PCS/PMA

Throughput and Latency Performance Analysis

Simulation Parameters

Nr. Of Nodes 256

Node-Switch BW

10 Gbps

Packet size 72 Bytes

Traffic Generation mode

Unicast

Traffic pattern Random-Uniform

Omnet++ framework

N. Terzenidis et al, OpEx, Vol. 26, Issue 7, pp. 8756-8766, (2018)

✓ Low-latency due to distributed control: arbitrate 16-nodes but connect 256

✓ Small-number of buffers: practical for optical feed-forward buffering !


256-port Experimental Evaluation

Data Generation Stage Data at 10.3125 Gbps 405-bit long packets 35-bit inter-packet guardband 425-bit electrical envelopes

Stage A 2 contending input ports SOA-MZIs for WC (Laser + modulator) bank for λ

tuning at the WC

Stage B Fiber-based feed-forward

buffer 0, 1 and 2 packet-slot

delay

Stage C Tunable laser SOA-MZI for WC 16x16 AWGR for

passive λ routing


Switch Input Switch Output BER measurements(AWGR port #8)

256-port Experimental Results

➢ Five-packet trace + eye for input 1 (SFP)

➢ Five-packet trace + eye for input 2 (XFP)

➢ The five selected packets at the 8th

& 9th output of AWGR

➢ Spectrum & eyes for each AWGR output port

✓ Proper packet routing & buffering

✓ 2dB average power penalty


More Experimental Results

Power Budget Analysis for 256 & 1024-ports

✓No limitations in terms of power budget

Power penalty for all 16 output ports

✓ 1.5 – 2.5dB Power penalty at 10-9

N. Terzenidis et al, JOCN, Vol. 10, Issue 7, pp. B102-B116 (2018)


Extending Hipoλaos architecture – 1024-port layout validated

32 planes with 32ports/plane B&S to 32 WCs 32x32 port AWGR Capacity increases to 10.24Tbps (at 10G/port)

Detailed Layout Results

✓ 1.5 – 3.2dB Power penalty at 10-9

M. Moralis et al, accepted at JLT


Extending Hipoλaos architecture – Intra-tray multicast

Limited modifications to enable multicasting operation Passive multi-wavelength routing at Stage C by the AWGR No extra latency

(1,1)-(1,j)

(1,1)-(1,j)

(1,1)-(1,j)

(1,1)-(1,j)

(1,1)-(1,j)

(1,1)-(1,1)

(1,1)-(1,j)

(1,1)-(1,2)

Results

Multicast to 5 nodes

1 dB power variation

N. Terzenidis et al, IEEE PTL, pp. 1535-1538, Sept. 2018


Interconnection layers in a disaggregated DC

Network is the most crucial infrastructure that determines performance in disaggregation

DC RACK

ToR

Intel QPI/UPI

Rack-level interconnects Dominated by electrical ToR switches

Board-level interconnects Dominated by point-to-point interconnects


Board level interconnects

>8 nodes connectivity

Increased latency (extra hops), power, cost, size

Intel® Xeon® processor E5-4600

CISC processors INTEL (x86)

INTEL QPI/UPI, AMD Infinity Fabric/HyperTransport, IBM POWER8 SMP Links, Oracle SPARK Coherence links, NVIDIA NVlink

Low-latency, low-power

Limited connectivity

Switch basedDirect P2P - High-End Server Processors

PCIe-based, RapidIO-based, Oracle Bixby etc.


Board level interconnects – The ICT-STREAMS approach

✓ Distributed routing

✓ All-to-All connectivity

✓ High number of ports (up to 16)

✓ Strictly non-blocking

✓ High bandwidth optical links

✓ Time of flight latency

✓ No O/E/O conversions

✓ Support for multicast/broadcast

Chip-to-chip board-level optical interconnect exploiting WDM-enabled SiPho TxRx and AWGR


The ICT-STREAMS O-band board

1.6Tb/s O-band WDM TxRxOptical Engine

25.6Tb/s AWGR-based passive routing platform1300nm SM High Freq. EOPCB

50Gb/s Ring Modulator & PD

Si MUX/DEMUX

III-V on SOI nanoamps

50Gb/s Electr. MOD DR and TIA

CLIPP-assisted thermal drift

compensation system

32×50Gb/s

16x16 O-band AWGR

www.ict-streams.eu


Experimental demonstration - 8×40Gb/s multi-socket Tx/Rx/routing

S. Pitris et al., IEEE Photon. Oct. 2018 DOI: 10.1109/JPHOT.2018.2873673

RM: 1pJ/bit @40Gbps

TIA: 3.95pJ/bit@40Gbps

Input ER : 5.4dB

Output ER : ~4 dB

40


Experimental demonstration - 4×40Gb/s SiPho TxRx

Si Chip

Mask Layout

To be presented at OFC 2019

Tx1 Tx2 Tx4Tx3ER = 4.4 dB ER = 4.1 dB ER = 4.2 dB ER = 4 dB

2m

V 1

0p

s/d

iv

λ1=1310.25 nm λ2=1303.23 λ3=1324.59 nm λ4=1317.28 nm

✓ Channel spacing of 6.75 nm

✓ 4x40G Silicon Ring Modulator

✓ Energy Efficiency 24.84 fJ/bit/RM

S. Pitris et al., “A 4×40 Gb/s O-band WDM Silicon Photonic Transmitter based on Micro-Ring Modulators” accepted for presentation OFC 2019, W3E.2, March 6, 2019

✓ Footprint: 2.7×5.2 mm2


Experimental demonstration - 40Gb/s Data Transmission in a 52km-long Fiber Link

Si RM co-packaged with 1V DR IC

✓ Error-free 40Gb/s transmission with very low power penalty

(<1dB) for up to 52 km transmission

Electronic Driver

Microring Modulator ✓ Low-power 28-nm FD-SOI electronic driver

✓ Power consumption of the electronic driver was 40mW

✓ Total Link losses: 17.6dB


Combining Hipoλaos and STREAMS architectures

T. Alexoudi et al, “Optics in Computing: from Photonic Network-on-Chip to Chip-to-Chip Interconnects and Disintegrated Architectures”, JLT doi: 10.1109/JLT.2018.2875995

✓ Small-number of buffers for 100% throughput: practical!

✓ P99 latency < 610nsec !

50-50 on/off-board traffic

Dual layer locality-aware interconnect Performance


Conclusions

✓ Hipoλaos optical switch for Rack-level interconnection in disaggregated DCs:

Up-to 1024 ports

10.24Tbps capacity

Multicast support

Up-to 95% throughput for 4 packet-buffers, with sub-μs latency

✓ STREAMS architecture for board-level interconnection:

>8 on-board nodes connectivity

Low latency, low energy with Si-Pho


ACKNOWLEDGMENTSM. Moralis-PegiosG. Mourgias-AlexandrisS. PitrisT. Alexoudi K. Vyrsokinos N. Pleros

Contacts:

Nikos Terzenidis @ [email protected]

Nikos Pleros @ [email protected]

mailto:[email protected]

mailto:[email protected]

[2019-hipeac-terzenidis] photonics for disaggregated datacenter...

Documents