[2019-hipeac-terzenidis] photonics for disaggregated datacenter...
TRANSCRIPT
Photonics Systems and
Networks (PhosNET)
research group
Dept. of Informatics,
Aristotle Univ. of
Thessaloniki,
Center for
Interdisciplinary
Research & Innovation
(CIRI), Greece
Photonics for Disaggregated DataCenter and
Computercom Architectures
Nikos Terzenidis, Miltiadis Moralis-Pegios, Stelios Pitris, Theoni Alexoudi and Nikos Pleros
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Motivation
Disaggregation on the rack-level
The Hipoλaos optical switch architecture
Disaggregation on the board-level
The ICT-STREAMS silicon photonics interconnection
architecture
Combining Hipoλaos and STREAMS architectures
Conclusion
Presentation Outline
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Intel and Tencent, “Tencent Explores Datacenter ResourcePooling Using Intel® Rack Scale Architecture (Intel® RSA).”
Significant heterogeneity in resource usage per machine and workload
“One size fits all” server configuration does not work in modern Data-Centers➢ Specialized servers is the norm (90% of total
servers) in order to support the variety of workloads and their requirements.
Considerable resource underutilization – up to 50%➢ Impact on cost & energy efficiency
Virtualization to increase utilization
Resource Under-utilization in Modern Data-Centers (DC)
Data Centers globally consume 100GWh/year
DC Workloads
*2015
Even sophisticated Virtualization is not enough
It can reduce idle resourcesDoes not eliminate inefficiencies
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Disaggregation – a whole new way of organizing the Data Center!
Traditional DC Architecture Multiple Servers with fixed set of resources
(CPU / MEMORY / STORAGE / NETWORK )
Lack of flexibility – waste of resources / energy
Disaggregated DC Architecture
Multiple Rack-Trays with pools of resources
Decoupling CPU/MEMORY/ STORAGE from a single server-boardFine-grain control of available resourcesUp-to 42% average power saving
*Ali et al. 2017
HiPEAC 2019 , Valencia, Spain, January 21, 2019
BW
latency
1000Gb/s
10nsec
500Gb/s
20nsec
6 Gb/s
10μsec
1 Gb/s
1 msecTypical server
box
Switch Requirements
Latency
SAN
Deployment
Pitwon
et al. SPIE ‘14
Pitwon
et al. SPIE ‘17
Intel RSA
Yosemite
Weerasinghe
et al. FPT ‘16Abel et al.
HOTI ‘17
C. C. Tu et al.
ANCS ‘14J. Gu et al. USENIX
‘17
dREDBox HiPEAC ‘18
Typ
e o
f R
eso
urc
e D
isa
gg
reg
ati
on
BW Radix
sub-μs ~100s>10G/port
Communication requirements / Deployment of Resource disaggregation
New challenges for the Interconnection
network
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Interconnection layers in a disaggregated DC
Network is the most crucial infrastructure that determines performance in disaggregation
DC RACK
ToR
Intel QPI/UPI
Rack-level interconnects Dominated by electrical ToR switches
Board-level interconnects Dominated by point-to-point interconnects
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Arista 7250QX-64Arista 7150S-64
• 1.28Tbps capacity
• 380nsec latency
• ~175pJ/bit energy
• 5.12Tbps capacity
• ~2 μsec latency
• ~120pJ/bit energy
• 20Tbps capacity
• >2 μsec latency
• ~300pJ/bit energy
Arista DCS-7308
❖ Capacity increases only via multiple line-cards
❖ > 2μsec latency for >256-port configurations
❖ Energy efficiency reduces with capacity..up to 300pJ/bit!
Rack level interconnects - Electrical DC switches
Latency & energy efficiencyof current electrical switches can be prohibiting factors for efficient resource disaggregation
!
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Rack level interconnects - What about optical switching ?
Minimized through distributed control
(Lower complexity algorithms at high radices)Minimized through Optical buffering at line-rate
(avoid OEO and Serdes)
Header processing& Switch reconfiguration
Switching Latency = Packet Contention Resolution+
Optical circuit switching (OCS): can yield the required port count but cannot provide dynamic operation on a per packet level
Optical packet switching (OPS): ns-scale switching in low-radices (up to 64-ports) that can scale beyond μs for higher port counts
msecreconfiguration
SWITCH DESIGN PORT NUMBERDISTR.
CONTROL
CONTENTION
RESOLUTIONDATA RATE
Iris 80×80 No Optical Delay Line 40 GB/S
Petabit 1024x1024 No EB-Input 10 Gb/sTonak-Lions 1024x1024 Yes EB-Input 10 Gb/s
NTT 270x270 No No 10Gb/sNagoya 1536x1536 No No 10Gb/sA-Star 448x448 No No 40Gb/s
OpSquare 2056x2056 Yes EB-Input 40Gb/sOSMOSIS 2048x2048 No EB-Input/Output 40 Gb/s
Data Vortex 10Kx10k Yes Deflection 10 Gb/s
Architectures employing distributed control utilize either electrical buffers or deflection routing
Only the IRIS architecture offers optical delay line buffering, but lacks of distributed controller
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Rack level interconnects - Our approach
Hipoλaos:
a High-port λ-routed all-optical packet switch
Do
rre
ne
t a
l. J
OC
N 2
01
2
✓ Strictly non-blocking
✓ Autonomously controlled
switches
✓ Easily implemented in optics
via B&S
Spanke design:
✓ Optical delay line
feed-forward buffering
✓ Avoid OEO
Optical buffering
at line-rate:
✓ Any-to-any connectivity
✓ Low-latency
AWGR-based
wavelength
routing:
N. Terzenidis et al, OpEx, pp. 8756-8766, (2018)
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Generic view of the Hipoλaos architecture
DC Rack organized in Trays
Switch Planes aggregate N ports, implement B&S, contention resolution
Wavelength routing to desired node via AWGRs
Internal switch organization:➢ Switch Planes➢ AWGRs
Scalable up-to 1024-ports
ToR Switch interconnect:➢ N2 nodes➢ N Trays➢ N nodes/ Tray
HiPEAC 2019 , Valencia, Spain, January 21, 2019
AWGR Principle of operation
Passive photonic device➢ Routing of the
Optical signal according to wavelength
WDM signal on input Each wavelength routed
to different output
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Detailed layout of the 256-port Hipoλaos architecture
3 Functional Stages Stage A:
➢ Header Processing on FPGA
➢ Signal Broadcasting Stage B:
➢ Tray Selection➢ Contention Resolution
Stage C:➢ Destination node
selection & routing via AWGR
256-port configuration➢ 16 Planes➢ 16 ports/Plane
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Process Flow on incoming packets
1. Packet arrival to switch inputs #1 & #162. Data streams split to 2 to achieve header
processing3. Data streams broadcasted to all WCs (each
connected to different output tray)4. Tray selection by enabling the appropriate
WC5. Packets delayed in feed-forward buffers6. Conversion to the appropriate wavelength7. Routing to the destination node via AWGR
HiPEAC 2019 , Valencia, Spain, January 21, 2019
✓ 95% Throughput with 8 buffers✓ 85% Throughput with only 2 buffers✓ 65% Throughput with 0 Buffers
✓ Up to 780ns latency✓ 610ns latency with 2 buffers✓ Biggest part associated to the FPGA PCS/PMA
Throughput and Latency Performance Analysis
Simulation Parameters
Nr. Of Nodes 256
Node-Switch BW
10 Gbps
Packet size 72 Bytes
Traffic Generation mode
Unicast
Traffic pattern Random-Uniform
Omnet++ framework
N. Terzenidis et al, OpEx, Vol. 26, Issue 7, pp. 8756-8766, (2018)
✓ Low-latency due to distributed control: arbitrate 16-nodes but connect 256
✓ Small-number of buffers: practical for optical feed-forward buffering !
HiPEAC 2019 , Valencia, Spain, January 21, 2019
256-port Experimental Evaluation
Data Generation Stage Data at 10.3125 Gbps 405-bit long packets 35-bit inter-packet guardband 425-bit electrical envelopes
Stage A 2 contending input ports SOA-MZIs for WC (Laser + modulator) bank for λ
tuning at the WC
Stage B Fiber-based feed-forward
buffer 0, 1 and 2 packet-slot
delay
Stage C Tunable laser SOA-MZI for WC 16x16 AWGR for
passive λ routing
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Switch Input Switch Output BER measurements(AWGR port #8)
256-port Experimental Results
➢ Five-packet trace + eye for input 1 (SFP)
➢ Five-packet trace + eye for input 2 (XFP)
➢ The five selected packets at the 8th
& 9th output of AWGR
➢ Spectrum & eyes for each AWGR output port
✓ Proper packet routing & buffering
✓ 2dB average power penalty
HiPEAC 2019 , Valencia, Spain, January 21, 2019
More Experimental Results
Power Budget Analysis for 256 & 1024-ports
✓No limitations in terms of power budget
Power penalty for all 16 output ports
✓ 1.5 – 2.5dB Power penalty at 10-9
N. Terzenidis et al, JOCN, Vol. 10, Issue 7, pp. B102-B116 (2018)
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Extending Hipoλaos architecture – 1024-port layout validated
32 planes with 32ports/plane B&S to 32 WCs 32x32 port AWGR Capacity increases to 10.24Tbps (at 10G/port)
Detailed Layout Results
✓ 1.5 – 3.2dB Power penalty at 10-9
M. Moralis et al, accepted at JLT
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Extending Hipoλaos architecture – Intra-tray multicast
Limited modifications to enable multicasting operation Passive multi-wavelength routing at Stage C by the AWGR No extra latency
(1,1)-(1,j)
(1,1)-(1,j)
(1,1)-(1,j)
(1,1)-(1,j)
(1,1)-(1,j)
(1,1)-(1,1)
(1,1)-(1,j)
(1,1)-(1,2)
Results
Multicast to 5 nodes
1 dB power variation
N. Terzenidis et al, IEEE PTL, pp. 1535-1538, Sept. 2018
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Interconnection layers in a disaggregated DC
Network is the most crucial infrastructure that determines performance in disaggregation
DC RACK
ToR
Intel QPI/UPI
Rack-level interconnects Dominated by electrical ToR switches
Board-level interconnects Dominated by point-to-point interconnects
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Board level interconnects
>8 nodes connectivity
Increased latency (extra hops), power, cost, size
Intel® Xeon® processor E5-4600
CISC processors INTEL (x86)
INTEL QPI/UPI, AMD Infinity Fabric/HyperTransport, IBM POWER8 SMP Links, Oracle SPARK Coherence links, NVIDIA NVlink
Low-latency, low-power
Limited connectivity
Switch basedDirect P2P - High-End Server Processors
PCIe-based, RapidIO-based, Oracle Bixby etc.
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Board level interconnects – The ICT-STREAMS approach
✓ Distributed routing
✓ All-to-All connectivity
✓ High number of ports (up to 16)
✓ Strictly non-blocking
✓ High bandwidth optical links
✓ Time of flight latency
✓ No O/E/O conversions
✓ Support for multicast/broadcast
Chip-to-chip board-level optical interconnect exploiting WDM-enabled SiPho TxRx and AWGR
HiPEAC 2019 , Valencia, Spain, January 21, 2019
The ICT-STREAMS O-band board
1.6Tb/s O-band WDM TxRxOptical Engine
25.6Tb/s AWGR-based passive routing platform1300nm SM High Freq. EOPCB
50Gb/s Ring Modulator & PD
Si MUX/DEMUX
III-V on SOI nanoamps
50Gb/s Electr. MOD DR and TIA
CLIPP-assisted thermal drift
compensation system
32×50Gb/s
16x16 O-band AWGR
www.ict-streams.eu
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Experimental demonstration - 8×40Gb/s multi-socket Tx/Rx/routing
S. Pitris et al., IEEE Photon. Oct. 2018 DOI: 10.1109/JPHOT.2018.2873673
RM: 1pJ/bit @40Gbps
TIA: 3.95pJ/bit@40Gbps
Input ER : 5.4dB
Output ER : ~4 dB
40
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Experimental demonstration - 4×40Gb/s SiPho TxRx
Si Chip
Mask Layout
To be presented at OFC 2019
Tx1 Tx2 Tx4Tx3ER = 4.4 dB ER = 4.1 dB ER = 4.2 dB ER = 4 dB
2m
V 1
0p
s/d
iv
λ1=1310.25 nm λ2=1303.23 λ3=1324.59 nm λ4=1317.28 nm
✓ Channel spacing of 6.75 nm
✓ 4x40G Silicon Ring Modulator
✓ Energy Efficiency 24.84 fJ/bit/RM
S. Pitris et al., “A 4×40 Gb/s O-band WDM Silicon Photonic Transmitter based on Micro-Ring Modulators” accepted for presentation OFC 2019, W3E.2, March 6, 2019
✓ Footprint: 2.7×5.2 mm2
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Experimental demonstration - 40Gb/s Data Transmission in a 52km-long Fiber Link
Si RM co-packaged with 1V DR IC
✓ Error-free 40Gb/s transmission with very low power penalty
(<1dB) for up to 52 km transmission
Electronic Driver
Microring Modulator ✓ Low-power 28-nm FD-SOI electronic driver
✓ Power consumption of the electronic driver was 40mW
✓ Total Link losses: 17.6dB
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Combining Hipoλaos and STREAMS architectures
T. Alexoudi et al, “Optics in Computing: from Photonic Network-on-Chip to Chip-to-Chip Interconnects and Disintegrated Architectures”, JLT doi: 10.1109/JLT.2018.2875995
✓ Small-number of buffers for 100% throughput: practical!
✓ P99 latency < 610nsec !
50-50 on/off-board traffic
Dual layer locality-aware interconnect Performance
HiPEAC 2019 , Valencia, Spain, January 21, 2019
Conclusions
✓ Hipoλaos optical switch for Rack-level interconnection in disaggregated DCs:
Up-to 1024 ports
10.24Tbps capacity
Multicast support
Up-to 95% throughput for 4 packet-buffers, with sub-μs latency
✓ STREAMS architecture for board-level interconnection:
>8 on-board nodes connectivity
Low latency, low energy with Si-Pho
HiPEAC 2019 , Valencia, Spain, January 21, 2019
ACKNOWLEDGMENTSM. Moralis-PegiosG. Mourgias-AlexandrisS. PitrisT. Alexoudi K. Vyrsokinos N. Pleros
Contacts:
Nikos Terzenidis @ [email protected]
Nikos Pleros @ [email protected]