circuit switching under the radar with reactor · pdf filecircuit switching under the radar...

Post on 23-Mar-2018

237 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Circuit Switching

Under the Radar with REACToR

He Liu, Feng Lu, Alex Forencich, Rishi Kapoor

Malveeka Tewari, Geoffrey M. Voelker

George Papen, Alex C. Snoeren, George Porter

1

2

How to build 100G datacenter networks?

3

Datacenters Traffic Is Skewed

4

Destinations

Ba

nd

wid

th R

eq

uir

ed

100G

10G Fat-Tree

5

Destinations

Ba

nd

wid

th R

eq

uir

ed

10G

100G

100G Fat-Tree

6

Destinations

Ba

nd

wid

th R

eq

uir

ed

Full Bi-sectional Bandwidth

But Expensive

cables, transceivers, …

100G

Helios, c-Through: Hotspot Circuits

7

Destinations

Ba

nd

wid

th R

eq

uir

ed

Use mirrors

Reconfigures in 10ms

Single optical circuit 100G

10G

[SIGCOMM 2010]

10G Electrical Fat-tree

OSA: More Circuits

8

Destinations

Ba

nd

wid

th R

eq

uir

ed

Several optical circuits 100G

[NSDI 2012]

Mordia: Fast Circuit Switching

9

Destinations

Ba

nd

wid

th R

eq

uir

ed

Reconfigures in 15 microseconds.

Enables time sharing a circuit.

100G

[SIGCOMM 2013]

Limitation: Still Circuit Switching

10

Destinations

Ba

nd

wid

th R

eq

uir

ed

Good with a few big flows

A B

t

A B

100G

Limitation: Still Circuit Switching

11

Destinations

Ba

nd

wid

th R

eq

uir

ed

100G

50G

Limitation: Inefficient with Small Flows

12

Destinations

Ba

nd

wid

th R

eq

uir

ed

Low link utilization

A

t Reconfiguring takes time

B C D E F A B C D E F

50G

100G

Our Approach: REACToR

13

Destinations

Ba

nd

wid

th R

eq

uir

ed

100G

Circuit

Switching

10G

Packet

Switching

100G Hybrid of Circuit and Packet

A B

t

A B

C D E F

t

C D E F C D E F C D E F C D E

100G

10G

Start with a Pre-existing 10G Network

14

ToR Box

End hosts

10G links

10G links

10G

Electrical Packet

Network

Connect via REACToR

15

10G

Electrical Packet

Network

10G links

End hosts

100G links

REACToR

Connect Circuits

16

100G

Optical Circuit

Network

100G links

End hosts

100G links

REACToR

10G

Electrical Packet

Network

10G links

Hybrid Network

17

100G links

End hosts

Small

Flows

Big

Flows

100G links

REACToR

10G links

Challenge: Two Different Networks

Electrical Packet

• Low bandwidth

• Buffers all the way

• Tx at any time

18

Optical Circuit

• High bandwidth

• Bufferless TDMA

• Tx only when circuit connects

Design Requirements

• Hybrid scheduling: classify traffic into circuits or packets

• Buffer packets at source hosts until circuit is available

• Have sources transmit when the circuit is connected

• Rate control to prevent downlink overload

19

The Hybrid Scheduling Problem

• Collect traffic demand from all hosts

• TDMA schedule the big flows on the circuit path

• Schedule the rest on the packet path

• An oracle predicts the demand and builds the schedules. 20

Destinations

Bandw

idth

Requir

ed

Centralized

TDMA Circuit

Schedule

Free-running

Packet Network

End Host: Classify and Buffer Packets

• Classify packets and map into different hardware queues

– Based on the schedule

• Packet path: one hardware queue for all destinations

– Can transmit at any time, but at 10G

• Circuit path: one hardware queue for each destination

– Can only transmit when the particular circuit is connected

• Buffer the packets in end-host memory

21

B

A

C D E F Packet Queue

Circuit Queues

Packet Transmission

• Packet path: Rate limit to

10G

• Circuit path: Transmit only

when the circuit is connected

• REACToR pulls packets from

the circuit queue in real-time

• Use PFC frames to

selectively unpause queues

22

B

A

D E F

Packet Queue

Circuit Queues

NIC

PFC:

Unpause A

Circuit to

Host A

REACToR

End Host

Packet

path

Circuit to

Host B

PFC:

Pause A

PFC:

Unpause B

C

Rate Control

• Problem: downlink merging 100G + 10G to 100G

23

Circuit 100G 100G to end host

10G Packet

What

To do?

Rate Control

• Problem: downlink merging 100G + 10G to 100G

• Our approach: Rate limit the circuit path at the source to

avoid overloading

24

Circuit 100G 100G to end host

10G Packet

What

To do?

Circuit

Packet

90G

10G

100G to end host

Implementation

25

26

REACToR

10G/1G Prototype

27

REACToR

Virtex 6 FPGA

H H H H

REACToR

Virtex 6 FPGA

H H H H

Linux servers (Intel 82599 10G NICs)

Circuit Switch

Mordia OCS

Control

Computer

Schedule

EPS

Fulcrum Monaco Circuit

Control

Schedule

Demand

10G 1G

10G

Timing Parameters

• End-to-end reconfiguration time: 30 μs

• Schedule reconfigures every 1500 us

• Example: 7 flows TDMA, 86% duty cycle

28

2

3

1

4

5

6

7

t

30μs

1500μs

Evaluation

• Experiment 1: Supporting TCP

– The performance on working with stock network stack

• Experiment 2: React to demand changes

– The dynamics on handling changes and mispredictions

• Experiment 3: Demonstrate the benefit of using hybrid

– The performance gain on handling skewed demand

29

Experiment 1: Supporting TCP

• Each host receives 7 TCP flows from all other hosts

• Hybrid schedule: data packets via OCS, ACKs via EPS

• 7 flows TDMA, fair sharing the link

• Check if TCP works with high throughput

30

TCP Throughput

31

1

2 3 4

5

6

7

TDMA invisible to TCP

Experiment 2: React to Demand Changes

32

Rack 1 Rack 2 Rack 1 Rack 2

From: Intra-rack Traffic To: Inter-rack Traffic

Use pktgen to impose precise and sudden traffic pattern change.

See if REACToR can react in time.

React to Demand Changes

33

1

2

3

4

5

6

7

3-host round robin demand change 4-host round robin

Tx with Packet

Rack 2

R

ack 1

React fast and robust

to demand changes

Experiment 3: Demonstrating Hybrid

34

Destinations

Bandw

idth

Skewed Very Skewed

Destinations

Ba

nd

wid

th

100G 100G

• Simulated 64 hosts with demand of different skewness

• Big benefit from a small electrical packet switch

Experiment 3: Demonstrating Hybrid

• Simulated 64 hosts with demand of different skewness

• Big benefit from a small electrical packet switch

35

Destinations

Bandw

idth

Skewed

Destinations

Ba

nd

wid

th

99G

50G

20 flows, 1G 20 flows, 50G

100G 100G

Very Skewed

Optical Circuit Switching Not Enough

36

Fraction of bandwidth the big flow uses

Lin

k U

tiliz

ed

100%

99% 50%

70%

100G Electrical Fat-tree

Optical Circuit Switching Not Enough

37

Fraction of bandwidth the big flow uses

Lin

k U

tiliz

ed

Optical Circuit 85%

100%

99% 50%

70%

75%

100G Electrical Fat-tree

Optical Circuit Switching Not Enough

38

Lin

k U

tiliz

ed

Optical Circuit 85%

Cost of switching

Lost due to

Inefficient

circuit usage

100%

70%

100G Electrical Fat-tree

75%

Fraction of bandwidth the big flow uses 99% 50%

Optical Circuit Switching Not Enough

39

Lin

k U

tiliz

ed

Optical Circuit 85%

75%

100%

70%

Datacenter

typically works

in this region

100G Electrical Fat-tree

Fraction of bandwidth the big flow uses 99% 50%

Hybrid Switching with REACToR

40

Lin

k U

tiliz

ed

Optical Circuit 85%

90%

REACToR Hybrid

100%

70%

75%

100G Electrical Fat-tree

Fraction of bandwidth the big flow uses 99% 50%

Hybrid Switching with REACToR

41

Lin

k U

tiliz

ed

Optical Circuit 85%

90%

REACToR Hybrid

100%

70%

Degrades

Gradually

90%

75%

100G Electrical Fat-tree

Fraction of bandwidth the big flow uses 99% 50%

Hybrid Switching with REACToR

42

Lin

k U

tiliz

ed

Optical Circuit 85%

90%

REACToR Hybrid

100%

70%

Degrades

Gradually

90%

75%

100G Electrical Fat-tree

Fraction of bandwidth the big flow uses 99% 50%

Performance akin to

100G Fat-tree

Conclusion

43

100G Electrical

Packet Switching 100G Optical

Circuit Switching

10G Electrical

Packet Switching

+ =

For datacenter workloads

REACToR

At a lower cost

Thank you!

44

DCTCP: Datacenter Workload

45

[SIGCOMM 2010]

Cost of Transceivers

• Cost of 10G Transceivers

– Cost: $500 per pair

– Power: 1Watt per pair

– (100G costs even more)

• 3-Level Fat-tree: 27.6k hosts

• Transceivers per host:

46

Scheduling

• Problem: matrix decomposition

– Similar to BvN, but must consider reconfiguration penalty

– NP-complete problem

– Goal: schedule all the big flows (90% of the demand)

• Greedy approach: e.g. iSLIP

– Suboptimal

• Naïve BvN:

– Fragmented by small elements and residuals

• A good algorithm should:

– Prioritize the big flows

– Perform full matrix decomposition (like BvN)

– Minimize number of reconfigurations at the same time

47

top related