t-110.5116 computer networks ii

• 23.9.2010

T-110.5116 Computer Networks II Data center networks 4.11.2013 Matti Siekkinen

(Sources: S. Kandula et al.: “The Nature of Datacenter: measurements & analysis”, A. Greenberg: “Networking The Cloud”, M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” )

Outline

•  What are data center networks? •  Layer 2 vs. Layer 3 in data center networks •  Data center network architectures •  TCP in data center networks

–  Problems of basic TCP –  Data Center TCP (DCTCP)

•  Conclusions

• 2

What is a data center?

•  Contains servers and data •  Has a network

–  Connect servers together

•  Runs applications and services –  Internal and external

•  Centrally managed •  Operated in controlled environment •  Can have very different sizes

–  SME datacenter vs. Google

• 3

Applications and services

•  External facing –  Search, Mail, Shopping Carts, …

•  Internal to the company/institution –  E.g. ERP (Financial, HR, …)

•  Services internal to the data center –  Those necessary for the data center to work

•  E.g. network operations (DNS, DHCP), backup –  Building blocks for external facing apps

•  MapReduce, Colossus, GFS, Spanner, BigTable, Dynamo, Hadoop, Dryad…

•  Often distributed

• 4

Multi-tier architecture

•  E.g. 3-tiers –  Front end servers –  Applications servers –  Backend database servers

•  Advantages –  Performance & scalability –  Security

• 5

Handles static requests

Handles dynamic content

Handles database transactions

What does it look like? •  Servers in racks

–  Contains commodity servers (blades) –  Connected to Top-Of-Rack switch –  Aggregated traffic to next level

•  Modular data centers

–  Shipping containers full of racks

Inside a container From Microsoft Chicago data center

• 6

Large data center requires a lot of...

•  Some statistics –  Microsoft runs about 1 million servers –  Google probably well over a million

•  450,000 servers in 2006

• 7

Power

Cooling

Photos from Microsoft Chicago data center

Cloud computing

•  Cloud computing –  Abstract underlying resources from the service provided –  Abstraction on different levels: IaaS, PaaS, SaaS

•  Virtualization enables cloud’s many properties –  Elastic resource allocation

•  Of course limited by number of physical servers •  One users resources limited by SLA, not by the amount of

hardware –  Efficient use of resources

•  Don’t need to run all servers at full steam all the time •  Client’s VMs can run on any physical server

• 8

Data center vs. Cloud

•  Data center is physical –  Physical infrastructure that runs services

•  Cloud is not physical –  Offers some service(s) –  Physical infrastructure is virtualized away

•  Large clouds usually need to be hosted in data centers –  Depends on scale

•  Data center does not need to host a cloud

• 9

Different kinds of data centers

•  Traditional enterprise DC: IT staff cost dominates –  Human to server ratio: 1:100 –  Less automation in management –  A few high priced servers –  Cost borne by the enterprise

•  Utilization is not critical •  Cloud service DC: other costs

–  Human to server ratio: 1:1000 –  Automation is more crucial –  Distributed workload, spread out on lots of commodity servers –  High upfront cost amortized over time and use –  Pay per use for customers

•  Utilization is critical

• 10

What is a data center network (DCN)?

• 11

•  Enables DC communication –  Internally from server to server –  From/to outside of the DC

•  In practice –  HW: switches, routers, and cabling –  SW: communication protocols (layers 2-4)

What is a data center network (DCN)?

•  Both layers 2 (link) and 3 (network) present –  Not only L3 routers but also L2 switches –  Layer 2 subnets connected with layer 3

•  Layer 4 (transport) needed similar to any packet networks

•  Note: does not have to be TCP/IP! –  Not part of routed Internet

•  Cannot resolve DC server’s address directly from Internet, only front end servers

–  But often is TCP/IP…

• 12

email WWW phone..."SMTP HTTP SIP..."

TCP UDP…"

"IP"

"

Eth PPP WiFi 3GPP…"

copper fiber radio OFDM FHSS..."

What makes DCNs special?

•  Just plug all servers to a single switch and that’s it? –  Several issues with this approach

•  Scaling up capacity –  Lots of servers need lots of switch ports –  E.g.: State of the art Cisco Nexus 7000 modular data center switch (L2

and L3) supports max. 768 1/10GE ports •  Switch capacity and price

–  Prices goes up with nb of ports –  E.g.: List price for 768 ports with 10GE modules somewhere beyond $1M –  Buying lots of commodity switches is an attractive option

•  Potentially majority of traffic stays within DC –  Server to server

• 13

What makes DCNs special? (cont.)

•  Requirements different from Internet applications –  Large amounts of bandwidth –  Very, very short delays –  Still, often Internet protocols (TCP/IP) used

•  Management requirements –  Incremental expansion –  Should be able to withstand server failures, link outages, server

rack failures •  Under failures, performance should degrade gracefully

•  Requirements due to expenses –  Cost-effectiveness; high throughput per dollar –  Power efficiency

⇒ DCN topology and equipment matter a lot

• 14

Data Center Costs

Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power infrastructure UPS, cooling, power distribution ~15% Power draw Electrical utility costs ~15% Network Switches, links, transit

•  Total cost varies –  Upwards of $1/4 B for mega data center

•  Server costs dominate –  Network costs also significant

⇒ Network should allow high utilization of servers

Source: Greenberg et al. The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009.

*3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

• 15

Outline



•  Conclusions

• 16

Switch vs. router: What’s the difference?

•  Switch is layer 2 device –  Does not understand IP protocol –  Does not run any routing protocol

•  Router is layer 3 device –  “Speaks” IP protocol –  Runs routing protocols to determine shortest paths

•  OSPF, RIP, etc.

•  Terminology often not so clear –  E.g. multi-layer and layer 3 switches also exist

• 17

Switch vs. router: Difference in basic functioning •  Router

–  Forwards packets based on destination IP address •  Prefix lookup against routing tables

–  Routing tables built and maintained by routing algorithms and protocols

•  Protocols exchange information about paths to known destinations •  Algorithms compute shortest paths based on this information

–  Broadcast sending usually not allowed •  Switch

–  Forwards frames (packets) based on destination MAC address –  Uses switch table

•  Equivalent to routing table in router –  Broadcast sending is common –  How is switch table built and maintained since there is no routing

protocol?

• 18

Switch is self learning •  When frame is received from one port

–  Switch learns that sender is behind that port –  Switch adds that information to switch table –  Soft state: forget after a while

•  If destination not (yet) known –  Flood to all other ports

•  Flooding can lead to forwarding loops –  Switches connected in cyclic manner –  These loops can create broadcast storms

•  Spanning tree protocol (STP) used to avoid loops –  Generates loop-free topology –  Avoid using some ports when flooding –  Rapid Spanning Tree Protocol (RSTP)

•  Faster convergence after a topology change

• 19

AA

Port 1

Hub

Port 1

Hub

Port 2 Port 2 AA 1 AA 1

BB

CC DD

AA 2 AA 2 AA 1 AA 1

And so on… No TTL in L2 headers!

<Src=AA, Dest=DD>

Layer 2 vs. Layer 3 in DCN

•  Management –  L2 close to plug-and-play –  L3 usually requires some manual configuration (subnet mask, DHCP)

•  Scalability and performance –  L2 broadcasting and STP scale poorly –  L2 forwarding less scalable than L3 fwding

•  L2 based on flat MAC addresses •  L3 based on hierarchical IP addresses (prefix lookup)

–  L2 has no such load balancing over multiple paths as L3 –  L2 loops may still happen in practice, even with STP

• 20

Layer 2 vs. Layer 3 in DCN

•  Flexibility –  VM migration may require change of IP address in L3 network

•  Need to conform to subnet address –  L2 network allows any IP address for any server

•  Some reasons may prevent using pure L3 design –  Some servers may need L2 adjacency, e.g.:

•  Servers performing the same functions (load balancing, redundancy) •  Non-IP traffic

–  Dual homed servers may need to be on same L2 domain •  Connected to two different access switches •  Some configurations require both primary and secondary to be in same L2

domain

• 21

VLAN

•  VLAN = Virtual Local Area Network •  Some servers may need to belong to same L2 domains

–  See previous slide…

•  VLANs overcome limitations of physical topology –  Run out of switch ports

•  VLAN allows flexible growth while maintaining layer 2 adjacency –  L2 domain across L3 device

•  VLAN can be port-based or @MAC-based

• 22

Port-based VLAN

•  Traffic isolation –  Frames to/from ports 1-8 can

only reach ports 1-8 –  Can also define VLAN based on

MAC addresses of endpoints, rather than switch port

•  Dynamic membership –  Ports can be dynamically

assigned among VLANs •  Forwarding between VLANS

happens on L3

• 23

1

8

9

16 10 2

7

…

VLAN1 (ports 1-8)

VLAN2 (ports 9-15)

15

…

routing

VLANs spanning multiple switches

•  VLANs can span over multiple switches •  Also over different routed subnets

–  Routers in between

1

8

9

10 2

7

…

VLAN1 (ports 1-8)

VLAN2 (ports 9-15)

15

…

2

7 3

Ports 2,3,5 belong to VLAN1 Ports 4,6,7,8 belong to VLAN2

5

4 6 8 16

1

• 24

Outline



•  Conclusions

• 25

Design Alternatives for DCN

Two high level choices for interconnections: •  Specialized hardware and protocols

–  E.g. Infiniband seems common –  J

•  Can provide high bandwidth & extremely low latency –  Custom hardware takes care of some reliability tasks

•  Relatively low power physical layer –  L

•  Expensive •  Not natively compatible with TCP/IP applications

•  Commodity (1/10 Gb) Ethernet switches and routers –  Compatible –  Cheaper –  Let’s look at this a bit more…

• 26

Conventional DCN architecture

•  Topology: Two- or three-level trees of switches or routers –  Multipath routing –  High bandwidth by

appropriate interconnection of many commodity switches

–  Redundancy

Internet

Layer-3 router

Layer-2/3 aggregation switches Layer-2 Top-Of-Rack access switches

Servers

• 27

Issues with conventional architecture

•  Bandwidth oversubscription –  Total bandwidth at core/aggregate level less than summed up

bandwidth at access level –  Limited server to server capacity –  Application designers need to be aware of limitations

•  No performance isolation –  VLANs typically provide reachability isolation only –  One server (service) sending/receiving too much traffic hurts all

servers sharing its subtree

•  There are more…

• 28

One solution to oversubscription

•  FAT Tree topology with special look-up scheme –  Add more commodity switches

•  Carefully designed topology •  All ports have same capacity as servers

–  Enables •  Full bisection bandwidth •  Lower cost because all switch ports have

same capacity –  Drawbacks

•  Need customized switches –  Special two level look-up scheme to

distribute traffic •  Lot of cabling

• 29

M. Al-Fares et al. Commodity Data Center Network Architecture. In SIGCOMM 2008.

Core Switches

Aggregation Switches

Edge Switches

FAT Tree

One solution to performance isolation: VLB •  Random flow spreading with Valiant Load Balancing (VLB)

–  Similar FAT Tree topology with commodity switches –  Every flow “bounced” off a random intermediate switch –  Provably hotspot free for any admissible traffic matrix –  No need to modify switches (std forwarding)

•  Relies on existing protocols and clever addressing –  Requires some changes to servers

• 30

10G D/2 ports

D/2 ports

. . .

. . . D switches

D/2 switches Intermediate node switches in VLB

D ports

Top Of Rack switch

[D2/4] * 20 Servers 20

ports

Aggregation switches

A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.

DCN architectures in research

•  Lots of alternative proposed architectures in recent years •  Goals

–  Overcome limitations of typical architectures of today –  Use commodity standard equipment

•  VL2 & Monsoon & CamCube (MSR) •  Portland (UCSD) •  Dcell & Bcube (MSR, Tsinghua, UCLA) •  …

• 31

Outline



•  Conclusions

• 32

TCP in the Data Center

•  TCP typically used as transport inside DC •  DCNs different environment for TCP compared to

normal Internet e2e transport –  Very short delays –  Specific application workloads

•  How well does TCP work in DCNs? –  Several problems…

• 33

DCN: latency matters

•  Online data-intensive applications (OLDI) –  Web search, retail, advertisement, …

•  Latency is critical for OLDI applications –  Direct impact on user perceived QoE –  Page load time is a key metric for customer retention

à direct impact on revenue •  How important is it?

–  Amazon.com: revenue decreased by 1% of sales for every 100ms latency

•  Objectives: –  Complete data flows as quickly as possible –  Meet flow deadlines

• 34

Creating a Page

DatacenterNetwork

Internet

…

Front End

…

News Feed

…

Search

…

Ads

…

Chat3

Generating a Web page in DCN

•  Facebook as an example –  lots of components…

• 35

Generating a Web page in DCN (cont.)

•  Servers may need to perform 100’s of data retrievals –  Many of which must be performed serially

•  Overall page deadline of 200-300ms à Only have 2-3ms per data retrieval

–  Including communication and computation

•  High percentiles of delay are important –  If single data retrieval is unlikely to miss a deadline (median) but

1 of 100 retrievals is likely to miss a deadline (99th percentile) à deadline miss happens on every page –  Data retrieval dependencies can magnify impact

• 36

Worker Nodes

Partition/Aggregate Application Structure

• 37

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Internet

•  The foundation for many large-scale web applications –  Web search, Social network

composition, Ad selection, etc. •  Deadlines in lower hierarchy must

meet with all-up deadline •  Iterative requests common à workers have tight deadlines

Workloads

•  Query-response traffic –  Partition/Aggregate –  Small flows (“mice”)

•  Background traffic –  Short messages [50KB-1MB]

•  Coordination, control state •  Small flows

–  Large flows [1MB-50MB] •  Updating data on each server •  Large flows (“elephants”)

•  Challenge: –  All this traffic goes through same switches –  Requirements are conflicting

• 38

Requires minimal delay

Requires high throughput

DCN characteristics

•  Network characteristics –  Large aggregate bandwidths –  Very short round trip time delays (<1ms)

•  Typical switches –  Use large numbers of commodity switches –  Typically commodity switch has shared memory

•  Common memory pool for all ports –  Why not separated memory spaces?

•  Cost issue for commodity switches

• 39

Resulting problems with TCP in DCN

•  Incast –  Incoming reply traffic gets synchronized –  Buffer overflow at a switch

•  Queue Buildup –  Large flows eat up buffer space at switches –  Small flows suffer

•  Buffer Pressure –  Special case of queue buildup with switches having shared

buffers

• 40

Problems: Incast

• 41

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

•  Synchronized mice collide. Ø  Caused by Partition/Aggregate

Incast

•  What happens next? –  TCP timeout –  Default minimum values of timeout 200-400ms depending on

OS

•  Why is that a major problem? –  Several order of magnitude longer than RTT à huge penalty –  Fail to meet deadlines

• 42

Problems: Queue Buildup

•  Remember the different workloads –  Small “mice” flows –  Large “elephant” flows

•  Large flows can eat up the shared buffer space –  Same outgoing port

•  Result is similar than with incast

• 43

Problems: Queue Buildup

Sender 1

Sender 2

Receiver

Big flows build up queues Ø  Increased latency for short flows Ø Packet loss

• 44

Problems: Buffer pressure

•  Kind of generalization of the previous problem •  Increased queuing delay and packet loss due to long

flows traversing other ports –  Shared memory pool –  Packets incoming and outgoing different ports still eat up each

common buffer space

• 45

Outline



•  Conclusions

• 46

Data Center Transport Requirements

• 47

1.  High Burst Tolerance –  Cope with the Incast problem

2.  Low Latency –  Short flows, queries

3. High Throughput –  Continuous data updates, large file transfers

We want to achieve all three at the same time

Exploring the solution space Proposal Throughput Burst tolerance

(Incast) Latency

Deep switch buffers J Can achieve high throughput

J Tolerates large bursts

L Queuing delays increase latency

Shallow buffers L Can hurt throughput of elephant flows

L Cannot tolerate bursts well

J Avoids long queuing delay

Jittering :/ No major impact J Prevents Incast L Increases median latency

Shorter RTOmin :/ No major impact J Helps recover faster

L Doesn’t help queue buildup

Nw assisted congestion ctrl (ECN style)

J High throughput with high utilization

J Helps in most cases L Problem if only 1 pkt is too much

J Reacts early to queue buildup

• 48

Proposal Throughput Burst tolerance (Incast)

Latency















Latency














Jittering

•  Add random delay before responding –  Desynchronize the responding sources to avoid buffer overflow

•  Jittering trades off median against high percentiles

MLA

Que

ry C

ompl

etio

n Ti

me

(ms)

Jittering off Jittering on

Requests are jittered over 10ms window

• 49


Latency















Latency








Shorter RTOmin J Improves throughput

J Helps recover faster







Latency








Shorter RTOmin J Improves throughput

J Helps recover faster






Exploring the solution space

• 50

Review: TCP with ECN

• 51

Sender 1

Sender 2

Receiver ECN Mark (1 bit)

ECN = Explicit Congestion Notification

Q: How do TCP senders react? A: Cut sending rate by half

DCTCP: Two key ideas

1.  React in proportion to the extent of congestion, not just its presence ü  Reduces variance in sending rates, lowering queuing requirements

2.  Mark based on instantaneous queue length

ü  Fast feedback to better deal with bursts

ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

Q: Why normal TCP with ECN does not behave like DCTCP? A: Fairness with conventional TCP…

• 52

DCTCP Algorithm

•  Switch side: –  Mark packets when Queue Length > K

•  Sender side: –  Maintain moving average of fraction of packets marked (α). –  In each RTT:

•  Adaptive window decreases: –  Note: decrease factor between 1 and 2.

• 53

KMark Don’t mark

Why does DCTCP work?

•  High Burst Tolerance –  Aggressive marking → sources react before packets are

dropped –  Large buffer headroom → bursts fit

•  Low Latency –  Small buffer occupancies → low queuing delay

•  High Throughput –  ECN averaging → smooth rate adjustments, low variance –  Leads to high utilization

• 54

(Kby

tes)

DCTCP in Action

• 55

Does it completely solve the Incast problem? •  Remember Incast: large number of synchronized small flows

hit the same queue •  Depends on the number of small flows

–  If nb of synchronized flows so high that just 1 packet from each enough to overflow buffer à DCTCP does not help

•  No chance to give feedback to senders before damage is done •  No congestion control mechanism can help •  Only solution is to somehow schedule responses (e.g. jittering)

•  Helps if each flow has several packets to transmit –  Windows build up over multiple RTTs –  Bursts in subsequent RTTs would lead to packet drops –  DCTCP sources receive enough ECN feedback to prevent buffer

overflows

• 56

Comparing TCP and DCTCP

•  Emulate traffic within 1 Rack of Bing cluster –  45 1G servers, 10G server for external traffic

•  Generate query, and background traffic –  Flow sizes and arrival times follow distributions seen in Bing

•  Metric: –  Flow completion time for queries and background flows

•  RTOmin = 10ms for both TCP & DCTCP –  More than fair comparison

• 57

Comparing TCP and DCTCP (cont.) Background Flows Query Flows

• 58


Short flows finish quicker

• 59


Throughput remains as high for long flows

• 60


Better burst tolerance for query flows

• 61

DCTCP summary

•  DCTCP –  Handles bursts well –  Keeps queuing delays low –  Achieves high throughput

•  Features: –  Simple change to TCP and a single switch parameter –  Based on existing mechanisms

• 62

TCP for DCN research

•  Data transport in DCN has received attention recently •  Many solutions proposed over the last three years

–  Deadline-Driven Delivery control protocol (D3) (UCSB, Microsoft) –  Deadline-Aware Datacenter TCP (D2TCP) (Purdue, Google) –  DeTail (Berkeley, Facebook): cross layer solution –  PDQ (UIUC): flow scheduling solution, not transport protocol –  pFabric (Insieme Networks, Google, Stanford, Berkeley/ICSI) –  Kwiken (UIUC, Steklov Math Inst., Microsoft): framework including

policing and resource reallocation, not a transport protocol •  The story is not completely written yet

–  Important to understand the problem and how the solution space differs from that applicable with Internet transport

–  DCTCP is only one possible approach to mitigate TCP performance issues in DCN

• 63

Outline



•  Conclusions

• 64

Wrapping up

•  Data center networks provide specific networking challenges –  Potentially huge scale –  Different requirements than with traditional Internet applications

•  Recently a lot of research activity –  New proposed architectures and protocols –  Big deal to companies with mega-scale data centers: $$

•  Popularity of cloud computing accelerates this evolution

• 65

Want to know more?

1.  M. Arregoces and M. Portolani. Data Center Fundamentals. Cisco Press, 2003. 2.  Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. 2009. The nature of data center

traffic: measurements & analysis. In Proceedings of IMC 2009. 3.  Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D. G., Ganger, G. R., Gibson, G. A.,

and Mueller, B. 2009. Safe and effective fine-grained TCP retransmissions for datacenter communication. In Proceedings of the ACM SIGCOMM 2009.

4.  A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009. 5.  C. Guo et al. DCell: A Scalable and Fault Tolerant Network Structure for Data Centers. In SIGCOMM,

2008. 6.  M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture.

In Proceedings of the ACM SIGCOMM 2008. 7.  Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N., Miri, P., Radhakrishnan, S.,

Subramanya, V., and Vahdat, A. 2009. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In Proceedings of the ACM SIGCOMM 2009.

8.  Joseph, D. A., Tavakoli, A., and Stoica, I. 2008. A policy-aware switching layer for data centers. In Proceedings of the ACM SIGCOMM 2008.

9.  Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., and Lu, S. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009.

10.  Abu-Libdeh, H., Costa, P., Rowstron, A., O'Shea, G., and Donnelly, A. 2010. Symbiotic routing in future data centers. In Proceedings of the ACM SIGCOMM 2010.

11.  Check SIGCOMM 2011 - 2013 program as well

• 66

t-110.5116 computer networks ii

Documents