t-110.5116 computer networks ii · t-110.5116 computer networks ii data center networks 28.11.2011...

T-110.5116 Computer Networks II Data center networks

28.11.2011 Matti Siekkinen

(Sources: S. Kandula et al.: “The Nature of Datacenter: measurements & analysis”, A. Greenberg: “Networking The Cloud”, M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” )

November 28, 11

Outline

 What are data center networks?  Typical data center network architecture  Data center network traffic  Alternative data center network architectures  End-to-end transport in data center network

  Problems with TCP in DCN   Data Center TCP (DCTCP)

 Conclusions

November 28, 11

What is a data center?

 DC houses critical computing resources   One or more server farms   Data, potentially lots of it

 Under centralized management  Operated in controlled environment  Applications and services

  Internal: financial, HR, network operations (DNS, NFS, DHCP), etc.

  External: e-commerce, B2B o  Web search, advertizing, recommendation systems…

3

November 28, 11

Multi-tier architecture for applications

 E.g. 3-tiers: applications servers between clients and backend database servers

 Advantages   Performance   Security   Scalability   Support for heterogeneity

 External facing e-commerce applications   Search, Mail, Shopping Carts, …

 Internal distributed services   MapReduce, GFS, BigTable (Google), Dynamo (Amazon),

Hadoop (Yahoo!), Dryad (Microsoft)   Building blocks for external facing apps (n-tiers)

4

4

Front end Server

Aggregator

Aggregator

Aggregator

… …!Aggregator

Worker

…Worker

Worker

…

Worker

Worker

November 28, 11

What does it look like?   Servers in racks

  Contains commodity servers (blades)   Connected to Top-Of-Rack switch   Aggregated traffic to next level

  Modular data centers   Shipping containers full of racks

5

Inside a container

From Microsoft Chicago data center

November 28, 11

Operating a large data center requires a lot of...

6

Power (efficiency measured as PUE)

Cooling

Photos from Microsoft Chicago data center

 Some statistics   Google: 450,000 servers in 2006,

estimated over a million by now   Microsoft is doubling the number of

servers every 14 months

November 28, 11

Data center vs. Cloud

 What is the difference?   Virtualization is a key concept for cloud   Cloud could be seen as DC with heavy use of virtualization

 Virtualization enables cloud’s many properties   ”Unlimited” capacity

o  Of course limited by number of physical servers o  One user seems to have unlimited portion of it

  Flexible use of resources o  Don’t need to power on all servers all the time o  Client’s VMs can run on any physical server

 Cloud means often outsourcing resources   Shared resources   Only partial control   Different business models

7

November 28, 11

Cloud DC vs. Enterprise DC

 Traditional enterprise DC: IT staff cost dominates   Human to servers ratio: 1:100   Automation is partial, configuration, and monitoring not

fully automated  Cloud service DC: other costs

  Human to server ratio: 1:1000   Automation is more crucial

November 28, 11

Cloud DC vs. Enterprise DC (cont.)

 Enterprise: scale up   Limited shared resources   Scale up: a few high priced servers   Utilization not that critical

 Data center: scale out   100,000 servers   Distributed workload, spread out a number of commodity

servers   High upfront cost amortized over time and use   Pay per use for customers

o  Utilization is very important

9

November 28, 11

What is a data center network (DCN)?

 Enables communication within DC   Server ↔ server   Client ↔ server

 In practice   Hw: switches, routers, and cabling   Sw: communication protocols (layers 2-4)

 Both layers 2 and 3 present   Not just L3 routers but also L2 switches   Layer 2 subnets connected with layer 3   Not always TCP/IP over Ethernet

 Principles evolved from enterprise networks

10

November 28, 11

What makes DCNs special?

 Just plug all servers to an edge router and be done with it?   Several issues with this approach

 Scale vs. switch capacity   May need to support even O(10^5) servers   E.g.: State of the art Cisco Nexus 7000 modular data center

switch (L2 and L3) supports max. 768 1/10GE ports  Switch capacity vs. price

  Prices goes up with nb of ports   E.g.: List price for 768 ports with 10GE modules somewhere

beyond $1M   Buying lots of commodity switches is an attractive option

 Potentially majority of traffic stays within DC   Server to server   Want to avoid single bottlenecks

11

November 28, 11

What makes DCNs special? (cont.)

 Requirements different from Internet applications   Large amounts of bandwidth   Very, very short delays   Still, often Internet protocols (TCP/IP) used

 Management requirements   Incremental expansion   Should be able to withstand server failures, link outages,

server rack failures o  Under failures, performance should degrade gracefully

 Requirements due to expenses   Cost-effectiveness; high throughput per dollar   Power efficiency

⇒ DCN topology and equipment matter a lot

12

November 28, 11

Data Center Costs Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power

infrastructure UPS, cooling, power distribution

~15% Power draw Electrical utility costs ~15% Network Switches, links, transit

 Total cost varies   Upwards of $1/4 B for mega data center

 Server costs dominate   Network costs also significant

⇒ Network should allow high utilization of servers

Greenberg et al. The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money

November 28, 11

Outline



 Conclusions

November 28, 11

Two high level choices for Interconnections:  Specialized hardware and communication protocols

  Infiniband seems most prominent  

o  Can provide high bandwidth & extremely low latency •  Custom hardware takes care of some reliability tasks

o  Relatively low power physical layer  

o  Expensive o  Not natively compatible with TCP/IP applications

 Commodity (1/10 Gb) Ethernet switches and routers   Compatible   Cheaper   We focus on this

Design Alternatives for DCN

November 28, 11

Current Trends

 Topology: Two- or three-level trees of switches or routers   High bandwidth by

appropriate interconnection of many commodity switches

  Redundancy

Internet

Layer-3 router

Layer-2/3 aggregation switches Layer-2 Top-Of-Rack access switches

Servers

November 28, 11

Current Trends (cont.)

  “Cheap” off-the-shelf Ethernet switches   Form the basis for large scale DCNs

 Multipath routing with ECMP   Most enterprise core switches support Equal Cost Multi Path

(ECMP) routing o  Layer-3 solution

  Path-selection via hashing o  # buckets = # outgoing links o  Hash network information (source/dest IP addrs) to select

outgoing link: preserves flow affinity   Why not just round-robin packets?

o  Reordering: risk of triple duplicate ACKs with TCP o  Different RTT per path (for TCP RTO) o  Different MTUs per path

November 28, 11

Layer 2 and Layer 3 in DCN   Layer 2

  Spanning tree protocol (STP)   Loop-free topology

  Layer 3   Shortest path routing   ECMP (Equal Cost Multipath routing)

  Layer 2 problem is loops   Difficult to avoid completely

o  Physical topologies rarely loop free o  STP normally should prevent, but in practice, they do happen

  Cause an increase in traffic level o  Packets replicated infinite number of times

•  No TTL in L2 headers o  Broadcast looping even worse o  Can bring down the (sub)network

  Difficult to troubleshoot o  Hard to track down the root of the problem

November 28, 11

Layer 2 and Layer 3 in DCN: Need for layer 2

  But layer 2 is needed because…   Clustering

o  Servers performing the same functions (load balancing, redundancy) o  Heartbeat or application packets may not be routable

•  Need L2 adjacency   Dual homed servers

o  Connected to two different access switches o  If primary and standby interfaces use the same IP and MAC -> Both ifs have same default gateway (IP address for outbound traffic) -> Need to be on same L2 domain

  Certain stateful service devices s.a. load balancers and firewalls o  If deployed in pairs for redundancy -> Need to exchange session state information and heartbeats which may not be routable

•  Need L2 adjacency

19

November 28, 11

Layer 2 and Layer 3 in DCN: VLAN   VLAN = Virtual Local Area Network   Some servers need to belong to same L2 broadcast domains

  See previous slide…   VLANs overcome limitations of physical topology

  Run out of switch ports   VLAN allows flexible growth while maintaining layer 2 adjacency

  L2 domain across routers   VLAN can be port-based or @MAC-based

20

  Inter-switch communication   Tag frames with VLAN

number

November 28, 11

Outline



 Conclusions

November 28, 11

Nature of data center traffic

 Glimpse of example case study from Microsoft Research 2009

 Detailed view of traffic within large DC   One cluster of it

 We look at   Flow characteristics   Traffic matrices

Typical cluster

November 28, 11

Methodology and data set   Measurements with instrumented servers (not switches)

  Possible with DC, not with ISPs o  ISPs try to use SNMP counters, flow sampling, deep packet

inspection   One logical cluster = 1500 servers

o  Part of DC of O(10000) servers   Servers in racks of 20

  Socket level events at each servers   ETW – Event Tracing for Windows

o  http://msdn.microsoft.com/en-us/magazine/cc163437.aspx#S1   One per application read or write

o  Aggregates over several packets   Data set

  2 months of measurements   Traffic and application logs (linkage possible)   A petabyte of data

23

November 28, 11

Application Workload

 Map-reduce style jobs  Programmers write jobs in Scope

  SQL like programming language   Compiler transforms into workflow consisting of phases

 Phases of different types   Extract, Partition, Aggregate, Combine

 Jobs range from short interactive programs to long running programs

 80% of the packets stay inside the data center

November 28, 11

Traffic patterns

ln(Bytes) exchanged per 10s

  Traffic exchanged between server pairs in 10s period

  Servers within a rack are adjacent on axis

  Work-Seeks-Bandwidth (W-S-B)   Small squares around

diagonal   Scatter-Gather (S-G)

  Horizontal and vertical lines

November 28, 11

Traffic patterns (cont.)

 Work-seeks-bandwidth   Need to make efforts to place jobs under the same ToR

 Scatter-gather-patterns   Server pushes/pulls data to/from many servers across

the cluster   Distributed query processing: map, reduce

o  Data divided into small parts o  Each servers works on particular part o  Answers aggregated

  Shows the need for inter-ToR communication  Computation constrained by the network

November 28, 11

 Flow: 5-tuple and all activity separated by < 60 s

 Traffic matrix (TM) computation to study regularity in traffic   Regular is good -> could optimize network for most traffic   ToR-to-ToR TM computation

o  TM(t)i,j: Aggregate traffic volume within 100s from ToRi to ToRj starting from time t

Flows and traffic matrix

Most of the flows: various mice

Most of the bytes: within 100MB flows

November 28, 11

  Take 40 best TM clusters (above) and classify each 100s TM into a best fitting cluster

 Traffic pattern changes nearly constantly

=> Unpredictable traffic

  Collapse similar traffic matrices (over 100sec) into “clusters”

  Need 50-60 clusters to cover a day’s traffic

=> Lack of regularity, bad news for traffic engineering

Traffic volatility

November 28, 11

Outline



 Conclusions

November 28, 11

Alternative proposed architectures  Goals

  Overcome limitations of typical architectures of today   Use commodity standard equipment

 Lot of research activities in recent years  We briefly look at two specific proposals

  Fat-tree -based solution from UCSD   VL2 from Microsoft Research

 Also many others exist   Portland (UCSD)   Dcell (Microsoft Research Asia, Tsinghua University, and

UCLA)   Bcube (Microsoft Research Asia, Tsinghua University, UCLA,

PKU, HUST)   Monsoon and CamCube (Microsoft Research)   …

30

November 28, 11

Some example issues with conventional architecture

 Bandwidth oversubscription   Ratio of allocated bandwidth per server to the

guaranteed bandwidth per server is greater than 1 o  Total bandwidth at core/aggregate level less than summed

up bandwidth at access level o  Limited server to server capacity

 No performance isolation   Collateral damage

31

November 28, 11

Some example issues with conventional architecture

 Bandwidth oversubscription   Ratio of allocated bandwidth per server to the

guaranteed bandwidth per server is greater than 1 o  Total bandwidth at core/aggregate level less than summed

up bandwidth at access level o  Limited server to server capacity

 No performance isolation   Collateral damage

 Fragmentation of resources   Not possible for virtual machines to migrate from VLAN

to another while keeping IP address   Require VLAN, IP reconfiguration

32

November 28, 11

  Data centers run two kinds of applications:   Outward facing (e.g. serving web pages to users)   Internal computation (e.g. computing search index)

  Can’t we just place adjacently servers that inter-communicate a lot?   May be poorly predictable (cf. case study on traffic analysis)

Internet CR CR

… AR AR

S S LB LB

S S

A A A …

S S

A A A …

…

AR AR

S S LB LB

S S

A A A …

S S

A A A …

10:1 over-subscription or worse (80:1, 240:1)

Oversubscription

November 28, 11

Oversubscription solution   FAT Tree topology with special look-up

scheme   Add more commodity switches

o  Carefully designed topology o  All ports have same capacity as servers

  Use two level look-ups to distribute traffic o  Enable to use all the “shortest paths” o  Requires specific addressing scheme

  Enables o  Full bisection bandwidth o  Even lower cost because all switch ports

have same capacity   Drawbacks

o  Need customized switches (forwarding part)

o  Lot of cabling

34

M. Al-Fares et al. Commodity Data Center Network Architecture. In SIGCOMM 2008.

Core Switches

Aggregation Switches

Edge Switches

FAT Tree

November 28, 11

 VLANs typically provide reachability isolation only  One service sending/receiving too much traffic hurts

all services sharing its subtree

Internet CR CR

… AR AR

S S LB LB

S S

A A A …

S S

A A A …

…

AR AR

S S LB LB

S S

A A A …

S S

A A A … A

Collateral damage

No performance isolation

November 28, 11

Performance isolation with VLB   Random flow spreading with Valiant Load Balancing (VLB)

  Similar FAT Tree topology with commodity switches   Every flow “bounced” off a random intermediate switch   Provably hotspot free for any admissible traffic matrix

o  Works also for unpredictable, non-regular traffic   No need to modify switches (std forwarding)

o  Relies on ECMP and clever addressing   Need some changes to servers, though

36

10G D/2 ports

D/2 ports

. . .

. . . D switches

D/2 switches Intermediate node switches in VLB

D ports

Top Of Rack switch

[D2/4] * 20 Servers 20

ports

Aggregation switches

A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.

November 28, 11

  VLANs used to isolate properties from each other   IP addresses topologically determined by ARs   Reconfiguration of IPs and VLAN trunks error-prone, slow,

often manual

Internet CR CR

… AR AR

S S LB LB

S S

A A A …

S S

A A A …

…

AR AR

S S LB LB

S S

A A A …

S S

A A A … A

Fragmentation of resources

November 28, 11

Solution: Name-location separation  Servers use flat names (x,y,z)  ToR switches

  Run link-state routing   Maintain switch-level topology

 A thin shim layer added to server OS   Remember HIP?   No change to applications or clients outside DC

 Centralized directory system for name resolution  Advantages

  Protects network and hosts from host-state churn   Obviates host and switch reconfiguration

 Directory system is a potential bottleneck

38

Directory Service

… x ToR2 y ToR3 z ToR4 …

… x ToR2 y ToR3 z ToR3

…

TCP

IP

NIC

ARP

Encapsulator MAC Resolu6on Cache

VL2 Agent User Kernel

Resolve remote IP

Server machine

A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.

November 28, 11

Outline



 Conclusions

November 28, 11

TCP in the Data Center

 TCP rules as transport inside DC   99.9% of traffic

 DCNs different environment for TCP compared to normal Internet e2e transport   Very short delays   Specific application workloads

 How well does TCP work in DCNs?   Several problems occur

o  Bursty packet drops, Incast problem, ... o  Builds up large queues

•  Adds latency and wastes buffer space

40

November 28, 11

Measurement study of TCP in DCN

 Case Study: Bing   Microsoft search engine

 Measurements from 6000 server production cluster  Instrumentation passively collects logs

  Application-level   Socket-level   Selected packet-level

 150TB of compressed data over a month

41

November 28, 11

Worker Nodes

42

Deadline = 250ms

Deadline = 50ms

Deadline = 10ms

Partition/Aggregate Application Structure

 The foundation for many large-scale web applications   Web search, Social network

composition, Ad selection, etc.  Time is money -> strict

deadlines  Missed deadline means lower

quality result

Internet

November 28, 11

Workloads

 Partition/Aggregate (Query)

 Short messages [50KB-1MB] (Coordination, Control state)

 Large flows [1MB-50MB] (Data update)

43

Delay-sensitive

Delay-sensitive

Throughput-sensitive Background traffic

November 28, 11

Outline



 Conclusions

November 28, 11

Problems with TCP in DCN

 Incast

 Queue Buildup

 Buffer Pressure

45

All of these are caused by shared memory switches   Common memory pool for all ports   Memory is a scarce resource   Typical for commodity switches   Cost is the reason

November 28, 11

Problems: Incast

46

Always causes a TCP timeout (no ACKs for FR)

Worker 1

Worker 2

Worker 3

Worker 4

Aggregator

RTOmin = 300 ms

•  Synchronized mice collide.   Caused by Partition/Aggregate

November 28, 11

Alleviating Incast through Jittering

 Requests are jittered over 10ms window   Deliberately delayed over 10ms window -> desynchronization

 Jittering switched off around 8:30 am. Jittering trades off median against high

percentiles.

MLA

Que

ry C

ompl

etio

n Ti

me

(ms)

November 28, 11

Problems: Queue Buildup Sender 1

Sender 2

Receiver

•  Big flows build up queues   Increased latency for short flows  Packet loss

•  Measurements in Bing cluster   For 90% packets: RTT < 1ms   For 10% packets: 1ms < RTT < 15ms

November 28, 11

Problems: Buffer pressure

 Similar phenomenon than previous one  Increased queuing delay and packet loss due to long

flows traversing other ports   Shared memory pool

49

November 28, 11

Outline



 Conclusions

November 28, 11 51

1.  High Burst Tolerance –  Incast due to Partition/Aggregate is common.

2.  Low Latency –  Short flows, queries

3. High Throughput –  Continuous data updates, large file transfers

The challenge is to achieve these three together.

Data Center Transport Requirements

November 28, 11

Tension Between Requirements

52

High Burst Tolerance High Throughput

Low Latency

• DCTCP

Deep Buffers:   Queuing delays increase latency

Shallow Buffers:   Bad for bursts & throughput

Reduced RTOmin (SIGCOMM ‘09)   Doesn’t Help Latency

AQM – RED:   Avg Queue Not Fast Enough for Incast

Objective: Low Queue Occupancy & High

Throughput

November 28, 11 53

Sender 1

Sender 2

Receiver ECN Mark (1 bit)

ECN = Explicit Congestion Notification

Review: TCP with ECN

November 28, 11

DCTCP: Two key ideas 1.  React in proportion to the extent of congestion, not its

presence   Reduces variance in sending rates, lowering queuing requirements

2.  Mark based on instantaneous queue length   Fast feedback to better deal with bursts

18

ECN Marks TCP DCTCP

1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%

0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%

November 28, 11

Data Center TCP Algorithm

Switch side:   Mark packets when Queue Length > K

19

Sender side: –  Maintain running average of fraction of packets

marked (α). In each RTT:

 Adaptive window decreases: –  Note: decrease factor between 1 and 2.

K Mark Don’t mark

November 28, 11

(Kby

tes)

DCTCP in Action

20

Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB

November 28, 11

Why it Works

1.  High Burst Tolerance   Large buffer headroom → bursts fit  Aggressive marking → sources react before packets are

dropped

2.  Low Latency   Small buffer occupancies → low queuing delay

3. High Throughput   ECN averaging → smooth rate adjustments, low variance

21

November 28, 11

Analysis  How low can DCTCP maintain queues without loss of

throughput?  How do we set the DCTCP parameters?

  Need to quantify queue size oscillations (Stability).

Time

(W*+1)(1-α/2)

W*

Window Size

W*+1

November 28, 11

Packets sent in this RTT are marked.




Time

(W*+1)(1-α/2)

W*

Window Size

W*+1

November 28, 11



22


85% Less Buffer than TCP

November 28, 11

Cluster Traffic Benchmark

 Emulate traffic within 1 Rack of Bing cluster   45 1G servers, 10G server for external traffic

 Generate query, and background traffic   Flow sizes and arrival times follow distributions seen in Bing

 Metric:   Flow completion time for queries and background flows.

24

We use RTOmin = 10ms for both TCP & DCTCP.

November 28, 11

Baseline

25

Background Flows Query Flows

November 28, 11

Baseline

25


Low latency for short flows.

November 28, 11

Baseline

25


Low latency for short flows. High throughput for long flows.

November 28, 11

Baseline

25


Low latency for short flows. High throughput for long flows. High burst tolerance for query flows.

November 28, 11

Scaled Background & Query 10x Background, 10x Query

26

Incast problem

November 28, 11

DCTCP summary

27

 DCTCP   Handles bursts well   Keeps queuing delays low   Achieves high throughput

 Features:   Simple change to TCP and a single switch parameter   Based on existing mechanisms

November 28, 11

Outline



 Conclusions

November 28, 11

Wrapping up

 Data center networks provide specific networking challenges   Potentially huge scale   Different requirements than with traditional Internet

applications  Recently a lot of research activity

  New proposed architectures and protocols   Big deal to companies with mega-scale data centers: $$

 Popularity of cloud computing accelerates this evolution

69

November 28, 11

Want to know more?

70

1.  M. Arregoces and M. Portolani. Data Center Fundamentals. Cisco Press, 2003. 2.  Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. 2009. The nature of data

center traffic: measurements & analysis. In Proceedings of IMC 2009. 3.  Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D. G., Ganger, G. R., Gibson, G.

A., and Mueller, B. 2009. Safe and effective fine-grained TCP retransmissions for datacenter communication. In Proceedings of the ACM SIGCOMM 2009.

4.  A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009. 5.  C. Guo et al. DCell: A Scalable and Fault Tolerant Network Structure for Data Centers. In

SIGCOMM, 2008. 6.  M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network

Architecture. In Proceedings of the ACM SIGCOMM 2008. 7.  Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N., Miri, P., Radhakrishnan, S.,

Subramanya, V., and Vahdat, A. 2009. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In Proceedings of the ACM SIGCOMM 2009.

8.  Joseph, D. A., Tavakoli, A., and Stoica, I. 2008. A policy-aware switching layer for data centers. In Proceedings of the ACM SIGCOMM 2008.

9.  Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., and Lu, S. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009.

10.  Abu-Libdeh, H., Costa, P., Rowstron, A., O'Shea, G., and Donnelly, A. 2010. Symbiotic routing in future data centers. In Proceedings of the ACM SIGCOMM 2010.

t-110.5116 computer networks ii · t-110.5116 computer networks ii data center networks 28.11.2011...

Documents