t-110.5116 computer networks ii · t-110.5116 computer networks ii data center networks 28.11.2011...
TRANSCRIPT
T-110.5116 Computer Networks II Data center networks
28.11.2011 Matti Siekkinen
(Sources: S. Kandula et al.: “The Nature of Datacenter: measurements & analysis”, A. Greenberg: “Networking The Cloud”, M. Alizadeh et al: “Data Center TCP(DCTCP)”, C. Kim: “VL2: A Scalable and Flexible Data Center Network” )
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11
What is a data center?
DC houses critical computing resources One or more server farms Data, potentially lots of it
Under centralized management Operated in controlled environment Applications and services
Internal: financial, HR, network operations (DNS, NFS, DHCP), etc.
External: e-commerce, B2B o Web search, advertizing, recommendation systems…
3
November 28, 11
Multi-tier architecture for applications
E.g. 3-tiers: applications servers between clients and backend database servers
Advantages Performance Security Scalability Support for heterogeneity
External facing e-commerce applications Search, Mail, Shopping Carts, …
Internal distributed services MapReduce, GFS, BigTable (Google), Dynamo (Amazon),
Hadoop (Yahoo!), Dryad (Microsoft) Building blocks for external facing apps (n-tiers)
4
4
Front end Server
Aggregator
Aggregator
Aggregator
… …!Aggregator
Worker
…Worker
Worker
…
Worker
Worker
November 28, 11
What does it look like? Servers in racks
Contains commodity servers (blades) Connected to Top-Of-Rack switch Aggregated traffic to next level
Modular data centers Shipping containers full of racks
5
Inside a container
From Microsoft Chicago data center
November 28, 11
Operating a large data center requires a lot of...
6
Power (efficiency measured as PUE)
Cooling
Photos from Microsoft Chicago data center
Some statistics Google: 450,000 servers in 2006,
estimated over a million by now Microsoft is doubling the number of
servers every 14 months
November 28, 11
Data center vs. Cloud
What is the difference? Virtualization is a key concept for cloud Cloud could be seen as DC with heavy use of virtualization
Virtualization enables cloud’s many properties ”Unlimited” capacity
o Of course limited by number of physical servers o One user seems to have unlimited portion of it
Flexible use of resources o Don’t need to power on all servers all the time o Client’s VMs can run on any physical server
Cloud means often outsourcing resources Shared resources Only partial control Different business models
7
November 28, 11
Cloud DC vs. Enterprise DC
Traditional enterprise DC: IT staff cost dominates Human to servers ratio: 1:100 Automation is partial, configuration, and monitoring not
fully automated Cloud service DC: other costs
Human to server ratio: 1:1000 Automation is more crucial
November 28, 11
Cloud DC vs. Enterprise DC (cont.)
Enterprise: scale up Limited shared resources Scale up: a few high priced servers Utilization not that critical
Data center: scale out 100,000 servers Distributed workload, spread out a number of commodity
servers High upfront cost amortized over time and use Pay per use for customers
o Utilization is very important
9
November 28, 11
What is a data center network (DCN)?
Enables communication within DC Server ↔ server Client ↔ server
In practice Hw: switches, routers, and cabling Sw: communication protocols (layers 2-4)
Both layers 2 and 3 present Not just L3 routers but also L2 switches Layer 2 subnets connected with layer 3 Not always TCP/IP over Ethernet
Principles evolved from enterprise networks
10
November 28, 11
What makes DCNs special?
Just plug all servers to an edge router and be done with it? Several issues with this approach
Scale vs. switch capacity May need to support even O(10^5) servers E.g.: State of the art Cisco Nexus 7000 modular data center
switch (L2 and L3) supports max. 768 1/10GE ports Switch capacity vs. price
Prices goes up with nb of ports E.g.: List price for 768 ports with 10GE modules somewhere
beyond $1M Buying lots of commodity switches is an attractive option
Potentially majority of traffic stays within DC Server to server Want to avoid single bottlenecks
11
November 28, 11
What makes DCNs special? (cont.)
Requirements different from Internet applications Large amounts of bandwidth Very, very short delays Still, often Internet protocols (TCP/IP) used
Management requirements Incremental expansion Should be able to withstand server failures, link outages,
server rack failures o Under failures, performance should degrade gracefully
Requirements due to expenses Cost-effectiveness; high throughput per dollar Power efficiency
⇒ DCN topology and equipment matter a lot
12
November 28, 11
Data Center Costs Amortized Cost* Component Sub-Components ~45% Servers CPU, memory, disk ~25% Power
infrastructure UPS, cooling, power distribution
~15% Power draw Electrical utility costs ~15% Network Switches, links, transit
Total cost varies Upwards of $1/4 B for mega data center
Server costs dominate Network costs also significant
⇒ Network should allow high utilization of servers
Greenberg et al. The Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm CCR 2009. *3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11
Two high level choices for Interconnections: Specialized hardware and communication protocols
Infiniband seems most prominent
o Can provide high bandwidth & extremely low latency • Custom hardware takes care of some reliability tasks
o Relatively low power physical layer
o Expensive o Not natively compatible with TCP/IP applications
Commodity (1/10 Gb) Ethernet switches and routers Compatible Cheaper We focus on this
Design Alternatives for DCN
November 28, 11
Current Trends
Topology: Two- or three-level trees of switches or routers High bandwidth by
appropriate interconnection of many commodity switches
Redundancy
Internet
Layer-3 router
Layer-2/3 aggregation switches Layer-2 Top-Of-Rack access switches
Servers
November 28, 11
Current Trends (cont.)
“Cheap” off-the-shelf Ethernet switches Form the basis for large scale DCNs
Multipath routing with ECMP Most enterprise core switches support Equal Cost Multi Path
(ECMP) routing o Layer-3 solution
Path-selection via hashing o # buckets = # outgoing links o Hash network information (source/dest IP addrs) to select
outgoing link: preserves flow affinity Why not just round-robin packets?
o Reordering: risk of triple duplicate ACKs with TCP o Different RTT per path (for TCP RTO) o Different MTUs per path
November 28, 11
Layer 2 and Layer 3 in DCN Layer 2
Spanning tree protocol (STP) Loop-free topology
Layer 3 Shortest path routing ECMP (Equal Cost Multipath routing)
Layer 2 problem is loops Difficult to avoid completely
o Physical topologies rarely loop free o STP normally should prevent, but in practice, they do happen
Cause an increase in traffic level o Packets replicated infinite number of times
• No TTL in L2 headers o Broadcast looping even worse o Can bring down the (sub)network
Difficult to troubleshoot o Hard to track down the root of the problem
November 28, 11
Layer 2 and Layer 3 in DCN: Need for layer 2
But layer 2 is needed because… Clustering
o Servers performing the same functions (load balancing, redundancy) o Heartbeat or application packets may not be routable
• Need L2 adjacency Dual homed servers
o Connected to two different access switches o If primary and standby interfaces use the same IP and MAC -> Both ifs have same default gateway (IP address for outbound traffic) -> Need to be on same L2 domain
Certain stateful service devices s.a. load balancers and firewalls o If deployed in pairs for redundancy -> Need to exchange session state information and heartbeats which may not be routable
• Need L2 adjacency
19
November 28, 11
Layer 2 and Layer 3 in DCN: VLAN VLAN = Virtual Local Area Network Some servers need to belong to same L2 broadcast domains
See previous slide… VLANs overcome limitations of physical topology
Run out of switch ports VLAN allows flexible growth while maintaining layer 2 adjacency
L2 domain across routers VLAN can be port-based or @MAC-based
20
Inter-switch communication Tag frames with VLAN
number
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11
Nature of data center traffic
Glimpse of example case study from Microsoft Research 2009
Detailed view of traffic within large DC One cluster of it
We look at Flow characteristics Traffic matrices
Typical cluster
November 28, 11
Methodology and data set Measurements with instrumented servers (not switches)
Possible with DC, not with ISPs o ISPs try to use SNMP counters, flow sampling, deep packet
inspection One logical cluster = 1500 servers
o Part of DC of O(10000) servers Servers in racks of 20
Socket level events at each servers ETW – Event Tracing for Windows
o http://msdn.microsoft.com/en-us/magazine/cc163437.aspx#S1 One per application read or write
o Aggregates over several packets Data set
2 months of measurements Traffic and application logs (linkage possible) A petabyte of data
23
November 28, 11
Application Workload
Map-reduce style jobs Programmers write jobs in Scope
SQL like programming language Compiler transforms into workflow consisting of phases
Phases of different types Extract, Partition, Aggregate, Combine
Jobs range from short interactive programs to long running programs
80% of the packets stay inside the data center
November 28, 11
Traffic patterns
ln(Bytes) exchanged per 10s
Traffic exchanged between server pairs in 10s period
Servers within a rack are adjacent on axis
Work-Seeks-Bandwidth (W-S-B) Small squares around
diagonal Scatter-Gather (S-G)
Horizontal and vertical lines
November 28, 11
Traffic patterns (cont.)
Work-seeks-bandwidth Need to make efforts to place jobs under the same ToR
Scatter-gather-patterns Server pushes/pulls data to/from many servers across
the cluster Distributed query processing: map, reduce
o Data divided into small parts o Each servers works on particular part o Answers aggregated
Shows the need for inter-ToR communication Computation constrained by the network
November 28, 11
Flow: 5-tuple and all activity separated by < 60 s
Traffic matrix (TM) computation to study regularity in traffic Regular is good -> could optimize network for most traffic ToR-to-ToR TM computation
o TM(t)i,j: Aggregate traffic volume within 100s from ToRi to ToRj starting from time t
Flows and traffic matrix
Most of the flows: various mice
Most of the bytes: within 100MB flows
November 28, 11
Take 40 best TM clusters (above) and classify each 100s TM into a best fitting cluster
Traffic pattern changes nearly constantly
=> Unpredictable traffic
Collapse similar traffic matrices (over 100sec) into “clusters”
Need 50-60 clusters to cover a day’s traffic
=> Lack of regularity, bad news for traffic engineering
Traffic volatility
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11
Alternative proposed architectures Goals
Overcome limitations of typical architectures of today Use commodity standard equipment
Lot of research activities in recent years We briefly look at two specific proposals
Fat-tree -based solution from UCSD VL2 from Microsoft Research
Also many others exist Portland (UCSD) Dcell (Microsoft Research Asia, Tsinghua University, and
UCLA) Bcube (Microsoft Research Asia, Tsinghua University, UCLA,
PKU, HUST) Monsoon and CamCube (Microsoft Research) …
30
November 28, 11
Some example issues with conventional architecture
Bandwidth oversubscription Ratio of allocated bandwidth per server to the
guaranteed bandwidth per server is greater than 1 o Total bandwidth at core/aggregate level less than summed
up bandwidth at access level o Limited server to server capacity
No performance isolation Collateral damage
31
November 28, 11
Some example issues with conventional architecture
Bandwidth oversubscription Ratio of allocated bandwidth per server to the
guaranteed bandwidth per server is greater than 1 o Total bandwidth at core/aggregate level less than summed
up bandwidth at access level o Limited server to server capacity
No performance isolation Collateral damage
Fragmentation of resources Not possible for virtual machines to migrate from VLAN
to another while keeping IP address Require VLAN, IP reconfiguration
32
November 28, 11
Data centers run two kinds of applications: Outward facing (e.g. serving web pages to users) Internal computation (e.g. computing search index)
Can’t we just place adjacently servers that inter-communicate a lot? May be poorly predictable (cf. case study on traffic analysis)
Internet CR CR
… AR AR
S S LB LB
S S
A A A …
S S
A A A …
…
AR AR
S S LB LB
S S
A A A …
S S
A A A …
10:1 over-subscription or worse (80:1, 240:1)
Oversubscription
November 28, 11
Oversubscription solution FAT Tree topology with special look-up
scheme Add more commodity switches
o Carefully designed topology o All ports have same capacity as servers
Use two level look-ups to distribute traffic o Enable to use all the “shortest paths” o Requires specific addressing scheme
Enables o Full bisection bandwidth o Even lower cost because all switch ports
have same capacity Drawbacks
o Need customized switches (forwarding part)
o Lot of cabling
34
M. Al-Fares et al. Commodity Data Center Network Architecture. In SIGCOMM 2008.
Core Switches
Aggregation Switches
Edge Switches
FAT Tree
November 28, 11
VLANs typically provide reachability isolation only One service sending/receiving too much traffic hurts
all services sharing its subtree
Internet CR CR
… AR AR
S S LB LB
S S
A A A …
S S
A A A …
…
AR AR
S S LB LB
S S
A A A …
S S
A A A … A
Collateral damage
No performance isolation
November 28, 11
Performance isolation with VLB Random flow spreading with Valiant Load Balancing (VLB)
Similar FAT Tree topology with commodity switches Every flow “bounced” off a random intermediate switch Provably hotspot free for any admissible traffic matrix
o Works also for unpredictable, non-regular traffic No need to modify switches (std forwarding)
o Relies on ECMP and clever addressing Need some changes to servers, though
36
10G D/2 ports
D/2 ports
. . .
. . . D switches
D/2 switches Intermediate node switches in VLB
D ports
Top Of Rack switch
[D2/4] * 20 Servers 20
ports
Aggregation switches
A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.
November 28, 11
VLANs used to isolate properties from each other IP addresses topologically determined by ARs Reconfiguration of IPs and VLAN trunks error-prone, slow,
often manual
Internet CR CR
… AR AR
S S LB LB
S S
A A A …
S S
A A A …
…
AR AR
S S LB LB
S S
A A A …
S S
A A A … A
Fragmentation of resources
November 28, 11
Solution: Name-location separation Servers use flat names (x,y,z) ToR switches
Run link-state routing Maintain switch-level topology
A thin shim layer added to server OS Remember HIP? No change to applications or clients outside DC
Centralized directory system for name resolution Advantages
Protects network and hosts from host-state churn Obviates host and switch reconfiguration
Directory system is a potential bottleneck
38
Directory Service
… x ToR2 y ToR3 z ToR4 …
… x ToR2 y ToR3 z ToR3
…
TCP
IP
NIC
ARP
Encapsulator MAC Resolu6on Cache
VL2 Agent User Kernel
Resolve remote IP
Server machine
A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009.
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11
TCP in the Data Center
TCP rules as transport inside DC 99.9% of traffic
DCNs different environment for TCP compared to normal Internet e2e transport Very short delays Specific application workloads
How well does TCP work in DCNs? Several problems occur
o Bursty packet drops, Incast problem, ... o Builds up large queues
• Adds latency and wastes buffer space
40
November 28, 11
Measurement study of TCP in DCN
Case Study: Bing Microsoft search engine
Measurements from 6000 server production cluster Instrumentation passively collects logs
Application-level Socket-level Selected packet-level
150TB of compressed data over a month
41
November 28, 11
Worker Nodes
42
Deadline = 250ms
Deadline = 50ms
Deadline = 10ms
Partition/Aggregate Application Structure
The foundation for many large-scale web applications Web search, Social network
composition, Ad selection, etc. Time is money -> strict
deadlines Missed deadline means lower
quality result
Internet
November 28, 11
Workloads
Partition/Aggregate (Query)
Short messages [50KB-1MB] (Coordination, Control state)
Large flows [1MB-50MB] (Data update)
43
Delay-sensitive
Delay-sensitive
Throughput-sensitive Background traffic
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11
Problems with TCP in DCN
Incast
Queue Buildup
Buffer Pressure
45
All of these are caused by shared memory switches Common memory pool for all ports Memory is a scarce resource Typical for commodity switches Cost is the reason
November 28, 11
Problems: Incast
46
Always causes a TCP timeout (no ACKs for FR)
Worker 1
Worker 2
Worker 3
Worker 4
Aggregator
RTOmin = 300 ms
• Synchronized mice collide. Caused by Partition/Aggregate
November 28, 11
Alleviating Incast through Jittering
Requests are jittered over 10ms window Deliberately delayed over 10ms window -> desynchronization
Jittering switched off around 8:30 am. Jittering trades off median against high
percentiles.
MLA
Que
ry C
ompl
etio
n Ti
me
(ms)
November 28, 11
Problems: Queue Buildup Sender 1
Sender 2
Receiver
• Big flows build up queues Increased latency for short flows Packet loss
• Measurements in Bing cluster For 90% packets: RTT < 1ms For 10% packets: 1ms < RTT < 15ms
November 28, 11
Problems: Buffer pressure
Similar phenomenon than previous one Increased queuing delay and packet loss due to long
flows traversing other ports Shared memory pool
49
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11 51
1. High Burst Tolerance – Incast due to Partition/Aggregate is common.
2. Low Latency – Short flows, queries
3. High Throughput – Continuous data updates, large file transfers
The challenge is to achieve these three together.
Data Center Transport Requirements
November 28, 11
Tension Between Requirements
52
High Burst Tolerance High Throughput
Low Latency
• DCTCP
Deep Buffers: Queuing delays increase latency
Shallow Buffers: Bad for bursts & throughput
Reduced RTOmin (SIGCOMM ‘09) Doesn’t Help Latency
AQM – RED: Avg Queue Not Fast Enough for Incast
Objective: Low Queue Occupancy & High
Throughput
November 28, 11 53
Sender 1
Sender 2
Receiver ECN Mark (1 bit)
ECN = Explicit Congestion Notification
Review: TCP with ECN
November 28, 11
DCTCP: Two key ideas 1. React in proportion to the extent of congestion, not its
presence Reduces variance in sending rates, lowering queuing requirements
2. Mark based on instantaneous queue length Fast feedback to better deal with bursts
18
ECN Marks TCP DCTCP
1 0 1 1 1 1 0 1 1 1 Cut window by 50% Cut window by 40%
0 0 0 0 0 0 0 0 0 1 Cut window by 50% Cut window by 5%
November 28, 11
Data Center TCP Algorithm
Switch side: Mark packets when Queue Length > K
19
Sender side: – Maintain running average of fraction of packets
marked (α). In each RTT:
Adaptive window decreases: – Note: decrease factor between 1 and 2.
K Mark Don’t mark
November 28, 11
(Kby
tes)
DCTCP in Action
20
Setup: Win 7, Broadcom 1Gbps Switch Scenario: 2 long-lived flows, K = 30KB
November 28, 11
Why it Works
1. High Burst Tolerance Large buffer headroom → bursts fit Aggressive marking → sources react before packets are
dropped
2. Low Latency Small buffer occupancies → low queuing delay
3. High Throughput ECN averaging → smooth rate adjustments, low variance
21
November 28, 11
Analysis How low can DCTCP maintain queues without loss of
throughput? How do we set the DCTCP parameters?
Need to quantify queue size oscillations (Stability).
Time
(W*+1)(1-α/2)
W*
Window Size
W*+1
November 28, 11
Packets sent in this RTT are marked.
Analysis How low can DCTCP maintain queues without loss of
throughput? How do we set the DCTCP parameters?
Need to quantify queue size oscillations (Stability).
Time
(W*+1)(1-α/2)
W*
Window Size
W*+1
November 28, 11
Analysis How low can DCTCP maintain queues without loss of
throughput? How do we set the DCTCP parameters?
22
Need to quantify queue size oscillations (Stability).
85% Less Buffer than TCP
November 28, 11
Cluster Traffic Benchmark
Emulate traffic within 1 Rack of Bing cluster 45 1G servers, 10G server for external traffic
Generate query, and background traffic Flow sizes and arrival times follow distributions seen in Bing
Metric: Flow completion time for queries and background flows.
24
We use RTOmin = 10ms for both TCP & DCTCP.
November 28, 11
Baseline
25
Background Flows Query Flows
Low latency for short flows. High throughput for long flows.
November 28, 11
Baseline
25
Background Flows Query Flows
Low latency for short flows. High throughput for long flows. High burst tolerance for query flows.
November 28, 11
DCTCP summary
27
DCTCP Handles bursts well Keeps queuing delays low Achieves high throughput
Features: Simple change to TCP and a single switch parameter Based on existing mechanisms
November 28, 11
Outline
What are data center networks? Typical data center network architecture Data center network traffic Alternative data center network architectures End-to-end transport in data center network
Problems with TCP in DCN Data Center TCP (DCTCP)
Conclusions
November 28, 11
Wrapping up
Data center networks provide specific networking challenges Potentially huge scale Different requirements than with traditional Internet
applications Recently a lot of research activity
New proposed architectures and protocols Big deal to companies with mega-scale data centers: $$
Popularity of cloud computing accelerates this evolution
69
November 28, 11
Want to know more?
70
1. M. Arregoces and M. Portolani. Data Center Fundamentals. Cisco Press, 2003. 2. Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. 2009. The nature of data
center traffic: measurements & analysis. In Proceedings of IMC 2009. 3. Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D. G., Ganger, G. R., Gibson, G.
A., and Mueller, B. 2009. Safe and effective fine-grained TCP retransmissions for datacenter communication. In Proceedings of the ACM SIGCOMM 2009.
4. A. Greenberg et al. VL2: A Scalable and Flexible Data Center Network. In SIGCOMM, 2009. 5. C. Guo et al. DCell: A Scalable and Fault Tolerant Network Structure for Data Centers. In
SIGCOMM, 2008. 6. M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network
Architecture. In Proceedings of the ACM SIGCOMM 2008. 7. Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N., Miri, P., Radhakrishnan, S.,
Subramanya, V., and Vahdat, A. 2009. PortLand: a scalable fault-tolerant layer 2 data center network fabric. In Proceedings of the ACM SIGCOMM 2009.
8. Joseph, D. A., Tavakoli, A., and Stoica, I. 2008. A policy-aware switching layer for data centers. In Proceedings of the ACM SIGCOMM 2008.
9. Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., and Lu, S. 2009. BCube: a high performance, server-centric network architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2009.
10. Abu-Libdeh, H., Costa, P., Rowstron, A., O'Shea, G., and Donnelly, A. 2010. Symbiotic routing in future data centers. In Proceedings of the ACM SIGCOMM 2010.