wireless and mobile system infrastructurewccclab.cs.nchu.edu.tw/www/images/data_center... ·...
TRANSCRIPT
1
Chapter 6
Paper Study: Data Center Networking
Data Center Networks Major theme: What are new networking issues posed
by large-scale data centers? Network Architecture? Topology design? Addressing? Routing? Forwarding? Please do the required readings! 2
Data Center Interconnection Structure
Nodes in the system: racks of servers How are the nodes (racks) inter-connected?
Typically a hierarchical inter-connection structure
Today’s typical data center structure Cisco recommended data center structure Starting from the bottom level
Rack switches 1-2 layers of (layer-2) aggregation switches Access routers Core routers
Is such an architecture good enough? 3
Cisco Recommended Data Center Structure
4
Internet CR CR
AR AR AR AR …
S S LB LB
Data Center Layer 3
Internet
S S
…
S S
…
…
Layer 2
Key: • CR = L3 Core Router • AR = L3 Access Router • S = L2 Switch • LB = Load Balancer • A = Rack of 20 servers with Top of Rack switch
Data Center Design Requirements
Data centers typically run two types of applications Outward facing (e.g., serving web pages to users) Internal computations (e.g., MapReduce for web indexing)
Workloads often unpredictable: Multiple services run concurrently within a DC Demand for new services is unexpected
Failures of servers are the norm Recall that GFS, MapReduce, etc., resort to dynamic re-
assignment of chunkservers, jobs/tasks (worker servers) to deal with failures; data is often replicated across racks, …
“Traffic matrix” between servers are constantly changing
5
Data Center Costs
Total cost varies Power consumption in data center Server costs dominate Network costs significant
Long provisioning timescales: New servers purchased quarterly at best
6
Amortized Cost*
Component Sub-Components
~45% Servers CPU, memory, disk
~25% Power infrastructure UPS, cooling, power distribution
~15% Power draw Electrical utility costs
~15% Network Switches, links, transit
*3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money
Source: “The Cost of a Cloud: Research Problems in Data Center Networks,” Sigcomm CCR 2009. Greenberg, Hamilton, Maltz, Patel.
Overall Data Center Design Goal Agility – Any service, Any Server Turn the servers into a single large fungible pool
Let services “breathe” : dynamically expand and contract their footprint as needed This is done in terms of Google’s GFS (Google File
System), BigTable, MapReduce. Benefits
Increase service developer productivity Lower cost Achieve high performance and reliability
7
Achieving Agility These are the three motivators for most data center
infrastructure projects! 1. Workload Management
Means for rapidly installing a service’s code on a server Dynamical cluster scheduling and server assignment
E.g., MapReduce, Bigtable, … Virtual machines, disk images
2. Storage Management Means for a server to access persistent data Distributed file systems (e.g., GFS)
3. Network Management Means for communicating with other servers, regardless of
where they are in the data center Achieve high performance and reliability
8
Networking Objectives 1. Uniform high capacity
Capacity between servers limited only by their NICs No need to consider topology when adding servers
High capacity between two any servers no matter which racks they are located !
2. Performance isolation Traffic of one service should be unaffected by others
3. Ease of management: “Plug-&-Play” (layer-2 semantics) Flat addressing, so any server should have IP address Server configuration is the same as in a LAN Legacy applications depending on broadcast must work
9
Is Today’s DC Architecture Adequate?
10
Internet CR CR
AR AR AR AR …
S S LB LB
Data Center Layer 3
Internet
S S
…
S S
…
…
Layer 2 Key: • CR = L3 Core Router • AR = L3 Access Router • S = L2 Switch • LB = Load Balancer • A = Top of Rack switch
• Uniform high capacity? • Performance isolation? Typically via VLANs
• Agility in terms of dynamically adding or shrinking servers?
• Agility in terms of adapting to failures, and to traffic dynamics?
• Ease of management?
• Hierarchical network; 1+1 redundancy • Components of higher DC architecture have capacity to handle more traffic
• More expensive, more efforts made at availability scale-up design • Servers connect via 1 Gbps link to Top-of-Rack switches • Other links are mix of 1G, 10G; fiber, copper
Papers Study
1. A Scalable, Commodity Data Center Network Architecture A new Fat-tree “inter-connection” structure (topology) to increases
“bi-section” bandwidth
Needs “new” addressing, forwarding/routing
2. VL2: A Scalable and Flexible Data Center Network Consolidate layer-2/layer-3 into a “virtual layer 2” Separating “naming” and “addressing”, also deal with
dynamic load-balancing issues
Optional Materials PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network
Fabric BCube: A High-Performance, Server-centric Network Architecture
for Modular Data Centers 11
A Scalable, Commodity Data Center Network Architecture
Main Goal: addressing the limitations of today’s data center network architecture To avoid the failure of single link or point To provide oversubscription of links in the topology
Trade-offs between cost and providing
Key Design Considerations/Goals Allows host communication at line speed
No matter where they are located! Backwards compatible with existing infrastructure
No changes in application & support of layer 2 (Ethernet) Cost effective
Cheap infrastructure Low power consumption & heat dissemination
M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable,commodity data center network architecture,” In Proceedings of the ACM SIGCOMM 2008 conference on Data communication, pages 63–74, 2008.
Fat-Tree Based DC Architecture Inter-connect racks (of servers) using a fat-tree topology Fat-Tree: a special type of Clos Networks
K-ary fat tree: three-layer topology (edge, aggregation, and core) Each pod consists of (k/2)2 servers and 2 layers of k/2 k-port switches Each edge switch connects to k/2 servers and k/2 aggr. switches Each aggr. switch connects to k/2 edge and k/2 core switches (k/2)2 core switches: each connects to k pods
Fat-tree with K=4
13
Fat tree network with K = 3 supporting 54 hosts
Fat-Tree Based Topology … Why Fat-Tree?
Fat tree has identical bandwidth at any bisections Each layer has the same aggregated bandwidth
Can be built using cheap devices with uniform capacity Each port supports same speed as end host All devices can transmit at line speed if packets are distributed uniform
along available paths
Great scalability: k-port switch supports k3/4 servers
14
Cost of Maintaining Switches
15
10G Switch 1G Switch
Fat-tree Topology is Great, But …
Does using fat-tree topology to inter-connect racks of servers in itself sufficient?
What routing protocols should we run on these switches?
Layer 2 switch algorithm: data plane flooding! Layer 3 IP routing:
shortest path IP routing will typically use only one path despite the path diversity in the topology
if using equal-cost multi-path routing at each switch independently and blindly, packet re-ordering may occur; further load may not necessarily be well-balanced
Aside: control plane flooding! 16
FAT-Tree Modified Enforce a special (IP) addressing scheme in DC
unused.podnumber.switchnumber.endhost Allows host attached to same switch to route only
through switch Allows inter-pod traffic to stay within pod
Use two level look-ups to distribute traffic and maintain packet ordering First level is prefix lookup Second level is a suffix lookup
17
More on Fat-Tree DC Architecture Diffusion Optimizations Flow classification
Eliminates local congestion Assign to traffic to ports on a per-flow basis
instead of a per-host basis Flow scheduling
Eliminates global congestion Prevent long lived flows sharing the same links Assign long lived flows to different links
What are potential drawbacks of this architecture?
18
VL2: A Scalable and Flexible Data Center Network
Main Goal: support agility & be cost-effective A virtual (logical) layer 2 architecture for connecting
racks of servers (network as a big “virtual switch”) Employs a 3-level Clos topology (full-mesh in top-2
levels) with non-uniform switch capacities Also provides identity and location separation
“application-specific” vs. “location-specific” addresses
Employs a directory service for name resolution Needs a direct host participation
Explicitly accounts for DC traffic matrix dynamics Employs the Valiant load-balancing (VLB) technique
Using randomization to cope with volatility
19 A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta,“ Vl2: A scalable and flexible data center network,” In Proceedings of the ACM SIGCOMM 2009 conference on Data communication, pages 51–62, 2009.
Clos network Clos network is a kind of multistage circuit
switching network. Charles Clos proposed in 1953
It represents a theoretical idealization of practical multi-stage telephone switching systems. Clos networks are required when the physical circuit
switching needs exceed the capacity of the largest feasible single crossbar switch.
The key advantage of Clos networks is that the number of crosspoints (which make up each crossbar switch) required can be much fewer than were the entire switching system implemented with one large crossbar switch.
20
Clos network Clos networks have three stages: the ingress stage,
middle stage, and egress stage. Each stage is made up of a number of crossbar
switches (see diagram below), often just called crossbars.
Each call entering an ingress crossbar switch can be routed through any of the available middle stage crossbar switches, to the relevant egress crossbar switch.
A middle stage crossbar is available for a particular new call if both the link connecting the ingress switch to the middle stage switch. the link connecting the middle stage switch to the egress
switch, are free. 21
Clos network Clos networks are defined by three integers n, m,
and r. n represents the number of sources which feed into each of r ingress stage crossbar switches.
Each ingress stage crossbar switch has m outlets, and there are m middle stage crossbar switches.
22
Bidirectional Paths
Routing in Bidirectional Paths
Networks are multi-path Routing takes place in two steps: route to an
intermediate node followed by routing to destination 1. Multiple intermediate nodes can be selected 2. Path from intermediate node to destination is
unique
Moving to Fat Trees
Nodes at tree leaves
Switches at tree vertices
Total link bandwidth is constant across all tree levels, with full bisection bandw idth
Equivalent to folded Benes topology
Preferred topology in many system area networks
Folded Clos = Folded Benes = Fat tree network
7 6
5 4
3 2
1 0
15 14
13 12
11 10
9 8
Network Bisection
© T.M. Pinkston, J. Duato, with major contributions by J. Filch
Fat Trees: Another View
Equivalent to the preceding multistage implementation Common topology in many supercomputer installations
Forward Backward
Specific Objectives and Solutions
27
Solution Approach Objective
2. Uniform high capacity between servers
Enforce hose model using existing mechanisms only
Employ flat addressing
1. Layer-2 semantics
3. Performance Isolation
Guarantee bandwidth for hose-model traffic
Flow-based random traffic indirection (Valiant LB)
Name-location separation & resolution service
VLAN
VL2 Topology Design Scale-out vs. Scale-up Argue for and exploit the gap in switch-to-switch capacity vs.
switch-to-server capacities current: 10Gbps vs. 1Gbps future: 40 Gpbs vs. 10 Gbps
A scale-out design with broad layers E.g., a 3-level Clos topology with full-mesh in top-2
levels ToR switches, aggregation switches, & core
(intermediate) switches Less wiring complexity, and more path diversity
Same bisection capacity at each layer No oversubscription
Extensive path diversity Graceful degradation under failure
28
VL2 Topology: Example
10G
DA/2 ports X 10G
Aggregation switches
. . .
. . .
D switches
D/2 intermediate switches (go to Internet)
Intermediate node switches in VLB
DI ports X 10G
Top Of Rack switch
[DADI/4] * 20 Servers
20 ports
Node degree (D) of available switches & # servers supported
D # Servers in pool4 80
24 2,88048 11,520
144 103,680
29
DA/2 ports X 10G
2 X 10G
Addressing and routing
Address resolution and Packet forwarding LA (location-specific IP address)
Assigned IP for all switches and interfaces. Forward any packets encapsulated with LAs along
the shortest path.
AA (application-specific IP address) Associated with an LA. ToR switch’s IP to which app. server is
connected.
Addressing and routing(cont.) Packet forwarding step
Server receive sender’s packet to encapsulate. Setting the dest. of the outer header to the LA of the
dest. AA. Once the packet arrives at the LA. Dest. switch decapsulates the packet and delivers to
dest. AA.
Address resolution and access control When first time to send, the host generate ARP request
for AA. Source’s network stack intercepts ARP request and
converts it to a unicast query to directory server (DS).
Address Resolution and Packet Forwarding
33
Addressing and Routing: Name-Location Separation
payload ToR3
. . . . . .
y x
Servers use flat names
Switches run link-state routing and maintain only switch-level topology
Cope with host churns with very little overhead
y z payload ToR4 z
ToR2 ToR4 ToR1 ToR3
y, z payload ToR3 z
. . .
Directory Service …
x ToR2 y ToR3 z ToR4 …
Lookup & Response
… x ToR2 y ToR3 z ToR3 …
• Allows to use low-cost switches • Protects network and hosts from host-state churn • Obviates host and switch reconfiguration
Maintain host information using DS
DS provides 2 key functions 1. Lookups and update for AA-to-LA mapping. 2. A reactive cache update mechanism that ensures
eventual consistency of the mapping with very litter update overhead.
The differ performance in DS architecture
Read-optimized Replicated lookup server that cache AA-to-LA mapping and
communicated with agents. Write-optimized
Asynchronous replicated state machine (RSM) server offering a strongly consistent, reliable store of AA-to-LA mapping.
Directory System Replicated state machine (RSM): Offer a strongly consistent, reliable store of AA-to-LA mapping
Use Randomization to Cope with Volatility
Valiant Load Balancing Every flow “bounced” off a random intermediate switch Provable hotspot free for any admissible traffic matrix Servers could randomize flow-lets if needed
Node degree (D) of available switches & # servers supported
D # Servers in pool4 80
24 2,88048 11,520
144 103,680
10G D/2 ports
D/2 ports
. . .
. . . D switches
D/2 switches
Intermediate node switches in VLB
D ports
Top Of Rack switch
[D2/4] * 20 Servers
20 ports
Aggregation switches
VL2 Summary VL2 achieves agility at scale via
1. L2 semantics 2. Uniform high capacity between servers 3. Performance isolation between services
38
Lessons • Randomization can tame volatility • Add functionality where you have
control • There’s no need to wait!
Additional Case Studies
Optional Material
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Main idea: new “hierarchical” addressing scheme
to facilitate dynamic and fault-tolerant routing/forwarding
39
R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat, “Portland: a scalable fault-tolerant layer 2 data center network fabric,” In Proceedings of the ACM SIGCOMM 2009 conference on Data communication, pages 39–50, 2009
PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
PortLand is a single “logical layer 2” data center network fabric that scales to millions of endpoints
PortLand internally separates host identity from host location uses IP address as host identifier introduces “Pseudo MAC” (PMAC) addresses
internally to encode endpoint location
PortLand runs on commodity switch hardware with unmodified hosts
40
PortLand Requirements Any VM may migrate to any physical machine.
Migrating VMs should not have to change their IP addresses. However, this action will break pre-existing TCP connections
and application-level state. An administrator should not need to configure any switch before
deployment. Any end host should be able to efficiently communicate with any
other end host in the data center along any of the available physical communication paths.
There should be no forwarding loops. Failures will be common at scale, so failure detection should be
rapid and efficient. Existing unicast and multicast sessions should proceed
unaffected to the extent allowed by underlying physical connectivity.
41
Design Goals for Network Fabric Support for Agility! Easy configuration and management: plug-&-play Fault tolerance, routing and addressing: scalability Commodity switch hardware: small switch state Virtualization support: seamless VM migration
What are the limitations of current layer-2 and layer-3? layer-2 (Ethernet w/ flat-addressing) vs. layer-3 (IP w/
prefix-based addressing): plug-&-play? scalability? small switch state? seamless VM migration?
42
PortLand Solution Assuming: a Fat-tree network topology for DC Introduce “pseudo MAC addresses” to balance the pros
and cons of flat- vs. topology-dependent addressing PMACs are “topology-dependent” hierarchical addresses
Only used “host locators,” not “host identities” IP addresses used as “host identities” (for compatibility w/ apps)
Pros: Small switch state & Seamless VM migration “Eliminate” flooding in both data & control planes
But requires a IP-to-PMAC mapping and name resolution a location directory service
And location discovery protocol & fabric manager for support of “plug-&-play”
43
PMAC Addressing Scheme PMAC (48 bits): pod.position.port.vmid
Pod: 16 bits; position and port (8 bits); vmid: 16 bits Assign only to servers (end-hosts) – by switches
44
pod
position
Location Discovery Protocol
Location Discovery Messages (LDMs) exchanged between neighboring switches
LDMS contain the following information: switch identifier, pod number, position, tree level and up/down.
45
PortLand: Name Resolution Edge switch listens to end hosts, and discover new source MACs Installs <IP, PMAC> mappings, and informs fabric manager
46
1 2 3
PortLand: Name Resolution … 1. Edge switch intercepts ARP messages from end hosts 2. Send request to fabric manager, which replies with
PMAC
47
PortLand: Fabric Manager Fabric Manager: logically centralized, multi-homed
server Maintains topology and <IP,PMAC> mappings in “soft
state”
48
Loop-free Forwarding and Fault-Tolerant Routing
Switches build forwarding tables based on their position Include edge, aggregation, and core switches
Use strict “up-down semantics” to ensure loop-free forwarding Load-balancing: use any equal-cost multi-
path routing (ECMP) path via flow hashing to ensure packet ordering
Fault-tolerant routing: Mostly concerned with detecting failures Fabric manager maintains logical fault matrix with
per-link connectivity info to inform affected switches Affected switches re-compute forwarding tables
49
Fault-Tolerant Routing 1. Upon not receiving an LDM (also referred to as a keepalive in this
context) for some configurable period of time, a switch assumes a link failure. 1. A keepalive (KA) is a message sent by one device to another to check that
the link between the two is operating, or to prevent this link from being broken
2. The detecting switch informs the fabric manager about the failure.
50
Fault-Tolerant Routing
3. The fabric manager maintains a logical fault matrix with per-link connectivity information for the entire topology and updates it with the new information.
4. The fabric manager informs all affected switches of the failure, which then individually recalculate their forwarding tables based on the new version of the topology.
51
Multicast: Fault detection and action There are one sender (pod 1) and three receivers,
spread across pods 0 and 1. A sender forwards packets to the designated core,
which in turn distributes the packets to the receivers. In step 1, two highlighted links in pod 0 simultaneously
fail.
52
Multicast: Fault detection and action
Two aggregation switches detect the failure in step 2 and notify the fabric manager, which in turn updates its fault matrix in step 3.
The fabric manager calculates forwarding entries for all affected multicast groups in step 4.
53
BCube: A High-Performance, Server-centric Network Architecture for Modular Data Centers
Main Goal: network architecture for shipping-container based modular data centers
Designed for shipping-container modular DC BCube construction: level structure
BCubek recursively constructed from BCube1 server-centric:
Servers perform routing and forwarding Consider a variety of communication patterns
one-to-one, one-to-many, one-to-all, all-to-all single path and multi-path routing
(Will not discuss in details; please read the paper!)
54
BCube Construction Leveled structure: BCubek is recursively
constructed from n BCubek-1 and nk n-port switches
55
BCube1 BCubek
n=4
One-to-All Traffic Forwarding Using a spanning tree Speed-up: L/(k+1) for a file of size L
56
Two-edge disjoint (server) spanning trees in BCube1 for one-to-traffic