the time-less datacenter - stanford university · 2016. 11. 19. · •2pc (fail-stop)...
TRANSCRIPT
-
The Time-less Datacenter
Paul Borrill and Alan H. Karp Earth Computing
The Datacenter Resilience Company
Stanford EE Computer Systems ColloquiumWednesday, November 16, 2016
http://ee380.stanford.edu
-
The Three Taxes: 1. Complexity 2. Fragility 3. Vulnerability
Cloud Computing
2
-
Earth Computing | The Datacenter Resilience Company
Twitter Today
Systems can fail in catastrophic ways leading to death or tremendous financial loss. Although their are many potential causes including physical failure, human error, and environmental factors, design errors are increasingly becoming the most serious culprit*
3
*NASA Formal Methods Program: https://shemesh.larc.nasa.gov/fm/fm-why-new.html
https://shemesh.larc.nasa.gov/fm/fm-why-new.html
-
Earth Computing | The Datacenter Resilience Company
Key Computer Science ProblemsReliable Consensus • Generals Problem (no fixed length protocol exists to guarantee a reliable
solution in an environment where messages can get lost)
• Slow Node vs. Link Failure Indistinguishability. I.e. what can one side of a failed link assume about a partner or cohort on the other side?
FLP Result • Impossibility of Distributed Consensus with One Faulty Process
Key Idea: • Don’t depend on processes to provide liveness, use a new kind of link
4
-
Earth Computing | The Datacenter Resilience Company
Problem: Event Ordering is Hard• In a distributed system over a general
network we can’t tell if event at process R happened before event at process Q, unless P caused R in some way
• Causal Trees provide this guarantee when they are stable
• Dynamic Causal Trees provide guarantees through failure & healing, iff you have AIT on each link
• Needs Atomic Information Transfer (AIT) in the Link
P
P:0
Q:--
R:--
Q
P:--
Q:0
R:--
R
P:--
Q:--
R:0
P
P:1
Q:2
R:1
P
P:2
Q:2
R:1
P
P:3
Q:3
R:3
Q
P:--
Q:1
R:1
Q
P:--
Q:2
R:1
Q
P:--
Q:3
R:1
Q
P:2
Q:4
R:1
Q
P:2
Q:5
R:1
R
P:--
Q:--
R:1
R
P:--
Q:3
R:2
R
P:--
Q:3
R:3
R
P:2
Q:5
R:4
R
P:2
Q:5
R:5
P
P:4
Q:5
R:5
t
Process
Causal History
FutureEffect
slop
e ≤ c
slop
e ≤ c
slope ≤ c
slope ≤ c
11 12 13 14
21 2223
24 25
3231 33 34 35
P
P:0
Q:--
R:--
Q
P:--
Q:0
R:--
R
P:--
Q:--
R:0
P
P:1
Q:2
R:1
P
P:2
Q:2
R:1
P
P:3
Q:3
R:3
Q
P:--
Q:1
R:1
Q
P:--
Q:2
R:1
Q
P:--
Q:3
R:1
Q
P:2
Q:4
R:1
Q
P:2
Q:5
R:1
R
P:--
Q:--
R:1
R
P:--
Q:3
R:2
R
P:--
Q:3
R:3
R
P:2
Q:5
R:4
R
P:2
Q:5
R:5
P
P:4
Q:5
R:5
t
Process
Causal History
slop
e ≤ c
slop
e ≤ c
slope ≤ c
11 12 13 14
21 22 23 24 25
3231 33 34 35
FutureEffect
5
-
Earth Computing | The Datacenter Resilience Company
Problem: Consensus is Hard•Failure detectors have failed
to solve the problem
•2PC (Fail-Stop)•Vulnerable to coordinator failure
(no safety proof)
• 3PC vulnerable to network partitions (no liveness proof)
•Paxos (Fail-Recover)•Robust Algorithm but hard to
understand & get right.
• Causal Trees make roles robust, easier to understand & verify
6
-
Earth Computing | The Datacenter Resilience Company
Why? Because The Network is Flaky!•App developers believe the network is the problem
•Networks drop, delay, duplicate & reorder packets•Networking people believe the apps are the problem
•The network end to end principle: Apps should retry to distinguish between delays & drops … but … retries* ruin TCP’s ordering guarantees
•Both are incorrect. Solution requires a simple, but fresh perspective
Peter Bailis, Kyle Kingsbury. The network is reliable
* Application retries (i.e. opening a new socket) 7
http://aphyr.com/posts/288-the-network-is-reliable
-
Earth Computing | The Datacenter Resilience Company
Datacenter Failures Cascade
8
Switches are DReDDful
They Drop, Reorder, Delay and Duplicate Packets
Interdependent failures Reconstruction storms Timeout storms Gossip storms Cascade failures
-
Earth Computing | The Datacenter Resilience Company
It’s Time to SimplifyDelta Amazon Google Apple Netflix Paypal …
9
-
Earth Computing | The Datacenter Resilience Company
Earth Computing
Earth Co
mputing
Tim Ber
ners LeeThe Big Idea
World Wide Web Key Idea: 2 Simple Sets of Rules
•Document Language (html)
•Connection Protocol (http)
Cloud ComputingMere mortals can now get their computers to talk to each other
Mere mortals can now manage their infrastructures
ONE WAY LINKS
TWO WAY LINKS
10
Earth Computing Key Idea: 2 Simple Sets of Rules
•Graph Language (gvml)
•Connection Protocol (eclp)
-
Earth Computing | The Datacenter Resilience Company
CAS •Atomic Instruction
Key Idea:
Lock-Free data structures
•New Concurrency Libraries
•Atomic RMW
AIT •Atomic Information
Key Idea:
Recoverable Atomic Tokens
•Deterministic, In-Order
•Reversible Atomic Message
Distributed Systems Primitives
Concurrent Safety
Non-Blocking
Deterministic Recoverability
Durable Indivisible Property
Shared Memory
Reversible Token
{While (CAS(oldvalue,newvalue, ) != new value}
{Transfer (AIT(tokenID,Notify=NO, ) != Continue}11
-
Earth Computing | The Datacenter Resilience Company12
C2C Lattice of Cells & Links
Typical Clos-Based DC Topology, Spine and Leaf Architecture
DC Gateway
Spine Node
Leaf Node
ToR
GW
SN
ToR
VM VM
VM VM
ToR
VM VM
VM VM
ToR
VM VM
VM VM
LN LN
SN
ToR
VM VM
VM VM
ToR
VM VM
VM VM
ToR
VM VM
VM VM
LN LN
GW
SN
ToR
VM VM
VM VM
ToR
VM VM
VM VM
ToR
VM VM
VM VM
LN LN
SN
ToR
VM VM
VM VM
ToR
VM VM
VM VM
ToR
VM VM
VM VM
LN LN
DC DC DC
DCI/WAN
Today’s Networking: Servers & Switches EARTH Computing: Cells & Links
Servers, Any to Any (IP) addressing
Simpler Wiring: N2N, Switchless
-
Earth Computing | The Datacenter Resilience Company13
Today: Internal Segregation Firewalls EARTH: Dynamic Confinement Domains
The Datacenter Simplified
Fundamentally Simpler
The Datacenter Today
-
Earth Computing | The Datacenter Resilience Company
Split infrastructure into: Cloud datacenter accessed by untrusted legacy protocols
Earth dynamic, resilient, programmable topologies
Core where data is immutable, secure, protected, & resilient to perturbations (failures, disasters, attacks)
14
Cloudplane
OutsideWorld
EarthCore
Earth Computing Network Fabric
Data Center
-
Confidential | Earth Computing Inc.
The Big Idea
Cloudplane
OutsideWorld
EarthCore
15
-
Earth Computing | The Datacenter Resilience Company
Logical Foundation for Resilience
16
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NIC
NIC
NIC
CellAgent
NIC
NIC
NIC
NICNIC
NIC
Fabric
-
Earth Computing | The Datacenter Resilience Company
EARTH Computing Link Protocol (ECLP)
• Events: Replaces Heartbeats, Timeouts • Addresses the Common Knowledge* Problem
17
CellAgentNICNIC
NIC
NICNIC
NIC
Cell AgentNICNIC
NIC
NICNIC
NIC
CableCellAgent
NICNIC
NIC
NICNIC
NIC
Cable
New Distributed Systems Foundation
*Knowledge and Common Knowledge in a Distributed Environment – Joseph Y. Halpern & Yoram Moses ’90 (initial version 1984).
http://www.cs.cornell.edu/home/halpern/papers/common_knowledge.pdf
-
Composable Presence ManagementRouter
NICNIC
NIC
NICNIC
NIC
Router NICNIC
NIC
NICNIC
NIC
CellAgentNICNIC
NIC
NICNIC
NIC
Router NICNIC
NIC
NICNIC
NIC
CableCellAgent
NICNIC
NIC
NICNIC
NIC
Cable Cable Cable
18
-
Composable Presence ManagementRouter
NICNIC
NIC
NICNIC
NIC
Router NICNIC
NIC
NICNIC
NIC
CellAgentNICNIC
NIC
NICNIC
NIC
Router NICNIC
NIC
NICNIC
NIC
CableCellAgent
NICNIC
NIC
NICNIC
NIC
Cable Cable Cable
19
-
Demo20
-
Two Generals Problem
21
-
Earth Computing | The Datacenter Resilience Company
Example Use Cases
22
Two Phase Commit
Paxos
Link Reversal
-
Earth Computing | The Datacenter Resilience Company
Example Use Cases• Two-phase commit The prepare phase is asking if the receiving agent is ready to accept the
token. This serves two purposes: communication liveness and agent readiness. Links provide the communication liveness test, and we can avoid blocking on agent ready, by having the link store the token on the receiving half of the link. If there is a failure, both sides know; and both sides know what to do next.
• Paxos “Agents may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted”. The assumption is when a node has failed and restarted, it can’t remember the state it needs to recover. With AIT, the other half of the link can tell it the state to recover from.
• Reliable tree generation Binary link reversal algorithms work by reversing the directions of some edges. Transforming an arbitrary directed acyclic input graph into an output graph with at least one route from each node to a special destination node. The resulting graph can thus be used to route messages in a loop-free manner. Links store the direction of the arrow (head and tail); AIT facilitates the atomic swap of the arrow’s tail and head to maintain loop-free routes during failure and recovery.
23
-
Earth Computing | The Datacenter Resilience Company24
Common Knowledge
Courtesy: Adrian Coyler, The Morning Paper.https://blog.acolyer.org/2015/02/16/knowledge-and-common-knowledge-in-a-distributed-environment/
-
Confidential | Earth Computing Inc. | Paul Borrill
Implementation On Smart NIC’s
PHY
PHY
PHY
PHY
PHY
PHY
Agent
NIC
NIC
NIC
NIC
NIC
NIC
Cell Hardware(containing Processor, Memory, Storage and
(e.g 6) physical network ports)
Cell SoftwarePrimary Agent
NetworkInterfaceController
NetworkConnector
EntanglementSynchronization
Domain
25
-
Earth Computing | The Datacenter Resilience Company
TRAPHs (Tree-gRAPHs) • Simple Provisioning, Confinement,
Elasticity, Migration, Failover
26
New Distributed Systems Foundation
Bare Metal
NALNetwork
Asset Layer
-
Earth Computing | The Datacenter Resilience Company27
DemoSimulator
-
Earth Computing | The Datacenter Resilience Company28
• Don’t Make Datacenter Look Like the Internet • Simpler to Configure/Reconfigure
• More Resilient to all perturbations
• Easier to Secure
• Key Ideas • RAFE: Reliable Address-Free Ethernet
• Replace switches with cell to cell links
• Don’t rely on blueprints, discover wiring
• Event driven => No network timeouts
• Keep state in links for recovery
• No VLANs, no network-layer encryption
• Scalable design - local only view
• NO IP; service addressing
• Self recovering from link & server failures
• RAFE is a discovery process rather than a configuration process
Questions?
Cloudplane
OutsideWorld
EarthCore