lecture 1 – introduction to distributed systems 1 15-440 distributed systems
Post on 19-Jan-2016
226 Views
Preview:
TRANSCRIPT
Lecture 1 – Introduction to Distributed Systems
1
15-440 Distributed Systems
What Is A Distributed System?
“A collection of independent computers that appears to its users as a single coherent system.” •Features:
• No shared memory – message-based communication• Each runs its own local OS• Heterogeneity• Expandability
•Ideal: to present a single-system image:• The distributed system “looks like” a single computer
rather than a collection of separate computers.
Definition of a Distributed System
Figure 1-1. A distributed system organized as middleware. The middleware layer runs on all machines, and offers a uniform interface to the system
Distributed Systems: Goals
• Resource Availability: remote access to resources• Distribution Transparency: single system image
• Access, Location, Migration, Replication, Failure,…
• Openness: services according to standards (RPC)• Scalability: size, geographic, admin domains, …
• Example of a Distributed System? • Web search on google • DNS: decentralized, scalable, robust to failures, ... • ...
04/21/23 4
Lecture 2 & 3 – 15-441 in 2 Days
15-440 Distributed Systems
Packet Switching – Statistical Multiplexing
• Switches arbitrate between inputs• Can send from any input that’s ready
• Links never idle when traffic to send• (Efficiency!)
Packets
6
Model of a communication channel
• Latency - how long does it take for the first bit to reach destination
• Capacity - how many bits/sec can we push through? (often termed “bandwidth”)
• Jitter - how much variation in latency?
• Loss / Reliability - can the channel drop packets?
• Reordering
7
Packet Switching
• Source sends information as self-contained packets that have an address.• Source may have to break up single message in multiple
• Each packet travels independently to the destination host.• Switches use the address in the packet to determine how to
forward the packets• Store and forward
• Analogy: a letter in surface mail.
8
Internet
• An inter-net: a network of networks.• Networks are connected using
routers that support communication in a hierarchical fashion
• Often need other special devices at the boundaries for security, accounting, ..
• The Internet: the interconnected set of networks of the Internet Service Providers (ISPs)• About 17,000 different networks
make up the Internet
Internet
9
Network Service Model
• What is the service model for inter-network?• Defines what promises that the network gives for any
transmission• Defines what type of failures to expect
• Ethernet/Internet: best-effort – packets can get lost, etc.
10
Possible Failure models
• Fail-stop:• When something goes wrong, the process stops / crashes /
etc.• Fail-slow or fail-stutter:
• Performance may vary on failures as well• Byzantine:
• Anything that can go wrong, will.• Including malicious entities taking over your computers and
making them do whatever they want.• These models are useful for proving things;• The real world typically has a bit of everything.
• Deciding which model to use is important!
11
What is Layering?
• Modular approach to network functionality• Example:
Link hardware
Host-to-host connectivity
Application-to-application channels
Application
12
IP Layering
• Relatively simple
Bridge/Switch Router/GatewayHost Host
Application
Transport
Network
Link
Physical
13
Protocol Demultiplexing
• Multiple choices at each layer
FTP HTTP TFTPNV
TCP UDP
IP
NET1 NET2 NETn…
TCP/UDPIPIPX
Port Number
Network
Protocol Field
Type Field
14
Goals [Clark88]
0 Connect existing networksinitially ARPANET and ARPA packet radio network
1.Survivabilityensure communication service even in the presence of
network and router failures
2.Support multiple types of services3. Must accommodate a variety of networks4. Allow distributed management5. Allow host attachment with a low level of effort6. Be cost effective7. Allow resource accountability
15
Goal 1: Survivability
• If network is disrupted and reconfigured…• Communicating entities should not care!• No higher-level state reconfiguration
• How to achieve such reliability?• Where can communication state be stored?
Network Host
Failure handing Replication “Fate sharing”
Net Engineering Tough Simple
Switches Maintain state Stateless
Host trust Less More16
CIDR IP Address Allocation
Provider is given 201.10.0.0/21
201.10.0.0/22 201.10.4.0/24 201.10.5.0/24 201.10.6.0/23
Provider
17
18
Ethernet Frame Structure (cont.)
• Addresses: • 6 bytes• Each adapter is given a globally unique address at
manufacturing time• Address space is allocated to manufacturers
• 24 bits identify manufacturer• E.g., 0:0:15:* 3com adapter
• Frame is received by all adapters on a LAN and dropped if address does not match
• Special addresses• Broadcast – FF:FF:FF:FF:FF:FF is “everybody”• Range of addresses allocated to multicast
• Adapter maintains list of multicast groups node is interested in
18
End-to-End Argument
• Deals with where to place functionality• Inside the network (in switching elements)• At the edges
• Argument• If you have to implement a function end-to-end anyway
(e.g., because it requires the knowledge and help of the end-point host or application), don’t implement it inside the communication system
• Unless there’s a compelling performance enhancement
• Key motivation for split of functionality between TCP,UDP and IP
Further Reading: “End-to-End Arguments in System Design.” Saltzer, Reed, and Clark. 19
User Datagram Protocol (UDP): An Analogy
Postal Mail• Single mailbox to receive
messages• Unreliable • Not necessarily in-order
delivery• Each letter is independent• Must address each reply
Example UDP applicationsMultimedia, voice over IP
UDP• Single socket to receive
messages• No guarantee of delivery• Not necessarily in-order
delivery• Datagram – independent
packets• Must address each packet
Postal Mail• Single mailbox to receive
letters• Unreliable • Not necessarily in-order
delivery• Letters sent independently
• Must address each letter
20
Transmission Control Protocol (TCP): An Analogy
TCP• Reliable – guarantee
delivery• Byte stream – in-order
delivery• Connection-oriented –
single socket per connection
• Setup connection followed by data transfer
Telephone Call• Guaranteed delivery• In-order delivery• Connection-oriented • Setup connection
followed by conversation
Example TCP applicationsWeb, Email, Telnet
21
Lecture 5 – Classical Synchronization
25
15-440 Distributed Systems
Classic synchronization primitives
• Basics of concurrency • Correctness (achieves Mutex, no deadlock, no livelock)• Efficiency, no spinlocks or wasted resources • Fairness
• Synchronization mechanisms • Semaphores (P() and V() operations) • Mutex (binary semaphore) • Condition Variables (allows a thread to sleep)
• Must be accompanied by a mutex • Wait and Signal operations
• Work through examples again + GO primitives
04/21/23 26
Lecture 6 – RPC
15-440 Distributed Systems
27
RPC Goals
• Ease of programming• Hide complexity • Automates task of implementing distributed
computation• Familiar model for programmers (just make a
function call)
Historical note: Seems obvious in retrospect, but RPC was only invented in the ‘80s. See Birrell & Nelson, “Implementing Remote Procedure Call” ... orBruce Nelson, Ph.D. Thesis, Carnegie Mellon University: Remote Procedure Call., 1981 :)
Passing Value Parameters (1)
• The steps involved in a doing a remote computation through RPC.
29
Stubs: obtaining transparency
• Compiler generates from API stubs for a procedure on the client and server
• Client stub • Marshals arguments into machine-independent format• Sends request to server• Waits for response• Unmarshals result and returns to caller
• Server stub• Unmarshals arguments and builds stack frame• Calls procedure• Server stub marshals results and sends reply
30
Real solution: break transparency
• Possible semantics for RPC:• Exactly-once
• Impossible in practice
• At least once: • Only for idempotent operations
• At most once• Zero, don’t know, or once
• Zero or once• Transactional semantics
31
Asynchronous RPC (3)
• A client and server interacting through two asynchronous RPCs.
32
Important Lessons
• Procedure calls• Simple way to pass control and data• Elegant transparent way to distribute application• Not only way…
• Hard to provide true transparency• Failures• Performance• Memory access• Etc.
• How to deal with hard problem give up and let programmer deal with it• “Worse is better”
33
Lecture 7,8 – Distributed File Systems
34
15-440 Distributed Systems
Why DFSs?
• Why Distributed File Systems: • Data sharing among multiple users
• User mobility
• Location transparency
• Backups and centralized management
• Examples: NFS (v1 – v4), AFS, CODA, LBFS
• Idea: Provide file system interfaces to remote FS’s• Challenge: heterogeneity, scale, security, concurrency,..
• Non-Challenges: AFS meant for campus community
• Virtual File Systems: pluggable file systems
• Use RPC’s
35
NFS vs AFS Goals
• AFS: Global distributed file system• “One AFS”, like “one Internet”• LARGE numbers of clients, servers (1000’s cache files)• Global namespace (organized as cells) => location transparency• Clients w/disks => cache, Write sharing rare, callbacks • Open-to-close consistency (session semantics)
• NFS: Very popular Network File System • NFSv4 meant for wide area • Naming: per-client view (/home/yuvraj/…)• Cache data in memory, not on disk, write through cache• Consistency model: buffer data (eventual, ~30seconds)• Requires significant resounces as users scale
36
DFS Important bits (1)
• Distributed filesystems almost always involve a tradeoff: consistency, performance, scalability.
• We’ve learned a lot since NFS and AFS (and can implement faster, etc.), but the general lesson holds. Especially in the wide-area.
• We’ll see a related tradeoff, also involving consistency, in a while: the CAP tradeoff. Consistency, Availability, Partition-resilience.
DFS Important Bits (2)
• Client-side caching is a fundamental technique to improve scalability and performance• But raises important questions of cache consistency
• Timeouts and callbacks are common methods for providing (some forms of) consistency.
• AFS picked close-to-open consistency as a good balance of usability (the model seems intuitive to users), performance, etc.• AFS authors argued that apps with highly concurrent,
shared access, like databases, needed a different model
Coda Summary
• Distributed File System built for mobility• Disconnected operation key idea
• Puts scalability and availability beforedata consistency• Unlike NFS
• Assumes that inconsistent updates are very infrequent
• Introduced disconnected operation mode and file hoarding and the idea of “reintegration”
40
Low Bandwidth File SystemKey Ideas
• A network file systems for slow or wide-area networks
• Exploits similarities between files • Avoids sending data that can be found in the server’s
file system or the client’s cache• Uses RABIN fingerprints on file content (file chunks)
• Can deal with byte offsets when part of file change
• Also uses conventional compression and caching• Requires 90% less bandwidth than traditional
network file systems
41
Lecture 9 – Time Synchronization
42
15-440 Distributed Systems
Impact of Clock Synchronization
• When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time.
43
Clocks in a Distributed System
• Computer clocks are not generally in perfect agreement• Skew: the difference between the times on two clocks (at any instant)
• Computer clocks are subject to clock drift (they count time at different rates)• Clock drift rate: the difference per unit of time from some ideal reference
clock • Ordinary quartz clocks drift by about 1 sec in 11-12 days. (10-6 secs/sec).• High precision quartz clocks drift rate is about 10-7 or 10-8 secs/sec
44
Perfect networks
• Messages always arrive, with propagation delay exactly d
• Sender sends time T in a message• Receiver sets clock to T+d
• Synchronization is exact
Cristian’s Time Sync
mr
mt
pTime server,S
• A time server S receives signals from a UTC source• Process p requests time in mr and receives t in mt from S• p sets its clock to t + RTT/2
• Accuracy ± (RTT/2 - min) :• because the earliest time S puts t in message mt is min after p sent mr. • the latest time was min before mt arrived at p• the time by S’s clock when mt arrives is in the range [t+min, t + RTT - min]
Tround is the round trip time recorded by pmin is an estimated minimum round trip time
46
Berkeley algorithm
• Cristian’s algorithm - • a single time server might fail, so they suggest the use of a group of
synchronized servers• it does not deal with faulty servers
• Berkeley algorithm (also 1989)• An algorithm for internal synchronization of a group of computers• A master polls to collect clock values from the others (slaves)• The master uses round trip times to estimate the slaves’ clock values• It takes an average (eliminating any above average round trip time or with
faulty clocks)• It sends the required adjustment to the slaves (better than sending the
time which depends on the round trip time)• Measurements
• 15 computers, clock synchronization 20-25 millisecs drift rate < 2x10-5
• If master fails, can elect a new master to take over (not in bounded time)
•47
NTP Protocol
• All modes use UDP• Each message bears timestamps of recent events:
• Local times of Send and Receive of previous message• Local times of Send of current message
• Recipient notes the time of receipt T3 (we have T0, T1, T2, T3)
48
T3
T2T1
T0
Server
Client
Time
m m'
Time
Logical time and logical clocks (Lamport 1978)
• Instead of synchronizing clocks, event ordering can be used
1. If two events occurred at the same process pi (i = 1, 2, … N) then they occurred in the order observed by pi, that is the definition of: “ i”
2. when a message, m is sent between two processes, send(m) happens before receive(m)
3. The happened before relation is transitive
• The happened before relation is the relation of causal ordering49
Total-order Lamport clocks
• Many systems require a total-ordering of events, not a partial-ordering
• Use Lamport’s algorithm, but break ties using the process ID• L(e) = M * Li(e) + i
• M = maximum number of processes• i = process ID
Vector Clocks
• Note that e e’ implies V(e)<V(e’). The converse is also true
• Can you see a pair of parallel events?• c || e (parallel) because neither V(c) <= V(e) nor V(e) <= V(c)
51
Lecture 10 – Mutual Exclusion
15-440 Distributed Systems
Example: Totally-Ordered Multicasting
• San Fran customer adds $100, NY bank adds 1% interest• San Fran will have $1,111 and NY will have $1,110
• Updating a replicated database and leaving it in an inconsistent state.
• Can use Lamport’s to totally order55
(San Francisco) (New York)
(+$100) (+1%)
Mutual ExclusionA Centralized Algorithm (1)
@ Client Acquire: Send (Request, i) to coordinator Wait for reply
@ Server:while true:
m = Receive()
If m == (Request, i):If Available():
Send (Grant) to i
56
Distributed Algorithm Strawman
• Assume that there are n coordinators• Access requires a majority vote from m > n/2
coordinators. • A coordinator always responds immediately to a
request with GRANT or DENY
• Node failures are still a problem• Large numbers of nodes requesting access can
affect availability
57
A Distributed Algorithm 2(Lamport Mutual Exclusion)
• Every process maintains a queue of pending requests for entering critical section in order. The queues are ordered by virtual time stamps derived from Lamport timestamps• For any events e, e' such that e --> e' (causality ordering), T(e) <
T(e')• For any distinct events e, e', T(e) != T(e')
• When node i wants to enter C.S., it sends time-stamped request to all other nodes (including itself) • Wait for replies from all other nodes.• If own request is at the head of its queue and all replies have been
received, enter C.S.• Upon exiting C.S., remove its request from the queue and send a
release message to every process.
58
A Distributed Algorithm 3(Ricart & Agrawala)
• Also relies on Lamport totally ordered clocks.
• When node i wants to enter C.S., it sends time-stamped request to all other nodes. These other nodes reply (eventually). When i receives n-1 replies, then can enter C.S.
• Trick: Node j having earlier request doesn't reply to i until after it has completed its C.S.
59
A Token Ring Algorithm
• Organize the processes involved into a logical ring• One token at any time passed from node to
node along ring
60
A Token Ring Algorithm
• Correctness:• Clearly safe: Only one process can hold token
• Fairness: • Will pass around ring at most once before getting
access.
• Performance:• Each cycle requires between 1 - ∞ messages• Latency of protocol between 0 & n-1
• Issues• Lost token
61
Summary
• Lamport algorithm demonstrates how distributed processes can maintain consistent replicas of a data structure (the priority queue).
• Ricart & Agrawala's algorithms demonstrate utility of logical clocks.
• Centralized & ring based algorithms much lower message counts
• None of these algorithms can tolerate failed processes or dropped messages.
62
Lecture 11 – Concurrency, Transactions
63
15-440 Distributed Systems
Distributed Concurrency Management
• Multiple-Objects, Multiple Servers. Ignore Failure• Single Server: Transactions (RD/WR to Global State)• ACID: Atomicity, Consistency, Isolation, Durability • Learn what these mean in the context of transactions• E.g. banking app => ACID is violated if not careful• Solutions: 2-phase locking (General, strict, strong strict)
• Deadling with deadlocks => build “waits-for” graph• Transactions: 2 phases (prep, commit/abort)
• Preparation: generate Lock Set “L”, Updates “U”
• COMMIT (updated global state), ABORT (leave state as is)
• Example using banking app
04/21/23 64
Distributed Transactions – 2PC
• Similar idea as before, but: • State spread across servers (maybe even WAN)
• Want to enable single transactions to read and update global state while maintaining ACID properties
• Overall Idea: • Client initiate transaction. Makes use of “co-ordinator”
• All other relevant servers operate as “participants”
• Co-ordinator assigns unique transaction ID (TID)
• 2 Phase-Commit • Prepare & Vote (client determine states, talk to Coord.)
• Commit/Abort Phase (co-ordinator boardcast to clients)
65
Lecture 12 – Logging and Crash Recovery
66
15-440 Distributed Systems
Summary – Fault Tolerance
• Real Systems (are often unreliable) • Introduced basic concepts for Fault Tolerant Systems
including redundancy, process resilience, RPC
• Fault Tolerance – Backward recovery using checkpointing, both Independent and coordinated
• Fault Tolerance –Recovery using Write-Ahead-Logging, balances the overhead of checkpointing and ability to recover to a consistent state
67
Dependability Concepts
• Availability – the system is ready to be used immediately.
• Reliability – the system runs continuously without failure.
• Safety – if a system fails, nothing catastrophic will happen. (e.g. process control systems)
• Maintainability – when a system fails, it can be repaired easily and quickly (sometimes, without its users noticing the failure). (also called Recovery)
• What’s a failure? : System that cannot meet its goals => faults• Faults can be: Transient, Intermittent, Permanent
Masking Failures by Redundancy
• Strategy: hide the occurrence of failure from other processes using redundancy.
1. Information Redundancy – add extra bits to allow for error detection/recovery (e.g., Hamming codes and the like).
2. Time Redundancy – perform operation and, if needs be, perform it again. Think about how transactions work (BEGIN/END/COMMIT/ABORT).
3. Physical Redundancy – add extra (duplicate) hardware and/or software to the system.
Recovery Strategies
• When a failure occurs, we need to bring the system into an error free state (recovery). This is fundamental to Fault Tolerance.
1. Backward Recovery: return the system to some previous correct state (using checkpoints), then continue executing.
-- Can be expensive, however still used
2. Forward Recovery: bring the system into a correct new state, from which it can then continue to execute.
-- Need to know potential errors up front!
Independent Checkpointing
The domino effect – Cascaded rollback
P2 crashes, roll back, but 2 checkpoints inconsistent (P2 shows m received, but P1 does not show m sent)
Solution? Co-ordinated checkpointing
Shadow Paging Vs WAL
• Shadow Pages• Provide Atomicity and Durability, “page” = unit of storage• Idea: When writing a page, make a “shadow” copy
• No references from other pages, edit easily!
• ABORT: discard shadow page • COMMIT: Make shadow page “real”. Update pointers to
data on this page from other pages (recursive). Can be done atomically
72
Shadow Paging vs WAL
• Write-Ahead-Logging • Provide Atomicity and Durability• Idea: create a log recording every update to database • Updates considered reliable when stored on disk• Updated versions are kept in memory (page cache) • Logs typically store both REDO and UNDO operations• After a crash, recover by replaying log entries to
reconstruct correct state • 3 Passes: (Analysis Pass, recovery pass, Undo Pass) • WAL is more common, fewer disk operations,
transactions considered committed once log written.
73
Lecture 13 – Errors and Failures
15-440 Distributed Systems
Measuring Availability
• Mean time to failure (MTTF)• Mean time to repair (MTTR)• MTBF = MTTF + MTTR
• Availability = MTTF / (MTTF + MTTR)• Suppose OS crashes once per month, takes 10min to
reboot. • MTTF = 720 hours = 43,200 minutes
MTTR = 10 minutes• Availability = 43200 / 43210 = 0.997 (~“3 nines”)
75
Disk failure conditional probability distribution - Bathtub curve
Expected operating lifetime
1 / (reported MTTF)
Infantmortality
Burn out
76
Parity Checking
Single Bit Parity:Detect single bit errors
77
Block Error Detection
• EDC= Error Detection and Correction bits (redundancy)• D = Data protected by error checking, may include header fields • Error detection not 100% reliable!
• Protocol may miss some errors, but rarely• Larger EDC field yields better detection and correction
78
Error Detection – Cyclic Redundancy Check (CRC)
• Polynomial code• Treat packet bits a coefficients of n-bit polynomial• Choose r+1 bit generator polynomial (well known –
chosen in advance)• Add r bits to packet such that message is divisible by
generator polynomial
• Better loss detection properties than checksums• Cyclic codes have favorable properties in that they are
well suited for detecting burst errors• Therefore, used on networks/hard drives
79
Error Recovery
• Two forms of error recovery• Redundancy
• Error Correcting Codes (ECC)• Replication/Voting
• Retry
• ECC• Keep encoded redundant data to help repair losses• Forward Error Correction (FEC) – send bits in advance
• Reduces latency of recovery at the cost of bandwidth
80
Summary
• Definition of MTTF/MTBF/MTTR: Understanding availability in systems.
• Failure detection and fault masking techniques• Engineering tradeoff: Cost of failures vs. cost of
failure masking.• At what level of system to mask failures?• Leading into replication as a general strategy for fault
tolerance
• Thought to leave you with:• What if you have to survive the failure of entire
computers? Of a rack? Of a datacenter?
81
81
Lecture 14 – RAID
Thanks to Greg Ganger and Remzi Arapaci-Dusseau for slides
15-440 Distributed Systems
Just a bunch of disks (JBOD)
• Yes, it’s a goofy name• industry really does sell “JBOD enclosures”
83
Disk Striping
• Interleave data across multiple disks • Large file streaming can enjoy parallel transfers • High throughput requests can enjoy thorough load
balancing• If blocks of hot files equally likely on all disks (really?)
84
Redundancy via replicas
• Two (or more) copies• mirroring, shadowing, duplexing, etc.
• Write both, read either
85
Simplest approach: Parity Disk
• Capacity: one extra disk needed per stripe
86
Updating and using the parity
87
Solution: Striping the Parity
• Removes parity disk bottleneck
88
RAID Taxonomy
• Redundant Array of Inexpensive Independent Disks• Constructed by UC-Berkeley researchers in late 80s (Garth)
• RAID 0 – Coarse-grained Striping with no redundancy • RAID 1 – Mirroring of independent disks • RAID 2 – Fine-grained data striping plus Hamming code disks
• Uses Hamming codes to detect and correct multiple errors • Originally implemented when drives didn’t always detect errors • Not used in real systems
• RAID 3 – Fine-grained data striping plus parity disk • RAID 4 – Coarse-grained data striping plus parity disk • RAID 5 – Coarse-grained data striping plus striped parity
89
How often are failures?
• MTBF (Mean Time Between Failures)• MTBFdisk ~ 1,200,00 hours (~136 years, <1% per year)
• MTBFmutli-disk system = mean time to first disk failure • which is MTBFdisk / (number of disks) • For a striped array of 200 drives• MTBFarray = 136 years / 200 drives = 0.65 years
90
Rebuild: restoring redundancy after failure
• After a drive failure • data is still available for access • but, a second failure is BAD
• So, should reconstruct the data onto a new drive • on-line spares are common features of high-end disk arrays
• reduce time to start rebuild
• must balance rebuild rate with foreground performance impact • a performance vs. reliability trade-offs
• How data is reconstructed • Mirroring: just read good copy • Parity: read all remaining drives (including parity) and compute
91
Conclusions
• RAID turns multiple disks into a larger, faster, more reliable disk
• RAID-0: StripingGood when performance and capacity really matter, but reliability doesn’t
• RAID-1: MirroringGood when reliability and write performance matter, but capacity (cost) doesn’t
• RAID-5: Rotating ParityGood when capacity and cost matter or workload is read-mostly• Good compromise choice
top related