gfs andmapreduce
DESCRIPTION
TRANSCRIPT
Google Base Infrastructure:
GFS and MapReduce
Johannes Passing
Agenda
� Context
� Google File System
� Aims
� Design
� Walk through� Walk through
� MapReduce
� Aims
� Design
� Walk through
� Conclusion
13.04.2008 Johannes Passing 2
Context
� Google‘s scalability strategy
� Cheap hardware
� Lots of it
� More than 450.000 servers (NYT, 2006)
� Basic Problems
13.04.2008 Johannes Passing 3
� Basic Problems
� Faults are the norm
� Software must be massively parallelized
� Writing distributed software is hard
GFS: Aims
� Aims
� Must scale to thousands of servers
� Fault tolerance/Data redundancy
� Optimize for large WORM files
� Optimize for large reads
� Favor throughput over latency� Favor throughput over latency
� Non-Aims
� Space efficiency
� Client side caching futile
� POSIX compliance
� Significant deviation from standard DFS (NFS, AFS, DFS, …)
13.04.2008 Johannes Passing 4
Design Choices
� File system cluster
� Spread data over thousands of machines
� Provide safety by redundant storage
� Master/Slave architecture
13.04.2008 Johannes Passing 5
Design Choices
� Files split into chunks
� Unit of management and distribution
� Replicated across servers
� Subdivided into blocks for integrity checks
GFS
� Accept internal fragmentation for better throughput
13.04.2008 Johannes Passing 6
…
File System
Architecture: Chunkserver
� Host chunks
� Store chunks as files in Linux FS
� Verify data integrity
� Implemented entirely in user mode
� Fault tolerance
� Fail requests on integrity violation detection
� All chunks stored on at least one other server
� Re-replication restores server after downtime
13.04.2008 Johannes Passing 7
Architecture: Master
� Namespace and metadata management
� Cluster management
� Chunk location registry (transient)
� Chunk placement decisions
� Re-replication, Garbage collection� Re-replication, Garbage collection
� Health monitoring
� Fault tolerance
� Backed by shadow masters
� Mirrored operations log
� Periodic snapshots
13.04.2008 Johannes Passing 8
Reading a chunk
� Send filename and offset
� Determine file and chunk
� Return chunk locations
and version
� Choose closest server
and request chunk
13.04.2008 Johannes Passing 9
� Verify chunk version and
block checksums
� Return data if valid, else
fail request
Writing a chunk
� Send filename and offset
� Return chunk locations
and lease owner
� Push data to any chunkserver
� Determine lease owner/
grant lease
13.04.2008 Johannes Passing 10
� Forward along replica chain
� Create mutation order, apply
� Forward write request
� Send write request to primary
� Validate blocks, apply
modifications , inc version, ack
� Reply to client
Appending a record (1/2)
� Send append request
� Return chunk locations
and lease owner
� Push data to any chunkserver
� Choose offset
13.04.2008 Johannes Passing 11
� Forward along replica chain
� Check for left space in chunk
� Pad chunk and request retry
� Send append request
� Request chunk allocation
� Return chunk locations
and lease owner
Appending a record (2/2)
� Send append request
� Forward request to replicas
� Write data, failure occurs
� Fail request
� Allocate chunk, write data
13.04.2008 Johannes Passing 12
� Fail request
� Write data
� Forward request to replicas
� Client retries
Padding
A
Padding
A
Padding
A
Free
B
Free
B
Free
B
B� Write data
Free
B
B
Free
B
Garbage
� Success
Free
Garbage
File Region Consistency
� File region states:
� Consistent � All clients see same data
� Defined � Consistent ^ clients see change in its entirety
� Inconsistent
� Consequences
13.04.2008 13Johannes Passing
Free
B
B
Free
B
B
inconsistent
defined
C
B
B
C
B
B
C
B
B
D
C
B
B
D definedconsistentFree
Garbage
B
C
Garbage
B
C
Garbage
B
D
� Consequences
� Unique record IDs to identify duplicates
� Records should be self-identifying and self-validating
Fault tolerance GFS
� Automatic re-replication
� Replicas are spread across racks
� Fault mitigations
� Data corruption � choose different replica� Data corruption � choose different replica
� Machine crash � choose replica on different machine
� Network outage � choose replica in different network
� Master crash � shadow masters
13.04.2008 Johannes Passing 14
Wrap-up GFS
� Proven scalability, fault tolerance and performance
� Highly specialized
� Distribution transparent to clients
� Only partial abstraction – clients must cooperate
� Network is the bottleneck� Network is the bottleneck
13.04.2008 Johannes Passing 15
MapReduce: Aims
� Aims
� Unified model for large scale distributed data processing
� Massively parallelizable
� Distribution transparent to developer
� Fault tolerance� Fault tolerance
� Allow moving of computation close to data
13.04.2008 Johannes Passing 16
Higher order functions
� Functional: map(f, [a, b, c, …])
f
a
f(a)
f
b
f(b)
f
c
f(c)
13.04.2008 Johannes Passing 17
� Google: map(k, v)
(k, v)
map
(f(k1), f(v1)) (f(k1), f(v1)) (f(k1), f(v1))
Higher order functions
� Functional: foldl/reduce(f, [a, b, c, …], i)
f
f
b
f
c
13.04.2008 Johannes Passing 18
f
i a
b
� Google: reduce(k, [a, b, c, d, …])
a
f(a)
b
–
c
f(c, d)
d
Idea
� Observation
� Map can be easily parallelized
� Reduce might be parallelized
� Idea� Idea
� Design application around map and reduce scheme
� Infrastructure manages scheduling and distribution
� User implements map and reduce
� Constraints
� All data coerced into key/value pairs
13.04.2008 Johannes Passing 19
Walkthrough
� Provide data
� e.g. from GFS or BigTable
� Create M splits
� Map
� Spawn M mappers
� Spawn master
13.04.2008 Johannes Passing 20
ReducersMappers
� Map
� Once per key/value pair
� Partition into R buckets
� Combine
� Reduce
� Spawn up to R*M combiners
� Spawn up to R reducers
Combiners
� Barrier
Scheduling
� Locality
� Mappers scheduled close to data
� Chunk replication improves locality
� Reducers run on same machine as mappers
� Choosing M and R
� Load balancing & fast recovery vs. number of output files
� M >> R, R small multiple of #machines
� Schedule backup tasks to avoid stragglers
13.04.2008 Johannes Passing 21
Fault tolerance MapReduce
� Fault mitigations
� Map crash
� All intermediate data in questionable state
�Repeat all map tasks of machine
�Rationale of barrier�Rationale of barrier
� Reduce crash
� Data is global, completed stages marked
�Repeat crashed resuce task only
� Master crash
� start over
� Repeated crashs on certain records
�Skip records13.04.2008 Johannes Passing 22
Wrap-up MapReduce
� Proven scalability
� Restrict programming model for better runtime support
� Tailored to Google’s needs
� Programming model is fairly low-level
13.04.2008 Johannes Passing 23
Conclusion
� Rethinking infrastructure
� Consequent design for scalability and performance
� Highly specialized solutions
� Not generally applicable
� Solutions more low-level than usual� Solutions more low-level than usual
� Maintenance efforts may be significant
13.04.2008 Johannes Passing 24
„We believe we get tremendous competitive advantage by
essentially building our own infrastructures“
--Eric Schmidt
References
� Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, 2004
� Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System”, 2003
� Ralf Lämmel, “Google's MapReduce Programming Model –Revisited”, SCP journal, 2006
� Google, “Cluster Computing and MapReduce”, � Google, “Cluster Computing and MapReduce”, http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html, 2007
� David F. Carr, “How Google Works”, http://www.baselinemag.com/c/a/Projects-Networks-and-Storage/How-Google-Works/, 2006
� Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis, “Evaluating MapReduce for Multi-core and Multiprocessor Systems”, 2007
13.04.2008 Johannes Passing 25