mapreduce & bigtable

57
MapReduce & BigTable http://net.pku.edu.cn/~wbia 黄黄黄 [email protected] 黄黄黄黄黄黄黄黄黄黄 12/10/2013

Upload: cortez

Post on 07-Jan-2016

71 views

Category:

Documents


7 download

DESCRIPTION

MapReduce & BigTable. http://net.pku.edu.cn/~wbia 黄连恩 [email protected] 北京大学信息工程学院 1 2/10/2013. MapReduce. Imperative Programming. In computer science , imperative programming is a programming paradigm that describes computation in terms of statements that change a program state. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MapReduce & BigTable

MapReduce & BigTable

http://net.pku.edu.cn/~wbia黄连恩

[email protected]北京大学信息工程学院

12/10/2013

Page 2: MapReduce & BigTable

MapReduce

Page 3: MapReduce & BigTable

Imperative Programming

In computer science, imperative programming is a programming paradigm that describes computation in terms of statements that change a program state.

Page 4: MapReduce & BigTable

Declarative Programming

In computer science, declarative programming is a programming paradigm that expresses the logic of a computation without describing its control flow

Page 5: MapReduce & BigTable

Functional Language

map f lst: (’a->’b) -> (’a list) -> (’b list)把 f 作用在输入 list的每

个元素上,输出一个新的list.

fold f x0 lst: ('a*'b->'b)->

'b->('a list)->'b 把 f 作用在输入 list的每

个元素和一个累加器元素上, f 返回下一个累加器的值

f f f f f f f f f f f returned

initial

Page 6: MapReduce & BigTable

From Functional Language View

map f lst: (’a->’b) -> (’a list) -> (’b list)把 f 作用在输入 list的每

个元素上,输出一个新的list.

fold f x0 lst: ('a*'b->'b)->

'b->('a list)->'b 把 f 作用在输入 list的每

个元素和一个累加器元素上, f 返回下一个累加器的值

f f f f f f f f f f f returned

initial

Functional 运算不修改数据,总是产生新数据 map 和 reduce 具有内在的并行性

Map 可以完全并行 Reduce 在 f 运算满足结合律时,可以乱序并发执行

Functional 运算不修改数据,总是产生新数据 map 和 reduce 具有内在的并行性

Map 可以完全并行 Reduce 在 f 运算满足结合律时,可以乱序并发执行

Reduce foldl : (a [a] a)

Page 7: MapReduce & BigTable

Example

fun foo(l: int list) = sum(l) + mul(l) + length(l)

fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst

Page 8: MapReduce & BigTable

MapReduce is…

“MapReduce is a programming model and an associated implementation for processing and generating large data sets.”[1]

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.

Page 9: MapReduce & BigTable

From Parallel Computing View

MapReduce 是一种并行编程模型

the essence is a single function that executes in parallel on independent data sets, with outputs that are eventually combined to form a single or small number of results.

the essence is a single function that executes in parallel on independent data sets, with outputs that are eventually combined to form a single or small number of results.

f 是一个 map 算子 map f (x:xs) = f x : map f xsg 是一个 reduce 算子 reduce g y (x:xs) = reduce g ( g y x) xs

homomorphic skeletons

Page 10: MapReduce & BigTable

Mapreduce Framework

Page 11: MapReduce & BigTable

Typical problem solved by MapReduce

读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something

map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs

Shuffle: 混排交换数据 把相同 key 的中间结果汇集到相同节点上

Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) ->

list(out_value) 归并某一个 key 的所有 values ,进行计算 输出合并的计算结果 (usually just one)

输出结果

Page 12: MapReduce & BigTable

Shuffle Implementation

Page 13: MapReduce & BigTable

Partition and Sort Group

Partition function: hash(key)%reducer numberGroup function: sort by key

Page 14: MapReduce & BigTable

Word Frequencies in Web pages

输入: one document per record 用户实现 map function ,输入为

key = document URL value = document contents

map 输出 (potentially many) key/value pairs. 对 document 中每一个出现的词,输出一个记录 <word, “1”>

Page 15: MapReduce & BigTable

Example continued:

MapReduce 运行系统 ( 库 ) 把所有相同 key 的记录收集到一起 (shuffle/sort)

用户实现 reduce function 对一个 key 对应的 values计算

求和 sum

Reduce 输出 <key, sum>

Page 16: MapReduce & BigTable

Inverted Index

Page 17: MapReduce & BigTable

Build Inverted Index

Map: <doc#, word> ➝[<word, doc-num>]Reduce: <word, [doc1, doc3, ...]> ➝ <word, “doc1, doc3, …”>

Map: <doc#, word> ➝[<word, doc-num>]Reduce: <word, [doc1, doc3, ...]> ➝ <word, “doc1, doc3, …”>

Page 18: MapReduce & BigTable

Build index

Input: web page data Mapper:

<url, document content> <term, docid, locid> Shuffle & Sort:

Sort by term Reducer:

<term, docid, locid>* <term, <docid,locid>*> Result:

Global index file, can be split by docid range

Page 19: MapReduce & BigTable

#Exercise

PageRank Algorithm Clustering Algorithm Recommendation Algorithm

1.串行算法表述1.算法的核心公式、步骤描述和说明2.输入数据表示、核心数据结构

2. MapReduce 下的实现:1. map, reduce 如何写2.各自的输入和输出是什么

1.串行算法表述1.算法的核心公式、步骤描述和说明2.输入数据表示、核心数据结构

2. MapReduce 下的实现:1. map, reduce 如何写2.各自的输入和输出是什么

Page 20: MapReduce & BigTable

MapReduce Runtime System

Page 21: MapReduce & BigTable

Google MapReduce Architecture

Single Master node

Many worker bees

Many worker bees

Page 22: MapReduce & BigTable

MapReduce Operation

Initial data splitinto 64MB blocks

Computed, resultslocally stored

M sends datalocation to R workers

Final output written

Master informed ofresult locations

Page 23: MapReduce & BigTable

Fault Tolerance

通过 re-execution 实现 fault tolerance 周期性 heartbeats 检测 failure Re-execute 失效节点上已经完成 + 正在执行的 map

tasks Why????

Re-execute 失效节点上正在执行的 reduce tasks Task completion committed through master

Robust: lost 1600/1800 machines once finished ok

Master Failure?

Page 24: MapReduce & BigTable

Refinement: Redundant Execution

Slow workers significantly delay completion time Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowly

Solution: Near end of phase, spawn backup tasks Whichever one finishes first "wins"

Dramatically shortens job completion time

Page 25: MapReduce & BigTable

Refinement: Locality Optimization

Master scheduling policy: Asks GFS for locations of replicas of input file

blocks Map tasks typically split into 64MB (GFS block

size) Map tasks scheduled so GFS input block replica

are on same machine or same rack Effect

Thousands of machines read input at local disk speed

Without this, rack switches limit read rate

Page 26: MapReduce & BigTable

Refinement: Skipping Bad Records

Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix

Not always possible ~ third-party source libraries

On segmentation fault: Send UDP packet to master from signal handler Include sequence number of record being

processed If master sees two failures for same record:

Next worker is told to skip the record

Page 27: MapReduce & BigTable

Compression of intermediate data Combiner

“Combiner” functions can run on same machine as a mapper

Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth

Local execution for debugging/testing User-defined counters

Other Refinements

Page 28: MapReduce & BigTable

Hadoop MapReduce Architecture

JobTrackerMapReduce job

submitted by client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Master/Worker ModelLoad-balancing by polling mechanism

Master/Worker ModelLoad-balancing by polling mechanism

Page 29: MapReduce & BigTable

History of Hadoop

2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella

December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.

January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support

the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better

hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3

hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!

Page 30: MapReduce & BigTable

Hadoop Software Ecosystem

Page 31: MapReduce & BigTable

BigTable

Page 32: MapReduce & BigTable

Google’s Motivation – Scale!

Scale Problem Lots of data Millions of machines Different project/applications Hundreds of millions of users

Storage for (semi-)structured data No commercial system big enough

Couldn’t afford if there was one

Low-level storage optimization help performance significantly Much harder to do when running on top of a database layer

Page 33: MapReduce & BigTable

Bigtable

Distributed multi-level map Fault-tolerant, persistent Scalable

Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans

Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance

Page 34: MapReduce & BigTable

Real Applications

Page 35: MapReduce & BigTable

Data Model

a sparse, distributed persistent multi-dimensional sorted map

(row, column, timestamp) -> cell contents

Page 36: MapReduce & BigTable

Data Model

Rows Arbitrary string Access to data in a row is atomic Ordered lexicographically

Page 37: MapReduce & BigTable

Data Model

Column Tow-level name structure:

family: qualifier Column Family is the unit of access control

Page 38: MapReduce & BigTable

Data Model

Timestamps Store different versions of data in a cell Lookup options

Return most recent K values Return all values

Page 39: MapReduce & BigTable

Data Model

The row range for a table is dynamically partitioned Each row range is called a tablet Tablet is the unit for distribution and load balancing

Page 40: MapReduce & BigTable

APIs

Metadata operations Create/delete tables, column families, change metadata

Writes Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row

Reads Scanner: read arbitrary cells in a bigtable

Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or

specific columns

Page 41: MapReduce & BigTable

Typical Cluster

Shared pool of machines that also run other distributed applications

Page 42: MapReduce & BigTable

Building Blocks

Google File System (GFS) stores persistent data (SSTable file format)

Scheduler schedules jobs onto machines

Chubby Lock service: distributed lock manager master election, location bootstrapping

MapReduce (optional) Data processing Read/write Bigtable data

Page 43: MapReduce & BigTable

Chubby

{lock/file/name} service Coarse-grained locks Each clients has a session with Chubby.

The session expires if it is unable to renew its session lease within the lease expiration time.

5 replicas, need a majority vote to be active Also an OSDI ’06 Paper

Page 44: MapReduce & BigTable

Implementation

Single-master distributed system Three major components

Library that linked into every client One master server

Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Metadata Operations

Many tablet servers Tablet servers handle read and write requests to its

table Splits tablets that have grown too large

Page 45: MapReduce & BigTable

Implementation

Page 46: MapReduce & BigTable

Tablets

Each Tablets is assigned to one tablet server. Tablet holds contiguous range of rows

Clients can often choose row keys to achieve locality

Aim for ~100MB to 200MB of data per tablet Tablet server is responsible for ~100 tablets

Fast recovery: 100 machines each pick up 1 tablet for failed

machine Fine-grained load balancing:

Migrate tablets away from overloaded machine Master makes load-balancing decisions

Page 47: MapReduce & BigTable

How to locate a Tablet?

Given a row, how do clients find the location of the tablet whose row range covers the target row?

METADATA: Key: table id + end row, Data: location Aggressive Caching and Prefetching at Client side

Page 48: MapReduce & BigTable

Tablet Assignment

Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers

and current assignments of tablets to servers. When a tablet is unassigned, master assigns the tablet to an

tablet server with sufficient room. It uses Chubby to monitor health of tablet servers, and

restart/replace failed servers.

Page 49: MapReduce & BigTable

Tablet Assignment

Chubby Tablet server registers itself by getting a lock in a

specific directory chubby Chubby gives “lease” on lock, must be renewed

periodically Server loses lock if it gets disconnected

Master monitors this directory to find which servers exist/are alive

If server not contactable/has lost lock, master grabs lock and reassigns tablets

GFS replicates data. Prefer to start tablet server on same machine that the data is already at

Page 50: MapReduce & BigTable

Refinement – Locality groups & Compression

Locality Groups Can group multiple column families into a locality group

Separate SSTable is created for each locality group in each tablet.

Segregating columns families that are not typically accessed together enables more efficient reads.

In WebTable, page metadata can be in one group and contents of the page in another group.

Compression Many opportunities for compression

Similar values in the cell at different timestamps Similar values in different columns Similar values across adjacent rows

Page 51: MapReduce & BigTable

Performance - Scaling

As the number of tablet servers is increased by a factor of 500:

Performance of random reads from memory increases by a factor of 300.

Performance of scans increases by a factor of 260.

Not Linear!WHY?

Page 52: MapReduce & BigTable

Not linearly?

Load Imbalance Competitions with other processes

Network CPU

Rebalancing algorithm does not work perfectly Reduce the number of tablet movement Load shifted around as the benchmark progresses

Page 53: MapReduce & BigTable

Thank You!

Q&A

Page 54: MapReduce & BigTable

Calculate PageRank

Input: WebGraph <from , <PR,<to>*>> Iteration Until Convergence

Mapper: <from, <PR,<to>*>>

<to , PR / outDegree(from)> <from, <PR,<to>*>> <from, <0,<to>*>>

Shuffle & Sort By <to>

Reducer: <to , valude>* 以及 <to, <0, <out>*>

<to, ∑(value), <out>*> Result:

<to, ∑(value)> are PR[] , the PageRank result array

Page 55: MapReduce & BigTable

Mapreduce Framework

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

Page 56: MapReduce & BigTable

Quiz

1. 使用 2 种不同的聚类方法将下面实例聚合为 3类:

(-4,-2),(-3,-2),(-2,-2),(-1,-2),(1,-1),(1,1),(2,3),(3,2),(3,4),(4,3)

讨论这些方法的差异,哪些方法得到了相同的结果?和人工聚类过程相比,这些结果有何不同?

Page 57: MapReduce & BigTable

2 。你的任务是将单词分成英语 (English) 类或非英语类。这些单词的产生来自如下分布:

(i) 计算多项式 NB 分类器的参数,分类器使用字母b 、 n 、 o 、 u 和 z 作为特征。在计算参数时使用平滑方法,零概率平滑成 0.01 ,而非零概率不做改变。(ii) 上述分类器对单词 zoo 的分类结果是什么?