map reduce and hadoop at mylife

MapReduce with Hadoop at MyLife

June 6, 2013Speaker: Jeff Meister

Topics of Talk

• What are MapReduce and Hadoop?

• When would you want to use them?

• How do they work?

• What does Hadoop do for you?

• How do you write MapReduce programs to take advantage of that?

• What do we use them for at MyLife?

What are MapReduce and Hadoop?

• MapReduce is a programming model for parallel processing of large datasets

• An idea for how to write programs under certain constraints

• Hadoop is an open-source implementation of MapReduce

• Designed for clusters of commodity machines

Motivation:Why would you use

MapReduce?

Background:Disk vs. Memory

• Memory

• Where the computer keeps data it’s currently working on

• Fast response time, random access supported

• Expensive: typical size in tens of GB

• Hard disk

• More permanent storage of data for future tasks

• Slow response time, sequential access only

• Cheap: typical size in hundreds or thousands of GB

Example Task onSmall Datasets

ID Public recordR1 Steve Jones, 36, 12 Main St, 10001

R2 John Brown, 72, 625 8th Ave, 90210

R3 James Davis, 23, 10 Broadway, 20202

R4 Tom Lewis, 45, 95 Park Pl, 90024

R5 Tim Harris, 33, PO Box 256, 33514

... ...R2000 Adam Parker, 59, 82 F St, 45454

Size: 8 MB Size: 3.5 MB

ID Phone numberP1 Robert White, 45121, (654) 321-4702

P2 David Johnson, 07470, (973) 602-2519

P3 Scott Lee, 23910, (602) 412-2255

P4 Steve Jones, 10001, (212) 347-3380

P5 John Wayne, 13284, (312) 446-8878

... ...P1000 Tom Lewis, 90024, (650) 945-2319

Real World:Large Datasets

• 290 million public records = 380 GB

• 228 million phone records = 252 GB

• We could improve previous algorithm, but...

• The machine doesn’t have enough memory

• Would spend lots of time moving pieces of data between disk and memory

• Disk is so slow, the task is now impractical

• What to do? Use Hadoop MapReduce!

• Divide into smaller tasks, run them in parallel

Hadoop:What does it do?

How do you work with it?

Components of the Hadoop System

• Hadoop Distributed File System (HDFS)

• Splits up files into blocks, stores them on multiple computers

• Knows which blocks are on each machine

• Transfers blocks between machines over the network

• Replicates blocks, designed to tolerate frequent machine failures

• MapReduce engine

• Supports distributed computation

• Programmer writes Map and Reduce functions

• Engine takes care of parallelization, so you can focus on your work

The Map andReduce Functions

• map : (K1, V1) List(K2, V2)

• Take an input record and produce (emit) a list of intermediate (key, value) pairs

• reduce : (K2, List(V2)) List(K3, V3)

• Examine the values for each intermediate key, produce a list of output records

• Critical observation: output type of map ≠ input type of reduce!

• What’s going on in between?

The “Magic”:A Fast Parallel Sort• The core of Hadoop MapReduce is a

distributed parallel sorting algorithm

• Hadoop guarantees that the input to each reducer is sorted by key (K2)

• All the (K2, V2) pairs from the mappers are grouped by key

• The reducer gets a list of values corresponding to each key

Why Is It Fast?

• Imagine how you might sort a deck of cards

• The most intuitive procedure for humans is very inefficient for computers

• Turns out the best algorithm, merge sort, is less straightforward

• Split the data up into smaller pieces, sort the pieces individually, then merge them

• Hadoop is using HDFS to do a giant parallel merge sort over its cluster

Example Taskwith MapReduce

• map : (source_id, record) List(match_key, source_id)

• For each input record, select the fields to match by, make a key out of them

• Use the record’s unique identifier as the value

• reduce : (match_key, List(source_id)) List(public_record_id, phone_id)

• For each match key, look through the list of unique IDs

• If we find both a public record ID and a phone ID in the same list, match!

• The profiles with these IDs share all fields in the key

• Generate the output pair of matched IDs

Example Task onSmall Datasets

ID Public recordR1 Steve Jones, 36, 12 Main St, 10001

R2 John Brown, 72, 625 8th Ave, 90210

R3 James Davis, 23, 10 Broadway, 20202

R4 Tom Lewis, 45, 95 Park Pl, 90024

R5 Tim Harris, 33, PO Box 256, 33514

... ...R2000 Adam Parker, 59, 82 F St, 45454

Size: 8 MB Size: 3.5 MB

ID Phone numberP1 Robert White, 45121, (654) 321-4702

P2 David Johnson, 07470, (973) 602-2519

P3 Scott Lee, 23910, (602) 412-2255

P4 Steve Jones, 10001, (212) 347-3380

P5 John Wayne, 13284, (312) 446-8878

... ...P1000 Tom Lewis, 90024, (650) 945-2319

When is MapReduce Appropriate?

• To benefit from using Hadoop:

• The data must be decomposable into many (key, value) pairs

• Each mapper runs the same operation, independently of other mappers

• Map output keys should sort values into groups of similar size

• Sequential algorithms that are more straightforward may need redesign for the MapReduce model

Common Applications of MapReduce

• Many common distributed tasks are easily expressible with MapReduce. A few examples:

• Term frequency counting

• Pattern searching

• Of course, sorting

• Graph algorithms, such as reversal (Web links)

• Inverted index generation

• Data mining (clustering, statistics)

MapReduce at MyLife

Applications of MapReduce at MyLife• We regularly run computations over large sets of

people data

• Who’s Searching For You

• Content-based aggregation pipeline (1.5 TB)

• Deltas of licensed data updates (300 GB)

• Generating search indexes for old platform

• Various ad hoc jobs involving matching, searching, extraction, counting, de-duplication, and more

Hadoop Cluster Specifications

• Currently 63 machines, each configured to run 4 or 6 map or reduce tasks at once (total capacity 296)

• CPU:

• Each machine: 2x quad-core Opteron @ 2.2 GHz

• Memory:

• Each machine: 32 GB

• Cluster total: 2 TB

• Hard disk:

• Each machine: between 3 and 9 TB

• Total HDFS capacity: 345 TB

Other Companies Using Hadoop

• Yahoo! - Index calculations for Web search

• Facebook - Analytics and machine learning

• World’s largest Hadoop cluster!

• Amazon - Supports Hadoop on EC2/S3 cloud services

• LinkedIn

• People You May Know

• Viewers of This Profile Also Viewed

• Apple - Used in iAds platform

• Twitter - Data warehousing and analytics

• Lots more... http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

Further Reading

• Google research papers

• Google File System, SOSP 2003

• MapReduce, OSDI 2004

• BigTable, OSDI 2006

• Hadoop manual: http://hadoop.apache.org/

• Other Hadoop-related projects from Apache: Cassandra, HBase, Hive, Pig

http://hadoop.apache.org

http://hadoop.apache.org

map reduce and hadoop at mylife

Technology

core of hadoop mapreduce

mapreduce model

mapreduce withhadoop

mapreduce programsto

match key

key k2

mapreduceand hadoop

example taskwith mapreduce