large-scale data processing with hadoop and php (zendcon 2011 2011-10-19)

82
LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP

Upload: david-zuelke

Post on 15-Jan-2015

7.821 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

LARGE-SCALE DATA PROCESSING WITH HADOOP AND PHP

Page 2: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

David Zuelke

Page 3: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

David Zülke

Page 4: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)
Page 5: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

http://en.wikipedia.org/wiki/File:München_Panorama.JPG

Page 6: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Founder

Page 8: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Lead Developer

Page 11: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

THE BIG DATA CHALLENGEDistributed And Parallel Computing

Page 12: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

we want to process data

Page 13: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

how much data exactly?

Page 14: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

SOME NUMBERS

• Facebook

•New data per day:

• 200 GB (March 2008)

• 2 TB (April 2009)

• 4 TB (October 2009)

• 12 TB (March 2010)

• Google

•Data processed per month: 400 PB (in 2007!)

• Average job size: 180 GB

Page 15: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

what if you have that much data?

Page 16: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

what if you have just 1% of that amount?

Page 17: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

“No Problemo”, you say?

Page 18: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

reading 180 GB sequentially off a disk will take ~45 minutes

Page 19: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

and you only have 16 to 64 GB of RAM per computer

Page 20: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

so you can't process everything at once

Page 21: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

general rule of modern computers:

Page 22: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

data can be processed much faster than it can be read

Page 23: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

solution: parallelize your I/O

Page 24: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

but now you need to coordinate what you’re doing

Page 25: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

and that’s hard

Page 26: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

what if a node dies?

Page 27: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

is data lost?will other nodes in the grid have to re-start?

how do you coordinate this?

Page 28: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

ENTER: OUR HEROIntroducing MapReduce

Page 29: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

in the olden days, the workload was distributed across a grid

Page 30: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

and the data was shipped around between nodes

Page 31: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

or even stored centrally on something like an SAN

Page 32: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

which was fine for small amounts of information

Page 33: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

but today, on the web, we have big data

Page 34: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

I/O bottleneck

Page 35: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

along came a Google publication in 2004

Page 36: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

MapReduce: Simplified Data Processing on Large Clustershttp://labs.google.com/papers/mapreduce.html

Page 37: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

now the data is distributed

Page 38: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

computing happens on the nodes where the data already is

Page 39: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

processes are isolated and don’t communicate (share-nothing)

Page 40: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

BASIC PRINCIPLE: MAPPER

• A Mapper reads records and emits <key, value> pairs

• Example: Apache access.log

• Each line is a record

• Extract client IP address and number of bytes transferred

• Emit IP address as key, number of bytes as value

• For hourly rotating logs, the job can be split across 24 nodes*

* In pratice, it’s a lot smarter than that

Page 41: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

BASIC PRINCIPLE: REDUCER

• A Reducer is given a key and all values for this specific key

• Even if there are many Mappers on many computers; the results are aggregated before they are handed to Reducers

• Example: Apache access.log

• The Reducer is called once for each client IP (that’s our key), with a list of values (transferred bytes)

•We simply sum up the bytes to get the total traffic per IP!

Page 42: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

EXAMPLE OF MAPPED INPUT

IP Bytes

212.122.174.13 18271

212.122.174.13 191726

212.122.174.13 198

74.119.8.111 91272

74.119.8.111 8371

212.122.174.13 43

Page 43: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

REDUCER WILL RECEIVE THIS

IP Bytes

212.122.174.13

18271

212.122.174.13191726

212.122.174.13198

212.122.174.13

43

74.119.8.11191272

74.119.8.1118371

Page 44: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

AFTER REDUCTION

IP Bytes

212.122.174.13 210238

74.119.8.111 99643

Page 45: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

PSEUDOCODE

function  map($line_number,  $line_text)  {    $parts  =  parse_apache_log($line_text);    emit($parts['ip'],  $parts['bytes']);}

function  reduce($key,  $values)  {    $bytes  =  array_sum($values);    emit($key,  $bytes);}

212.122.174.13  21023874.119.8.111      99643

212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /foo  HTTP/1.1"  200  18271212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /bar  HTTP/1.1"  200  191726212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /baz  HTTP/1.1"  200  19874.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /egg  HTTP/1.1"  200  4374.119.8.111      -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /moo  HTTP/1.1"  200  91272212.122.174.13  -­‐  -­‐  [30/Oct/2009:18:14:32  +0100]  "GET  /yay  HTTP/1.1"  200  8371

Page 46: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

A YELLOW ELEPHANTIntroducing Apache Hadoop

Page 48: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term.

Doug Cutting

Page 49: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Hadoop is a MapReduce framework

Page 50: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

it allows us to focus on writing Mappers, Reducers etc.

Page 51: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

and it works extremely well

Page 52: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

how well exactly?

Page 53: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

HADOOP AT FACEBOOK (I)

• Predominantly used in combination with Hive (~95%)

• 8400 cores with ~12.5 PB of total storage

• 8 cores, 12 TB storage and 32 GB RAM per node

• 1x Gigabit Ethernet for each server in a rack

• 4x Gigabit Ethernet from rack switch to core

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Hadoop is aware of racks and locality of nodes

Page 54: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

HADOOP AT FACEBOOK (II)

•Daily stats:

• 25 TB logged by Scribe

• 135 TB of compressed data scanned

• 7500+ Hive jobs

• ~80k compute hours

•New data per day:

• I/08: 200 GB

• II/09: 2 TB (compressed)

• III/09: 4 TB (compressed)

• I/10: 12 TB (compressed)

http://www.slideshare.net/royans/facebooks-petabyte-scale-data-warehouse-using-hive-and-hadoop

Page 55: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

HADOOP AT YAHOO!

•Over 25,000 computers with over 100,000 CPUs

• Biggest Cluster :

• 4000 Nodes

• 2x4 CPU cores each

• 16 GB RAM each

•Over 40% of jobs run using Pighttp://wiki.apache.org/hadoop/PoweredBy

Page 56: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

OTHER NOTABLE USERS

• Twitter (storage, logging, analysis. Heavy users of Pig)

• Rackspace (log analysis; data pumped into Lucene/Solr)

• LinkedIn (friend suggestions)

• Last.fm (charts, log analysis, A/B testing)

• The New York Times (converted 4 TB of scans using EC2)

Page 57: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

JOB PROCESSINGHow Hadoop Works

Page 58: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Just like I already described! It’s MapReduce!\o/

Page 59: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

BASIC RULES

• Uses Input Formats to split up your data into single records

• You can optimize using combiners to reduce locally on a node

•Only possible in some cases, e.g. for max(), but not avg()

• You can control partitioning of map output yourself

• Rarely useful, the default partitioner (key hash) is enough

• And a million other things that really don’t matter right now ;)

Page 60: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

HDFSHadoop Distributed File System

Page 61: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

HDFS

• Stores data in blocks (default block size: 64 MB)

•Designed for very large data sets

•Designed for streaming rather than random reads

•Write-once, read-many (although appending is possible)

• Capable of compression and other cool things

Page 62: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

HDFS CONCEPTS

• Large blocks minimize amount of seeks, maximize throughput

• Blocks are stored redundantly (3 replicas as default)

• Aware of infrastructure characteristics (nodes, racks, ...)

• Datanodes hold blocks

• Namenode holds the metadata

Critical component for an HDFS cluster (HA, SPOF)

Page 63: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

there’s just one little problem

Page 64: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

you need to write Java code

Page 65: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

however, there is hope...

Page 66: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

STREAMINGHadoop Won’t Force Us To Use Java

Page 67: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Hadoop Streaming can use any script as Mapper or Reducer

Page 68: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

many configuration options (parsers, formats, combining, …)

Page 69: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

it works using STDIN and STDOUT

Page 70: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Mappers are streamed the records(usually by line: <line>\n)

and emit key/value pairs: <key>\t<value>\n

Page 71: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Reducers are streamed key/value pairs:<keyA>\t<value1>\n<keyA>\t<value2>\n<keyA>\t<value3>\n<keyB>\t<value4>\n

Page 72: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Caution: no separate Reducer processes per key(but keys are sorted)

Page 73: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

STREAMING WITH PHPIntroducing HadooPHP

Page 74: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

HADOOPHP

• A little framework to help with writing mapred jobs in PHP

• Takes care of input splitting, can do basic decoding et cetera

• Automatically detects and handles Hadoop settings such as key length or field separators

• Packages jobs as one .phar archive to ease deployment

• Also creates a ready-to-rock shell script to invoke the job

Page 75: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

written by

Page 76: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)
Page 78: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

DEMOHadoop Streaming & PHP in Action

Page 79: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

!e End

Page 80: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

RESOURCES

• http://www.cloudera.com/developers/learn-hadoop/

• Tom White: Hadoop. The Definitive Guide. O’Reilly, 2009

• http://www.cloudera.com/hadoop/

• Cloudera Distribution for Hadoop is easy to install and has all the stuff included: Hadoop, Hive, Flume, Sqoop, Oozie, …

Page 81: Large-Scale Data Processing with Hadoop and PHP (ZendCon 2011 2011-10-19)

Questions?