hadoop and big data -...

77
Hadoop and Big Data 12.6-2013 1/77 Hadoop and Big Data Keijo Heljanko Department of Information and Computer Science School of Science Aalto University [email protected] 12.6-2013

Upload: others

Post on 20-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

1/77

Hadoop and Big Data

Keijo Heljanko

Department of Information and Computer ScienceSchool of ScienceAalto University

[email protected]

12.6-2013

Page 2: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

2/77

Business Drivers of Cloud Computing

I Large data centers allow for economics of scaleI Cheaper hardware purchasesI Cheaper cooling of hardware

I Example: Google paid 40 MEur for a Summa paper mill sitein Hamina, Finland: Data center cooled with sea water fromthe Baltic Sea

I Cheaper electricityI Cheaper network capacityI Smaller number of administrators / computer

I Unreliable commodity hardware is usedI Reliability obtained by replication of hardware components

and a combined with a fault tolerant software stack

Page 3: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

3/77

Cloud Computing TechnologiesA collection of technologies aimed to provide elastic “pay asyou go” computing

I Virtualization of computing resources: Amazon EC2,Eucalyptus, OpenNebula, Open Stack Compute, . . .

I Scalable file storage: Amazon S3, GFS, HDFS, . . .I Scalable batch processing: Google MapReduce, Apache

Hadoop, PACT, Microsoft Dryad, Google Pregel, BerkeleySpark, . . .

I Scalable datastore:Amazon Dynamo, Apache Cassandra,Google Bigtable, HBase,. . .

I Distributed Consensus: Google Chubby, ApacheZookeeper, . . .

I Scalable Web applications hosting: Google App Engine,Microsoft Azure, Heroku, . . .

Page 4: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

4/77

Clock Speed of Processors

I Herb Sutter: The Free Lunch Is Over: A Fundamental TurnToward Concurrency in Software. Dr. Dobb’s Journal, 30(3),March 2005 (updated graph in August 2009).

Page 5: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

5/77

Implications of the End of Free Lunch

I The clock speeds of microprocessors are not going toimprove much in the foreseeable future

I The efficiency gains in single threaded performance aregoing to be only moderate

I The number of transistors in a microprocessor is stillgrowing at a high rate

I One of the main uses of transistors has been to increasethe number of computing cores the processor has

I The number of cores in a low end workstation (as thoseemployed in large scale datacenters) is going to keep onsteadily growing

I Programming models need to change to efficiently exploitall the available concurrency - scalability to high number ofcores/processors will need to be a major focus

Page 6: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

6/77

Warehouse-scale Computing (WSC)

I The smallest unit of computation in Google scale is:Warehouse full of computers

I [WSC]: Luiz Andre Barroso, Urs Holzle: The Datacenter asa Computer: An Introduction to the Design ofWarehouse-Scale Machines Morgan & ClaypoolPublishers 2009http://dx.doi.org/10.2200/S00193ED1V01Y200905CAC006

I The WSC book says:“. . . we must treat the datacenter itself as one massivewarehouse-scale computer (WSC).”

Page 7: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

7/77

Jeffrey Dean (Google): LADIS 2009 keynote failurenumbers

I LADIS 2009 keynote: “Designs, Lessons and Advice fromBuilding Large Distributed Systems”http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

I At warehouse-scale, failures will be the norm and thus faulttolerant software will be inevitable

I Hypothetically assume that server mean time betweenfailures (MTBF) would be 30 years (highly optimistic andnot realistic number!). In a WSC with 10000 nodes roughlyone node will fail per day.

I Typical yearly flakiness metrics from J. Dean (Google,2009):

I 1-5% of your disk drives will dieI Servers will crash at least twice (2-4% failure rate)

Page 8: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

8/77

Big Data

I As of May 2009, the amount of digital content in the worldis estimated to be 500 Exabytes (500 million TB)

I EMC sponsored study by IDC in 2007 estimates theamount of information created in 2010 to be 988 EB

I Worldwide estimated hard disk sales in 2010:≈ 675 million units

I Data comes from: Video, digital images, sensor data,biological data, Internet sites, social media, . . .

I The problem of such large data massed, termed Big Datacalls for new approaches to storage and processing of data

Page 9: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

9/77

Example: Simple Word Search

I Example: Suppose you need to search for a word in a 2TBworth of text that is found only once in the text mass usinga compute cluster

I Assuming 100MB/s read speed, in the worst case readingall data from a single 2TB disk takes ≈ 5.5 hours

I If 100 hard disks can be used in parallel, the same tasktakes less than four minutes

I Scaling using hard disk parallelism is one of the designgoals of scalable batch processing in the cloud

Page 10: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

10/77

Google MapReduce

I A scalable batch processing framework developed atGoogle for computing the Web index

I When dealing with Big Data (a substantial portion of theInternet in the case of Google!), the only viable option is touse hard disks in parallel to store and process it

I Some of the challenges for storage is coming from Webservices to store and distribute pictures and videos

I We need a system that can effectively utilize hard diskparallelism and hide hard disk and other componentfailures from the programmer

Page 11: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

11/77

Google MapReduce (cnt.)

I MapReduce is tailored for batch processing with hundredsto thousands of machines running in parallel, typical jobruntimes are from minutes to hours

I As an added bonus we would like to get increasedprogrammer productivity compared to each programmerdeveloping their own tools for utilizing hard disk parallelism

Page 12: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

12/77

Google MapReduce (cnt.)

I The MapReduce framework takes care of all issues relatedto parallelization, synchronization, load balancing, and faulttolerance. All these details are hidden from theprogrammer

I The system needs to be linearly scalable to thousands ofnodes working in parallel. The only way to do this is tohave a very restricted programming model where thecommunication between nodes happens in a carefullycontrolled fashion

I Apache Hadoop is an open source MapReduceimplementation used by Yahoo!, Facebook, and Twitter

Page 13: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

13/77

MapReduce and Functional Programming

I Based on the functional programming in the large:I User is only allowed to write side-effect free functions

“Map” and “Reduce”I Re-execution is used for fault tolerance. If a node executing

a Map or a Reduce task fails to produce a result due tohardware failure, the task will be re-executed on anothernode

I Side effects in functions would make this impossible, asone could not re-create the environment in which theoriginal task executed

I One just needs a fault tolerant storage of task inputsI The functions themselves are usually written in a standard

imperative programming language, usually Java

Page 14: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

14/77

Why No Side-Effects?

I Side-effect free programs will produce the same outputirregardless of the number of computing nodes used byMapReduce

I Running the code on one machine for debugging purposesproduces the same results as running the same code inparallel

I It is easy to introduce side-effect to MapReduce programsas the framework does not enforce a strict programmingmethodology. However, the behavior of such programs isundefined by the framework, and should therefore beavoided.

Page 15: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

15/77

Yahoo! MapReduce Tutorial

I We use a Figures from the excellent MapReduce tutorial ofYahoo! [YDN-MR], available from:http://developer.yahoo.com/hadoop/tutorial/module4.html

I In functional programming, two list processing conceptsare used

I Mapping a list with a functionI Reducing a list with a function

Page 16: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

16/77

Mapping a List

Mapping a list applies the mapping function to each list element(in parallel) and outputs the list of mapped list elements:

Figure: Mapping a List with a Map Function, Figure 4.1 from[YDN-MR]

Page 17: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

17/77

Reducing a ListReducing a list iterates over a list sequentially and produces anoutput created by the reduce function:

Figure: Reducing a List with a Reduce Function, Figure 4.2 from[YDN-MR]

Page 18: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

18/77

Grouping Map Output by Key to ReducerIn MapReduce the map function outputs (key, value)-pairs.The MapReduce framework groups map outputs by key, andgives each reduce function instance (key, (..., list ofvalues, ...)) pair as input. Note: Each list of valueshaving the same key will be independently processed:

Figure: Keys Divide Map Output to Reducers, Figure 4.3 from[YDN-MR]

Page 19: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

19/77

MapReduce Data FlowPractical MapReduce systems split input data into large(64MB+) blocks fed to user defined map functions:

Figure: High Level MapReduce Dataflow, Figure 4.4 from [YDN-MR]

Page 20: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

20/77

Recap: Map and Reduce Functions

I The framework only allows a user to write two functions: a“Map” function and a “Reduce” function

I The Map-function is fed blocks of data (block size64-128MB), and it produces (key, value) -pairs

I The framework groups all values with the same key to a(key, (..., list of values, ...)) format, andthese are then fed to the Reduce function

I A special Master node takes care of the scheduling andfault tolerance by re-executing Mappers or Reducers

Page 21: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

21/77

MapReduce Diagram

22.1.2010 2

Worker

Worker

Worker

Worker

split 0

split 1

split 2

split 3

split 4

(3)read

(1)fork

outputfile 0(4)

local write

outputfile 1

Userprogram

Master

(1)fork

(2)assignmap

(6)write

Worker

(5)Remote read

(1)fork

(2)assignreduce

Inputfiles

Mapphase

Intermediate files(on local disks)

Reducephase

Outputfiles

Figure: J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004

Page 22: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

22/77

Example: Word Count

I Classic word count example from the Hadoop MapReducetutorial:http://hadoop.apache.org/common/docs/current/mapred_tutorial.html

I Consider doing a word count of the following file usingMapReduce:

Hello World Bye WorldHello Hadoop Goodbye Hadoop

Page 23: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

23/77

Example: Word Count (cnt.)

I Consider a Map function that reads in words one a time,and outputs (word, 1) for each parsed input word

I The Map function output is:

(Hello, 1)(World, 1)(Bye, 1)(World, 1)(Hello, 1)(Hadoop, 1)(Goodbye, 1)(Hadoop, 1)

Page 24: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

24/77

Example: Word Count (cnt.)

I The Shuffle phase between Map and Reduce phasecreates a list of values associated with each key

I The Reduce function input is:

(Bye, (1))(Goodbye, (1))(Hadoop, (1, 1))(Hello, (1, 1))(World, (1, 1))

Page 25: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

25/77

Example: Word Count (cnt.)

I Consider a reduce function that sums the numbers in thelist for each key and outputs (word, count) pairs. Theoutput of the Reduce function is the output of theMapReduce job:

(Bye, 1)(Goodbye, 1)(Hadoop, 2)(Hello, 2)(World, 2)

Page 26: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

26/77

Phases of MapReduce

1. A Master (In Hadoop terminology: Job Tracker) is startedthat coordinates the execution of a MapReduce job. Note:Master is a single point of failure

2. The master creates a predefined number of M Mapworkers, and assigns each one an input split to work on. Italso later starts a predefined number of R reduce workers

3. Input is assigned to a free Map worker 64-128MB split at atime, and each user defined Map function is fed (key,value) pairs as input and also produces (key, value)pairs

Page 27: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

27/77

Phases of MapReduce(cnt.)

4. Periodically the Map workers flush their (key, value)pairs to the local hard disks, partitioning by their key to Rpartitions (default: use hashing), one per reduce worker

5. When all the input splits have been processed, a Shufflephase starts where M × R file transfers are used to sendall of the mapper outputs to the reducer handling each keypartition. After reducer receives the input files, reducersorts (and groups) the (key, value) pairs by the key

6. User defined Reduce functions iterate over the (key,(..., list of values, ...)) lists, generatingoutput (key, value) pairs files, one per reducer

Page 28: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

28/77

Google MapReduce (cnt.)

I The user just supplies the Map and Reduce functions,nothing more

I The only means of communication between nodes isthrough the shuffle from a mapper to a reducer

I The framework can be used to implement a distributedsorting algorithm by using a custom partitioning function

I The framework does automatic parallelization and faulttolerance by using a centralized Job tracker (Master) and adistributed filesystem that stores all data redundantly oncompute nodes

I Uses functional programming paradigm to guaranteecorrectness of parallelization and to implementfault-tolerance by re-execution

Page 29: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

29/77

Apache Hadoop

I An Open Source implementation of the MapReduceframework, originally developed by Doug Cutting andheavily used by e.g., Yahoo! and Facebook

I “Moving Computation is Cheaper than Moving Data” - Shipcode to data, not data to code.

I Map and Reduce workers are also storage nodes for theunderlying distributed filesystem: Job allocation is first triedto a node having a copy of the data, and if that fails, then toa node in the same rack (to maximize network bandwidth)

I Project Web page: http://hadoop.apache.org/

Page 30: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

30/77

Apache Hadoop (cnt.)

I When deciding whether MapReduce is the correct fit for analgorithm, one has to remember the fixed data-flow patternof MapReduce. The algorithm has to be efficiently mappedto this data-flow pattern in order to efficiently use theunderlying computing hardware

I Builds reliable systems out of unreliable commodityhardware by replicating most components (exceptions:Master/Job Tracker and NameNode in Hadoop DistributedFile System)

Page 31: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

31/77

Apache Hadoop (cnt.)

I Tuned for large (gigabytes of data) filesI Designed for very large 1 PB+ data setsI Designed for streaming data accesses in batch processing,

designed for high bandwidth instead of low latencyI For scalability: NOT a POSIX filesystemI Written in Java, runs as a set of user-space daemons

Page 32: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

32/77

Hadoop Distributed Filesystem (HDFS)

I A distributed replicated filesystem: All data is replicated bydefault on three different Data Nodes

I Inspired by the Google FilesystemI Each node is usually a Linux compute node with a small

number of hard disks (4-12)I A single NameNode that maintains the file locations, many

DataNodes (1000+)

Page 33: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

33/77

Hadoop Distributed Filesystem (cnt.)

I Any piece of data is available if at least one datanodereplica is up and running

I Rack optimized: by default one replica written locally,second in the same rack, and a third replica in another rack(to combat against rack failures, e.g., rack switch or rackpower feed)

I Uses large block size, 128 MB is a common default -designed for batch processing

I For scalability: Write once, read many filesystem

Page 34: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

34/77

Implications of Write Once

I All applications need to be re-engineered to only dosequential writes. Example systems working on top ofHDFS:

I HBase (Hadoop Database), a database system with onlysequential writes, Google Bigtable clone

I MapReduce batch processing systemI Apache Pig and Hive data mining toolsI Mahout machine learning librariesI Lucene and Solr full text searchI Nutch web crawling

Page 35: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

35/77

Two Large Hadoop Installations

I Yahoo! (2009): 4000 nodes, 16 PB raw disk, 64TB RAM,32K cores

I Facebook (2010): 2000 nodes, 21 PB storage, 64TB RAM,22.4K cores

I 12 TB (compressed) data added per day, 800TB(compressed) data scanned per day

I A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J.Sen Sarma, R. Murthy, H. Liu: Data warehousing andanalytics infrastructure at Facebook. SIGMOD Conference2010: 1013-1020.http://doi.acm.org/10.1145/1807167.1807278

Page 36: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

36/77

Writing Hadoop Jobs

I See Hadoop installation instructions at:https://noppa.aalto.fi/noppa/kurssi/t-79.5308/practical_matters

I Example: Assume we want to count the words in a fileusing Apache Hadoop

I What we need to write in Java:I Code to include the needed Hadoop Java LibrariesI Code for the MapperI Code for the ReducerI (Optionally: Code for the Combiner)I Code for the main method for Job configuration and

submission

Page 37: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

37/77

WordCount: Including Hadoop Java Libraries

package org.apache.hadoop.examples;

import java.io.IOException;import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.

FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.

FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;

Page 38: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

38/77

WordCount: Including Hadoop Java Libraries

I The package org.apache.hadoop.examples gives thepackage name we are defining

I The java.util.StringTokenizer is a utility to parse astring into a list of words by returning an iterator over thewords of the string

I Most org.apache.hadoop.* libraries listed are needed by allHadoop jobs, cut&paste from examples

Page 39: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

39/77

WordCount: Including Hadoop Java Libraries

I The IntWritable is a Hadoop variant of the Java Integer

type.I The Text is a Hadoop variant of the Java String classI In addition: the LongWritable is a Hadoop variant of the

Java Long type, the FloatWritable is a Hadoop variant ofthe Java Float type, etc.

I The Hadoop types must be used for both Mapper andReducer keys and values instead of Java native types

Page 40: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

40/77

WordCount: Code for the Mapperpublic class WordCount {

public static class TokenizerMapperextends Mapper<Object, Text, Text, IntWritable>{private final static IntWritable one = new

IntWritable(1);private Text word = new Text();

public void map(Object key, Text value, Contextcontext

) throws IOException,InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);

}}

}

Page 41: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

41/77

WordCount: Code for the Mapper

I The line public class WordCount tells we are defining theWordCount class, into which Mapper, Reducer, and mainmethod all belong into

I The line public static class TokenizerMapper tells thename of the Mapper we are defining

I The four parameters of the Mapper in:extends Mapper<Object, Text, Text, IntWritable> arethe types of the input key, input value, output key, andoutput value, respectively

I The key argument is not used by the WordCount Mapper,but by default it is a long integer offset for text inputs

Page 42: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

42/77

WordCount: Code for the Mapper

I Remember that Hadoop keys and values are of Hadoopinternal types: Text, IntWritable, LongWritable,FloatWritable, etc.

I Code is needed to covert between them and Java nativetypes. Because of this, define the Hadoop types forcreating the Mapper output:

I The code: private final static IntWritable one =

new IntWritable(1) creates an IntWritable constant“1”

I The code: private Text word = new Text() creates aText object to be used as output key object of the Mapper

Page 43: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

43/77

WordCount: Code for the Mapper

I The Mapper method is defined as: public void map(

Object key, Text value, Context context), whereContext is an object used to write the output of the Mapper

I The Java code: StringTokenizer itr = new

StringTokenizer(value.toString()) parses a Text valueconverted into Java String into a list of words by returningan iterator itr over the words of the String

I The code word.set(itr.nextToken()) overwrites thecurrent contents of the Text object word by the next Stringcontents returned from the iterator itr

I The code context.write(word, one) outputs the parsedword as key, and the constant “1” as the value from theMapper giving an output pair (word,1) of (Text, IntWritable)type

Page 44: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

44/77

WordCount: Code for the Reducerpublic static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException,

InterruptedException {int sum = 0;for (IntWritable val : values) {

sum += val.get();}result.set(sum);context.write(key, result);

}}

Page 45: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

45/77

WordCount: Code for the Reducer

I The line public static class IntSumReducer gives thename of the Reducer

I The four parameters of the Reducer in:extends Reducer<Text,IntWritable,Text,IntWritable>

are the types of the input key, input value, output key, andoutput value, respectively. The input types of Reducermust match the output types of the Mapper

I The codeprivate IntWritable result = new IntWritable()

generates a IntWritable object to be used as output valueof the Reducer

Page 46: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

46/77

WordCount: Code for the Reducer

I The Reducer method is defined as:public void reduce(Text key, Iterable<IntWritable>

values, Context context), whereIterable<IntWritable> values is an iterator over the valuetype of the Mapper output, and Context is an object usedto write the output of the Reducer

I The reducer will be called once per each key, and will begiven a list of values mapped to that key to iterate on usingthe values iterator

Page 47: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

47/77

WordCount: Code for the Reducer

I The loop for (IntWritable val : values) iterates overthe values mapped to the current key and sums them up

I The code result.set(sum) basically sets the IntWritableobject result using the sum variable, remember that nativeJava classes can not be directly used as outputs

I The code context.write(key, result) writes the Reduceroutput as a (word, count) pair of (Text, IntWritable) type

Page 48: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

48/77

WordCount: Code for main 1/2

public static void main(String[] args) throwsException {

Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(

conf, args).getRemainingArgs();if (otherArgs.length != 2) {System.err.println("Usage: wordcount <in> <out

>");System.exit(2);

}

Page 49: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

49/77

WordCount: Code for main 1/2

I The line public static void main(String[] args)

declares the main routine that will be running on the clientmachine that starts the MapReduce job

I The line Configuration conf = new Configuration()

creates the configuration object conf holding all theinformation of the MapReduce job to be created

I The line String[] otherArgs = new GenericOptionsParser

(conf, args).getRemainingArgs() and code after it doesargument processing and checking that the code is calledwith two arguments

Page 50: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

50/77

WordCount: Code for main 2/2

Job job = new Job(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(

otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(

otherArgs[1]));System.exit(job.waitForCompletion(true) ? 0 : 1);

}}

Page 51: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

51/77

WordCount: Code for main 2/2

I The code Job job = new Job(conf, "word count")

creates a new MapReduce Job with name “word count”using the created configuration object conf

I The line job.setJarByClass(WordCount.class) tells thename of the defined class

I The Mapper is set:job.setMapperClass(TokenizerMapper.class)

I The (optional!) Combiner is set:job.setCombinerClass(IntSumReducer.class), note thatthis is sound only when the Reduce function iscommutative and associative! (Addition is!)

I The Reducer is set:job.setReducerClass(IntSumReducer.class)

Page 52: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

52/77

WordCount: Code for main 2/2I Java type system is not able to figure out correct types of

output keys and values, so we have to specify them again:I The code job.setOutputKeyClass(Text.class) sets the

output key of both Mapper and Reducer to the Text type. IfMapper has a different output key type from Reducer, theMapper key type can be specified by overriding this usingjob.setMapOutputKeyClass()

I The codejob.setOutputValueClass(IntWritable.class) setsthe output key of both Mapper and Reducer to theIntWritable type. Also job.setMapOutputValueClass()

can be used to override this if the Mapper output type isdifferent

I Also because of the Java limitations, it does not complainat compile time if Mapper outputs and Reducer inputs areof incompatible types

Page 53: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

53/77

WordCount: Code for main 2/2

I The code FileInputFormat.addInputPath(job, new Path(

otherArgs[0])) sets the input to be all of files in inputdirectory specified as first argument on command line

I The code FileOutputFormat.setOutputPath(job, new

Path(otherArgs[1])) sets the output directed to the outputdirectory specified as second argument on command line,one file per reducer

I The codeSystem.exit(job.waitForCompletion(true)? 0 : 1)

submits the job to Hadoop, waits for the job to finish, andreturns 0 on success, 1 on failure

Page 54: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

54/77

A Quick WordCount Demo: Formatting HDFS$ hadoop namenode -format

11/09/20 01:39:29 INFO namenode.NameNode: STARTUP_MSG:/************************************************************STARTUP_MSG: Starting NameNodeSTARTUP_MSG: host = nipsu.local/127.0.0.1STARTUP_MSG: args = [-format]STARTUP_MSG: version = 0.20.203.0STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-203 -r 1099333; compiled by ’oom’ on Wed May 4 07:57:50 PDT 2011

************************************************************/11/09/20 01:39:29 INFO util.GSet: VM type = 64-bit11/09/20 01:39:29 INFO util.GSet: 2% max memory = 19.9175 MB11/09/20 01:39:29 INFO util.GSet: capacity = 2ˆ21 = 2097152 entries11/09/20 01:39:29 INFO util.GSet: recommended=2097152, actual=209715211/09/20 01:39:30 INFO namenode.FSNamesystem: fsOwner=keijoheljanko11/09/20 01:39:30 INFO namenode.FSNamesystem: supergroup=supergroup11/09/20 01:39:30 INFO namenode.FSNamesystem: isPermissionEnabled=true11/09/20 01:39:30 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=10011/09/20 01:39:30 INFO namenode.FSNamesystem: isAccessTokenEnabled=falseaccessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)11/09/20 01:39:30 INFO namenode.NameNode: Caching file names occuring more than 10 times11/09/20 01:39:30 INFO common.Storage: Image file of size 119 saved in 0 seconds.11/09/20 01:39:30 INFO common.Storage: Storage directory/tmp/hadoop-keijoheljanko/dfs/name has been successfully formatted.11/09/20 01:39:30 INFO namenode.NameNode: SHUTDOWN_MSG:/************************************************************SHUTDOWN_MSG: Shutting down NameNode at nipsu.local/127.0.0.1

************************************************************/

Page 55: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

55/77

A Quick WordCount Demo: Starting Hadoop

$ start-all.sh

starting namenode, logging to/Users/keijoheljanko/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-keijoheljanko-namenode-nipsu.local.outlocalhost: starting datanode, logging to/Users/keijoheljanko/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-keijoheljanko-datanode-nipsu.local.outlocalhost: starting secondarynamenode, logging to/Users/keijoheljanko/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-keijoheljanko-secondarynamenode-nipsu.local.outstarting jobtracker, logging to/Users/keijoheljanko/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-keijoheljanko-jobtracker-nipsu.local.outlocalhost: starting tasktracker, logging to/Users/keijoheljanko/hadoop/hadoop-0.20.203.0/bin/../logs/hadoop-keijoheljanko-tasktracker-nipsu.local.out

Page 56: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

56/77

A Quick WordCount Demo: Uploading Kalevala toHDFS

$ wget http://www.nic.funet.fi/pub/doc/literary/finnish/kalevala/kalevala.txt.gz

--2011-09-20 01:41:24-- http://www.nic.funet.fi/pub/doc/literary/finnish/kalevala/kalevala.txt.gzResolving www.nic.funet.fi (www.nic.funet.fi)... 193.166.3.3, 2001:708:10:9::20:3Connecting to www.nic.funet.fi (www.nic.funet.fi)|193.166.3.3|:80... connected.HTTP request sent, awaiting response... 200 OKLength: 200872 (196K) [application/x-gzip]Saving to: ‘kalevala.txt.gz’

100%[===============================================================================>] 200,872 --.-K/s in 0.06s

2011-09-20 01:41:24 (3.06 MB/s) - ‘kalevala.txt.gz’ saved [200872/200872]

$ gunzip kalevala.txt.gz

$ recode ISO8859-1..UTF-8 kalevala.txt

$ ls -al kalevala.txt-rw-r--r-- 1 keijoheljanko staff 586333 Nov 5 1996 kalevala.txt

$ hadoop fs -mkdir input

$ hadoop fs -copyFromLocal kalevala.txt input

$ hadoop fs -ls inputFound 1 items-rw-r--r-- 1 keijoheljanko supergroup 586333 2011-09-20 01:46 /user/keijoheljanko/input/kalevala.txt

Page 57: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

57/77

A Quick WordCount Demo: Compiling WordCount

$ ls -al WordCount.java

-rw-r--r--@ 1 keijoheljanko staff 2395 Sep 18 13:53 WordCount.java

$ javac -classpath${HADOOP_HOME}/hadoop-core-${HADOOP_VERSION}.jar:{HADOOP_HOME}/lib/commons-cli-1.2.jar-d wordcount_classes WordCount.java

$ jar -cvf wordcount.jar -C wordcount_classes/ .

added manifestadding: org/(in = 0) (out= 0)(stored 0%)adding: org/apache/(in = 0) (out= 0)(stored 0%)adding: org/apache/hadoop/(in = 0) (out= 0)(stored 0%)adding: org/apache/hadoop/examples/(in = 0) (out= 0)(stored 0%)adding: org/apache/hadoop/examples/WordCount$IntSumReducer.class(in = 1789)(out= 746)(deflated 58%)adding: org/apache/hadoop/examples/WordCount$TokenizerMapper.class(in = 1790)(out= 765)(deflated 57%)adding: org/apache/hadoop/examples/WordCount.class(in = 1911)(out= 996)(deflated 47%)

$ ls -al wordcount.jar

-rw-r--r-- 1 keijoheljanko staff 3853 Sep 20 02:07 wordcount.jar

Page 58: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

58/77

A Quick WordCount Demo: Running WordCount(1/2)

$ hadoop jar wordcount.jar org.apache.hadoop.examples.WordCount input output

11/09/20 02:11:32 INFO input.FileInputFormat: Total input paths to process : 111/09/20 02:11:32 INFO mapred.JobClient: Running job: job_201109200145_000111/09/20 02:11:33 INFO mapred.JobClient: map 0% reduce 0%11/09/20 02:11:47 INFO mapred.JobClient: map 100% reduce 0%11/09/20 02:12:00 INFO mapred.JobClient: map 100% reduce 100%11/09/20 02:12:05 INFO mapred.JobClient: Job complete: job_201109200145_000111/09/20 02:12:05 INFO mapred.JobClient: Counters: 2511/09/20 02:12:05 INFO mapred.JobClient: Job Counters11/09/20 02:12:05 INFO mapred.JobClient: Launched reduce tasks=111/09/20 02:12:05 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=1446111/09/20 02:12:05 INFO mapred.JobClient: Total time spent by all reduceswaiting after reserving slots (ms)=011/09/20 02:12:05 INFO mapred.JobClient: Total time spent by all mapswaiting after reserving slots (ms)=011/09/20 02:12:05 INFO mapred.JobClient: Launched map tasks=111/09/20 02:12:05 INFO mapred.JobClient: Data-local map tasks=111/09/20 02:12:05 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1071911/09/20 02:12:05 INFO mapred.JobClient: File Output Format Counters11/09/20 02:12:05 INFO mapred.JobClient: Bytes Written=318862

Page 59: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

59/77

A Quick WordCount Demo: Running WordCount(2/2)

11/09/20 02:12:05 INFO mapred.JobClient: FileSystemCounters11/09/20 02:12:05 INFO mapred.JobClient: FILE_BYTES_READ=42380411/09/20 02:12:05 INFO mapred.JobClient: HDFS_BYTES_READ=58645711/09/20 02:12:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=89007511/09/20 02:12:05 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=31886211/09/20 02:12:05 INFO mapred.JobClient: File Input Format Counters11/09/20 02:12:05 INFO mapred.JobClient: Bytes Read=58633311/09/20 02:12:05 INFO mapred.JobClient: Map-Reduce Framework11/09/20 02:12:05 INFO mapred.JobClient: Reduce input groups=2647211/09/20 02:12:05 INFO mapred.JobClient: Map output materialized bytes=42380411/09/20 02:12:05 INFO mapred.JobClient: Combine output records=2647211/09/20 02:12:05 INFO mapred.JobClient: Map input records=1436611/09/20 02:12:05 INFO mapred.JobClient: Reduce shuffle bytes=42380411/09/20 02:12:05 INFO mapred.JobClient: Reduce output records=2647211/09/20 02:12:05 INFO mapred.JobClient: Spilled Records=5294411/09/20 02:12:05 INFO mapred.JobClient: Map output bytes=83940311/09/20 02:12:05 INFO mapred.JobClient: Combine input records=6758311/09/20 02:12:05 INFO mapred.JobClient: Map output records=6758311/09/20 02:12:05 INFO mapred.JobClient: SPLIT_RAW_BYTES=12411/09/20 02:12:05 INFO mapred.JobClient: Reduce input records=26472

Page 60: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

60/77

A Quick WordCount Demo: Dowloading Output$ hadoop fs -copyToLocal output output

$ ls -al outputtotal 624drwxr-xr-x 5 keijoheljanko staff 170 Sep 20 02:16 .drwxr-xr-x 17 keijoheljanko staff 578 Sep 20 02:16 ..-rw-r--r-- 1 keijoheljanko staff 0 Sep 20 02:16 _SUCCESSdrwxr-xr-x 3 keijoheljanko staff 102 Sep 20 02:16 _logs-rw-r--r-- 1 keijoheljanko staff 318862 Sep 20 02:16 part-r-00000

$ head -15 output/part-r-00000"Ahto,1"Aik’1"Aika 1"Ain’1"Aina 2"Ainap’1"Ainapa 1"Aita 2"Ajoa 1"Akat 1"Akatp’1"Akka 5"Ala 1"Ampuisitko 1"Anna 7

Page 61: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

61/77

Hadoop Books

I I can warmly recommend the Book:I Tom White: “Hadoop: The Definitive Guide, Second

Edition”, O’Reilly Media, 2010. Print ISBN:978-1-4493-8973-4, Ebook ISBN: 978-1-4493-8974-1.http://www.hadoopbook.com/

I An alternative book is:I Chuck Lam: “Hadoop in Action”, Manning, 2010. Print

ISBN: 9781935182191.

Page 62: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

62/77

Recap of Hadoop Commands Used

I start-all.shI Starts all Hadoop daemons.

Hint: If you are getting the following error on Mac OSX:“Bad connection to FS. command aborted.exception: Call tolocalhost/127.0.0.1:9000 failed onconnection exception:java.net.ConnectException: Connectionrefused”call start-all.sh another time to start the namenode forreal.

Page 63: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

63/77

Recap of Hadoop Commands Used (cnt.)

I hadoop namenode -formatI Formats the HDFS filesystem

I hadoop fs -mkdir fooI Creates a directory called “foo” in the HDFS home

directory of the userI hadoop fs -rmr foo

I Removes the directory/file called “foo” in the HDFS homedirectory of the user

I hadoop fs -copyFromLocal foo.txt barI Copies the file “foo.txt” to the HDFS directory/file called

“bar” in the HDFS home directory of the user

Page 64: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

64/77

Recap of Hadoop Commands Used (cnt.)

I hadoop fs -ls fooI Does a directory/file listing for directory called “foo” in the

HDFS home directory of the userI hadoop jar wordcount.jarorg.apache.hadoop.examples.WordCount foobar

I Runs MapReduce job in file “wordcount.jar” in the class“org.apache.hadoop.examples.WordCount” usingthe HDFS directory “foo” as input and stores the output ina new HDFS directory “bar” to be created by the job (i.e.,“bar” should not yet exist in HDFS)

Page 65: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

65/77

Recap of Hadoop Commands Used (cnt.)

I hadoop fs -copyToLocal foo barI Copies the directory/file “foo” in the user HDFS home

directory to the local file system directory/file “bar”I stop-all.sh

I Stops all Hadoop daemons

Page 66: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

66/77

Hadoop Default ports

I Use the URL: http://localhost:50030 to access theJobTracker through a Web browser. Hint: Use the refreshbutton to update the view during Job execution.

I Use the URL: http://localhost:50070 to access theNameNode through a Web browser. Hint: You can alsobrowse HDFS filesystem through this URL.

Page 67: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

67/77

Hadoop NameNode and SecondaryNameNodeLocking

I Hadoop Namenode and SecondaryNameNode use lockingto disallow the accidental starting of two daemonsmodifying the same files and leading to HDFSinconsistency

I If either daemon gets killed uncleanly, it they might leavelock files around and prevent a new NameNode orSecondaryNameNode from starting

I The lock files are called data/in use.lock,name/in use.lock, andnamesecondary/in use.lock.

I They are located in dfs subdirectory of hadoop.tmp.dirthat can be set in core-site.xml (by default/tmp/hadoop-${user.name})

Page 68: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

68/77

Commercial Hadoop Support

I Cloudera: A Hadoop distribution based on Apache Hadoop+ patches. Available from:http://www.cloudera.com/

I Hortonworks: Yahoo! spin-off of their Hadoop developmentteam, will be working on mainline Apache Hadoop.http://www.hortonworks.com/

I MapR: A rewrite of much of Apache Hadoop in C++,including a new filesystem. API-compatible with ApacheHadoop.http://www.mapr.com/

Page 69: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

69/77

Cloud Software Project

I ICT SHOK Program Cloud Software (2010-2013)I A large Finnish consortium, see:http://www.cloudsoftwareprogram.org/

I Case study at Aalto: CSC Genome Browser Cloud BackendI Co-authors: Matti Niemenmaa, Andre Schumacher (Aalto

University, Department of Information and ComputerScience), Aleksi Kallio, Eija Korpelainen, Taavi Hupponen,and Petri Klemela (CSC — IT Center for Science)

Page 70: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

70/77

CSC Genome Browser

I CSC provides tools and infrastructure for bioinformaticsI Bioinformatics is the largest customer group of CSC (in

user numbers)I Next-Generation Sequencing (NGS) produces large data

sets (TB+)I Cloud computing can be harnessed for analyzing these

data setsI 1000 Genomes project

(http://www.1000genomes.org): Freely available 50TB+ data set of human genomes

Page 71: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

71/77

CSC Genome Browser

I Cloud computing technologies will enable scalable NGSdata analysis

I There exists prior research into sequence alignment andassembly in the cloud

I Visualization is needed in order to understand large datamasses

I Interactive visualization for 100GB+ datasets can only beachieved with preprocessing in the cloud

Page 72: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

72/77

Genome Browser Requirements

I Interactive browsing with zooming in and out, “GoogleEarth”-style

I Single datasets 100GB-1TB+ with interactive visualizationat different zoom levels

I Preprocessing used to compute summary data for thehigher zoom levels

I Dataset too large to compute the summary data in real timeusing the real dataset

I Scalable cloud data processing paradigm map-reduceimplemented in Hadoop used to compute the summarydata in preprocessing (currently upto 15 x 12 = 180 cores)

Page 73: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

73/77

Genome Browser GUI

Page 74: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

74/77

Experiments

I One 50 GB (compressed) BAM input file from 1000Genomes

I Run on the Triton cluster 1-15 compute nodes with 12cores each

I Four repetitions for increasing number of worker nodesI Two types of runs: sorting according to starting position

(“sorted”) and read aggregation using five summary-blocksizes at once (“summarized”)

Page 75: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

75/77

Mean speedup

0

2

4

6

8

10

12

14

16

1 2 4 8 15

Mean speedup

Workers

50 GB sorted

Ideal

Input file import

Sorting

Output file export

Total elapsed

0

2

4

6

8

10

12

14

16

1 2 4 8 15M

ea

n s

pe

ed

up

Workers

50 GB summarized for B=2,4,8,16,32

Ideal

Input file import

Summarizing

Output file export

Total elapsed

Page 76: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

76/77

Results

I An updated CSC Genome browser GUII Chipster: Tools and framework for producing visualizable

dataI An implementation of preprocessing for visualization using

HadoopI Scalability studies for running Hadoop on 50 GB+ datasets

on the Triton cloud testbedI Software released, with more than 1200 downloads:http://sourceforge.net/projects/hadoop-bam/

Page 77: Hadoop and Big Data - science-it.aalto.fiscience-it.aalto.fi/.../05/SCiP2013.Hadoop_and_big_data.2013-06-12.… · Hadoop and Big Data 12.6-2013 2/77 Business Drivers of Cloud Computing

Hadoop and Big Data12.6-2013

77/77

Current Research Topics

I High level programming frameworks for ad-hoc queries:Apache Pig integration of Hadoop-BAM in the SeqPig tool(http://sourceforge.net/projects/seqpig/),and SQL integration in Apache Hive

I Real-time SQL query evaluation on Big Data: ClouderaImpala and Berkeley Shark

I Scalable datastores: HBase (Hadoop Database) forreal-time read/write database workloads

I Energy efficiency through better main memory and SSDutilization