hadoop admin: session -2 what is hadoop?. agenda hadoop demo using cygwin hdfs daemons map reduce...

23
HADOOP ADMIN: Session -2 What is Hadoop?

Upload: charla-randall

Post on 24-Dec-2015

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

HADOOP ADMIN: Session -2

What is Hadoop?

Page 2: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

AGENDAHadoop Demo using CygwinHDFS DaemonsMap Reduce DaemonsHadoop Ecosystem Projects

Page 3: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Hadoop Using CygwinWhat is Cygwin?Hadoop needs Java version 1.6 or higher

bin/hadoopbin/hadoop jar hadoop-examples-1.0.4.jar

Word count input outputWord count example

Tokenization problemModifying the Program

C:\Documents and Settings\sb009239\Deskt

Page 4: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

HDFS Daemons

Name NodeMeta Data in RAM

Data Node 1Secondary Name Node

Block Repor

t

Heart

Beats

Not a backup

node/stand by Node

Read

Read Data Block 1

Roll edits

Copy

Fsimage and

edits

Replay all edits and create new fs image

Rename new edits

Send New

Fs image

Page 5: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Map Reduce V1 DaemonsJob TrackerTask Tracker

Job TrackerJob Tracker

Task TrackerTask Tracker

Task TrackerTask Tracker Task TrackerTask Tracker Task TrackerTask Tracker

Page 6: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Word Count over a Given Set of Web Pages

see bob throw see 1

bob 1

throw 1

see 1

spot 1

run 1

bob 1

run 1

see 2

spot 1

throw 1

see spot run

Can we do word count in parallel?

Page 7: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

The MapReduce Framework (pioneered by Google)

Page 8: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Automatic Parallel Execution in MapReduce (Google)

Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to

avoid a slow task slowing down the whole job

Page 9: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

MapReduce in Hadoop (1)

Page 10: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

MapReduce in Hadoop (2)

Page 11: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Data Flow in a MapReduce Program in Hadoop

InputFormatMap functionPartitionerSorting & MergingCombinerShufflingMergingReduce functionOutputFormat

1:many

Page 12: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 13: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 14: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Page 15: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Job Configuration Parameters190+ parameters in

HadoopSet manually or

defaults are used

Page 16: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Hadoop Ecosystem/Sub Projects

Page 17: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

PIGOne frequent complaint about MR is that it’s difficult to

programOne criticism of MapReduce is that the development

cycle is very longAs you implement the program in MapReduce, you’ll

have to think at the level of mapper and reducer functions and job chaining

Pig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007

Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin

Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoop’s simple scalability and reliability

Yahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIG

Indeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there

Page 18: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

PIG::How I look like:Not a variable, relation

Loads data file into a relation,with a defined schema

Page 19: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

Word count example in PIG Text=LOAD ‘text’ USING Textloader()Loads each line as one column Tokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as

word COUNT_STAR($1)

PIG JOB

MR TRANSFOR

MATIONMR JOBS HDFS

Page 20: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

PIG Vs HivePig is a new language, easy to learn if you know

languages similar to PerlHive is a sub-set of SQL with very simple variations

to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you

Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL).

Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.

Page 21: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

HIVE(HQL)Hive is a data ware house infrastructure

built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop cluster

Invented at Facebook for their own problems .

SQL like query language(HQL/Hive QL) to retrieve the data and process it.

JDBC/ODBC access is providedCurrently used with respect to Hbase

Page 22: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

HbaseHBase is not about being a high level

language that compiles to map-reduce,Hbase is about allowing Hadoop to support

lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.

Page 23: HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects

SqoopTo load bulk data into Hadoop from relational

databasesImports individual tables or entire databases to

files in HDFSProvides the ability to import from SQL

databases straight into your Hive data warehouse

Importing this table into HDFS could be done with the command:

you@db$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ --local --hive-import- See more at: