introduction to the hadoop ecosystem by uwe seiler

51
Introduction to the Hadoop ecosystem

Upload: codemotion

Post on 27-May-2015

450 views

Category:

Technology


7 download

DESCRIPTION

Apache Hadoop is one of the most popular solutions for today’s Big Data challenges. Hadoop offers a reliable and scalable platform for fail-safe storage of large amounts of data as well as the tools to process this data. This presentation will give an overview of the architecture of Hadoop and explain the possibilities for integration within existing enterprise systems. Finally, the main tools for processing data will be introduced which includes the scripting language layer Pig, the SQL-like query layer Hive as well as the column-based NoSQL layer HBase.

TRANSCRIPT

Page 1: Introduction to the hadoop ecosystem by Uwe Seiler

Introduction to the Hadoop ecosystem

Page 2: Introduction to the hadoop ecosystem by Uwe Seiler

About me

Page 3: Introduction to the hadoop ecosystem by Uwe Seiler

About us

Page 4: Introduction to the hadoop ecosystem by Uwe Seiler

Why Hadoop?

Page 5: Introduction to the hadoop ecosystem by Uwe Seiler

Why Hadoop?

Page 6: Introduction to the hadoop ecosystem by Uwe Seiler

Why Hadoop?

Page 7: Introduction to the hadoop ecosystem by Uwe Seiler

Why Hadoop?

Page 8: Introduction to the hadoop ecosystem by Uwe Seiler

Why Hadoop?

Page 9: Introduction to the hadoop ecosystem by Uwe Seiler

Why Hadoop?

Page 10: Introduction to the hadoop ecosystem by Uwe Seiler

Why Hadoop?

Page 11: Introduction to the hadoop ecosystem by Uwe Seiler

How to scale data?

w1 w2 w3

r1 r2 r3

Page 12: Introduction to the hadoop ecosystem by Uwe Seiler

But…

Page 13: Introduction to the hadoop ecosystem by Uwe Seiler

But…

Page 14: Introduction to the hadoop ecosystem by Uwe Seiler

What is Hadoop?

Page 15: Introduction to the hadoop ecosystem by Uwe Seiler

What is Hadoop?

Page 16: Introduction to the hadoop ecosystem by Uwe Seiler

What is Hadoop?

Page 17: Introduction to the hadoop ecosystem by Uwe Seiler

What is Hadoop?

Page 18: Introduction to the hadoop ecosystem by Uwe Seiler

The Hadoop App Store

HDFS MapRed HCat Pig Hive HBase Ambari Avro Cassandra

Chukwa

Intel

Sync

Flume Hana HyperT Impala Mahout Nutch Oozie Scoop

Scribe Tez Vertica Whirr ZooKee Cloudera Horton MapR EMC

IBM Talend TeraData Pivotal Informat Microsoft. Pentaho Jasper

Kognitio Tableau Splunk Platfora Rack Karma Actuate MicStrat

Page 19: Introduction to the hadoop ecosystem by Uwe Seiler

Data Storage

Page 20: Introduction to the hadoop ecosystem by Uwe Seiler

Data Storage

Page 21: Introduction to the hadoop ecosystem by Uwe Seiler

Hadoop Distributed File System

Page 22: Introduction to the hadoop ecosystem by Uwe Seiler

Hadoop Distributed File System

Page 23: Introduction to the hadoop ecosystem by Uwe Seiler

HDFS Architecture

Page 24: Introduction to the hadoop ecosystem by Uwe Seiler

Data Processing

Page 25: Introduction to the hadoop ecosystem by Uwe Seiler

Data Processing

Page 26: Introduction to the hadoop ecosystem by Uwe Seiler

MapReduce

Page 27: Introduction to the hadoop ecosystem by Uwe Seiler

Typical large-data problem

Page 28: Introduction to the hadoop ecosystem by Uwe Seiler

MapReduce Flow

𝐤𝟏 𝐯𝟏 𝐤𝟐 𝐯𝟐 𝐤𝟒 𝐯𝟒 𝐤𝟓 𝐯𝟓 𝐤𝟔 𝐯𝟔 𝐤𝟑 𝐯𝟑

a 𝟏 b 2 c 9 a 3 c 2 b 7 c 8

a 𝟏 b 2 c 3 c 6 a 3 c 2 b 7 c 8

a 1 3 b 𝟐 7 c 2 8 9

a 4 b 9 c 19

Page 29: Introduction to the hadoop ecosystem by Uwe Seiler

Jobs & Tasks

Page 30: Introduction to the hadoop ecosystem by Uwe Seiler

Combined Hadoop Architecture

Page 31: Introduction to the hadoop ecosystem by Uwe Seiler

Word Count Mapper in Java

public class WordCountMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable>

{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> output, Reporter reporter) throws IOException

{

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens())

{

word.set(tokenizer.nextToken());

output.collect(word, one);

}

}

}

Page 32: Introduction to the hadoop ecosystem by Uwe Seiler

Word Count Reducer in Java

public class WordCountReducer extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable>

{

public void reduce(Text key, Iterator values, OutputCollector

output, Reporter reporter) throws IOException

{

int sum = 0;

while (values.hasNext())

{

IntWritable value = (IntWritable) values.next();

sum += value.get();

}

output.collect(key, new IntWritable(sum));

}

}

Page 33: Introduction to the hadoop ecosystem by Uwe Seiler

Scripting for Hadoop

Page 34: Introduction to the hadoop ecosystem by Uwe Seiler

Scripting for Hadoop

Page 35: Introduction to the hadoop ecosystem by Uwe Seiler

Apache Pig

••

Page 36: Introduction to the hadoop ecosystem by Uwe Seiler

Pig in the Hadoop ecosystem

Hadoop Distributed File System

Distributed Programming Framework

Metadata Management

Scripting

Page 37: Introduction to the hadoop ecosystem by Uwe Seiler

Pig Latin

users = LOAD 'users.txt' USING PigStorage(',') AS (name,

age);

pages = LOAD 'pages.txt' USING PigStorage(',') AS (user,

url);

filteredUsers = FILTER users BY age >= 18 and age <=50;

joinResult = JOIN filteredUsers BY name, pages by user;

grouped = GROUP joinResult BY url;

summed = FOREACH grouped GENERATE group,

COUNT(joinResult) as clicks;

sorted = ORDER summed BY clicks desc;

top10 = LIMIT sorted 10;

STORE top10 INTO 'top10sites';

Page 38: Introduction to the hadoop ecosystem by Uwe Seiler

Pig Execution Plan

Page 39: Introduction to the hadoop ecosystem by Uwe Seiler

Try that with Java…

Page 40: Introduction to the hadoop ecosystem by Uwe Seiler

SQL for Hadoop

Page 41: Introduction to the hadoop ecosystem by Uwe Seiler

SQL for Hadoop

Page 42: Introduction to the hadoop ecosystem by Uwe Seiler

Apache Hive

Page 43: Introduction to the hadoop ecosystem by Uwe Seiler

Hive in the Hadoop ecosystem

Hadoop Distributed File System

Distributed Programming Framework

Metadata Management

Scripting Query

Page 44: Introduction to the hadoop ecosystem by Uwe Seiler

Hive Architecture

Page 45: Introduction to the hadoop ecosystem by Uwe Seiler

Hive Example

CREATE TABLE users(name STRING, age INT);

CREATE TABLE pages(user STRING, url STRING);

LOAD DATA INPATH '/user/sandbox/users.txt' INTO

TABLE 'users';

LOAD DATA INPATH '/user/sandbox/pages.txt' INTO

TABLE 'pages';

SELECT pages.url, count(*) AS clicks FROM users JOIN

pages ON (users.name = pages.user)

WHERE users.age >= 18 AND users.age <= 50

GROUP BY pages.url

SORT BY clicks DESC

LIMIT 10;

Page 46: Introduction to the hadoop ecosystem by Uwe Seiler

Bringing it all together…

Page 47: Introduction to the hadoop ecosystem by Uwe Seiler

Online Advertising

Page 48: Introduction to the hadoop ecosystem by Uwe Seiler

Getting started…

Page 49: Introduction to the hadoop ecosystem by Uwe Seiler

Hortonworks Sandbox

Page 50: Introduction to the hadoop ecosystem by Uwe Seiler

Hadoop Training

••

••

••

Page 51: Introduction to the hadoop ecosystem by Uwe Seiler

The end…or the beginning?