processing data using amazon elastic mapreduce and apache hive

42
Processing Data using Amazon Elastic MapReduce and Apache Hive Team Members Frank Paladino Aravind Yeluripiti

Upload: serena

Post on 09-Feb-2016

67 views

Category:

Documents


2 download

DESCRIPTION

Processing Data using Amazon Elastic MapReduce and Apache Hive. Team Members Frank Paladino Aravind Yeluripiti. Project Goals. Creating an Amazon EC2 instance to process queries against large datasets using Hive. Choosing two public data sets available on Amazon - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Processing Data using Amazon Elastic MapReduce and Apache Hive

Team MembersFrank Paladino

Aravind Yeluripiti

Page 2: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Project Goals• Creating an Amazon EC2 instance to process

queries against large datasets using Hive. • Choosing two public data sets available on

Amazon• Running queries against these datasets• Comparing HiveQL and SQL

Page 3: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Definitions• Hadoop MapReduce is a software framework for

easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

• Amazon Elastic MapReduce is a web service which provides easy access to a Hadoop MapReduce cluster in the cloud.

• Apache Hive is an open source data warehouse built on top of Hadoop which provides a SQL-like interface over MapReduce.

Page 4: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Starting an interactive hive session

• Register for an AWS account • Create a key pair in EC2 console • Launch an EMR job flow to start an interactive

Hive session using the key pair• Use the domain name and key pair to SSH into

the master node of the Amazon EC2 cluster as the user “hadoop”

• Start hive

Page 5: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Data set 1: DISASTERS WORLDWIDE FROM 1900-2008

• Available at: – http://www.infochimps.com/datasets/disasters-worldwide-fr

om-1900-2008• Info– Disaster data from 1900 – 2008, – organized by

• start and end date, • country (and sub-location), • disaster type (and sub-type), • disaster name, cost, and persons killed and affected by the disaster.

• Details– Number of rows: 17829– Size: 451 kB (Compressed) , 1.5 MB (uncompressed)

Page 6: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Create table CREATE EXTERNAL TABLE emdata ( start string, ende string, country string, locatione string, type string, subtype string, name string, killed string, cost string, affected string )ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'LOCATION 's3://cmpt592-assignment5/input/';

Page 7: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q1: • Get the total number of disasters occurred in each

country• Query:SELECT

count(distinct id),country

FROM emdata

GROUP BY country;

Page 8: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q1: • Output:

Page 9: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q2: • Get the number and type of disasters in a given

country• Query:SELECT

count(distinct id),type

FROM emdata

WHERE country='Afghanistan'

GROUP BY type;

Page 10: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q2: • Output:

Page 11: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q3: • Get the number, type, and subtype of disasters in

a given country• Query:SELECT

count(distinct id),type,subtype

FROM emdata

WHERE country='Afghanistan'

GROUP BY type, subtype;

Page 12: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q3: • Output:

Page 13: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q4: • Get the total casualties and type of disasters in a

given country only when casualties > 100• Query:SELECT

sum(killed),type

FROM emdata

WHERE country='Afghanistan' and killed>100

GROUP BY type;

Page 14: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q4: • Output:

Page 15: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q5: • Get the total casualties and name of country for a

certain type of disaster when casualties > 500• Query:SELECT

sum(killed),country

FROM emdata

WHERE type='Flood' and killed>500

GROUP BY country;

Page 16: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q3: • Output:

Page 17: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q6: • Get the total cost and name of country for a

certain type of disaster when cost > 1 000 000• Query:SELECT

sum(cost),country

FROM emdata

WHERE type='Flood' and cost>1000000

GROUP BY country;

Page 18: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Q3: • Output:

Page 19: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Data set 2: Google Books n-grams

• Available at: s3://datasets.elasticmapreduce/ngrams/books/

• Details– Size:2.2 TB– Source: Google Books– Created On: January 5, 2011 6:11 PM GMT– Last Updated: January 21, 2012 2:12 AM GMT

• Processing n-gram data using Amazon Elastic MapReduce and Apache Hive.

• Calculating the top trending topics per decade– http://aws.amazon.com/articles/5249664154115844

Page 20: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Data set 1: Google Books n-grams• N-grams are fixed size tuples of items. In this case the items are words extracted

from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters.

• The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence.

• The yellow dog played fetch.– Would produce the following 2-grams:

["The", "yellow"] ["yellow", 'dog"] ["dog", "played"] ["played", "fetch"] ["fetch", "."] Or the following 3-grams: ["The", "yellow", "dog"] ["yellow", "dog", "played"] ["dog", "played", "fetch"] ["played", "fetch", "."]

Page 21: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Data set 1: Google Books n-grams

• Two settings to efficiently process data from S3• hive> set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;• hive> set mapred.min.split.size=134217728;

• Used to tell– Hive to use an InputFormat that will split a file into

pieces for processing, and– Tell it not to split them into pieces any smaller than

128 MB.

Page 22: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Creating input table

• In order to process any data– Define the source of data

• Data set used – English 1-grams– Rows : 472,764,897– Compressed Size : 4.8 GB

Page 23: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Creating input table

• Statement to define dataCREATE EXTERNAL TABLE english_1grams (

gram string, year int, occurrences bigint, pages bigint, books bigint

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION's3://datasets.elasticmapreduce/ngrams/books/

20090715/eng-all/1gram/';

Page 24: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Creating input table• Output

Page 25: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Count the number of rows

• StatementSelect count(*) from english_1grams;

• Output

Page 26: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Normalizing the data

• The data in its current form is very raw.– It contains punctuation, numbers (typically years), and is

case sensitive. • First we want to create a table to store the results of

the normalization. In the process, we can also drop unnecessary columns.

• StatementCREATE TABLE normalized (

gram string, year int, occurrences bigint

);

Page 27: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Normalizing the data• Output

Page 28: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Inserting data into normalized table

• Read raw data and insert into normalized table• StatementINSERT OVERWRITE TABLE normalized SELECT

lower(gram), year, occurrences

FROM english_1grams WHERE

year >= 1890 AND gram REGEXP "^[A-Za-z+'-]+$";

Page 29: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Inserting data into normalized table• Output

Page 30: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Finding word ratio by decade

• More books are printed over time, – so every word has a tendency to have more

occurrences in later decades. • We only care about the relative usage of that

word over time, – so we want to ignore the change in size of corpus. – This can be done by finding the ratio of occurrences of

this word over the total number of occurrences of all words.

Page 31: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Finding word ratio by decade

• Create a table to store this data• StatementCREATE TABLE by_decade (

gram string, decade int, ratio double

);

• Calculate the total number of word occurrences by decade. Then

• Join this data with the normalized table in order to calculate the usage ratio.

Page 32: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Finding word ratio by decade• StatementINSERT OVERWRITE TABLE by_decade

SELECT a.gram, b.decade, sum(a.occurrences) / b.total FROM normalized a JOIN ( SELECT substr(year, 0, 3) as decade, sum(occurrences) as total FROM normalized GROUP BY substr(year, 0, 3) ) b ON substr(a.year, 0, 3) = b.decade GROUP BY a.gram, b.decade, b.total;

Page 33: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Finding word ratio by decade• Output

Page 34: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Calculating changes per decade

• With a normalized dataset by decade we can get down to calculating changes by decade.– This can be achieved by joining the dataset on itself.

• We'll want to join rows where the n-grams are equal and the decade is off by one. – This lets us compare ratios for a given n-gram from

one decade to the next.

Page 35: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Calculating changes per decade

• StatementSELECT

a.gram as gram, a.decade as decade, a.ratio as ratio, a.ratio / b.ratio as increase

FROM by_decade a JOIN by_decade b

ON a.gram = b.gram and a.decade - 1 = b.decadeWHERE a.ratio > 0.000001 and a.decade >= 190DISTRIBUTE BY decade SORT BY decade ASC, increase DESC;

Page 36: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Calculating changes per decade• Output

Page 37: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Final output

Page 38: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Hive (Hadoop) vs. SQL (RDBMS)• Hive (Hadoop)

– Intended for scaling to hundred and thousands of machines.– Optimized for full table scans and jobs incur substantial overhead in

submission and scheduling.– Best for data transformation, summarization, and analysis of large volumes of

data but not appropriate for applications requiring fast query response times.– Read based and therefore not appropriate for transaction processing

requiring write operations.• SQL (RDBMS)

– Typically run on a single large machine and do not provide support for executing map and reduce functions on the tables.

– RDBMS systems are best for when referential integrity are required and frequent small updates are performed.

– Tables are indexed and cached so small amounts of data can be retrieved very quickly.

Page 39: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

References• Disasters worldwide from 1900-2008 dataset

– URL: http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008• Finding trending topics using Google Books n-grams data and Apache Hive

on Elastic MapReduce– URL: http://aws.amazon.com/articles/5249664154115844

• Google Books Ngrams– URL:http://aws.amazon.com/datasets/8172056142375670

• Apache Hadoop– URL: http://hadoop.apache.org/

• Amazon Elastic Compute Cloud– URL: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

• Getting started Guide– URL:

http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-tutorial.html

Page 40: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Acknowledgements

• Frank Paladino

Page 41: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive

Acknowledgements

• Dr Aparna Varde

Page 42: Processing  Data  using Amazon Elastic  MapReduce  and Apache Hive