processing data using amazon elastic mapreduce and apache hive

Processing Data using Amazon Elastic MapReduce and Apache Hive

Team MembersFrank Paladino

Aravind Yeluripiti

Project Goals• Creating an Amazon EC2 instance to process

queries against large datasets using Hive. • Choosing two public data sets available on

Amazon• Running queries against these datasets• Comparing HiveQL and SQL

Definitions• Hadoop MapReduce is a software framework for

easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

• Amazon Elastic MapReduce is a web service which provides easy access to a Hadoop MapReduce cluster in the cloud.

• Apache Hive is an open source data warehouse built on top of Hadoop which provides a SQL-like interface over MapReduce.

Starting an interactive hive session

• Register for an AWS account • Create a key pair in EC2 console • Launch an EMR job flow to start an interactive

Hive session using the key pair• Use the domain name and key pair to SSH into

the master node of the Amazon EC2 cluster as the user “hadoop”

• Start hive

Data set 1: DISASTERS WORLDWIDE FROM 1900-2008

• Available at: – http://www.infochimps.com/datasets/disasters-worldwide-fr

om-1900-2008• Info– Disaster data from 1900 – 2008, – organized by

• start and end date, • country (and sub-location), • disaster type (and sub-type), • disaster name, cost, and persons killed and affected by the disaster.

• Details– Number of rows: 17829– Size: 451 kB (Compressed) , 1.5 MB (uncompressed)

http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008


Create table CREATE EXTERNAL TABLE emdata ( start string, ende string, country string, locatione string, type string, subtype string, name string, killed string, cost string, affected string )ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'LOCATION 's3://cmpt592-assignment5/input/';

Q1: • Get the total number of disasters occurred in each

country• Query:SELECT

count(distinct id),country

FROM emdata

GROUP BY country;

Q1: • Output:

Q2: • Get the number and type of disasters in a given

country• Query:SELECT

count(distinct id),type

FROM emdata

WHERE country='Afghanistan'

GROUP BY type;

Q2: • Output:

Q3: • Get the number, type, and subtype of disasters in

a given country• Query:SELECT

count(distinct id),type,subtype

FROM emdata

WHERE country='Afghanistan'

GROUP BY type, subtype;

Q3: • Output:

Q4: • Get the total casualties and type of disasters in a

given country only when casualties > 100• Query:SELECT

sum(killed),type

FROM emdata

WHERE country='Afghanistan' and killed>100

GROUP BY type;

Q4: • Output:

Q5: • Get the total casualties and name of country for a

certain type of disaster when casualties > 500• Query:SELECT

sum(killed),country

FROM emdata

WHERE type='Flood' and killed>500

GROUP BY country;

Q3: • Output:

Q6: • Get the total cost and name of country for a

certain type of disaster when cost > 1 000 000• Query:SELECT

sum(cost),country

FROM emdata

WHERE type='Flood' and cost>1000000

GROUP BY country;

Q3: • Output:

Data set 2: Google Books n-grams

• Available at: s3://datasets.elasticmapreduce/ngrams/books/

• Details– Size:2.2 TB– Source: Google Books– Created On: January 5, 2011 6:11 PM GMT– Last Updated: January 21, 2012 2:12 AM GMT

• Processing n-gram data using Amazon Elastic MapReduce and Apache Hive.

• Calculating the top trending topics per decade– http://aws.amazon.com/articles/5249664154115844

http://aws.amazon.com/articles/5249664154115844

Data set 1: Google Books n-grams• N-grams are fixed size tuples of items. In this case the items are words extracted

from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters.

• The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token. For example, the following sentence.

• The yellow dog played fetch.– Would produce the following 2-grams:

["The", "yellow"] ["yellow", 'dog"] ["dog", "played"] ["played", "fetch"] ["fetch", "."] Or the following 3-grams: ["The", "yellow", "dog"] ["yellow", "dog", "played"] ["dog", "played", "fetch"] ["played", "fetch", "."]

Data set 1: Google Books n-grams

• Two settings to efficiently process data from S3• hive> set hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;• hive> set mapred.min.split.size=134217728;

• Used to tell– Hive to use an InputFormat that will split a file into

pieces for processing, and– Tell it not to split them into pieces any smaller than

128 MB.

Creating input table

• In order to process any data– Define the source of data

• Data set used – English 1-grams– Rows : 472,764,897– Compressed Size : 4.8 GB

Creating input table

• Statement to define dataCREATE EXTERNAL TABLE english_1grams (

gram string, year int, occurrences bigint, pages bigint, books bigint

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS SEQUENCEFILE LOCATION's3://datasets.elasticmapreduce/ngrams/books/

20090715/eng-all/1gram/';

Creating input table• Output

Count the number of rows

• StatementSelect count(*) from english_1grams;

• Output

Normalizing the data

• The data in its current form is very raw.– It contains punctuation, numbers (typically years), and is

case sensitive. • First we want to create a table to store the results of

the normalization. In the process, we can also drop unnecessary columns.

• StatementCREATE TABLE normalized (

gram string, year int, occurrences bigint

);

Normalizing the data• Output

Inserting data into normalized table

• Read raw data and insert into normalized table• StatementINSERT OVERWRITE TABLE normalized SELECT

lower(gram), year, occurrences

FROM english_1grams WHERE

year >= 1890 AND gram REGEXP "^[A-Za-z+'-]+$";

Inserting data into normalized table• Output

Finding word ratio by decade

• More books are printed over time, – so every word has a tendency to have more

occurrences in later decades. • We only care about the relative usage of that

word over time, – so we want to ignore the change in size of corpus. – This can be done by finding the ratio of occurrences of

this word over the total number of occurrences of all words.

Finding word ratio by decade

• Create a table to store this data• StatementCREATE TABLE by_decade (

gram string, decade int, ratio double

);

• Calculate the total number of word occurrences by decade. Then

• Join this data with the normalized table in order to calculate the usage ratio.

Finding word ratio by decade• StatementINSERT OVERWRITE TABLE by_decade

SELECT a.gram, b.decade, sum(a.occurrences) / b.total FROM normalized a JOIN ( SELECT substr(year, 0, 3) as decade, sum(occurrences) as total FROM normalized GROUP BY substr(year, 0, 3) ) b ON substr(a.year, 0, 3) = b.decade GROUP BY a.gram, b.decade, b.total;

Finding word ratio by decade• Output

Calculating changes per decade

• With a normalized dataset by decade we can get down to calculating changes by decade.– This can be achieved by joining the dataset on itself.

• We'll want to join rows where the n-grams are equal and the decade is off by one. – This lets us compare ratios for a given n-gram from

one decade to the next.

Calculating changes per decade

• StatementSELECT

a.gram as gram, a.decade as decade, a.ratio as ratio, a.ratio / b.ratio as increase

FROM by_decade a JOIN by_decade b

ON a.gram = b.gram and a.decade - 1 = b.decadeWHERE a.ratio > 0.000001 and a.decade >= 190DISTRIBUTE BY decade SORT BY decade ASC, increase DESC;

Calculating changes per decade• Output

Final output

Hive (Hadoop) vs. SQL (RDBMS)• Hive (Hadoop)

– Intended for scaling to hundred and thousands of machines.– Optimized for full table scans and jobs incur substantial overhead in

submission and scheduling.– Best for data transformation, summarization, and analysis of large volumes of

data but not appropriate for applications requiring fast query response times.– Read based and therefore not appropriate for transaction processing

requiring write operations.• SQL (RDBMS)

– Typically run on a single large machine and do not provide support for executing map and reduce functions on the tables.

– RDBMS systems are best for when referential integrity are required and frequent small updates are performed.

– Tables are indexed and cached so small amounts of data can be retrieved very quickly.

References• Disasters worldwide from 1900-2008 dataset

– URL: http://www.infochimps.com/datasets/disasters-worldwide-from-1900-2008• Finding trending topics using Google Books n-grams data and Apache Hive

on Elastic MapReduce– URL: http://aws.amazon.com/articles/5249664154115844

• Google Books Ngrams– URL:http://aws.amazon.com/datasets/8172056142375670

• Apache Hadoop– URL: http://hadoop.apache.org/

• Amazon Elastic Compute Cloud– URL: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

• Getting started Guide– URL:

http://docs.aws.amazon.com/gettingstarted/latest/emr/getting-started-emr-tutorial.html


http://aws.amazon.com/articles/5249664154115844

http://aws.amazon.com/datasets/8172056142375670

http://hadoop.apache.org/

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html



Acknowledgements

• Frank Paladino

Acknowledgements

• Dr Aparna Varde

processing data using amazon elastic mapreduce and apache hive

Documents