an intriduction to hive

An Introduction to Apache HIVE

CreditsBy: Reza Ameri Semester: Fall 2013 Course: DDB Prof: Dr. Naderi


Agenda

• Starting Note– What is Hive– What is cool about Hive– Hive in use– What Hive is not?

• Brief About Data Warehouse

2 of 31


Agenda- Contd.

• Hive Architecture– Components– Architecture Diagram

• Hive in Production– HQL– Data Insertion/Aggregation

• Performance• Further Reading• References

3 of 31


Starting Note

• What is Apache Hive?– Open Source (Very Important!) So Free

– Data Warehouse System on Hadoop

– Provides HQL(SQL like query interface)

– Suitable for Structured and Semi-Structured Data

– Capability to deal with different storages and file formats

4 of 31


Starting Note- Contd.

• What is cool about Hive

– Let users use MR without thinking MR with HiveQL interface.

• Some history– Hive is made by Facebook!– Developing by Netflix aslo.– Amazon uses it in Amazon Elastic MapReduce

5 of 31


Starting Note- Contd.

• What Hive is not

– Does not use complex indexes so do not response in a seconds!

– But it scales very well and, It works with data of Peta Byte order

– It is not independent and it’s performance is tied Hadoop

6 of 31


Brief About Data Warehouse

• OLAP vs OLTP– DW is needed in OLAP– We want report and summary not live data of

transactions for continuing the operate– We need reports to make operation better not to

conduct and operation!– We use ETL to populate data in DW.

7 of 31



Inmon approach vs Kimbal approach

8 of 31



Inmon approach vs Kimbal approach

9 of 31



• Other keywords– ODS- Operational Data Store– Fact Tables– Data Mart– Dimensions– Concurrent ETLs

10 of 31


Hive Architecture

• Components– Hadoop– Driver– Command Line Interface (CLI)– Web Interface– Metastore– Thrift Server

11 of 31


Hive Architecture

12 of 31


Hive Architecture

13 of 31

HDFS

Map Reduce

Web UI + Hive CLI + JDBC/ODBC

Browse, Query, DDL

MetaStore

Thrift API

Hive QL

Parser

Planner

Optimizer

Execution SerDe

CSVThriftRegex

UDF/UDAF

substrsum

average

FileFormats

TextFileSequenceFile

RCFile

User-definedMap-reduce Scripts


Hive Architecture- Contd.

– Internal Components• Compiler and Planner

– It compiles and checks the input query and create an execution plan.

• Optimizer– It optimizes the execution plan before it runs.

• Execution Engine– Runs the execution plan. It is guaranteed that execution plan

is DAG

14 of 31


Hive Architecture- Contd.

• Hive Data Model– Any data in hive is categorized in• Databases

– First level of abstraction.

• Tables– Ordinary tables

• Partition– To handle data transferring in MR.

• Bucket– Facilitate the data access in partitions.

15 of 31


Hive in Production

• Log processing– Daily Report– User Activity Measurement

• Data/Text mining– Machine learning (Training Data)

• Business intelligence– Advertising Delivery– Spam Detection

16 of 31


Hive in Production

– HQL• Create• Row Format• SerDe• Select• Cluster By/Distribute By

– Data Insertion/Aggregation

17 of 31


HQL- Samples

• CREATE TABLECREATE TABLE movies (movie_id int, movie_name string, tags string)

• ROW FORMATROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’;

18 of 31


HQL- Samples

• Partitioncreate table table_name (id int,date string, name string)partitioned by (date string)

19 of 31


HQL- Samples

• SerDe– User Table with

“id::gender::age::occupation::zipcode” format.CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");

20 of 31


HQL- Samples

• SelectSELECT * FROM movies LIMIT 10;

• Distribute By– Select * from movies distribute by tags;– Select the column to organize data while sending

it to reducer.

21 of 20


Hive Process

• Data Insertion/Aggregation –Bulk• ETL– Talend - Community version– Sqoop (SQl to hadOOP, Apache license)– SyncSort – Not Free!

22 of 31


Hive Process- Contd.

– STP(Straight Through Processing)• Flume – Apache lisenced• Chukwa - a part of Apache Hadoop distribution• Scribe – Facebook solution for log processing

and aggregation.

23 of 31



• NetFlix Case Study– Usage of Chukwa– Log processing– Count Errors per session– Count Streams per day– Ad-hoc queries like summaries (sum, max, min, …)

24 of 31



25 of 31



• Phase 1– Hadoop job parses the logs and loads to Hive

every hour.– Previous job should also run every 24 hours for

summary• Phase 2– Real-time log processing(parse/merge/load)– Chukwa has non-stop log collection.

26 of 31


Performance

• According to Globant investigations• Tables:

27 of 31


Performance

28 of 31


Performance

29 of 31


Further Reading

• Apache Drill– Software framework that supports data-intensive, distributed

applications, for interactive analysis of large-scale datasets• PIG

– MR Platform for creating and using MR on Hadoop• Oracle Big Data• DB2 10 and InfoSphere Warehouse• Parallel databases: Gamma, Bubba, Volcano• Google: Sawzall• Yahoo: Pig• IBM: JAQL• Microsoft: DradLINQ , SCOPE

30 of 31


References

• https://www.facebook.com/note.php?note_id=89508453919• https://github.com/facebook/scribe• http://sqoop.apache.org/docs/• http://flume.apache.org/FlumeDeveloperGuide.html• Sqoop Database Import For Hadoop, Cloudera, Oct.2009• https://

cwiki.apache.org/confluence/display/Hive/LanguageManual• http://www.semantikoz.com/blog/the-free-apache-hive-book/• BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING,

Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook

team, 2009

31 of 31

https://github.com/facebook/scribe

https://github.com/facebook/scribe

http://sqoop.apache.org/docs/

http://flume.apache.org/FlumeDeveloperGuide.html

http://www.slideshare.net/cloudera/hw09-sqoop-database-import-for-hadoop







https://cwiki.apache.org/confluence/display/Hive/LanguageManual

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

http://www.semantikoz.com/blog/the-free-apache-hive-book/


Thanks…

an intriduction to hive

Documents

hive hive

hive architecture

hive data model

data warehousean introduction

apache hive10

apache hive11

apache hive8

apache hive9