an intriduction to hive

32
An Introduction to Apache HIVE Credi ts By: Reza Ameri Semester: Fall 2013 Course: DDB Prof: Dr. Naderi

Upload: reza-ameri

Post on 27-Jan-2015

127 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: An intriduction to hive

An Introduction to Apache HIVE

CreditsBy: Reza Ameri Semester: Fall 2013 Course: DDB Prof: Dr. Naderi

Page 2: An intriduction to hive

An Introduction to Apache HIVE

Agenda

• Starting Note– What is Hive– What is cool about Hive– Hive in use– What Hive is not?

• Brief About Data Warehouse

2 of 31

Page 3: An intriduction to hive

An Introduction to Apache HIVE

Agenda- Contd.

• Hive Architecture– Components– Architecture Diagram

• Hive in Production– HQL– Data Insertion/Aggregation

• Performance• Further Reading• References

3 of 31

Page 4: An intriduction to hive

An Introduction to Apache HIVE

Starting Note

• What is Apache Hive?– Open Source (Very Important!) So Free

– Data Warehouse System on Hadoop

– Provides HQL(SQL like query interface)

– Suitable for Structured and Semi-Structured Data

– Capability to deal with different storages and file formats

4 of 31

Page 5: An intriduction to hive

An Introduction to Apache HIVE

Starting Note- Contd.

• What is cool about Hive

– Let users use MR without thinking MR with HiveQL interface.

• Some history– Hive is made by Facebook!– Developing by Netflix aslo.– Amazon uses it in Amazon Elastic MapReduce

5 of 31

Page 6: An intriduction to hive

An Introduction to Apache HIVE

Starting Note- Contd.

• What Hive is not

– Does not use complex indexes so do not response in a seconds!

– But it scales very well and, It works with data of Peta Byte order

– It is not independent and it’s performance is tied Hadoop

6 of 31

Page 7: An intriduction to hive

An Introduction to Apache HIVE

Brief About Data Warehouse

• OLAP vs OLTP– DW is needed in OLAP– We want report and summary not live data of

transactions for continuing the operate– We need reports to make operation better not to

conduct and operation!– We use ETL to populate data in DW.

7 of 31

Page 8: An intriduction to hive

An Introduction to Apache HIVE

Brief About Data Warehouse

Inmon approach vs Kimbal approach

8 of 31

Page 9: An intriduction to hive

An Introduction to Apache HIVE

Brief About Data Warehouse

Inmon approach vs Kimbal approach

9 of 31

Page 10: An intriduction to hive

An Introduction to Apache HIVE

Brief About Data Warehouse

• Other keywords– ODS- Operational Data Store– Fact Tables– Data Mart– Dimensions– Concurrent ETLs

10 of 31

Page 11: An intriduction to hive

An Introduction to Apache HIVE

Hive Architecture

• Components– Hadoop– Driver– Command Line Interface (CLI)– Web Interface– Metastore– Thrift Server

11 of 31

Page 12: An intriduction to hive

An Introduction to Apache HIVE

Hive Architecture

12 of 31

Page 13: An intriduction to hive

An Introduction to Apache HIVE

Hive Architecture

13 of 31

HDFS

Map Reduce

Web UI + Hive CLI + JDBC/ODBC

Browse, Query, DDL

MetaStore

Thrift API

Hive QL

Parser

Planner

Optimizer

Execution SerDe

CSVThriftRegex

UDF/UDAF

substrsum

average

FileFormats

TextFileSequenceFile

RCFile

User-definedMap-reduce Scripts

Page 14: An intriduction to hive

An Introduction to Apache HIVE

Hive Architecture- Contd.

– Internal Components• Compiler and Planner

– It compiles and checks the input query and create an execution plan.

• Optimizer– It optimizes the execution plan before it runs.

• Execution Engine– Runs the execution plan. It is guaranteed that execution plan

is DAG

14 of 31

Page 15: An intriduction to hive

An Introduction to Apache HIVE

Hive Architecture- Contd.

• Hive Data Model– Any data in hive is categorized in• Databases

– First level of abstraction.

• Tables– Ordinary tables

• Partition– To handle data transferring in MR.

• Bucket– Facilitate the data access in partitions.

15 of 31

Page 16: An intriduction to hive

An Introduction to Apache HIVE

Hive in Production

• Log processing– Daily Report– User Activity Measurement

• Data/Text mining– Machine learning (Training Data)

• Business intelligence– Advertising Delivery– Spam Detection

16 of 31

Page 17: An intriduction to hive

An Introduction to Apache HIVE

Hive in Production

– HQL• Create• Row Format• SerDe• Select• Cluster By/Distribute By

– Data Insertion/Aggregation

17 of 31

Page 18: An intriduction to hive

An Introduction to Apache HIVE

HQL- Samples

• CREATE TABLECREATE TABLE movies (movie_id int, movie_name string, tags string)

• ROW FORMATROW FORMAT DELIMITED FIELDS TERMINATED BY ‘:’;

18 of 31

Page 19: An intriduction to hive

An Introduction to Apache HIVE

HQL- Samples

• Partitioncreate table table_name (id int,date string, name string)partitioned by (date string)

19 of 31

Page 20: An intriduction to hive

An Introduction to Apache HIVE

HQL- Samples

• SerDe– User Table with

“id::gender::age::occupation::zipcode” format.CREATE TABLE USER (id INT, gender STRING, age INT, occupation STRING, zipcode INT)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "(.*)::(.*)::(.*)::(.*)::(.*)");

20 of 31

Page 21: An intriduction to hive

An Introduction to Apache HIVE

HQL- Samples

• SelectSELECT * FROM movies LIMIT 10;

• Distribute By– Select * from movies distribute by tags;– Select the column to organize data while sending

it to reducer.

21 of 20

Page 22: An intriduction to hive

An Introduction to Apache HIVE

Hive Process

• Data Insertion/Aggregation –Bulk• ETL– Talend - Community version– Sqoop (SQl to hadOOP, Apache license)– SyncSort – Not Free!

22 of 31

Page 23: An intriduction to hive

An Introduction to Apache HIVE

Hive Process- Contd.

– STP(Straight Through Processing)• Flume – Apache lisenced• Chukwa - a part of Apache Hadoop distribution• Scribe – Facebook solution for log processing

and aggregation.

23 of 31

Page 24: An intriduction to hive

An Introduction to Apache HIVE

Hive Process- Contd.

• NetFlix Case Study– Usage of Chukwa– Log processing– Count Errors per session– Count Streams per day– Ad-hoc queries like summaries (sum, max, min, …)

24 of 31

Page 25: An intriduction to hive

An Introduction to Apache HIVE

Hive Process- Contd.

25 of 31

Page 26: An intriduction to hive

An Introduction to Apache HIVE

Hive Process- Contd.

• Phase 1– Hadoop job parses the logs and loads to Hive

every hour.– Previous job should also run every 24 hours for

summary• Phase 2– Real-time log processing(parse/merge/load)– Chukwa has non-stop log collection.

26 of 31

Page 27: An intriduction to hive

An Introduction to Apache HIVE

Performance

• According to Globant investigations• Tables:

27 of 31

Page 28: An intriduction to hive

An Introduction to Apache HIVE

Performance

28 of 31

Page 29: An intriduction to hive

An Introduction to Apache HIVE

Performance

29 of 31

Page 30: An intriduction to hive

An Introduction to Apache HIVE

Further Reading

• Apache Drill– Software framework that supports data-intensive, distributed

applications, for interactive analysis of large-scale datasets• PIG

– MR Platform for creating and using MR on Hadoop• Oracle Big Data• DB2 10 and InfoSphere Warehouse• Parallel databases: Gamma, Bubba, Volcano• Google: Sawzall• Yahoo: Pig• IBM: JAQL• Microsoft: DradLINQ , SCOPE

30 of 31

Page 31: An intriduction to hive

An Introduction to Apache HIVE

References

• https://www.facebook.com/note.php?note_id=89508453919• https://github.com/facebook/scribe• http://sqoop.apache.org/docs/• http://flume.apache.org/FlumeDeveloperGuide.html• Sqoop Database Import For Hadoop, Cloudera, Oct.2009• https://

cwiki.apache.org/confluence/display/Hive/LanguageManual• http://www.semantikoz.com/blog/the-free-apache-hive-book/• BEGINNING MICROSOFT® SQL SERVER® 2012 PROGRAMMING,

Wiley, Paul Atkinson and Robert Vieira, ISBN: 978-1-118-10228-2• Hive – A Petabyte Scale Data Warehouse Using Hadoop, facebook

team, 2009

31 of 31

Page 32: An intriduction to hive

An Introduction to Apache HIVE

Thanks…