hive intro

51
Apache Hive An Introduction

Upload: saeed-meethal

Post on 28-Nov-2015

40 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hive Intro

Apache HiveAn Introduction

Page 2: Hive Intro

2

Agenda

OverviewA quick look at Hive and its background.

StructureA peek at the structure of Hive.

LanguageHow to write DDL and DML statements in Hive.

Hive at YahooWorking with Hive on Yahoo’s grids.

Advanced FeaturesSome more things you can do with Hive.

More InformationWhere to look when you need more details or help.

Page 3: Hive Intro

3

Overview

Page 4: Hive Intro

4

Motivation for Hive

Companies are no longer dealing with gigabytes, but

rather terabytes

Large amount of data to analyze

Researchers want to study and understand the data

Business fold wants to slice and dice the data & metrics

in various ways

Every one impatient – Give me answers now

Joining across large data sets is quite tricky

Page 5: Hive Intro

5

Motivation for Hive

Started in January 2007 at Facebook

Query data on Hadoop without having to write complex

MapReduce programs in Java each time

SQL chosen for familiarity and tools-support

An active open-source project since August 2008

Top-level Apache project (hive.apache.org)

Used in many companies; a diverse set of contributors

Page 6: Hive Intro

6

What Hive Is

A Hadoop-based system for managing and querying

structured data

Hive provides an view of your data as tables with rows

and columns

Uses HDFS for storing data

Provides a SQL-like interface for querying data

Uses MapReduce for executing queries

Scales well to handle massive data-sets

Page 7: Hive Intro

7

Example

SELECT COUNT(1) AS job_count, t.wait_time

FROM

(SELECT ROUND(wait_time/1000) AS wait_time, job_id

FROM starling_jobs

WHERE grid = ‘MB’

AND dt >= ‘2011_07_11’

AND dt <= ‘2011_07_13’) t

GROUP BY t.wait_time;

Page 8: Hive Intro

8

8 Simple steps

Login to grid gateway machine. 

Create a hdfs file to store your hive metadata,   Ex:-    hadoop fs -mkdir /user/vmoorthy/warehouse 

Go to hive shell by running hive 

SET mapred.job.queue.name=unfunded;   -- to run your job in the unfunded queue

 

Page 9: Hive Intro

9

8 Simple steps (…)

Create a database specifying the location for meta data store.

Ex:-   CREATE DATABASE autos LOCATION '/user/vmoorthy/warehouse';

USE autos;   -- to work with previously created database named 'autos’

CREATE TABLE used_car(chromeTrimId INT,trimId INT, usedCarCondition STRING, usedCarMileage INT, usedCarPrice INT, chromeModelId INT, modelId INT)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' LOCATION '/user/vmoorthy/usedCarTrim';  

-- create a table for the tab separated hdfs file named usedCarTrim

Page 10: Hive Intro

10

8 Simple steps (…)

Now, you are ready to run select queries on the above table.

Ex:-   SELECT * FROM used_car WHERE chrometrimid > 3030;

Page 11: Hive Intro

11

Structure

Page 12: Hive Intro

12

Architecture

Hive

Driver(Compiler, Optimizer, Executor)

Command-line Interface

WebInterface

ThriftServer

JDBC ODBC

Meta-store

Hadoop Database

Page 13: Hive Intro

13

Query Execution

Parser

Logical Plan Generator

Optimizer

Physical PlanGenerator

Executor

Query

MapReduce Job(s)

Page 14: Hive Intro

14

Storage

<warehouse-directory>

<database-directory>

<table-directory>

<partition-directory>

<data-file1>

<data-file2>

[…]

<data-filen>

Table metadata is stored in meta-store

Directories for databases, tables and partitions

Files for table-data

Page 15: Hive Intro

15

Language

Page 16: Hive Intro

16

Data Model

Database – a namespace for tables and other units of

data (“default” if none specified)

Table – a row-based store for data in a database; each

row having one or more columns

Partition – a key-based separation of data in a table for

reducing the amount of data scanned (Optional)

Bucket – cluster of data in a partition based on a hashing

a column-value (Optional)

Page 17: Hive Intro

17

Primitive Data-types

Integers – TINYINT (1 byte), SMALLINT (2 bytes), INT (4

bytes), BIGINT (8 bytes)

Boolean – BOOLEAN (TRUE / FALSE)

Floating-Point – FLOAT, DOUBLE

String – STRING

Implicit and explicit casting supported

Page 18: Hive Intro

18

Complex Data-types

Arrays – a list of elements of the same data-type accessible

using an index. “A[n]” denotes the element at index

“n”(starts from zero) in array “A”

Structs – a record with named elements. “foo.bar”

denotes the field “bar” in the struct “foo”

Maps – maintains mappings from keys to respective

values. “M[‘foo’]” denotes the value for “foo” in the

map “M”

Collections can be nested arbitrarily

Page 19: Hive Intro

19

Operators

Relational – “=”, “!=”, “<”, “<=”, etc. as well as “IS

NULL”, “IS NOT NULL”, “LIKE”, etc. Generate TRUE

or FALSE based on comparison

Arithmetic – “+”, “-”, “*”, “/”, etc. Generate number

based on the result of the arithmetic operation

Logical – “AND”, “OR”, “NOT”, etc. Generate TRUE or

FALSE based on combining Boolean expressions

Page 20: Hive Intro

20

Built-in Fuctions

Mathematical – “round()”, “floor()”, “rand()”, etc.

String – “concat()”, “substr()”,

“regexp_replace()”, etc.

Time – “to_date()”, “from_unixtime()”, “year()”,

“month()”, etc.

Aggregates – “count()”, “sum()”, “min()”, “max()”,

“avg()”

…and quite a lot more

Page 21: Hive Intro

21

Creating a Table

CREATE TABLE employees(name STRING, age INT);

or

CREATE TABLE IF NOT EXISTS employees(name STRING, age INT);

or

CREATE TABLE employees(name STRING, age INT)

PARTITIONED BY (join_dt STRING);

or

CREATE TABLE employees(name STRING, age INT)

STORED AS SequenceFile;

etc.

Page 22: Hive Intro

22

Loading Data

LOAD DATA INPATH '/foo/bar/snafu.txt'

INTO TABLE employees;

or

LOAD DATA LOCAL INPATH '/homes/wombat/emp_2011-12-01.txt'

INTO TABLE employees

PARTITION (join_dt=‘2011_12_01’);

or

INSERT OVERWRITE TABLE employees

SELECT name, age FROM all_employees

WHERE location = 'Bangalore';

Page 23: Hive Intro

23

Querying Data

SELECT * FROM employees;

or

SELECT * FROM employees LIMIT 10;

or

SELECT name, age FROM employees

WHERE age > 30;

or

SET hive.exec.compress.output=false;

SET hive.cli.print.header=true;

INSERT OVERWRITE LOCAL DIRECTORY ‘/homes/wombat/blr’

SELECT * FROM all_employees

WHERE location = ‘Bangalore’;

etc.

Page 24: Hive Intro

24

External Tables

CREATE EXTERNAL TABLE foo(name string, age int)

LOCATION ‘/user/bar/wombat’;

Data not managed by Hive

Useful when data is already processed and in a usable state

Manually clean up after dropping tables/partitions

Page 25: Hive Intro

25

Altering a Table

ALTER TABLE employees RENAME TO blr_employees;

ALTER TABLE employees

REPLACE COLUMNS (emp_name STRING, emp_age INT);

ALTER TABLE employees ADD COLUMNS (emp_id STRING);

ALTER TABLE all_employees

DROP PARTITION (location=‘Slackville’);

Page 26: Hive Intro

26

Databases

CREATE DATABASE foo;

orCREATE DATABASE IF NOT EXISTS foo;

orCREATE DATABASE foo LOCATION ‘/snafu/wombat’;

USE foo;SELECT * FROM bar LIMIT 10;

orSELECT * FROM foo.bar LIMIT 10;

DROP DATABASE foo;

orDROP DATABASE IF EXISTS foo;

Page 27: Hive Intro

27

Other Operations

SHOW TABLES;

SHOW PARTITIONS all_employees;

SHOW PARTITIONS all_employees

PARTITION (location=‘Bangalore’);

DESCRIBE employees;

DROP TABLE employees;

or

DROP TABLE IF EXISTS employees;

Page 28: Hive Intro

28

Joins

SELECT e.name, d.dept_name

FROM departments d JOIN all_employees e

ON (e.dept_id = d.dept_id);

or

SELECT e.name, d.dept_name

FROM departments d

LEFT OUTER JOIN all_employees e

ON (e.dept_id = d.dept_id);

Page 29: Hive Intro

29

Ordering of Data

ORDER BY – global ordering of results based on the

selected columns

SORT BY – local ordering of results on each reducer

based on the selected columns

Page 30: Hive Intro

30

File-formats

TextFile – plain-text files; fields delimited with ^A by

default

SequenceFile – serialized objects, possibly-

compressed

RCFile – columnar storage of serialized objects,

possibly-compressed

Page 31: Hive Intro

31

TextFile Delimiters

Default field-separator is ^A; row-separator is \n

John Doe^A36\n

Jane Doe^A33\n

Default list-separator is ^B; value-separator is ^C

John Doe^Adept^Cfinance^Bemp_id^C2357\n

CREATE TABLE employees(name STRING, age INT)

ROW FORMAT DELIMITED

FIELDS TERMINATED by '\t';

Page 32: Hive Intro

32

Buckets

CREATE TABLE employees(name STRING, age INT)

CLUSTERED BY (name) INTO 31 BUCKETS;

Distribute partition-data into files based on columns

Improves performance for filters with these columns

Works best when data is uniformly distributed

Page 33: Hive Intro

33

Compressed Storage

Saves space and generally improves performance

Direct support for reading compressed files LOAD DATA LOCAL INPATH ‘/foo/bar/emp_data.bz2’ INTO TABLE all_employees;

Compressed TextFile cannot usually be split

SequenceFile or RCFile recommended instead

Page 34: Hive Intro

34

Tips

Judicious use of partitions and buckets can drastically

improve the performance of your queries

Put always-used Hive CLI commands in

$HOME/.hiverc (e.g. SET

mapred.job.queue.name=unfunded;)

Use EXPLAIN to analyze a query before executing it

Use RCFile with compression to save storage and to

improve performance

Page 35: Hive Intro

35

Hive at Yahoo

Page 36: Hive Intro

36

Specifics

Hive CLI available as /home/y/bin/hive on gateways

of supported grids

Mandatory LOCATION clause in CREATE TABLE

Must specify MapReduce queue for submitted Jobs

(e.g. SET mapred.job.queue.name=unfunded;)

No JDBC / ODBC support

Integrated with HCatalog

Page 37: Hive Intro

37

Advanced Features

Page 38: Hive Intro

38

User-defined Functions

Many very useful built-in functions SHOW FUNCTIONS;

DESCRIBE FUNCTION foo;

Extensible using user-defined functions

User-defined Function (UDF) for one-to-one mappingE.g. round(), concat(), unix_timestamp(), etc.

User-defined Aggregate Function (UDAF) for many-to-one

mappingE.g. sum(), avg(), stddev(), etc.

User-defined Table-generating Function (UDTF) for one-to-many

mappingE.g. explode(), etc.

Page 39: Hive Intro

39

Custom UDF

package com.yahoo.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.hive.ql.exec.Description;

import org.apache.hadoop.io.Text;

@Description(

name = "toupper",

value = "_FUNC_(str) - Converts a string to uppercase",

extended = "Example:\n" +

" > SELECT toupper(author_name) FROM authors a;\n" +

" STEPHEN KING"

)

Page 40: Hive Intro

40

Custom UDF (…)

public class ToUpper extends UDF {

public Text evaluate(Text s) {

Text to_value = new Text("");

if (s != null) {

try {

to_value.set(s.toString().toUpperCase());

} catch (Exception e) { // Should never happen

to_value = new Text(s);

}

}

return to_value;

}

}

Page 41: Hive Intro

41

UDF Usage

add jar build/ql/test/test-udfs.jar;

CREATE TEMPORARY FUNCTION TO_UPPER AS ’com.yahoo.hive.udf.ToUpper';

SELECT TO_UPPER(src.value) FROM src;

DROP TEMPORARY FUNCTION TO_UPPER;

Page 42: Hive Intro

42

Overloaded UDF

public class UDFExampleAdd extends UDF {

public Integer evaluate(Integer a, Integer b) {

if (a == null || b == null) return null;

return a + b;

}

public Double evaluate(Double a, Double b) {

if (a == null || b == null) return null;

return a + b;

}

}

Page 43: Hive Intro

43

Overloaded UDF

add jar build/contrib/hive_contrib.jar;

CREATE TEMPORARY FUNCTION example_add AS 'org.apache.hadoop.hive.contrib.udf.UDFExampleAdd';

SELECT example_add(1, 2) FROM src;

SELECT example_add(1.1, 2.2) FROM src;

Page 44: Hive Intro

44

UDAF Example

SELECT page_url, count(1), count(DISTINCT user_id)FROM mylog;

public class UDAFCount extends UDAF {

public static class Evaluator implements UDAFEvaluator {

private int mCount;

public void init() {mcount = 0;}

public boolean iterate(Object o) {

if (o!=null) mCount++; return true;}

public Integer terminatePartial() {return mCount;}

public boolean merge(Integer o) {mCount += o; return true;}

public Integer terminate() {return mCount;}

}

Page 45: Hive Intro

45

Overloaded UDAF

public class UDAFSum extends UDAF {

public static class IntEvaluator implements UDAFEvaluator {

private int mSum;

public void init() {mSum = 0;}

public boolean iterate(Integer o)

{mSum += o; return true;}

public Integer terminatePartial() {return mSum;}

public boolean merge(Integer o) {mSum += o; return true;}

public Integer terminate() {return mSum;}

}

Page 46: Hive Intro

46

Overloaded UDAF

public static class DblEvaluator implements UDAFEvaluator {

private double mSum;

public void init() {mSum = 0;}

public boolean iterate(Double o)

{mSum += o; return true;}

public Double terminatePartial() {return mSum;}

public boolean merge(Double o)

{mSum += o; return true;}

public Double terminate() {return mSum;}

}

}

Page 47: Hive Intro

47

What Hive Is Not

Not suitable for small data-sets

Does not provide real-time results

Does not support row-level updates

Imposes a schema on the data

Does not support transactions

Does not need expensive server-class hardware,

RDBMS licenses or god-like DBAs to scale

Page 48: Hive Intro

48

More Information

Page 49: Hive Intro

49

External References

Hive home-page: hive.apache.org

Hive wiki: cwiki.apache.org/confluence/display/Hive

Hive tutorial: cwiki.apache.org/confluence/display/Hive/Tutorial

Hive language manual:

cwiki.apache.org/confluence/display/Hive/LanguageManual

Mailing-list: [email protected]

Page 50: Hive Intro

50

Internal References

Hive at Yahoo: wiki.corp.yahoo.com/view/Grid/Hive

Hive FAQ: wiki.corp.yahoo.com/view/Grid/HiveFAQ

Troubleshooting: wiki.corp.yahoo.com/view/Grid/HiveTroubleShooting

Internal mailing-list: [email protected]

Hive CLI yinst package: hive_cli

Installation instructions:

wiki.corp.yahoo.com/view/Grid/HiveInstallation

Page 51: Hive Intro

51

Questions?