hive – a warehousing solution over a mapreduce framework bingbing liu 2009-12-12 1

Hive – A Warehousing Solution Over a MapReduce Framework

Bingbing Liu

2009-12-12

1

Outline

• Introduction

• Data Model

• Architecture

• HiveQL

2

What is Hive?

• A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files

• Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance

3

Hive/Hadoop Usage @ Facebook

• Types of Applications:– Reporting

• Eg: Daily/Weekly aggregations of impression/click counts• Complex measures of user engagement

– Ad hoc Analysis• Eg: how many group admins broken down by state/country

– Data Mining (Assembling training data)• Eg: User Engagement as a function of user attributes

– Spam Detection• Anomalous patterns for Site Integrity• Application API usage patterns

– Ad Optimization– Too many to count ..

700 Terabytes data

5000queries/day

More than 100 users

4

Data Warehousing at Facebook Today

Web Servers Scribe Servers

Filers

Hive on Hadoop ClusterOracle RAC Federated MySQL 5

Data Model

• Hive中数据组织形式 :

– Tables: 概念上类似于 rdbms中的 table，在存储上对应于一个 HDFS的目录。

– Partitions:每个表有一个或多个分区，决定数据在子目录中分发。

– Buckets: 每个分区中数据基于对列的 hash分配到每个 bucket，每个 bucket是一个文件。

例如：指定数据按例 ds划分Create table sc （ sno

int ） partitioned by （ ds string)则数据中，若 ds=2009-12-08，存储中此分区子目录则为

/sc/ds=2009-12-08

7

Data Model

Logical Partitioning

Hash Partitioning

sc

HDFS MetaStore

/hive/sc/hive/sc/ds=2009-12-08

/hive/sc/ds=2009-12-08/sc.txt

…

Tables

Data LocationBucketing Info

Partitioning Cols

Metastore DB

student

course

8

Metastore

• 存储于本地或者传统的 Rdbms中（非 Hdfs）。• Database

– 所有 table的命名空间，默认为“ default”• Table

– 包括 Column列表和其类型， storage和序列反序列化信息。

– Storage包括数据在底层位置，数据格式（类型）， buckets信息。

• Partition – 每个分区可以包含自己的列，序列反序列化信息，以

及 storage信息。9

Architecture

HDFS

Hive CLIDDL QueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

ExecutionParser

Planner

DB

Web U

I

Optimizer

10

HiveQL – Hive Query Language

• Support:– Select ,project, aggregate ,union all– Load data to table from local or hdfs directory– Equi-joins– Subqueries in from clause– Multi-table Insert– Multi-group-by

11

Example

• Student ( sno int ,sname string ,class int)

• Course (cno int ,cname string);

• Sc (sno int , cno int ,grade int) partitioned by (ds string);

12

传统的Insert into table test（ 1 ， 1 ， 1）；不支持

14

HiveQL- Join

• SQL:

INSERT OVERWRITE TABLE test

SELECT t1.sname,t2.cno

FROM student t1 JOIN sc t2 ON (t1.sno = t2.sno);

Sno Sname

Class

1 Wang 1

2 Zhang

1

3 Zhou 2

4 Chen 2

Sno Cno

Grade

1 1 90

1 2 80

2 1 79

2 2 80

sno cno

Wang

1

Wang

2

Zhang

1

Zhang

2

X =

student sc test

15

HiveQL- Join in Map Reducekey value

1 <0,Wang>

2 <0,Zhang>

3 <0,Zhou>

4 <0,Chen>

student

sckey value

1 <1,1>

1 <1,2>

2 <1,1>

2 <1,2>

Map

key value

1 <0,Wang>

1 <1,1>

1 <1,2>

key value

2 <0,Zhang>

2 <1,1>

2 <1,2>

ShuffleSort

Reduce

Sno Sname

Class

1 Wang 1

2 Zhang 1

3 Zhou 2

4 Chen 2

Sno Cno Grade

1 1 90

1 2 80

2 1 79

2 2 80

3 <0,Zhou>

4 <0,Chen>

16

Query planTableScanOperator

Table:student[sno int ,sname string ,class

int]

TableScanOperatorTable:sc

[sno int ,cno int ,grade int]

ReduceSinkOperatorPartition cols:col[0][0 int ,1 string ,2 int]

ReduceSinkOperatorPartition cols:col[0][0 int ,1 int ,2 int]

JoinOperatorPredicate : cols[0,0]=col[1,0]

[0 int ,1 string ,2 int ,3 int ,4 int ,5 int]

SelectOperatorExpressions:[col[1],col[4]]

[0 string ,1 int]

FileOutputOperatorTable:test

[0 string ,1 int]

Map

Reduce

17

Hive QL – Group By

SELECT student.class, count(1)

FROM student

GROUP BY student.class;

student

Class count

1 2

2 2

Sno Sname

Class

1 Wang 1

2 Zhang

1

3 Zhou 2

4 Chen 2

19

Hive QL – Group By in Map Reduce

Sno Sname

Class

1 Wang

1

2 Zhang

1

pv_users

class count

1 2

Sno Sname

Class

3 Zhou 2

4 Chen 2

Map

key value

1 1

1 1

key value

2 1

2 1

key value

1 1

1 1

key value

2 1

2 1

ShuffleSort

class count

2 2

Reduce

20

Query planTableScanOperator

Table:student[sno int ,sname string ,class

int]

ReduceSinkOperatorPartition cols:col[2][0 int ,1 string ,2 int]

GroupByOperatorAggregations:[count[2]]

Keys:[col[2]][0 int ,1 bigint]

FileOutputOperatorTable:tmp1

[0 int , 1 bigint]

TableScanOperatorTable:tmp1

[0 int , 1 bigint]

ReduceSinkOperatorPartition cols:col[0]

[0 int , 1 bigint]

SelectOperatorExpressions:[col[0],col[1]]

[0 int , 1 bigint]

聚集的key

如果 groupby sno ， class?

0<int ,int>?21

Multi group by

23

hive – a warehousing solution over a mapreduce framework bingbing liu 2009-12-12 1

Documents