hive – a warehousing solution over a mapreduce framework bingbing liu 2009-12-12 1

24
Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Upload: baldwin-lawrence-simon

Post on 13-Jan-2016

223 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Hive – A Warehousing Solution Over a MapReduce Framework

Bingbing Liu

2009-12-12

1

Page 2: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Outline

• Introduction

• Data Model

• Architecture

• HiveQL

2

Page 3: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

What is Hive?

• A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files

• Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance

3

Page 4: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Hive/Hadoop Usage @ Facebook

• Types of Applications:– Reporting

• Eg: Daily/Weekly aggregations of impression/click counts• Complex measures of user engagement

– Ad hoc Analysis• Eg: how many group admins broken down by state/country

– Data Mining (Assembling training data)• Eg: User Engagement as a function of user attributes

– Spam Detection• Anomalous patterns for Site Integrity• Application API usage patterns

– Ad Optimization– Too many to count ..

700 Terabytes data

5000queries/day

More than 100 users

4

Page 5: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Data Warehousing at Facebook Today

Web Servers Scribe Servers

Filers

Hive on Hadoop ClusterOracle RAC Federated MySQL 5

Page 6: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

6

Page 7: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Data Model

• Hive中数据组织形式 :

– Tables: 概念上类似于 rdbms中的 table,在存储上对应于一个 HDFS的目录。

– Partitions:每个表有一个或多个分区,决定数据在子目录中分发。

– Buckets: 每个分区中数据基于对列的 hash分配到每个 bucket,每个 bucket是一个文件。

例如:指定数据按例 ds划分Create table sc ( sno

int ) partitioned by ( ds string)则数据中,若 ds=2009-12-08,存储中此分区子目录则为

/sc/ds=2009-12-08

7

Page 8: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Data Model

Logical Partitioning

Hash Partitioning

sc

HDFS MetaStore

/hive/sc/hive/sc/ds=2009-12-08

/hive/sc/ds=2009-12-08/sc.txt

Tables

Data LocationBucketing Info

Partitioning Cols

Metastore DB

student

course

8

Page 9: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Metastore

• 存储于本地或者传统的 Rdbms中(非 Hdfs)。• Database

– 所有 table的命名空间,默认为“ default”• Table

– 包括 Column列表和其类型, storage和序列反序列化信息。

– Storage包括数据在底层位置,数据格式(类型), buckets信息。

• Partition – 每个分区可以包含自己的列,序列反序列化信息,以

及 storage信息。9

Page 10: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Architecture

HDFS

Hive CLIDDL QueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

ExecutionParser

Planner

DB

Web U

I

Optimizer

10

Page 11: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

HiveQL – Hive Query Language

• Support:– Select ,project, aggregate ,union all– Load data to table from local or hdfs directory– Equi-joins– Subqueries in from clause– Multi-table Insert– Multi-group-by

11

Page 12: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Example

• Student ( sno int ,sname string ,class int)

• Course (cno int ,cname string);

• Sc (sno int , cno int ,grade int) partitioned by (ds string);

12

Page 13: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

13

Page 14: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

传统的Insert into table test( 1 , 1 , 1);不支持

14

Page 15: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

HiveQL- Join

• SQL:

INSERT OVERWRITE TABLE test

SELECT t1.sname,t2.cno

FROM student t1 JOIN sc t2 ON (t1.sno = t2.sno);

Sno Sname

Class

1 Wang 1

2 Zhang

1

3 Zhou 2

4 Chen 2

Sno Cno

Grade

1 1 90

1 2 80

2 1 79

2 2 80

sno cno

Wang

1

Wang

2

Zhang

1

Zhang

2

X =

student sc test

15

Page 16: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

HiveQL- Join in Map Reducekey value

1 <0,Wang>

2 <0,Zhang>

3 <0,Zhou>

4 <0,Chen>

student

sckey value

1 <1,1>

1 <1,2>

2 <1,1>

2 <1,2>

Map

key value

1 <0,Wang>

1 <1,1>

1 <1,2>

key value

2 <0,Zhang>

2 <1,1>

2 <1,2>

ShuffleSort

Reduce

Sno Sname

Class

1 Wang 1

2 Zhang 1

3 Zhou 2

4 Chen 2

Sno Cno Grade

1 1 90

1 2 80

2 1 79

2 2 80

3 <0,Zhou>

4 <0,Chen>

16

Page 17: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Query planTableScanOperator

Table:student[sno int ,sname string ,class

int]

TableScanOperatorTable:sc

[sno int ,cno int ,grade int]

ReduceSinkOperatorPartition cols:col[0][0 int ,1 string ,2 int]

ReduceSinkOperatorPartition cols:col[0][0 int ,1 int ,2 int]

JoinOperatorPredicate : cols[0,0]=col[1,0]

[0 int ,1 string ,2 int ,3 int ,4 int ,5 int]

SelectOperatorExpressions:[col[1],col[4]]

[0 string ,1 int]

FileOutputOperatorTable:test

[0 string ,1 int]

Map

Reduce

17

Page 18: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

18

Page 19: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Hive QL – Group By

SELECT student.class, count(1)

FROM student

GROUP BY student.class;

student

Class count

1 2

2 2

Sno Sname

Class

1 Wang 1

2 Zhang

1

3 Zhou 2

4 Chen 2

19

Page 20: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Hive QL – Group By in Map Reduce

Sno Sname

Class

1 Wang

1

2 Zhang

1

pv_users

class count

1 2

Sno Sname

Class

3 Zhou 2

4 Chen 2

Map

key value

1 1

1 1

key value

2 1

2 1

key value

1 1

1 1

key value

2 1

2 1

ShuffleSort

class count

2 2

Reduce

20

Page 21: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Query planTableScanOperator

Table:student[sno int ,sname string ,class

int]

ReduceSinkOperatorPartition cols:col[2][0 int ,1 string ,2 int]

GroupByOperatorAggregations:[count[2]]

Keys:[col[2]][0 int ,1 bigint]

FileOutputOperatorTable:tmp1

[0 int , 1 bigint]

TableScanOperatorTable:tmp1

[0 int , 1 bigint]

ReduceSinkOperatorPartition cols:col[0]

[0 int , 1 bigint]

SelectOperatorExpressions:[col[0],col[1]]

[0 int , 1 bigint]

聚集的key

如果 groupby sno , class?

0<int ,int>?21

Page 22: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

22

Page 23: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

Multi group by

23

Page 24: Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1

24