hadoop and hive
DESCRIPTION
Facebook and Open Source talk for University of Illinois at Urbana-Champaign by Zheng ShaoTRANSCRIPT
![Page 1: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/1.jpg)
![Page 2: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/2.jpg)
Facebook and Open SourceUniversity of Illinois, Urbana-Champaign
Zheng ShaoFacebook Data Team10/3/2008
![Page 3: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/3.jpg)
Current Status of Facebook▪ #5 site on the Internet
▪ 100+ million users
▪ 65 billion monthly page views
▪ > 2^32 photos
• 400,000 developers around the world
• Google/7 search queries
• More hourly news stories than every other media source in history combined
![Page 4: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/4.jpg)
Architecture▪ 3 service tiers:
▪ Web/Apache
▪ Memcached
▪ MySQL
▪ Uses:
▪ Linux
▪ PHP
![Page 5: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/5.jpg)
Architecture (continued)▪ Backend Data Warehousing System:
Web Servers Scribe Servers
Network Storage
Hive on Hadoop ClusterOracle RAC MySQL
![Page 6: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/6.jpg)
Open Source Strategy
▪ Facebook relies heavily on open source software.
▪ Facebook makes big contributions to the open source community.
▪ What’s the benefit of open source for you?
▪ Everybody can use it, including YOU
▪ Everybody can work on it, including YOU
![Page 7: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/7.jpg)
Open Source at Facebook▪ Thrift
▪ Cassandra
▪ Facebook Open Platform
▪ MemCacheD enhancements
▪ PHP enhancements
▪ Hive on top of Hadoop
▪ Many more at http://developers.facebook.com/opensource.php
![Page 8: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/8.jpg)
Hadoop and HiveDistributed File System, Map-Reduce, and SQL
![Page 9: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/9.jpg)
Hadoop
▪ Mimics Google File System and Map Reduce
▪ Started by Yahoo in Feb 2006
▪ Supported/used by 30+ companies/universities now, including Google : http://wiki.apache.org/hadoop/PoweredBy
![Page 10: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/10.jpg)
Why Hadoop?▪ We Want
▪ Efficiency
▪ Scalability
▪ Redundancy
▪ Load Balance
▪ Google File System and Map Reduce is nice
▪ But not open source
![Page 11: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/11.jpg)
Hive
▪ A database/data warehouse on top of Hadoop
▪ Rich data types (structs, lists and maps)
▪ Efficient implementations of SQL filters, joins and group-by’s on top of map reduce
▪ Easy interactions with different programming languages
![Page 12: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/12.jpg)
Why Hive?▪ We want
▪ SQL support
▪ Efficiency
▪ Scalability
▪ Redundancy
▪ Easy to add new features
▪ Alternatives:
▪ Traditional database systems
▪ Proprietary data warehousing systems
▪ Sawzall / Pig
![Page 13: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/13.jpg)
Hive Architecture
HDFS
Hive CLIDDLQueriesBrowsing
Map Reduce
SerDe
Thrift Jute JSONThrift API
MetaStore
Web UIMgmt, etc
Hive QL
Planner ExecutionParser Planner
![Page 14: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/14.jpg)
Machine 2
Machine 1
<k1, v1><k2, v2><k3, v3>
<k4, v4><k5, v5><k6, v6>
(Simplified) Map Reduce Review
<nk1, nv1><nk2, nv2><nk3, nv3>
<nk2, nv4><nk2, nv5><nk1, nv6>
LocalMap
<nk2, nv4><nk2, nv5><nk2, nv2>
<nk1, nv1><nk3, nv3><nk1, nv6>
GlobalShuffle
<nk1, nv1><nk1, nv6><nk3, nv3>
<nk2, nv4><nk2, nv5><nk2, nv2>
LocalSort
<nk2, 3>
<nk1, 2><nk3, 1>
LocalReduce
![Page 15: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/15.jpg)
Hive QL – Join
• SQL:
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN user u ON (pv.userid = u.userid);
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid
age gender
111 25 female
222 32 male
pageid
age
1 25
2 25
1 32
X =
page_viewuser
pv_users
![Page 16: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/16.jpg)
Hive QL – Join in Map Reduce
key value
111 <1,1>
111 <1,2>
222 <1,1>
pageid
userid
time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid
age gender
111 25 female
222 32 male
page_view
user
key value
111 <2,25>
222 <2,32>
Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
ShuffleSort
pageid
age
1 25
2 25
pageid age
1 32
Reduce
![Page 17: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/17.jpg)
Hive QL – Group By
• SQL:▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
▪ GROUP BY pageid, age;
pageid
age
1 25
2 25
1 32
2 25
pv_users
pageid
age Count
1 25 1
2 25 2
1 32 1
pageid_age_sum
![Page 18: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/18.jpg)
Hive QL – Group By in Map Reduce
pageid
age
1 25
2 25
pv_users
pageid
age Count
1 25 1
1 32 1
pageid_age_sum
pageid
age
1 32
2 25
Map
key value
<1,25>
1
<2,25>
1
key value
<1,32>
1
<2,25>
1
key value
<1,25>
1
<1,32>
1
key value
<2,25>
1
<2,25>
1
ShuffleSort
pageid
age Count
2 25 2
Reduce
![Page 19: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/19.jpg)
Hive QL – Group By with Distinct
▪ SQL
▪ SELECT pageid, COUNT(DISTINCT userid)
▪ FROM page_view GROUP BY pageid
pageid
userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
2 111 9:08:20
page_view
pageid count_distinct_userid
1 2
2 1
result
![Page 21: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/21.jpg)
Hive OptimizationsEfficient execution of SQL on Map Reduce
![Page 22: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/22.jpg)
Machine 2
Machine 1
<k1, v1><k2, v2><k3, v3>
<k4, v4><k5, v5><k6, v6>
(Simplified) Map Reduce Revisit
<nk1, nv1><nk2, nv2><nk3, nv3>
<nk2, nv4><nk2, nv5><nk1, nv6>
LocalMap
<nk2, nv4><nk2, nv5><nk2, nv2>
<nk1, nv1><nk3, nv3><nk1, nv6>
GlobalShuffle
<nk1, nv1><nk1, nv6><nk3, nv3>
<nk2, nv4><nk2, nv5><nk2, nv2>
LocalSort
<nk2, 3>
<nk1, 2><nk3, 1>
LocalReduce
![Page 23: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/23.jpg)
Hive Optimizations – Merge Sequential Map Reduce Jobs
▪ SQL:
▪ FROM (a join b on a.key = b.key) join c on a.key = c.key SELECT …
key av bv
1 111 222
key av
1 111
A
Map Reducekey bv
1 222
B
key cv
1 333
C
AB
Map Reducekey av bv cv
1 111 222 333
ABC
![Page 24: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/24.jpg)
Hive Optimizations – Share Common Read Operations
• Extended SQL
▪ FROM pv_users
▪ INSERT INTO TABLE pv_pageid_sum
▪ SELECT pageid, count(1)
▪ GROUP BY pageid
▪ INSERT INTO TABLE pv_age_sum
▪ SELECT age, count(1)
▪ GROUP BY age;
pageid
age
1 25
2 32
Map Reduce
pageid
count
1 1
2 1
pageid
age
1 25
2 32
Map Reduce
age count
25 1
32 1
![Page 25: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/25.jpg)
Hive Optimizations – Load Balance Problem
pageid
age
1 25
1 25
1 25
2 32
1 25
pv_users
pageid
age count
1 25 4
2 32 1
pageid_age_sum
Map-Reducepagei
dage coun
t
1 25 2
2 32 1
1 25 2
pageid_age_partial_sum
Map-Reduce
![Page 26: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/26.jpg)
Future Works
▪ Database Management Functionalities
▪ Optimizations using Combiners
▪ Optimizations based on Table Statistics
(Learn more about that in CS 511)
▪ Everybody can use it, including YOU
▪ Everybody can work on it, including YOU
![Page 28: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/28.jpg)
CreditsData Management at Facebook, Jeff
HammerbacherHive – Data Warehousing & Analytics on Hadoop,
Joydeep Sen Sarma, Ashish Thusoo
PeopleSuresh AnthonyZheng ShaoPrasad ChakkaPete WyckoffNamit JainRaghu MurthyJoydeep Sen SarmaAshish Thusoo
![Page 29: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/29.jpg)
(c) 2008 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
![Page 30: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/30.jpg)
Appendix Pages
![Page 31: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/31.jpg)
Dealing with Structured Data▪ Type system
▪ Primitive types
▪ Recursively build up using Composition/Maps/Lists
▪ Generic (De)Serialization Interface (SerDe)
▪ To recursively list schema
▪ To recursively access fields within a row object
▪ Serialization families implement interface
▪ Thrift DDL based SerDe
▪ Delimited text based SerDe
▪ You can write your own SerDe
▪ Schema Evolution
![Page 32: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/32.jpg)
MetaStore▪ Stores Table/Partition properties:
▪ Table schema and SerDe library
▪ Table Location on HDFS
▪ Logical Partitioning keys and types
▪ Other information
▪ Thrift API
▪ Current clients in Php (Web Interface), Python (old CLI), Java (Query Engine and CLI), Perl (Tests)
▪ Metadata can be stored as text files or even in a SQL backend
![Page 33: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/33.jpg)
Hive CLI▪ DDL:
▪ create table/drop table/rename table
▪ alter table add column
▪ Browsing:
▪ show tables
▪ describe table
▪ cat table
▪ Loading Data
▪ Queries
![Page 34: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/34.jpg)
Web UI for Hive▪ MetaStore UI:
▪ Browse and navigate all tables in the system
▪ Comment on each table and each column
▪ Also captures data dependencies
▪ HiPal:
▪ Interactively construct SQL queries by mouse clicks
▪ Support projection, filtering, group by and joining
▪ Also support
![Page 35: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/35.jpg)
Hive Query Language▪ Philosophy
▪ SQL▪ Map-Reduce with custom scripts (hadoop streaming)
▪ Query Operators▪ Projections▪ Equi-joins▪ Group by▪ Sampling▪ Order By
![Page 36: Hadoop and Hive](https://reader035.vdocuments.net/reader035/viewer/2022062303/5550f728b4c90572478b46b9/html5/thumbnails/36.jpg)
Hive QL – Custom Map/Reduce Scripts• Extended SQL:
• FROM (
• FROM pv_users
• SELECT TRANSFORM(pv_users.userid, pv_users.date)
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• SELECT TRANSFORM(map.dt, map.uid)
• USING 'reduce_script' AS (date, count);
• Map-Reduce: similar to hadoop streaming