cs 626 large scale data sciencejzhang/cs626/lecture11.pdfcs 626 large scale data science jun zhang...
TRANSCRIPT
![Page 1: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/1.jpg)
CS 626 Large Scale Data Science
Jun ZhangMarch 5, 2020
Originally prepared by Licong Cui
Lecture 11 – Apache Hive
1
![Page 2: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/2.jpg)
Review: Hadoop Ecosystem
2
![Page 3: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/3.jpg)
Review: Pig
A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.
Pig Engine
Pig Latin3
![Page 4: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/4.jpg)
Review: WordCount Using Pig
lines = LOAD ‘cs626/words.txt' AS (line:chararray);words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as word;grouped = GROUP words BY word;wordcount = FOREACH grouped GENERATE group, COUNT(words);DUMP wordcount;
4
![Page 5: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/5.jpg)
Pig vs MapReduce vs Hive
1https://www.dezyre.com/article/mapreduce-vs-pig-vs-hive/163
1
5
![Page 6: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/6.jpg)
Outline
What is Hive?Data TypesData ModelsHive ArchitectureHive vs Traditional DatabaseHive vs Pig
6
![Page 7: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/7.jpg)
Hive
Facebook Data warehousing infrastructure based on
Hadoop Designed to Enable easy data summarization Ad-hoc querying Analysis of large volumes of data
HiveQL - Hive’s query language7
![Page 8: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/8.jpg)
Hive
Organize data into tables
Metastore
Metadata (table schemas)
8
![Page 9: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/9.jpg)
Run Hive
Interactive mode Hive shell: hive
Non-interactive mode Local: hive –f script.q
Hive web interface
JDBC (Java Database Connectivity)
9
![Page 10: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/10.jpg)
HiveQL Data Types
Primitive data types
Complex data types
10
![Page 11: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/11.jpg)
Primitive Data Types
Boolean
Numeric: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL
String: STRING, VARCHAR
Timestamp: TIMESTAMP, DATE
11
![Page 12: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/12.jpg)
Primitive Data Types (cont.)
12
![Page 13: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/13.jpg)
Primitive Data Types (cont.)
13
![Page 14: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/14.jpg)
Complex Data Types
14
![Page 15: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/15.jpg)
Example Complex Data Types
15
![Page 16: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/16.jpg)
Example Complex Data Types (cont.)
16
![Page 17: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/17.jpg)
HiveQL Example
17
![Page 18: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/18.jpg)
HiveQL Example (cont.)
18
![Page 19: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/19.jpg)
HiveQL Example (cont.)
19
![Page 20: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/20.jpg)
Data Model
Tables
Partitions
Buckets
20
![Page 21: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/21.jpg)
Tables
Analogous to relational tables Hive moves data to its warehouse directory Each table has corresponding directory in HDFS Hive does not check that the files in the table
directory conform to the schema at the loading time Example:
hdfs://user/tom/data.txt ==> hdfs://user/hive/warehouse/managed_table
21
![Page 22: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/22.jpg)
External Tables
Does not move data to Hive’s warehouse directory
Point to existing data directories in HDFS
Data is assumed to be in Hive-compatible format
Dropping external table drops only the metadata
Example:
22
![Page 23: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/23.jpg)
Partitions
Hive organize tables into partitions
Partitions determine distribution of data within subdirectories
Example:
23
![Page 24: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/24.jpg)
Partitions (cont.)
Example:
SELECT statements:
24
![Page 25: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/25.jpg)
Buckets
Data in each partition divided into buckets
Each bucket is stored as a file in partition directory
Hash function: H(column) mod num_buckets = bucket_number
Example:
25
![Page 26: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/26.jpg)
Hive Architecture
26
![Page 27: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/27.jpg)
Hive Architecture
27
![Page 28: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/28.jpg)
Thrift Server
Framework for cross language services
Server written in Java
Support for clients written in different languages JDBC(java), ODBC(c++), php, perl, python scripts
28
![Page 29: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/29.jpg)
Metastore
System catalog which contains metadata about the Hive tables
Stored in RDBMS/local file system HDFS too slow (not optimized for random access) Derby, MySQL
Objects of Metastore Database - Namespace of tables Table - list of columns, types, owner, storage, SerDe Partition - Partition specific column, SerDe and storage
29
![Page 30: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/30.jpg)
Driver
Driver Maintains the lifecycle of HiveQL statement
Query Compiler Compiles HiveQL in a DAG of map reduce tasks
Executor Executes the tasks plan generated by the compiler in
proper dependency order Interacts with the underlying Hadoop instance
30
![Page 31: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/31.jpg)
Hive vs Traditional Database
Schema on Read Versus Schema on Write Traditional database: schema on write table’s schema is enforced at data load time
Hive: schema on read does not verify the data when it is loaded, but rather when
a query is issued
31
![Page 32: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/32.jpg)
Hive vs Traditional Database (cont.)
Updates, Transactions, and Indexes Mainstays of traditional databases
HDFS does not provide in-place file updates Changes resulting from inserts, updates, and deletes are
stored in small delta files
Delta files are periodically merged into the base table files by MapReduce jobs
32
![Page 33: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/33.jpg)
Hive vs. RDBMS
33
![Page 34: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/34.jpg)
SQL vs HiveQL
34
Cheat sheet at class webpage
![Page 35: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/35.jpg)
SQL vs HiveQL (cont.)
35
![Page 36: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/36.jpg)
Pig vs HivePig Hive
Procedural data flow language
Declarative SQLishLanguage
36
![Page 37: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/37.jpg)
Pig vs Hive (cont.)
Pig Mainly for data transformations and processing
Unstructured and structured data
Hive Mainly for data warehousing and querying data
Structured data
Lower learning curve than Pig or MapReduce
HiveQL is much closer to SQL than Pig37
![Page 38: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/38.jpg)
Hive, Pig, and Hadoop Benchmark
Version: Hadoop – 0.18x, Pig:786346, Hive:786346
38
![Page 39: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/39.jpg)
Bigger Picture
Store large amounts of data to HDFS
Process raw data using Pig
Build schema using Hive
Querying data using Hive
39
![Page 40: CS 626 Large Scale Data Sciencejzhang/CS626/Lecture11.pdfCS 626 Large Scale Data Science Jun Zhang March 5, 2020 Originally prepared by Licong Cui Lecture 11 – Apache Hive 1 Review:](https://reader033.vdocuments.net/reader033/viewer/2022042223/5ec9a3ec66b0bc5d441a5cb0/html5/thumbnails/40.jpg)
References
Hadoop: The Definitive Guide (By Tom White) https://courses.engr.illinois.edu/cs525/sp201
5/ http://www.slideshare.net/chirag064/hive-
warehousing-over-hadoop https://acadgild.com/blog/working-with-
hive-complex-data-types/
40