cs-495/595 hive lecture #9 dr. chuck cartledge dr. chuck
TRANSCRIPT
1/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
CS-495/595Hive
Lecture #9
Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge
18 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 201518 Mar. 2015
2/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Table of contents I
1 Miscellanea
2 Assignment #3
3 The Book
4 Chapter 12
5 Break
6 Project
7 Conclusion
8 References
3/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Corrections and additions since last lecture.
Assignment #3 due in a fewhours.
4/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Pay attention to the assignment details.
Things like:
Getting the average of thecorrect procedure based onthe numeric code
Grouping the practitionersby type
Getting the average for thestate based on the numericcode
Addressing those geographicareas that aren’t in yourcartographic file
If appropriate a “heat scale”
5/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Hadoop, The Definitive Guide
Version 3 is specified in thesyllabus [5]
Version 4 came out inNovember 2015
We’ll use Version 3 as muchas possible
6/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Where to get it.
“. . . was created to make itpossible for analysts withSQL skills . . . to run querieson huge data . . . ”
Installable much like Pig
Image from [3].
Available from “https://hive.apache.org/downloads.html”
7/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
How to get it running.
Assuming that you havehadoop up and running
There are “logical” conflictsbetween hadoop and Hive
8/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Different interfaces
Like so many in the hadoopecosystem.
A command line interface(CLI, our old friend)
Web interface (not sure ifthis works on our cluster)
JDBC and ODBC
All feed a compiler, andoptimizer, and executor.
9/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Simple things like creating tables
Tables can be created fromexternal data files
MapReduce underlyingeverything, so rows andfields are user definable
Tables can be partitioned foroptimization
10/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Blow up the same image.
11/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
More information about architecture
Focusing on where data is stored.
Actual data is stored eitherlocally, or in the HDFS
Metadata is stored in themetastore used by hive
metastore is a Derbydatabase (data about thehive is stored in a RDBMS)
Hive tables are stored on theHDFS under/user/hive/warehouse
Image from [5].
12/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Comparison with Traditional Databases.
Schema enforcement
Traditional — enforced atload time (called Write)Hive — enforced at querytime (called Read)
Updates, transactions, andindexes
Traditional — these aremainstaysHive — HDFS doesn’tsupport these actions
Image from [1].
13/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Same image
14/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Hive has its roots in MySQL
Relatively small differencesin syntax between HiveQLand MySQL
Limitations on VIEWs,Indexes, and updates(workaround is to createnew tables)
HiveQL supports“partitioning” the table(column or row) to better fitthe HDFS, and MapReduceparadigm
HiveQL does not supportupdating existing records
Image from [2].
HiveQL has grown “organically” to meet the needs of its users.
15/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Overview
Same image
16/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Break time.
Take about 10 minutes.
17/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Additional details
Techniques and tools to solve the problem
Database “heavy lifting”
MapReduce
Pig Latin
Hive
Display and analysis
Custom code (language dejure)
Excel
Some cartographic package
18/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Additional details
Class membership
Undergrads1 CAMPBELL, CHRISTOPHER G.2 CRUZ, JOSHUA T.3 DAVIS, RANDALL A.4 JIANG, MING H.5 PHELPS, NATHAN A.6 ZHANG, HEMIAO
Graduates1 ALFURAYJ, HAIFA S.2 ARAB, MARYAM3 BETHU, ANVESH4 DASARI, VICTOR PRABHU5 GARNER, KEVIN M.6 HAVANUR, SRINIVAS J.7 LAMBI, ROHIT D.8 PATEL, PRIYANK A.9 POTINENI, BHAVYATEJA10 SADANA, PRANEET11 SAJJAN, PRASANNA KUMAR
BASAVARAJ
So many people.
19/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Additional details
Team membership
Undergrads1 Campbell and Cruz2 Davis and Phelps3 Jiang and Zhang
Graduates1 Alfurayj and Arab2 Bethu and Dasari3 Garner, Havanur and
Sajjan4 Lambi and Sadana5 Patel and Potineni
20/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Additional details
Scatter graphs
We don’t have a priori knowledge of any relationship betweenMedicare billings and pharmaceutical payments. So create ascatter graph of the data.
Just plot one value versusthe other and look
Looking if there is anyapparent relationship
Relationship may not exist,may not be linear, may notbe monotonic
21/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Additional details
Examples of scatter graphs
Different types of relationships:
None (shotgun)
Positive (strong positive)
Negative (strong negative)
Independent (or low)
Independent andnon-monotonic (or low)
Spurious
Image from [4].
The scatter graph will be our guide for computations.
22/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
Additional details
Presentation
Short (on the order of 5minutes)
Address the 6 questions
Power point, or other format
Still need the “standard”PDF submission
23/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
What have we covered?
Gave some “final” hints/directionsfor assignment #3Talked about Hive, HiveQL,origins, strengths, and weaknessesTalked about the project
Next lecture: Discussion of current real-world applications of BigData
24/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
References I
[1] Amr A. Awadallah, Schema-on-read vs schema-on-write,http://www.slideshare.net/awadallah/schemaonread-
vs-schemaonwrite, 2014.
[2] Marc Holmes, Hive for sql users, http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/, 2013.
[3] Jasper Pei Lee, Hive a sql-like wrapper over hadoop,https://jasperpeilee.wordpress.com/2011/11/22/
hive-a-sql-like-wrapper-over-hadoop/, 2011.
[4] Bioscience Staff, Numbers numerical methods for biosciencestudents,http://web.anglia.ac.uk/numbers/graphsCharts.html.
[5] Tom White, Hadoop: The definitive guide, 3rd edition, O’ReillyMedia, Inc., 2012.
25/25
Miscellanea Assignment #3 The Book Chapter 12 Break Project Conclusion References
References II