sdec2011 essentials of hive
DESCRIPTION
TRANSCRIPT
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Essentials of HiveMastering Hadoop Map-reduce for Data Analysis
Shashank Tiwariblog: shanky.org | twitter: @[email protected]
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
What is Hive?
• A data warehouse system for Hadoop
• Facilitates data summarization and ad-hoc queries
• Allows SQL like querying using HiveQL, by transposing metadata onto data stored in HDFS
• Can also plug-in custom mappers and reducers
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Supported Platforms
• Linux/Unix and Mac OSX
• Does not work on Cygwin
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Required Software
• Java 1.6.x
• Hadoop 0.17.x to 0.20.x
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Download
• Source: http://hive.apache.org/releases.html
• Version:
• hive-0.7.0
• Both binary and source distributions available
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Install
• Extract: tar zxvf hive-0.7.0-bin.tar.gz
• Move and Create Symbolic Link: ln -s hive-0.7.0-bin hive
• Set environment variable HIVE_HOME to point to the hive directory
• Add $HIVE_HOME/bin to your PATH environment variable
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Build From Source
• $ svn co http://svn.apache.org/repos/asf/hive/trunk hive
• $ cd hive
• $ ant clean package
• The binary distribution is in build/dist
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Hive Needs Hadoop
• Needs Hadoop
• Add Hadoop distribution to your path or set HADOOP_HOME
• Start Hadoop daemons
• bin/start-all.sh
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Configure Hive
• Create /tmp in HDFS and set appropriate permissions
• bin/hadoop fs -mkdir /tmp
• bin/hadoop fs -chmod g+w /tmp
• Create /user/hive/warehouse and set appropriate permissions
• bin/hadoop fs -mkdir /user/hive/warehouse
• bin/hadoop fs -chmod g+w /user/hive/warehouse
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Default Hive Configuration
• Default configuration: conf/hive-default.xml
• Override default configuration by redefining properties in:
• conf/hive-site.xml
• Set HIVE_CONF_DIR to set a new location for the config file
• Hive configuration is a overlay on top of Hadoop configuration
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Hive Configuration Manipulation
• Edit: conf/hive-site.xml
• Use SET command on the Hive cli
• Pass parameters to Hive
• bin/hive -hiveconf prop1=val1 -hiveconf prop2=val2
• set HIVE_OPTS to "-hiveconf prop1=val1 -hiveconf prop2=val2"
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Hive by Example -- Getting Started
• Start the cli: bin/hive
• Basic DDL statements
• List the existing tables
• SHOW TABLES;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Create Table
• CREATE TABLE books (isbn INT, title STRING);
• DESCRIBE books;
• isbn int
• title string
• CREATE TABLE users (id INT, name STRING) PARTITIONED BY (vcol STRING);
• What is PARTITION BY vcol?
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Logical Table Partitions
• A Hive table can be logically partitioned by a virtual column
• virtual column is derived by the partition in which the data is stored
• A table can have multiple partitions
• Each partition in uniquely identified by a virtual column value
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Alter Table
• ALTER TABLE books ADD COLUMNS (author STRING, category STRING);
• Change Column Property
• ALTER TABLE table_name CHANGE [COLUMN]
• old_column_name new_column_name column_type
• [COMMENT column_comment] [FIRST|AFTER column_name]
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Alter Table Column Property
• ALTER TABLE books CHANGE author author ARRAY<STRING> COMMENT "multi-valued";
• old and new column name needs to be specified
• Data type changed
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Data Types Supported
• Primitives: INT, STRING, etc...
• Complex types: maps, array, struct
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Rename Table
• ALTER TABLE books RENAME TO published_contents;
• DESCRIBE published_contents;
• DESCRIBE books; (Execution error!)
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Drop Tables
• DROP TABLE published_contents;
• DROP TABLE users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
GroupLens Example -- Getting the Data Set
• Movie ratings -- 1 million records
• Available in tar.gz format: million-ml-data.tar__0.gz
• Extract: tar zxvf million-ml-data.tar__0.gz
•
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Loading Rating Data
• Format of data in ratings.dat:
• UserID::MovieID::Rating::Timestamp
• Replace delimiter ‘::’ for ‘#’
• :%s/::/#/g
• Save as .hash_delimited
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Creating Metadata and Loading the File
• hive> CREATE TABLE ratings( userid INT, movieid INT, rating INT, tstamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE;
• LOAD DATA LOCAL INPATH <'path/to/flat/file'> OVERWRITE INTO TABLE <table name>;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
File Load Properties
• No validation. Developer’s responsibility to make sure schema matches between table schema and the file.
• Data can be on the local filesystem or on HDFS
• Data copied to Hive HDFS namespace
• If OVERWRITE not specified then its data append
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Rating Data Load
• hive> LOAD DATA LOCAL INPATH '/path/to/ratings.dat.hash_delimited'
• > OVERWRITE INTO TABLE ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
A SQL Style Query
• SELECT COUNT(*) FROM ratings;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Loading movies and users data
• Now load the movies and users data in the same way as the ratings data.
• Details on the console...
• CREATE TABLE users_2(userid INT, gender STRING, age INT, occupation STRING, zipcode STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '#' STORED AS TEXTFILE;
• add FILE /Users/tshanky/workspace/hadoop_workspace/hive_workspace/occupation_mapper.py;
• INSERT OVERWRITE TABLE users_2 SELECT TRANSFORM (userid, gender, age, occupation, zipcode) USING 'python occupation_mapper.py' AS (userid, gender, age, occupation_str, zipcode) FROM users;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Good Old SQL
• SELECT * FROM movies LIMIT 5;
• SELECT * FROM ratings WHERE movieid = 1;
• SELECT COUNT(*) FROM ratings WHERE movieid < 10;
• SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
• SELECT title FROM movies WHERE title = `^Toy+`;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
More Than Good Old SQL
• SELECT `*+(id)` FROM ratings WHERE movieid = 1;
• regular expression based search on column name
• SELECT ratings.rating, COUNT(ratings.rating) FROM ratings WHERE movieid = 1 GROUP BY ratings.rating; (group by)
• SELECT * FROM movies ORDER BY movieid DESC;
• DISTRIBUTE BY & ORDER BY (CLUSTER BY) -- by partition
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
JOIN(s) in HiveQL
• equality joins, outer joins, left semi-joins
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;
• More than 2 tables:
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
JOIN(s) in HiveQL
• equality joins, outer joins, left semi-joins
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) LIMIT 5;
• More than 2 tables:
• SELECT ratings.userid, ratings.rating, ratings.tstamp, movies.title, users.gender FROM ratings JOIN movies ON (ratings.movieid = movies.movieid) JOIN users ON (ratings.userid = users.userid) LIMIT 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Explain Plan to Under the hood MapReduce
• EXPLAIN SELECT COUNT(*) FROM ratings WHERE movieid = 1 and rating = 5;
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.Copyright for all other & referenced work is retained by their respective owners.
Questions?
• blog: shanky.org | twitter: @tshanky