hive quick start tutorial

36
Hive Quick Start © 2010 Cloudera, Inc.

Upload: carl-steinbach

Post on 27-Jan-2015

134 views

Category:

Technology


4 download

DESCRIPTION

Hive quick start tutorial presented at March 2010 Hive User Group meeting. Covers Hive installation and administration commands.

TRANSCRIPT

Page 1: Hive Quick Start Tutorial

Hive Quick Start

© 2010 Cloudera, Inc.

Page 2: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Background

•  Started at Facebook•  Data was collected by nightly cron jobs into Oracle DB

•  “ETL” via hand-coded python

•  Grew from 10s of GBs (2006) to 1 TB/day new data (2007), now 10x that.

Page 3: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Hadoop as Enterprise Data Warehouse

• Scribe and MySQL data loaded into Hadoop HDFS

• Hadoop MapReduce jobs to process data

• Missing components:– Command-line interface for “end users” – Ad-hoc query support

• … without writing full MapReduce jobs –  Schema information

Page 4: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Hive Applications

• Log processing• Text mining• Document indexing• Customer-facing business intelligence (e.g., Google Analytics)

• Predictive modeling, hypothesis testing

Page 5: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Hive Architecture

Page 6: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Data Model

• Tables– Typed columns (int, float, string, date,

boolean) – Also, array/map/struct for JSON-like data

• Partitions– e.g., to range-partition tables by date

• Buckets– Hash partitions within ranges (useful for

sampling, join optimization)

Page 7: Hive Quick Start Tutorial

Column Data Types

CREATE TABLE t ( s STRING, f FLOAT, a ARRAY<MAP<STRING, STRUCT<p1:INT, p2:INT>>);

SELECT s, f, a[0][‘foobar’].p2 FROM t;

© 2010 Cloudera, Inc.

Page 8: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Metastore

• Database: namespace containing a set of tables

• Holds Table/Partition definitions (column types, mappings to HDFS directories)

• Statistics• Implemented with DataNucleus ORM. Runs on Derby, MySQL, and many other relational databases

Page 9: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Physical Layout

•  Warehouse directory in HDFS– e.g., /user/hive/warehouse

•  Table row data stored in subdirectories of warehouse

•  Partitions form subdirectories of table directories

•  Actual data stored in flat files– Control char-delimited text, or

SequenceFiles – With custom SerDe, can use arbitrary

format

Page 10: Hive Quick Start Tutorial

Installing Hive

From a Release Tarball:

$ wget http://archive.apache.org/dist/hadoop/hive/hive-0.5.0/hive-0.5.0-bin.tar.gz

$ tar xvzf hive-0.5.0-bin.tar.gz $ cd hive-0.5.0-bin $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH

© 2010 Cloudera, Inc.

Page 11: Hive Quick Start Tutorial

Installing Hive

Building from Source: $ svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive

$ cd hive $ ant package $ cd build/dist $ export HIVE_HOME=$PWD $ export PATH=$HIVE_HOME/bin:$PATH

© 2010 Cloudera, Inc.

Page 12: Hive Quick Start Tutorial

Installing Hive

Other Options:• Use a Git Mirror:

– git://github.com/apache/hive.git

• Cloudera Hive Packages– Redhat and Debian – Packages include backported patches –  See archive.cloudera.com

© 2010 Cloudera, Inc.

Page 13: Hive Quick Start Tutorial

Hive Dependencies

• Java 1.6• Hadoop 0.17-0.20• Hive *MUST* be able to find Hadoop:– $HADOOP_HOME=<hadoop-install-dir> – Add $HADOOP_HOME/bin to $PATH

© 2010 Cloudera, Inc.

Page 14: Hive Quick Start Tutorial

Hive Dependencies

• Hive needs r/w access to /tmp and /user/hive/warehouse on HDFS:

$ hadoop fs –mkdir /tmp $ hadoop fs –mkdir /user/hive/warehouse $ hadoop fs –chmod g+w /tmp $ hadoop fs –chmod g+w /user/hive/warehouse

© 2010 Cloudera, Inc.

Page 15: Hive Quick Start Tutorial

Hive Configuration

•  Default configuration in $HIVE_HOME/conf/hive-default.xml– DO NOT TOUCH THIS FILE!

•  Re(Define) properties in $HIVE_HOME/conf/hive-site.xml

•  Use $HIVE_CONF_DIR to specify alternate conf dir location

© 2010 Cloudera, Inc.

Page 16: Hive Quick Start Tutorial

Hive Configuration

• You can override Hadoop configuration properties in Hive’s configuration, e.g:– mapred.reduce.tasks=1

© 2010 Cloudera, Inc.

Page 17: Hive Quick Start Tutorial

Logging

• Hive uses log4j• Log4j configuration located in $HIVE_HOME/conf/hive-log4j.properties

• Logs are stored in /tmp/${user.name}/hive.log

© 2010 Cloudera, Inc.

Page 18: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Starting the Hive CLI

• Start a terminal and run $ hive

• Should see a prompt like: hive>

Page 19: Hive Quick Start Tutorial

Hive CLI Commands

• Set a Hive or Hadoop conf prop:– hive> set propkey=value;

• List all properties and values:– hive> set –v;

• Add a resource to the DCache:– hive> add [ARCHIVE|FILE|JAR] filename;

© 2010 Cloudera, Inc.

Page 20: Hive Quick Start Tutorial

Hive CLI Commands

• List tables:– hive> show tables;

• Describe a table:– hive> describe <tablename>;

• More information:– hive> describe extended <tablename>;

© 2010 Cloudera, Inc.

Page 21: Hive Quick Start Tutorial

Hive CLI Commands

• List Functions:– hive> show functions;

• More information:– hive> describe function <functionname>;

© 2010 Cloudera, Inc.

Page 22: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Selecting data

hive> SELECT * FROM <tablename> LIMIT 10;

hive> SELECT * FROM <tablename> WHERE freq > 100 SORT BY freq ASC LIMIT 10;

Page 23: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Manipulating Tables

• DDL operations–  SHOW TABLES – CREATE TABLE – ALTER TABLE – DROP TABLE

Page 24: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Creating Tables in Hive

• Most straightforward:

CREATE TABLE foo(id INT, msg STRING);

•  Assumes default table layout–  Text files; fields terminated with ^A, lines terminated with

\n

Page 25: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Changing Row Format

• Arbitrary field, record separators are possible. e.g., CSV format:

CREATE TABLE foo(id INT, msg STRING) DELIMITED FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’;

Page 26: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Partitioning Data

•  One or more partition columns may be specified:

CREATE TABLE foo (id INT, msg STRING) PARTITIONED BY (dt STRING);

•  Creates a subdirectory for each value of the partition column, e.g.:

/user/hive/warehouse/foo/dt=2009-03-20/

•  Queries with partition columns in WHERE clause will scan through only a subset of the data

Page 27: Hive Quick Start Tutorial

Sqoop = SQL-to-Hadoop

© 2010 Cloudera, Inc.

Page 28: Hive Quick Start Tutorial

Sqoop: Features

•  JDBC-based interface (MySQL, Oracle, PostgreSQL, etc…)

•  Automatic datatype generation–  Reads column info from table and generates Java classes –  Can be used in further MapReduce processing passes

•  Uses MapReduce to read tables from database–  Can select individual table (or subset of columns) –  Can read all tables in database

•  Supports most JDBC standard types and null values

© 2010 Cloudera, Inc.

Page 29: Hive Quick Start Tutorial

Example input

mysql> use corp; Database changed

mysql> describe employees; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | firstname | varchar(32) | YES | | NULL | | | lastname | varchar(32) | YES | | NULL | | | jobtitle | varchar(64) | YES | | NULL | | | start_date | date | YES | | NULL | | | dept_id | int(11) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+

© 2010 Cloudera, Inc.

Page 30: Hive Quick Start Tutorial

Loading into HDFS

$ sqoop --connect jdbc:mysql://db.foo.com/corp \ --table employees

•  Imports “employees” table into HDFS directory

© 2010 Cloudera, Inc.

Page 31: Hive Quick Start Tutorial

Hive Integration

$ sqoop --connect jdbc:mysql://db.foo.com/corp --hive-import --table employees

•  Auto-generates CREATE TABLE / LOAD DATA INPATH statements for Hive

•  After data is imported to HDFS, auto-executes Hive script

•  Follow-up step: Loading into partitions

© 2010 Cloudera, Inc.

Page 32: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Hive Project Status

• Open source, Apache 2.0 license• Official subproject of Apache Hadoop

• Current version is 0.5.0• Supports Hadoop 0.17-0.20

Page 33: Hive Quick Start Tutorial

© 2010 Cloudera, Inc.

Conclusions

• Supports rapid iteration of ad-hoc queries

• High-level Interface (HiveQL) to low-level infrastructure (Hadoop).

• Scales to handle much more data than many similar systems

Page 34: Hive Quick Start Tutorial

Hive Resources

Documentation• wiki.apache.org/hadoop/Hive

Mailing Lists• [email protected]

IRC• ##hive on Freenode

© 2010 Cloudera, Inc.

Page 35: Hive Quick Start Tutorial

Carl Steinbach

[email protected]

Page 36: Hive Quick Start Tutorial