hive and hiveql - module6
TRANSCRIPT
Hive and HiveQL
What is Hive?• Apache Hive is a data warehouse system for Hadoop.
• Hive is not a relational database, it only maintains metadata information about your Big Data stored on HDFS.
• Hive allows to treat your Big Data as tables and perform SQL-like operations on the data using a scripting language called HiveQL.
• Hive is not a database, but it uses a database (called the metastore) to store the tables that you define. Hive uses Derby by default.
• A Hive table consists of a schema stored in the metastore and data stored on HDFS.
• Hive converts HiveQL commands into MapReduce jobs.
Hive Architecture Contd..
Step 1: Issuing Commands Using the Hive CLI, a Web interface, or a Hive JDBC/ODBC client, a Hive query is submitted to the HiveServer.
Step 2: Hive Query Plan The Hive query is compiled, optimized and planned as a MapReduce job.
Step 3: MapReduce Job Executes
The corresponding MapReduce job is executed on the Hadoop cluster.
Comparison with Traditional Database
Hive data types
Arithmetic Operators
Mathematical functions
Aggregate functions
Other built-in functions
Managed Tables• When a table is created in Hive, by default Hive will manage the data, which
means that Hive moves the data into its warehouse directory.• When data is loaded into a managed table, it is moved into Hive’s warehouse
directory.CREATE TABLE managed_table (dummy STRING);LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table;
• It will move the file hdfs://user/tom/data.txt into Hive’s warehouse directory for the managed_table table, which is hdfs://user/hive/warehouse/managed_table• If the table is later dropped, then the table, including its metadata and its data, is
deleted.
External Tables• When a External table is created, it tells Hive to refer to the data that is at an
existing location outside the warehouse directory and it is not managed by Hive.
• The location of the external data is specified at table creation time
CREATE EXTERNAL TABLE external_table (dummy STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'LOCATION '/user/tom/external_table_location/file.txt';
• Creation and deletion of the data can be controlled.
• Hive tables can be organized into buckets, which imposes extra structure on the table and how the underlying files are stored. Bucketing has two key benefits:
• More efficient queries: especially when performing joins on the same bucketed columns.
• More efficient sampling: because the data is already split up into smaller pieces.
Storage Formats• There are two dimensions that govern table storage in Hive• Row format : The row format dictates how rows, and the fields in a particular row, are
stored. The row format is defined by a SerDe.• File format : The file format dictates the container format for fields in a row.
The default storage format: Delimited text• When a table is created with no ROW FORMAT or STORED AS clauses, the default format is
delimited text, with a row per line.• The default row delimiter is not a tab character, but the Control-A character.• The default collection item delimiter is a Control-B character, used to delimit items in an ARRAY
or STRUCT, or key-value pairs in a MAP.• The default map key delimiter is a Control-C character, used to delimit the key and value in a
MAP.• Rows in a table are delimited by a newline character.
Storage Formats Contd..
CREATE TABLE XYZ;
is identical to the more explicit:
CREATE TABLE ...ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\001'COLLECTION ITEMS TERMINATED BY '\002'MAP KEYS TERMINATED BY '\003'LINES TERMINATED BY '\n'STORED AS TEXTFILE;
Importing DataINSERT OVERWRITE TABLE
INSERT OVERWRITE TABLE targetSELECT col1, col2FROM source;
• For partitioned tables, you can specify the partition to insert into by supplying a PARTITION clause:
INSERT OVERWRITE TABLE targetPARTITION (dt='2010-01-01')SELECT col1, col2FROM source;
Importing Data Contd..Multitable insertFROM records2INSERT OVERWRITE TABLE stations_by_yearSELECT year, COUNT(DISTINCT station)GROUP BY yearINSERT OVERWRITE TABLE records_by_yearSELECT year, COUNT(1)GROUP BY yearINSERT OVERWRITE TABLE good_records_by_yearSELECT year, COUNT(1)WHERE temperature != 9999AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)GROUP BY year;
Importing Data Contd..CREATE TABLE...AS SELECT
CREATE TABLE targetASSELECT col1, col2FROM source;
• A CTAS operation is atomic, so if the SELECT query fails for some reason, then the table is not created.
Altering Tables• ALTER TABLE source RENAME TO target;
• ALTER TABLE target ADD COLUMNS (col3 STRING);
Dropping Tables
DROP TABLE table_name;
• The DROP TABLE statement deletes the data and metadata for a table. In the case of external tables, only the metadata is deleted—the data is left untouched.
Querying data(Sorting and Aggregating)• Sorting data in Hive can be achieved by use of a standard ORDER BY clause.
ORDER BY produces a result that is totally sorted, so sets the number of reducers to one.• SORT BY produces a sorted file per reducer.• DISTRIBUTE BY clause used to control which reducer a particular row goes to.
• Inner joins
Querying data(Joins)
• Left Outer Join
• Right Outer Join
• Full Outer Join
• Left Semi Join
Subqueries• A subquery is a SELECT statement that is embedded in another SQL statement.
Hive has limited support for subqueries, only permitting a subquery in the FROM clause of a SELECT statement.
SELECT station, year, AVG(max_temperature)FROM (SELECT station, year, MAX(temperature) AS max_temperatureFROM records2WHERE temperature != 9999AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9)GROUP BY station, year) mtGROUP BY station, year;
Views• A view is a sort of “virtual table” that is defined by a SELECT statement.
CREATE VIEW max_temperatures (station, year, max_temperature) ASSELECT station, year, MAX(temperature)FROM valid_recordsGROUP BY station, year;
• With the views in place, we can now use them by running a query:
SELECT station, year, AVG(max_temperature)FROM max_temperaturesGROUP BY station, year;
Hive Join Strategies