impala 2.0 update #impalajp
DESCRIPTION
http://connpass.com/event/9031/TRANSCRIPT
1
Impala 2.0 Update Sho Shimauchi, Cloudera 2014/10/31
2
Today’s Topic
• What is Cloudera Impala? • Impala 1.4 / 2.0 update
• Performance Improvement • Query Language • Resource Management and Security • Others
3
Who am I ?
• Pre-‐sales SoluLons Architect • joined Cloudera in 2011, the first Japanese employee at Cloudera
• email: [email protected] • twiTer: @shiumachi
4
Cloudera Impala
5
What is Impala?
• MPP SQL query engine for Hadoop environment • wriTen in naLve code for maximum hardware efficiency
• open-‐source! • hTp://impala.io/
• Supported by Cloudera, Amazon, and MapR • History
• 2012/10 Public Beta released • 2013/04 Impala 1.0 released • current version: Impala 2.0
6
Impala is easy to use
• create tables as virtual views over data stored in HDFS / HBase • schema metadata is stored in Metastore
• shared with Hive, Pig, etc.
• connect via ODBC / JDBC • authenLcate via Kerberos / LDAP • run standard SQL
• ANSI SQL-‐92 based • limited to SELECT and bulk INSERT • no correlated subqueries available in 2.0 • UDF / UDAF
7
Impala 1.4 (2014/07)
• DECIMAL(<precision>, <scale>) • HDFS caching DDL • column definiLon based on Parquet file (CREATE TABLE … LIKE PARQUET) • ORDER BY without LIMIT • LDAP connecLons through TLS • SHOW PARTITIONS • YARN integrated resource manager will be producLon ready • Llama HA support • CREATE TABLE … STORED AS AVRO • SUMMARY command in impala-‐shell (provides high-‐level summary of query plan)
• faster COMPUTE STATS • Performance improvements for parLLon pruning • impala shell supports UTF-‐8 characters • addiLonal built-‐ins from EDW systems
8
Impala 2.0 (2014/10)
• hash table can spill to disk • join and aggregate tables of arbitrary size
• Subquery enhancements • allowed in WHERE queries • EXISTS / NOT EXISTS • IN / NOT IN can operate on the result set from a subquery • correlated / uncorrelated subqueries • scalar subqueries
• SQL 2003 compliant analyLc window funcLons • LEAD(), LAG(), RANK(), FIRST_VALUE(), etc.
• New Data Type: VARCHAR, CHAR • Security Enhancements
• mulLple authenLcaLon methods • GRANT / REVOKE / CREATE ROLE / DROP ROLE / SHOW ROLES / etc.
• text + gzip / bzip2 / Snappy • Hint inside views • QUERY_TIMEOUT_S • DATE_PART() / EXTRACT() • Parquet default block size is changed to 256MB (was: 1GB) • LEFT ANTI JOIN / RIGHT ANTI JOIN • impala-‐shell can read sesngs from $HOME/.impalarc
9
Performance Improvement
10
HDFS caching
• When HDFS files are cached in memory, Impala can read the cached data without any disk reads, and without making an addiLonal copy of the data in memory
• avoids checksumming and data copies • new HDFS API is available in CDH 5.0 • configure cache with Impala DDL
• CREATE TABLE tbl_name CACHED IN ‘<pool>’ • ALTER TABLE tbl_name ADD PARTITION … CACHED IN ‘<pool>’
11
ParLLon Pruning improvement
• Previously, Impala typically queried tables with up to approximately 3000 parLLons. With the performance improvement in parLLon pruning, now Impala can comfortably handle tables with tens of thousands of parLLons.
12
Spilling to Disk SQL OperaLon
• write temporary data to when Impala is close to exceeding its memory limit
• In PROFILE, BlockMgr.BytesWriTen counter reports how much data was wriTen to disk during the query
13
Query Language
14
Subquery
Scalar subquery: produces a result set with a single row containing a single column SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2);!
Uncorrelated subquery: not refer to any tables from the outer block of the query
SELECT x FROM t1 WHERE x IN (SELECT y FROM t2);!
Correlated subquery: compare one or more values from the outer query block to values referenced in the WHERE clause of the subquery
SELECT employee_name, employee_id FROM employees one WHERE! salary > (SELECT avg(salary) FROM employees two WHERE one.dept_id = two.dept_id);!
15
AnalyLc FuncLons (a.k.a Window FuncLons)
• supported in 2.0 and later • supported funcLons
• RANK() / DENSE_RANK() • FIRST_VALUE() / LAST_VALUE() • LAG() / LEAD() • ROW_NUMBER()
• Aggregate funcLons are already implemented • MAX(), MIN(), AVG(), SUM(), etc.
16
AnalyLc FuncLons Example
select stock_symbol, closing_date, closing_price,! lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing"! from stock_ticker! order by closing_date;!+--------------+---------------------+---------------+-------------------+!| stock_symbol | closing_date | closing_price | yesterday closing |!+--------------+---------------------+---------------+-------------------+!| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |!| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |!| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |!| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |!| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |!| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |!| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |!+--------------+---------------------+---------------+-------------------+!
For each day, the query prints the closing price alongside the previous day's closing price:
17
ApproximaLon features
• APPX_COUNT_DISTINCT query opLon • rewrite COUNT(DISTINCT) calls to use NDV() • speeds up the operaLon • allows mulLple COUNT(DISTINCT) in a single query
• APPX_MEDIAN() • returns a value that is approximately the median (midpoint) of values in the set of input values
18
Approx. funcLons example
[localhost:21000] > select min(x), max(x), avg(x) from million_numbers;!+-------------------+-------------------+-------------------+!| min(x) | max(x) | avg(x) |!+-------------------+-------------------+-------------------+!| 4.725693727250069 | 49994.56852674231 | 24945.38563793553 |!+-------------------+-------------------+-------------------+![localhost:21000] > select appx_median(x) from million_numbers;!+----------------+!| appx_median(x) |!+----------------+!| 24721.6 |!+----------------+!
19
CREATE TABLE … LIKE PARQUET
• CREATE TABLE ... LIKE PARQUET 'hdfs_path_of_parquet_file'
• The column names and data types are automaLcally configured based on the Parquet data file
20
ORDER BY without LIMIT
• LIMIT clause is now opLonal for queries that use the ORDER BY clause
• Impala automaLcally uses a temporary disk work area to perform the sort if the sort operaLon would otherwise exceed the Impala memory limit for a parLcular data node.
21
DECODE()
SELECT event, DECODE(day_of_week, 1, "Monday", 2, "Tuesday", 3, "Wednesday”, 4, "Thursday", 5, "Friday", 6, "Saturday", 7, "Sunday", "Unknown day")! FROM calendar;!
22
ANTI JOIN
LEFT ANTI JOIN / RIGHT ANTI JOIN are supported in Impala 2.0 [localhost:21000] > create table t1 (x int);![localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);!![localhost:21000] > create table t2 (y int);![localhost:21000] > insert into t2 values (2), (4), (6);!![localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);!+---+!| x |!+---+!| 1 |!| 3 |!| 5 |!+---+!!
23
new data types
• DECIMAL (Impala 1.4) • column_name DECIMAL[(precision[,scale])]
• with no precision or scale values is equivalent to DECIMAL(9,0)
• VARCHAR (Impala 2.0) • STRING with a max length
• CHAR (Impala 2.0) • STRING with a precise length
24
new built-‐in funcLons
• EXTRACT() : returns one date or Lme field from a TIMESTAMP value
• TRUNC() : truncates date/Lme values to year, month, etc. • ADD_MONTHS(): alias for MONTHS_ADD() • ROUND(): rounds DECIMAL values • for compuLng properLes for staLsLcal distribuLons
• STDDEV() • STDDEV_SAMP() / STDDEV_POP() • VARIANCE() • VARIANCE_SAMP() / VARIANCE_POP()
• MAX_INT() / MIN_SMALLINT() • IS_INF() / IS_NAN()
25
SHOW PARTITIONS
[localhost:21000] > show partitions census;!+-------+-------+--------+------+---------+!| year | #Rows | #Files | Size | Format |!+-------+-------+--------+------+---------+!| 2000 | -1 | 0 | 0B | TEXT |!| 2004 | -1 | 0 | 0B | TEXT |!| 2008 | -1 | 0 | 0B | TEXT |!| 2010 | -1 | 0 | 0B | TEXT |!| 2011 | 4 | 1 | 22B | TEXT |!| 2012 | 4 | 1 | 22B | TEXT |!| 2013 | 1 | 1 | 231B | PARQUET |!| Total | 9 | 3 | 275B | |!+-------+-------+--------+------+---------+!!
26
SUMMARY
• impala-‐shell command • easy-‐to-‐digest overview of the Lmings for the different phases of execuLon for a query
[localhost:21000] > select avg(ss_sales_price) from store_sales where ss_coupon_amt = 0;!+---------------------+!| avg(ss_sales_price) |!+---------------------+!| 37.80770926328327 |!+---------------------+![localhost:21000] > summary;!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!| 03:AGGREGATE | 1 | 1.03ms | 1.03ms | 1 | 1 | 48.00 KB | -1 B | MERGE FINALIZE |!| 02:EXCHANGE | 1 | 0ns | 0ns | 1 | 1 | 0 B | -1 B | UNPARTITIONED |!| 01:AGGREGATE | 1 | 30.79ms | 30.79ms | 1 | 1 | 80.00 KB | 10.00 MB | |!| 00:SCAN HDFS | 1 | 5.45s | 5.45s | 2.21M | -1 | 64.05 MB | 432.00 MB | tpc.store_sales |!+--------------+--------+----------+----------+-------+------------+----------+---------------+-----------------+!
27
SET statement
• Before Impala 2.0, SET can be used only in impala-‐shell
• In Impala 2.0, you can use SET in client app through JDBC / ODBC APIs.
28
Resource Management and Security
29
Admission Control (Impala 1.3)
• Fast and lightweight resource management mechanism
• avoids oversubscripLon of resources for concurrent workloads • queries are queued when reaching configurable limits
• Run on every impalad • no SPOF
30
YARN and Llama
• Llama: Low Latency ApplicaLon MAster • Subdivides coarse-‐grain YARN scheduling into finer-‐granularity for low-‐latency and short-‐lived queries
• Llama registers one long-‐lived AM per YARN pool • Llama caches resources allocated by YARN for a short Lme, so that they can be quickly re-‐allocated to Impala queries • much faster than waiLng for YARN
• Impala 1.4: GA. Llama HA support
31
Query Timeout
• A new query opLon, QUERY_TIMEOUT_S, lets you specify a Lmeout period in seconds for individual queries
• Note: The Lmeout clock for queries and sessions only starts Lcking when the query or session is idle
32
Security
• Impala 2.0 can accept either kind of auth. request • ex) host A with Kerberos, and host B with LDAP
• Security related statement • GRANT • REVOKE • CREATE ROLE • DROP ROLE • SHOW ROLES • SHOW ROLE GRANT
• -‐-‐disk_spill_encrypLon opLon
33
Others
34
Text + gzip, bzip2, and Snappy
• In Impala 2.0 and later, Impala supports using text data files that employ gzip, bzip2, or Snappy compression
• use ROW FORMAT with delimiter and escape character to create table
CREATE TABLE csv_compressed (a STRING, b STRING, c STRING)! ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";!
35
impala-‐shell
• UTF-‐8 support (1.4) • .impalarc file (2.0) [impala]!verbose=true!default_db=tpc_benchmarking!write_delimited=true!output_delimiter=,!output_file=/home/tester1/benchmark_results.csv!show_profiles=true!
36
DocumentaLon
• Cluster Sizing Guidelines for Impala • hTp://www.cloudera.com/content/cloudera/en/documentaLon/core/latest/topics/impala_cluster_sizing.html
37