apache impala (incubating) 2.5 performance update

1© Cloudera, Inc. All rights reserved.

Apache Impala 2.5 (Incubating)Performance improvements overview


Agenda

• What is Impala? • Impala at Apache• What is new in Impala 2.5 (CDH 5.7)• Impala performance update• Roadmap• Q&A


SQL-on-Hadoop engines

SQLImpala

SQL-on-Apache Hadoop – Choosing the right tool for the right job

https://vision.cloudera.com/sql-on-apache-hadoop-choosing-the-right-tool-for-the-right-job/


• General-purpose SQL engine • Real-time queries in Apache Hadoop • General availability (v1.0) release out since April 2013 • Analytic SQL functionality (v2.0) since October 2014• Apache incubator project since December 2015• Previous release 2.3 (CDH 5.5) released November 2015

• Current release 2.5 (CDH 5.7) April 2016

What is Impala?

Today’s topic

Justin Erickson

Maybe add:"Analytic SQL functionality (v2.0) since October 2014"Just mention if you run through this that this means"Things like, SQL:2003 analytic window functions, Correlated/uncorrelated subqueries, etc" Reason this helps address that Impala is proper SQL


• Query speed over Hadoop that meets or exceeds that of a proprietary analytic DBMS• General-purpose SQL query engine:

• Targeted for analytical workloads• Supports queries that take from milliseconds to hours

• Runs directly within Hadoop: • reads widely used Hadoop file formats • talks to widely used Hadoop storage managers • runs on same nodes that run Hadoop processes • Highly available

• High performance: • C++ instead of Java • Run time code generation

Impala overview


Impala Use Cases

•Interactive BI/analytics on more data•Asking new questions – exploration, ML (Ibis)•Data processing with tight SLAs•Query-able archive w/full fidelity

http://blog.cloudera.com/blog/2015/07/ibis-on-impala-python-at-scale-for-data-science/


• Incubator project since December 2015

• Development process slowly moving to ASF infrastructure (see IMPALA-3221)

• Help wanted!

Where to find the Impala community:

[email protected]

[email protected]

http://impala.io

@apacheimpala

Impala at Apache

mailto:[email protected]

mailto:[email protected]


New in Impala 2.5Usability Enhancements• Admission Control Improvements• Null-safe join/equals

Performance and Scalability• Runtime filters• Improved Cardinality Estimation and Join

Ordering• Query start-up improvements• Additional codegen and code

optimizations• Decimal arithmetic improvements• Fast min/max values on partition

columns(with query option)Integrations•Support for EMC DSSD

http://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html


New in Impala 2.5Performance and Scalability

• Runtime filters• Improved Cardinality Estimation and Join

Ordering• Query start-up improvements• Additional codegen and code

optimizations• Decimal arithmetic improvements• Incremental metadata updates (DDL)• Fast min/max values on partition

columns(with query option)

Covered today

http://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html


Impala 2.5 (CDH 5.7) improvements vs Impala 2.3 (CDH 5.5)

• 2.2x speedup for TPC-H• 1.7x speedup for TPC-H (Nested)• 4.3X speedup for TPC-DS

https://blog.cloudera.com/blog/2015/11/new-in-cloudera-enterprise-5-5-support-for-complex-types-in-impala/


Runtime filtering

• General idea: some predicates can only be computed at runtime

• Example: SELECT count(*) FROM date_dim dt ,store_sales WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND dt.d_moy = 12;

• How does Impala execute this query?


SELECT dt.d_year,item.i_brand brand,sum(ss_ext_sales_price) sum_agg

FROM date_dim dt,store_sales,item

WHERE dt.d_date_sk = store_sales.ss_sold_date_skAND store_sales.ss_item_sk = item.i_item_skAND i_category = "Books"AND i_class = "fiction"AND dt.d_moy = 12

GROUP BY dt.d_year,item.i_brand

ORDER BY dt.d_year,sum_agg DESC,i_brand limit 100

Runtime filters

store_sales

43 billion rows

item

198 rows

Broadcast Join #1

290 million rows

date_dim

6,200 rows

Broadcast Join #2

Aggregate

47 million rows







Runtime filters

store_sales

43 billion rows

item

198 rows

Broadcast Join #1

290 million rows

date_dim

6,200 rows

Broadcast Join #2

Aggregate

47 million rows

Runtime filters: the opportunity● The planner doesn’t know what the set of

ss_sold_date_sk and ss_item_sk contains - even with statistics.

● opportunity to save some work - why bother sending 43 billion of those rows to the joins?

● Runtime filters computes this predicate at runtime.







Runtime filters

store_sales

43 billion rows

item

198 rows

Broadcast Join #1

290 million rows

date_dim

6,200 rows

Broadcast Join #2

Aggregate

47 million rowsStep 1: planner tells Join #1 to produce bloom filter qualifying i_item_sk & Join #2 to produce bloom filter for qualifying d_date_sk







Runtime filters

store_sales

43 billion rows

item

198 rows

Broadcast Join #1

290 million rows

date_dim

6,200 rows

Broadcast Join #2

Aggregate

47 million rowsStep 2: Join reads all rows from build side (right input), and computes filter containing all distinct values of i_item_sk and d_date_sk







Runtime filters

store_sales

43 billion rows

item

198 rows

Broadcast Join #1

290 million rows

date_dim

6,200 rows

Broadcast Join #2

Aggregate

47 million rowsStep 3: Join #1 & #2 sends filter to store_sales scan. Scan eliminates rows that don’t have a match in the bloom filters.







Runtime filters

store_sales

47 million rows

item

198 rows

Broadcast Join #1

47 million rows

date_dim

6,200 rows

Broadcast Join #2

Aggregate

47 million rows

store_sales scan uses bloom filter from Join #2 to filter out partitions (ss_sold_date_sk)and bloom filter from Join #1 to filter out rows that don’t qualify (ss_item_sk)







Runtime filters

store_sales

47 million rows

item

198 rows

Broadcast Join #1

47 million rows

date_dim

6,200 rows

Broadcast Join #2

Aggregate

47 million rows

914x reduction in number of rows coming out of scan43 billion -> 47 million

6x reduction in number of rows coming out of join290 million -> 47 million


SELECT c_email_address,sum(ss_ext_sales_price) sum_agg

FROM store_sales,customer,customer_demographics

WHERE ss_customer_sk = c_customer_skAND cd_demo_sk = c_current_cdemo_skAND cd_gender = ‘M’AND cd_purchase_estimate = 10000AND cd_credit_reting = ‘Low Risk’

GROUP BY c_email_addressORDER BY sum_agg DESC

Runtime filters variation : Global filters

ShuffleJoin #1

43 billion rows

customer_demo

2,400 rows

BroadcastJoin #2

Aggregate

49 million rows

store_sales

43 billion rows

customer

3.8 million

Shuffle Shuffle

Justin Kestelyn

maybe remove global filters example in interests of time







ShuffleJoin #1

43 billion rows

customer_demo

2,400 rows

BroadcastJoin #2

Aggregate

49 million rows

Join #1 & #2 are expensive joins since left side of the joins have 43 billion rows

store_sales

43 billion rows

customer

3.8 million

Shuffle Shuffle







ShuffleJoin #1

43 billion rows

customer_demo

2,400 rows

BroadcastJoin #2

Aggregate

49 million rows

Create bloom filter from Join #2 on cd_demo_sk and push down to customer table scan

store_sales

43 billion rows

customer

3.8 million

Shuffle Shuffle







ShuffleJoin #1

43 billion rows

customer_demo

2,400 rows

BroadcastJoin #2

Aggregate

49 million rows

Reduced customer rows by 826X

3.8 million to 4,600 rows

store_sales

43 billion rows

customer

4,600 rows

Shuffle Shuffle







ShuffleJoin #1

43 billion rows

customer_demo

2,400 rows

BroadcastJoin #2

Aggregate

49 million rows

store_sales

43 billion rows

customer

4,600 rows

Shuffle Shuffle

Create bloom filter from Join #1 on c_customer_sk and push down to store_sales table scan







ShuffleJoin #1

49 million rows

customer_demo

2,400 rows

BroadcastJoin #2

Aggregate

49 million rows

store_sales

49 million rows

customer

4,600 rows

Shuffle Shuffle

877x reduction in rows43 billion -> 49 million rows

set RUNTIME_FILTER_MODE=GLOBAL;


Runtime filters: real-world results

• Runtime filters can be highly effective. Some benchmark queries are more than 30 times faster in Impala 2.5.0.

• As always, depends on your queries, your schemas and your cluster environment.• By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They

can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL. • Other runtime filter parameters include :

• RUNTIME_BLOOM_FILTER_SIZE: [1048576]• RUNTIME_FILTER_WAIT_TIME_MS: [0]


Improved Cardinality Estimates and Join Order

1. More robust scan cardinality estimation• Mitigate correlated predicates (exponential backoff)

2. Improved join cardinality estimation• Special treatment of common case of PK/FK joins• Detect selective joins by applying the selectivity of build-side predicates to the

estimated join cardinality

• TPC-H Q8 Impact: >8x speedup (91s in Impala 2.3 -> 11s in Impala 2.5)

SELECT * FROM cars WHERE cars.make = 'Toyota' AND cars.model = 'Camry'


Query start-up: performance impact


LLVM Codegen Support in Impala

Operations:• Hash join• Aggregation• Scans: Text, Sequence, Avro• Expressions in all operators• Sort• Top-N

Data Types:• TINYINT, SMALLINT, INT, BIGINT• FLOAT, DOUBLE• BOOLEAN• STRING, VARCHAR• DECIMALNew in Impala

2.5Extended in Impala 2.5


Codegen for Order by & Top-Nvoid* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. .

int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);

if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];

int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }


Codegen for Order by & Top-N

int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1

int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key}

Codegen code

• Perfectly unrolls “for each grouping column” loop• No switching on input type(s)• Removes branching on ASCENDING/DESCENDING,

NULLS FIRST/LAST

Original code





Codegen for Order by & Top-N

int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1

int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key}

Codegen code

• Perfectly unrolls “for each grouping column” loop• No switching on input type(s)• Removes branching on ASCENDING/DESCENDING,

NULLS FIRST/LAST

Original code




10x more efficient code


Float/Double Vs Decimal?Pros for Float/Double

• Uses less memory.

• Faster because floating point math operations are natively supported by processors.(Note: Decimal uses fixed-point hardware types - int64 and __int128)

• Can represent a larger range of numbers.

Cons for Float/Double• Precision errors compound during aggregations

• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)

Decimal arithmetic and aggregation

No go for applications requiring high precision & accuracy What about performance penalty?


Decimal arithmetic and aggregation

SELECT l_returnflag, l_linestatus, Sum(l_quantity) AS SUM_QTY, Sum(l_extendedprice)AS SUM_BASE_PRICE, Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICEFROM lineitem GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag,

l_linestatus

3x speedup

● Simplified overflow check for decimal.● Extended Codegen framework to support aggregations involving decimal.● Bridged the performance gap between double and decimal


Network

Distributed Aggregations in Impala

Preagg Preagg Preagg

Merge Merge Merge

select cust_id, sum(dollars)from sales group by cust_id;

Scan ScanScan

• Impala aggregations have two phases:• Pre-aggregation phase• Merge phase

• The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value.• E.g. many sales per customer.


Network

Downsides of Pre-aggregations

Preagg Preagg Preagg

Merge Merge Merge

select distinct * from sales;

Scan ScanScan

• Pre-aggregations consume:• Memory• CPU cycles

• Pre-aggregations are not always effective at reducing network traffic

• E.g. select distinct for nearly-distinct rows• Pre-aggregations can spill to disk under

memory pressure• Disk I/O is bad - better to send to

merge agg rather than disk


Network

Streaming Pre-aggregations in Impala 2.5

Merge Merge Merge

select distinct * from sales;

Scan ScanScan

• Reduction factor is dynamically estimated based on the actual data processed

• Pre-aggregation expands memory usage only if reduction factor is good

• Benefits:• Certain aggregations with low reduction

factor see speedups of up to 40%• Memory consumption can be reduced by

50% or more• Streaming pre-aggregations don’t spill to

disk


Streaming Pre-aggregations in Impala 2.5

Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 366.581ms 366.581ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 149.923us 149.923us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 243.604ms 248.701ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 8s887ms 9s585ms 450.00M 437.91M 1.53 GB 245.01 MB FINALIZE 03:EXCHANGE 15 827.770ms 932.785ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 9s995ms 11s484ms 450.00M 437.91M 1.64 GB 3.59 GB 00:SCAN HDFS 15 142.192ms 189.179ms 450.00M 450.00M 150.94 MB 88.00 MB tpch_300_parquet.orders

Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 356.667ms 356.667ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 110.924us 110.924us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 246.188ms 250.408ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 11s174ms 11s753ms 450.00M 437.91M 1.51 GB 245.01 MB FINALIZE 03:EXCHANGE 15 750.620ms 805.099ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 5s670ms 6s715ms 450.00M 437.91M 153.40 MB 3.59 GB STREAMING 00:SCAN HDFS 15 151.746ms 201.804ms 450.00M 450.00M 150.95 MB 88.00 MB tpch_300_parquet.orders

Baseline finished in 23.13 seconds

With stream pre-aggregation enabled finished in 14.9 seconds


Optimization for partition keys scan

• Use metadata to avoid table accesses for partition key scans:• select min(month), max(year) from functional.alltypes;• month, year are partition keys of the table

• Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS• Applicable:

• min(), max(), ndv() and aggregate functions with distinct keyword• partition keys only

01:AGGREGATE [FINALIZE] | output: min(month),max(year)| 00:UNION constant-operands=24

03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year)|02:EXCHANGE [UNPARTITIONED] |01:AGGREGATE| output: min(month), max(year)|00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB

Plan without optimization Plan with optimization


21x node cluster each with Hardware ● 384GB memory, 2s sockets, 12x total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz● 12 disk drives at 932GB each (one for the OS, the rest for HDFS)

Comparative Set● Impala 2.5

○ RUNTIME_FILTER_MODE = 2;● Spark SQL 1.6

○ Thrift JDBC server used to avoid startup cost ○ --master yarn --deploy-mode client --driver-memory 24G --driver-cores 8 --executor-memory 24G --num-executors 240

Workload● TPC-DS 15TB stored in Parquet file format (default of 256MB block size)● Un-modified TPC-DS queries : 3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47, 52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98● Caveats:

○ Spark-SQL failed running : ■ Q25 : Bad plan ■ Q47 : StackOverflowError■ Q89 : StackOverflowError

Competitive benchmark : TPC-DS


Q25 (Fact to fact joins)SELECT i_item_id,i_item_desc, s_store_id, s_store_name, Stddev_samp(ss_net_profit),Stddev_samp(sr_net_loss), Stddev_samp(cs_net_profit) AS catalog_sales_profit FROM store_sales, store_returns, catalog_sales, date_dim d1, date_dim d2, date_dim d3, store, item WHERE d1.d_moy = 4 AND d1.d_year = 2001 AND d1.d_date_sk = ss_sold_date_sk AND i_item_sk = ss_item_sk AND s_store_sk = ss_store_sk AND ss_customer_sk = sr_customer_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number AND sr_returned_date_sk = d2.d_date_sk AND d2.d_moy BETWEEN 4 AND 10 AND d2.d_year = 2001 AND sr_customer_sk = cs_bill_customer_sk AND sr_item_sk = cs_item_sk AND cs_sold_date_sk = d3.d_date_sk AND d3.d_moy BETWEEN 4 AND 10 AND d3.d_year = 2001 GROUP BY i_item_id, i_item_desc, s_store_id, s_store_name ORDER BY i_item_id, i_item_desc, s_store_id, s_store_name LIMIT 100;

Competitive benchmark Query complexity varied from Q3SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, Sum(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;


Competitive benchmark


Competitive benchmark

Impala 2.5 is 11x faster (based on geomean)


Performance Benchmark Takeaways• Impala unlocks BI usage directly on Hadoop

• Meets BI low-latency and multi-user requirements • Advantage expands for single-user vs just 10 users

• Spark SQL enables easier Spark application development• Enables mixed procedural Spark (Java/Scala) and SQL job development

• Mid-term trends will further favor Impala’s design approach for latency and concurrency• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)• CPU efficiency will increase in importance• Native code enables easy optimizations for CPU instruction sets


• Available today in Impala 2.5:• All the same Impala functionality, performance, and third-party integrations• Supported across our cloud partners• Deployment via Director• Modular architecture enables cloud’s decoupled storage and elasticity future

• Available soon in Impala 2.6:• Impala read/write to S3 in addition to local HDFS IMPALA-1878• Dynamically sized runtime filters• Parquet scanner optimization• Faster joins, aggregations, sorts and decimal arithmetic • Rack aware scheduling • Faster code generation

Impala and Cloud

http://www.cloudera.com/documentation/enterprise/release-notes/topics/impala_new_features.html#new_features_250

https://www.cloudera.com/products/cloudera-director.html

https://issues.cloudera.org/browse/IMPALA-1878


Impala Roadmap2H 2015 1H 2016 2016

• SQL Support & Usability• Nested structures• Kudu updates (beta)

• Management & Security• Record reader service

(beta)• Finer-grained security

(Sentry)• Integration

• Isilon support• Python interface (Ibis)

• Performance & Scale• Improved predictability

under concurrency

• Performance & Scale• Continued scalability and

concurrency• Initial perf/scale

improvements• Management & Security

• Improved admission control

• Resource utilization and showback

• SQL Support & Usability• Dynamic partitioning

• Performance & Scale• >20x performance• Multi-threaded

joins/aggregations• Continued scale work

• Cloud• S3 read/write support

• Management & Security• Improved YARN

integration• Automated metadata

• SQL Support & Usability• Data type improvements• Added SQL extensions

https://issues.cloudera.org/browse/IMPALA-1878

Justin Erickson

Remove "nested types with Avro"If somebody asks we can respond with potentially end of this year


Appendix.


• Pre Impala 2.5:• Coordinator starts receiving fragments before

senders• Problem:

• Serializes startup• Scale and plan complexity ~ slower startup

• Impala 2.5:• Coordinator starts fragments in any order• Added wait logic for senders and receivers

Query start-up improvements


Scheduling Small Queries

Query scheduler assigns scan ranges to workers (running impalad).First it selects an HDFS datanode to read from.

A B C

Selection will always start with the same replica to make optimal use of OS buffer caches.This can lead to hot-spots for some workloads.Improvement: Pick impalad at random.

Justin Kestelyn

remove slides 45-47 in interests of time

Mostafa Mokhtar

[email protected] Should I remove the competitive benchmark slide as well?

Mostafa Mokhtar

I think the slide numbers have changed since I deleted some

Mostafa Mokhtar

_Marked as resolved_

Justin Kestelyn

_Re-opened_I would keep the vs Spark SQL results in, people will ask

Mostafa Mokhtar

They are still there, slided 40-43


New Query Option: random_replica

Disabled by default.set random_replica = 1;

Also has a corresponding query hint:SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA

*/;


Where It Can Help• Large number of small queries, each with few input tables.• High load on only one of multiple replicas of a table.• Queries are CPU bound.• Benefit: Distribute load more evenly over replicas.• Tradeoff: Distribution of local reads will increase buffer cache usage.

What’s Next• Add possibility to prefer remote reads.• Switch remote impalad selection from round-robin to load-based.• Add rack-awareness.


Catalog Improvements

Incrementally update table metadata instead of force-reloading all table metadata during DDL/DML operations

Reload metadata of only ‘dirty’ partitions

Reuse descriptors of HDFS files to avoid loading file/block metadata for files that haven’t been modified

Significantly reduce the latency of DDL/DML operations that change a small fraction of table metadata (e.g. alter table foo partition (year = 2010) set location ‘blah’)


Catalog Improvements - Results

apache impala (incubating) 2.5 performance update

Software