cloudera impala + postgresql

24
Running Cloudera Impala on PostgreSQL By Chengzhong Liu [email protected] m 2013.12

Upload: liuknag

Post on 14-Dec-2014

1.186 views

Category:

Technology


11 download

DESCRIPTION

Hacking Cloudera Impala for running on PostgreSQL cluster as MPP style. Performances under typical sql stmt and concurrence case are verified.

TRANSCRIPT

Page 1: Cloudera Impala + PostgreSQL

Running Cloudera Impala on PostgreSQL

By Chengzhong [email protected]

2013.12

Page 2: Cloudera Impala + PostgreSQL

Story coming from…

• Data gravity• Why big data• Why SQL on big data

Page 3: Cloudera Impala + PostgreSQL

Today agenda

• Big data in Miaozhen 秒针系统• Overview of Cloudera Impala• Hacking practice in Cloudera Impala• Performance• Conclusions• Q&A

Page 4: Cloudera Impala + PostgreSQL

What happened in miaozhen

• 3 billion Ads impression per day• 20TB data scan for report generation every morning• 24 servers cluster

• Besides this– TV Monitor– Mobile Monitor– Site Monitor– …

Page 5: Cloudera Impala + PostgreSQL

Before Hadoop

• Scrat– PostgreSQL 9.1 cluster– Write a simple proxy – <2s for 2TB data scan

• Mobile Monitor– Hadoop-like distribute computing system– Rabbit MQ + 3 computing servers– Write a Map-Reduce in C++– Handles 30 millions to 500 millions Ads impression

Page 6: Cloudera Impala + PostgreSQL

Problem & Chance

• Database cluster• SQL on Hadoop• Miscellaneous data

• Requirements– Most data is rational– SQL interface

Page 7: Cloudera Impala + PostgreSQL

SQL on Hadoop

• Google Dremel• Apache Drill• Cloudera Impala• Facebook Presto• EMC Greenplum/Pivotal

HDFS

Map Reduce

HivePig

Impala/Drill/Pivotal/Presto

Latency matters

Page 8: Cloudera Impala + PostgreSQL

What’s this

• A kind of MPP engine• In memory processing• Small to big join– Broadcast join

• Small result size

Page 9: Cloudera Impala + PostgreSQL

Why Cloudera Impala

• The team move fast– UDF coming out– Better join strategy on the way

• Good code base– Modularize– Easy to add sub classes

• Really fast– Llvm code generation

• 80s/95s – uv test

– Distributed aggregation Tree– In-situ data processing (inside storage)

Page 10: Cloudera Impala + PostgreSQL

Typical Arch.SQL Interface Meta Store

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Query Planner

Coordinator

Exec Engine

Page 11: Cloudera Impala + PostgreSQL

Our target

• A MPP database– Build on PostgreSQL9.1– Scale well– Speed

• A mixed data source MPP query engine– Join two tables in different sources– In fact…

Page 12: Cloudera Impala + PostgreSQL

Hacking… from where

• Add, not change– Scan Node type– DB Meta info

• Put changes in configuration– Thrift Protocol update• TDBHostInfo• TDBScanNode

Page 13: Cloudera Impala + PostgreSQL

Front end

• Meta store update– Link data to the table name– Table location management

• Front end– Compute table location

Page 14: Cloudera Impala + PostgreSQL

Back end

• Coordinator– pg host

• New scan node type– db scan node• Pg scan node• Psql library using cursor

Page 15: Cloudera Impala + PostgreSQL

SQL Plan

Aggr.: sum(count(id)

Exchange node

Aggr. : group by id

Aggr. : count(id)

HDFS/PG scan

Aggr. : group by id

Exchange node

• select count(distinct id) from table

– MR like process

Page 16: Cloudera Impala + PostgreSQL

Env.

• Ads impression logs– 150 millions, 100KB/line

• 3 servers– 24 cores– 32 G mem– 2T * 12 HD– 100Mbps LAN

• Query– Select count(id) from t group by campaign– Select count(distinct id) from t group by campaign– Select * from t where id = ‘xxxxxxxx’

Page 17: Cloudera Impala + PostgreSQL

Performance

1 2 30

100

200

300

400

500

600

700

impalahivepg+impala

• Group by speed / core• 20 M /s

Page 18: Cloudera Impala + PostgreSQL

With index

Page 19: Cloudera Impala + PostgreSQL

Codegen on/off

uv_test distinct duplicated0

10

20

30

40

50

60

70

80

90

100

en_codegendis_codegen

• select count(distinct id) from t group by c

• select distinct idfrom t

• select id from tgroup by id having count(case when c = '1' then 1 else null end) > 0 and count(case when c= 2' then 1 else null end) > 0 limit 10;

Page 20: Cloudera Impala + PostgreSQL

Multi-users

Page 21: Cloudera Impala + PostgreSQL

Conclusion

• Source quality– Readable– Google C++ style– Robust

• MPP solution based on PG– Proved perf.– Easy to scale

• Mixed engine usage– HDFS and DB

Page 22: Cloudera Impala + PostgreSQL

What’s next

• Yarn integrating• UDF• Join with Big table• BI roadmap• Fail over

Page 24: Cloudera Impala + PostgreSQL

Thanks!

Q & A