gnw05 - extending oracle databases with hadoop
TRANSCRIPT
Extending Databases With the
Full Power of HadoopTanel Poder
http://gluent.comHow Gluent
Does it!
2
Gluent - who we are
I also co-authored the Expert Oracle Exadata book
Tanel PoderCo-founder & CEO & still a performance geek
I was an independent consultant for many years, Oracle performance & scalability work.
Long term Oracle Database & Data Warehousing guys –
focused on performance & scale.
We got started in Dec 2014 and are ~20 people by now
Alumni 2009-2016
3
1. Intro• Why Hadoop? (3 minute overview)
2. Gluent Data Platform fundamentals
3. Offloading Oracle data to Hadoop
4. Updating Hadoop data in Oracle
5. Querying Hadoop data in Oracle
6. Sharing Hadoop & RDBMS data with multiple apps & databases
Agenda
Demos
Why Hadoop?• Scalability in Software!
• Open Data Formats• Future-proof!
• One Data, Many Engines!
Hive Impala Spark Presto Giraph
SQL-on-Hadoop is only one of many
applications of Hadoop
Graph ProcessingFull text search
Image ProcessingLog processing
Streams
Processing can be done close to data – awesome scalability!• One of the mantras of Hadoop:
• "Moving computation is cheaper than moving data"
• Hadoop data processing frameworks hide the complexity of MPP
• Now you can build a supercomputer from commodity “pizzaboxes” with cheap locally attached disks!
Node 1+ disks
Node 2+ disks
Node 3+ disks
Node 4+ disks
Node 5+ disks
Node 6+ disks
Node ...+ disks
Node N+ disks
Your code
Your code
Your code
Your code
Your code
Your code
Your code
Your code
Distributed storage + parallel execution framework Impala, Hive, Spark, Presto, HBase/coproc …
Your Code
“Affordable at Scale”
(Jeffrey Needham)
No SAN storage cost
& bottlenecks
No expensive big iron
hardware
One Data, Many Engines!• Decoupling storage from compute + open data formats =
flexible future-proof data platforms!
HDFS
Parquet ORC XML Avro
Amazon S3
Parquet WebLog
Kudu
Column-store
Impala SQLHive SQL Xyz…
Solr / Search SparkMR
Kudu APIlibparquet
Hadoop, NoSQL, RDBMS One way to look at it
HadoopData Lake
RDBMSComplex
TransactionsAnalytics
All Data! Scalable,Flexible
Sophisticated,Out-of-the-box,
Mature
“NoSQL” Transaction
Ingest
Scalable Simple
Transactions
Access to all enterprise data?
8
GluentOracle
TeradataPostgres
Big Data Sources
MSSQL
App X
App Y
App Z
Open Data Formats
Gluent as a data virtualization
layer
Oracle
9
GluentOracle
TeradataPostgres
SQL
Big Data Sources
MSSQL
App X
App Y
App Z
Push computation
down to Hadoop
Hive, Impala, Spark, etc…
10
Gluent Extends Databases with the Full Power of Hadoop
Offload RDBMS Data
Query any Hadoop Data
Transparent: All queries work – no code changes required!
11
Data Offload
12
Gluent’s Data Offload tool Gluent provides the full orchestration for syncing entire tables to Hadoop
(and keeping them in sync with just a single
command)
Lots of steps for ensuring proper hybrid query
performance, partitioning and data correctness
No ETL development needed!
Before & After
Cheap Scalable Storage(e.g HDFS + Hive/Impala)
Hot DataDim.
Tables
HadoopNode
HadoopNode
HadoopNode
HadoopNode
HadoopNode
ExpensiveStorage
Expensive Storage
Time-partitioned fact table Hot DataCold DataDim.
Tables
Large Database Offloading
Hotn
ess o
f dat
a
DW FACT TABLES in TeraBytes
HASH(customer_id)RA
NGE
(ord
er_d
ate)
Old Fact data rarely updated
Fact tables time-partitioned
DDDD
D
D
DIMENSION TABLESin GigaBytes
Months to years of history
After multiple joins on dimension tables – a full scan is done on the fact
table
A few filter predicates directly on the fact, most predicates on dimension
tables
15
• 100% RDBMS table (no offload)
• “90/10” table (union-all)• Most, but not all data offloaded &
dropped from RDBMS
• “100/10” table• Entire table available in Hadoop
• “100/100” table
• 100% Hadoop table• Usually presented to other DBs
Hybrid table types
Decoupled Partitioning Models (examples)Ho
tnes
s of d
ata
HASH(customer_id)
RAN
GE (o
rder
_dat
e)
HASH(store_id)RANGE(customer_id)
RAN
GE (o
rder
_dat
e)
HASH(customer_id)HASH(store_id)
RAN
GE (o
rder
_dat
e)
DB table partitioned
weekly Hadoop monthly
Multi-level partitioning
No DB partitioning, full Hadoop partitioning
Data Model Virtualization (example)Large RDBMS tables always joined (“fact
to fact” joins)
Large “fact to fact” joins
offloaded to Hadoop
T1 T2
View (T1_T2)
Join executed in Hadoop SQL
T1 T2
View (T1_T2)
Join materialized in Hadoop
Accessing the “wide” table
does not require
further joining
18
• How to schedule regular (additional partition) offloading?• And incremental changed data offloading?
Offloading options
Offload table to Hadoop, append new partitions as needed:
./offload –t schema.table --older-than-days=30 -x
Enable batch incremental changed data syncing to Hadoop:
./offload –t schema.table --incremental-enabled -x
Enable incremental offloading, with DML & data update support:
./offload -t schema.table --updates-enabled –x
19
• Sometimes historical updates are needed• Or deletion of specific records
• Gluent allows to update Hadoop data from your familiar database• Using existing RDBMS SQL
• We call it “data patching” internally• Because it’s not meant for full blown hybrid OLTP on Hadoop
• You do get some transactional properties for the hybrid transaction• Atomicity (either all RDBMS + Hadoop updates go through or not)
Updating offloaded data?
20
21
SQL Query & Workload offloading?
22
• First I’ll show some demos:• demo.sql• demo_plsql.sql
• And then explain how it is working :-)
Gluent Demo
Gluent Smart Connector
SAN/Exadata
Hadoop Ecosystem
HDFSOnsite, Cloud
SQL on Hadoop
HBase
Hive / LLAPImpala
Oracle RDBMS
ReportsQueries
Spark SQL
Petabytes of dataScan billions rows/sRead hundreds GB/s
3. Return only requiredrows and columns to Oracle
ETL
1. Secret Sauce:Read SQL execution plan, analyze SQL,
variables
2. Run parts of SQL in HadoopPush down
scanning, filtering, aggregations, joins
For every query, billions of rows can be scanned in Hadoop and only relevant millions returned to Oracle
Data Feeds
Returned data size and processing is greatly reduced
23
24
• Gluent virtualizes individual tables – nodes of execution plan
Table scan & filter pushdown - How do we do it?
Everything else still runs in the RDBMS (100% compatible)
Offload SALES table access
node to Hadoop
FACTunion all
Gluent Data Offload creates Hybrid Schemas with virtualized tables
Offloaded&
droppedhistory
Original Schema
Dim
Dim Original Fact
Hybrid Schema
Dim
Dim
HadoopFACT_EXT
Synonyms
ETL SQL SQLSQL
ETL unchanged (as it does not need
to see a long fact history)
Report SQLs unchanged, just
logging in to different schema
FACT
25Gluent, Inc. - confidential
26
Smart Connector Architecture + Oracle direct filter pushdown
HDFS
Hive / Impala SQL
App SQL
External TablePreprocessorGluent Smart
Connector
Data access
SQL
1. Oracle DB compiles SQL to execution plan
2. Some tables in plan are hybrid objects (Gluent external tables)
4. Smart Connector reads Oracle execution
plan memory
3. External table preprocessor
launches Gluent Smart Connector
5. Smart Connector
constructs “data access SQL” for
Hadoop
+ Filters+ Projectionpushdown
6. Hadoop SQL returns only
requested rows and columns to
connector
7. Connector returns table
results to Oracle External table
Execution plan reading: Oracle Conceptual viewSQL> SELECT SUM(object_id) FROM test_users u, test_objects o WHERE u.username = o.owner AND o.status = 'VALID' GROUP BY u.created;
-------------------------------------------------| Id | Operation | Name |-------------------------------------------------| 0 | SELECT STATEMENT | || 1 | HASH GROUP BY | ||* 2 | HASH JOIN | || 3 | TABLE ACCESS FULL | TEST_USERS ||* 4 | EXTERNAL TABLE ACCESS | TEST_OBJECTS |-------------------------------------------------
Predicate Information (identified by operation id):---------------------------------------------------
2 - access("U"."USERNAME"="O"."OWNER") 4 - filter("O"."STATUS"='VALID')
Column Projection Information (identified by operation id):-----------------------------------------------------------
1 - "U"."CREATED"[DATE,7], SUM("OBJECT_ID")[22] 2 - (#keys=1) "U"."CREATED"[DATE,7], "OBJECT_ID"[NUMBER,22] 3 - "U"."USERNAME"[VARCHAR2,30], "U"."CREATED"[DATE,7] 4 - "O"."OWNER"[VARCHAR2,30], "OBJECT_ID"[NUMBER,22]
Gluent identifies which tables in the plan are
“hybrid” and which are local
All required information is available in the execution
plan memory
V$SQL_PLAN is not enough, we read
what we need from SGA
What about bind vars?We read bind values
from process memory (PGA)
28
• 100% RDBMS table (no offload)
• “90/10” table (union-all)• Most, but not all data offloaded &
dropped from RDBMS
• “100/10” table• Entire table available in Hadoop
• “100/100” table
• 100% Hadoop table• Usually presented to other DBs
Hybrid table types
90/10 Hybrid Table: Under the hood• UNION ALL
• Latest partitions in Oracle: (SALES table)• Offloaded data in Hadoop: (SALES_EXT smart table)
SELECT PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID, QUANTITY_SOLD, AMOUNT_SOLD
FROM SSH.SALESWHERE TIME_ID >= TO_DATE(' 2016-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')UNION ALLSELECT "prod_id", "cust_id", "time_id", "channel_id", "promo_id",
"quantity_sold", "amount_sold"FROM SSH_H.SALES_EXT WHERE "time_id" < TO_DATE(' 2016-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')
CREATE VIEW hybrid_schema.SALES AS
Gluent, Inc. - confidential 29
Selective Offload Processing
Cheap Scalable Storage(e.g HDFS + Hive/Impala)
Hot DataDim.
Tables
HadoopNode
HadoopNode
HadoopNode
HadoopNode
HadoopNode
ExpensiveStorage
SALES union-all view
TABLE ACCESS FULLEXTERNAL TABLE ACCESS / DBLINK
SELECT c.cust_gender, SUM(s.amount_sold) FROM ssh.customers c, sales_v s WHERE c.cust_id = s.cust_id GROUP BY c.cust_gender
HASH JOIN
GROUP BY
SELECT STATEMENT
Partially Offloaded Oracle Execution Plan (90/10 table)
---------------------------------------------------------| Id | Operation | Name | ---------------------------------------------------------| 0 | SELECT STATEMENT | | | 1 | SORT GROUP BY ROLLUP | | |* 2 | HASH JOIN | | | 3 | TABLE ACCESS STORAGE FULL | TIMES | |* 4 | HASH JOIN | | |* 5 | TABLE ACCESS STORAGE FULL | CHANNELS | |* 6 | HASH JOIN | | |* 7 | TABLE ACCESS STORAGE FULL | PRODUCTS | |* 8 | HASH JOIN | | |* 9 | TABLE ACCESS STORAGE FULL | PROMOTIONS | | 10 | VIEW | SALES | | 11 | UNION-ALL | | | 12 | PARTITION RANGE ITERATOR | | | 13 | TABLE ACCESS STORAGE FULL| SALES | |* 14 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | --------------------------------------------------------- ... 5 - filter("CH"."CHANNEL_CLASS"='Direct') 6 - access("P"."PROD_ID"="S"."PROD_ID") 7 - filter("P"."PROD_CATEGORY"='Peripherals and Accessories') 8 - access("PM"."PROMO_ID"="S"."PROMO_ID") 9 - filter("PM"."PROMO_NAME"='blowout sale') 14 - filter("TIME_ID"<TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
Visit Hadoop rowsource only when
access to old data required
32
MSSQL 90/10 table execution plan
Concatenation == UNION ALL
Remote Query to Hadoop
33
• We go beyond single-table virtualization – push down joins, aggregations
Aggregation & full join pushdown - How do we do it?
Everything else still runs in the RDBMS (100% compatible)Offload entire
execution plan branches
For transparent optimization on
Oracle:dbms_advanced_
rewrite
Connector runs more complex SQL in Hadoop
34
Adaptive Join Filter Pulldown
Offloaded table, read
direct predicates
Gluent Adaptive Join
Filter Pulldown
SELECT ...
FROM `SH`.`SALES_ALL_INT`
WHERE `PROMO_ID`=37 AND `PROD_ID`>=14 AND `PROD_ID`<=130 AND `PROD_ID` IN (SELECT "P"."PROD_ID"
FROM ["SH"."PRODUCTS" "P"] WHERE ("P"."PROD_CATEGORY"= 'Peripherals and Accessories')) AND
`CHANNEL_ID`>=3 AND `CHANNEL_ID`<=9 AND `CHANNEL_ID` IN (SELECT "CH"."CHANNEL_ID"
FROM ["SH"."CHANNELS" "CH"]
WHERE ("CH"."CHANNEL_CLASS"=
'Direct'))
35
Phases of Data(base) platform modernization
Enterprise Data Warehouse with no Offload• Starting point: Too expensive & too slow
ETLsrc
Oracle/ExadataDW
Offload Phase 1: Data Offload• Your ETL and queries remain unchanged
ETLsrc
Oracle/ExadataDW Hadoop
Offload Phase 2: Query Offload• ETL and original queries still unchanged• Some queries running directly on Hadoop
ETLsrc
Oracle/ExadataDW
Hadoop
Offload Phase 3: Hadoop-first• Some ETL and data feeds land directly in Hadoop• Oracle DB can access all Hadoop data
ETLsrc
Oracle/ExadataDW
Hadoop
src
40
Hadoop as a data sharing backend (data hub)
Database Database Database Database Database
New Big Data
Feeds
Machine- Generated
Data
New IoT
feeds
41
• present
Demo
42
1. Liberate enterprise data• RDBMS data is offloaded from isolated silos to scalable storage & processing
platforms, in open data formats (Hadoop/HDFS)
2. Use modern distributed processing platforms for heavy-lifting• As these platforms improve & evolve, so does your Gluent experience
3. Require no disruptive forklift upgrade or switchover• Your existing applications still log in to the RDBMS as they’ve always done
4. Require no application code rewrites• Access all your data - code doesn’t change, application architecture doesn’t change
Gluent is designed to…
43
• Next webinar:• Gluent Real World Results from Customer Deployments• 17. January 2017• Sign up here:
• https://gluent.com/event/gnw06-modernizing-enterprise-data-architecture-with-gluent-cloud-and-hadoop/
• Training:• Hadoop for Database Professionals (1-day overview training)• TBD February 2017• Register interest here:
• https://gluent.com/hadoop-for-database-professionals/
Next Steps?
44
• Gluent Whitepapers, including Advisor -->• https://gluent.com/whitepapers/
• Gluent New World videos (will include this one)• http://vimeo.com/gluent
• Podcast about moving to the “New World” • “Drill to Detail” podcast with Mark Rittman• http://www.drilltodetail.com/podcast/2016/12/6/drill-to-detail-ep12-gluent-and-the-new-world-of-hybrid-
data-with-special-guest-tanel-poder
More info about Gluent
45
We are hiring awesome developers & data engineers ;-)
http://gluent.com/careers
Thanks!!!