tanel poder connecting hadoop and oracle · gluent.com 1 connecting hadoop and oracle tanel poder a...

49
1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

Upload: lycong

Post on 03-Jul-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 1

ConnectingHadoopandOracle

TanelPoderalongtimecomputerperformancegeek

Page 2: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 2

Intro:Aboutme

• TanelPõder• OracleDatabasePerformancegeek(18+years)• ExadataPerformancegeek• LinuxPerformancegeek• HadoopPerformancegeek

• CEO&co-founder:

ExpertOracleExadatabook

(2nd editionisoutnow!)

Instantpromotion

Page 3: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 3

AllEnterpriseDataAvailableinHadoop!

GluentHadoop

Gluent

MSSQL

Tera-data

IBMDB2

BigDataSources

Oracle

AppX

AppY

AppZ

Page 4: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 4

GluentOffloadEngine

Gluent

Hadoop

AccessanydatasourceinanyDB&APP

MSSQL

Tera-data

IBMDB2

BigDataSources

Oracle

AppX

AppY

AppZ

PushprocessingtoHadoop

Page 5: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 5

BigDataPlumbers

Page 6: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 6

Didyouknow?

• ClouderaImpalanowhasanalyticfunctions andcanspillworkarea bufferstodisk• …sincev2.0inOct2014

• Hivehas"storageindexes"(Impalanotyet)• Hive0.11withORCformat,ImpalapossiblywithnewKudustorage

• Hivesupportsrowlevelchanges &ACIDtransactions• …sincev0.14inNov2014• Hiveusesbase+deltatable approach(notforOLTP!)

• SparkSQL isthenewkidontheblock• Actually,Sparkisoldnewsalready->ApacheFlink

Page 7: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 7

Scalabilityvs.Features(Hiveexample)$ hive (version 1.2.1 HDP 2.3.1)

hive> SELECT SUM(duration) > FROM call_detail_records> WHERE> type = 'INTERNET‘> OR phone_number IN ( SELECT phone_number> FROM customer_details> WHERE region = 'R04' );

FAILED: SemanticException [Error 10249]: Line 5:17 Unsupported SubQuery Expression 'phone_number': Only SubQuery expressions that are top levelconjuncts are allowed

WeOracleusershavebeenspoiled

withverysophisticatedSQLengine foryears:)

Page 8: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 8

Scalabilityvs.Features(Impalaexample)$ impala-shellSELECT SUM(order_total) FROM ordersWHERE order_mode='online' OR customer_id IN (SELECT customer_id FROM customers

WHERE customer_class = 'Prime');

Query: select SUM(order_total) FROM orders WHERE order_mode='online' OR customer_id IN (SELECT customer_id FROM customers WHERE customer_class = 'Prime')

ERROR: AnalysisException: Subqueries in OR predicates arenot supported: order_mode = 'online' OR customer_id IN (SELECT customer_id FROM soe.customers WHERE customer_class = 'Prime')

Cloudera:CDH5.3Impala2.1.0

Page 9: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 9

Scalabilityvs.Features(Hiveexample2)$ hive (version 1.2.1 HDP 2.3.1)

hive> SELECT > *> FROM> table1> WHERE> ID IN (SELECT id FROM table2 WHERE id<100)> AND ID IN (SELECT id FROM table3 WHERE id<100)> AND ID <100;

FAILED: SemanticException [Error 10249]: Line 7:6 Unsupported SubQuery Expression 'ID': Only 1 SubQuery expression is supported.

Page 10: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 10

Hadoopvs.OracleDatabase:Onewaytolookatit

Hadoop Oracle

Cheap,Scalable,Flexible

Sophisticated,Out-of-the-box,

Mature

BigDataPlumbing?

Page 11: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 11

3majorbigdataplumbingscenarios

1. Load datafromHadooptoOracle

2. Offload datafromOracletoHadoop

3. Query HadoopdatainOracle

Abigdifferencebetweendatacopy andon-

demanddataquery tools

Page 12: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 12

Righttoolfortherightproblem!Howtoknowwhat'sright?

Page 13: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 13

UseCase:MoveDatato/fromHadoop:Sqoop

• AparallelJDBC<->HDFSdatacopy tool• Apache2.0licensed(likemostotherHadoopcomponents)• GeneratesMapReducejobsthatconnecttoOraclewithJDBC

OracleDB

MapReduceMapReduceMapReduceMapReduce JDBC

SQOOP

file

Hadoop/HDFSfilefile

file

file

Hivemetadata

Page 14: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 14

Example:CopyDatatoHadoopwithSqoop

• Sqoopis"just"aJDBCDBclientcapableofwritingtoHDFS• Itisveryflexible

sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112 --username system--null-string '' --null-non-string '' --target-dir=/user/impala/offload/tmp/ssh/sales--append -m1 --fetch-size=5000 --fields-terminated-by '','' --lines-terminated-by ''\\n'' --optionally-enclosed-by ''\"'' --escaped-by ''\"'' --split-by TIME_ID --query "\"SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE('

1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS','NLS_CALENDAR=GREGORIAN') AND \$CONDITIONS\""

Justanexampleofsqoop syntax

Page 15: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 15

Sqoopusecases

1. CopydatafromOracletoHadoop• Entireschemas,tables,partitionsoranySQLqueryresult

2. CopydatafromHadooptoOracle• ReadHDFSfiles,converttoJDBCarraysandinsert

3. Copychanged rows fromOracletoHadoop• SqoopsupportsaddingWHEREsyntaxtotablereads:• WHERE last_chg > TIMESTAMP'2015-01-10 12:34:56'

• WHERE ora_rowscn > 123456789

Page 16: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 16

SqoopwithOracleperformanceoptimizations- 1

• IscalledOraOOP• WasaseparatemodulebyGuyHarrisonteamatQuest(Dell)• IncludedinstandardSqoopv1.4.5+ andSqoop2comingtoo!

• sqoop import direct=true …

• ParallelSqoopOraclereadsdonebyblockrangesorpartitions• Previouslyrequiredwideindexrangescans toparallelizeworkload

• ReadOracledatawithfulltablescans anddirectpathreads• MuchlessIO,CPU,buffercacheimpactonproductiondatabases

Page 17: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 17

SqoopwithOracleperformanceoptimizations- 2

• DataloadingintoOracleviadirectpathinsert ormerge• Bypassbuffercacheandmostredo/undogeneration

• HCCcompressionon-the-flyforloadedtables• ALTER TABLE fact COMPRESS FOR QUERY HIGH

• Allfollowingdirectpathinsertswillbecompressed

• Sqoopisprobablythebesttoolfordatacopy/move usecase• Forbatchdatamovementneeds• IfusedwiththeOracleoptimizations

Page 18: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 18

Whataboutreal-timechangereplication?

• OracleChangeDataCapture• Supportedin11.2– butnotrecommendedbyOracleanymore• Desupportedin12.1

• OracleGoldenGateorDBVisitReplicate

• Somemanualwork/scriptinginvolvedwithallthesetools:

1. SqoopthesourcetablesnapshotstoHadoop2. Replicatechangesto"delta"tablesonHDFSorinHBase3. Usea"merge"viewonHive/Impalatomergebasetablewithdelta

Page 19: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 19

“Applying”updates/changesinHadoop

• HDFSfilesareappend-only bydesign(scalabilityreasons)• NorandomwritesorchangingofexistingHDFSfiledata

• Basetable+deltatableapproach1. Abasetable containsoriginal(batchoffloaded)data2. Adeltatable containsanynewversionsofdata3. Amergeviewperformsanouterjointoshowlatestdatafrombase

tableordeltatable(ifany)4. Optional“compaction”–mergingbase+delta togetherinbackground

• HiveACIDtransactionsusethisapproachunderthehood• Cloudera’sKUDU projectusessimilarconceptwithoutHDFS

• …butdeltascanbeinmemoryandstoredinflash(ifavailable)

Page 20: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 20

References:IncrementaldataupdatesinHadoop

• Cloudera(Impala)• http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-

fast-data-with-impala-without-kudu/

• Cloudera(Kudu)• https://blog.cloudera.com/blog/2015/09/kudu-new-apache-hadoop-

storage-for-fast-analytics-on-fast-data/

• Hortonworks (Hive)• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-

2.3.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html

Page 21: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 21

UseCase:Query Hadoopdataasifitwasin(Oracle)RDBMS

Theultimategoal(formeatleast:)

• LowcostandscalabilityofHadoop…

• …togetherwiththesophisticationofOracleSQLengine

• Keepusingyourexisting(Oracle)applicationswithoutre-engineering&rewritingSQL

Page 22: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 22

WhatdoSQLqueriesdo?

1. Retrievedata

2. Processdata

3. Returnresults

DiskreadsDecompressionFilteringColumnextraction&pruning

JoinParalleldistributeAggregateSort

FinalresultcolumnprojectionSenddatabackovernetwork…ordumptodisk

Page 23: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 23

Howdoesthequeryoffloadingwork?(Ext.tab)

-----------------------------------------------------------| Id | Operation | Name |-----------------------------------------------------------

| 0 | SELECT STATEMENT | || 1 | SORT ORDER BY | ||* 2 | FILTER | || 3 | HASH GROUP BY | ||* 4 | HASH JOIN | ||* 5 | HASH JOIN | ||* 6 | EXTERNAL TABLE ACCESS FULL| CUSTOMERS_EXT |

|* 7 | EXTERNAL TABLE ACCESS FULL| ORDERS_EXT || 8 | EXTERNAL TABLE ACCESS FULL | ORDER_ITEMS_EXT |-----------------------------------------------------------

Dataretrievalnodes

Dataprocessingnodes

• Dataretrieval nodesdowhattheyneedtoproducerows/cols• ReadOracledatafiles,readexternaltables,callODBC,readmemory

• Intherestoftheexecutionplan,everythingworksasusual

Page 24: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 24

Howdoesthequeryoffloadingwork?

Dataretrievalnodesproducerows/columns

Dataprocessingnodesconsumerows/columns

Dataprocessing nodesdon'tcarehowandwheretherowswerereadfrom,thedataflowformatisalwaysthesame.

Page 25: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 25

Howdoesthequeryoffloadingwork?(ODBC)

--------------------------------------------------------| Id | Operation | Name | Inst |IN-OUT|

--------------------------------------------------------| 0 | SELECT STATEMENT | | | || 1 | SORT ORDER BY | | | |

|* 2 | FILTER | | | || 3 | HASH GROUP BY | | | ||* 4 | HASH JOIN | | | ||* 5 | HASH JOIN | | | |

| 6 | REMOTE | orders | IMPALA | R->S || 7 | REMOTE | customers | IMPALA | R->S || 8 | REMOTE | order_items | IMPALA | R->S |

--------------------------------------------------------Remote SQL Information (identified by operation id):----------------------------------------------------

6 - SELECT `order_id`,`order_mode`,`customer_id`,`order_status` FROM `soe`.`orders` WHERE `order_mode`='online' AND `order_status`=5 (accessing 'IMPALA.LOCALDOMAIN' )

7 – SELECT `customer_id`,`cust_first_name`,`cust_last_name`,`nls_territory`,

`credit_limit` FROM `soe`.`customers` WHERE `nls_territory` LIKE 'New%' (accessing 'IMPALA.LOCALDOMAIN' )

8 - SELECT `order_id`,`unit_price`,`quantity` FROM `soe`.`order_items` (accessing'IMPALA.LOCALDOMAIN' )

Dataretrievalnodes

RemoteHadoopSQL

Dataprocessingbusinessasusual

Page 26: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 26

EvaluateHadoop->OracleSQLprocessing

• Whoreadsdisk?

• Whoandwhenfilterstherowsandprunescolumns?

• Whodecompressesdata?

• Whoconvertstheresultset toOracleinternaldatatypes?

• WhopaystheCPUprice?!

Page 27: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 27

Sqoopingdataaround

HDFS

Hadoop Oracle

OracleDB

Sqoop

Map-reduceJobs

IO

JDBCScan,join,aggregate

Datatypeconversion

Loadtable

ReadIOWriteIO+Filter

Page 28: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 28

OracleSQLConnectorforHDFS

HDFS

Hadoop Oracle

OracleDB

Joins,processing

etc

libhdfs

libjvm

JavaHDFSclient

ExternalTable

Filter+Datatypeconversion

Nofilterpushdown intoHadoop.

ParseTextfileorreadpre-convertedDataPumpfile

Page 29: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 29

QueryusingDBlinks&ODBCGateway

HDFS

Hadoop Oracle

OracleDB

Impala/Hive

Decompress,Filter,Project

IO+filter

Thriftprotocol

ODBCGateway Joins,

processingetc

Datatypeconversion

ODBCdriver

Page 30: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 30

BigDataSQL

OracleBDAHDFS

Hadoop Oracle

OracleExadataDB

OracleStorageCellDecompress,Filter,Project

IO+filter

iDB

HiveExternalTable

Joins,processing

etcDatatypeconversion

rows

Page 31: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 31

Oracle<->Hadoopdataplumbingtools- capabilities

Tool LOADDATA

OFFLOADDATA

ALLOWQUERYDATA

OFFLOADQUERY

PARALLELEXECUTION

Sqoop Yes Yes Yes

OracleLoaderforHadoop

Yes Yes

OracleSQLConnectorforHDFS

Yes Yes Yes

ODBCGateway

Yes Yes Yes

Big DataSQL

Yes Yes Yes Yes

GluentOffloadEngine

Yes Yes Yes Yes Yes

Page 32: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 32

Oracle<->Hadoopdataplumbingtools- overhead

Tool Load datatouse

DecompressCPU

FilteringCPU DatatypeConversion

Sqoop Oracle Hadoop Oracle Oracle

OracleLoaderforHadoop

Oracle Hadoop Oracle Hadoop

OracleSQLConnectorforHDFS

Oracle Oracle /Hadoop*

ODBCGateway

Hadoop Hadoop Oracle

Big DataSQL

Hadoop Hadoop Hadoop

GluentOffloadEngine

Hadoop Hadoop Hadoop

ParsetextfilesonHDFSorreadpre-convertedDataPumpbinary

Oracle12c+BDA+Exadataonly

OKforoccasionalarchivequery

Oracle11g,12c,noplatformrestrictions

Page 33: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 33

ExampleUseCase

DataWarehouseOffload

Page 34: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 34

DimensionalDWdesign

Hotnessofdata

DWFACTTABLESinTeraBytes

HASH(customer_id)

RANG

E(order_d

ate)

OldFactdatararelyupdated

Facttablestime-partitioned

DDDD

D

D

DIMENSIONTABLESinGigaBytes

Months toyearsofhistory

Aftermultiple joinsondimension tables–afullscanisdoneon

thefacttable

Somefilterpredicatesdirectlyonthefacttable(timerange),countrycodeetc)

Page 35: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 35

DWDataOffloadtoHadoop

CheapScalableStorage(e.g HDFS+Hive/Impala)

HotDataDim.Tables

HadoopNode

HadoopNode

HadoopNode

HadoopNode

HadoopNode

ExpensiveStorage

ExpensiveStorage

Time-partitionedfacttable HotDataColdDataDim.Tables

Page 36: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 36

1.CopyDatatoHadoopwithSqoop

• Sqoopis"just"aJDBCDBclientcapableofwritingtoHDFS• Itisveryflexible

sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112 --username system--null-string '' --null-non-string '' --target-dir=/user/impala/offload/tmp/ssh/sales--append -m1 --fetch-size=5000 --fields-terminated-by '','' --lines-terminated-by ''\\n'' --optionally-enclosed-by ''\"'' --escaped-by ''\"'' --split-by TIME_ID --query "\"SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE('

1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS','NLS_CALENDAR=GREGORIAN') AND \$CONDITIONS\""

Justanexampleofsqoop syntax

Page 37: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 37

2.LoadDataintoHive/Impalainread-optimizedformat

• Theadditionalexternaltablestepgivesbetterflexibilitywhenloadingintothetargetread-optimizedtable

CREATE TABLE IF NOT EXISTS SSH_tmp.SALES_ext (PROD_ID bigint, CUST_ID bigint, TIME_ID timestamp, CHANNEL_ID bigint, PROMO_ID bigint, QUANTITY_SOLD bigint, AMOUNT_SOLD bigint

)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '"' LINES TERMINATED BY '\n'

STORED AS TEXTFILE LOCATION '/user/impala/offload/tmp/ssh/sales'

INSERT INTO SSH.SALES PARTITION (year) SELECT t.*, CAST(YEAR(time_id) AS SMALLINT) FROM SSH_tmp.SALES_ext t

Page 38: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 38

3a.QueryDataviaaDBLink/HS/ODBC

• (I'mnotcoveringtheODBCdriverinstallandconfig here)

SQL> EXPLAIN PLAN FOR SELECT COUNT(*) FROM ssh.sales@impala;Explained.

SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);--------------------------------------------------------------| Id | Operation | Name | Cost (%CPU)| Inst |IN-OUT|--------------------------------------------------------------| 0 | SELECT STATEMENT | | 0 (0)| | || 1 | REMOTE | | | IMPALA | R->S |--------------------------------------------------------------

Remote SQL Information (identified by operation id):----------------------------------------------------

1 - SELECT COUNT(*) FROM `SSH`.`SALES` A1 (accessing 'IMPALA’)

TheentirequerygetssenttoImpalathankstoOracleHeterogenous Services

Page 39: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 39

3b.QueryDataviaOracleSQLConnectorforHDFSCREATE TABLE SSH.SALES_DP( "PROD_ID" NUMBER,

"CUST_ID" NUMBER,"TIME_ID" DATE,"CHANNEL_ID" NUMBER,

"PROMO_ID" NUMBER,"QUANTITY_SOLD" NUMBER,"AMOUNT_SOLD" NUMBER

)

ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER

DEFAULT DIRECTORY "OFFLOAD_DIR"

ACCESS PARAMETERS( external variable data

PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream')

LOCATION( 'osch-tanel-00000', 'osch-tanel-00001', 'osch-tanel-00002', 'osch-tanel-00003',

'osch-tanel-00004', 'osch-tanel-00005' )

)REJECT LIMIT UNLIMITED

NOPARALLEL

Theexternalvariabledataoptionsaysit'saDataPumpformatinputstream

hdfs_stream istheJavaHDFSclientappinstalledintoyourOracleserveraspartoftheOracleSQLConnectorforHadoop

Thelocationfilesaresmall.xmlconfig filescreatedbyOracleLoaderforHadoop(tellingwhereinHDFSourfilesare)

Page 40: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 40

3b.QueryDataviaOracleSQLConnectorforHDFS

• ExternalTableaccessisparallelizable!• MultipleHDFSlocationfilesmustbecreatedforPX(donebyloader)

SQL> SELECT /*+ PARALLEL(4) */ COUNT(*) FROM ssh.sales_dp;

----------------------------------------------------------------------------------| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |----------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | || 1 | SORT AGGREGATE | | | | || 2 | PX COORDINATOR | | | | || 3 | PX SEND QC (RANDOM) | :TQ10000 | Q1,00 | P->S | QC (RAND) |

| 4 | SORT AGGREGATE | | Q1,00 | PCWP | || 5 | PX BLOCK ITERATOR | | Q1,00 | PCWC | || 6 | EXTERNAL TABLE ACCESS FULL| SALES_DP | Q1,00 | PCWP | |

----------------------------------------------------------------------------------

OracleParallelSlavescanaccessadifferentDataPumpfilesonHDFS

Anyfilterpredicatesarenotoffloaded.Alldata/columnsareread!

Page 41: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 41

3c.QueryDataviaBigDataSQL

SQL> CREATE TABLE movielog_plus2 (click VARCHAR2(40))3 ORGANIZATION EXTERNAL

4 (TYPE ORACLE_HDFS5 DEFAULT DIRECTORY DEFAULT_DIR6 ACCESS PARAMETERS (7 com.oracle.bigdata.cluster=bigdatalite8 com.oracle.bigdata.overflow={"action":"truncate"}9 )

10 LOCATION ('/user/oracle/moviework/applog_json/')

11 )12 REJECT LIMIT UNLIMITED;

• Thisisjustonesimpleexample:

ORACLE_HIVEwoulduseHivemetadatatofigureoutdatalocationandstructure(butstoragecellsdothediskreadingfromHDFSdirectly)

Page 42: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 42

4.QueryDataviaGluentSmartConnector

• Iwillstaycivilizedandwillnotturnthisintoasalespresentation…

• …butcheckouthttp://gluent.comformoreinfo;-)

Page 43: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 43

Howtoquerythefulldataset?

• UNIONALL• LatestpartitionsinOracle:(SALEStable)• OffloadeddatainHadoop:(SALES@impala orSALES_DP ext tab)

SELECT PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID, QUANTITY_SOLD, AMOUNT_SOLD

FROM SSH.SALESWHERE TIME_ID >= TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')

UNION ALL

SELECT "prod_id", "cust_id", "time_id", "channel_id", "promo_id", "quantity_sold", "amount_sold"

FROM SSH.SALES@impalaWHERE "time_id" < TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')

CREATE VIEW app_reporting_user.SALES AS

Page 44: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 44

Hybridtablewithselectiveoffloading

CheapScalableStorage(e.g HDFS+Hive/Impala)

HotDataDim.Tables

HadoopNode

HadoopNode

HadoopNode

HadoopNode

HadoopNode

ExpensiveStorage

TABLEACCESSFULLEXTERNALTABLEACCESS/DBLINK

SELECT c.cust_gender, SUM(s.amount_sold)

FROM ssh.customers c, sales_v s

WHERE c.cust_id = s.cust_idGROUP BY c.cust_gender

SALESunion-allviewHASHJOIN

GROUPBY

SELECTSTATEMENT

Page 45: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 45

PartiallyOffloadedExecutionPlan--------------------------------------------------------------------------| Id | Operation | Name |Pstart| Pstop | Inst |--------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | | || 1 | HASH GROUP BY | | | | ||* 2 | HASH JOIN | | | | || 3 | PARTITION RANGE ALL | | 1 | 16 | || 4 | TABLE ACCESS FULL | CUSTOMERS | 1 | 16 | || 5 | VIEW | SALES_V | | | || 6 | UNION-ALL | | | | |

| 7 | PARTITION RANGE ITERATOR| | 7 | 68 | || 8 | TABLE ACCESS FULL | SALES | 7 | 68 | || 9 | REMOTE | SALES | | | IMPALA |--------------------------------------------------------------------------

2 - access("C"."CUST_ID"="S"."CUST_ID")

Remote SQL Information (identified by operation id):----------------------------------------------------

9 - SELECT `cust_id`,`time_id`,`amount_sold` FROM `SSH`.`SALES` WHERE `time_id`<'1998-07-01 00:00:00' (accessing 'IMPALA' )

Page 46: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 46

Performance

• VendorX:"WeloadNTerabytesperhour"

• VendorY:"WeloadNBillionrowsperhour"

• Q:Whatdatatypes?• SomuchcheapertoloadfixedCHARs• …butthat'snotwhatyourrealdatamodellookslike.

• Q:Howwiderows?• 1widevarchar column or500numericcolumns?

• DatatypeconversionisCPUhungry!

Makesureyouknowwhereinstackyou'llbepayingtheprice

Andisitonceoralwayswhenaccessed?

Page 47: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 47

Performance:DataRetrievalLatency

• Cloudera Impala• Lowlatency(subsecond)

• Hive• UselatestHive0.14+withallthebells'n'whistles (Stinger,TEZ+YARN)• Otherwiseyou'llwaitforjobsandJVMstostartupforeveryquery• ThenextHortonworks HDPrelease(2.4?)hasHiveLLAPdaemons• WithLLAP,JVMcontainerswillnotbestartedforsimplescans/filters

• OracleSQLConnectorforHDFS• Multi-secondlatencyduetohdfs_streamJVMstartup

• OracleBigDataSQL• Lowlatencythanksto"ExadatastoragecellsoftwareonHadoop"

Timetofirstrow/byte

Page 48: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 48

BigDataSQLPerformanceConsiderations

• Onlydataretrieval (theTABLEACCESSFULLetc)isoffloaded!• Filterpushdownetc

• Alldataprocessing stillhappensintheDBlayer• GROUPBY,JOIN,ORDERBY,AnalyticFunctions,PL/SQLetc

• JustlikeonExadata…• Storagecellsonlyspeedupdataretrieval

Page 49: Tanel Poder Connecting Hadoop and Oracle · gluent.com 1 Connecting Hadoop and Oracle Tanel Poder a long time computer performance geek

gluent.com 49

Thanks!

Wearehiringdevelopers&dataengineers!!!http://gluent.com

Also,theusual:

http://[email protected]