there and back again. - store & retrieve data …...• kafka - streaming processing data...
TRANSCRIPT
![Page 1: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/1.jpg)
There and back again.or how to connect Oracle and Big Data.
SESSION IDGLEB OTOCHKIN
![Page 2: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/2.jpg)
Gleb Otochkin
Started to work with data in 1992At Pythian since 2008
Area of expertise: ● Data Integration● Oracle RAC● Oracle engineered systems● Virtualization● Performance tuning● Big Data
[email protected]@sky_vst
Principal Consultant
© The Pythian Group Inc., 2017 2
![Page 3: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/3.jpg)
ABOUT PYTHIAN
Pythian’s 400+ IT professionals help companies adopt and manage disruptive technologies to better compete
© 2016 Pythian. Confidential 3
![Page 4: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/4.jpg)
Systems currently managed by Pythian
EXPERIENCED
Pythian experts in 35 countries
GLOBAL
Millennia of experience gathered and shared
over 19 years
EXPERTS
11,800 2 400
© The Pythian Group Inc., 2017 4
![Page 5: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/5.jpg)
Journey of our data begins here.
© The Pythian Group Inc., 2017 5
![Page 6: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/6.jpg)
AGENDA
© The Pythian Group Inc., 2017 6
![Page 7: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/7.jpg)
Big Data?
© The Pythian Group Inc., 2017 7
![Page 8: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/8.jpg)
CLICK TO ADD TITLE
Big Data ecosystem
Big Data?Here in Texas we call it just Data.
![Page 9: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/9.jpg)
CLICK TO ADD TITLESUBTITLE
© The Pythian Group Inc., 2017
Big Data ecosystem
VOLUME
ANALYSIS
ANY STRUCTURE
PROCESSINGGROWS
COMPLEX
DISCOVERY
BIG
![Page 10: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/10.jpg)
Some Big Data tools and terms. Some terms and tools used in Big DATA:
• HDFS - Hadoop Distributed File System.
• Apache Hadoop - framework for distributed storage and processing.
• HBase - non-relational, distributed database.
• Kafka - streaming processing data platform.
• Flume - streaming, aggregation data framework.
• Cassandra - open-source distributed NoSQL database.
• Mongodb - open-source cross-platform document-oriented database.
• Hive - query and analysis data framework on top of Hadoop.
• Avro - data format widely used in BD (JSON+binary)© The Pythian Group Inc., 2017 10
![Page 11: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/11.jpg)
Business cases
© The Pythian Group Inc., 2017 11
![Page 12: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/12.jpg)
Why do we need replication?
Business cases.
![Page 13: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/13.jpg)
Business cases.Operation activity on DB with rest on a Data Lake.
© The Pythian Group Inc., 2017 13
RDBMS
OLTP
RDBMS
OLTP
BD platformData
Preparation Engine
BI & Analysis
![Page 14: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/14.jpg)
Business cases.Operation activity on DB with Data in Big Data and BI on a RDBMS.
© The Pythian Group Inc., 2017 14
RDBMS
OLTP
RDBMS
OLTP
BD platformData
Preparation Engine
BI & AnalysisRDBMS
![Page 15: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/15.jpg)
Business cases. Operation and BI activity on DB with main Data body in a Big Data platform.
© The Pythian Group Inc., 2017 15
RDBMS
OLTP
RDBMS
OLTP
BD platform
BI & Analysis
ODI & BD SQL
![Page 16: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/16.jpg)
Business cases. Operation and BI activity on DB with main Data body in a Big Data platform.
© The Pythian Group Inc., 2017 16
RDBMS
OLTP
RDBMS
OLTP
BD platforms
BI & Analysis
Kafka KafkaKafka
RDBMS
![Page 17: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/17.jpg)
Business cases. Operation and BI activity on DB with main Data body in a Big Data platform.
© The Pythian Group Inc., 2017 17
RDBMS
OLTP
RDBMS
OLTP
BD platforms
BI & Analysis
Kafka KafkaKafka
Stream processing
![Page 18: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/18.jpg)
Business cases.Short summary
• How we connect different platforms:•Data replication tools.
▪Oracle GoldenGate.▪DBVisit.▪Shareplex.▪ELT tools
•Batch load tools.▪Sqoop.▪Oracle loader for Hadoop
•Data Integration tools.▪Oracle Data Integrator.
•Presenting Big Data to RDBMS:▪Oracle Big Data SQL.▪Gluent.
• Reasons:• Growing Data Volume.
• Data retention policy.
• Cost.
• New options and API.
© The Pythian Group Inc., 2017 18
![Page 19: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/19.jpg)
Oracle to big data.
Real time replication OLTP data to a Big Data platform.
![Page 20: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/20.jpg)
Replication to Big Data. Oracle Goldengate:
• Proc.
•Real time streaming.
•No or minimal impact to the source.
•Supports most of the data types.
•Different formats and conversion.
•Enterprise support
• Cons. •Licensing fee.
•Closed source code.
© The Pythian Group Inc., 2017 20
Why Goldengate?
![Page 21: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/21.jpg)
Replication to Big Data. Replication to HDFS by Oracle GoldenGate.
© The Pythian Group Inc., 2017 21
App
App
App
App Database Tran Log
App
Oracle Goldengate
Oracle Goldengate
HDFS
HDFS
HDFS
HDFS
HDFS
OGG Trail
![Page 22: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/22.jpg)
Replication to Big Data. Source side.
• Oracle 11.2.0.4 and up
• Archivelog mode.
• Supplemental logging:•Minimal on DB level.
•On schema or table level.
• OGG user in database.
© The Pythian Group Inc., 2017 22
App
App
App
App Database Tran Log
App
![Page 23: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/23.jpg)
Replication to Big Data. Oracle GoldenGate.
• OGG 12.2.0.1 and up.
• BD adapters with replicat.
• Supported BD targets:•HDFS
•Kafka
•Flume
•HBase
•Mongodb (from 12.3)
•Cassandra (from 12.3)
• Different formats.
© The Pythian Group Inc., 2017 23
Tran Log
Manager
OGG Trail
Extract
OGG Trail
Data Pump
Manager
Replicat
• With the latest OGG for BD:•JDBC , Elasticsearch, Kinesis, Kafka Connect ...
![Page 24: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/24.jpg)
Replication to Big Data. To HDFS.
• OGG 12.2.0.1 and up.
• BD adapters with replicat.
• Different formats.
• DML and DDL.
© The Pythian Group Inc., 2017 24
OGG Trail
Manager
Replicat
HDFS Client
HDFS
HDFS
HDFS
HDFS
HDFS
![Page 25: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/25.jpg)
Replication to Big Data. Test two.
• New columns:•Operation type.
•Table name
•Local and UTC timestamp
orcl> select * from ggtest.test_tab_2;
ID RND_STR USE_DATE
1 BGBXRKJL 02/13/16 08:34:19
2 FNMCEPWE 08/17/15 04:50:18
© The Pythian Group Inc., 2017 25
hive> select * from BDTEST.TEST_TAB_2;
OK
I BDTEST.TEST_TAB_2 2016-10-25 01:09:16.000168 2016-10-24T21:09:21.186000 00000000120000004759 1 BGBXRKJL 2016-02-13:08:34:19 NULL
I BDTEST.TEST_TAB_2 2016-10-25 01:09:16.000168 2016-10-24T21:09:22.827000 00000000120000004921 2 FNMCEPWE 2015-08-17:04:50:18 NULL
![Page 26: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/26.jpg)
Replication to Big Data. Journal of changes instead of a state
© The Pythian Group Inc., 2017 26
Database HDFS
Insert row 1Row 1
I | Row 1
Insert row 2Row 2
I | Row 2
Insert row 3Row 3
I | Row 3
Update row 3Row 3U | Row 3
Delete row 2D | Row 2
![Page 27: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/27.jpg)
Replication to Big Data. To Kafka.
© The Pythian Group Inc., 2017 27
OGG Trail
Manager
Replicat
Kafka producer
Kafka topic
Kafka topic
Zookeeper
Kafka Consumer
Kafka Consumer
Kafka Consumer
Broker
Stream
![Page 28: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/28.jpg)
Replication to Big Data. Some notes about Kafka handler:
• Different topologies for Kafka.
• Topic partitioning to table.
• Topics are created automatically.
• Operation vs transactional mode.
• Blocking vs Non-Blocking.
• Version 0.9+.
• Schema definition can go to a different topic(Avro).
© The Pythian Group Inc., 2017 28
and others:
![Page 29: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/29.jpg)
Replication to Big Data. Some notes about source:
• JSON,Text,Avro(different types) and XML formats.
• Some strange behaviour with Avro.
• Hive support through text format.
• Trimming out leading or trailing whitespaces.
• Truncates only as DML.
• Proper path to Java classes.
• We are replicating a log of changes.
• Only committed transactions.
• “Passive” commit.
• Different levels of supplemental logging may lead to different captured data (compressed DML).
• DDL only with following DML.
• CTAS is not supported.
© The Pythian Group Inc., 2017 29
and BD destination:
![Page 30: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/30.jpg)
Kafka connect and Kafka stream. Confluent team and their additions.
• Kafka connect.• Frame to build connector.
• https://www.confluent.io/product/connectors/
• Main purpose is copy data to and from Kafka.
© The Pythian Group Inc., 2017 30
• Kafka Streams and KSQL(beta).• Incorporated to Kafka and works
together.• https://www.confluent.io/product/kaf
ka-streams/• Input data are transformed to output
data.
![Page 31: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/31.jpg)
Replication to Big Data. To Flume.
© The Pythian Group Inc., 2017 31
OGG Trail
Manager
Replicat
Flume source
Flume channel HDFSFlume sink
Flume agent
![Page 32: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/32.jpg)
Replication to Big Data. Some notes about Flume handler:
• Configuration file path.
• Failover and load balancing (from 1.6.x for Avro).
• Kerberos security (from 1.6.x for Thrift).
• Version 1.4+. (1.6.x better)
• Avro and Thrift formats.
• Schema changes go to a different Flume event.
© The Pythian Group Inc., 2017 32
and others:
![Page 33: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/33.jpg)
Replication to Big Data. To HBase. MongoDB. Cassandra.
© The Pythian Group Inc., 2017 33
OGG Trail
Manager
Replicat
Hbase client HDFS
HBASE
![Page 34: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/34.jpg)
Replication to Big Data. Primary key for a source table in HBase.
orcl> alter table ggtest.test_tab_2 add constraint pk_test_tab_2 primary key (pk_id);
Table altered.
orcl> insert into ggtest.test_tab_2 values(9,'PK_TEST',sysdate,null);
© The Pythian Group Inc., 2017 34
• Having supplemental logging for all columns:•Row id as concatenation of values for all columns.
• Adding a primary key and supplemental logging for keys:•Row id is primary key.
![Page 35: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/35.jpg)
Replication to Big Data. Two cases. All columns supplemental logging and PK supplemental logging.
© The Pythian Group Inc., 2017 35
hbase(main):012:0> scan 'BDTEST:TEST_TAB_2'
ROW COLUMN+CELL
7|IJWQRO7T|2013-07-07:08:13:52 column=cf:ACC_DATE, timestamp=1459275116849, value=2013-07-07:08:13:52
7|IJWQRO7T|2013-07-07:08:13:52 column=cf:PK_ID, timestamp=1459275116849, value=7
7|IJWQRO7T|2013-07-07:08:13:52 column=cf:RND_STR_1, timestamp=1459275116849, value=IJWQRO7T
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:ACC_DATE, timestamp=1459278884047, value=2016-03-29:15:14:37
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:PK_ID, timestamp=1459278884047, value=8
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:RND_STR_1, timestamp=1459278884047, value=TEST_INS1
8|TEST_INS1|2016-03-29:15:14:37|TEST_ALTER column=cf:TEST_COL, timestamp=1459278884047, value=TEST_ALTER
9 column=cf:ACC_DATE, timestamp=1462473865704, value=2016-05-05:14:44:19
9 column=cf:PK_ID, timestamp=1462473865704, value=9
9 column=cf:RND_STR_1, timestamp=1462473865704, value=PK_TEST
9 column=cf:TEST_COL, timestamp=1462473865704, value=NULL
![Page 36: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/36.jpg)
Replication to Big Data. Some notes about HBase handler:
• Hbase-site.xml in the classpath.
• GROUPTRANSOPS batching for performance.
• Kerberos security (from 1.6.x for Thrift).
• Version 1.1+. (0.98 with compat parameter)
• HBase row key mapping.
• Only single column family.
© The Pythian Group Inc., 2017 36
and others:
![Page 37: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/37.jpg)
Replication to Big Data. Some notes about Cassandra handler:
• Does not allow the _id column to be modified.
• Undo for bulk transactions (requires full before image for deletes in OGG).
• Primary key updates are deletes and inserts (watch compressed updates in OGG).
• Does not automatically create keyspaces (but creates tables)
• Primary keys are immutable.
• Inserts and Updates are different than in traditional databases.
• Primary key updates are deletes and inserts (watch compressed updates in OGG).
© The Pythian Group Inc., 2017 37
and Mongodb:
![Page 38: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/38.jpg)
Other tools. What can be used as an alternative.
• DBvisit.• Replicate connector to Kafka.
• Confluent version of Kafka.
• http://www.dbvisit.com/replicate_connector_for_kafka
• Has similar functionality to Oracle GoldenGate.
© The Pythian Group Inc., 2017 38
• Dell Shareplex.• Replication to Kafka.
• Supports most of datatypes.
• http://documents.software.dell.com/SharePlex/8.6.5/release-notes/system-requirements
• https://documents.software.dell.com/shareplex/8.6.4/administration-guide/configure-replication-to-open-target-targets/configure-replication-to-a-kafka-target
![Page 39: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/39.jpg)
And what is wrong with Sqoop?
![Page 40: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/40.jpg)
Batch offloading from Oracle to Big Data. To HDFS.
• Sqoop one of the oldest.
• Works in both ways.
• Gluent is more than a just batch tool.
© The Pythian Group Inc., 2017 40
SQOOP HDFS
HDFS
HDFS
HDFS
HDFS
Database
![Page 41: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/41.jpg)
Replication to Big Data. Sqoop import.
© The Pythian Group Inc., 2017 41
App
App
App
App Database
App
Sqoop HDFS
HDFS
HDFS
HDFS
HDFS
MapReduceJDBC
Hive
Generate MapReduce Jobs.Connects through JDBC. Good for batch processing.
![Page 42: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/42.jpg)
Back again
Using data from Big Data in Oracle
![Page 43: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/43.jpg)
From Big Data to Oracle.
© The Pythian Group Inc., 2017 43
Tool Filtering DataConversion
Query Offloadquery
Parallel
Sqoop Oracle Oracle Yes
ODBC gateway Hadoop Oracle Yes Yes
Oracle loader for HDFS
Oracle Hadoop Yes
Oracle SQL Connector for HDFS
Oracle Oracle Yes Yes
BD SQL Hadoop Hadoop Yes Yes Yes
ODI KM KM Yes Yes
Gluent Hadoop Hadoop Yes Yes Yes
![Page 44: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/44.jpg)
Oracle Data Integrator. ODI vs traditional ETL.
Intermediate staging and transformation
Source I
Source II
Target
Extract Transform Load
Load engineSource I
Source II Target
Extract TransformLoad
Transform
© The Pythian Group Inc., 2017 44
![Page 45: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/45.jpg)
Oracle Data Integrator. How it looks.
Weblogic
Staging schema
Oracle DB I
HDFS
Work repository
Oracle DB II
CSV text
ODI Agent
Target schema
Master repository
ODI Studio© The Pythian Group Inc., 2017 45
![Page 46: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/46.jpg)
Oracle Data Integrator. ODI workflow.
Data model
Project I
Designer
Project II
Data model
Data model
Data model
Logical schema I
Logical view
Logical schema II
Prod Batch
Context
Prod Stream
PROD
Physical
PROD
Test
PROD StreamPROD Stream
Test env
Test env
© The Pythian Group Inc., 2017 46
![Page 47: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/47.jpg)
Oracle Data Integrator. ODI logical mapping
• Logical mapping is separated from physical implementation.
© The Pythian Group Inc., 2017 47
![Page 48: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/48.jpg)
Oracle Data Integrator. ODI physical mapping.
• Different physical mapping for the same logical map
• Different knowledge modules.
© The Pythian Group Inc., 2017 48
![Page 49: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/49.jpg)
ODI Oracle Data Integrator.
• Knowledge modules for most platforms.
• Uses filters, mappings, joins and constraints.
• Batch or event(stream) oriented integration.
• ELT:• Extract.
• Load.
• Transform.
• Logical and physical models separation.
• Knowledge modules:• RKM - reverse engineering.
• LKM - load.
• IKM - integration.
© The Pythian Group Inc., 2017 49
![Page 50: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/50.jpg)
Oracle Big Data SQL Architecture.
© The Pythian Group Inc., 2017 50
Oracle Database Hadoop
CDH or HDP management server
Hadoop
BD SQL agent/service
BD SQL
Exadata technology for Big Data
![Page 51: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/51.jpg)
Oracle Big Data SQL. What it can do for us.
• Smart scan:
• Storage indexes.
• Bloom filters.
• Works with: • Apache Hive.
• HDFS
• Oracle NoSQL Database.
• Apache HBase
• Fully supports SQL syntax.
• Predicate push down.
© The Pythian Group Inc., 2017 51
![Page 52: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/52.jpg)
Oracle Big Data SQL. Views and packages to use.
• Smart scan:
• Storage indexes.
• Bloom filters.
© The Pythian Group Inc., 2017 52
orcl> select cluster_id,database_name, owner, table_name from all_hive_tables where database_name='bdtest';
CLUSTER_ID DATABASE_NAME OWNER TABLE_NAME
bigdatalite bdtest oracle test_tab_1
bigdatalite bdtest oracle test_tab_2
PROCEDURE CREATE_EXTDDL_FOR_HIVE
Argument Name Type In/Out Default?
------------------------------ ----------------------- ------ --------
CLUSTER_ID VARCHAR2 IN
DB_NAME VARCHAR2 IN
HIVE_TABLE_NAME VARCHAR2 IN
HIVE_PARTITION PL/SQL BOOLEAN IN
TABLE_NAME VARCHAR2 IN
PERFORM_DDL PL/SQL BOOLEAN IN DEFAULT
TEXT_OF_DDL CLOB OUT
![Page 53: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/53.jpg)
Oracle Big Data SQL. Building the external table.
• Hive table can be queried in Oracle now.
© The Pythian Group Inc., 2017 53
CREATE TABLE test_tab_2 ( tran_flag VARCHAR2(4000), tab_name VARCHAR2(4000), ………..)
ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS
( com.oracle.bigdata.cluster=bigdatalite
com.oracle.bigdata.tablename=bdtest.test_tab_2) ) PARALLEL 2 REJECT LIMIT UNLIMITED
TRAN_FLAG ID RND_STR USE_DATE
---------- ---------- ---------- --------------------
I 1 BGBXRKJL 2016-02-13:08:34:19
I 2 FNMCEPWE 2015-08-17:04:50:18
I 1 BGBXRKJL 2016-02-13:08:34:19
![Page 54: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/54.jpg)
Oracle Big Data SQL. What it can do for us.
• External table FULL access.
© 2016 Pythian. Confidential 54
![Page 55: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/55.jpg)
Oracle Big Data SQL. Options for BD SQL.
• Oracle Copy to Hadoop utility.•Using Data Pump format.
•ORACLE_DATAPUMP access driver.
•CTAS to external table.
•hadoop fs -put …
•Hive “INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat'”
•Oracle Shell for Hadoop Loaders
• ACCESS PARAMETERS :• com.oracle.bigdata.overflow
• com.oracle.bigdata.tablename
• com.oracle.bigdata.colmap
• com.oracle.bigdata.erroropt
• Directly to HDFS :• TYPE oracle_hdfs.
© The Pythian Group Inc., 2017 55
![Page 56: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/56.jpg)
Other tools. What can be used as an alternative.
• Gluent Data Platform. • Can offload data to Hadoop.
• Allows query data of any data source in Hadoop.
• Offload engine .
• Supports Cloudera, Hortonworks or any Hadoop with Impala or or Hive SQL engine installed.
https://gluent.com/
© The Pythian Group Inc., 2017 56
![Page 57: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/57.jpg)
And here we have returned back.
© The Pythian Group Inc., 2017 57
![Page 58: There and back again. - Store & Retrieve Data …...• Kafka - streaming processing data platform. • Flume - streaming, aggregation data framework. • Cassandra - open-source distributed](https://reader034.vdocuments.net/reader034/viewer/2022042303/5ece2b6cee11c142a623d865/html5/thumbnails/58.jpg)
QA
© The Pythian Group Inc., 2017 58