migrating structured data between hadoop and rdbms

21
February 16 th 2016 [email protected] Migrating structured data between Hadoop and RDBMS

Upload: bouquet

Post on 11-Feb-2017

335 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Migrating structured data between Hadoop and RDBMS

February 16th 2016 [email protected]

Migrating structured data between Hadoop and RDBMS

Page 2: Migrating structured data between Hadoop and RDBMS

Who am I?

• Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on the top of the highest mountains of the world

Page 3: Migrating structured data between Hadoop and RDBMS

What I do ?

• Develop of an analytics toolbox. • No setup. No SQL. No compromise. • Generate SQL with a REST API.

It is open source! https://github.com/openbouquet

Page 4: Migrating structured data between Hadoop and RDBMS

Topic of today

• You need Scalability? • You need a machine learning toolbox?

Hadoop is the solution.

•But you still need structured data? Our tool provide a solution.

=> We need both!

Page 5: Migrating structured data between Hadoop and RDBMS

What does that mean?

• Creation of dataset in Bouquet • Send the dataset to Spark • Enrich inside Spark • Re-injection in original database

Page 6: Migrating structured data between Hadoop and RDBMS

How we do it?

User input

Relational DB

SparkBouquet

Page 7: Migrating structured data between Hadoop and RDBMS

Create and Send

Page 8: Migrating structured data between Hadoop and RDBMS

How does it work?

BouquetRelational DB

Spark

HDFS/ Tachyon

Hive Metastore

User select the data. Bouquet generate the corresponding SQL Code

Kafka

Page 9: Migrating structured data between Hadoop and RDBMS

How does it work?

BouquetRelational DB

Spark

HDFS/ Tachyon

Hive Metastore

Data is read from the SQL database

Kafka

Page 10: Migrating structured data between Hadoop and RDBMS

How does it work?

BouquetRelational DB

Spark

HDFS/ Tachyon

Hive Metastore

Bouquet creates an avro schema and send the data to Kafka

Kafka

Page 11: Migrating structured data between Hadoop and RDBMS

How does it work?

BouquetRelational DB

SparkKafka

HDFS/ Tachyon

Hive Metastore

Kafka Broker(s) receive the data

Page 12: Migrating structured data between Hadoop and RDBMS

How does it work?

BouquetRelational DB

Spark

HDFS/ Tachyon

Hive Metastore

Kafka

The hive metastore is updated and the hdfs connectors writes into hdfs

Page 13: Migrating structured data between Hadoop and RDBMS

Tachyon?

• Use it as in memory filesystem to replace HDFS. • Interact with Spark using the hdfs plugin. • Transparent from user point of view

Page 14: Migrating structured data between Hadoop and RDBMS

How to keep the data structured?Use a schema registry (Avro in Kafka). each schema has a corresponding kafka topic and a distinct hive table.

{ "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]} ] }

Page 15: Migrating structured data between Hadoop and RDBMS

Challenges

- Auto creation of topics/table in Hive for each datasets from Bouquet.

- JDBC reads are too slow for something like Kafka. - Issue with types conversion: null is not supported for all cases for example (issue 272 on schema-registry).

- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec 2015)

- Hive: Setting the warehouse directory. - In tachyon: Setting up hostname.

Page 16: Migrating structured data between Hadoop and RDBMS

Technology choice

• KISS: Kafka + Spark + Tachyon. • Flexible (Hive, In-memory storage) • Easily scalable

• GemFire, SnappyData, Apache Ignite for In-memory storage. • Storm for streaming

Page 17: Migrating structured data between Hadoop and RDBMS

Status

Injection DB -> Spark: OK Spark usage: OK Re-injection: In alpha stage.

Page 18: Migrating structured data between Hadoop and RDBMS

Re-injection

Two solutions: • Spark user notifies Bouquet that data has changed (using a custom function) • Bouquet pulls the data from spark

Page 19: Migrating structured data between Hadoop and RDBMS

We use it for real!

Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.

Page 20: Migrating structured data between Hadoop and RDBMS

In the future

• Notebook integration • We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)

Page 21: Migrating structured data between Hadoop and RDBMS

QUESTIONS OPENBOUQUET.IO