migrating structured data between hadoop and rdbms

February 16th 2016 [email protected]

Migrating structured data between Hadoop and RDBMS

mailto:[email protected]

Who am I?

• Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on the top of the highest mountains of the world

What I do ?

• Develop of an analytics toolbox. • No setup. No SQL. No compromise. • Generate SQL with a REST API.

It is open source! https://github.com/openbouquet

https://github.com/openbouquet

Topic of today

• You need Scalability? • You need a machine learning toolbox?

Hadoop is the solution.

•But you still need structured data? Our tool provide a solution.

=> We need both!

What does that mean?

• Creation of dataset in Bouquet • Send the dataset to Spark • Enrich inside Spark • Re-injection in original database

How we do it?

User input

Relational DB

SparkBouquet

Create and Send

How does it work?

BouquetRelational DB

Spark

HDFS/ Tachyon

Hive Metastore

User select the data. Bouquet generate the corresponding SQL Code

Kafka

How does it work?


Spark

HDFS/ Tachyon

Hive Metastore

Data is read from the SQL database

Kafka

How does it work?


Spark

HDFS/ Tachyon

Hive Metastore

Bouquet creates an avro schema and send the data to Kafka

Kafka

How does it work?


SparkKafka

HDFS/ Tachyon

Hive Metastore

Kafka Broker(s) receive the data

How does it work?


Spark

HDFS/ Tachyon

Hive Metastore

Kafka

The hive metastore is updated and the hdfs connectors writes into hdfs

Tachyon?

• Use it as in memory filesystem to replace HDFS. • Interact with Spark using the hdfs plugin. • Transparent from user point of view

How to keep the data structured?Use a schema registry (Avro in Kafka). each schema has a corresponding kafka topic and a distinct hive table.

{ "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]} ] }

Challenges

- Auto creation of topics/table in Hive for each datasets from Bouquet.

- JDBC reads are too slow for something like Kafka. - Issue with types conversion: null is not supported for all cases for example (issue 272 on schema-registry).

- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec 2015)

- Hive: Setting the warehouse directory. - In tachyon: Setting up hostname.

Technology choice

• KISS: Kafka + Spark + Tachyon. • Flexible (Hive, In-memory storage) • Easily scalable

• GemFire, SnappyData, Apache Ignite for In-memory storage. • Storm for streaming

Status

Injection DB -> Spark: OK Spark usage: OK Re-injection: In alpha stage.

Re-injection

Two solutions: • Spark user notifies Bouquet that data has changed (using a custom function) • Bouquet pulls the data from spark

We use it for real!

Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.

In the future

• Notebook integration • We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)

QUESTIONS OPENBOUQUET.IO

migrating structured data between hadoop and rdbms

Data & Analytics