big data revolution in jobrapido - bnova...consumption –the data lake is a large body of water in...

26
Michele Pinto – Big Data Technical Team Leader @ Jobrapido Big Data Tech 2016 – Firenze - October 20, 2016 BIG DATA REVOLUTION IN JOBRAPIDO

Upload: others

Post on 16-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

Michele Pinto – Big Data Technical Team Leader @ Jobrapido

Big Data Tech 2016 – Firenze - October 20, 2016

BIG DATA REVOLUTION IN JOBRAPIDO

Page 2: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

COMPANY WEBSITEwww.jobrapido.com

ABOUT ME

LINKEDINhttps://www.linkedin.com/in/pintomichele

NAMEMichele Pinto

Page 3: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

WEBSITES IN 58 COUNTRIESHead office Milan + office in Amsterdam

Jobrapido is the world's leading

jobsearch engine that analyses and

collects all job posts on the web,

giving jobseekers all offers

available, ordered for relevance

based on the search they’ve done

Analysis

Aggregation

Response

* Clicks on job listings (organic + sponsored) and clicks on contextual ads

WHO WE ARE

UNIQUE VISITORS35 Mio Uvs / month

SUBSCRIBERS70+ Mio subs users (current stock)

PAGEVIEWS / CLICKS*280 Mio PVs / month & 130 Mio clicks / month

JOBS20+ Mio jobs at any given time

VISITORS1.0 BN visits / year

PEOPLE100+

Page 4: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream
Page 5: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

MOBILE APP

MY SEARCHESMY JOBS MENU

CNT SELECTIONSIGN UP

SIGN IN

Page 6: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

WHERE WE ARE

Page 7: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

THE NEED FOR A BIG DATA ARCHITECTURE (1/2)

7

Page 8: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

THE NEED FOR A BIG DATA ARCHITECTURE (2/2)

• SCALE in terms of throughput and computational power correlated

to the data growth rate

• Unify the tracking layer in a single TRACKING PLATFORM

• Place and extract data for analytics into a single DATA LAKE

• REAL-TIME DATA INGESTION in our Data Warehouse

• Drastically REDUCE COMPLEXITY and MAINTENANCE

8

MAIN FEATURES:

Page 9: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

TRACKING PLATFORM

9

Page 10: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

WHY A NEW TRACKING PLATFORM (TP)?

• Obtain a unique, simple and scalable Tracking Layer

• Everyone in Jobrapido should design, track and query its own events

• Tracking phase and data processing phase totally decoupled

• Upcoming events queryable and processable in real-time

• Remove any bottleneck during the event tracking process

10

Page 11: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

TP: ARCHITECTURAL OVERVIEW

11

Page 12: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

TP TECHNOLOGIES – AVRO (1/3)

• Serialization into Avro/Binary or Avro/JSON

• Support for schema evolution: the schema used to read a file does not need to match the schema used to write the file

• Self-documenting: stores schema in file header

• Rich schema language defined in JSON

• Compressible and splittable (good for Spark and Map-Reduce)

• Can generate Java objects from schemas

12

MAIN FEATURES:

Data serialization system that provides a compact, fast, binary data format (avro.apache.org)

Page 13: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

TP TECHNOLOGIES – AVRO (2/3)

• Each event has the same identical header containing some “technical” fields:

13

EVERYTHING IS AN EVENT = HEADER + BODY

• What differs between different event types is the body, tracker fills only the body attributes

Page 14: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

TP TECHNOLOGIES – AVRO (3/3)

14

BODY: EVERYONE CAN BUILD IT’S OWN EVENT (E.G. THE EVENT CLICK)

Page 15: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

TP TECHNOLOGIES – KAFKA

• Events are sent directly to Kafka

• One topic per event type

• Retention policy is set to 15 days

15

Kafka enables the capture, movement, processing and storage of data streams in a distributed, fault-tolerant fashion (kafka.apache.org)

• High-throughput

• More than 2000 messages /second (AVG)

• More than 1,5 MB / second (AVG)

Page 16: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

DATA LAKE

16

Page 17: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

WHY A DATA LAKE? “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (James Dixon, CTO of Pentaho)

• Implement a massive storage platform of RAW DATA

• An immutable MASTER DATA, information is never deleted

• Store as much data as we want at a very CHEAP PRICE

• Data must be available for various tasks including reporting, visualization, analytics and machine learning

17

MAIN GOALS:

Page 18: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

DATA LAKE: ARCHITECTURAL OVERVIEW

18

Page 19: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

DATA LAKE TECHNOLOGIES – FLUME (1/2)

• Distributed, scalable and reliable

• Contextual and dynamic event routing

• Fully extensible (plugin architecture)

• Fully integrated in the Big Data ecosystem

• Easy to install and configure

19

MAIN FEATURES:

Distributed data collection service for efficiently collecting and moving large amounts of log data (flume.apache.org)

Page 20: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

DATA LAKE TECHNOLOGIES – FLUME (2/2)

20

FLUME AGENT = SOURCE + [INTERCEPTORS] + CHANNEL + SINK

Page 21: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

REAL-TIME DATA WAREHOUSE

INGESTION

21

Page 22: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

REAL-TIME DATA WAREHOUSE INGESTION (1/2)

• Data Lake decoupled from Data Warehouse

• Staging area automatically ingested in real-time

• Data marts can be refreshed faster

• No data pipeline to implement or maintain

• Ingestion automatically scheduled, filtered and parsed

• JSON events automatically filled in target tables

• Events are queryable in real-time with the best performance on the market

22

MAIN GOALS:

Page 23: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

REAL-TIME DATA WAREHOUSE INGESTION (2/2)

• Vertica acts as a consumer for Kafka (microbatch)

• Scheduling, filtering, parsing (JSON, Avro, custom)

• Vertica->Kafka: Vertica is able to send query results to Kafka

• Monitoring data load activities via Web UI

• Stream, rates, schedulers, rates, rejections and errors

• In-database monitoring

23

KAFKA AND VERTICA WORK TOGETHER:

Page 24: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

JOBRAPIDO BIG DATA ARCHITECTURE

24

Page 25: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

WHAT’S NEXT

• Kafka Connect vs Flafka evaluation

• Enrichment of event streams with Kafka Stream

• Unleash the power of Spark

• Integrate Knime with the Data Lake

• Implement a lot of Data Marts

25

Page 26: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream

26

GRAZIE