big data revolution in jobrapido - bnova...consumption –the data lake is a large body of water in...
TRANSCRIPT
![Page 1: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/1.jpg)
Michele Pinto – Big Data Technical Team Leader @ Jobrapido
Big Data Tech 2016 – Firenze - October 20, 2016
BIG DATA REVOLUTION IN JOBRAPIDO
![Page 2: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/2.jpg)
COMPANY WEBSITEwww.jobrapido.com
ABOUT ME
LINKEDINhttps://www.linkedin.com/in/pintomichele
NAMEMichele Pinto
![Page 3: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/3.jpg)
WEBSITES IN 58 COUNTRIESHead office Milan + office in Amsterdam
Jobrapido is the world's leading
jobsearch engine that analyses and
collects all job posts on the web,
giving jobseekers all offers
available, ordered for relevance
based on the search they’ve done
Analysis
Aggregation
Response
* Clicks on job listings (organic + sponsored) and clicks on contextual ads
WHO WE ARE
UNIQUE VISITORS35 Mio Uvs / month
SUBSCRIBERS70+ Mio subs users (current stock)
PAGEVIEWS / CLICKS*280 Mio PVs / month & 130 Mio clicks / month
JOBS20+ Mio jobs at any given time
VISITORS1.0 BN visits / year
PEOPLE100+
![Page 4: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/4.jpg)
![Page 5: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/5.jpg)
MOBILE APP
MY SEARCHESMY JOBS MENU
CNT SELECTIONSIGN UP
SIGN IN
![Page 6: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/6.jpg)
WHERE WE ARE
![Page 7: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/7.jpg)
THE NEED FOR A BIG DATA ARCHITECTURE (1/2)
7
![Page 8: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/8.jpg)
THE NEED FOR A BIG DATA ARCHITECTURE (2/2)
• SCALE in terms of throughput and computational power correlated
to the data growth rate
• Unify the tracking layer in a single TRACKING PLATFORM
• Place and extract data for analytics into a single DATA LAKE
• REAL-TIME DATA INGESTION in our Data Warehouse
• Drastically REDUCE COMPLEXITY and MAINTENANCE
8
MAIN FEATURES:
![Page 9: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/9.jpg)
TRACKING PLATFORM
9
![Page 10: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/10.jpg)
WHY A NEW TRACKING PLATFORM (TP)?
• Obtain a unique, simple and scalable Tracking Layer
• Everyone in Jobrapido should design, track and query its own events
• Tracking phase and data processing phase totally decoupled
• Upcoming events queryable and processable in real-time
• Remove any bottleneck during the event tracking process
10
![Page 11: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/11.jpg)
TP: ARCHITECTURAL OVERVIEW
11
![Page 12: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/12.jpg)
TP TECHNOLOGIES – AVRO (1/3)
• Serialization into Avro/Binary or Avro/JSON
• Support for schema evolution: the schema used to read a file does not need to match the schema used to write the file
• Self-documenting: stores schema in file header
• Rich schema language defined in JSON
• Compressible and splittable (good for Spark and Map-Reduce)
• Can generate Java objects from schemas
12
MAIN FEATURES:
Data serialization system that provides a compact, fast, binary data format (avro.apache.org)
![Page 13: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/13.jpg)
TP TECHNOLOGIES – AVRO (2/3)
• Each event has the same identical header containing some “technical” fields:
13
EVERYTHING IS AN EVENT = HEADER + BODY
• What differs between different event types is the body, tracker fills only the body attributes
![Page 14: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/14.jpg)
TP TECHNOLOGIES – AVRO (3/3)
14
BODY: EVERYONE CAN BUILD IT’S OWN EVENT (E.G. THE EVENT CLICK)
![Page 15: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/15.jpg)
TP TECHNOLOGIES – KAFKA
• Events are sent directly to Kafka
• One topic per event type
• Retention policy is set to 15 days
15
Kafka enables the capture, movement, processing and storage of data streams in a distributed, fault-tolerant fashion (kafka.apache.org)
• High-throughput
• More than 2000 messages /second (AVG)
• More than 1,5 MB / second (AVG)
![Page 16: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/16.jpg)
DATA LAKE
16
![Page 17: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/17.jpg)
WHY A DATA LAKE? “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (James Dixon, CTO of Pentaho)
• Implement a massive storage platform of RAW DATA
• An immutable MASTER DATA, information is never deleted
• Store as much data as we want at a very CHEAP PRICE
• Data must be available for various tasks including reporting, visualization, analytics and machine learning
17
MAIN GOALS:
![Page 18: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/18.jpg)
DATA LAKE: ARCHITECTURAL OVERVIEW
18
![Page 19: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/19.jpg)
DATA LAKE TECHNOLOGIES – FLUME (1/2)
• Distributed, scalable and reliable
• Contextual and dynamic event routing
• Fully extensible (plugin architecture)
• Fully integrated in the Big Data ecosystem
• Easy to install and configure
19
MAIN FEATURES:
Distributed data collection service for efficiently collecting and moving large amounts of log data (flume.apache.org)
![Page 20: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/20.jpg)
DATA LAKE TECHNOLOGIES – FLUME (2/2)
20
FLUME AGENT = SOURCE + [INTERCEPTORS] + CHANNEL + SINK
![Page 21: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/21.jpg)
REAL-TIME DATA WAREHOUSE
INGESTION
21
![Page 22: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/22.jpg)
REAL-TIME DATA WAREHOUSE INGESTION (1/2)
• Data Lake decoupled from Data Warehouse
• Staging area automatically ingested in real-time
• Data marts can be refreshed faster
• No data pipeline to implement or maintain
• Ingestion automatically scheduled, filtered and parsed
• JSON events automatically filled in target tables
• Events are queryable in real-time with the best performance on the market
22
MAIN GOALS:
![Page 23: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/23.jpg)
REAL-TIME DATA WAREHOUSE INGESTION (2/2)
• Vertica acts as a consumer for Kafka (microbatch)
• Scheduling, filtering, parsing (JSON, Avro, custom)
• Vertica->Kafka: Vertica is able to send query results to Kafka
• Monitoring data load activities via Web UI
• Stream, rates, schedulers, rates, rejections and errors
• In-database monitoring
23
KAFKA AND VERTICA WORK TOGETHER:
![Page 24: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/24.jpg)
JOBRAPIDO BIG DATA ARCHITECTURE
24
![Page 25: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/25.jpg)
WHAT’S NEXT
• Kafka Connect vs Flafka evaluation
• Enrichment of event streams with Kafka Stream
• Unleash the power of Spark
• Integrate Knime with the Data Lake
• Implement a lot of Data Marts
25
![Page 26: BIG DATA REVOLUTION IN JOBRAPIDO - BNova...consumption –the Data Lake is a large body of water in a more natural state. The contents of the Data The contents of the Data Lake stream](https://reader033.vdocuments.net/reader033/viewer/2022060523/60529334269fd530f6192a24/html5/thumbnails/26.jpg)
26
GRAZIE