meetup duchess 20160119 - leboncoin de la data
TRANSCRIPT
LEBONCOIN DE LA DATA Stéphanie Baltus – Responsable Data Engineering- @steph_baltusMeetup Duchess France @ TheFamily – 01/19/2016
2
■ About leboncoin
■ Data, data everywhere !
■ To infinity and beyond …
PLAN
Select Image Placeholder and choose Insert, Image.
ABOUT LEBONCOIN
4
LEBONCOIN... AND FRIENDS
5
6
■ A Schibsted Media Group company
■ Since 2006
■ 320+ people
■ Located in Paris, Montceau-Les-Mines, Reims
■ 2014 Revenue: 150+M€
IN A FEW WORDS
7
NOT JUST A WEBSITE
8
■ Classified ads :
■ Professional
■ Personal
■ Premium offer :
■ Highlight products
■ Ad import tools
■ Ad display
NOT JUST A CLASSIFIED ADS COMPANY
Select Image Placeholder and choose Insert, Image.
DATA, DATA EVERYWHERE
10
■ Building a team
■ Provide daily batch DWH■ Website traffic (sort of)■ Ad activity & validation■ Sales & Coin usage■ User information■ Support
■ Try near-real time processing
A BIT OF STORY
11
SO, WE DID SOME BI STUFF (2012-2015)
12
IT LOOKS LIKE THIS
13
■ A lot of uncovered scope
■ Incremental load only
■ Unablity to load historical data, stuck from 2013 to today
■ A business team unable to query the database
■ A lot of « no! » when asking for evolution
■ Vertical scalability only
■ No potential sharing policy with the product (website, app, CRM, …)
IT WORKS ! BUT …
Select Image Placeholder and choose Insert, Image.
TO INFINITY AND BEYOND!
15
■ Share data services with the website, apps
■ Build a unique source of truth
■ Provide raw data to our analysts
■ Provide real time data
■ Cover all the data scope of leboncoin
THE FUTURE
16
FUNCTIONAL ARCHITECTURE
17
DATA ARCHITECTURE : DUMBO STYLE
18
ONE STACK TO RULE THEM ALL
19
■ Centralized data cleaning / streamlining
■ Extended analytics apps
■ Ads and customers indexes
■ Import ad web service
■ Datalake indexing through bloomfilter
■ Anomaly detection
SOME IMPLEMENTATIONS
20
■ Goal : help the SysAdmin Team to catch bots crawling our website and apps to steal our ads or people’s phone numbers => Anomaly detection
■ How :
■ Use http logs (150Go per day)
■ Build KPIs and vectors
■ Apply a logistic regression to identify suspicious session
■ Next steps :
■ Test K-Means algorithm
CATCH’EM ALL !
21
■ Data unified view
■ Home built data extractor + Spark MDM jobs
■ Build a next generation BI app
■ Spark ETL+ Redshift
■ Share built information with other apps
■ Spark ETL+ ES + Kafka
DIVE INTO DATA SHARING
22
NOW IT LOOKS LIKE THIS
23
■ Being production ready
■ New app, new services
■ More machine learning oriented apps
■ Feeding the website
■ Recruiting
WHAT’S NEXT ?
QUESTIONS ?