from labelling open data images to building a private recommender system
TRANSCRIPT
From Labelling Open Data Images to Building a Private Recommender System A transfer learning application
Outline
• Introduction
• Iterative building of a recommender system
• Labelling images AKA: Pragmatic Deep learning for “Dummies”
• Post processing AKA: Using Images information for BI on steroids
• Results & Conclusion
Dataiku
• Founded in 2013 • 60 + employees • Paris, New-York, London, San Francisco
Data Science Software Editor of Dataiku DSS
DESIGN
Load and prepare your data
PREPARE Build your
models
MODEL Visualize and share
your work
ANALYSE
Re-execute your workflow at ease
AUTOMATE Follow your production
environment
MONITOR Get predictions
in real time
SCORE PRODUCTION
• E-business vacation retailer
• Founded in 2006. 500M revenue in 2015.
• 18 Millions of clients.
• Hundreds of sales everyday
-> recommendation engine
• Sale Image is paramount
Key Figures
VPG specificities
• Sales are very temporary -> Unlike amazon / Price Minister / Cdiscount -> Some classical recommender system fails -> Sales are event linked (Christmas, ski, summer)
• Expensive Product -> Few recurrent buyers -> Appearance counts a lot • Few recurrent buyer -> Classical approach fail. -> Less signal. Visit information paramount. -> less inclined to browse a lot (4-10 first sales)
A data science workflow Six steps to a predictive model
Data Exploration &
Understanding
Data Preparation Model Creation
Evaluation
Deployment
Data Acquisition
Dataset 1
Scored dataset
Scored dataset
Iteration 1 Iteration 2
Iteration n
Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases.
Dataset 2
Dataset n
Business Understanding
Adapted from the CRISP-DM methodology
A data science workflow Six steps to a predictive model
Data Exploration &
Understanding
Data Preparation Model Creation
Evaluation
Deployment
Data Acquisition
Dataset 1
Scored dataset
Scored dataset
Iteration 1 Iteration 2
Iteration n
Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases.
Dataset 2
Dataset n
Business Understanding
Adapted from the CRISP-DM methodology
Iterative Building of a Recommender System
Basic Recommendation Engines
Other Factors
One Meta Model to Rule Them All
Recommenders as features
Machine learning to op5mize purchasing probability
Combine
Recommend
Describe
One Meta Model to Rule Them All
• Negative sampling • Take all purchases tuples : (user, product, timestamp)-> 1 • Select 5 sales open at the same date the user did not buy -> 0 • The model directly optimize purchasing probability
• Machine learning model • Features : recommender systems. • Logistic Regression Regularizing effect : we don’t want to overfit leaks.
• Reranking approach. Similar to Google or Yandex (Kaggle challenge)
One Meta Model to Rule Them All
• Going further ? • Predict the visit ?
- Would enable to take account more information - Many people browse randomly
• Learning to rank on target: 2 bought, 1 visited, 0 elsewhere • Impact of this on top 10 sales ?
• Limitations : • Highly dependant on ranking displayed - which we don’t have - may overfit old man made rules.
Cleaning, combining and enrichment of
data
Recommendation Engines
Optimization of home display
the application automatically runs and
compiles heterogeneous data
Generation of recommendations based
on user behaviour
Every customer is shown the 10 sales he is the most likely to buy
Customer visits Purchases
Sales Images
Metal model combine recommendations to
directly optimize purchasing probability
Meta Model
Recommender system for Home Page Ordering
+7% revenue
Sales information
(A/B testing)
Batch Scoring every night
Why use Image ?
We want do distinguish
« Sun and Beach »
« Ski »
A picture is worth a thousand words
Sales Images
Integrating Image Information
Labelling Model
Pool + Palm Trees Hotel + Mountains
Pool + Forest + Hotel + Sea
Sea + Beach +Forest + Hotel
Sales descriptions
CONTENT BASED
Recommender System
Image Labelling For Recommendation Engine Pragma&c Deep learning for “Dummies”
Using Deep Learning models Common Issues
“I don’t have GPUs server” “I don’t have a deep leaning expert”
“I don’t have labelled data” (or too few) “I don’t have the time to wait for model training ”
I don’t want to pay to pay for private apis” / “I’m afraid their labelling will change over time”
Pragmatic Deep Learning Cheat Sheet Do you have Labels ?
Many ? Are you sure ?
Train DL model
Transfer Learning
Is there a similar
database ? Is there a pre-‐trained model ?
Create your own
Use it !
Y
Y
Y N
N
N
N
Y N
“I don’t have (or few) labelled data” -> Is there similar data ?
Solution 1 : Pre trained models
PLACES DATABASE VPG SUN DATABASE
205 categories 2.5 M images
307 categories 110 K images
tower: 0.53 skyscraper: 0.26
swimming_pool/outdoor: 0.65 inn/outdoor: 0.06
Solution 1 : Pre trained models If there is open data, there is an open pre trained model ! • Kudos to the community • Check the licensing
Example with Places (Caffe Model Zoo) :
Solution 2 : Transfer Learning “I want to add information of SUN database” “But I have only 100 K images”
If you know how to recognize… after a little bit of training… you will be able to recognize
Transfer Learning
Use a network that knows how to see • As a feature generator / transformer • To be updated for the new problem
Solution 2 : Transfer Learning Not limited to images !
Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359.
If you know sentiment for
Transfer Learning
Word2Vec: Use large text corpora • For grammar learning • For synonym learning
This wine taste great The most disgusting cheese ever
1 0
(word2vec) And you know synonyms and grammar
This cheese tasted awful The best wine in town
It’s easy to classify
Solution 2 : Transfer Learning
Credit : Fei-‐Fei Li & Andrej Karpathy & Jus5n Johnson h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
Retrain new network
Solution 2 : Transfer Learning
Similar Data
Not so similar Data
Use network as transformer
Simple model on shallow layers ? Or get other data
Lot’s of labeled data
With existing architecture
Create Simple Model
Troubles
Fine Tune
Few labeled data
Credit : Fei-‐Fei Li & Andrej Karpathy & Jus5n Johnson h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf
Several layers depending on size of data
SUN VS Places dataset J
VPG : • No labeled data • Similar data
?
PLACES DATABASE VOYAGE PRIVE SUN DATABASE
Training (op5onal)
Pre-‐trained model VGG16
tower: 0.53 skyscraper: 0.26
Re-‐Training
Transferred Data : Last convolu5onal layer features
Re-‐trained model TensorFlow
2 fully connected layers
Caffe Model Zoo
GPU
CPU
GPU
Leverage existing knowledge !
Solution 2 : Transfer Learning
Accuracy: 72%, Top-‐5 Acc: 90 % > state of the art on dataset alone
Solution 3 : Generating your own large (or not) dataset
• Create Label Set • Easy : Man VS Woman ? • Harder : all relevant information in my images • Manually select all words in a corpus (ex Wordnet)
• Use Search Engines • Augment search terms • Get URLs and images from search term • Deduplicate
• Validate with Mechanical Turk • Exclude incorrect images • Evaluate human performance
Solution 4 : What about APIs ?
Solution 4 : What about APIs ?
• Price • Their cost
often rather cheap. Ex: 100 K request for less than 300$ • VS the one of redeveloping (probably not as well)
• Full Database scoring • APIs are often limited query per month. • Make sure to be able to avoid cold start problem
• Stability • Use model versioning • Avoid covariate shift, distribution drift
What about APIs ? Use for generating labels !
• How to : • Score part of the database for training • Train a model • Score your entire database
• But I have only 5000 requests ? -> Use Transfer Learning !
• Stealing models Tramèr, Florian, et al. "Stealing Machine Learning Models via Prediction APIs." arXiv preprint arXiv:1609.02943 (2016).
(Or don’t, it’s illegal)
What about APIs ? Use for generating labels !
Experiment: • 5000 requests on API
-> 4500 for training -> 500 for validation
• Transfer learning with MIT Places Pre-trained Model
• Scikit learn Multilabel model • One Vs the Rest • Untuned Logistic regression
(Or don’t, it’s illegal) (demo, not used in any real project)
What about APIs ? Results
Accuracy 95
Recall 80
Precision 75
Label Probability Label Probability landscape 1,0000 sunset 0,9998 sky 1,0000 no person 0,9996 outdoors 1,0000 water 0,9990 nature 1,0000 park 0,9849 rock 1,0000 river 0,9678 travel 1,0000 scenic 0,8031
Label Probability Label Probability beach 1,0000 ocean 1,0000 summer 1,0000 relaxation 1,0000 sand 1,0000 island 1,0000 tropical 1,0000 idyllic 1,0000 travel 1,0000 seashore 0,9998 seascape 1,0000 water 0,9997
(demo, not used in any real project)
Post Treatment
(Or how we transfer the labelling information)
Using Images informa&on for BI on steroids
Classification problem • Only have probabilities of each class • Selecting based on probability threshold fails • Keeping all information is not sparse
-> we keep 5 labels and probabilities per image
Labels post-processing
Deep/Transfer Learning models
5-10 tags per images
• 2s/image with CPU • x20 speed up with GPU
Voyage Privé images
Labels post-processing
Complementary information Redondant information
Issue with our approach:
Solution : Matrix Factorization
Topic extraction with Non-Negative Matrix Factorization
• Non Negative Matrix factorization (NMF) X = WH • X : image x tags, non negative • W : image x theme • H : theme x tag (scikit learn implementation)
• Most represented Themes
• Swimming-pool_Apartment_Putting-green • Ocean_Coast_SandBar • Coast_SeaCliff_RockArch • Beach_Coast_BoardWalk • Bridge_Viaduc_River • Palace_BuildingFacade-Mansion • Castle_Mansion_Monastery • HotelRoom_Bedroom_DormRoom
• Dimension Reduction • 200x200 pixels -> 600 tags => 30 themes • Faster content based filtering
• Image often sparse combination of themes Faster content based filtering
• Each theme has the same explication power Balanced vector for content based
• Explicability Each theme corresponds to a few labels
Image content detection Topic scores determine the importance of topics in an image
TOPIC TOPIC SCORE (%)
Golf course – Fairway – PuPng green 31
Hotel – Inn – Apartment building outdoor 30
Swimming pool – Lido Deck – Hot tub outdoor 22
Beach – Coast -‐ Harbor 17
TOPIC TOPIC SCORE (%) Tower – Skyscraper – Office building 62
Bridge – River – Viaduct 38
Note on model performance
• Images labels are used for similarity Calling herb field “putting green”:
• Is not important if all herbs field are called this way. • Would be if we had lot’s of golf trips sales.
• Improving the NN performance ? • Labels are used in NMF and reduced to themes • Themes are used to calculate similarities for CB
recommenders • CB Recommenders are used as a feature in meta model • Meta model give probabilities of purchase = order • Users only check 10 sales… -> what is the change of online performance for 1% accuracy ?
Results
Results ? All Visits :
• Mostly France • Pool displayed
First Recommendation • Fail to display pools
Only Images ? • Pool all around the world
Third column = Right Mix
Results ? All Visits :
• Spain • Sun & Beach • Pool displayed
First Recommendation
• Displays nature…
Only Images ? • Pool all around the world
Third = Right Mix • Get the bungalow feature !
• Do iterative data science ! • Start simple and grow • Validate each steps • Image labelling = BI on steroids
• Deep Learning ? • Is there existing data ? • Is there a pre-trained model ?
• Transfer Learning • Cheaper, faster • Any Data Scientist can do it
• What’s Next ?
Conclusion
Learned along the way For ski sales, showing indoor pictures performs better
What’s next ?
• Comparison proposed/visited vacation
𝑨𝒕𝒕𝒓𝒂𝒄𝒕𝒊𝒗𝒆𝒏𝒆𝒔𝒔(𝒕𝒂𝒈)= 𝑽𝒊𝒔𝒊𝒕𝒆𝒅 𝒐𝒇𝒇𝒆𝒓𝒔 𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈 𝒕𝒂𝒈/𝑷𝒓𝒐𝒑𝒐𝒔𝒆𝒅 𝒐𝒇𝒇𝒆𝒓𝒔 𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈 𝒕𝒂𝒈
Ocean 67%
Bedroom 33%
VPG offers database
Ocean 33% Bedroo
m 67%
Visits database • Voyage Privé offers database = baseline
• « Bedroom » attractiveness = 0.67/0.33 = 2 • « Ocean » attractiveness = 0.33/0.67 = 0.5
Learned along the way For ski sales, showing indoor pictures performs better
What’s next ?
What’s Next ?
Kenya
Prague
Berlin
Cambodia
What’s Next ? Customize the Image !
Kenya
Prague
Berlin
Cambodia
Thank you for your attention !