from labelling open data images to building a private recommender system

47
From Labelling Open Data Images to Building a Private Recommender System A transfer learning application

Upload: pierre-gutierrez

Post on 13-Apr-2017

560 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: From Labelling Open data images to building a private recommender system

From Labelling Open Data Images to Building a Private Recommender System A transfer learning application

Page 2: From Labelling Open data images to building a private recommender system

Outline

•  Introduction

•  Iterative building of a recommender system

•  Labelling images AKA: Pragmatic Deep learning for “Dummies”

• Post processing AKA: Using Images information for BI on steroids

• Results & Conclusion

Page 3: From Labelling Open data images to building a private recommender system

Dataiku

•  Founded in 2013 •  60 + employees •  Paris, New-York, London, San Francisco

Data Science Software Editor of Dataiku DSS

DESIGN

Load and prepare your data

PREPARE Build your

models

MODEL Visualize and share

your work

ANALYSE

Re-execute your workflow at ease

AUTOMATE Follow your production

environment

MONITOR Get predictions

in real time

SCORE PRODUCTION

Page 4: From Labelling Open data images to building a private recommender system

•  E-business vacation retailer

•  Founded in 2006. 500M revenue in 2015.

•  18 Millions of clients.

•  Hundreds of sales everyday

-> recommendation engine

•  Sale Image is paramount

Key Figures

Page 5: From Labelling Open data images to building a private recommender system

VPG specificities

•  Sales are very temporary -> Unlike amazon / Price Minister / Cdiscount -> Some classical recommender system fails -> Sales are event linked (Christmas, ski, summer)

•  Expensive Product -> Few recurrent buyers -> Appearance counts a lot •  Few recurrent buyer -> Classical approach fail. -> Less signal. Visit information paramount. -> less inclined to browse a lot (4-10 first sales)

Page 6: From Labelling Open data images to building a private recommender system

A data science workflow Six steps to a predictive model

Data Exploration &

Understanding

Data Preparation Model Creation

Evaluation

Deployment

Data Acquisition

Dataset 1

Scored dataset

Scored dataset

Iteration 1 Iteration 2

Iteration n

Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases.

Dataset 2

Dataset n

Business Understanding

Adapted from the CRISP-DM methodology

Page 7: From Labelling Open data images to building a private recommender system

A data science workflow Six steps to a predictive model

Data Exploration &

Understanding

Data Preparation Model Creation

Evaluation

Deployment

Data Acquisition

Dataset 1

Scored dataset

Scored dataset

Iteration 1 Iteration 2

Iteration n

Creating a predictive model is an highly iterative process. Data Science Studio enables its users to create and manage these projects from end-to-end. This process is not industry specific, and can be applied to many use cases.

Dataset 2

Dataset n

Business Understanding

Adapted from the CRISP-DM methodology

Page 8: From Labelling Open data images to building a private recommender system

Iterative Building of a Recommender System

Page 9: From Labelling Open data images to building a private recommender system

Basic Recommendation Engines

Page 10: From Labelling Open data images to building a private recommender system

Other Factors

Page 11: From Labelling Open data images to building a private recommender system

One Meta Model to Rule Them All

Recommenders  as  features  

Machine  learning  to  op5mize  purchasing  probability  

Combine  

Recommend  

Describe  

Page 12: From Labelling Open data images to building a private recommender system

One Meta Model to Rule Them All

• Negative sampling •  Take all purchases tuples : (user, product, timestamp)-> 1 •  Select 5 sales open at the same date the user did not buy -> 0 •  The model directly optimize purchasing probability

• Machine learning model •  Features : recommender systems. •  Logistic Regression Regularizing effect : we don’t want to overfit leaks.

• Reranking approach. Similar to Google or Yandex (Kaggle challenge)

Page 13: From Labelling Open data images to building a private recommender system

One Meta Model to Rule Them All

• Going further ? •  Predict the visit ?

-  Would enable to take account more information -  Many people browse randomly

•  Learning to rank on target: 2 bought, 1 visited, 0 elsewhere •  Impact of this on top 10 sales ?

•  Limitations : •  Highly dependant on ranking displayed - which we don’t have - may overfit old man made rules.

Page 14: From Labelling Open data images to building a private recommender system

Cleaning, combining and enrichment of

data

Recommendation Engines

Optimization of home display

the application automatically runs and

compiles heterogeneous data

Generation of recommendations based

on user behaviour

Every customer is shown the 10 sales he is the most likely to buy

Customer visits Purchases

Sales Images

Metal model combine recommendations to

directly optimize purchasing probability

Meta Model

Recommender system for Home Page Ordering

+7% revenue

Sales information

(A/B testing)

Batch Scoring every night

Page 15: From Labelling Open data images to building a private recommender system

Why use Image ?

We want do distinguish

« Sun and Beach »

« Ski »

A picture is worth a thousand words

Page 16: From Labelling Open data images to building a private recommender system

Sales Images

Integrating Image Information

Labelling Model

Pool + Palm Trees Hotel + Mountains

Pool + Forest + Hotel + Sea

Sea + Beach +Forest + Hotel

Sales descriptions

CONTENT  BASED  

Recommender System

Page 17: From Labelling Open data images to building a private recommender system

Image Labelling For Recommendation Engine Pragma&c  Deep  learning  for  “Dummies”  

Page 18: From Labelling Open data images to building a private recommender system

Using Deep Learning models Common Issues

“I don’t have GPUs server” “I don’t have a deep leaning expert”

“I don’t have labelled data” (or too few) “I don’t have the time to wait for model training ”

I don’t want to pay to pay for private apis” / “I’m afraid their labelling will change over time”

Page 19: From Labelling Open data images to building a private recommender system

Pragmatic Deep Learning Cheat Sheet Do  you  have  Labels  ?  

Many  ?     Are  you  sure  ?  

Train  DL  model  

Transfer  Learning  

Is  there  a  similar  

database  ?  Is  there  a  pre-­‐trained  model  ?  

Create  your  own  

Use  it  !  

Y  

Y  

Y  N  

N  

N  

N  

Y   N  

Page 20: From Labelling Open data images to building a private recommender system

“I don’t have (or few) labelled data” -> Is there similar data ?

Solution 1 : Pre trained models

PLACES  DATABASE  VPG   SUN  DATABASE  

205  categories  2.5  M  images  

307  categories  110  K  images  

Page 21: From Labelling Open data images to building a private recommender system

tower: 0.53 skyscraper: 0.26

swimming_pool/outdoor: 0.65 inn/outdoor: 0.06

Solution 1 : Pre trained models If there is open data, there is an open pre trained model ! •  Kudos to the community •  Check the licensing

Example  with  Places  (Caffe  Model  Zoo)  :    

Page 22: From Labelling Open data images to building a private recommender system

Solution 2 : Transfer Learning “I want to add information of SUN database” “But I have only 100 K images”

If you know how to recognize… after a little bit of training… you will be able to recognize

Transfer Learning

Use a network that knows how to see •  As a feature generator / transformer •  To be updated for the new problem

Page 23: From Labelling Open data images to building a private recommender system

Solution 2 : Transfer Learning Not limited to images !

Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." IEEE Transactions on knowledge and data engineering 22.10 (2010): 1345-1359.

If you know sentiment for

Transfer Learning

Word2Vec: Use large text corpora •  For grammar learning •  For synonym learning

This wine taste great The most disgusting cheese ever

1 0

(word2vec) And you know synonyms and grammar

This cheese tasted awful The best wine in town

It’s easy to classify

Page 24: From Labelling Open data images to building a private recommender system

Solution 2 : Transfer Learning

Credit  :    Fei-­‐Fei  Li  &  Andrej  Karpathy  &  Jus5n  Johnson  h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf  

Page 25: From Labelling Open data images to building a private recommender system

Retrain new network

Solution 2 : Transfer Learning

Similar Data

Not so similar Data

Use network as transformer

Simple model on shallow layers ? Or get other data

Lot’s of labeled data

With existing architecture

Create Simple Model

Troubles

Fine Tune

Few labeled data

Credit  :    Fei-­‐Fei  Li  &  Andrej  Karpathy  &  Jus5n  Johnson    h`p://cs231n.stanford.edu/slides/winter1516_lecture11.pdf  

Several layers depending on size of data

SUN  VS  Places  dataset  J  

VPG  :  •  No  labeled  data  •  Similar  data  

?  

Page 26: From Labelling Open data images to building a private recommender system

PLACES  DATABASE   VOYAGE  PRIVE  SUN  DATABASE  

Training  (op5onal)  

Pre-­‐trained  model  VGG16  

tower: 0.53 skyscraper: 0.26

Re-­‐Training  

Transferred  Data  :  Last  convolu5onal  layer  features  

Re-­‐trained  model  TensorFlow  

2  fully  connected  layers  

Caffe  Model  Zoo  

 

GPU  

CPU  

GPU  

Leverage existing knowledge !

Solution 2 : Transfer Learning

Accuracy:  72%,  Top-­‐5  Acc:  90  %  >  state  of  the  art  on  dataset  alone  

Page 27: From Labelling Open data images to building a private recommender system

Solution 3 : Generating your own large (or not) dataset

• Create Label Set •  Easy : Man VS Woman ? •  Harder : all relevant information in my images •  Manually select all words in a corpus (ex Wordnet)

• Use Search Engines •  Augment search terms •  Get URLs and images from search term •  Deduplicate

•  Validate with Mechanical Turk •  Exclude incorrect images •  Evaluate human performance

Page 28: From Labelling Open data images to building a private recommender system

Solution 4 : What about APIs ?

Page 29: From Labelling Open data images to building a private recommender system

Solution 4 : What about APIs ?

• Price •  Their cost

often rather cheap. Ex: 100 K request for less than 300$ •  VS the one of redeveloping (probably not as well)

•  Full Database scoring •  APIs are often limited query per month. •  Make sure to be able to avoid cold start problem

• Stability •  Use model versioning •  Avoid covariate shift, distribution drift

Page 30: From Labelling Open data images to building a private recommender system

What about APIs ? Use for generating labels !

• How to : •  Score part of the database for training •  Train a model •  Score your entire database

• But I have only 5000 requests ? -> Use Transfer Learning !

• Stealing models Tramèr, Florian, et al. "Stealing Machine Learning Models via Prediction APIs." arXiv preprint arXiv:1609.02943 (2016).

(Or don’t, it’s illegal)

Page 31: From Labelling Open data images to building a private recommender system

What about APIs ? Use for generating labels !

Experiment: •  5000 requests on API

-> 4500 for training -> 500 for validation

• Transfer learning with MIT Places Pre-trained Model

• Scikit learn Multilabel model •  One Vs the Rest •  Untuned Logistic regression

(Or don’t, it’s illegal) (demo, not used in any real project)

Page 32: From Labelling Open data images to building a private recommender system

What about APIs ? Results

Accuracy   95  

Recall   80  

Precision   75  

Label   Probability   Label   Probability  landscape 1,0000 sunset 0,9998 sky 1,0000 no person 0,9996 outdoors 1,0000 water 0,9990 nature 1,0000 park 0,9849 rock 1,0000 river 0,9678 travel 1,0000 scenic 0,8031

Label   Probability   Label   Probability  beach 1,0000 ocean 1,0000 summer 1,0000 relaxation 1,0000 sand 1,0000 island 1,0000 tropical 1,0000 idyllic 1,0000 travel 1,0000 seashore 0,9998 seascape 1,0000 water 0,9997

(demo, not used in any real project)

Page 33: From Labelling Open data images to building a private recommender system

Post Treatment

(Or how we transfer the labelling information)

Using  Images  informa&on  for  BI  on  steroids    

Page 34: From Labelling Open data images to building a private recommender system

Classification problem •  Only have probabilities of each class •  Selecting based on probability threshold fails •  Keeping all information is not sparse

-> we keep 5 labels and probabilities per image

Labels post-processing

Deep/Transfer Learning models

5-10 tags per images

•  2s/image with CPU •  x20 speed up with GPU

Voyage Privé images

Page 35: From Labelling Open data images to building a private recommender system

Labels post-processing

Complementary information Redondant information

Issue with our approach:

Solution : Matrix Factorization

Page 36: From Labelling Open data images to building a private recommender system

Topic extraction with Non-Negative Matrix Factorization

•  Non Negative Matrix factorization (NMF) X = WH •  X : image x tags, non negative •  W : image x theme •  H : theme x tag (scikit learn implementation)

•  Most represented Themes

•  Swimming-pool_Apartment_Putting-green •  Ocean_Coast_SandBar •  Coast_SeaCliff_RockArch •  Beach_Coast_BoardWalk •  Bridge_Viaduc_River •  Palace_BuildingFacade-Mansion •  Castle_Mansion_Monastery •  HotelRoom_Bedroom_DormRoom

•  Dimension Reduction •  200x200 pixels -> 600 tags => 30 themes •  Faster content based filtering

•  Image often sparse combination of themes Faster content based filtering

•  Each theme has the same explication power Balanced vector for content based

•  Explicability Each theme corresponds to a few labels

Page 37: From Labelling Open data images to building a private recommender system

Image content detection Topic scores determine the importance of topics in an image

TOPIC   TOPIC  SCORE  (%)  

Golf  course  –  Fairway  –  PuPng  green   31  

Hotel  –  Inn  –  Apartment  building  outdoor   30  

Swimming  pool  –  Lido  Deck  –  Hot  tub  outdoor   22  

Beach  –  Coast  -­‐  Harbor   17  

TOPIC   TOPIC  SCORE  (%)  Tower  –  Skyscraper  –  Office  building   62  

Bridge  –  River  –  Viaduct   38  

Page 38: From Labelling Open data images to building a private recommender system

Note on model performance

•  Images labels are used for similarity Calling herb field “putting green”:

•  Is not important if all herbs field are called this way. •  Would be if we had lot’s of golf trips sales.

•  Improving the NN performance ? •  Labels are used in NMF and reduced to themes •  Themes are used to calculate similarities for CB

recommenders •  CB Recommenders are used as a feature in meta model • Meta model give probabilities of purchase = order •  Users only check 10 sales… -> what is the change of online performance for 1% accuracy ?

Page 39: From Labelling Open data images to building a private recommender system

Results

Page 40: From Labelling Open data images to building a private recommender system

Results ? All Visits :

•  Mostly France •  Pool displayed

First Recommendation •  Fail to display pools

Only Images ? •  Pool all around the world

Third column = Right Mix

Page 41: From Labelling Open data images to building a private recommender system

Results ? All Visits :

•  Spain •  Sun & Beach •  Pool displayed

First Recommendation

•  Displays nature…

Only Images ? •  Pool all around the world

Third = Right Mix •  Get the bungalow feature !

Page 42: From Labelling Open data images to building a private recommender system

•  Do iterative data science ! •  Start simple and grow •  Validate each steps •  Image labelling = BI on steroids

•  Deep Learning ? •  Is there existing data ? •  Is there a pre-trained model ?

•  Transfer Learning •  Cheaper, faster •  Any Data Scientist can do it

•  What’s Next ?

Conclusion

Page 43: From Labelling Open data images to building a private recommender system

Learned along the way For ski sales, showing indoor pictures performs better

What’s next ?

•  Comparison proposed/visited vacation

𝑨𝒕𝒕𝒓𝒂𝒄𝒕𝒊𝒗𝒆𝒏𝒆𝒔𝒔(𝒕𝒂𝒈)=   𝑽𝒊𝒔𝒊𝒕𝒆𝒅  𝒐𝒇𝒇𝒆𝒓𝒔  𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈  𝒕𝒂𝒈/𝑷𝒓𝒐𝒑𝒐𝒔𝒆𝒅  𝒐𝒇𝒇𝒆𝒓𝒔  𝒄𝒐𝒏𝒕𝒂𝒊𝒏𝒊𝒏𝒈  𝒕𝒂𝒈 

Ocean  67%  

Bedroom  33%  

VPG  offers  database  

Ocean  33%  Bedroo

m  67%  

Visits  database  •  Voyage Privé offers database = baseline

•  « Bedroom » attractiveness = 0.67/0.33  = 2 •  « Ocean » attractiveness = 0.33/0.67  = 0.5

Page 44: From Labelling Open data images to building a private recommender system

Learned along the way For ski sales, showing indoor pictures performs better

What’s next ?

Page 45: From Labelling Open data images to building a private recommender system

What’s Next ?

Kenya

Prague

Berlin

Cambodia

Page 46: From Labelling Open data images to building a private recommender system

What’s Next ? Customize the Image !

Kenya

Prague

Berlin

Cambodia

Page 47: From Labelling Open data images to building a private recommender system

Thank you for your attention !