product classification for e-commerce platforms

21
The Challenge Data preparation Learning models Results Lessons learned Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas Viseo R&D, Laboratoire d’Informatique de Grenoble January 27, 2016 Meetup, Grenoble Data Science Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

Upload: ioannis-partalas

Post on 22-Jan-2018

148 views

Category:

Data & Analytics


4 download

TRANSCRIPT

The Challenge Data preparation Learning models Results Lessons learned

Product classification for e-Commerce platforms

Ioannis Partalas and Georgios Balikas

Viseo R&D, Laboratoire d’Informatique de Grenoble

January 27, 2016Meetup, Grenoble Data Science

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Outline

1 The Challenge

2 Data preparation

3 Learning models

4 Results

5 Lessons learned

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

CDiscount competition• Run on the htpp://www.datascience.net platform• A large collection of product items were available• Goal: classify new products to a product taxonomy• Performance criterion: Accuracy = #products well classified

#total products• Prizes: 1st place 9,000 euros, 2nd 4,000e, 3rd 1,000e, 4th and 5th 500e• Participated 175 teams. We were ranked 10th with score 64.2 (winningteam had 68.3)

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Product classification

• Critical task for e-commerce platforms (e.g. Amazon, e-Bay, Cdiscount,Kelkoo)

• Supports retrieval and recommendation tasks• Shopping platforms use product taxonomies to this end

• Product classification can be framed as a text classification problem• xi ∈Rd represents a document i in a vector space• yi ∈Y = {1 . . .K } its associated class label, |Y | > 2

• Some problems• Titles are very short: “Lot de 20 pastilles de culture”• Problematic grammatical structure (incomplete sentences): “Pastis -

Marseille - Vendu à l’unité”

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Training Data

• ∼15M products• Hierarchy of classes with3 levels: 52, 536 and5,789 classes

• Target class: Categorie3• 35,066 test instances

Categorie3 Description Libelle Marque

1000015309 De Collectif aux éditionsSOLESMES

Benedictions de l eglise

1000015309 Cartable 2 soufflets, compatiblePC. Coloris : Rouge

Cartable DELSEY

1000010100 or 750, poids : 3.45gr, diamants: 0.26carats

Bague or et diamants AUCUNE

1000003407 Champagne Brut - Champagne-Vendu à l’unité-1 x 75cl

Mumm Brut AUCUNE

0 10 20 30 40 50 60 70 800

200000

400000

600000

800000

1000000

1200000

Frequency

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Subsampling

• Highly imbalanced dataset: models are biased towards big classes• Data was randomly sampled by downsampling the majority classes• Boosts around +2.5% the best single models• Speeds up the training process

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Preprocessing

• Concatenation of “Description”+“Libellé”+“Marque”• Removal of non-ascii and non-printable characters, html tags andpuncuations

• Accents were stripped• Split words with numerical and text part: “12cm” → “12” and “cm”

def preprocess(text):t = re.sub(’(<!--.*?-->|<[^>]*>)|([^[:print:]])|(_)’,r’ ’,text)t = re.sub(’\W’,r’ ’,t)t = re.sub(r’([a-z]+)([0-9]+)’,r’\1 \2’,t)t = re.sub(r’([0-9]+)([a-z]+)’,r’\1 \2’,t)t = re.sub(’[/><]’,r’ ’,t)t = re.sub(r’\s+’,r’ ’,t)return t;

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Vectorization (1/2)

Doc1

DocN

v1

vd

Vocabulary Extraction

xd1...xdN

Vectorization:

xi ∈X ⊂R |V |

• Vocabulary extraction:• No stemming and lemmatization, we kept stop-words• Used several combinations of n-grams, n= 1,2,3

• 2-grams for "Câble antivol CORPORATE Blanc à code" → (Câbleantivol), (antivol CORPORATE), (CORPORATE Blanc), (Blanc code) ...

• Indicatively |V | = 1.6M for unigrams• We keep a fraction to reduce the problem

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Vectorization (2/2)

• We adopted the Vector Space Model (Salton 1975):• xij = tfvj ,i × idfvj• tfvj ,i refers to the term frequency of term j in document i• idfv = log N

dfv+1 , dfv is the document frequency of term v , N the totalnumber or documents

• Sublinear scaling tf ← 1+ log(tf )

• Each vector was normalized (unit vectors), xi ← xi/||xi ||• Power transformation: x = (x1,x2, . . . ,xd )→ (xα1 ,xα2 , . . . ,xαd )

• Reduces the effect of most common words• α= 0.5 seems to work well

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Vectorizing a document

Câble antivol CORPORATE Blanc à code en acier trempé - Verrou à code

[cable, antivol, corporate, blanc, code, acier, trempe,verrou]

cable antivol corporate code

[1,1,0,0,0,1,0, . . . ,2]Dimension d

Preprocessing

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Tuning and Validation Strategy

We used:• a subset of classes to validate our ideas,• the public part of the leaderboard to check our performance, and• periodic rankings wrt to the private part to make sure we do not overfitthe public part.

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Models

• We rely on linear models focusing on SVMs

minw

12||w ||2+C

∑i

L(w ;xi ,yi )

• Loss functions: max(1−yiwT xi ), log(1+e−yiwT xi )

• One-versus-rest for solving the multiclass problem• We also employed several hierarchical top-down models

• Keeping the whole structure• Removing layers from the hierarchy

Root

ArtsArts SportsSports

Movies Video Tennis Soccer

Players Fun

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Ensembling

• Our final systems were combinations of the basemodels

• For an ensemble of classifiers {h1, . . . ,hT }

• Plurality voting

H(x)= cargmaxj∑T

i=1hji (x)

• Weighted voting

H(x)= cargmaxj∑T

i=1wihji (x)

• Unfortunately we had no time to try Stackedgeneralization

Training dataset

a1 a2 aT

h1 h2 hT

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Hardware

• Full access to a machine with with 4 cores at 2.4Ghz and 16Gb of RAM• Limited access to a machine with 24 cores at 3.3 Ghz and 128Gb of RAM• Preprocessing takes around 1h + 4 to 6 hours for training a model

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Base models results

Description Public Private

1 200K unigrams 61.09 60.772 200K unigrams, α=0.5 61.60 61.253 250K unigrams 61.142 60.774 300K unigrams 61.148 60.875 250K unigrams, 250K bigrams 61.79 61.376 200K unigrams, 400K bigrams, α=0.5 62.09 61.767 200K unigrams, 400K bigrams, α=0.5, “Marque” as binary feature 62.64 62.158 1,2 M unigrams, bigrams, trigrams, α=0.5 62.28 61.999 2M unigrams, bigrams, trigrams 62.35 61.8310 2M unigrams, bigrams, trigrams, α=0.5 63.30 62.99

• Power transformation boosts performance• The addition of bigrams and trigrams helps. But may overfit the data

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Best submissions

Description Public Private Coverage

1 270K unigrams, α= 0.5, half data 63.56 63.11 3,2082 Weighted voting 1 64.55 64.20 3,1283 Weighted voting 2 64.57 64.14 3,116

• Downsampling improved the final score• Weighted voting consistently improved accuracy by about 1.2%-1.8%• Low coverage: 40% of the products in the training data belongs to 10most common classes

• Best system in competition got 68.32%. We ranked 10th (we were in 1stplace for over 1 month :( )

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

What didn’t work

• k-Nearest Neighbors, Rochio (used mainly for ensembling)• Distributed representations failed to improve the results

• Used word2vec tool (Mikolov, 2013)• We generated a low-dimensional representation (200 features)• Improved k-NN classifiers over tf − idf representation

• Tried also BM25 scheme instead of tf − idf but results were worse

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

We explored also

• Sparsification of linear models (Moura et al., 2015)• Slight increase. Needs more investigation• No time to do further experiments

• Re-ranking for large-scale problems (Babbar et al., 2014)• Worked in some validations• Costly operation

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Conclusions

• Knowing the problem helps the feature engineering process.• The validation mechanism is of primary importance.

• Do not trust the leaderboard• Make only a few submissions initially for testing the validation strategy

• High leaderboard ranks matter only after the end of the challenge.• A clear strategy will benefit your participation in the long run.• Published ideas do not apply universally.• Ensembles always win.

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Guidelines

• Learn your data. Try to understand them.• Always have a validation strategy• Do not fit leaderboard• Keep always with you out-of-fold data

• You may need them to stack, blend or validate

• Don’t go for the money and don’t do early dreams :)

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas

The Challenge Data preparation Learning models Results Lessons learned

Thank you

Questions?

Product classification for e-Commerce platforms Ioannis Partalas and Georgios Balikas