dataiku at sf datamining meetup - kaggle yandex challenge
DESCRIPTION
This is a presentation made on the 13th August 2014 at the SF Data Mining Meetup at Trulia. It's about Dataiku and the Kaggle Personalized Web Search Ranking challenge sponsored by YandexTRANSCRIPT
![Page 1: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/1.jpg)
write your own data story!
![Page 2: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/2.jpg)
short story
![Page 3: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/3.jpg)
Founded January 2013
January 2014A Data Science Studio
powered team wins a Challenge
![Page 4: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/4.jpg)
Founded January 2013
January 2014A Data Science Studio
powered team wins a Challenge
![Page 5: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/5.jpg)
Founded January 2013
January 2014A Data Science Studio
powered team wins a Challenge
Data Science Studio’s GA February 2014
![Page 6: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/6.jpg)
Founded January 2013
January 2014A Data Science Studio
powered team wins a Challenge
Data Science Studio’s GA February 2014
July 2014Data Science Studio
Available for Free with a
Community Edition !!
![Page 7: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/7.jpg)
Founded January 2013
January 2014A Data Science Studio
powered team wins a Challenge
Data Science Studio’s GA February 2014
15 People Now
July 2014Data Science Studio
Available for Free with a
Community Edition !!
![Page 8: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/8.jpg)
!BI
Developer
Data Preparation
Build Algorithm
Build Application
Run Application
Business Analyst
DataScientist
![Page 9: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/9.jpg)
I don’t want to be a data cleaner anymore“
![Page 10: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/10.jpg)
Finding Leaks in my Data Pipelines“
![Page 11: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/11.jpg)
Waiting for the
(gradient boosted) trees
to grow“
![Page 12: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/12.jpg)
MPP Databases
Statistical Software Machine Learning
No-SQL Hadoop
![Page 13: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/13.jpg)
Demo Time
![Page 14: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/14.jpg)
Challenge
![Page 15: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/15.jpg)
Using Historical Logs of a search engine QUERIES RESULTS CLICKS !and a set of new QUERIES and RESULTS !rerank the RESULTS in order to optimize relevance
Personalized Web SearchYandexFri 11 Oct 2013 – Fri 10 Jan 2014 194 Teams $9,000 cash prize
![Page 16: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/16.jpg)
No researcher. No experience in reranking.
Not much experience in ML for most of us. Not exactly our job. No expectations.
Kenji Lefevre 37
Algrebraic Geometry Learning Python
Christophe Bourguignat 37
Signal Processing Eng. Learning Scikit
Mathieu Scordia 24
Data Scientist
Paul Masurel 33
Soft. Engineer
The Team
![Page 17: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/17.jpg)
A-Team?
![Page 18: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/18.jpg)
“HOBBITS"
![Page 19: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/19.jpg)
YANDEX SUPPLIED 27 DAYS OF ANONYMOUS LOG
Challenge Data
34,573,630 Sessions with user id 21,073,569 Queries 64,693,054 Clicks
~ 15GB
Example
![Page 20: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/20.jpg)
Relevance?
![Page 21: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/21.jpg)
A METRIC FOR RELEVANCE RIGHT FROM THE LOG? ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE
A LOOK AT THE LOGS.
![Page 22: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/22.jpg)
WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION
DWELL TIME
![Page 23: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/23.jpg)
DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH THE RELEVANCE
![Page 24: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/24.jpg)
GOOD WE HAVE A MEASURE OF RELEVANCE ! CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE
NOW?
![Page 25: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/25.jpg)
![Page 26: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/26.jpg)
Emphasis on relevant documents
Discount per ranking
Discount Cumulative Gain
![Page 27: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/27.jpg)
Normalized Discount Cumulative Gain
Just Normalize Between 0 and 1
![Page 28: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/28.jpg)
PERSONALIZED RERANKING IS ABOUT REORDERING THE N-BEST RESULTS BASED ON
THE USER PAST SEARCH HISTORY
Results Obtained in the contest: !
Original NCDG 0.79056 !
ReRanked NCDG 0.80714 !!
~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #5 on each query
~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #2 in 20% of the queries
Equivalent To
![Page 29: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/29.jpg)
How they did it
![Page 30: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/30.jpg)
Simple, point wise approach
Session 1 Session 2 ....0
1
2
For each (URL, Session) predict relevance (0,1 or 2)
![Page 31: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/31.jpg)
Supervised Learning on History
We split 27 days of the train dataset 24 (history) + 3 days (annotated). !
Stop randomly in the last 3 days at a “test" session (like Yandex)
Train Set (24 history)
Train Set (annotation) Test Set
![Page 32: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/32.jpg)
Working with a ML workflow collaboratively
![Page 33: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/33.jpg)
Features Construction : Team Member work independantly
Learning : Team Member work independantly
Split Train & Validation
Features on 30 days
Labelled 30 days data
![Page 34: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/34.jpg)
!regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery. classification : we lose the hierarchy but we can optimize the NDCG (more and that later)
REGRESSION or CLASSIFICATION
According to P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007. Classification outperforms regression.
![Page 35: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/35.jpg)
!
Compute the probabilities that P(relevance = X)
Build a sorted list
!
Sort by !
P(Relevance=1) + 3 P (Relevance=2)
![Page 36: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/36.jpg)
Hence order by decreasing
Hence order by P(Relevance=1) + 3 P (Relevance=2)
P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007.
get slightly better results with linear weighting.
![Page 37: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/37.jpg)
Features
![Page 38: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/38.jpg)
FIRST OF ALL THE RANK In this contest, the rank is both
The rank that has been displayed to the user THE DISPLAY RANK
!The rank that is computed by Yandex using
PageRank, non-personalized log analysis?, TF-IDF, and machine learning etc.
THE NON-PERSONALIZED RANK
RANK AS feature
![Page 39: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/39.jpg)
Digression
![Page 40: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/40.jpg)
THE PROBLEM!WITH RERANKING
![Page 41: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/41.jpg)
53% OF THE COMPETITORS COULD NOT IMPROVE THE BASELINE
Worse 53%
Better 47%
![Page 42: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/42.jpg)
1. compute non-personalized rank 2. select 10 best hits and serves them in order 3. re-rank using log analysis. 4. put new ranking algorithm in prod (yeah right!) 5. compute NDCG on new logs 6. … 7. Profits !!
IDEAL
![Page 43: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/43.jpg)
1. compute non-personalized rank 2. select 10 bests hits 3. serve 10 bests hits ranked in random order 4. re-rank using log analysis, including non-personalized rank as a
feature 5. compute score against the log with the former rank
REAL
![Page 44: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/44.jpg)
Users tend to click on the first few urls. User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal.
PROBLEM
We cannot discriminate the effect of the signal of the non-personalized rank from effect of the display rank
![Page 45: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/45.jpg)
PROMOTES OVER CONSERVATIVE RE-RANKING POLICY
Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest.
Average per session of the max position jump
![Page 46: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/46.jpg)
end digression
![Page 47: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/47.jpg)
Revisits (Query-(User)-URL) features and variants Query Features Cumulative Features User Click Habits Collaborative Filtering Seasonality
FEATURES
![Page 48: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/48.jpg)
!In the past, when the user was displayed this url, with the exact same query
what is the probability that :
REVISITS
• satisfaction=2 • satisfaction=1 • satisfaction=0 • miss (not-clicked) • skipped (after the last click)
5 Conditional Probability Features
1 An overall counter of display 4 mean reciprocal rank (kind of the harmonic mean of the rank) 1 snippet quality score (twisted formula used to compute snippet quality)
11 Base Features
![Page 49: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/49.jpg)
• (In the past|within the same sesssion), • (with this very query | whatever query | a subquery | a super query) • and was offered (this url/this domain)
MANY VARIATIONSX2X 3X 2
12 variants
With the same user
Without being the same user ( URL - query features)
• Same Domain • Same URL • Same Query and Same URL
3 variants
15 Variants X 11 Base Features
165 Features
![Page 50: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/50.jpg)
ADDITIVE SMOOTHINGhttp://fumicoton.com/posts/bayesian_rating
• book A : 1 rating of 5. Average rating of 5. • book B : 50 ratings. Average rating of 4.5
In our case to evaluate the probability that a (URL|query) should have a label l, under predicate P:
![Page 51: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/51.jpg)
CUMULATIVE FEATURES
Aggregate the features of the URL above in the ranking list
Rationale : If a URL above is likely to be clicked, those below are likely to be missed
![Page 52: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/52.jpg)
QUERY FEATURES
Click entropy number of time it has been queried for number of terms average position within in session average number of occurences in a session MRR of its clicks
How complex and ambiguous is a query ?
![Page 53: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/53.jpg)
USER FEATURESWhat are the users habits ?
Click entropy User click rank counters
Rank {1, 2} clicks Rank {3, 4, 5} clicks Rank {6,7,8,9,10 } clicks
Average number of terms Average number of different terms in a session Total number of queries issued by the user
![Page 54: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/54.jpg)
SEASONALITYWhat day is monday ?
![Page 55: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/55.jpg)
COLLABORATIVE FILTERING (ATTEMPT)
User / Domain interaction matrix. FunkSVD Algorithm
Simon Funkhttp://sifter.org/~simon/journal/20061211.html
https://github.com/commonsense/divisi/blob/master/svdlib/_svdlib.pyxCython implementation
Marginal increase 5.10^-5 of the NCDG !
Why ?
![Page 56: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/56.jpg)
learning
![Page 57: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/57.jpg)
Short Story
Point Wise, Random Forest, 30 Features, 4th Place (*)
List Wise , LambdaMART, 90 Features, 1st Place (*)
(*) A Yandex “PaceMaker" Team was also displaying results on the leaderboard and were at the first place during the whole competition even if not officially contestant
Trained in 2 days, 1135 Trees
Optimize & Train in ~ 1 hour (12 cores), 24 trees
![Page 58: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/58.jpg)
Lambda Mart
From RankNet to LambdaRank to LambdaMART: An Overview
Christopher J.C. Burges
Microsoft Research Technical Report MSR-TR-2010-82
LambdaMART = LambdaRank + MART
![Page 59: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/59.jpg)
Lambda RankOriginal Ranking Re Ranked
13 errors 11 errors
High Quality Hit
Low Quality Hit
Rank Net Gradient
LambdaRank "Gradient"
From RankNet to LambdaRank to LambdaMART: An Overview
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
![Page 60: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/60.jpg)
Grid SearchWe are not doing typical classification here. It is extremely important to perform grid
search directly against NDCG final score.
NDCG “conservatism” end up with large “min samples per leaf” (between 40 and 80 )
![Page 61: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/61.jpg)
Feature Selection
Top-Down approach : Starting from a high number of features, iteratively removed subsets of features. This approach led to the subset of 90 features for the LambdaMart winning solutions
(Similar strategy now implemented by sklearn.feature_selection.RFECV)
! Bottom-up approach : Starting from a low number of features, add the
features that produce the best marginal improvement. Gave the 30 features that lead to the best solution with the point-wise approach.
![Page 62: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/62.jpg)
Top Features
![Page 63: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/63.jpg)
References
![Page 64: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/64.jpg)
http://sourceforge.net/p/lemur/wiki/RankLib/Ranklib ( Implementation of LambdaMART)
These Slides http://www.slideshare.net/Dataiku
Learning to rank using multiple classification and gradient boosting.
P. Li, C. J. C. Burges, and Q. Wu. Mcrank - In NIPS, 2007
From RankNet to LambdaRank to LambdaMART: An Overview
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
http://fumicoton.com/posts/bayesian_ratingBlog Post About Additive Smoothing
Blog Posts about the solution
Contest Url
Paper with Detailed Description
http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/http://www.dataiku.com/blog/2014/01/14/winning-kaggle.html
http://research.microsoft.com/en-us/um/people/nickcr/wscd2014/papers/wscdchallenge2014dataiku.pdf
https://www.kaggle.com/c/yandex-personalized-web-search-challenge
Research Papers
References
![Page 65: Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge](https://reader033.vdocuments.net/reader033/viewer/2022050815/547dc7895806b5ef5e8b45d8/html5/thumbnails/65.jpg)
Random ThoughtsDependancy analysis and comparing rank with predictive “relevance" could help determine general cases where the existing engine is not relevant enough How does it compare to a pure statistical approach ? !Applying personalisation technique this way might not be practical because of the amount of live information to be maintained (in real-time) about users (each query, each click) to perform actionnable predictions How could a machine learning challenge enforce this kind of constraints? Is data science a science, a sport or a hobby. Newcomers can discover a field, improve existing results, and seemingly obtain incrementally more effective results, with little plateau effect ! Are we just at the very beginning non-industrial era of this discipline?