machine learning at orbitz robert lancaster and jonathan seidman strata 2011 february 02 | 2011
TRANSCRIPT
Why Start the Machine Learning Team at Orbitz?
• Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience.
• For example:
– Hotel sort optimization: How can we improve the ranking of hotel search results in order to show consumers hotels that more closely match their preferences?
– Cache optimization: can we intelligently cache hotel rates in order to optimize the performance of hotel searches?
– Personalization/segmentation: can we show targeted search results to specific consumer segments?
page 3
Data Challenges
• The team immediately faced challenges getting access to data:
– Performing required analysis requires access to large amounts of data on user interaction with the site.
– This data is available in web analytics logs, but required fields were not available in our data warehouse because of size considerations.
– Even worse, we had no archive of the data beyond several days.
– Size constraints aside, there’s considerable time and effort to get new data added to the data warehouse.
page 4
New Data Infrastructure to Address These Challenges
• Hadoop provides a solution to these challenges by:
– Providing long-term storage of entire raw dataset without placing constraints on how that data is processed.
– Allowing us to immediately take advantage of new web analytics data added to the site.
– Providing a platform for efficient analysis of data, as well as preparation of data for input to external processes for further analysis.
• Hive was added to the infrastructure to provide structure over the prepared data, facilitating ad-hoc queries and selection of specific data sets for analysis.
• Data stored in Hive not only supports machine learning efforts, but also provides metrics to analysts not available through other sources.
page 5
New Data Infrastructure – Cont’d
• Hadoop and Hive are now being used by the machine learning team to:
– Extract data from logs for hotel sort and cache optimization analyses.
– Distribute complex cross-validation and performance evaluation operations.
– Extracting data for clustering.
• Hadoop and Hive have also gained rapid adoption in the organization beyond the machine learning team: evaluating page download performance, searching production logs, keyword analysis, etc.
page 6
Use Case – Hotel Cache Optimization
Overview:
Search methodology:
• Subset of total properties in a location (1 page at a time).
• Get “just enough” information to present to consumers.
Caching:
• Reduces impact to suppliers (maintain “look-to-book” ratio).
• Reduces latency.
• Increases “coverage.”
Optimization Goal:
Improve the customer experience (reduce latency, increase coverage) when searching for hotel rates while controlling impact on suppliers (maintain look-to-book).
page 7
Hotel Cache Optimization – Early Attempts
Early approaches were well intended, but were not driven by analysis of the available data. For example:
Theory: High amount of thrashing leads to eviction of more useful cache entries.
Attempted Solution: Increase cache size.
Result: No increase in measured coverage.
Problem: No actual analysis on required cache size.
Theory: Locally managed inventory represents “free” information and can be requested without limit to improve coverage.
Attempted Solution: Don’t cache locally managed inventory. Increase the amount of local inventory requested with each user search.
Result: No increase in measured coverage.
Problem: Locally managed inventory doesn’t represent a large percentage of total inventory and is already highly preferenced.
page 8
Hotel Cache Optimization – Data Driven Approaches
Data Driven Approaches:
Traffic Partitioning: Identify the subset of traffic that is most efficient and optimize that subset through prefetching and increased bursting.
TTL Optimization: Use historic logs of availability and rate change information to predict volatility of hotel rates and optimize cache TTL.
page 9
Hotel Cache Optimization– Traffic Distribution
page 10
A small number of queries (3%) make up more than a third of search volume.
Optimize Hotel Cache – Traffic Partitioning
Evaluate possible mechanisms for determining most frequent queries.
Favor mechanisms that gives high search/query ratio for the greatest percentage of search volume.
Test for stability of mechanism across multiple time periods.
Partion Strategy Description Pct Queries Pct Searches Searches/Query
Baseline All traffic 100.00% 100.00% 2.19
Top 50 Top 50 searched markets 14.88% 26.76% 3.94
HeuristicTop 50 searched markets, weekend stay within 1 month. 0.87% 8.52% 21.4
Enumeration Queries repeated 5 or more times. 3.45% 28.80% 18.29
Prediction TBD TBD TBD TBD
page 11
Conclusions and Lessons Learned
• Start with a manageable problem (ease of measuring success, availability of data, etc.)
• Avoid thinking of machine learning team as an R&D organization.
• Instead, foster machine learning approaches throughout the organization:
– Embed resources on actual feature teams.
– Machine learning study groups, etc.
page 12