thomas jensen. machine learning
DESCRIPTION
#BigDataBYTRANSCRIPT
The Impact of Big Data on Classic Machine Learning
Algorithms
Thomas Jensen, Senior Business Analyst @ Expedia
Who am I?
• Senior Business Analyst @ Expedia• Working within the competitive
intelligence unit• Responsible for :
• Algorithm that score new hotels• Algorithm that predicts room nights
sold on existing Expedia hotels• Scraping competitor sites• Other stuff….
The Promise of Big Data
Real time dataData driven decision
More accurate and robust models
Granularity
Big Data Challenges
Data Processing – not going to talk about this.
Speed at which to use data – how fast should we update algorithms?
How do we train algorithms on data sets that do not fit into memory?
Big Data Challenges
Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Classification - Logistic Regression
• One classic task in machine learning / statistics is to classify some objects/events/decisions correctly
• Examples are:• Customer churn• Click behavior• Purchase behavior• ….
• One of the most popular algorithms to carry out these tasks is logistic regression
What is logistic regression?
• Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other
• Pr 𝑦 𝑥 =1
1+𝑒−𝑥𝛽
• The challenge is to choose the optimal beta(s)
• To do that we minimize a cost function
Why Use Logistic Regression?
• It is simple and well understood algorithm
• Outputs probabilities
• There are tried and tested models to estimate the parameters
• It is flexible – can handle a number of different inputs, and feature transformations
Usual Approaches
• Batch training (offline approach)• Get all the data and train the algorithm in one go
• Disadvantages when data is big• Requires all data to be loaded into memory
• Periodic retraining is necessary
• Very time consuming with big data!
Batch Training
Examples of Logistic Regression in Industry Settings – Real Time Bidding
• RTB• RTB algorithms are usually
based on logistic regression• Whether or not to bid on a
user is determined by the probability that the user will click on an add
• Each day billions of bids are processed
• Each bid has to be processed within 80 milliseconds
Examples of Logistic Regression in Industry Settings – Fraud Detection
Detecting Fraudulent Credit Card Transactions
• The probability that a transaction is using a stolen credit card is typically estimated with logistic regression
• Billions of transactions are analyzed each day
How Slow is the Batch Version of Logistic Regression?
One target variable and two feature vectors.All randomly generated.
A Real World Problem
A Real World Problem
• Some stats on the training job in the pipeline:• Runs training jobs on a per country basis
• Longest running job lasts ~9 hours
• Shortest running job lasts ~3 hours
• There are often convergence failures
• What we need an algorithm that:• Can reduce training time
• Is robust towards convergence failures
A Big Data Friendly Approach
Online Training
• Pass each data point sequentially through the algorithm
• Only requires one data point at a time in memory
• Allows for on-the-fly training of the algorithm
Online Learning
• We want to learn a vector of weights
• Initialize all weights. Begin loop:1. Get training example
2. Make a prediction for the target variable
3. Learn the true value of the target
4. Update the weights and go to 1
Online Learning
• Initialise all weights. Begin loop:
Repeat {For i = 1 to m {
𝜃𝑗 = 𝜃𝑗 − 𝛼𝜕
𝜕𝜃𝑗𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖 , 𝑦𝑖))
}
}
the partial derivative of the cost functions
the cost function – giventheta and row i, i.e. how wrongAre we?
the step size – how fastwe should climb the gradient
Online Learning
• Approaches the maximum of the function in a jumpy manner and never actually settles on the maximum.
Batch vs. Online Learning
DataSize: 4.8GBRows: 500,000Columns: 5000
0
20
40
60
80
100
120
Batch SGDClassifier Sofia-ml
Training
*Times include reading data and training algorithm
Online Learning Vs. Batch
Online Learning
• When we have a continuous stream of data
• When It is important to update the algorithm in real time – can hit a moving target
• When training speed is important
• Parameters are “jumpy” around the optimal values
Batch
• When it is very important to get the exact optimal values
• When data can fit in memory
• When training time is not of the essence
Popular Online Learning Libraries
• Sofia-ml (c/c++)• Requires data in svmLight format• Have implementations of SVM, Neural networks and logistic regression• Supports classification and ranking
• Wovbal wabbit (c/c++)• Requires data in own wv format• Have implementations of the most popular loss functions• Supports classification, ranking and regression
• Pandas + scikit-learn (python)• Pandas has a nice function for reading files in batches• Can handle sparse and non-sparse matrices• Scikit–learn has an SGD classifier that can fit the model in batches• Supports classification, ranking and regression