automatic feature engineering the manual approach amsterdam du... · automatic feature engineering...
TRANSCRIPT
Automatic Feature Engineering The manual approach
Pierre Gutierrez Leo Dreyfus-Schmidt
Du Phan
CONTENTS
A general-purpose human-powered feature generation pipeline.
INITIAL SOLUTION
Dealing with vertical business problems.
Accelerating the feature engineering process.
MOTIVATION
Leveraging relational nature between tables to aggregate features.
DEEP FEATURE SYNTHESIS
What about Deep Learning ?
Where does the solution fit in a general data science workflow ?
CONCLUSION
CONTENTS
A general-purpose human-powered feature generation pipeline.
INITIAL SOLUTION
Dealing with vertical business problems.
Accelerating the feature engineering process.
MOTIVATION
How can we provide a general solution for these problems ?
There is always a structure in our data waiting to be exploited
Feature Engineering process
(sample size: 1)
37 % Meh.Fun !
How can we accelerate the boring parts ?
CONTENTS
A general-purpose human-powered feature generation pipeline.
INITIAL SOLUTION
Dealing with vertical business problems.
Accelerating the feature engineering process.
MOTIVATION
CONTENTS
A general-purpose human-powered feature generation pipeline.
INITIAL SOLUTION
Dealing with vertical business problems.
Accelerating the feature engineering process.
MOTIVATION
OBJECTIVELeveraging human knowledge for automatic
feature engineering
Build a general-purpose feature generation pipeline
Create expressive features based on user's data model
Versatility, Modularity and Interpretability
Most problems can be aggregated with some primary keys
user_id user_id + event_timestamp
user_id + product_id
Most features belong to a “general” feature family
Frequency: how often does the client do a specific action ?
Recency: when was the last time that he did this action ?
Monetary: what is his spending habit ?
Distribution: what type of clients is he ?
fittransform
Feature: frequency of the buying event in the last 6 months Time window: last 6 months
Primary key: user_id Filter: event_type is buy_order
DROP TABLE IF EXISTS frequency_feature_last_6_month_group_1; CREATE TABLE frequency_feature_last_6_month_group_1 AS( SELECT *, group_1_frequency_last_6_month/6 as mean_frequency_group_1_per_month_last_6_month FROM ( SELECT user_id, COUNT(event_timestamp) as group_1_frequency_last_6_month FROM ( SELECT * FROM “events_complete" WHERE event_timestamp::timestamp >= (ref_date - INTERVAL '6 month') AND event_timestamp::timestamp <= ref_date AND event_type IN (‘buy_order’) ) as table_layer_2 GROUP BY user_id ) as table_layer_3
Leveraging relational nature between tables to aggregate features.
DEEP FEATURE SYNTHESIS
What about Deep Learning ?
Where does the solution fit in a general data science workflow ?
CONCLUSION
Max Kanter Kalyan Veeramachaneni
Features are often derived using relationships in the dataset
Across datasets, many features are derived using similar mathematical operations
New features are composed using previously derived features
Customers CustomerID
Age Churned
Orders CustomerID
OrderID Date
OrderProduct OrderID
ProductID Product.Price
Step 1: SUM(Product.Price) GROUP BY OrderID
Step 2: AVG(Orders.SUM(Product.Price)) GROUP BY CustomerID
-> average expense per order per customer
Limit ?
Brute-force nature -> Feature selection needs to be considered
Leveraging relational nature between tables to aggregate features.
DEEP FEATURE SYNTHESIS
What about Deep Learning ?
Where does the solution fit in a general data science workflow ?
CONCLUSION
Alex Net (2012)
Feature Engineering vs Representation Learning: the chess game metaphor
Feature Engineering: same game, different forms
Representation Learning: different game
Interpretability: do you need it ?
Where does this method fit in the data science workflow ?
Thank you for your attention! Question time