fast prediction of new feature utility

21
Fast Prediction of New Feature Utility Hoyt Koepke Misha Bilenko

Upload: graham

Post on 23-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Fast Prediction of New Feature Utility. Hoyt Koepke Misha Bilenko . Machine Learning in Practice. To improve accuracy, we can improve: Training Supervision Features. Problem formulated as a prediction task. Implement learner, get supervision. Design, refine features. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast  Prediction of New  Feature Utility

Fast Prediction ofNew Feature Utility

Hoyt Koepke Misha Bilenko

Page 2: Fast  Prediction of New  Feature Utility

Machine Learning in Practice

To improve accuracy, we can improve:– Training– Supervision– Features

Problem formulated as a prediction task

Design, refine features

Implement learner, get supervision

Train, validate,

ship

Page 3: Fast  Prediction of New  Feature Utility

Improving Accuracy By Improving

• Training – Algorithms, objectives/losses, hyper-parameters, …

• Supervision– Cleaning, labeling, sampling, semi-supervised

• Representation: refine/induce/add new features– Most ML engineering for mature applications happens here!– Process: let’s try this new extractor/data stream/transform/…• Manual or automatic [feature induction: Della Pietra et al.’97]

Page 4: Fast  Prediction of New  Feature Utility

Evaluating New Features• Standard procedure:

– Add features, re-run train/test/CV, hope accuracy improves

• In many applications, this is costly– Computationally: full re-training is – Monetarily: cost per feature-value (must check on a small sample)– Logistically: infrastructure pipelined, non-trivial, under-documented

• Goal: Efficiently check whether a new feature can improve accuracy without retraining

Page 5: Fast  Prediction of New  Feature Utility

Feature Relevance Feature Selection• Selection objective: removing existing features

• Relevance objective: decide if a new feature is worth adding

• Most feature selection methods either use re-training or estimate

• Feature relevance requires estimating

Page 6: Fast  Prediction of New  Feature Utility

Formalizing New Feature Relevance• Supervised learning setting

– Training set – Current predictor =– New feature

….….

Page 7: Fast  Prediction of New  Feature Utility

Formalizing New Feature Relevance• Supervised learning setting

– Training set – Current predictor =– New feature

• Hypothesis: can a better predictor be learned with the new feature?

• Too general Instead, let’s test an additive form: s.t.

For efficiency, we can just test:

s.t.

Page 8: Fast  Prediction of New  Feature Utility

Hypothesis Test for New Feature Relevance• We want to test whether has incremental signal:

s.t. • Intuition: loss gradient tells us how to improve the predictor• Consider functional loss gradient

– Since is locally optimal, : no descent direction exists• Theorem: under reasonable assumptions, is equivalent to:

> 0

where

Page 9: Fast  Prediction of New  Feature Utility

Hypothesis Test for New Feature Relevance > 0

• Intuition: can yield a descent direction in functional space? • Why this is cool:

Testing new feature relevance for a broad class of losses ⟺testing correlation between feature and normalized loss gradient

….

Page 10: Fast  Prediction of New  Feature Utility

Hypothesis Test for New Feature Relevance > 0

• Intuition: can yield a descent direction in functional space? • Why this is cool:

Testing new feature relevance for a broad class of losses ⟺testing correlation between feature and normalized loss gradient

Page 11: Fast  Prediction of New  Feature Utility

Testing Correlation to Loss Gradient• We don’t have a consistent test for > 0 …but ( locally optimal), so above is equivalent to:

s.t. …for which we can design a consistent bootstrap test!

• Intuition– We need to test if we can train regressor – We want it to be as powerful as possible and work on small samplesQ: How do we distinguish between true correlation and overfitting? A: We correct by correlation from

Page 12: Fast  Prediction of New  Feature Utility

New Feature Relevance: Algorithm

(1) Train best-fit regressor - Compute correlation between predictions and targets

(2) Repeat timesa) Draw independent bootstrap samples and b) Train best-fit regressor, compute correlation

(3) Score: correlation (1) corrected by (2)

Page 13: Fast  Prediction of New  Feature Utility

New Feature Relevance: Algorithm

Page 14: Fast  Prediction of New  Feature Utility

Connection to Boosting

• AnyBoost/gradient boosting additive form:– vs. – Gradient vs. coordinate descent in functional space

• Anyboost/GB: generalization

• This work: consistent hypothesis test for feasibility– Statistical stopping criteria for boosting?

Page 15: Fast  Prediction of New  Feature Utility

Experimental Validation

• Natural methodology: compare to full re-training• For each feature :– Actual – Predicted

• We are mainly interested in high- features

Page 16: Fast  Prediction of New  Feature Utility

Datasets

• WebSearch: each “feature” is a signal source• E.g., “Body” source defines all features that depend on document

body:–

• Signal source examples: AnchorText, ClickLog, etc.

Page 17: Fast  Prediction of New  Feature Utility

Results: Adult

Page 18: Fast  Prediction of New  Feature Utility

Results: Housing

Page 19: Fast  Prediction of New  Feature Utility

Results: WebSearch

Page 20: Fast  Prediction of New  Feature Utility

Comparison to Feature Selection

Page 21: Fast  Prediction of New  Feature Utility

New Feature Relevance: Summary

• Evaluating new features by re-training can be costly– Computationally, Financially, Logistically

• Fast alternative: testing correlation to loss gradient• Black-box algorithm: regression for (almost) any loss!• Just one approach, lots of future work: – Alternatives to hypothesis testing: info-theory, optimization, …– Semi-supervised methods– Back to feature selection? – Removing black-box assumptions