feature engineering and classi er ensemble for kdd cup 2010

41
Feature Engineering and Classifier Ensemble for KDD Cup 2010 Chih-Jen Lin Department of Computer Science National Taiwan University Joint work with HF Yu, HY Lo, HP Hsieh, JK Lou, T McKenzie, JW Chou, PH Chung, CH Ho, CF Chang, YH Wei, JY Weng, ES Yan, CW Chang, TT Kuo, YC Lo, PT Chang, C Po, CY Wang, YH Huang, CW Hung, YX Ruan, YS Lin, SD Lin and HT Lin July 25, 2010 Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 1 / 39

Upload: others

Post on 13-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Feature Engineering and ClassifierEnsemble for KDD Cup 2010

Chih-Jen Lin

Department of Computer ScienceNational Taiwan University

Joint work with HF Yu, HY Lo, HP Hsieh, JK Lou, T McKenzie,JW Chou, PH Chung, CH Ho, CF Chang, YH Wei, JY Weng, ESYan, CW Chang, TT Kuo, YC Lo, PT Chang, C Po, CY Wang,YH Huang, CW Hung, YX Ruan, YS Lin, SD Lin and HT Lin

July 25, 2010Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 1 / 39

Page 2: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Outline

Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 2 / 39

Page 3: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Outline

Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 3 / 39

Page 4: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Team Members

At National Taiwan University, we organized acourse for KDD Cup 2010

Three instructors, two TAs, 19 students and one RA

19 students split to six sub-teams

Named by animals

Armyants, starfish, weka, trilobite, duck, sunfish

We will be happy to share experiences in running acourse for competitions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 4 / 39

Page 5: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Armyants

麥陶德 (Todd G. McKenzie), 羅經凱 (Jing-Kai Lou)and 解巽評 (Hsun-Ping Hsieh)

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 5 / 39

Page 6: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Starfish

Chia-Hua Ho (何家華), Po-Han Chung (鐘博翰), andJung-Wei Chou (周融瑋)

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 6 / 39

Page 7: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Weka

Yin-Hsuan Wei (魏吟軒), En-Hsu Yen (嚴恩勗),Chun-Fu Chang (張淳富) and Jui-Yu Weng (翁睿妤)

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 7 / 39

Page 8: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Trilobite

Yi-Chen Lo (羅亦辰), Che-Wei Chang (張哲維) andTsung-Ting Kuo (郭宗廷)

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 8 / 39

Page 9: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Duck

Chien-Yuan Wang (王建元), Chieh Po (柏傑), andPo-Tzu Chang (張博詞).

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 9 / 39

Page 10: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Sunfish

Yu-Xun Ruan (阮昱勳), Chen-Wei Hung (洪琛洧) andYi-Hung Huang (黃曳弘)

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 10 / 39

Page 11: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Tiger (RA)

Yu-Shi Lin (林育仕)Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 11 / 39

Page 12: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Snoopy (TAs)

Hsiang-Fu Yu (余相甫) and Hung-Yi Lo (駱宏毅)Snoopy and Pikachu are IDs of our team in the finalstage of the competition

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 12 / 39

Page 13: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Team Members

Instructors

林智仁 (Chih-Jen Lin), 林軒田 (Hsuan-Tien Lin) and 林守德 (Shou-De Lin)

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 13 / 39

Page 14: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Initial Approaches and Some Settings

Outline

Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 14 / 39

Page 15: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Initial Approaches and Some Settings

Initial Thoughts and Our Approach

We suspected that this competition would be verydifferent from past KDD Cups

Domain knowledge seems to be extremely importantfor educational systems

Temporal information may be crucial

At first, we explored a temporal approach

We tried Bayesian networks

But quickly found that using a traditionalclassification approach is easier

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 15 / 39

Page 16: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Initial Approaches and Some Settings

Initial Thoughts and Our Approach(Cont’d)

Traditional classification:

Data points: independent Euclidean vectors

Suitable features to reflect domain knowledge andtemporal information

Domain knowledge, temporal information: important,but not as extremely important as we thought in thebeginning

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 16 / 39

Page 17: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Initial Approaches and Some Settings

Our Framework

Problem

SparseFeatures

CondensedFeatures

Ensemble

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 17 / 39

Page 18: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Initial Approaches and Some Settings

Validation Sets

• Avoid overfitting theleader board

• Standard validation

⇒ ignore time series

• Our validation set: lastproblem of each unit intraining set

• Simulate the procedure toconstruct testing sets

A unit of problems

problem 1 ∈ V

problem 2 ∈ V...

last problem ∈ V

V : internal trainingV : internal validation

• In the early stage, we focused on validation sets

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 18 / 39

Page 19: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Outline

Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 19 / 39

Page 20: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Problem

SparseFeatures

CondensedFeatures

Ensemble

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 20 / 39

Page 21: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Basic Sparse Features

Categorical: expanded to binary featuresstudent, unit, section, problem, step, KC

Numerical: scaled by log(1 + x)opportunity value, problem view

A89: algebra 2008 2009B89: bridge to algebra 2008 2009

RMSE (leader board) A89 B89Basic sparse features 0.2895 0.2985Best leader board 0.2759 0.2777

Five of six student sub-teams use variants of thisapproacheFrom this basic set, we add more features

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 21 / 39

Page 22: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Feature Combination and TemporalInformation

Feature combination: (problem, step) etc.

⇒ Fetch hierarchical information

Nonlinear mappings of data

Temporal feature: add information in previous steps

⇒ Fetch time series information

e.g., add KC and step name in previous three stepsas temporal features

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 22 / 39

Page 23: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Feature Combination and TemporalInformation (Cont’d)

RMSE

features

0.2895

0.28430.2816

0.2815

0.2985

0.2883 0.28750.2836

A89B89

Basic+Combination

+Temporal+ More combination

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 23 / 39

Page 24: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Knowledge Component Feature

Originally using binary features to indicate if a KCappears. An alternative way:

Each token in KC as a feature“Write expression, positive one slope” similar to“Write expression, positive slope”

Use “write,” “expression,” “positive” “slope,” and“one” as binary features

Performs well on A89 only

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 24 / 39

Page 25: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Knowledge Component Feature

Originally using binary features to indicate if a KCappears. An alternative way:

Each token in KC as a feature“Write expression, positive one slope” similar to“Write expression, positive slope”

Use “write,” “expression,” “positive” “slope,” and“one” as binary features

Performs well on A89 only

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 24 / 39

Page 26: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Training via Linear Classification

Large numbers of instances and features

The largest number of features used is 30,971,151

#instances #featuresA89 8,918,055 ≥ 20MB89 20,012,499 ≥ 30M

Impractical to use nonlinear classifiers

Use LIBLINEAR developed at National TaiwanUniversity (Fan et al., 2008)

We consider logistic regression instead of SVM

Training time: about 1 hour for 20M instances and30M features (B89)

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 25 / 39

Page 27: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Sparse Features and Linear Classification

Result Using Sparse Features

Leader board results:

A89 B89Basic sparse features 0.2895 0.2985Best sparse features 0.2784 0.2830Best leader board 0.2759 0.2777

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 26 / 39

Page 28: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Condensed Features and Random Forest

Outline

Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 27 / 39

Page 29: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Condensed Features and Random Forest

Problem

SparseFeatures

CondensedFeatures

Ensemble

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 28 / 39

Page 30: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Condensed Features and Random Forest

Condensed FeaturesCategorical feature ⇒ numerical feature

• Use correct first attempt rate (CFAR). Example: astudent named sid:

CFAR =# steps with student = sid and CFA = 1

# steps with student = sid

• CFARs for student, step, KC, problem, (student, unit),(problem, step), (student, KC) and (student, problem)

Temporal feaures: the previous ≤ 6 steps with the samestudent and KC• An indicator for the existence of such steps

• Correct first attempt rate

• Average hint request rateChih-Jen Lin (National Taiwan Univ.) July 25, 2010 29 / 39

Page 31: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Condensed Features and Random Forest

Condensed Features (Cont’d)

Temporal features:

When was a step with the same student name andKC be seen?

Binary features to model four levels:

Same day, 1-6 days, 7-30 days, > 30 days

Opportunity and problem view: scaledTotal 17 condensed features

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 30 / 39

Page 32: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Condensed Features and Random Forest

Training by Random Forest

Due to a small # of features, we could try severalclassifiers via Weka (Hall et al., 2009)

Random Forest (Breiman, 2001) showed the bestperformance:

A89 B89Basic sparse features 0.2895 0.2985Best sparse features 0.2784 0.2830Best condensed features 0.2824 0.2847Best leader board 0.2759 0.2777

This small feature set works well

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 31 / 39

Page 33: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Ensemble and Final Results

Outline

Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 32 / 39

Page 34: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Ensemble and Final Results

Problem

SparseFeatures

CondensedFeatures

Ensemble

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 33 / 39

Page 35: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Ensemble and Final Results

Linear Regression for Ensemble

Linear regression to ensemble sub-team results

minw

‖y − Pw‖2 +λ

2‖w‖2

y: labels of testing set: l × 1; l : # testing data

P : l× (# results from students)

Truncated to [0, 1] : min(1,max(0,Pw))

Need some techniques as y unavailable

Decision of the regularization parameter λ

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 34 / 39

Page 36: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Ensemble and Final Results

Ensemble Results

Ensemble significantly improves the results

A89 B89 Avg.Basic sparse features 0.2895 0.2985 0.2940Best sparse features 0.2784 0.2830 0.2807Best condensed features 0.2824 0.2847 0.2835Best ensemble 0.2756 0.2780 0.2768Best leader board 0.2759 0.2777 0.2768

Our team ranked 2nd on the leader board

Difference to the 1st is small; we hoped that oursolution did not overfit leader board too much andmight be better on the complete challenge set

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 35 / 39

Page 37: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Ensemble and Final Results

Final Results

Rank Team name Leader board Cup1 National Taiwan University 0.276803 0.2729522 Zhang and Su 0.276790 0.2736923 BigChaos @ KDD 0.279046 0.2745564 Zach A. Pardos 0.279695 0.2765905 Old Dogs With New Tricks 0.281163 0.277864

Team names used during the competition:

Snoopy ⇒ National Taiwan University

BbCc ⇒ Zhang and Su

Cup scores generally better than leader board

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 36 / 39

Page 38: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Discussion and Conclusions

Outline

Team MembersInitial Approaches and Some SettingsSparse Features and Linear ClassificationCondensed Features and Random ForestEnsemble and Final ResultsDiscussion and Conclusions

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 37 / 39

Page 39: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Discussion and Conclusions

Diversities in Learning

We believe that one key to our ensemble’s success is thediversity

Feature diversity

Classifier diversity

Different sub-teams try different ideas guided by theirhuman intelligence

Our student sub-teams even have biodiversity

Mammals: snoopy, tiger

Birds: weka, duck

Insects: armyants, trilobite

Marine animals: starfish, sunfish

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 38 / 39

Page 40: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Discussion and Conclusions

Diversities in Learning

We believe that one key to our ensemble’s success is thediversity

Feature diversity

Classifier diversity

Different sub-teams try different ideas guided by theirhuman intelligenceOur student sub-teams even have biodiversity

Mammals: snoopy, tiger

Birds: weka, duck

Insects: armyants, trilobite

Marine animals: starfish, sunfishChih-Jen Lin (National Taiwan Univ.) July 25, 2010 38 / 39

Page 41: Feature Engineering and Classi er Ensemble for KDD Cup 2010

Discussion and Conclusions

Conclusions

Feature engineering and classifier ensemble seem tobe useful for educational data mining

All our team members worked very hard, but we arealso a bit lucky

We thank the organizers for organizing thisinteresting and fruitful competition

We also thank National Taiwan University forproviding a stimulating research environment

Chih-Jen Lin (National Taiwan Univ.) July 25, 2010 39 / 39