approach to generate a vast variety of features for predicting dropout in mooc takuya akiyama, kei...
TRANSCRIPT
Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC
Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu, Rui Kimura, Nobuyuki Maita, Yujin Tang, Takafumi Watanabe,
Akihiro Kobayashi, Kazunori Matsumoto,and Keiichi Kuroyanagi
The Final Results
2Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Account Name Full Name Affiliations
t.MF Akiyama, Takuya KDDI R&D Laboratories, Inc.Aakansh Gupta, Aakansh Uhuru Corporation NoahZh Zhang, Nuo KDDI R&D Laboratories, Inc.kyone Yonekawa, Kei KDDI R&D Laboratories, Inc.
mz-matsumoto Matsumoto, Kazunori KDDI R&D Laboratories, Inc.mura Muramatsu, Shigeki KDDI R&D Laboratories, Inc.ruik Kimura, Rui KDDI R&D Laboratories, Inc.
no6est Maita, Nobuyuki KDDI R&D Laboratories, Inc.Yujin Tang, Yujin KDDI R&D Laboratories, Inc.Keiku Kuroyanagi, Keiichi Financial Engineering Group, Inc.
TakWat Watanabe, Takafumi KDDI R&D Laboratories, Inc.apf-koba Kobayashi, Akihiro KDDI R&D Laboratories, Inc.
“KDDILABS&Keiku” Members
3Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Working atKDDILABSoffice
What is “KDDI”?
4Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Knowledge Discovery and Data mining Institute …?
NO!
A Japanese telecommunication company “KDDI” is an acronym standing for Japanese words
No relation between KDD2015 and KDDIWe did NOT Cheat
System Overview
5Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Feature
XGBoost†
RegularizedGreedyForest
OriginalData
BlendSubmitData
×2000
Our special twist is “strategic” feature engineering
Deep Neural
Network
Bagging of 200 models
† http://dmlc.github.io/
Each member started KDD Cup separately.
At First...
6Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
The number of features : 1500 over
Each member created features separately.
In late Jun, we were merged as a team
“Basic Features”
All features has variation with respect to labels like:
Time window, category, event, source, or their combination
Examples of Basic 1500 Features
7Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
: access log
Time
2h 30m30s
50s1m 15sThe number of logs of the eID
The number of lags
Time to Target Prediction Interval
?
Target prediction interval
Counting up 5s
3m
4m 30s
Why can’t we predict “Lower right” eID accurately? “Lower right” eID do not have enough number of logs, in some cases there are only 1 log, but they
did not drop out the course. Because there are less number of logs, it is hard to predict their dropout probability by basic
features.
ROC Curve & Predicted Value Distribution
8Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Dropout eID
Non Dropout eID
Prediction of training data by XGBoost at 10-fold cross validation.T
rue
Po
sitiv
e R
ate
De
nsi
ty
False Positive Rate Predicted Value
We created the features by 3 kinds of methods1. Aggregating “Cross-Course” logs
2. Using Idea of Recommendation System
3. Using Time-Series Prediction
Our Strategy & Features
9Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Our Strategy
Creating features which do NOT depend on the number of logs
Idea: About 1/3 users attend multiple courses.
①Aggregating “Cross-Course” logs
10Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
active course count 1 2 3 4 5 ... 28 29 30 31 39the number of user 7350920251 8237 4118 2277 ... 1 1 1 2 3
All users : 112448
It is effective to create features by logs of not only the object course, but also other active courses.
Users attending multiple courses : 38939
How to create features: AMaybe, some user have enrolled multiple courses at once, and attended courses one
by one.
11Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
①Aggregating “Cross-Course” logs
time
User a Course B
Course A
Course C
?Target prediction interval of Course A
Although there are many logs at Couse
B & Course C
Only 1 log at Course A
There are little logs in the
period
There is a high probability of attending Course A in this period.
Counting up the number of logs, unique days, or unique courses in which logs exist by moving time window (window size : 5days)
How to create features: B If there is some relationship between a target course and an other course, logs of the
target course may exist near logs of the other course.
12Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
①Aggregating “Cross-Course” logs
time
User aCourse B
Course A ?
Two Steps:
1. Making a matrix of interrelationship of all courses which is transition probability from one course log to other course log.
2. Calculating a sum of products of logs and interrelationship of each courses to the target course at the prediction interval.
Relational Courses There are a high prob ability of
Existing Logs of Course A nearby Logs of Course B
Target prediction interval of Course A
Idea:We want to create features by NOT using logs.
→other users who enroll similar course pattern to the user is useful.
How to create features:Creating features by Collaborative Filtering which is often used as recommendation
system in e-commerce sites or search engines
②Using Idea of Recommendation System
13Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Collaborative Filtering: 1. Calculate similarities between the user and other users by comparing active course
patterns of each users
2. Calculate reasonable value which is calculated by a weighted average of other users value whose similarities are higher than threshold
②Using Idea of Recommendation System
14Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
CourseUser .
1 2 3 4 5Similarity to User A
A ○ × ○ × × -
B ○ ○ × × × 0.7
C × ○ × × × 0.2
D ○ × ○ ○ × 0.8
○ means “enrolled”× means “not enrolled”
Feature value (in this case, the number of logs)
CourseUser .
1 2 3 4 5
A 1 × 20 × ×
B 130 50 × × ×
C × 30 × × ×
D 50 × 40 70 ×
(130×0.7+50×0.8)/(0.7+0.8)=87
↓The user may
continue to attend this course.
Idea: Is there consistent trend at the numbers of unique users who attend the courses in
specific days? If we know the number of unique users who attend the courses in dropout judgment
period and an order of users who is more likely to attend the courses in its period, we can see the boundary of dropout users and non-dropout users.
How to create features:Using ARIMA which is often used in financial prediction or telecommunication traffic
predictionPredicting unique users in judgment period by using a transition of unique users in each
specific time window (10days)Ranking users according to most useful feature values in previous dropout predict system.
③Using Time-Series Prediction
15Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
③Using Time-Series Prediction
16Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Trainsitions of unique users (time window:10days)
Username Ranking of the feature value Normalized value
a 1 0.001
b 1000 1
c 2000 2
User ranking according to specific feature values (for example, the number of logs)and normalization by the predicted number of unique users
Predicted number of unique userswho attend
the course in day 31~40
Prediction Using ARIMAActual Value
Results
17Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Prediction of “Lower right” eID are improved
Dropout eID Non Dropout eID Dropout eID Non Dropout eID
Predicted Value Distribution
Final AUC becomes 0.90756
Results
18Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Final private score is 0.90597
In this competition, we didn’t use some features created from truth data of training set because we were afraid of over-fitting to training set. Maybe It restricted more flexible idea and was why we got no more than 6th rank.
Creating wide variety and useful features was important. However of course, the choice of three kind of models (XGBoost, Regularized Greedy Forest, and Bagging Deep Learning) was also important of, so we really appreciate the authors of used models and libraries.
Miscellaneous
19Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
20Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved
Thanks for your attention.