approach to generate a vast variety of features for predicting dropout in mooc takuya akiyama, kei...

20
Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama , Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu, Rui Kimura, Nobuyuki Maita, Yujin Tang, Takafumi Watanabe, Akihiro Kobayashi, Kazunori Matsumoto,and Keiichi Kuroyanagi

Upload: linda-hart

Post on 01-Jan-2016

227 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC

Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu, Rui Kimura, Nobuyuki Maita, Yujin Tang, Takafumi Watanabe,

Akihiro Kobayashi, Kazunori Matsumoto,and Keiichi Kuroyanagi

Page 2: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

The Final Results

2Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Page 3: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Account Name Full Name Affiliations

t.MF Akiyama, Takuya KDDI R&D Laboratories, Inc.Aakansh Gupta, Aakansh Uhuru Corporation NoahZh Zhang, Nuo KDDI R&D Laboratories, Inc.kyone Yonekawa, Kei KDDI R&D Laboratories, Inc.

mz-matsumoto Matsumoto, Kazunori KDDI R&D Laboratories, Inc.mura Muramatsu, Shigeki KDDI R&D Laboratories, Inc.ruik Kimura, Rui KDDI R&D Laboratories, Inc.

no6est Maita, Nobuyuki KDDI R&D Laboratories, Inc.Yujin Tang, Yujin KDDI R&D Laboratories, Inc.Keiku Kuroyanagi, Keiichi Financial Engineering Group, Inc.

TakWat Watanabe, Takafumi KDDI R&D Laboratories, Inc.apf-koba Kobayashi, Akihiro KDDI R&D Laboratories, Inc.

“KDDILABS&Keiku” Members

3Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Working atKDDILABSoffice

Page 4: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

What is “KDDI”?

4Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Knowledge Discovery and Data mining Institute …?

NO!

A Japanese telecommunication company “KDDI” is an acronym standing for Japanese words

No relation between KDD2015 and KDDIWe did NOT Cheat

Page 5: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

System Overview

5Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Feature

XGBoost†

RegularizedGreedyForest

OriginalData

BlendSubmitData

×2000

Our special twist is “strategic” feature engineering

Deep Neural

Network

Bagging of 200 models

† http://dmlc.github.io/

Page 6: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Each member started KDD Cup separately.

At First...

6Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

The number of features : 1500 over

Each member created features separately.

In late Jun, we were merged as a team

“Basic Features”

Page 7: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

All features has variation with respect to labels like:

Time window, category, event, source, or their combination

Examples of Basic 1500 Features

7Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

: access log

Time

2h 30m30s

50s1m 15sThe number of logs of the eID

The number of lags

Time to Target Prediction Interval

Target prediction interval

Counting up 5s

3m

4m 30s

Page 8: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Why can’t we predict “Lower right” eID accurately? “Lower right” eID do not have enough number of logs, in some cases there are only 1 log, but they

did not drop out the course. Because there are less number of logs, it is hard to predict their dropout probability by basic

features.

ROC Curve & Predicted Value Distribution

8Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Dropout eID

Non Dropout eID

Prediction of training data by XGBoost at 10-fold cross validation.T

rue

Po

sitiv

e R

ate

De

nsi

ty

False Positive Rate Predicted Value

Page 9: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

We created the features by 3 kinds of methods1. Aggregating “Cross-Course” logs

2. Using Idea of Recommendation System

3. Using Time-Series Prediction

Our Strategy & Features

9Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Our Strategy

Creating features which do NOT depend on the number of logs

Page 10: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Idea: About 1/3 users attend multiple courses.

①Aggregating “Cross-Course” logs

10Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

active course count 1 2 3 4 5 ... 28 29 30 31 39the number of user 7350920251 8237 4118 2277 ... 1 1 1 2 3

All users : 112448

It is effective to create features by logs of not only the object course, but also other active courses.

Users attending multiple courses : 38939

Page 11: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

How to create features: AMaybe, some user have enrolled multiple courses at once, and attended courses one

by one.

11Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

①Aggregating “Cross-Course” logs

time

User a Course B

Course A

Course C

?Target prediction interval of Course A

Although there are many logs at Couse

B & Course C

Only 1 log at Course A

There are little logs in the

period

There is a high probability of attending Course A in this period.

Counting up the number of logs, unique days, or unique courses in which logs exist by moving time window (window size : 5days)

Page 12: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

How to create features: B If there is some relationship between a target course and an other course, logs of the

target course may exist near logs of the other course.

12Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

①Aggregating “Cross-Course” logs

time

User aCourse B

Course A ?

Two Steps:

1. Making a matrix of interrelationship of all courses which is transition probability from one course log to other course log.

2. Calculating a sum of products of logs and interrelationship of each courses to the target course at the prediction interval.

Relational Courses There are a high prob ability of

Existing Logs of Course A nearby Logs of Course B

Target prediction interval of Course A

Page 13: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Idea:We want to create features by NOT using logs.

→other users who enroll similar course pattern to the user is useful.

How to create features:Creating features by Collaborative Filtering which is often used as recommendation

system in e-commerce sites or search engines

②Using Idea of Recommendation System

13Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Page 14: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Collaborative Filtering: 1. Calculate similarities between the user and other users by comparing active course

patterns of each users

2. Calculate reasonable value which is calculated by a weighted average of other users value whose similarities are higher than threshold

②Using Idea of Recommendation System

14Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

  CourseUser    .

1 2 3 4 5Similarity to User A

A ○ × ○ × × -

B ○ ○ × × × 0.7

C × ○ × × × 0.2

D ○ × ○ ○ × 0.8

○ means “enrolled”× means “not enrolled”

Feature value (in this case, the number of logs)

CourseUser .

1 2 3 4 5

A 1 × 20 × ×

B 130 50 × × ×

C × 30 × × ×

D 50 × 40 70 ×

(130×0.7+50×0.8)/(0.7+0.8)=87

↓The user may

continue to attend this course.

Page 15: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Idea: Is there consistent trend at the numbers of unique users who attend the courses in

specific days? If we know the number of unique users who attend the courses in dropout judgment

period and an order of users who is more likely to attend the courses in its period, we can see the boundary of dropout users and non-dropout users.

How to create features:Using ARIMA which is often used in financial prediction or telecommunication traffic

predictionPredicting unique users in judgment period by using a transition of unique users in each

specific time window (10days)Ranking users according to most useful feature values in previous dropout predict system.

③Using Time-Series Prediction

15Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Page 16: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

③Using Time-Series Prediction

16Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Trainsitions of unique users (time window:10days)

Username Ranking of the feature value Normalized value

a 1 0.001

b 1000 1

c 2000 2

User ranking according to specific feature values (for example, the number of logs)and normalization by the predicted number of unique users

Predicted number of unique userswho attend

the course in day 31~40

Prediction Using ARIMAActual Value

Page 17: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Results

17Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Prediction of “Lower right” eID are improved

Dropout eID Non Dropout eID Dropout eID Non Dropout eID

Predicted Value Distribution

Page 18: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

Final AUC becomes 0.90756

Results

18Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Final private score is 0.90597

Page 19: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

In this competition, we didn’t use some features created from truth data of training set because we were afraid of over-fitting to training set. Maybe It restricted more flexible idea and was why we got no more than 6th rank.

Creating wide variety and useful features was important. However of course, the choice of three kind of models (XGBoost, Regularized Greedy Forest, and Bagging Deep Learning) was also important of, so we really appreciate the authors of used models and libraries.

Miscellaneous

19Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Page 20: Approach to Generate a Vast Variety of Features for Predicting Dropout in MOOC Takuya Akiyama, Kei Yonekawa, Aakansh Gupta, Nuo Zhang, Shigeki Muramatsu,

20Copyright © 2015 KDDI R&D Labs. Inc. All Rights Reserved

Thanks for your attention.