feature engineering studio january 21, 2015. welcome to feature engineering studio design...

31
Feature Engineering Studio January 21, 2015

Upload: jason-pilgram

Post on 12-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Feature Engineering Studio

January 21, 2015

Page 2: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Welcome to Feature Engineering Studio

• Design studio-style course teaching how to distill and engineer features for data mining

Page 3: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

What We’ll Cover

• The process of feature engineering and distillation– brainstorming features – deciding what features to create– criteria for selecting features– actually creating the features– studying the impact of features on model

goodness

Page 4: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Why?

• Feature engineering is the most important, and least well-studied part of the process of developing prediction models

• It is an art, it is human-driven design• It involves lore rather than well-known and

validated principles• It is hard! (But fun, and important)

Page 5: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Why?

• It’s well known in data mining (and statistics for that matter)

• That your model will never be any good if your features (predictors) aren’t very good

Page 6: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

The Big Idea

• How can we take the voluminous, ill-formed, and yet under-specified data that we now have in education

• And shape it into a reasonable set of variables

• In an efficient, effective, and predictive way?

Page 7: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Tools We’ll Use

• Excel• Google Refine• RapidMiner• Other relevant tools (TBD/your choice)

Page 8: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Course times

• Monday 11am-12:40pm• Wednesday 11am-12:40pm

• Not every week; please see online schedule

Page 9: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Course Prerequisite

• Core Methods in Educational Data Mining • Or instructor approval

• I will approve anyone who has at least a little bit of background building prediction models or similar statistical models– Talk to me after class, during my office hours, or

by appointment

Page 10: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

That said…

• If you haven’t had experience building prediction models in RapidMiner or a similar tool, then you’ll need to learn

• We will have a few special lab sessions to help you catch up if you don’t have experience with this paradigm or tools

• You can definitely catch up

Page 11: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Who here?

• Took or audited my Core Methods course?• Has built a prediction model using a

classification algorithm and cross-validation?• Has built a regression model in a stats package

using stepwise regression?• Has run a regression in a stats package?• Has built any kind of mathematical model?

Page 12: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

How this class works

• Lots of assignments (13)– They can’t be late, because we will discuss them in class– 3 of 12 regular assignments can be missed without

penalty, but not the final presentation (#13)– Important note: You cannot do extra assignments and

take the best grades. Only the first 9 assignments turned in will be graded.

• Not many required readings• Essential to participate in critique and class

discussions

Page 13: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Who here?

• Has had a design studio style course before?

Page 14: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

This is not…

• A lecture class• A reading discussion seminar

Page 15: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

This is…

• A class where you will be working on a project of your own choosing the whole semester

• A class where you’ll get, and give, a lot of constructive criticism

Page 16: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

The semester project

• You will build a prediction model• If you have your own data set, and research

question – perfect!• If you don’t have your own data set, and

research question – no worries! I will help you find one!

Page 17: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Two types of classes

• Regular sessions– Discuss readings, work on projects

• Lab sessions– Extra practice with tools – Lecture on concepts beyond regular class topics

• Including core content from HUDK4050 needed for this class

• Not a substitute for HUDK4050, we’ll be covering about 5% of HUDK4050 in these sessions

Page 18: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Assignments

• Let’s look at syllabus

Page 19: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Readings

• Will be made available very soon

Page 20: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Any questions?

Page 21: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Upcoming Classes

• 1/26 Lab session on data set finding– Come to this if you don’t have a data set in mind

• 2/2 Problem proposal (Asgn. 1 due)• 2/4 Data cleaning (Asgn. 2 due)• 2/16 Lab session on RapidMiner– Come to this if you’ve never built a classifier or

regressor in RapidMiner (or a similar tool)– Statistical significance tests using linear regression

don’t count…• 2/23 Feature distillation in Excel (Asgn.3 due)

Page 22: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Assignment One

• Problem Proposal– Due Monday, February 2

• Be ready to talk for 5 minutes on:– A data set

• Give where it came from and how big it is• You need to already have this data set, or be able to acquire it in

the next two weeks

– A prediction model you will build in this data set– What variable will you predict?– What kind of variables will you use to predict it?– Why is this worth doing?

Page 23: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Example (Pardos et al., 2014)

• Data set– ASSISTments system, formative assessment and

learning software for math used by 60k students a year (Razzaq et al., 2007)

– 810,000 data points from 229 students studied– Student actions in the software have been overlaid

with synchronized field observations of student affect (boredom, frustration, etc.)• 3075 field observations• Each field observation connects to 20 seconds of log file

actions

Page 24: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Example (Pardos et al., 2014)

• We will predict whether a student is bored at a specific time– So that we can replicate the human judgments

without needing a field observer

• We will predict this from what was going on in the log files at the time the field observation was made– We know every student action’s correctness, timing,

relevant skill, and probability they knew the skill

Page 25: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Example (Pardos et al., 2014)

• This is worth doing because boredom is known to predict student learning (Craig et al., 2004; Rodrigo et al., 2009; Pekrun et al., 2010)

• And building a detector will help us study boredom more thoroughly

• As well as enabling us to intervene on boredom in real time

Page 26: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Important Considerations

• Is the problem genuinely important? (usable or publishable)

• Is there a good measure of ground truth? (the variable you want to predict)

• Do we have rich enough data to distill meaningful features?

• Is there enough data to be able to take advantage of data mining?

Page 27: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

You don’t need to be able to answer these questions in a week

• Think about them• Think about your problem• Email me or come to my office hours

(or set up an appointment)• Bring it to class• We’ll discuss it in class

• No idea is perfect right from the start!

Page 28: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Be ready to answer questions

Page 29: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Be ready to answer questions

• Be ready to ask questions too…

Page 30: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

No data ready at hand?

• Come to next Monday’s session, we will find you data!

Page 31: Feature Engineering Studio January 21, 2015. Welcome to Feature Engineering Studio Design studio-style course teaching how to distill and engineer features

Any questions or concerns?