# bayesian knowledge tracing prediction models

Post on 12-Jan-2017

220 views

Embed Size (px)

TRANSCRIPT

Bayes Knowledge Tracing Prediction Models

Bayesian Knowledge TracingPrediction Models

Bayesian Knowledge Tracing

GoalInfer the latent constructDoes a student know skill X

GoalInfer the latent constructDoes a student know skill X

From their pattern of correct and incorrect responses on problems or problem steps involving skill X

EnablingPrediction of future correctness within the educational software

Prediction of future correctness outside the educational softwaree.g. on a post-test

AssumptionsStudent behavior can be assessed as correct or not correct

Each problem step/problem is associated with one skill/knowledge component

AssumptionsStudent behavior can be assessed as correct or not correct

Each problem step/problem is associated with one skill/knowledge componentAnd this mapping is defined reasonably accurately

AssumptionsStudent behavior can be assessed as correct or not correct

Each problem step/problem is associated with one skill/knowledge componentAnd this mapping is defined reasonably accurately (though extensions such as Contextual Guess and Slip may be robust to violation of this constraint)

Multiple skills on one stepThere are alternate approaches which can handle this(cf. Conati, Gertner, & VanLehn, 2002; Ayers & Junker, 2006; Pardos, Beck, Ruiz, & Heffernan, 2008)

Bayesian Knowledge-Tracing is simpler (and should produce comparable performance) when there is one primary skill per step

Goal: For each knowledge component (KC), infer the students knowledge state from performance.

Suppose a student has six opportunities to apply a KC and makes the following sequence of correct (1) and incorrect (0) responses. Has the student has learned the rule?Bayesian Knowledge Tracing0 0 1 0 1 1

Model Learning AssumptionsTwo-state learning modelEach skill is either learned or unlearned

In problem-solving, the student can learn a skill at each opportunity to apply the skill

A student does not forget a skill, once he or she knows itStudied in Pavliks models

Only one skill per action

Addressing Noise and ErrorIf the student knows a skill, there is still some chance the student will slip and make a mistake.

If the student does not know a skill, there is still some chance the student will guess correctly.

Corbett and Andersons Model

Not learned

Two Learning Parametersp(L0)Probability the skill is already known before the first opportunity to use the skill in problem solving.p(T)Probability the skill will be learned at each opportunity to use the skill.Two Performance Parametersp(G)Probability the student will guess correctly if the skill is not known.p(S)Probability the student will slip (make a mistake) if the skill is known.

Learned

p(T)

correct

correct

p(G)

1-p(S)p(L0)

Bayesian Knowledge TracingWhenever the student has an opportunity to use a skill, the probability that the student knows the skill is updated using formulas derived from Bayes Theorem.

Formulas

Questions? Comments?

Knowledge TracingHow do we know if a knowledge tracing model is any good?

Our primary goal is to predict knowledge

Knowledge TracingHow do we know if a knowledge tracing model is any good?

Our primary goal is to predict knowledge

But knowledge is a latent trait

Knowledge TracingHow do we know if a knowledge tracing model is any good?

Our primary goal is to predict knowledge

But knowledge is a latent trait

We can check our knowledge predictions by checking how well the model predicts performance

Fitting a Knowledge-Tracing ModelIn principle, any set of four parameters can be used by knowledge-tracing

But parameters that predict student performance better are preferred

Knowledge TracingSo, we pick the knowledge tracing parameters that best predict performance

Defined as whether a students action will be correct or wrong at a given time

Effectively a classifier (which well talk about in a few minutes)

Questions? Comments?

Recent ExtensionsRecently, there has been work towards contextualizing the guess and slip parameters(Baker, Corbett, & Aleven, 2008a, 2008b)

Do we really think the chance that an incorrect response was a slip is equal whenStudent has never gotten action right; spends 78 seconds thinking; answers; gets it wrongStudent has gotten action right 3 times in a row; spends 1.2 seconds thinking; answers; gets it wrong

The jurys still outInitial reports showed that CG BKT predicted performance in the tutor much better than existing approaches to fitting BKT(Baker, Corbett, & Aleven, 2008a, 2008b)

But a new brute force approach, which tries all possible parameter values for the 4-parameter model performs equally well as CG BKT(Baker, Corbett, Gowda, 2010)

The jurys still outCG BKT predicts post-test performance worse than existing approaches to fitting BKT (Baker, Corbett, Gowda, et al, 2010)

But P(S) predicts post-test above and beyond BKT(Baker, Corbett, Gowda, et al, 2010)

So there is some way that contextual G and S are useful we just dont know what it is yet

Questions? Comments?

Fitting BKT modelsBayes Net Toolkit Student ModelingExpectation Maximizationhttp://www.cs.cmu.edu/~listen/BNT-SM/

Java CodeGrid Search/Brute Forcehttp://users.wpi.edu/~rsbaker/edmtools.html

Conflicting results as to which is best

IdentifiabilityDifferent models can achieve the same predictive power

(Beck & Chang, 2007; Pardos et al, 2010)

Model DegeneracySome model parameter values, typically where P(S) or P(G) is greater than 0.5

Infer that knowledge leads to poorer performance

(Baker, Corbett, & Aleven, 2008)

BoundingCorbett & Anderson (1995) bounded P(S) and P(G) to maximum values below 0.5 to avoid thisP(S) Chi-squared test

You plug in your 10,000 cases, and you get

Chi-sq(1,df=10,000)=3.84, two-tailed p=0.05

Time to declare victory?

ExampleKappa -> Chi-squared test

You plug in your 10,000 cases, and you get

Chi-sq(1,df=10,000)=3.84, two-tailed p=0.05

No, I did something wrong here

Non-independence of the dataIf you have 50 students

It is a violation of the statistical assumptions of the test to act like their 10,000 actions are independent from one another

For student A, action 6 and 7 are not independent from one another (actions 6 and 48 arent independent either)

Why does this matter?Because treating the actions like they are independent is likely to make differences seem more statistically significant than they are

So what can you do?

So what can you do?Compute statistical significance test for each student, and then use meta-analysis statistical techniques to aggregate across students (hard to do but does not violate any statistical assumptions)

I have java code which does this for A, which Im glad to share with whoever would like to use this later

Comments? Questions?

Hands-on ActivityAt 11:45

Regression

RegressionThere is something you want to predict (the label)The thing you want to predict is numerical

Number of hints student requestsHow long student takes to answerWhat will the students test score be

RegressionAssociated with each label are a set of features, which maybe you can use to predict the label

SkillpknowtimetotalactionsnumhintsENTERINGGIVEN0.704910ENTERINGGIVEN0.5021020USEDIFFNUM0.049613ENTERINGGIVEN0.967730REMOVECOEFF0.7921611REMOVECOEFF0.7921320USEDIFFNUM0.073520.

RegressionThe basic idea of regression is to determine which features, in which combination, can predict the labels value

SkillpknowtimetotalactionsnumhintsENTERINGGIVEN0.704910ENTERINGGIVEN0.5021020USEDIFFNUM0.049613ENTERINGGIVEN0.967730REMOVECOEFF0.7921611REMOVECOEFF0.7921320USEDIFFNUM0.073520.

Linear RegressionThe most classic form of regression is linear regressionAlternatives include Poisson regression, Neural Networks...

Linear RegressionThe most classic form of regression is linear regression

Numhints = 0.12*Pknow + 0.932*Time 0.11*Totalactions

SkillpknowtimetotalactionsnumhintsCOMPUTESLOPE0.54491?

Linear RegressionLinear regression only fits linear functions (except when you apply transforms to the input variables, which RapidMiner can do for you)

Linear RegressionHowever

It is blazing fast

It is often more accurate than more complex models, particularly once you cross-validateData Minings Dirty Little Secret

It is feasible to understand your model(with the caveat that the second feature in your model is in the context of the first feature, and so on)

Example of CaveatLets study a classic example

Example of CaveatLets study a classic exampleDrinking too much prune nog at a party, and having to make an emergency trip to the Little Researchers Room

Data

DataSome people are resistent to the deletrious effects of prunes and can safely enjoy high quantities of prune nog!

Learned FunctionProbability of emergency=0.25 * # Drinks of nog last 3 hours- 0.018 * (Drinks of nog last 3 hours)2

But does that actually mean that (Drinks of nog last 3 hours)2 is associated with less emergencies?

Learned FunctionProbability of emergency=0.25 * # Drinks of nog last 3 hours- 0.018 * (Drinks of nog last 3 hours)2

But does that actually mean that (Drinks of nog last 3 hours)2 is associated with less emergencies?

No!

Example of Caveat

(Drinks of nog last 3 hours)2 is actually positively correlated with emergencies!r=0.59

Number of drinks of prune nogNumber of emergenciesExample of Caveat

The relationship is only in the negative direction when (Drinks of nog last 3 hours) is already in the model

Number of drinks of prune nogNumber of emergenciesExample of CaveatSo be careful when interpreting linear regression models (or almost any other type of model)

Comments? Questions?

Neural NetworksAnother popular form of regression is neural networks (also calledMultilayerPerceptron)

This image courtesy of Andrew W. Moore, Google http://www.cs.cmu.edu/~awm/tutorials

Neural NetworksNeural networks can fit more complex functions than linear regression

It is usually near-to-impossible to understand what the heck is going on inside one

Soller & Stevens (2007)

In factThe difficulty of interpreting non-linear models is so well known, that New York City put up a road sign about it

And of courseThere are lots of fancy regressors in any Data Mining package

SMOReg (support vector machine)Poisson Regression

And so on

How can you tell if a regression model is any good?

How can you tell if a regression model is any good?Correlation is a classic method(Or its cousin r2)

What data set should you generally test on?The data set you trained your classifier onA data set from a different tutorSplit your data set in half, train on one half, test on the other halfSplit your data set in ten. Train on each set of 9 sets, test on the tenth. Do this ten times.

Any differences from classifiers?

What are some stat tests you could use?

What about?Take the correlation between your prediction and your labelRun an F test

SoF(1,9998)=50.00, p