bayesian ordinal peer grading - cornell universitykarthik/publications/ppt/nips2014-hpml.pdf ·...
TRANSCRIPT
Bayesian Ordinal Peer Grading
Karthik Raman, Thorsten Joachims
Cornell University
Evaluation at Scale: Peer Grading
• MCQs & Auto-graded questions: Not a good test of understanding. • Limits kinds of courses offered.
• Conventional Evaluation:• Medium-scale classes (20-200 students) : TAs perform grading.
• Scaling to MOOCs (10000+ students)??
• PEER GRADING: Students grade each other.
• Overcomes scaling limitation of TA grading:• Number of “graders” scales with number of
students!
Ordinal Peer Grading [KDD 2014]
• Students are not trained graders: Need to make feedback process simple!
• Ordinal feedback easier to provide and more reliable than cardinal feedback:• Project X is better than Project Y vs. Project X is a B+.
• Ordinal Peer Grading: Graders provide ordering of assignments • Need to infer overall ordering and grader reliabilities.
>
>
..
..> >
…
…
… …
A
A
B
B
B
C
C
…
…
..
..
Mallows Ordinal Peer Grading
• GENERATIVE MODEL:
• is the Kendall-Tau distance between orderings:
• # of pairs ordered differently between the two rankings.
• Greedy algorithm to find MLE (Maximum-Likelihood estimator).
• Easy to extend: Grader Reliabilities, other aggregation models (Bradley-Terry ..)
• Viable alternate to conventional TA evaluation [KDD ’14]• As good (if not better) than cardinal peer grading.
Instructors Want More Details!
• Need finer-grained details to fully trust algorithm.
• Uncertainty information• Helps identify most
confusing assignments.
• Allows instructors to understand grading output.
Solution: Bayesian Mallows Peer Grading
• Sample orderings using MCMC (Metropolis-Hastings)
σt-
1σ’Sample new
orderingAccept/Reject?
σt…..
O(|D| log|D|) time
…..
O(|D|2) time
Experiments
• Meaningful posteriors at cost of minimal grading error
20.8
31.9
21.8 21.4
20.3
30.8
15
20
25
30
35
40
Poster Report
NCS (Cardinal)
MAL-MLE
Bayes-MAL
44.2
51.7
42.6 45.1
74.5
86.073.9 74.0
30
40
50
60
70
80
90
Poster Report
Bayes(50%)
Boot(50%)
Bayes(80%)
Boot(80%)