semi-supervised learning to perceive neutral au04 children’s … · 2019. 8. 21. · template id:...

Semi-Supervised Learning to Perceive Children’s Affective States in a Tablet Tutor Mansi Agarwal and Jack Mostow PROBLEM APPROACH DATASET Detect affective states in videos of children using RoboTutor Boredom Confusion Delight Frustration Neutral Surprise ● Dataset: 229 videos of 30 children using RoboTutor ● Pooled labels: ○ Positive: Delight, Surprise ○ Negative: Boredom, Confusion, Frustration ○ Neutral ● Labeled dataset: 345 10-second clips from 17 videos ○ Focused sample: clips with extreme feature values ○ Random sample: clips centered at random points ACKNOWLEDGEMENTS 5-fold cross-validated results Unlabelled data helps -- beats supervised learning! Future work: ● Analyze affective states to guide redesign of RoboTutor. ● Port classifier to run in real-time on tablet. ● Detect and respond to affective states at runtime. ● Use other available inputs -- screen gestures and audio. This work would not have been possible without the support of the RoboTutor team, especially Fortunatus Massewe for recording many hours of RoboTutor screen video, and Dr. Leonora Anyango Kivuva, Joash Gambarage, Sheba Naderzad, and Andrew McReynolds for labeling so much data. We would also like to thank Rachel Burcin, Dr. John Dolan and Mikayla Trost for tirelessly supporting us throughout the summer. The first author was a Robotics Institute Summer Scholar supported by an S.N. Bose Scholarship. Model Precision Recall F1 Accuracy Gaussian Naive Bayes 0.52 0.21 0.23 0.21 Decision Tree 0.43 0.44 0.43 0.44 SVM 0.49 0.51 0.46 0.51 Adaboost 0.47 0.47 0.47 0.47 Logistic Regression 0.50 0.53 0.51 0.53 KNN 0.53 0.53 0.53 0.53 Random Forest 0.63 0.61 0.60 0.61 Semi-supervised method 0.88 0.84 0.86 0.88 CONCLUSION RESULTS COMPARED TO SUPERVISED LEARNING OpenFace Extract features Head Orientation Temporal aggregation Classifier Neutral 10-second screen video clip Camera window Facial landmarks, head location and pose, FAUs Output Label 1. Feature extraction: Compute head proximity, head orientation, facial action units (FAUs), blink rate, pupil size, eye gaze from OpenFace output. 2. Temporal aggregation: Reduce to a single feature value for each 10-second clip. 3. Semi-supervised learning: Pseudo-label unlabeled instances to augment sparse hand-labeled training data. Extracted channels Head Proximity ● ● ● AU04 Extract ● ● ● 0.23 0.47 0.12 ● ● ● 0 sec 10 sec ● Novel population: children ages 6-12 in Tanzania ● Camera only: user-facing tablet camera; no fancy sensors ● Authentic video: occlusion, variable indoor and outdoor illumination, limited spatial and temporal resolution ● Unlabeled dataset: 1000 clips from 40 other videos Instance

Upload: others

Post on 09-Feb-2021

5 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Template ID: inquisitalanchor Size: 48x36

Semi-Supervised Learning to Perceive Children’s Affective States in a Tablet Tutor

Mansi Agarwal and Jack Mostow

PROBLEM APPROACH

DATASET

Detect affective states in videos of children using RoboTutor

Boredom Confusion Delight

Frustration Neutral Surprise

● Dataset: 229 videos of 30 children using RoboTutor

● Pooled labels:○ Positive: Delight, Surprise○ Negative: Boredom, Confusion, Frustration○ Neutral

● Labeled dataset: 345 10-second clips from 17 videos

○ Focused sample: clips with extreme feature values

○ Random sample: clips centered at random points

ACKNOWLEDGEMENTS

5-fold cross-validated resultsUnlabelled data helps -- beats supervised learning!

Future work:● Analyze affective states to guide redesign of RoboTutor.● Port classifier to run in real-time on tablet.● Detect and respond to affective states at runtime.● Use other available inputs -- screen gestures and audio.

This work would not have been possible without the support of the RoboTutor team, especially Fortunatus Massewe for recording many hours of RoboTutor screen video, and Dr. Leonora Anyango Kivuva, Joash Gambarage, Sheba Naderzad, and Andrew McReynolds for labeling so much data. We would also like to thank Rachel Burcin, Dr. John Dolan and Mikayla Trost for tirelessly supporting us throughout the summer. The first author was a Robotics Institute Summer Scholar supported by an S.N. Bose Scholarship.

Model Precision Recall F1 Accuracy

Gaussian Naive Bayes 0.52 0.21 0.23 0.21

Decision Tree 0.43 0.44 0.43 0.44

SVM 0.49 0.51 0.46 0.51

Adaboost 0.47 0.47 0.47 0.47

Logistic Regression 0.50 0.53 0.51 0.53

KNN 0.53 0.53 0.53 0.53

Random Forest 0.63 0.61 0.60 0.61

Semi-supervised method 0.88 0.84 0.86 0.88

CONCLUSIONRESULTS COMPARED TO SUPERVISED LEARNING

OpenFace Extract

features

Head Orientation

Temporal

aggregationClassifier Neutral

10-second screen video clipCamera window Facial landmarks, head

location and pose, FAUs

Output Label

1. Feature extraction: Compute head proximity, head orientation, facial action units (FAUs), blink rate, pupil size, eye gaze from OpenFace output.

2. Temporal aggregation: Reduce to a single feature value for each 10-second clip.

3. Semi-supervised learning: Pseudo-label unlabeled instances to augment sparse hand-labeled training data.

Extracted channels

Head Proximity

●●●

AU04

Extract

●●●

0.23

0.47

0.12

●●●

0 sec 10 sec

● Novel population: children ages 6-12 in Tanzania● Camera only: user-facing tablet camera; no fancy sensors ● Authentic video: occlusion, variable indoor and outdoor

illumination, limited spatial and temporal resolution

● Unlabeled dataset: 1000 clips from 40 other videos

Instance