semi-supervised learning to perceive neutral au04 children’s … · 2019. 8. 21. · template id:...

1
Semi-Supervised Learning to Perceive Children’s Affective States in a Tablet Tutor Mansi Agarwal and Jack Mostow PROBLEM APPROACH DATASET Detect affective states in videos of children using RoboTutor Boredom Confusion Delight Frustration Neutral Surprise Dataset: 229 videos of 30 children using RoboTutor Pooled labels: Positive: Delight, Surprise Negative: Boredom, Confusion, Frustration Neutral Labeled dataset: 345 10-second clips from 17 videos Focused sample: clips with extreme feature values Random sample: clips centered at random points ACKNOWLEDGEMENTS 5-fold cross-validated results Unlabelled data helps -- beats supervised learning! Future work: Analyze affective states to guide redesign of RoboTutor. Port classifier to run in real-time on tablet. Detect and respond to affective states at runtime. Use other available inputs -- screen gestures and audio. This work would not have been possible without the support of the RoboTutor team, especially Fortunatus Massewe for recording many hours of RoboTutor screen video, and Dr. Leonora Anyango Kivuva, Joash Gambarage, Sheba Naderzad, and Andrew McReynolds for labeling so much data. We would also like to thank Rachel Burcin, Dr. John Dolan and Mikayla Trost for tirelessly supporting us throughout the summer. The first author was a Robotics Institute Summer Scholar supported by an S.N. Bose Scholarship. Model Precision Recall F1 Accuracy Gaussian Naive Bayes 0.52 0.21 0.23 0.21 Decision Tree 0.43 0.44 0.43 0.44 SVM 0.49 0.51 0.46 0.51 Adaboost 0.47 0.47 0.47 0.47 Logistic Regression 0.50 0.53 0.51 0.53 KNN 0.53 0.53 0.53 0.53 Random Forest 0.63 0.61 0.60 0.61 Semi-supervised method 0.88 0.84 0.86 0.88 CONCLUSION RESULTS COMPARED TO SUPERVISED LEARNING OpenFace Extract features Head Orientation Temporal aggregation Classifier Neutral 10-second screen video clip Camera window Facial landmarks, head location and pose, FAUs Output Label 1. Feature extraction: Compute head proximity, head orientation, facial action units (FAUs), blink rate, pupil size, eye gaze from OpenFace output. 2. Temporal aggregation: Reduce to a single feature value for each 10-second clip. 3. Semi-supervised learning: Pseudo-label unlabeled instances to augment sparse hand-labeled training data. Extracted channels Head Proximity AU04 Extract 0.23 0.47 0.12 0 sec 10 sec Novel population: children ages 6-12 in Tanzania Camera only: user-facing tablet camera; no fancy sensors Authentic video: occlusion, variable indoor and outdoor illumination, limited spatial and temporal resolution Unlabeled dataset: 1000 clips from 40 other videos Instance

Upload: others

Post on 09-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Template ID: inquisitalanchor Size: 48x36

    Semi-Supervised Learning to Perceive Children’s Affective States in a Tablet Tutor

    Mansi Agarwal and Jack Mostow

    PROBLEM APPROACH

    DATASET

    Detect affective states in videos of children using RoboTutor

    Boredom Confusion Delight

    Frustration Neutral Surprise

    ● Dataset: 229 videos of 30 children using RoboTutor

    ● Pooled labels:○ Positive: Delight, Surprise○ Negative: Boredom, Confusion, Frustration○ Neutral

    ● Labeled dataset: 345 10-second clips from 17 videos

    ○ Focused sample: clips with extreme feature values

    ○ Random sample: clips centered at random points

    ACKNOWLEDGEMENTS

    5-fold cross-validated resultsUnlabelled data helps -- beats supervised learning!

    Future work:● Analyze affective states to guide redesign of RoboTutor.● Port classifier to run in real-time on tablet.● Detect and respond to affective states at runtime.● Use other available inputs -- screen gestures and audio.

    This work would not have been possible without the support of the RoboTutor team, especially Fortunatus Massewe for recording many hours of RoboTutor screen video, and Dr. Leonora Anyango Kivuva, Joash Gambarage, Sheba Naderzad, and Andrew McReynolds for labeling so much data. We would also like to thank Rachel Burcin, Dr. John Dolan and Mikayla Trost for tirelessly supporting us throughout the summer. The first author was a Robotics Institute Summer Scholar supported by an S.N. Bose Scholarship.

    Model Precision Recall F1 Accuracy

    Gaussian Naive Bayes 0.52 0.21 0.23 0.21

    Decision Tree 0.43 0.44 0.43 0.44

    SVM 0.49 0.51 0.46 0.51

    Adaboost 0.47 0.47 0.47 0.47

    Logistic Regression 0.50 0.53 0.51 0.53

    KNN 0.53 0.53 0.53 0.53

    Random Forest 0.63 0.61 0.60 0.61

    Semi-supervised method 0.88 0.84 0.86 0.88

    CONCLUSIONRESULTS COMPARED TO SUPERVISED LEARNING

    OpenFace Extract

    features

    Head Orientation

    Temporal

    aggregationClassifier Neutral

    10-second screen video clipCamera window Facial landmarks, head

    location and pose, FAUs

    Output Label

    1. Feature extraction: Compute head proximity, head orientation, facial action units (FAUs), blink rate, pupil size, eye gaze from OpenFace output.

    2. Temporal aggregation: Reduce to a single feature value for each 10-second clip.

    3. Semi-supervised learning: Pseudo-label unlabeled instances to augment sparse hand-labeled training data.

    Extracted channels

    Head Proximity

    ●●●

    AU04

    Extract

    ●●●

    0.23

    0.47

    0.12

    ●●●

    0 sec 10 sec

    ● Novel population: children ages 6-12 in Tanzania● Camera only: user-facing tablet camera; no fancy sensors ● Authentic video: occlusion, variable indoor and outdoor

    illumination, limited spatial and temporal resolution

    ● Unlabeled dataset: 1000 clips from 40 other videos

    Instance