correlations with learning in spoken tutoring dialogues diane litman learning research and...
TRANSCRIPT
Correlations with Learning in Spoken Tutoring Dialogues
Diane Litman
Learning Research and Development Center
and
Computer Science Department
University of Pittsburgh
Outline
Introduction Dialogue Acts and Learning Correlations Learning, Speech Recognition and Text-to-Speech Current Directions and Summary
Motivation An empirical basis for optimizing dialogue behaviors in
spoken tutorial dialogue systems
What aspects of dialogue correlate with learning? – Student behaviors– Tutor behaviors– Interacting student and tutor behaviors
Do correlations generalize across tutoring situations?– Human-human tutoring– Human-computer tutoring
Approach Initial: learning correlations with superficial dialogue
characteristics [Litman et al., Intelligent Tutoring Systems Conf., 2004]– Easy to compute automatically and in real-time, but…– Correlations in the literature did not generalize to our spoken or
human-computer corpora– Results were difficult to interpret
» e.g., do longer student turns contain more explanations?
Current: learning correlations with deeper “dialogue act” codings [Forbes-Riley et al., Artificial Intelligence and Education Conf., 2005]
ITSPOKE (Version1): Intelligent Tutoring SPOKEn Dialogue System
Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002)
Student speech digitized from microphone input; Sphinx2 speech recognizer
Tutor speech played via headphones/speakers; Cepstral text-to-speech synthesizer
Other additions: access to Why2-Atlas “internals”, speech recognition repairs, etc.
Two Spoken Tutoring Corpora Human-Human Corpus
– 14 students – 1 human tutor– 128 physics problems (dialogues) – 5948 student turns, 5505 tutor turns
Computer-Human Corpus– 20 students– ITSPOKE (Version1) tutor – 100 physics problems (dialogues) – 2445 student turns, 2967 tutor turns
Outline
Introduction Dialogue Acts and Learning Correlations Learning, Speech Recognition and Text-to-Speech Current Directions and Summary
Dialogue Acts
Dialogue Acts represent intentions behind utterances– Both domain-independent and tutoring-specific tagsets
» e.g., Graesser and Person, 1994; Graesser et al., 1995; Chi et al., 2001
– Used in prior studies of correlations with learning» e.g., tutor acts in AutoTutor (Jackson et al., 2004), dialogue acts in human
tutoring (Chi et al., 2001)
ITSPOKE Study– Student and tutor dialogue acts– Unigrams and bigrams of dialogue acts– Human and computer tutoring– Spoken input and output
Tagset (1): (Graesser and Person, 1994)
• Tutor and Student Question Acts
Short Answer Question: basic quantitative relationships
Long Answer Question: definition/interpretation of concepts
Deep Answer Question: reasoning about causes/effects
Tagset (2): inspired by (Graesser et al., 1995)
• Tutor Feedback Acts
Positive Feedback: overt positive response
Negative Feedback: overt negative response
• Tutor State Acts
Restatement: repetitions and rewordings
Recap: restating earlier-established points
Request/Directive: directions for argument
Bottom Out: complete answer after problematic response
Hint: partial answer after problematic response
Expansion: novel details
Tagset (3): inspired by (Chi et al., 2001)
• Student Answer Acts
Deep Answer: at least 2 concepts with reasoning
Novel/Single Answer: one new concept
Shallow Answer: one given concept
Assertion: answers such as “I don’t know”
• Tutor and Student Non-Substantive Acts:
do not contribute to physics discussion
Annotated Human-Human Excerpt
T: Which one will be faster? [Short Answer Question]
S: The feathers. [Novel/Single Answer]
T: The feathers - why? [Restatement,
Deep Answer Question]
S: Because there’s less matter. [Deep Answer]
All turns in both corpora were manually coded for dialogue acts (Kappa > .6)
Correlations with Unigram Measures(student and tutor-centered analyses)
For each student, and each student and tutor dialogue act tag, compute – Tag Total: number of turns containing the tag– Tag Percentage: (tag total) / (turn total)– Tag Ratio: (tag total) / (turns containing tag of that type)
Correlate measures with posttest, after regressing out pretest
Human-Computer Results (continued)
Tutor Dialogue Acts Mean R p
# Deep Answer Question 9.59 .41 .08
% Deep Answer Question 6.27% .45 .05
% Question Act 76.89% .57 .01
(Short Answer Question)/Question .88 -.47 .04
(Deep Answer Question) /Question .08 .42 .07
# Positive Feedback 76.10 .38 .10
Human-Human Results (14 students)
Student Dialogue Acts Mean R p
# Novel/Single Answer 19.29 .49 .09
# Deep Answer 68.50 -.49 .09
(Novel/Single Answer)/Answer .14 .47 .10
(Short Answer Question)/Question .91 .56 .05
(Long Answer Question) /Question .03 -.57 .04
Human-Human Results (continued)
Tutor Dialogue Acts Mean R p
# Request/Directive 19.86 -.71 .01
%Request/Directive 5.65% -.61 .03
# Restatement 79.14 -.56 .05
# Negative Feedback 14.50 -.60 .03
Discussion
Computer Tutoring: knowledge construction– Positive correlations
» Student answers displaying reasoning» Tutor questions requiring reasoning
Human Tutoring: more complex– Positive correlations
» Student utterances introducing a new concept
– Mostly negative correlations » Student attempts at deeper reasoning» Tutor attempts to direct the dialogue
Correlations with Bigram Measures(interaction-centered analyses)
For each student, and each tag sequence containing both a tutor and a student dialogue act, compute– [Student Act_n - Tutor Act_n+1] Totals
» all bigrams constructed by pairing each Student Dialogue Act in turn n with each Tutor Dialogue Act in turn n+1
– [Tutor Act_n - Student Act_n+1] Totals » all bigrams constructed by pairing each Tutor Dialogue Act in
turn n with each Student Dialogue Act in turn n+1
Correlate measures with posttest, after regressing out pretest
Bigram Results
Many bigrams incorporate, as either the first or second element, a dialogue act corresponding to one of the unigram results, e.g.– [Student Deep Answer – Tutor Deep Answer Question]– [Tutor Recap - Student Deep Answer]
Other dialogue acts only correlate with learning as part of a larger dialogue pattern, e.g.– [Student Shallow Answer - Tutor Restatement]– [Tutor Restatement – Student Shallow Answer]
Discussion
Computer Tutoring– n-grams seem able to capture effective learning patterns
in this simpler corpus
Human Tutoring – despite mostly negative correlations, students are
learning!– suggests effective learning patterns are too complicated
to be captured with n-grams
Current Directions
“Correctness” annotation– Are more Deep Answers “incorrect” or “partially
correct” in the human-human corpus?– Do correct answers positively correlate with learning?
Beyond the turn level – Correlations with larger dialogue act patterns (e.g., tri-
grams, n-grams)– Computation and use of hierarchical discourse structure
Outline
Introduction Dialogue Acts and Learning Correlations Learning, Speech Recognition and Text-to-Speech Current Directions and Summary
Research Question
Does the performance of a system’s speech recognizer and/or text-to-speech system relate to learning?– Speech recognition accuracy correlates with user
satisfaction in non-tutoring systems (Litman and Pan, 2002; Walker et al., 2002)
– The nature of a computer’s voice relates to learning and motivation in pedagogical agents (Baylor et al., 2003; Atkinson)
ASR Performance Measures (1): Rejections
ITSPOKE: Therefore, what is the magnitude of this gravitational force in the horizontal direction?
STUDENT: significant
ASR: significant (False Rejection)
ITSPOKE: Could you please repeat that?
STUDENT: great
ASR: crate (True Rejection)
ITSPOKE: I'm sorry, I'm having trouble understanding you. Please try again.
ASR Performance Measures (2):“Transcription” Misrecognitions
ITSPOKE: Yeah. Does the packet have an acceleration? If yes, please specify its direction.
STUDENT: yes downward
ASR: is downward word error rate: 50 %
binary word error: True
ITSPOKE: <omitted> How would you describe the vertical component of the packet's velocity? (e.g., decreasing, zero, etc.)
STUDENT: increasing
ASR: decreasing word error rate: 100%
binary word error: True
ASR Performance Measures (3):“Semantic” Misrecognitions
ITSPOKE: Yeah. Does the packet have an acceleration? If yes, please specify its direction.
STUDENT: yes downward NLU(student): downward
ASR: is downward NLU(ASR): downward word error rate: 50 % semantic error rate: 0%
binary word error: True binary semantic error: False
ITSPOKE: <omitted> How would you describe the vertical component of the packet's velocity? (e.g., decreasing, zero, etc.)
STUDENT: increasing NLU(student): increase
ASR: decreasing NLU(ASR): decrease word error rate: 100 % semantic error rate: 100%
binary word error: True binary semantic error: True
Correlations with Learning?
Computed totals, percentages, and ratios of:– Rejections
» False, True, Both
– Misrecognitions» Transcription and Semantic» Word and Binary
– ASR Problems (Rejections + Misrecognitions)» Transcription and Semantic» Word and Binary
Found no significant correlations or trends!
ITSPOKE (Version2)
Prerecorded Output Synthesized Output
(human voice) (text-to-speech)
ITSPOKE: Terrific. Let's try the original question again. If gravity is the only force acting on an object, will it be moving or staying still?
STUDENT: moving (ASR: moving)
ITSPOKE: Yes. Not only are the person, keys, and elevator moving, they have only gravitational forces acting on them. When an object is falling and has only gravitational force on it, it is said to be in what?
STUDENT: free fall (ASR: free fall)
Prerecorded (human voice) Synthesized (text-to-speech)
New Computer Tutoring Experiment
Same subject pool, physics problems, web interface, and experimental procedure as before, except– ITSPOKE (Version2)
Pre-recorded voice condition– 30 students (150 dialogues)
Text-to-speech condition– 29 students (145 dialogues)
Outline
Introduction Dialogue Acts and Learning Correlations Learning, Speech Recognition and Text-to-Speech Current Directions and Summary
Summary
Many dialogue act correlations– positive correlations with deep reasoning and
questioning in computer tutoring– correlations in human tutoring more complex– student, tutor, and interactive perspectives all useful
No correlations with ASR problems Stay tuned …
– New dialogue act patterns and “correctness” analysis– Pre-recorded versus text-to-speech
Acknowledgments The ITSPOKE Group
– Staff: Scott Silliman– Research Associates: Kate Forbes-Riley
Joel TetreaultAlison Huettner (consultant)
– Graduate Students: Ai Hua Beatriz Maeireizo Amruta Purandare Mihai Rotaru Arthur Ward
Kurt VanLehn and the Why2 Team
Hypotheses
Compared to typed dialogues, spoken interactions will yield better learning gains, and will be more efficient and natural
Different student behaviors will correlate with learning in spoken versus typed dialogues, and will be elicited by different tutor actions
Findings in human-human and human-computer dialogues will vary as a function of system performance
Motivation
Working hypothesis regarding learning gains– Human Dialogue > Computer Dialogue > Text
Most human tutoring involves face-to-face spoken interaction, while most computer dialogue tutors are text-based – Evens et al., 2001; Zinn et al., 2002; Vanlehn et
al., 2002; Aleven et al., 2001
Can the effectiveness of dialogue tutorial systems be further increased by using spoken interactions?
Potential Benefits of Speech Self-explanation correlates with learning and occurs more in speech
– Hausmann and Chi, 2002
Speech contains prosodic information, providing new sources of information for dialogue adaptation – Forbes-Riley and Litman, 2004
Spoken computational environments may prime a more social interpretation that enhances learning– Moreno et al., 2001; Graesser et al., 2003
Potential for hands-free interaction – Smith, 1992; Aist et al., 2003
Spoken Tutorial Dialogue Systems
Recent tutoring systems have begun to add spoken language capabilities– Rickel and Johnson, 2000; Graesser et al. 2001;
Mostow and Aist, 2001; Aist et al., 2003; Fry et al., 2001; Schultz et al., 2003
However, little empirical analysis of the learning ramifications of using speech
Architecture
Cepstral
www server
www browser
javaITSpoke
Text Manager
Spoken Dialogue Manager
essay
dialogue
student text
(xml)
tutor turn
(xml)
htmlxml
text
Speech Analysis (Sphinx)
dialogue
dialogue
repair goals
Essay Analysis (Carmel, Tacitus-
lite+)
Content Dialogue
Manager (Ape, Carmel)
Why2
tutorial goals
textessay
Speech Recognition: Sphinx2 (CMU)
Probabilistic language models for different dialogue states
Initial training data– typed student utterances from Why2-Atlas corpora
Later training data– spoken utterances obtained during development and pilot
testing of ITSPOKE Total vocabulary
– 1240 unique words “Semantic Accuracy” Rate = 92.4%
Speech Synthesis: Cepstral
Commercial outgrowth of Festival text-to-speech synthesizer (Edinburgh, CMU)
Required additional processing of Why2-Atlas prompts (e.g., f=m*a)
Common Experimental Aspects
Students take a physics pretest Students read background material Students use web interface to work through up
to 10 problems with either a computer or a human tutor
Students take a posttest– 40 multiple choice questions, isomorphic to pretest
ITSPOKE Corpora Comparison Human-Human Human-Computer
…1.3 minutes into session…
Student: Can I ask you questions?
Tutor: Yes
Student: Is that ok?
Tutor: Mm-hm
Student: Um from what we learned from the last problem they're gonna have the same- the keys and you have the same acceleration right?
Tutor: Yes
Student: So that means they'll both hit the bottom at the same time. But I don't understand where- I don't understand if you're in the elevator-
Tutor: You see
Student: Where are you going to-?
Tutor: The uh let me uh the key uh- the person holds the key in front of-
Student: Their face yeah-
Tutor: Uh his or her face and then lets it uh uh it is let go so the the the question relates to the relative position of the key and the face uh-
Student: So the key and the face-
Tutor: Uh how will they- as they go- as they both fall what will be their relative position? That is the question.
Student: (sigh)
…3.5 minutes into session…
Tutor: I didn't understand. What force acts on all objects within close proximity to earth?
Student: uh the force of gravity
Tutor: Fine. Besides the Earth's gravitational force, are there any other forces acting on the person, keys, and elevator after the release?
Student: no
Tutor: Yep. Thus, keep in mind that after the release, the only force on the person, keys, and elevator is the force of gravity. If gravity is the only force acting on an object, will it be moving or staying still?
Student: it will be moving
Tutor: Fine. Not only are the person, keys, and elevator moving, they have only gravitational forces on them. When an object is falling and has only gravitational force on it, it is said to be in what?
Student: freefall
Tutor: Good. Thus, after the release the only force on the keys and man is the downward force of earth's gravity, so they are in freefall. Before the release, what is the relationship between the man's velocity and the keys' velocity?
Student: they're equal
Tutor: Correct. We know that after the release, the keys, the person, and the elevator are in freefall. Is freefall acceleration near earth a constant? Or does it vary depending upon which object is in freefall?
Student: it's a constant
Learning Correlations after Controlling for Pretest
Dependent MeasureHuman Spoken (14)
Human Typed (20)
R p R pAve. Stud. Words/Turn -.209 .49 .515 .03Intercept: Stud. Words/Turn -.441 .13 .593 .01Ave. Tut. Words/Turn -.086 .78 .536 .02
Learning Correlations after Controlling for Pretest
Dependent MeasureSpoken(ITSPOKE)
Typed (Why2-Atlas)
R p R pTot. Stud. Words .394 .10 .050 .82Tot. Subdialogues/KCD - .018 .94 - .457 .03
ASR-Learning Correlations Key
• ASR MIS: ASR Misrecognition• SEM MIS: Semantic Misrecognition• TRUE REJ: True Rejection• FALSE REJ: False Rejection• REJ: Total Rejections (true or false)• ASR PROB: ASR Misrecognition or Rejection• SEM PROB: Semantic Misrecognition• TIMEOUT: Timeout
Correlation Results (20 students)
Learning Time
SP Measure Mean R p R p
# ASR MIS 32.3 .27 .26 .46 .05
# SEM MIS 6.7 -.05 .83 .25 .29
# TRUE REJ 6.5 -.22 .36 .18 .46
# FALSE REJ 1.7 -.26 .28 .27 .26
# REJ 8.2 -.24 .31 .21 .38
# ASR PROB 40.5 .06 .82 .41 .08
# SEM PROB 14.9 -.19 .43 .25 .31
# TIMEOUT 5.5 .30 .22 .45 .06
0
10
20
30
40
50
60
70
80
90
100
pr0
9
syn31
pr2
4
syn03
syn28
pr0
6
pr0
7
syn33
pr3
1
pr1
9
pr2
6
syn37
pr2
0
syn32
pr1
1
syn21
pr3
0
pr1
0
pr0
4
syn24
pr3
5
syn25
syn13
syn27
Subjects
PC
T Goat FilterWord Error RateThreshold
0
10
20
30
40
50
60
70
80
90
100
pr0
1
syn03
syn06
ppr2
psyn4
psyn1
ppr4
dia
ne
syn01
pr0
8
pr0
4
beatriz
sw
apna
syn04
Subjects
PC
T
Word Error RateGoat FilterThreshold
0
10
20
30
40
50
60
70
80
90
100
pr0
9
syn31
syn34
syn28
pr2
3
pr1
4
pr3
1
syn14
pr2
6
syn15
pr2
0
syn32
pr1
1
pr1
2
syn10
syn12
pr2
1
syn25
syn13
syn07
Subjects
PC
T Goat FilterWord Error RateThreshold
Current Directions Online dialogue act annotation during computer tutoring
– Tutor acts can be authored– Student acts need to be recognized
“Correctness” annotation– Are more Deep Answers “incorrect” or “partially correct” in the
human-human corpus?– Do correct answers positively correlate with learning?
Beyond the turn level – Learning correlations with dialogue act patterns (e.g., bigrams)– Computation and use of discourse structure
Primary Research Question
How does speech-based dialogue interaction impact the effectiveness of tutoring systems for student learning?
Spoken Versus Typed Human and Computer Dialogue Tutoring (ITS 2004)
Human Tutoring: spoken dialogue yielded learning and efficiency gains– Many differences in superficial dialogue characteristics
Computer Tutoring: spoken dialogue made less difference
Learning Correlations: few results– Different dialogue characteristics correlate in human versus
computer, and in spoken versus typed
Motivation An empirical basis for authoring (or learning) optimal dialogue
behaviors in spoken tutorial dialogue systems
Previous Approach: learning correlations with superficial dialogue characteristics– Easy to compute automatically and in real-time, but…– Correlations in the literature did not generalize to our spoken or human-
computer corpora– Results were difficult to interpret
» e.g., do longer student turns contain more explanations?
Current Approach: – learning correlations with measures based on deeper dialogue codings
Current Empirical Studies Does learning correlate with measures of automatic
speech recognition (ASR) performance?– ITSPOKE (Version1) corpus
Does the use of pre-recorded audio rather than text-to-speech improve learning and other measures?– ITSPOKE (Version 2) corpus
Can speech recognition “goats” be screened in advance?
User Survey: after (Baylor et al. 2003), (Walker et al. 2002)
• It was easy to learn from the tutor• The tutor did not interfere with my understanding of the content• The tutor believed I was knowledgeable• The tutor was useful• The tutor was effective on conveying ideas• The tutor was precise in providing advice• The tutor helped me to concentrate• It was easy to understand the tutor• I knew what I could say or do at each point in the conversations with the tutor• The tutor worked the way I expected it to• Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly.
Response options: almost always, often, sometimes, rarely, almost never