correlations with learning in spoken tutoring dialogues diane litman learning research and...

Correlations with Learning in Spoken Tutoring Dialogues

Diane Litman

Learning Research and Development Center

and

Computer Science Department

University of Pittsburgh

Outline

Introduction Dialogue Acts and Learning Correlations Learning, Speech Recognition and Text-to-Speech Current Directions and Summary

Motivation An empirical basis for optimizing dialogue behaviors in

spoken tutorial dialogue systems

What aspects of dialogue correlate with learning? – Student behaviors– Tutor behaviors– Interacting student and tutor behaviors

Do correlations generalize across tutoring situations?– Human-human tutoring– Human-computer tutoring

Approach Initial: learning correlations with superficial dialogue

characteristics [Litman et al., Intelligent Tutoring Systems Conf., 2004]– Easy to compute automatically and in real-time, but…– Correlations in the literature did not generalize to our spoken or

human-computer corpora– Results were difficult to interpret

» e.g., do longer student turns contain more explanations?

Current: learning correlations with deeper “dialogue act” codings [Forbes-Riley et al., Artificial Intelligence and Education Conf., 2005]

ITSPOKE (Version1): Intelligent Tutoring SPOKEn Dialogue System

Back-end is text-based Why2-Atlas tutorial dialogue system (VanLehn et al., 2002)

Student speech digitized from microphone input; Sphinx2 speech recognizer

Tutor speech played via headphones/speakers; Cepstral text-to-speech synthesizer

Other additions: access to Why2-Atlas “internals”, speech recognition repairs, etc.

Two Spoken Tutoring Corpora Human-Human Corpus

– 14 students – 1 human tutor– 128 physics problems (dialogues) – 5948 student turns, 5505 tutor turns

Computer-Human Corpus– 20 students– ITSPOKE (Version1) tutor – 100 physics problems (dialogues) – 2445 student turns, 2967 tutor turns

Outline


Dialogue Acts

Dialogue Acts represent intentions behind utterances– Both domain-independent and tutoring-specific tagsets

» e.g., Graesser and Person, 1994; Graesser et al., 1995; Chi et al., 2001

– Used in prior studies of correlations with learning» e.g., tutor acts in AutoTutor (Jackson et al., 2004), dialogue acts in human

tutoring (Chi et al., 2001)

ITSPOKE Study– Student and tutor dialogue acts– Unigrams and bigrams of dialogue acts– Human and computer tutoring– Spoken input and output

Tagset (1): (Graesser and Person, 1994)

• Tutor and Student Question Acts

Short Answer Question: basic quantitative relationships

Long Answer Question: definition/interpretation of concepts

Deep Answer Question: reasoning about causes/effects

Tagset (2): inspired by (Graesser et al., 1995)

• Tutor Feedback Acts

Positive Feedback: overt positive response

Negative Feedback: overt negative response

• Tutor State Acts

Restatement: repetitions and rewordings

Recap: restating earlier-established points

Request/Directive: directions for argument

Bottom Out: complete answer after problematic response

Hint: partial answer after problematic response

Expansion: novel details

Tagset (3): inspired by (Chi et al., 2001)

• Student Answer Acts

Deep Answer: at least 2 concepts with reasoning

Novel/Single Answer: one new concept

Shallow Answer: one given concept

Assertion: answers such as “I don’t know”

• Tutor and Student Non-Substantive Acts:

do not contribute to physics discussion

Annotated Human-Human Excerpt

T: Which one will be faster? [Short Answer Question]

S: The feathers. [Novel/Single Answer]

T: The feathers - why? [Restatement,

Deep Answer Question]

S: Because there’s less matter. [Deep Answer]

All turns in both corpora were manually coded for dialogue acts (Kappa > .6)

Correlations with Unigram Measures(student and tutor-centered analyses)

For each student, and each student and tutor dialogue act tag, compute – Tag Total: number of turns containing the tag– Tag Percentage: (tag total) / (turn total)– Tag Ratio: (tag total) / (turns containing tag of that type)

Correlate measures with posttest, after regressing out pretest

Human-Computer Results (20 students)

Student Dialogue Acts Mean R p

# Deep Answer 11.90 .48 .04

Human-Computer Results (continued)

Tutor Dialogue Acts Mean R p

# Deep Answer Question 9.59 .41 .08

% Deep Answer Question 6.27% .45 .05

% Question Act 76.89% .57 .01

(Short Answer Question)/Question .88 -.47 .04

(Deep Answer Question) /Question .08 .42 .07

# Positive Feedback 76.10 .38 .10

Human-Human Results (14 students)

Student Dialogue Acts Mean R p

# Novel/Single Answer 19.29 .49 .09

# Deep Answer 68.50 -.49 .09

(Novel/Single Answer)/Answer .14 .47 .10

(Short Answer Question)/Question .91 .56 .05

(Long Answer Question) /Question .03 -.57 .04

Human-Human Results (continued)

Tutor Dialogue Acts Mean R p

# Request/Directive 19.86 -.71 .01

%Request/Directive 5.65% -.61 .03

# Restatement 79.14 -.56 .05

# Negative Feedback 14.50 -.60 .03

Discussion

Computer Tutoring: knowledge construction– Positive correlations

» Student answers displaying reasoning» Tutor questions requiring reasoning

Human Tutoring: more complex– Positive correlations

» Student utterances introducing a new concept

– Mostly negative correlations » Student attempts at deeper reasoning» Tutor attempts to direct the dialogue

Correlations with Bigram Measures(interaction-centered analyses)

For each student, and each tag sequence containing both a tutor and a student dialogue act, compute– [Student Act_n - Tutor Act_n+1] Totals

» all bigrams constructed by pairing each Student Dialogue Act in turn n with each Tutor Dialogue Act in turn n+1

– [Tutor Act_n - Student Act_n+1] Totals » all bigrams constructed by pairing each Tutor Dialogue Act in

turn n with each Student Dialogue Act in turn n+1

Correlate measures with posttest, after regressing out pretest

Bigram Results

Many bigrams incorporate, as either the first or second element, a dialogue act corresponding to one of the unigram results, e.g.– [Student Deep Answer – Tutor Deep Answer Question]– [Tutor Recap - Student Deep Answer]

Other dialogue acts only correlate with learning as part of a larger dialogue pattern, e.g.– [Student Shallow Answer - Tutor Restatement]– [Tutor Restatement – Student Shallow Answer]

Discussion

Computer Tutoring– n-grams seem able to capture effective learning patterns

in this simpler corpus

Human Tutoring – despite mostly negative correlations, students are

learning!– suggests effective learning patterns are too complicated

to be captured with n-grams

Current Directions

“Correctness” annotation– Are more Deep Answers “incorrect” or “partially

correct” in the human-human corpus?– Do correct answers positively correlate with learning?

Beyond the turn level – Correlations with larger dialogue act patterns (e.g., tri-

grams, n-grams)– Computation and use of hierarchical discourse structure

Outline


Research Question

Does the performance of a system’s speech recognizer and/or text-to-speech system relate to learning?– Speech recognition accuracy correlates with user

satisfaction in non-tutoring systems (Litman and Pan, 2002; Walker et al., 2002)

– The nature of a computer’s voice relates to learning and motivation in pedagogical agents (Baylor et al., 2003; Atkinson)

ASR Performance Measures (1): Rejections

ITSPOKE: Therefore, what is the magnitude of this gravitational force in the horizontal direction?

STUDENT: significant

ASR: significant (False Rejection)

ITSPOKE: Could you please repeat that?

STUDENT: great

ASR: crate (True Rejection)

ITSPOKE: I'm sorry, I'm having trouble understanding you. Please try again.

ASR Performance Measures (2):“Transcription” Misrecognitions

ITSPOKE: Yeah. Does the packet have an acceleration? If yes, please specify its direction.

STUDENT: yes downward

ASR: is downward word error rate: 50 %

binary word error: True

ITSPOKE: <omitted> How would you describe the vertical component of the packet's velocity? (e.g., decreasing, zero, etc.)

STUDENT: increasing

ASR: decreasing word error rate: 100%

binary word error: True

ASR Performance Measures (3):“Semantic” Misrecognitions

ITSPOKE: Yeah. Does the packet have an acceleration? If yes, please specify its direction.

STUDENT: yes downward NLU(student): downward

ASR: is downward NLU(ASR): downward word error rate: 50 % semantic error rate: 0%

binary word error: True binary semantic error: False

ITSPOKE: <omitted> How would you describe the vertical component of the packet's velocity? (e.g., decreasing, zero, etc.)

STUDENT: increasing NLU(student): increase

ASR: decreasing NLU(ASR): decrease word error rate: 100 % semantic error rate: 100%

binary word error: True binary semantic error: True

Correlations with Learning?

Computed totals, percentages, and ratios of:– Rejections

» False, True, Both

– Misrecognitions» Transcription and Semantic» Word and Binary

– ASR Problems (Rejections + Misrecognitions)» Transcription and Semantic» Word and Binary

Found no significant correlations or trends!

ITSPOKE (Version2)

Prerecorded Output Synthesized Output

(human voice) (text-to-speech)

ITSPOKE: Terrific. Let's try the original question again. If gravity is the only force acting on an object, will it be moving or staying still?

STUDENT: moving (ASR: moving)

ITSPOKE: Yes. Not only are the person, keys, and elevator moving, they have only gravitational forces acting on them. When an object is falling and has only gravitational force on it, it is said to be in what?

STUDENT: free fall (ASR: free fall)

Prerecorded (human voice) Synthesized (text-to-speech)

New Computer Tutoring Experiment

Same subject pool, physics problems, web interface, and experimental procedure as before, except– ITSPOKE (Version2)

Pre-recorded voice condition– 30 students (150 dialogues)

Text-to-speech condition– 29 students (145 dialogues)

Outline


Summary

Many dialogue act correlations– positive correlations with deep reasoning and

questioning in computer tutoring– correlations in human tutoring more complex– student, tutor, and interactive perspectives all useful

No correlations with ASR problems Stay tuned …

– New dialogue act patterns and “correctness” analysis– Pre-recorded versus text-to-speech

Acknowledgments The ITSPOKE Group

– Staff: Scott Silliman– Research Associates: Kate Forbes-Riley

Joel TetreaultAlison Huettner (consultant)

– Graduate Students: Ai Hua Beatriz Maeireizo Amruta Purandare Mihai Rotaru Arthur Ward

Kurt VanLehn and the Why2 Team

Thank You!

Questions?

Hypotheses

Compared to typed dialogues, spoken interactions will yield better learning gains, and will be more efficient and natural

Different student behaviors will correlate with learning in spoken versus typed dialogues, and will be elicited by different tutor actions

Findings in human-human and human-computer dialogues will vary as a function of system performance

Motivation

Working hypothesis regarding learning gains– Human Dialogue > Computer Dialogue > Text

Most human tutoring involves face-to-face spoken interaction, while most computer dialogue tutors are text-based – Evens et al., 2001; Zinn et al., 2002; Vanlehn et

al., 2002; Aleven et al., 2001

Can the effectiveness of dialogue tutorial systems be further increased by using spoken interactions?

Potential Benefits of Speech Self-explanation correlates with learning and occurs more in speech

– Hausmann and Chi, 2002

Speech contains prosodic information, providing new sources of information for dialogue adaptation – Forbes-Riley and Litman, 2004

Spoken computational environments may prime a more social interpretation that enhances learning– Moreno et al., 2001; Graesser et al., 2003

Potential for hands-free interaction – Smith, 1992; Aist et al., 2003

Spoken Tutorial Dialogue Systems

Recent tutoring systems have begun to add spoken language capabilities– Rickel and Johnson, 2000; Graesser et al. 2001;

Mostow and Aist, 2001; Aist et al., 2003; Fry et al., 2001; Schultz et al., 2003

However, little empirical analysis of the learning ramifications of using speech

Architecture

Cepstral

www server

www browser

javaITSpoke

Text Manager

Spoken Dialogue Manager

essay

dialogue

student text

(xml)

tutor turn

(xml)

htmlxml

text

Speech Analysis (Sphinx)

dialogue

dialogue

repair goals

Essay Analysis (Carmel, Tacitus-

lite+)

Content Dialogue

Manager (Ape, Carmel)

Why2

tutorial goals

textessay

Speech Recognition: Sphinx2 (CMU)

Probabilistic language models for different dialogue states

Initial training data– typed student utterances from Why2-Atlas corpora

Later training data– spoken utterances obtained during development and pilot

testing of ITSPOKE Total vocabulary

– 1240 unique words “Semantic Accuracy” Rate = 92.4%

Speech Synthesis: Cepstral

Commercial outgrowth of Festival text-to-speech synthesizer (Edinburgh, CMU)

Required additional processing of Why2-Atlas prompts (e.g., f=m*a)

Common Experimental Aspects

Students take a physics pretest Students read background material Students use web interface to work through up

to 10 problems with either a computer or a human tutor

Students take a posttest– 40 multiple choice questions, isomorphic to pretest

ITSPOKE Corpora Comparison Human-Human Human-Computer

…1.3 minutes into session…

Student: Can I ask you questions?

Tutor: Yes

Student: Is that ok?

Tutor: Mm-hm

Student: Um from what we learned from the last problem they're gonna have the same- the keys and you have the same acceleration right?

Tutor: Yes

Student: So that means they'll both hit the bottom at the same time. But I don't understand where- I don't understand if you're in the elevator-

Tutor: You see

Student: Where are you going to-?

Tutor: The uh let me uh the key uh- the person holds the key in front of-

Student: Their face yeah-

Tutor: Uh his or her face and then lets it uh uh it is let go so the the the question relates to the relative position of the key and the face uh-

Student: So the key and the face-

Tutor: Uh how will they- as they go- as they both fall what will be their relative position? That is the question.

Student: (sigh)

…3.5 minutes into session…

Tutor: I didn't understand. What force acts on all objects within close proximity to earth?

Student: uh the force of gravity

Tutor: Fine. Besides the Earth's gravitational force, are there any other forces acting on the person, keys, and elevator after the release?

Student: no

Tutor: Yep. Thus, keep in mind that after the release, the only force on the person, keys, and elevator is the force of gravity. If gravity is the only force acting on an object, will it be moving or staying still?

Student: it will be moving

Tutor: Fine. Not only are the person, keys, and elevator moving, they have only gravitational forces on them. When an object is falling and has only gravitational force on it, it is said to be in what?

Student: freefall

Tutor: Good. Thus, after the release the only force on the keys and man is the downward force of earth's gravity, so they are in freefall. Before the release, what is the relationship between the man's velocity and the keys' velocity?

Student: they're equal

Tutor: Correct. We know that after the release, the keys, the person, and the elevator are in freefall. Is freefall acceleration near earth a constant? Or does it vary depending upon which object is in freefall?

Student: it's a constant

Learning Correlations after Controlling for Pretest

Dependent MeasureHuman Spoken (14)

Human Typed (20)

R p R pAve. Stud. Words/Turn -.209 .49 .515 .03Intercept: Stud. Words/Turn -.441 .13 .593 .01Ave. Tut. Words/Turn -.086 .78 .536 .02

Learning Correlations after Controlling for Pretest

Dependent MeasureSpoken(ITSPOKE)

Typed (Why2-Atlas)

R p R pTot. Stud. Words .394 .10 .050 .82Tot. Subdialogues/KCD - .018 .94 - .457 .03

ASR-Learning Correlations Key

• ASR MIS: ASR Misrecognition• SEM MIS: Semantic Misrecognition• TRUE REJ: True Rejection• FALSE REJ: False Rejection• REJ: Total Rejections (true or false)• ASR PROB: ASR Misrecognition or Rejection• SEM PROB: Semantic Misrecognition• TIMEOUT: Timeout

Correlation Results (20 students)

Learning Time

SP Measure Mean R p R p

# ASR MIS 32.3 .27 .26 .46 .05

# SEM MIS 6.7 -.05 .83 .25 .29

# TRUE REJ 6.5 -.22 .36 .18 .46

# FALSE REJ 1.7 -.26 .28 .27 .26

# REJ 8.2 -.24 .31 .21 .38

# ASR PROB 40.5 .06 .82 .41 .08

# SEM PROB 14.9 -.19 .43 .25 .31

# TIMEOUT 5.5 .30 .22 .45 .06

0

10

20

30

40

50

60

70

80

90

100

pr0

9

syn31

pr2

4

syn03

syn28

pr0

6

pr0

7

syn33

pr3

1

pr1

9

pr2

6

syn37

pr2

0

syn32

pr1

1

syn21

pr3

0

pr1

0

pr0

4

syn24

pr3

5

syn25

syn13

syn27

Subjects

PC

T Goat FilterWord Error RateThreshold

0

10

20

30

40

50

60

70

80

90

100

pr0

1

syn03

syn06

ppr2

psyn4

psyn1

ppr4

dia

ne

syn01

pr0

8

pr0

4

beatriz

sw

apna

syn04

Subjects

PC

T

Word Error RateGoat FilterThreshold

0

10

20

30

40

50

60

70

80

90

100

pr0

9

syn31

syn34

syn28

pr2

3

pr1

4

pr3

1

syn14

pr2

6

syn15

pr2

0

syn32

pr1

1

pr1

2

syn10

syn12

pr2

1

syn25

syn13

syn07

Subjects

PC

T Goat FilterWord Error RateThreshold

Current Directions Online dialogue act annotation during computer tutoring

– Tutor acts can be authored– Student acts need to be recognized

“Correctness” annotation– Are more Deep Answers “incorrect” or “partially correct” in the

human-human corpus?– Do correct answers positively correlate with learning?

Beyond the turn level – Learning correlations with dialogue act patterns (e.g., bigrams)– Computation and use of discourse structure

Primary Research Question

How does speech-based dialogue interaction impact the effectiveness of tutoring systems for student learning?

Spoken Versus Typed Human and Computer Dialogue Tutoring (ITS 2004)

Human Tutoring: spoken dialogue yielded learning and efficiency gains– Many differences in superficial dialogue characteristics

Computer Tutoring: spoken dialogue made less difference

Learning Correlations: few results– Different dialogue characteristics correlate in human versus

computer, and in spoken versus typed

Motivation An empirical basis for authoring (or learning) optimal dialogue

behaviors in spoken tutorial dialogue systems

Previous Approach: learning correlations with superficial dialogue characteristics– Easy to compute automatically and in real-time, but…– Correlations in the literature did not generalize to our spoken or human-

computer corpora– Results were difficult to interpret

» e.g., do longer student turns contain more explanations?

Current Approach: – learning correlations with measures based on deeper dialogue codings

Current Empirical Studies Does learning correlate with measures of automatic

speech recognition (ASR) performance?– ITSPOKE (Version1) corpus

Does the use of pre-recorded audio rather than text-to-speech improve learning and other measures?– ITSPOKE (Version 2) corpus

Can speech recognition “goats” be screened in advance?

User Survey: after (Baylor et al. 2003), (Walker et al. 2002)

• It was easy to learn from the tutor• The tutor did not interfere with my understanding of the content• The tutor believed I was knowledgeable• The tutor was useful• The tutor was effective on conveying ideas• The tutor was precise in providing advice• The tutor helped me to concentrate• It was easy to understand the tutor• I knew what I could say or do at each point in the conversations with the tutor• The tutor worked the way I expected it to• Based on my experience using the tutor to learn physics, I would like to use such a tutor regularly.

Response options: almost always, often, sometimes, rarely, almost never

Summary

Dialog Act Annotation and Learning Correlations– Human tutoring and ITSPOKE (Version 1) corpora

Speech Recognition and Learning Correlations– ITSPOKE (Version 1) corpus

ITSPOKE (Version 2) and New Corpus Collection– Pre-recorded versus synthesized speech (in progress)

correlations with learning in spoken tutoring dialogues diane litman learning research and...

Documents

tutor acts

student speech

learning correlations

tutor dialogue actsunigrams

dialogue behaviors

tutor behaviorsdo correlations

student question

human tutoring chi