dutie speech: determining utility thresholds for information extraction from speech

1

DUTIE Speech: Determining Utility Thresholds for

Information Extraction from Speech

John Makhoul, Rich Schwartz,Alex Baron, Ivan Bulyko, Long Nguyen,

Lance Ramshaw, Dave Stallard, Bing Xiang

2

Objective

Estimate speech recognition accuracy required to support utility in the form of question answering (QA)

Follow-on to earlier DUTIE study from text– Entities and relations extracted into database, which was used

by human subjects for QA task– Measured human QA performance as function of information

extraction (IE) scores Extension to speech recognition

– Measure effect of speech recognition error on IE scores– Assume same relation between IE scores and QA, infer effect of

speech recognition on QA performance

3

Original DUTIE Study with Text Input

Databases of fully automated IE and manual annotation– Populated with entities, relations, co-reference links– 946 articles

Two databases were blended to produce a continuum of database qualities, as measured by– Entity Value Score (EVS)– Relation Value Score (RVS)

For each database, measured human performance– QA performance– Time taken to answer each question in seconds

4

DUTIE Results

Need to reduce IE error rate by about in half to achieve 70% QA performance

0.55

0.62

0.67

0.740.71

0.80

203

184

177

164

152

140

y = -0.1328x2 + 0.38x + 0.553R2 = 0.9957

0.50

0.60

0.70

0.80

0.90

1.00

0% 17% 33% 50% 67% 83% 100%

Extraction Blend(Entity, Relationship Value Score)

QA

Per

form

ance

135

150

165

180

195

210

Tim

e pe

r Que

stio

n (S

econ

ds)

Score Time Poly. (Score)

0.70

Inferred IE Required: 46% (75, 48)

(27.4)

Target Accuracy: .70

(56, 27) (69, 41) (76, 50) (84, 63)(64, 36) (100, 100)

5

Relative QA Performance vs. EVS

Same results, just scaled by QA with perfect IE scores

QA 1.548EVS(ref )2 3.26EVS(ref ) 0.715

6

DUTIE Speech Corpus

Data #articles #hoursTDT 578 15.5

Newswire 368 19.2All 946 34.7

The DUTIE speech corpus consists of 946 articles with 34.7 hours of audio data in total– Same articles as in the original DUTIE study– 15.5 hours of TDT broadcast news data

• ABC, CNN, PRI, VOA (Jan. 1998 ~ June 1998, Oct. 2000 ~ Dec. 2000)• MNB, NBC (Oct. 2000 ~ Dec. 2000)

– 19.2 hours of Newswire read speech recorded at LDC• APW, NYT (Feb. 1998 ~ June 1998, Oct. 2000 ~ Dec. 2000)

7

DUTIE Speech Process

Speech Recognition– Takes audio; outputs text in SNOR format– Run at four different levels of accuracy

Punctuation– Takes recognition output; adds periods/commas– Two methods: Forced alignment vs. automatic punctuation

Information Extraction (IE)– Takes punctuated text and finds entities and relations– Produces ACE Program Format (APF) XML

Scoring IE– Compares test and reference APFs and computes Entity

Value Score and Relation Value Score

8

Block Diagram

Recognizer

ForcedPunctuation

AutomaticPunctuation

APF Aligner

Scorer

ReferenceAPFs

ReferenceText

InformationExtraction

Speech

Text Text

PunctuatedText

PunctuatedText

Entities andRelations (APFs)

APFs

Value Scores

9

Speech Recognition Four systems to produce a range of word error rates

– System I: BBN RT04 stand-alone 10xRT system, with heavily-weighted DUTIE text in language model training (cheating)

– System II: BBN RT04 stand-alone 10xRT system, with normally-weighted DUTIE text in language model training (some cheating)

– System III: BBN RT02 system (Fair)– System IV: BBN RT02 system, with decreased grammar

weight in decoding (degraded)

System Training (hrs) WER of TDT(%)WER of Newswire(%)Average WER(%)I 1700 8.9 6.6 7.6II 1700 11.7 10.0 10.8III 140 19.2 16.5 17.7IV 140 25.4 23.1 24.1

10

Sentence Boundary Detection Model

Sentence boundary included periods, questions marks, exclamation points

Use a 3-gram LM to compute probabilities of sentence boundary at each word position [Stolcke 1996]

Training data– TDT3 closed captions (12M words)– HUB4 transcripts (120M words)– Gigaword News articles from 2000 (100M words)

Use Viterbi to find the most likely sequence of tags

11

Automatic Punctuation Results

3-gram word LM gives near-state-of-the-art period error rate (state-of-the-art is 60% as reported at RT-04)

Punctuation performance is sensitive to WER (in part due to LM being trained on errorless text)

Further improvements possible with new models or prosodic features

WER(%) Period Error Rate (%)

0 587.6 60

10.8 6217.7 6524.1 68

State-of-the-art ASR

12

Reference Punctuation

Tokenize reference into words labeled with punctuation triplets1) Punctuation attached to beginning of word2) Punctuation attached to end of word3) Unattached punctuation (e.g. hyphens) to right of word

Align reference and hypothesis words Attach each reference word’s punctuation to the hypothesis word

it is aligned to

Ref text: Hello, I’m looking for a size ten shoe. I prefer black, and don’t care about price.

ASR out:JELLO I’M LOOKING FOR * SHOE I PREFER * AND DON’T CARE ABOUT PRICE

Output: JELLO, I’M LOOKING FOR SHOE. I PREFER, AND DON’T CARE ABOUT PRICE.

13

Information Extraction

Finds entities and relations between them Identifies entities by character offset interval in the input

text file– Character offset is defined literally: All whitespace and

punctuation is included! Produces ACE Program Format (APF) XML expression

MOSCOW (AP) _ Presidents Leonid Kuchma of Ukraine and Boris Yeltsin of Russia signed an economic cooperation plan Friday

``We have covered the entire list of questions and discussed how we will be tackling them,'' Yeltsin was quoted as saying .

….. <entity ID="2" TYPE="GPE" SUBTYPE="Other“> <entity_mention TYPE="NAM" ID="104-1"> <extent> <charseq START="75" END="82"></charseq> </extent> </entity_mention> </entity> ….

14

Scoring IE, Part I

IE scoring program compares the character offset intervals of entities in reference and test APFs– Requires 30% overlap

Problem #1: Character offsets in reference APFs reflect all whitespace formatting in original text file– But recognizer output will have different character

offsets, so offsets will be wrong Solution

1. Align words in reference and test 2. Based on this alignment, compute character offset

mapping between reference and test3. Change character positions in test APF using mapping4. Compute IE scores

15

Scoring IE, Part II

Problem #2: IE scoring program only compares character offset intervals, not the words in them– So it may ignore word errors in a name

• “George Hush” vs. “George Bush” Solution: Modify scoring program to require match of

alphanumeric characters in the test and reference character intervals– Modification courtesy of George Doddington– Requires 50% content overlap

16

Detailed Results

WERAll

PunctuationCorrectPeriod

CorrectComma

AutoPeriod

EntityScore (%)

RelationScore (%)

0.0%

X 59.7 27.3

X X 58.9 25.9

X 53.4 21.3

X 51.7 19.9

7.6%X 49.3 22.9

X X 48.5 21.4

X 42.7 17.8

10.8%X 46.9 22.0

X X 46.0 20.5

X 41.5 17.7

X 40.5 17.4

17.7%X 39.1 17.8

X X 38.0 16.6

X 33.9 14.3

24.1%X 31.0 15.2

X X 30.2 14.4

X 26.1 11.8

17

Effect of Punctuation on Entity Value Score

Sentence boundaries are required but locations are not critical (loss is 2.8% relative with 62% period error rate)

Loss of comma results in 9.5% reduction in Entity score– Importance of appositives to IE (“George W. Bush, President of the

United States, said this morning …”)

30

35

40

45

50

55

60

65

0 10.8

Word Error (%)

Entit

y Va

lue

Scor

e All Punctuation

Correct Periodand Comma

Correct Period, No Comma

Automatic Period,No Comma

18

Entity Value Scores as Function of WER

Effect of WER on Entity score is linear Loss for automatic punctuation relative to reference

is 13.5% relative

0.469

0.391

0.310

0.517

0.4270.405

0.339

0.261

0.597

0.493 y = -1.166x + 0.592R2 = 0.996

y = -1.035x + 0.514R2 = 0.996

0%

10%

20%

30%

40%

50%

60%

0% 5% 10% 15% 20% 25% 30%

Word Error Rate

Ent

ity V

alue

Sco

re

Reference PunctuationAuto Punctuation(Linear (Ref Punc(Linear (Auto Punc

19

Relation Value Scores as Function of WER

Loss for automatic punctuation relative to reference is 25% relative

0.220

0.178

0.152

0.199

0.178 0.174

0.143

0.118

0.229

0.273

y = -0.505x + 0.271R2 = 0.994

y = -0.340x + 0.203R2 = 0.979

0%

5%

10%

15%

20%

25%

30%

0% 5% 10% 15% 20% 25% 30%

Word Error Rate

Rel

atio

n V

alue

Sco

re

Reference PunctuationAuto Punctuation(Linear (Ref Punc(Linear (Auto Punc

20

Relation Between WER and IE Scores

Entity Value Score (EVS) and Relation Value Score (RVS) are linear function of WER

WERRVSRVSWER

EVSEVS 7.11

)0(;21

)0(

Automatic punctuation has multiplicative effect on scores

Relative QA as a function of EVS

)7.11(75.0)(

__

)21(865.0)(

__

WERrefRVSpuncautoRVS

WERrefEVSpuncautoEVS

QA 1.548EVS(ref )2 3.26EVS(ref ) 0.715

21

Predicted Relative QA vs. WER and EVS(ref)

At 12% WER with today’s IE, we get 33% of maximum QA– Near zero for 25% WER (e.g., non-English)

With half the IE error rate, half WER, half the loss from punctuation, we estimate 72% of maximum QA

22

Conclusions

IE scores degrade linearly with WER Sentence boundaries are required but locations are not

critical Commas are important for IE With current technology (e.g., 12% WER and 60% EVS

on text), we can only achieve 33% of maximum QA performance

If IE error and WER were cut by half and loss due to commas cut in half, QA performance could increase to over 70% of maximum

dutie speech: determining utility thresholds for information extraction from speech

Documents