recognizing discourse structure: speech discourse & dialogue cmsc 35900-1 october 11, 2006
TRANSCRIPT
Roadmap
• Recognizing discourse structure in speech
• Analyzing spoken monologue
• Automatic topic segmentation– Acoustic cues, text cues, and integration
• Conclusions & Plans
Recognizing Discourse Structure
• Hypothesis:– Discourse can be decomposed into subunits
• Formal written text– Clues to structure: paragraphs, chapters, sections
• Spoken discourse– Lacks orthographic cues– Are compensating features available?
Prosody & Discourse Structure
• Discourse structure model– Grosz&Sidner 1986– Global structure: discourse segments, embedding– Local structure: prominence, salience
• Linguistic structure includes intonation– Signal global or local structure
• Use of phrases to signal global structure
• Signal parenthetical
Intonational Features
• Theoretical framework– Tone and Break Index (ToBI, Pierrehumbert)
• Tone: pitch contours; Breaks: phrase units
• “Intermediate” phrases are basic units
• Features:• Pitch range within and between phrases
• Amplitude (loudness)
• Pitch contour type
• Speaking rate (syll/sec)
• Inter-phrase pause duration
Speech Corpora
• Vary on:– Speaker type: professional/not– Speaking style: read/spontaneous– Speech content: news/directions/etc
• Variability in prosody too….
Pilot Study I: Newswire
• Professionally read 3 AP newswire stories
• Manual segmentation: Text only, Speech– Consensus labels: SB, SF
• Correlation of pitch range, amplitude, rate– Can identify structure via hand-labelings
• Issues:– Difficulty labeling, Idiosyncratic BN speech
Pilot Study II: Prominence and Discourse
• Prominence: Accent/stress on a word– Typically associated with NEW information– Contrast:
• Locally NEW (in segment) vs Globally NEW
• Analyze all NPs in 20 min spontaneous • Difference in position and form influence
– Full forms accented, pronouns etc not– Mismatches: Imply role of global/local
• Issues: – Difficulty labeling; use of full names or pronouns
Direction-giving Corpus
• Spontaneous/read speech; non-professional– Task-oriented: give directions, vary complexity
• Return later to read original transcriptions
• Discourse segment labeling: Text vs Speech– More consensus labels for speech than text
• Speech allows more reliable segmentation
• Spontaneous more reliable than read (medial)
Acoustic Analysis
• Features:• Max/mean f0 (pitch), amplitude, rate, pause (pre/post)
• Findings:• Segment beginnings: Higher max/mean f0, amplitude
– Shorter following pause (Longer preceding pause in read)
• Segment endings: Lower max/mean f0, amplitude
• Similar for T & S annotations
• Issues: Single speaker
Prominence and Discourse
• NPs annotated for:– Lexical form (full NP/pron), grammatical role,
surface position (sent/phrase), accent– 23% reduced stress
• Effect of form, role
• Repetition, not necessarily reduced– Also find reduced forms in contrasts
Summary
• Clear prosodic cues to discourse structure– Across speakers, speaking style, content– Initiation:
• High max/average pitch, amplitude; preceding pause– Finality is converse
• Information status– Few clear correlates with accentuation
• Mediated by form, grammatical role
Prosodic and Lexical Cues toTopic Segmentation
• Broadcast news story-level segmentation– Television and radio
• Contrast w/GHN– Fully automatic: transcription, prosodic labeling– Large data set- multiple speakers– All teleprompted news
Possible Signals
• Lexical topic similarity in vector space – Hearst (1994)
• Lexical discourse cues (Beeferman et al)• E.g. “CNN “ – Reporter sign-off
– HMM topic model
• Prosodic cues– Pitch, loudness, duration, speaker change, …
Basic Approach
• Chop audio stream into “sentences”
• Group “sentences” into topics
• Classify each sentence boundary as topic boundary or not
• Probabilistic framework– argmax B Pr(B|W,F)
• B is sequence of boundaries, W words, F features
Prosodic Classification
• Features:– Pitch (f0) – before and after possible boundary,– Duration – final phoneme, final rhyme, pause
• No amplitude – viewed as redundant with pitch
• Classifier: Decision trees– Features selected by wrapper loop on training
Lexical Classification
• HMM topic language models– Train one model per topic – Begin/End state
• Train on previous topics
• Later augment with Topic Boundary states
Integrating Models
• With decision trees:– Incorporate HMM topic boundary probability
as additional feature– Boundary labeled if exceeds some threshold
• With HMMs:– Use prosodic trees to estimate likelihoods – Use standard Viterbi decoding to find best
Testing & Evaluation
• Based on 6 shows– 104 shows used for training
• Used ASR output for words/positions– Contrast with correct forced alignment
• Used manual speaker segmentation
• Bizarre cost metric
• Basic units: Chop at 0.572 sec pause
Decision Tree Classification
• Prosody-only features: – Pause duration, F0 difference, speaker change,
gender• Consistent with GHN
• Gender? Different styles for males/females
• Combined:– HMM LM likelihoods, pause, F0 difference
Best Results
• Integrate prosody and lexical cues
• HMM-based model combination better– Decision tree thresholding inconsistent
• Improves over HMM classifier only