18.0 some recent developments in ntu

67
18.0 Some Recent Developments in NTU Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411. 2. “Higher Order Cepstral Moment Nomalization(HOCMN) for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Montreal, CA, May 2004, pp.197-200. 3. “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 4. “Powered Cepstral Normalization (P-CN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 5. “ Improved Spontaneous Mandarin Speech Recognition by Disfluency Interruption Point (IP) Detection Using Prosodic Features”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.1621-1624. 6. “ Prosodic Modeling in Large Vocabulary Mandarin Speech Recognition”,

Upload: faunia

Post on 05-Jan-2016

39 views

Category:

Documents


1 download

DESCRIPTION

18.0 Some Recent Developments in NTU. Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 18.0   Some Recent Developments in NTU

18.0 Some Recent Developments in NTU

Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411.

2. “Higher Order Cepstral Moment Nomalization(HOCMN) for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Montreal, CA, May 2004, pp.197-200.

3. “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

4. “Powered Cepstral Normalization (P-CN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

5. “ Improved Spontaneous Mandarin Speech Recognition by Disfluency Interruption Point (IP) Detection Using Prosodic Features”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.1621-1624. 6. “ Prosodic Modeling in Large Vocabulary Mandarin Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 7. “Latent Prosodic Modeling (LPM) for Speech with Applications in Recognizing

Spontaneous Mandarin Speech with Disfluencies”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

Page 2: 18.0   Some Recent Developments in NTU

Reference: 8. “Entropy-based Feature Parameter Weighting for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.

9. “A New Framework for System Combination Based on Integrated Hypothesis Space,” International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

10. “Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA)”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.

11. “Analytical Comparison between Position Specific Posterior Lattices and Confusion Networks Based on Words and Subword Units for Spoken Document Indexing”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.

12. “A Multi-Modal Dialogue System for Information Navigation and Retrieval across Spoken Document Archives with Topic Hierarchies”, Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, San Juan, Nov-Dec 2005.

13. “Efficient Interactive Retrieval of Spoken Documents with Key Terms Ranked by Reinforcement Learning”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.

14. “Type- Dialogue Systems for Information Access from Unstructured Knowledge ⅡSources”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.

18.0 Some Recent Developments in NTU

Page 3: 18.0   Some Recent Developments in NTU

Reference: 15. “Histogram-Based Quantization (HQ) for Robust and Scalable Distributed Speech Recognition”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.957-960.

16. “Joint Uncertainty Decoding (JUD) with Histogram-Based Quantization (HQ) for Robust and/or Distributed Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.

18.0 Some Recent Developments in NTU

Page 4: 18.0   Some Recent Developments in NTU

Role of Spoken Language Processing under Network Environment

Content AnalysisUser Interface

Internet

User-Content Interaction

User Interface

—when keyboards/mice inadequate

Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language

Page 5: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Page 6: 18.0   Some Recent Developments in NTU

Segmental Eigenvoice

– Decompose the supervectors into sub-supervectors, from which sub-eigenspaces can be constructed, therefore better performance

can be obtained with more adaptation data

Page 7: 18.0   Some Recent Developments in NTU

Segmental Eigenvoice (1/3)

Page 8: 18.0   Some Recent Developments in NTU

Segmental Eigenvoice (2/3)

Page 9: 18.0   Some Recent Developments in NTU

Segmental Eigenvoice (3/3)

Page 10: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Page 11: 18.0   Some Recent Developments in NTU

Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition

— to reduce the mismatch between the statistical characterics of training and testing corpora by

normalizing the ceptral moments

Page 12: 18.0   Some Recent Developments in NTU

Cepstral Moment Normalization

• Moment Estimation:– Time average : N-th moment of MFCC parameters about the origin

• Cepstral Normalization:– For odd order L

– For even order N

1 1

0 0

1 1[ ( )] ( ) ( )

T TNN N

k k

E X n X k X kT T

[ ]( ) 0LLE X n

[ ]( )NN NE X n M

Example: CMS for L=1

Example: CMVN for N=1 and 2

Page 13: 18.0   Some Recent Developments in NTU

Higher Order Cepstral Moment Normalization (HOCMN)

CN

CTN=HOCMN[1,2,3]

CN (l=86)

• Aurora 2, Clean Condition Training, Word Accuracy Averaged over 0~20dB and All Types of Noise (sets A,B,C)

CMVN

CTN=HOCMN[1,3,2]

CMVN (l=86)

Page 14: 18.0   Some Recent Developments in NTU

Skewness and Kurtosis (1)

• Skewness

– Third moment about the mean and normalized to the standard deviation

– Departure of pdf from symmetry• Positive/negative indicates skew to right/left• Zero indicates symmetric

• Kurtosis

– Fourth moment about the mean and normalized to the standard deviation

– Peaked or “flat with tails of large size” as compared to standard Gaussian

• “3” is the fourth moment of N(0,1)• Positive/negative indicates flatter/more peaked

Page 15: 18.0   Some Recent Developments in NTU

Skewness and Kurtosis (2)

• Define: Generalized Skewness of Odd Order L

– L not necessarily 3– Similar meaning as skewness (skew to right or left) except in the

sense of L–th moment

• Define: Generalized Kurtosis of Even Order N

– N not necessarily 4– Similar meaning as kurtosis (peaked or flat) except in the sense of

N–th moment

( ) , : an odd integerL LS E X L

Page 16: 18.0   Some Recent Developments in NTU

Skewness and Kurtosis (3)

• Normalizing Odd Order Moment is to Constrain the pdf to be Symmetric about the Origin

– Except in the sense of L-th moment

• Normalizing Even Order Moment is to Constrain the pdf to be “Equally Flat with Tails of Equal Size” as Compared to a Standard Gaussian

– Except in the sense of N-th moment

Page 17: 18.0   Some Recent Developments in NTU

• The Order of Normalized Moments are not necessarily Integers

• Generalized Moments– Type 1:

• Reduced to odd order moment when u is an odd integer L

(example: L=1 or 3)

– Type 2:

• Reduced to even order moment when u is an even integer N

(example: N=2 or 4)

– HOCMN with Non-integer Moment Orders

Generalized Moments with Non-integer Orders

Page 18: 18.0   Some Recent Developments in NTU

PDF Analysis

• HEQ– Over fitted to Gaussian– Original statistics lost

• HOCMN– Fitting the generalized skewness and

kurtosis of a few orders only– Retain more original characteristics

HEQ

HOCMN

Original C0 & C1

Page 19: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Page 20: 18.0   Some Recent Developments in NTU

Use of Prosody in Recognition and Handling Disfluencies in Spontaneous Speech

— prosody may be useful in recognition, and in particular in handling disfluencies in spontaneous speech

Page 21: 18.0   Some Recent Developments in NTU

100

200

300

400

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

frame number

fun

dam

en

tal

fre

qu

en

cy

(H

z)

Tone 2 Tone 4Tone 3

Prosodic Features (І) — Pitch-related Features

P1

P2d1

d2

• Pitch-related Features– The average pitch value within the syllable – The maximum difference of pitch value within the syllable – The average of absolute values of pitch variations within the syllable– The magnitude of pitch reset for boundaries – The difference of such feature values of adjacent syllable boundaries ( P1-P2 ,

d1-d2 , etc.)

– A total of 54 pitch-related features were obtained

Page 22: 18.0   Some Recent Developments in NTU

• Duration-related Features

– A total of 38 duration-related features were obtained

syllable boundary syllable boundarypausepause

end of utterancebegin of utterance

A B C D Eba

Prosodic Features (Ⅱ) —Duration-related Features

Pause duration b Average syllable duration

(B+C+D+E)/4 or ( (D+E)/2 + C )/2 Average syllable duration ratio

(D+E)/(B+C) or (D+E)/2 /C

Combination of pause & syllable features (ratio or product) C*b , D*b, C/b, D/b Lengthening C / ( (A+B)/2 ) Standard deviation of feature values

Page 23: 18.0   Some Recent Developments in NTU

Recognition Framework with Prosodic Modeling

• Rescoring Formula:

λl ,λp: weighting coefficients

( ) log log logl pS W P X W P W P F W Prosodicmodel

• Two-pass Recognition

Page 24: 18.0   Some Recent Developments in NTU

Prosodic Feature Extraction from Paths in the Word Graph

Define Tone

variable

Directly take the LW boundaries as a prosodic

cue

21

,jL

j j jk jk jkk

P f w P f T B

(LW (LW boundaries )boundaries )

LWLW

Lj : the length of the j-th word

Page 25: 18.0   Some Recent Developments in NTU

Prosodic modeling

,jk jk jkP f T BGMMclassifier

,

,

,

1,0

2,0

...

...

5,1

jk jk jk

jk jk jk

jk jk jk

p T B f

p T B f

p T B f

fjk

fjk’=

• Hybrid

• GMM

• Decision Tree

GMM

classifier

,jk jk jkP f T B

jkP B

,jk jk jkP T B f Baye’s Rule

,jk jk jkP f T B

fjk

fjk

Page 26: 18.0   Some Recent Developments in NTU

Examples of Disfluencies in Spontaneous Speech

It has a *eh there is a resort there.

The disfluency interruption point (IP) (*)

它 (ta1) 有 (you3) 一個 (yi2ge5) 呃 (E) 那邊(ne4bian1)

it has one [discourse particle] there

有個 (you3ge5) 度假村 (du4jian4cun1) 嘛 (MA)

has a resort [discourse particle ]

Do you import * uhn export products?

reparandum resumptionoptional editing term

是 (shi4) 進口 (jin4kou3) 嗯 (EN) 出口 (chu1kou3) 嗎 (ma1) is import [discourse export [interrogative particle] particle]

• Overt Repair

reparandum resumption

• Abandoned Utterances

optional editing term

Page 27: 18.0   Some Recent Developments in NTU

Spontaneous Speech Recognition with Disfluency Interruption Point (IP) Detection

• Rescoring with IP information

• Recognition Results

( c: IP class )

* arg max ( | , )W

W P W X F

arg max ( | ) ( | )W

P W F P X W

11( | ) ( | , )n

n n Nn

P W F P w w F

1 11 1( | , ) ( | , )n n

n N n n Ncn

P c w F P w w c

4444.5

4545.5

4646.5

0.5 0.9 1.3 2 4char

acte

r Acc with disfluency

handling

baseline

word n-grams when crossing IP boundaries

IP probability given by detection models

Page 28: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Page 29: 18.0   Some Recent Developments in NTU

Entropy-based Weighted Viterbi Decoding

— contribution of each feature parameter in Viterbi decoding weighted by its entropy with respect to

different phone classes

Page 30: 18.0   Some Recent Developments in NTU

t: frame index

x(t): feature vector

d: index of feature parameter in x(t)

c: class index

Entropy-based Weighting

• Basic Idea

– If a feature parameteris discriminative

– If not discriminative

• its Entropy value is low

• its Entropy value is high

observation probability distributions of different classes

Page 31: 18.0   Some Recent Developments in NTU

Entropy Estimation by GMMs

• GMMs for Different Classes c

– “GMM c” is developed for the acoustic class “c” (c = 1, 2, …)

Page 32: 18.0   Some Recent Developments in NTU

Entropy-based Weighted Viterbi Decoding

• Testing

• Viterbi decodingD M

j jm d jmd jmdd=1 m=1

log[ ( (t)) ] = (t, d) ( log c ( (t); , ) )b W N x x

Page 33: 18.0   Some Recent Developments in NTU

Experimental Results

• MFCC– Consistent improvements

for all types of noiseand SNR conditions

• Similar Results for PLP and Other Features

OriginalParameterWeighting

Relative ErrorReduction (%)

Set A 61.34 68.00 17.23Set B 55.75 63.74 18.06Set C 66.14 69.46 9.81

Average 61.08 67.07 15.39

MFCC

Page 34: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Page 35: 18.0   Some Recent Developments in NTU

System Combination by Integrated Hypothesis Space and Delicate Rescoring

– properly integrating useful information from different approaches

Page 36: 18.0   Some Recent Developments in NTU

Conventional System Combination Approaches

Decoder 1

AlignmentModule

VotingModule

Decoder N

InputSpeech

N-BestConfusionNetwork

result

1.Alignment Algorithms2.Distortion introduced

Inner Word graph

Page 37: 18.0   Some Recent Developments in NTU

Proposed Approach

Decoder 1

Decoder N

InputSpeech Rescoring

result

IntegratedHypothesis

Space

Direct Integration

of Individual Hypothesis

SpaceDelicate

Rescoring

• Produce Integrated Hypothesis Space with detail time information

• Perform Delicate Rescoring on the Integrated Hypothesis Space

Page 38: 18.0   Some Recent Developments in NTU

• Merged Word Graph

• If Two Word Arcs from Different Systems are Equal– Define:

• Others

S(q=q1+q2)=combine(S(q1), S(q2)) if q1=q2

122211

212121

||

|

WqqWqq

qqqqqWWW

q1=q2 ≡ pw1=pw2 , w1=w2 , ts1=ts2 , te1=te2

S(q=qi)=S(qi)

Hypothesis Space Integration

W1

W4

W4

W4

W2W8

W5

W6

W6

W7

W10

W10

W4

W8

W9

W10

W3

W10

W1

W4

W4

W4

W2

W8

W5

W6

W6

W7

W10

W10

W10

W4

W2

W8

W6

W7

W10

W10

W4

W8

W9

W10

W3

System 1

System 2

Page 39: 18.0   Some Recent Developments in NTU

Delicate Rescoring Example (Ⅰ) – Expected Phone Accuracy Score (EPA)

• Borrowing the Concept of Expected Phone Accuracy in MPE Training– – –

• Decoding Procedure

Wp ppe

ppeOpAw

w' phonesdifferent are p and p' if ',1phone same theare p and p' if ',21max|wP

qqAqAEqSqS EPA P

K

qpii

i

pAqA,1 豪雨

陶藝

h_a au sic_iu u

t_a au sic_i u

1 5 t_a1/6=0.17-1+2*0.17=-0.66

au5/6=0.83-1+0.83=-0.17

k 1

,

y* arg maxM k

ky W q y

S q

qk : the kth word in the path y

y : word sequence for a path

Page 40: 18.0   Some Recent Developments in NTU

Delicate Rescoring Example (Ⅱ) – Time Frame Error Score (TFE)

• Borrowing the Concept from Minimum Time Frame Error Decoding– frame level loss function

– P(q’) is available from the process of calculating consensus scores

• Decoding Procedure

)(1

)'()',()1(

)()(]',';,'['

se

Wttwpwqse

TFE tt

qPqqoverlaptt

qSqS esi

],;,[ esii ttwpwq

k 1

y , y

y* arg minM k

kW q

S q

qk : the kth word in the path

y : word sequence for a path

Page 41: 18.0   Some Recent Developments in NTU

Experimental Results

• For Chinese language SER and CER make better sense due to the word segmentation problem

• For SER (for syllables), CER (for characters), proposed approach is significantly better than ROVER upper bound

– Alignment distortion

• TFE has best performance– Discriminative Decoding

Tested system SER CER WER

BaselineMFCC 15.89 22.19 29.93

HLDA 14.43 20.80 28.53

ROVER upper bound

1-Best 14.90 20.39 26.92

10-Best 14.64 20.21 26.76

20-Best 14.49 20.12 26.79

Integrated Hypothesis

Space

(1)CONS 13.67 19.62 26.88

(2)EPA 13.41 19.73 27.70

(3)CONS

+EPA13.55 19.54 26.97

(4)TFE 13.35 19.27 26.71

Page 42: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Page 43: 18.0   Some Recent Developments in NTU

Multimedia Content Analysis for Efficient Browsing and Retrieval

– automatic generation of titles, summaries and semantic structures for multimedia documents

Page 44: 18.0   Some Recent Developments in NTU

Difficulties in Browsing Multimedia/Spoken Documents Written Documents are Better Structured and Easier to

Browse

— in paragraphs with titles

— easily summarized and shown on the screen

— easily decided at a glance if it is what the user is looking for Multimedia/Spoken Documents are just Video/Audio Signals

— not easy to be summarized and shown on the screen

— the user can’t go through each one from the beginning to the end during browsing

— better approaches for efficient browsing and retrieval are needed

Page 45: 18.0   Some Recent Developments in NTU

Integration Relationships among the Involved Technology Areas

Keyterms/Named EntityExtraction from

Spoken Documents

Semantic

Analysis

Information

Indexing,

Retrieval

And Browsing

Key Term Extraction from

Spoken Documents

Page 46: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

14

Page 47: 18.0   Some Recent Developments in NTU

Improved and Interactive Spoken Document Retrieval

– improved spoken document retrieval with higher accuracy and better user-content interaction

Page 48: 18.0   Some Recent Developments in NTU

Lattices, Position Specific Posterior Lattices (PSPL), Confusion Networks (CN)

W2: probW9: probW4: probW1: prob

CN structure:

W3: probW6: probW7: prob

W8: prob W5: probW10: prob

W3: prob

W7: prob

W2: probW1: prob W5: probW9: prob

W10: prob

PSPL structure:

W6: prob

cluster 1

W4: probW8: prob

cluster 2 cluster 4cluster 3 cluster 1 cluster 2 cluster 4cluster 3

W6W8

W4

W1

W7W8W9W10

W8

W7

W9

W3

W2

W5

W10

Start node End node

Time index

All paths:W1W2, W3W4W5, W6W8W9W10,

Lattice:

• PSPL:─ Locate a word in a segment according to the order of the word in a path

• CN:─ Cluster several words in a segment according to similar time spans and word

pronunciation

Page 49: 18.0   Some Recent Developments in NTU

OOV/Rare Word Problem

• OOV word W=w1w2w3w4 and a lattice L of document D

– wi : subword units

• W never appears in L – Never find D under PSPL

• But W=w1w2w3w4 is hidden in L at subword level

• Subword-based PSPL (S-PSPL)

w2w3

w3w4bcdw3w4e

w3w4b

aw1w2

w1w2

Word Lattice L:

Time index

Page 50: 18.0   Some Recent Developments in NTU

Subword-based PSPL and CN

w1_1

Time index

w1_2

w1_3 w2_1 w2_2 w2_3

w2_4

w3_2w3_1 w4_1 w4_2 w51

w5_2 w5_3 w5_4

w7_1

w7_2

w6_1

w6_2 w8_1 w8_2

w8_2w8_1

w9_1

w9_2

w10_1

w10_2

w1_1: prob w1_2: prob …. …..

S-PSPL structure:

…..

cluster 1

…..

…..

cluster 2cluster 8

S-CN structure:

…..

w5_4: prob

….. …..

w1_1: prob w1_2: prob …. …..

…..

cluster 1

…..

…..

cluster 2 cluster 8

…..

w2_4: prob

…..

Lattice Represented by Subword Arcs:

Page 51: 18.0   Some Recent Developments in NTU

Performance Comparison

0.54

0.59

0.64

0.69

0.74

0.79

0.84

0.89

0 2 4 6 8 10 12 14 16 18 20

Index Size(MB)

MA

P

PSPL(word)CN(word)

PSPL(character)

CN(character)

CN(syllable)

PSPL(syllable)

Page 52: 18.0   Some Recent Developments in NTU

Interactive Retrieval of Spoken Documents by Topic Hierarchy

• Interactive Process between User and Content for Spoken Document Retrieval

• Given User’s Initial Query, the Extracted Key Terms can be many, even in a Hierarchy– Ranking the key terms will be helpful in efficient retrieval

Topic Hierarchy

User

Multi-modal Dialogue

Retrieved Documents

Spoken Document

Archive

Retrieval System

Query/Instruction

Page 53: 18.0   Some Recent Developments in NTU

Key Term Space Archive Space

titj

tktl

C(ti)

C(tj)

C(tk)

s1 = [ti ]

s2 = [ti ,tj ]

s3 = [ti ,tk ]

sn = [ti ,tj ,tl ]

G1 = C(ti )

G2 = C(ti + tj)

G3 = C(ti + tk)

Gn = C(ti +tj +tl)

Query Term Suggestions and Improved Interaction by Dialogue Modeling

Such mapping is defined by some IR function (ex: PLSA)

states: s1, s2, s3, …

actions: ti, tj, tk, …

state_s1 + action_tj

state_s2

Document Space

Page 54: 18.0   Some Recent Developments in NTU

• A State Transition Diagram Generated for Each User Given the Initial Query s1

• User Assumed Satisfied (Double Circles) when Recall Rate = L/|D| > τ0

– L: number of relevant documents appearing in the top K retrieved documents– D: desired document set– m(s) = Mininum Number of Steps or Queries to Arrive at the Final State

Learning User’s Behavior in Retrieval by a Large Number of Simulated Users

s1

s2

s3

s4

s6

s7

s8s13

s14

s12

s15

m(s12) = 4

m(s7) = 3

m(s13) = 4

m(s15) = 5

s9 m(s9) = 3

m(s4) = 2

m(s3) = 3

Goal: to minimize the number of steps to arrive at the final state

Page 55: 18.0   Some Recent Developments in NTU

Types- and Dialogue SystemsⅠ Ⅱ

ASRLanguage

Understanding

Well-organizedDatabase

Speech, Graph, Tables

Dialogue Modeling

words,lattices

Dialogue Act Classification

Semantic Frame

Dialogue State

Output Generator

Spoken language Understanding

User Act

System Action Dialogue

Manager

U

Input Speech Utterance Au

^S

Type-I:

Page 56: 18.0   Some Recent Developments in NTU

Types- and Dialogue SystemsⅠ Ⅱ

ASR

Multimedia Document

Archive

Retrieval Engine

Indexing

word/ phone lattice, one-best, N-best

ASR

inverted index file

word/ phone lattice, one-best, N-bestSpoken Language based Information Access

Internal State

Dialogue Modeling

Related Documents

Multi-modal

User Interface

Dialogue ManagerOutput

Presentation

Multi-modal interactions

Information Obtained

d

Spoken Docume

nts

Input Spoken Query

q

Type-II:

Page 57: 18.0   Some Recent Developments in NTU

Improved Performance by Dialogue Modeling

0

15

30

45

60

75

90

105

74 88 92 100

2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

74 88 92 100ASR Character Accuracy in % for Queries ASR Character Accuracy in % for Queries

Ave

rage

Num

ber

of K

ey T

erm

s N

eede

d fo

r S

ucce

ssfu

l Tri

als

Tas

k S

ucce

ss R

ate

Dialogue Modeling

wpq

tf-idf Dialogue Modeling

wpq

tf-idf

Page 58: 18.0   Some Recent Developments in NTU

Hierarchy of Research Areas

Applications

MultimediaTechnologies

SpokenDialogue

Speech-basedInformationRetrieval

Dictation&

Transcription

Distributed SpeechRecognition and

Wireless Environment

MultilingualSpeech

Processing

InformationIndexing

& Retrieval

Text-to-speechSynthesis

Speech/Language

Understanding

Decoding&

SearchAlgorithms

LinguisticProcessing

&LanguageModeling

Wireless Transmission

&Network

Environment

Speech Recognition Core

KeywordSpotting

Robustness:noise/channelfeature/model

Hands-freeInteraction:

acoustic receptionmicrophone array, etc.

Speaker Adaptation

&Recognition

IntegratedTechnologies

Applied Technologies

BasicTechnologies

AcousticProcessing:

features,modeling,

etc.

12 14 4 15

11 10 2 3 1

7 5 68

Spoken Document

Understanding and organization

13

9

Prosodic Modeling

Spontaneous Speech Processing:

pronunciation modeling disfluencies, etc.

Page 59: 18.0   Some Recent Developments in NTU

Histogram-based Quantization (HQ) for Robust Distributed Speech Recognition

– quantization dynamically determined by local statistics, thus automatically absorbing the various disturbances

Page 60: 18.0   Some Recent Developments in NTU

• An Example Partition of Speech Recognition Processes into Client/Sever

Distributed Speech Recognition (DSR) and Wireless Environment

Front-endSignal Processing

AcousticModels Lexicon

FeatureVectors

Linguistic Decoding and

Search Algorithm

Output Sentence

SpeechCorpora

AcousticModel

Training

LanguageModel

Construction

TextCorpora

LexicalKnowledge-base

LanguageModel

Input Speech

Grammar

– encoded feature parameters transmitted in packets Client/Server Structure

Server

ServerClients

Network

Client

Page 61: 18.0   Some Recent Developments in NTU

Problems with Conventional Vector Quantization (VQ) Conventional VQ (e.g. SVQ) Popularly Used in DSR Dynamic Environmental Noise and Codebook Mismatch

Jointly Degrade the Performance of SVQ

Noise moves clean speech to another partition cell (X to

Y)

Mismatch between fixed VQ codebook and test data

increases distortion

Quantization increases difference between clean

and noisy features

Page 62: 18.0   Some Recent Developments in NTU

– Decision boundaries yi{i=1,…,N} are dynamically defined by C(y).

– Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.

Histogram-based Quantization (HQ) ( )Ⅰ

T

{ , , (vertical scale) 1,..., }determined by Lloyd-Max and a standard Gaussian Distribution

i i iD z b i N

Page 63: 18.0   Some Recent Developments in NTU

Histogram-based Quantization (HQ) (Ⅱ)

– With histogram C’(y’), decision boundaries automatically changed to .

– Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.

T

1( , )iiy y

1

1

, '( )

' ' ,

1,2, ...

t ti ii

t ii

x z if b C x b

or y x y

where i N

Page 64: 18.0   Some Recent Developments in NTU

Histogram-based Quantization (HQ) (Ⅱ)

• Based on CDF on the Vertical Scale and Histogram, less Sensitive to Noise on the Horizontal Scale

• Disturbances are Automatically Absorbed into HQ Blocks

Dynamic nature of HQ hidden codebook on vertical scaletransformed by dynamic C(y){yi} Dynamic on horizontal scale

T

Page 65: 18.0   Some Recent Developments in NTU

Histogram-based VQ (HVQ)

Page 66: 18.0   Some Recent Developments in NTU

Different Types of Noise, Averaged over All SNR Values

Experimental Results

ClientHEQ-SVQ

ClientHEQ-SVQ

ServerUD

ClientHQ

ClientHQ

ServerJUD

Page 67: 18.0   Some Recent Developments in NTU

Performance in Mobile Wireless Networks