algorithms for speech recognition and language processing

8/8/2019 Algorithms for Speech Recognition and Language Processing

1/189

c m p - l g / 9 6 0 8 0 1 8 v 2 1 7 S e p 1 9 9 6

Algorithms for Speech Recognition andLanguage Processing

Mehryar Mohri Michael Riley Richard Sproat

AT&T Laboratories AT&T Laboratories Bell Laboratories

[email protected] [email protected] [email protected]

Joint work with Emerald Chung, Donald Hindle, Andrej Ljolje, Fernando Pereira

Tutorial presented at COLING96, August 3rd, 1996 .


2/189

Introduction (1)

Text and speech processing: hard problems

Theory of automata

Appropriate level of abstraction

Well-dened algorithmic problems

M.Mohri-M.Riley-R.Sproat Algorithms for Speech Recognition and Language Processing Introduction 2


3/189

Introduction (2)

Three Sections:

Algorithms for text and speech processing (2h)

Speech recognition (2h)

Finite-state methods for language processing (2h)

M.Mohri-M.Riley-R.Sproat Algorithms for Speech Recognition and Language Processing Introduction 3


4/189

PART IAlgorithms for Text and Speech Processing

Mehryar MohriAT&T Laboratories

[email protected]

August 3rd, 1996

M.Mohri-M.Riley-R.Sproat Algorithms for Speech Recognition and Language Processing PART I 4


5/189

Denitions: nite automata (1) A = ( ; Q ; ; I ; F )

Alphabet ,

Finite set of states Q ,

Transition function : Q ! 2 Q ,

I Q

set of initial states, F Q set of nal states.

A recognizes L ( A ) = f w 2 : ( I ; w ) \ F 6 = ; g

(Hopcroft and Ullman, 1979; Perrin, 1990)Theorem 1 (Kleene, 1965). A set is regular (or rational) iff it can berecognized by a nite automaton.



6/189

Denitions: nite automata (2)

0

ab

1a 2b 3a

0

b1a

a

2

b

b

3a

a

b

Figure 1: L ( A ) = a b a .



7/189

Denitions: weighted automata (1)

A = ( ; Q ; ; ; ; ; I ; F )

( ; Q ; ; I ; F ) is an automaton,

Initial output function ,

Output function

: Q

Q ! K

, Final output function ,

Function f : ! ( K ; + ; ) associated with A : 8 u 2 D o m ( f ) ; f ( u ) =

X

( i ; q ) 2 I ( ( i ; u ) \ F )

( ( i ) ( i ; u ; q ) ( q ) ) .



8/189

Denitions: weighted automata (2)

0 /4 1 /0a/0

3 /0a/2

2 /0

b/1

b/0a/0

Figure 2: Index of t = a b a .



9/189

Denitions: rational power series

Power series : functions mapping to a semiring ( K ; + ; )

Notation: S =X

w 2

( S ; w ) w , ( S ; w ) : coefcients

Support: s u p p ( S ) = f w 2

: ( S ; w ) 6 = 0 g

Sum: ( S + T ; w ) = ( S ; w ) + ( T ; w )

Star: S =X

n 0

S

n

Product: ( S T ; w ) =X

u v = w 2

( S ; u ) ( T ; v )

Rational power series : closure under rational operations of polynomials

(polynomial power series) (Salomaa and Soittola, 1978; Berstel andReutenauer, 1988)

Theorem 2 (Sch utzenberger, 1961). A power series is rational iff it can berepresented by a weighted nite automaton.



10/189

Denitions: transducers (1) T = ( ; ; Q ; ; ; I ; F )

Finite alphabets and ,

Finite set of states Q ,

Transition function : Q ! 2 Q ,

Output function

: Q

Q !

, I Q set of initial states,

F Q set of nal states.

T denes a relation: R ( T ) = f ( u ; v ) 2 ( ) 2 : v 2

q 2 ( ( I ; u ) \ F )

( I ; u ; q ) g



11/189

Denitions: transducers (2)

0

a:a

1a:a

b:b

3a:b

2

b:a

b:ab:a

a:a

Figure 3: Fibonacci normalizer ( a b b ! b a a ] b a a a b b ] ).



12/189

Denitions: weighted transducers

0

a:b/0b:a/1

1a:b/0

a:b/1

2b:c/1 3 /0a:b/0

Figure 4: Example, a a b a ! ( b b c b ; ( 0 0 1 0 ) ( 0 1 1 0 ) ) .

( min ; + ) : a a b a ! min f 1 ; 2 g = 1

( + ; ) : a a b a ! 0 + 0 = 0



13/189

Composition: Motivation (1)

Construction of complex sets or functions from more elementary ones

Modular (modules, distinct linguistic descriptions) On-the-y expansion



14/189


lexicalanalyzer

syntaxanalyzer

semanticanalyzer

intermediate codegenerator

codeoptimizer

codegenerator

source program

target programFigure 5: Phases of a compiler (Aho et al. , 1986).



15/189


Spellchecker

Inflected forms

Index

Source text

Set of positions

Figure 6: Complex indexation.



16/189

Composition: Example (1)

0 1a:a 2b: 3c: 4d:d

0 1a:d 2:e 3d:a

(0,0) (1,1)a:d (2,2)b:e (3,2)c: (4,3)d:a

Figure 7: Composition of transducers.



17/189

Composition: Example (2)

0 1a:a/3 2b: /1 3c: /4 4d:d/2

0 1a:d/5 2:e /7 3d:a/6

(0,0) (1,1)a:d/15 (2,2)b:e/7 (3,2)c: /4 (4,3)d:a/12

Figure 8: Composition of weighted transducers ( + ; ).



18/189

Composition: Algorithm (1)

Construction of pairs of states

Match: q 1 a : b = w 1

! q

0

1 and q 2 b : c = w 2

! q

0

2

Result: ( q 1 ; q 2 ) a : c = ( w 1 w 2 )

! ( q

0

1 ; q 0

2 )

Elimination of -paths redundancy: lter

Complexity: quadratic

On-the-y implementation



19/189


a:a b: c: d:d

a:d :e d:a

a:d 1 :e d:a

2 : 2 : 2 : 2 :

: 1

a:a b: 2 c: 2 d:d

: 1 : 1 : 1 : 1

0 1 2 3 4

0 1 2 3

0 1 2 3

0 1 2 3 4

(a)

(b)

(c)

(d)

A

B

A'

B'

Figure 9: Composition of weighted transducers with -transitions.



20/189


0,0 1,1 1,2

2,1 2,2

3,1 3,2

4,3

a:d :e

b:

c:

b:

c:

:e

:ed:a

b:e(x:x) (1:1 )

(1:1 )

(1:1 )

(2:2 )(2:2 )

(2:2 ) (2:2 )

(x:x)

(2:1)

Figure 10: Redundancy of -paths.



21/189


0

x:x2:1

11:1

2

2:2

x:x

1:1

x:x

2:2

Figure 11: Filter for efcient composition.



22/189

Composition: Theory

Transductions (Elgot and Mezei, 1965; Eilenberg, 1974 1976;Berstel, 1979).

Theorem 3 Let 1 and 2 be two (weighted) (automata +transducers), then ( 1 2 ) is a (weighted) (automaton + transducer).

Efcient composition of weighted transducers (Mohri, Pereira, andRiley, 1996).

Works with any semiring

Intersection: composition of automata (weighted).



23/189

Intersection: Example

0

b1a

a

2

b

b

3a

a

b

0

1b 3

a

c

2b

c

4ba 5a

(0,0)

(0,1)b

(1,3)a

(0,2)b

(2,4)ba

(3,5)a

Figure 12: Intersection of automata.



24/189

Union: Example

0

b/11a/3

a/5

2

b/2

b/6

3 /0a/4

a/3

b/7

0

1b/5 3

a/3

c/0

2b/2

c/1

4b/3a/6 5 /0a/4

0

1b/5 3

a/3

c/0

2b/2

c/1

4b/3a/6 5 /0a/4

6

b/1

7a/3

a/5

8

b/2

b/6

9 /0

a/4

a/3

b/7

10

/0

/0

Figure 13: Union of weighted automata (min ; + ).



25/189

Determinization: Motivation (1)

Efciency of use (time)

Elimination of redundancy No loss of information ( 6 = pruning)



26/189


27/189

Determinization: Motivation (3)

0 1which/69.92flights/53.1

3flight/53.2

4leave/64.6

5leaves/62.3

6leave/63.6

7

leaves/67.6

8 /0

Detroit/103

Detroit/105

Detroit/105

Detroit/101

Figure 15: Determinized language model (9 states, 11 transitions, 4 paths).



28/189

Determinization: Example (1)

t4

0

2a

b

1

a

b

3

b

b

b

b

{0} {1,2}a

b{3}b

Figure 16: Determinization of automata.



29/189


t4

0

2a/1

b/4

1 /0

a/3

b/1

3 /0

b/1

b/3

b/3

b/5

{(0,0)}

{(1,2),(2,0)}/2a/1

{(1,0),(2,3)}/0

b/1{(3,0)}/0

b/1

b/3

Figure 17: Determinization of weighted automata (min ; + ).



30/189


0

2

a:b

b:a

1

a:ba

b:aa

3

c:c

d:

{(0, )} {(1,a),(2, )}a:b

b:a

a

{(3, )}

c:c

d:a

Figure 18: Determinization of transducers.



31/189


0

2

a:b/3

b:a/2

1/0

a:ba/4

b:aa/3

3/0

c:c/5

d:/4

{(0,e,0)} {(1,a,1),(2, ,0)}a:b/3b:a/2

a/1

{(3,,0)}/0

c:c/5

d:a/5

Figure 19: Determinization of weighted transducers (min ; + ).



32/189

Determinization: Algorithm (1)

Generalization of the classical algorithm for automata Powerset construction

Subsets made of (state, weight) or (state, string, weight)

Applies to subsequentiable weighted automata and transducers Time and space complexity: exponential (polynomial w.r.t. size of

the result)




33/189

Determinization: Algorithm (2)Conditions of applications

Twin states: q and q 0 are twin states iff:

If: they can be reached from the initial states by the same inputstring u

Then: cycles at q and q 0 with the same input string v have the

same output value Theorem 4 (Choffrut, 1978; Mohri, 1996a) Let be an

unambiguous weighted automaton (transducer, weighted transducer),then can be determinized iff it has the twin property.

Theorem 5 (Mohri, 1996a) The twin property can be tested in polynomial time.



34/189

Determinization: Theory Determinization of automata

General case (Aho, Sethi, and Ullman, 1986)

Specic case of

: failure functions (Mohri, 1995) Determinization of transducers, weighted automata, and weighted

transducers

General description, theory and analysis (Mohri, 1996a; Mohri,1996b)

Conditions of application and test algorithm

Acyclic weighted transducers or transducers admitdeterminization

Can be used with other semirings (ex: ( R ; + ; ) )



35/189

Local determinization: Motivation

Time efciency

Reduction of redundancy

Control of the resulting size (exibility) Equivalent function (or equal set)

No loss of information



36/189

Local determinization: Example

0

1a:a/3

b:a/5

2a:b/4

b:b/6

c:a/2

3a:a/5

b:a/74

c:a/3

5b:c/3

a:b/3

a:c/2

a:a/3

c:b/2

0

1

{(1,a,0),(2,b,1),(3,a,2)}

a: /3

b:/53

{(2,,0)}

c:a/2

:b/1:

2

{(1,,0)}

:a/0

4

{(3,,0)}

:a/2 5

a:b/3

6

a:c/2

c:a/3

b:c/3

a:a/3

c:b/2

Figure 20: Local determinization of weighted transducers (min ; + ).



37/189

Local determinization: Algorithm

Predicate, ex: ( P

) ( o u t d e g r e e ( q ) > k

) k : threshold parameter

Local: D o m ( d e t ) = f q : P ( q ) g

Determinization only for q 2 D o m ( d e t )


Complexity O ( j D o m ( d e t ) j max q 2 Q

( o u t d e g r e e ( q ) ) )



38/189

Local determinization: theory

Various choices of predicate (constraint: local)

Denition of parameters

Applies to all automata, weighted automata, transducers, andweighted transducers

Can be used with other semirings (ex: ( R ; + ; ) )



39/189

Minimization: Motivation

Space efciency

Equivalent function (or equal set) No loss of information ( 6 = pruning)



40/189

Minimization: Motivation (2)

0 1which/69.92flights/53.1

3flight/53.2

4leave/64.6

5leaves/62.3

6leave/63.6

7

leaves/67.6

8 /0

Detroit/103

Detroit/105

Detroit/105

Detroit/101

Figure 21: Determinized language model .



41/189

Minimization: Motivation (3)

0 1which/2912flights/0

3flight/1.34

4

leave/0.0498

leaves/0

leave/0

leaves/0.132

5 /0Detroit/0

Figure 22: Minimized language model .



42/189

Minimization: Example (1)

t96

01a

3

b

a

2b

4

b

c

5

a

bc

b

a

t97

0

1a

3

b

a

2

b

b

c 4ab

Figure 23: Minimization of automata.



43/189


0 1a:0

b:1

d:0

2a:3

4

b:2

3

c:2

5

c:1

d:4

6

e:3c:1

7e:1d:3 e:2

0 1a:6b:7

d:0

2a:3

4

b:0

3

c:0

5

c:0

d:6 6e:0

c:1

7e:0

d:6 e:0

0 1a:6

b:7

d:0

2

a:3

b:0

3

c:0

d:64e:0

c:1

5e:0

Figure 24: Minimization of weighted automata (min ; + ).



44/189


0

1a:A

4

b:C

2b:B

5b:C

3

c:C

d:D

a:DB

6e:D 7f:BC

c:D

0

1a:ABCDB

4b:CCDDB

2b:

5b:

3

c:

d:CDB

a:DB

6e:C 7f:

c:

0 1a:ABCDB

b:CCDDB

2b:

3

c:

d:CDB

a:DB

5e:C 6f:

Figure 25: Minimization of transducers.



45/189


0

1a:A/0

4

b:C/2

2b:B/5

5b:C/2

3

c:C/3

d:D/1

a:DB/2

6e:D/1 7/0f:BC/6

c:D/4

0

1a:ABCDB/15

4b:CCDDB/15

2b: /0

5b: /0

3

c: /0

d:CDB/9

a:DB/2

6e:C/0 7/0f:/0

c: /0

0 1a:ABCDB/15

b:CCDDB/15

2b:/0

3

c: /0

d:CDB/9

a:DB/2

5e:C/0 6/0f: /0

Figure 26: Minimization of weighted transducers (min ; + ).



46/189

Minimization: Algorithm (1)

Two steps Pushing or extraction of strings or weights towards initial state

Classical minimization of automata, (input,ouput) considered as asingle label

Algorithm for the rst step

Transducers: specic algorithm

Weighted automata: shortest-paths algorithms



47/189

Minimization: Algorithm (2) Complexity

E: set of transitions

S: sum of the lengths of output strings the longest of the longest common prexes of the output paths

leaving each state

Type General AcyclicAutomata O ( j E j log ( j Q j ) ) O ( j Q j + j E j )

Weighted automata O ( j E j log ( j Q j ) ) O ( j Q j + j E j )

Transducers O ( j Q j + j E j O ( S + j E j + j Q j +

( log j Q j + j P m a x

j ) ) ( j E j ( j Q j j F j ) )

j P

m a x

j )



48/189

Minimization: Theory

Minimization of automata (Aho, Hopcroft, and Ullman, 1974; Revuz,1991)

Minimization of transducers (Mohri, 1994)

Minimization of weighted automata (Mohri, 1996a)

Minimal number of transitions Test of equivalence

Standardization of power series (Sch utzenberger, 1961)

Works only with elds Creates too many transitions



49/189

Conclusion (1)

Theory

Rational power series

Weighted automata and transducers

Algorithms

General (various semirings)

Efciency (used in practice, large sizes)



50/189

Conclusion (2)

Applications

Text processing(spelling checkers, pattern-matching, indexation, OCR)

Language processing

(morphology, phonology, syntax, language modeling) Speech processing (speech recognition, text-to-speech synthesis)

Computational biology (matching with errors)

Many other applications



51/189

PART IISpeech Recognition

Michael RileyAT&T Laboratories

[email protected]

August 3rd, 1996

M.Mohri-M.Riley-R.Sproat Algorithms for Speech Recognition and Language Processing PART II 51


52/189

Overview The speech recognition problem

Acoustic, lexical and grammatical models

Finite-state automata in speech recognition

Search in nite-state automata



53/189

Speech Recognition

Given an utterance, nd its most likely written transcription.

Fundamental ideas:

Utterances are built from sequences of units

Acoustic correlates of a unit are affected by surrounding units

Units combine into units at a higher level phones ! syllables !words

Relationships between levels can be modeled by weighted graphs we use weighted nite-state transducers

Recognition: nd the best path in a suitable product graph



54/189

Levels of Speech Representation


d


55/189

Maximum A Posteriori Decoding

Overall analysis [4, 57]:

Acoustic observations: parameter vectors derived by local spectralanalysis of the speech waveform at regular (e.g. 10msec) intervals

Observation sequence o

Transcriptions w

Probability P ( o j w ) of observing o when w is uttered

Maximum a posteriori decoding :

w = argmaxw

P ( w j o ) = argmaxw

P ( o j w ) P ( w ) P ( o )

= argmaxw P ( o j w ) | { z }

generativemodel

P ( w ) | { z }

languagemodel



56/189

Generative Models of Speech

Typical decomposition of P ( o j w ) into conditionally-independentmappings between levels:

Acoustic model P (

o j

p )

: phone sequences !

observation sequences.Detailed model:

P ( o j d ) : distributions ! observation vectors symbolic ! quantitative

P ( d j m ) : context-dependent phone models !distribution sequences

P ( m j p ) : phone sequences ! model sequences

Pronunciation model P ( p j w ) : word sequences ! phone sequences

Language model P ( w ) : word sequences


R iti C d G l F


57/189

Recognition Cascades: General Form

Multistage cascade:

o = sk w = s0s1sk 1stage k stage 1

Find s0 maximizing

P ( s0

; s k

) = P ( s k

j s0

) P ( s0

) = P ( s0

)

X

s1 ; : : : ; s k 1

Y

1 j k

P ( s j

j s j

1 )

Viterbi approximation:

Cost (

s0 ;

s k ) =

Cost (

s k j

s0 ) +

Cost (

s0 )

Cost ( s k

j s0 ) min s1 ; : : : ; s k 1P

1 j k Cost ( s j j s j 1 )

where Cost ( : : : ) = log P ( : : : ) .



58/189

Speech Recognition Problems Modeling: how to describe accurately the relations between levels )

modeling errors

Search: how to nd the best interpretation of the observationsaccording to the given models ) search errors


Acoustic Modeling Feature Selection I


59/189

Acoustic Modeling Feature Selection I

Short-time spectral analysis:

log

Z

g ( ) x ( t + ) e

i 2 f d

Short-time (25 msec. Hamming window) spectrum of /ae/ Hz. vs. Db.

Scale selection: Cepstral smoothing

Parameter sampling (13 parameters)



60/189

Acoustic Modeling Feature Selection II [40, 38]

Renements

Time derivatives 1st and 2nd order non-Fourier analysis (e.g., Mel scale)

speaker/channel adaptation

mean cepstral subtraction vocal tract normalization linear transformations

Result: 39 dimensional feature vector (13 cepstra, 13 delta cepstra,13 delta-delta cepstra) every 10 milliseconds



61/189

Acoustic Modeling Stochastic Distributions [4, 61, 39, 5]

Vector quantization nd codebook of prototypes

Full covariance multivariate Gaussians:

P y ] =1

(

2 )

N =

2 j S j 1

=

2

e

12 ( y

T

T

) S 1 ( y )

Diagonal covariance Gaussian mixtures

Semi-continuous, tied mixtures



62/189

Acoustic Modeling Units and Training [61, 36]

Units

Phonetic ( sub-word ) units e.g., cat > /k ae t/ Context-dependent units a e

k ; t

Multiple distributions ( states ) per phone left, middle, right

Training

Given a segmentation , training straight-forward

Obtain segmentation by transcription

Iterate until convergence


Generating Lexicons Two Steps


63/189

Generating Lexicons Two Steps

Orthography ! Phonemeshad ! /hh ae d/ your ! /y uw r/

complex, context-independent mapping

usually small number of alternatives

determined by spelling constraints; lexical facts

large online dictionaries available

Phonemes ! Phones /hh ae d y uw r/ ! [hh ae dcl jh axr] (60% prob) /hh ae d y uw r/ ! [hh ae dcl d y axr] (40% prob)

complex, context-dependent mapping many possible alternatives

determined by phonological and phonetic constraints



64/189


65/189

1. Decision Tree Splitting Rules

Which split to take at a node?

Candidate splits considered.

Binary cuts : For continuous 1 x < 1 , consider splits of

form: x k vs : x > k ; 8 k :

Binary partitions : For categorical x 2 f 1 ; 2 ; : : : ; n g = X ,consider splits of form:

x 2 A vs : x 2 X A ; 8 A X :



66/189

2. Decision Tree Stopping Rules


67/189

pp g

When to declare a node terminal? Strategy ( Cost-Complexity pruning ):

1. Grow over-large tree.

2. Form sequence of subtrees, T 0 ; : : : ; T n ranging from full tree to just the root node.

3. Estimate honest error rate for each subtree.

4. Choose tree size with mininum honest error rate.

To form sequence of subtrees, vary from 0 (for full tree) to 1 (for just root node) in:

min T

R ( T ) + j T j

:

To estimate honest error rate, test on data different from trainingdata, e.g., grow tree on 9 = 10 of available data and test on 1 = 10 of datarepeating 10 times and averaging ( cross-validation) .



68/189

End of Declarative Sentence Prediction: PruningSequence

+++++++++++++++++++++++++++++++++++++++++++++

++++

+

+

+

+

+ = raw, o = cross-validated# of terminal nodes

e r r o r r a

t e

0 20 40 60 80 100

0 . 0

0

. 0 0 5

0 . 0

1 5

0 . 0

2 5

ooooooooooooooooooooooooooooooo

oo

o

o

o



69/189

3. Decision Tree Node Assignment

Which class/value to assign to a terminal node?

Plurality vote : Choose most frequent class at that node forclassication; choose mean value for regression.


End-of-Declarative-Sentence Prediction: Features [65]


70/189

Prob[word with . occurs at end of sentence]

Prob[word after . occurs at beginning of sentence]

Length of word with .

Length of word after .

Case of word with .: Upper, Lower, Cap, Numbers Case of word after .: Upper, Lower, Cap, Numbers

Punctuation after . (if any)

Abbreviation class of word with .: e.g., month name,unit-of-measure, title, address name, etc.



71/189

End of Declarative Sentence?

bprob:27.29

48294/52895

yes

eprob:1.045

5539/10020

yes

3289/3547

no

next:cap,upcase+.next:n/a,lcase,lcase+.,upcase,num

5281/6473

yes

type:n/atype:addr,com,group,state,title,unit

5156/5435

yes

5137/5283

yes

133/152

no

913/1038

no

42755/42875

yes


Phoneme-to-Phone Alignment


72/189

PHONEME PHONE WORDp p purposeer erp pcl

- pax ixs sae ax andn n

d -r r respectih ixs sp pcl- peh ehk kclt t


Phoneme-to-Phone Realization: Features [66, 10, 62]


73/189

Phonemic Context:

Phoneme to predict

Three phonemes to left Three phonemes to right

Stress (0, 1, 2)

Lexical Position:

Phoneme count from start of word

Phoneme count from end of word


Phoneme-to-Phone Realization: Prediction Example


74/189

Tree splits for /t/ in your pretty red :

PHONE COUNT SPLITix 182499n 87283 cm0: vstp,ustp,vfri,ufri,vaff,uaff,nas

kcl+k 38942 cm0: vstp,ustp,vaff,uaff tcl+t 21852 cp0: alv,paltcl+t 11928 cm0: ustptcl+t 5918 vm1: mono,rvow,wdi,ydi

dx 3639 cm-1: ustp,rho,n/adx 2454 rstr: n/a,no


Phoneme-to-Phone Realization: Network Example


75/189

Phonetic network for Don had your pretty... :

PHONEME PHONE1 PHONE2 PHONE3 CONTEXTd 0.91 d

aa 0.92 aan 0.98 nhh 0.74 hh 0.15 hvae 0.73 ae 0.19 ehd 0.51 dcl jh 0.37 dcl d

y 0.90 y (if d ! dcl d)0.84 - 0.16 y (if d ! dcl jh)

uw 0.48 axr 0.29 err 0.99 -p 0.99 pcl pr 0.99 rih 0.86 iht 0.73 dx 0.11 tcl tiy 0.90 iy



76/189

Acoustic Model Context Selection [92, 39]

Statistical regression trees used to predict contexts based ondistribution variance

One tree per context-independent phone and state (left, middle, right)

The trees were grown until the data criterion of 500 frames perdistribution was met

Trees pruned using cost-complexity pruning and cross-validation toselect best contexts

About 44000 context-dependent phone models

About 16000 distributions


N-Grams: Basics


77/189

Chain Rule and Joint/Conditional Probabilities:

P x 1 x 2 : : : x N ] = P x N j x 1 : : : x N 1 ] P x N 1 j x 1 : : : x N 2 ] : : : P x 2 j x 1 ] P x 1 ]

where, e.g.,

P x

N

j x 1 : : : x N 1 ] = P x 1 : : : x N ]

P x 1 : : : x N 1 ]

(FirstOrder) Markov assumption:

P x

k

j x 1 : : : x k 1 ] = P x k j x k 1 ] = P x

k 1 x k ]

P x

k 1 ]

nthOrder Markov assumption:

P x

k

j x 1 : : : x k 1 ] = P x k j x k n : : : x k 1 ] = P x

k n

: : : x

k

]

P x

k n

: : : x

k 1 ]


N-Grams: Maximum Likelihood Estimation


78/189

Let N be total number of n-grams observed in a corpus and c ( x 1 : : : x n )be the number of times the n-gram x 1 : : : x n occurred. Then

P x 1 : : : x n ] = c ( x 1 : : : x n )

N

is the maximum likelihood estimate of that n-gram probability.

For conditional probabilities,

P x

n

j x 1 : : : x n 1 ] = c ( x 1 : : : x n )

c ( x 1 : : : x n 1 ) :

is the maximum likelihood estimate.With this method, an n-gram that does not occur in the corpus is assignedzero probability.


N-Grams: Good-Turing-Katz Estimation [29, 16]


79/189

Let n r

be the number of n-grams that occurred r times. Then

P x 1 : : : x n ] = c

( x 1 : : : x n )

N

is the Good-Turing estimate of that n-gram probability, where c

( x ) = ( c ( x ) + 1 ) n

c ( x ) + 1 n

c ( x )

:

For conditional probabilities,

P x

n

j x 1 : : : x n 1 ] = c

( x 1 : : : x n )

c ( x 1 : : : x n 1 ) ; c ( x 1 : : : x n ) > 0

is Katzs extension of the Good-Turing estimate.

With this method, an n-gram that does not occur in the corpus is assignedthe backoff probability P x

n

j x 1 : : : x n 1 ] = P x n j x 2 : : : x n 1 ] ; where is a normalizing constant.


Finite-State Modeling [57]


80/189

Our view of recognition cascades : represent mappings between levels,observation sequences and language uniformly with weighted nite-statemachines:

Probabilistic mapping P (

x j

y )

: weighted nite-state transducer .Example word pronunciation transducer:

d: /1 ey: /.4

ae: /.6

dx: /.8

t: /.2

ax:"data"/1

Language model P ( w ) : weighted nite-state acceptor


Example of Recognition Cascade


81/189

phones wordsA D M

observationsO

Recognition from observations o by composition:

Observations: O ( s ; s ) =

8

<

:

1 if s = o

0 otherwise

Acoustic-phone transducer: A ( a ; p ) = P ( a j p )

Pronunciation dictionary: D ( p ; w ) = P ( p j w )

Language model: M ( w ; w ) = P ( w )

Recognition: w = argmaxw

( O A D M ) ( o ; w )


Speech Models as Weighted Automata


82/189

Quantized observations:on. . .t1 t2t0

o1 o2 tn

Phone model A

: observations ! phones

o i: /p01 (i) : /p2f

...

... ...

... ...

o i: /p12 (i)

o i: /p00 (i) o i: /p11 (i) o i: /p22 (i)

s0 s1 s2

Acoustic transducer: A =

P

A

Word pronunciations D data : phones ! words

d: /1 ey: /.4

ae: /.6

dx: /.8

t: /.2

ax:"data"/1

Dictionary: D =

P

w

D

w



83/189


84/189

Sample Pronunciation Dictionary D

Dictionary with hostile , battle and bottle as a weighted transducer:

0

15

-:-/2.466

l:-/0.112

14

b:bottle/0.000

17-:-/0.000

16-:-/0.000

1

-:-/0.014

2

ax:-/2.607

ay:-/1.616

el:-/0.4313

t:-/0.067

4s:-/0.035

5

-:-/2.466

l:-/0.112

6-:-/0.014

7

ax:-/2.607

el:-/0.164

8t:-/2.113

dx:-/0.2409 ae:-/0.057

10-:-/2.466

l:-/0.112

11-:-/0.014

12

ax:-/2.607

el:-/0.164

13 t:-/2.113

dx:-/0.240

aa:-/0.055

18-:hostile/2.943

hh:hostile/0.134

hv:hostile/2.635

b:battle/0.000

aa:-/0.055



85/189

Sample Language Model M

Simplied language model as a weighted acceptor:

0 4-/2.374

5

-/3.961

2battle/6.603

hostile/9.394

-/3.173

battle/9.268

1

bottle/11.510

-/1.882

-/2.306

-/1.102

-/1.9133

hostile/11.119

-/3.537

battle/10.896

bottle/13.970


Recognition by Composition


86/189

From phones to words: compose dictionary with phone lattice toyield word lattice with combined acoustic and pronunciation costs:

0 1hostile/-32.900 2battle/-26.825

Applying language model: Compose word lattice with language

model to obtain word lattice with combined acoustic, pronunciationand language model costs:

0

2hostile/-21.781

1hostile/-19.407 3

battle/-17.916

battle/-15.250



87/189

Context-Dependency Examples

Context-dependent phone models: Maps from CI units to CD units.Example: a e = b d ! a e

b ; d

Context-dependent allophonic rules: Maps from baseforms to

detailed phones. Example: t = V

0

V ! d x

Difculty: Cross-word contexts where several words enter andleave a state in the grammar, substitution does not apply.


Context-Dependency Transducers


88/189

Example triphonic context transducer for two symbols x and y .

x.x x/x_x:x

x.y

x/x_y:x

y.x

y/x_x:yy.y

y/x_y:y x/y_x:x

x/y_y:x

y/y_x:y

y/y_y:y



89/189

On-Demand Composition [69, 53]


90/189

Create generalized state machine C for composition A B .

C : s t a r t : = ( A : s t a r t ; B : s t a r t )

C : f i n a l ( ( s

1 ; s

2 ) )

: = A : f i n a l ( s

1 ) ^ B : f i n a l ( s

2 )

C : a r c s ( ( s 1 ; s 2 ) ) : = M e r g e ( A : a r c s ( s 1 ) ; B : a r c s ( s 2 ) )

Merged arcs dened as:

( l 1 ; l 3 ; x + y ; ( n s 1 ; n s 2 ) ) 2 M e r g e ( A : a r c s ( s 1 ) ; B : a r c s ( s 2 ) )

iff

( l 1 ; l 2 ; x ; n s 1 ) 2 A : a r c s ( s 1 ) and ( l 2 ; l 3 ; y ; n s 2 ) 2 B : a r c s ( s 2 )


State Caching


91/189

Create generalized state machine B for input machine A .

B : s t a r t : = A : s t a r t

B : f i n a l ( s t a t e )

: = A : f i n a l ( s t a t e )

B : a r c s ( s t a t e ) : = A : a r c s ( s t a t e )

Cache Disciplines:

Expand each state of A exactly once, i.e. always save in cache(memoize).

Cache, but forget old states using a least-recently used criterion.

Use instructions (ref counts) from user (decoder) to save and forget.


O D d C iti R lt


92/189

On Demand Composition ResultsATIS Task - class-based trigram grammar, full cross-word triphoniccontext-dependency.

states arcs

context 762 40386

lexicon 3150 4816

grammar 48758 359532

full expansion 1 : 6 106 5 : 1 106

For the same recognition accuracy as with a static, fully expanded

network, on-demand composition expands just 1.6% of the total numberof arcs.


Determinization in Large Vocabulary Recognition


93/189

Determinization in Large Vocabulary Recognition For large vocabularies, string lexicons are very non-deterministic

Determinizing the lexicon solves this problem, but can introducenon-coassessible states during its composition with the grammar

Alternate Solutions:

Off-line compose, determinize, and minimize:

L e x i c o n G r a m m a r

Pre-tabulate non-coassessible states in the composition of:

D e t ( L e x i c o n ) G r a m m a r


Search in Recognition Cascades

Reminder: Cost log probability


94/189

Reminder: Cost log probability Example recognition problem: w = argmax

w ( O A D M ) ( o ; w )

Viterbi search : approximate w by the output word sequence for thelowest-cost path from the start state to a nal state in O A D M ignores summing over multiple paths with same output:

...:w 1

...:w i

...:w n...:

...:

...:

...:

>

O A D M

Composition preserves acyclicity, O is acyclic ) acyclic searchgraph


Single-source Shortest Path Algorithms [83]

Meta algorithm:


95/189

Meta-algorithm: Q f s 0 g ; 8 s ; C o s t ( s ) 1

While Q not empty, s D e q u e u e ( Q )

For each s 0 2 A d j s ] such that C o s t ( s 0 ) > C o s t ( s ) + c o s t ( s ; s 0 )

C o s t ( s

0

) C o s t ( s ) + c o s t ( s ; s

0

)

E n q u e u e ( Q ; s )

Specic algorithms:

Name Queue type Cycles Neg. Weights Complexity

acyclic topological no yes O ( j V j + j E j )

Dijkstra best-rst yes no O ( j E j log j V j )

Bellman-Ford FIFO yes yes O ( j V j j E j )


The Search Problem

Obvious rst approach : use an appropriate single-source


96/189

Obvious rst approach : use an appropriate single sourceshortest-path algorithm

Problem: impractical to visit all states, can we do better?

Admissible methods: guarantee nding best path, but reordersearch to avoid exploring provably bad regions

Non-admissible methods: may fail to nd best path, but may needto explore much less of the graph

Current practical approaches:

Heuristic cost functions

Beam search

Multipass search

Rescoring


Heuristic Cost Function A* Search [4 56 17]


97/189

Heuristic Cost Function A Search [4, 56, 17] States in search ordered by

cost-so-far ( s ) + lower-bound-to-complete ( s )

With a tight bound, states not on good paths are not explored

With a loose lower bound no better than Dijkstras algorithm

Where to nd a tight bound? Full search of a composition of smaller automata (homomorphic

automata with lower-bounding costs?)

Non-admissible A* variants: use averaged estimate of cost-to-complete, not a lower-bound


Beam Search [35]


98/189

Beam Search [35] Only explore states with costs within a beam (threshold) of the cost

of the best comparable state

Non-admissible

Comparable states states corresponding to (approximately) thesame observations

Synchronous (Viterbi) search: explore composition states inchronological observation order

Problem with synchronous beam search: too local, some observation

subsequences are unreliable and may locally put the best overall pathoutside the beam


Beam-Search Tradeoffs [68]


99/189

Beam Search Tradeoffs [68]

Word lattice: result of composing observation sequence, leveltransducers and language model.

Beam Word latticeerror rate

Median numberof edges

4 7.3% 86.5

6 5.4% 244.58 4.4% 827

10 4.1% 3520

12 4.0% 13813.5


Multipass Search [52, 3, 68]


100/189

Multipass Search [52, 3, 68]

Use a succession of binary compositions instead of a single n -waycomposition combinable with other methods

Prune : Use two-pass variant of composition to remove states not inany path close enough to the best

Pruned intermediate lattices are smaller, lower number of statepairings considered

Approximate : use simpler models (context-independent phonemodels, low-order language models)

Rescore : : :



101/189

PART III

Finite State Methods in Language


102/189

Finite State Methods in LanguageProcessing

Richard Sproat

Speech Synthesis Research Department

Bell Laboratories, Lucent Technologies

[email protected]

M.Mohri-M.Riley-R.Sproat Algorithms for Speech Recognition and Language Processing PART III 102

Overview


103/189

Text-analysis for Text-to-Speech (TTS) Synthesis

A rich domain with lots of linguistic problems

Probably the least familiar application of NLP technologies

Syntactic analysis

Some thoughts on text indexation


The Nature of the TTS Problem

This is some text:


104/189

This is some text:It was a dark andstormy night. Fourscore and sevenyears ago. Now isthe time for allgood men. Letthem eat cake.Quoth the ravennevermore.

Linguistic Analysis

Speech Synthesis

phonemes, durationsand pitch contours

speech waveforms


From Text to Linguistic Representation


105/189

Y o C

The rat is eating the oil &

l a u

us

j o u

rt s

N V N

shu3 chi1 you2lao3

L H H L HL


Russian Percentages: The ProblemHow do you say % in Russian?

Adjectival forms when modifying nouns


106/189

dject va o s w e od y g ou s20% s k i d k a ) d v a d c a t i - p r o c e n t n a s k i d k a

20% discount dvadcat i - procent naja skidka

s

20% r a s t v o r o m ) s d v a d c a t i

- p r o c e n t n y m r a s t v o r o m

with 20% solution s dvadcat i - procent nym rastvorom

Nominal forms otherwise21% ) d v a d c a t ~ o d i n p r o c e n t

dvadcat odin procent

23% ) d v a d c a t ~ t r i p r o c e n t a

dvadcat tri procent a

20% ) d v a d c a t ~ p r o c e n t o v

dvadcat procent ov

s 20% ) s d v a d c a t ~ p r o c e n t a m i

with 20% s dvadcat ju procent ami


Text Analysis Problems

Segment text into words.


107/189

Segment text into sentences, checking for and expandingabbreviations :

St. Louis is in Missouri.

Expand numbers

Lexical and morphological analysis

Word pronunciation

Homograph disambiguation

Phrasing

Accentuation


Desiderata for a Model of Text Analysis for TTS


108/189

Delay decisions until have enough information to make them

Possibly weight various alternatives

Weighted Finite-State Transducers offer an attractive computational model


Overall Architectural Matters

Example: word pronunciation in Russian


109/189

Text form: k o s t r a < kostra > (bonre+genitive.singular)

Morphological analysis: k o s t

0 E r f noun g f masc g f inan g + 0 a f sg g f gen g

Pronunciation: /k str 0 a/

Minimal Morphologically-Motivated Annotation (MMA): k o s t r 0 a

(Sproat, 1996)


Overall Architectural Matters


110/189

Pronunciation

Language Model

Surface Orthographic FormKOSTPA #kastr"a#

#KOSTP"A#

:

:

fst:

:

fst

:

:

fst

:

:

fst

Morphological Analysis#KOST"{E}P{noun}{masc}{inan}+"A{sg}{gen}#

:

MMA

S

M

P

D

S O11

Lexical Analysis WFST:

L

O

Phonological Analysis WFST:PLL = D MO


Orthography ! Lexical Representation

A Closer Look


111/189

Words : Lex. Annot. Lex. Annot. : Lex. Anal. _ Punc. :Interp.

S S

Special Symbols : Expansions SPACE :Interp.S

Numerals : Expansions

SPACE : white space in German, Spanish, Russian : : :

in Japanese, Chinese : : :


Chinese Word Segmentation

F !

F

1 asp4 :

68 le0 p e r f

F !

F 2 1 vb8 : 11 liao3jie3 understand


112/189

j !

j 1 vb5 : 56 da4 big

j ! j 1 nc11 : 45 da4jie1 avenue

!

2 adv4 : 58 bu4 not b

! b vb4 : 45 zai4 at

!

vb11 : 77 wang4 forget

F !

vb++ 2 F 2 npot12 : 23 wang4+bu4liao3 unable to forget

! np4 : 88 wo3 I ! vb8 : 05 fang4 place

j !

j 1 vb10 : 70 fang4da4 enlarge

!

1 nc11 : 02 na3li3 where

!

nc10 : 35 jie1 avenue ! 1 nc10 : 92 jie3fang4 liberation

j ! 3 j 1 urnp42 : 23 xie4 fang4da4 n a m e


Chinese Word Segmentation


113/189

Space = : #

L = S p a c e _ ( D i c t i o n a r y _ ( S p a c e P u n c ) ) +

BestPath( F j b L) = pro4 : 88 # vb + F 2 npot12 : 23# 1 nc10 : 92 j 1 nc11 : 45 : : :

I couldnt forget where Liberation Avenue is.


Numeral Expansion


114/189

234 Factorization ) 2 102 + 3 101 + 4

DecadeFlop ) 2 102 + 4 + 3 101

NumberLexicon

+

zwei+hundert+vier+und+dreiig


Numeral Expansion


115/189

0 1

1:1

2:2

3:3

4:45:56:67:7

8:8

9:9

4

:101

2

:102 5

0:0

1:1

2:2

3:3

4:45:56:6

7:7

8:8

9:93

0:0

1:1

2:2

3:3

4:45:56:6

7:7

8:8

9:9

:101


German Numeral Lexicon / f 1 g : (eins f num g ( f masc g j f neut g ) f sg g f ## g )/

/ f 2 g : (zwei f num g f ## g )/

/ 3 (d i ## )/


116/189

/ f 3 g : (drei f num g f ## g )/ ...

/( f 0 g f +++ g f 1 g f 10 ^ 1 g ) : (zehn f num g f ## g )/ /( f 1 g f +++ g f 1 g f 10 ^ 1 g ) : (elf f num g f ## g )/

/( f 2 g f +++ g f 1 g f 10 ^ 1 g ) : (zw olf f num g f ## g )/

/( f 3 g f +++ g f 1 g f 10 ^ 1 g ) : (drei f ++ g zehn f num g f ## g )/

...

/( f 2 g f 10 ^ 1 g ) : (zwan f ++ g zig f num g f ## g )/

/( f 3 g f 10 ^ 1 g ) : (drei f ++ g ig f num g f ## g )/ ...

/( f 10 ^ 2 g ) : (hundert f num g f ## g )/

/( f 10 ^ 3 g ) : (tausend f num g f neut g f ## g )/


Morphology: Paradigmatic Specications

Paradigm A1


117/189

Paradigm f A1 g

# starke Flektion (z.B. nach unbestimmtem Artikel)

Sufx f ++ g er f sg g f masc g f nom g

Sufx f ++ g en f sg g f masc g ( f gen g j f dat g j f acc g )

Sufx f ++ g e f sg g f femi g ( f nom g j f acc g )

Sufx f ++ g en f sg g ( f femi g j f neut g )( f gen g j f dat g )Sufx f ++ g es f sg g f neut g ( f nom g j f acc g )

Sufx f ++ g e f pl g ( f nom g j f acc g )

Sufx f ++ g er f pl g f gen g

Sufx f ++ g en f pl g f dat g



118/189


/ f

A1 g

: (aal f

++ g

glatt f

adj g

)/ / f A1 g : (ab f ++ g ander f ++ g lich f adj g f umlt g )/


119/189

/ f A1 g : (ab f ++ g artig f adj g )/

/ f

A1 g

: (ab f

++ g

bau f

++ g

wurdig f

adj g f

umlt g

)/ ...

/ f A6 g : (dein f adj g )/

/ f

A6 g

: (euer f

adj g

)/ / f A6 g : (ihr f adj g )/

/ f A6 g : (Ihr f adj g )/

/ f

A6 g

: (mein f

adj g

)/ / f A6 g : (sein f adj g )/

/ f A6 g : (unser f adj g )/




120/189

Project(( f A6 g _ Endings) (( f A6 g :Stems) _ Id( ))) )

0 1m 2 3e 4i 5n 6adj 7++

8sg

12

e

9masc

11

neut

pl

13sg

14

m

17

n 20

r 24s

10

nom

nomaccfemi

15sg 16

pl18

sg

21sg

23pl

25sg

masc

neut

dat19masc

acc22femi

gen

gen

dat

mascneut



121/189

Morphology: Finite-State Grammar


122/189

FUGE SECOND f ++ g < 1.5 >

FUGE SECOND f

++ g

s f

++ g

...

SECOND PREFIX f Eps g < 1.0 >

SECOND STEM f Eps g < 2.0 >

SECOND WORD f Eps g < 2.0 >...

WORD


Morphology: Finite-State Grammar


123/189

Unanst andigkeitsunterstellungallegation of indecency

+

"un f ++ g "an f ++ g stand f ++ g ig f ++ g keit f ++ g s f ++ g unter f ++ g stell f ++ g ung


Rewrite Rule Compilation

Context-dependent rewrite rules


124/189

General form : ! =

; ; ; regular expressions.

Constraint: cannot be rewritten but can be used as a context

Example : a ! b = c b

(Johnson, 1972; Kaplan & Kay, 1994; Karttunen, 1995; Mohri & Sproat,1996)


Example


125/189

a ! b = c b

w = c a b



126/189

Example

After replace :

2


127/189

0 1c2


128/189

Based on the use of marking transducers

Brackets inserted only where needed

Efciency

3 determinizations + additional linear time work

Smaller number of compositions


Rule Compilation Method

r f r e p l a c e l 1 l 2


129/189

r : ! >

f : ( f > g ) > ! ( f > g ) f < 1 ; < 2 g >

r e p l a c e : < 1 > ! < 1

l 1 :

< 1 !

l 2 : < 2 !


Marking Transducers

Proposition Let be a deterministic automaton representing then


130/189

Proposition Let be a deterministic automaton representing , thenthe transducer post-marks occurrences of by #.

q

c:c

d:d

a:a

b:b

Final state q with entering and leaving transitions of I d ( ) .

q q:#

c:c

d:d

a:a

b:b

States and transitions after modications, transducer .



131/189

The Transducers as Expressions using Marker


132/189

r = r e v e r s e ( M a r k e r ( r e v e r s e ( ) ; 1 ; f > g ; ; ) ) ]

f = r e v e r s e ( M a r k e r ( ( f > g ) r e v e r s e ( >

> ) ; 1 ; f < 1 ; < 2 g ; ; ) ) ]

l 1 = M a r k e r ( ; 2 ; ; ; f < 1 g ) ] < 2 : < 2 l 2 = M a r k e r ( ; 3 ; ; ; f < 2 g ) ]


Example: r for rule a ! b = c b

a:ac:c

b bb:b


133/189

r e v e r s e ( ) =

0 1

b:b

a:a

c:c

M a r k e r ( r e v e r s e ( ) ; 1 ; f > g ; ; ) =

0

a:ac:c

1b:b

Eps:>

r e v e r s e ( M a r k e r ( r e v e r s e ( ) ; 1 ; f > g ; ; ) ) =

0

a:ac:c

1Eps:>

b:b


The Replace Transducer


134/189

0

:, < :< , >:2 2

1< ::1 2

>:


Extension to Weighted Rules

Weighted context-dependent rules:


135/189

! =

; ; regular expressions,

formal power series on the tropical semiring

Example: c ! ( : 9 c ) + ( : 1 t ) = a t


Rational power series

Functions S : ! R +

f 1 g , Rational power series

Tropical semiring : min


136/189

Tropical semiring : ( R +

f 1 g ; min ; + )

Notation: S =

X

w 2 ( S ; w )

Example: S = ( 2 a ) ( 3 b ) ( 4 b ) ( 5 b ) + ( 5 a ) ( 3 b )

( S ; a b b b ) = min f 2 + 3 + 4 + 5 = 14 ; 5 + 3 + 3 + 3 = 11 g = 11

Theorem 6 (Sch utzenberger, 1961): S is rational iff it is recognizable(representable by a weighted transducer).


Compilation of weighted rules

Extension of the composition algorithm to the weighted case

Efcient lter for -transitions


137/189

Efcient lter for transitions

Addition of weights of matching labels

Same compilation algorithm

Single-source shortest paths algorithms to nd the best path


Rewrite Rules: An Example

s ! z = ($|#) VStop ;


138/189

($| ) p ;

0

V:V$:$z:z

VStop:VStop#:#

1

s:s

2

s:z

z:z

V:V

s:s

s:z

3#:#

$:$

VStop:VStop

4$:$

#:#

V:V

$:$

z:z

#:#

s:ss:z

VStop:VStop

/mis$mo$/ Voicing = /miz$mo$/



139/189

Russian Percentage Expansion: An example

s 5% s k i d k o i


140/189

Lexical Analysis FST

+

sprep pjatnum nom -procentn adj ? +aja fem + sg + nom skidk fem ojsg + instr

sprep pjatnum igen-procentn adj ? +oj fem + sg + instr skidk fem ojsg + instr 2 : 0

sprep pjatnum juinstr -procent noun +ami pl + instr skidk femojsg + instr 4 : 0

...



141/189

Percentage Expansion: Continued

s 5% s k i d k o i

+


142/189

s pjati gen-procentn adjojsg + instr skidkoj

L P

+

s # PiT"!p r@c"Entn&y # sK"!tk&y


Phrasing Prediction

Problem : predict intonational phrase boundaries in longunpunctuated utterences:

h l ld l k k d d


143/189

For his part, Clinton told reporters in Little Rock, Ark., on Wednesday k that the pact can be a good thing for America k if we change our economic policy k to rebuild American industry here at home k and if we get the kind of guarantees we need on environmental and labor standards in Mexico k and a real plan k to help the people who willbe dislocated by it.

Bell Labs synthesizer uses a CART-based predictor trained on labeledcorpora (Wang & Hirschberg 1992).



144/189

Phrasing Prediction: Sample Tree

punc:NO

punc:YES

69923/82958no

j3f:CC,CS,EX,FORIN,IN,ININ,ONIN,TO,TOIN

j3f:AT,CD,PART,UH,NA

69826/74358no

no

no

j3n:NN,NP

j3n:NNS,PN,NA

8503/8600

yes

yes yes


145/189

j3f:CC,CS

j3f:EX,FORIN,IN,ININ,ONIN,TO,TOIN

9968/13032

syls:7.5

1502/2985

no

j1v:HV,MD,SAIDVBD,VB,VBD,VBG

j1v:HVD,VBN,VBZ,NA

361/452no

3221/29

yes

33353/423

no

j4n:NN,NNS,NP

j4n:PN,NA

1392/2533

yes

npdist:1.875

510/792no

68472/667

no

6987/125

yes

j2n:NN,NNS,NP

j2n:PN,NA

1110/1741

yes

70820/1042

yes

j3f:CC

j3f:CS

409/699no

j1f:AT,CS,IN,TO

j1f:CC,CD,FORIN,ININ,ONIN,TOIN,NA

167/295

yes

28440/60

no

285147/235

yes

raj4:CL,DEACC

raj4:ACC

281/404no

286275/379

no

28719/25

yes

98466/10047

no

nploc:SUCC,SINGLE

nploc:PRE,W/IN,OTHER

59858/61326

j2n:NN,NNS,NP

j2n:PN,NA

8536/9549

no

j3w:WP$,WDT,WPS,WRB

j3w:NA

6111/7106no

ssylsp:4.5

219/418

yes

8048/53

no

j1f:AT,CS,FORIN,TOIN

j1f:CC,CD,IN,ININ,NA

214/365

yes

16256/88

no

163182/277

yes

415912/6688

no

212425/2443

no

1151322/51777

no

j4f:0,ONIN

j4f:AT,CC,CD,FORIN,IN,ININ,TO,TOIN,NA

777/848

y

1230/39

no

13768/809

yes

77726/7752

y


Phrasing Prediction: Results

Results for multi-speaker read speech:

major boundaries only: 91.2%

collapsed major/minor phrases: 88.4%


146/189

3-way distinction between major, minor and null boundary:81.9%

Results for spontaneous speech:

major boundaries only: 88.2%

collapsed major/minor phrases: 84.4%

3-way distinction between major, minor and null boundary:78.9%

Results for 85K words of hand-annotated text, cross-validated ontraining data: 95.4% .


Tree-Based Modeling: Prosodic Phrase Prediction

[1] dpunc:3.5

920/1800no

[2] lpos:N,V,A,Adv,Dlpos:P

620/1080no

rpos:N,Arpos:V,Adv,D,P

420/720yes


147/189

[3] rpos:N,Arpos:V,Adv,D,P

495/900no

[4] dpunc:2.5

190/300no

16133/200

no

lpos:Nlpos:V,A,Adv,D

57/100

no

3413/20

yes

3550/80

no

dpunc:1.5

305/600no

lpos:V,A,Advlpos:N,D

117/200

no

3678/120

no

lpos:Nlpos:D

41/80yes

7423/40yes

7522/40

no

rpos:V,Adv,Drpos:P

212/400

yes

lpos:V,Alpos:N,Adv,D

152/300yes

rpos:Vrpos:Adv,D

64/120no

15224/40yes

15348/80

no

rpos:Vrpos:Adv,D

96/180yes

15433/60

no

15569/120

yes

3960/100

yes

5125/180

no

lpos:Alpos:N,V,Adv,D,P

134/240no

1230/40

no

lpos:V,Advlpos:N,D,P

104/200no

2647/80

no

rpos:Nrpos:A

63/120

yes

5435/60

no

5538/60yes

7314/480

yes


The Tree Compilation Algorithm

(Sproat & Riley, 1996)

Each leaf node corresponds to single rule dening a constrained weighted mapping for the input symbol associated with the tree

Decisions at each node are stateable as regular expressions restricting the left


148/189

Decisions at each node are stateable as regular expressions restricting the left

or right context of the rule(s) dominated by the branch The full left/right context of the rule at a leaf node are derived by intersecting

the expressions traversed between the root and leaf node

The transducer for the entire tree represents the conjunction of all theconstraints expressed at the leaf nodes; it is derived by intersecting togetherthe set of WFSTs corresponding to each of the leaves

Note that intersection is dened for transducers that express same-lengthrelations

The alphabet is dened to be an alphabet of all correspondence pairs thatwere determined empirically to be possible


Interpretation of Tree as a Ruleset


149/189

Node 16

1 ( ( I ! I ! # ! I ! # ! # ! ) ) \ 3 N A

2 ( ( N V A A d v D ) ) \

4 ( ( I ! I ! # ! ) )

# ) ( I 1 : 09 #0 : 41 ) = I ( ! # ) ? ( N V A A d v D ) N A


Summary of Compilation Algorithm

Each rule represents a weighted two-level surface coercion rule

R u l e

L

= C o m p i l e (

T

!

L

=

\

p 2 P

L

p

\

p 2 P

L

p

)


150/189

Each tree/forest represents a set of simultaneous weighted two-levelsurface coercion rules

R u l e

T

=

\

L 2 T

R u l e

L

R u l e

F

=

\

T 2 F

R u l e

T

BestPath(,D#N#V#Adv#D#A#N Tree) ) ,D#N#V#Adv , D#A#N 2 : 76


Lexical Ambiguity Resolution

Word sense disambiguation :

She handed down a harsh sentence . peine

This sentence is ungrammatical. phrase


151/189

Homograph disambiguation :He plays bass . /be / s/

This lake contains a lot of bass . /bs/

Diacritic restoration :appeler lautre cote de latlantique c ote side

Cote dAzur c ote coast

(Yarowsky, 1992; Yarowsky 1996; Sproat, Hirschberg & Yarowsky, 1992;Hearst 1991)



152/189

Homograph Disambiguation 2

Sort by A b s ( L o g ( P r ( P r o n 1 j C o l l o c a t i o n i ) P r ( P r o n 2 j C o l l o c a t i o n i )

) )

Decision List for lead

Logprob Evidence Pronunciation


153/189

11.40 follow/V + lead ) lid

11.20 zinc $ lead ) l d

11.10 lead level/N ) l d

10.66 of lead in ) l d

10.59 the lead in ) lid

10.51 lead role ) lid

10.35 copper $ lead ) l d

10.28 lead time ) lid

10.16 lead poisoning ) l d


Homograph Disambiguation 3: Pruning

Redundancy by subsumption

Evidence lid l d Logprob

lead level/N 219 0 11.10


154/189

lead levels 167 0 10.66

lead level 52 0 8.93

Redundancy by association

Evidence t

t i

tear gas 0 1671

tear $ police 0 286

tear $

riot 0 78tear $ protesters 0 71


Homograph Disambiguation 4: Use

Choose single best piece of matching evidence.

Decision List for lead

Logprob Evidence Pronunciation

11 40 f ll /V lead lid


155/189

11.40 follow/V + lead ) lid

11.20 zinc $ lead ) l d

11.10 lead level/N ) l d

10.66 of lead in ) l d

10.59 the lead in ) lid

10.51 lead role ) lid

10.35 copper $ lead ) l d

10.28 lead time ) lid

10.16 lead poisoning ) l d


Homograph Disambiguation: EvaluationWord Pron1 Pron2 Sample Size Prior Performance

lives la i vz l i vz 33186 .69 .98

wound wa Y nd wund 4483 .55 .98

Nice na i s nis 573 .56 .94

Begin b g n be g n 1143 75 97


156/189

Begin b i q g i n be i g i n 1143 .75 .97

Chi t S i ka i 1288 .53 .98

Colon ko Y q lo Y n q ko Y l n 1984 .69 .98

lead (N) lid l d 12165 .66 .98

tear (N) t t i 2271 .88 .97

axes (N) q ksiz q ks i z 1344 .72 .96

IV a i vi f A M W 1442 .76 .98

Jan d c n j n 1327 .90 .98

routed M ut i d M a Y t i d 589 .60 .94

bass be i s bs 1865 .57 .99

TOTAL 63660 .67 .97


Decision Lists: Summary


157/189

Efcient and exible use of data.

Easy to interpret and modify.


Decision Lists as WFSTs

The lead example


158/189

Construct homograph taggers H

0, H

1 : : :

that nd and tag instancesof a homograph set in a lexical analysis. For example, H 1 is:

0

:

1

##:##

:2

l:l

:3

e:e:

4

a:a:

5

d:d:

6

1:1

: 7nn:nn8

:H1:



Construct an environmental classier consisting of a pair of transducers C 1 and C 2 ,where

C 1 optionally rewrites any symbol except the word boundary or the homograph tagsH0, H1 : : : , as a single dummy symbol


159/189

C

2 classies contextual evidence from the decision list according to its type, andassigns a cost equal to the position of the evidence in the list; and otherwise passes, word boundary and H0, H1 : : : through:

## follow vb ## ! ## V0 ## < 1 >

## zinc nn ## ! ## C1 ## < 2 >

## level(s?) nn ## ! ## R1 ## < 3 >

## of pp ## ! ## [1 ## < 2 >

## in pp ## !

## 1] ##

...



Construct a disambiguator D from a set of optional rules of the form:H0 ! 3 / V0

H1 ! 3 / C1

H1 ! 3 / C1


160/189

H0 ! 3 / ## R0

H1 ! 3 / ## R1

H0 ! 3 / [0 #

algorithms for speech recognition and language processing

Documents