word-subword based keyword spotting with implications in oov detection

Word-subword based keyword spotting with

implications in OOV detection

Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink

Brno University of TechbnologyBUT Speech@FIT

44th Asilomar Conference on Signals, Systems and Computers, 8.11.2010

ASILOMAR SS & C Černocký, Szöke, Hanneman, Kombrink 8.11.2010 2/34

Agenda

• Word-based STD, OOV problem, subwords• Experiments• Sub-word units• Hybrid word-subword system • What can we do with OOVs • Conclusion


Goal of STD and glossary of termsGoal: detect keywords or key-phrases in input

speech, for each detection, output:• Identity• Position• Score

Glossary • Large Vocabulary Continuous Speech Recognizer –

LVCSR – system converting spoken speech into text.• Out-of-vocabulary – OOV – word which is not in the

LVCSR vocabulary.• Term – textual entry consisting of one or more words in

sequence.• Spoken Term Detection – STD – a way to search for a

term in spoken data.• Subword(s) – unit(s) that are parts of words (phones,

syllables, automatically found, etc.).


Word-based STD

• Due to the presence of language model, Word-based STD systems are reaching better accuracies than acoustic ones.


Implementation• Term is searched in recognition lattice • Allows to estimate posterior probability of a

term.


The OOV problemREF: THIS IS AN EXAMPLE OF RECOGNIZER OUTPUTREC: THIS IS AMEX APPLE OF RECOGNIZER OUTPUT

• One OOV causes several errors:• OOV can not be found (in the output of LVCSR).• OOV impairs recognition of neighboring words.

• OOV usually carries lot of information (named entity).

We need to handle OOVs ! • Word accuracy.• Spoken term detection accuracy.• Practical (memory, CPU, index size, etc.).


Answer to OOV problem – sub-word STD

• Subword recognizer is built (output is subword lattice).

• Term is converted from words to sequence of subwords.

• This sequence is searched in the subword lattice.

*p -r-a y m * *m -ih -n ih -s t-a x r*

P R IM E M IN IS T E R


Agenda



Evaluation - TWV

• Defined by NIST for NIST STD 2006 evaluation:

• one number• higher is better• depending on normalization

• Requires full STD system


Normalization-independent evaluation - UBTVW• UBTWV - Upper Bound Term Weighted Value

• Finds optimum threshold for each term• one number• higher is better• Independent on

normalization


Data

• NIST STD 2006 evaluations.• 3h of English telephone conversations.• 373 1-4 words long terms occurring 4737/196

times.


Recognizer I.

• LVCSR developed in AMI/AMIDA project• State-of the art system including VTLN, MPE,

posterior features, SAT, 3 passes. • Acoustic models trained on 278h of speech.• Language model trained on 977M word tokens

(50k vocabulary).• Dictionary pruned to generate OOVs ->

WRDRED. • Word accuracy – 69.04%.


Results

• Words• Words converted to phones• Phone recognizer

Phones too small => need longer units


Agenda



Better subwords – phone multigrams

• Statistics of phone n-grams are collected (up to 6) from training data (phone transcriptions of speech).

• Probabilities of all units are estimated.• Training data are segmented by the most probable

sequence of multigrams.• Statistics are recomputed and low occurring units

are deleted. Several iterations.• N-gram language model is estimated on top of the

multigram segmentation of the training data.


Constrained multigrams• nosil – sil is not part of multigram unit.• noxwrd – add information of word boundary to

multigram unit.

Term (word representation): PRIME MINISTERTerm pronunciation: p r ay m m ih n ih s t axrTerm (subword representation): *p-r-ay m* *m-ih-n ih-s t-axr*


Results

• Subword search can process OOV terms.• Subword search is not so accurate as word search of

in-vocabulary terms.• Subword search consumes more index space.

=> Need for combination of word and subword searches.


Agenda



Parallel word-subword

… works, but needs to maintain and run 2 systems.


Hybrid word-subword


Implementation by composition of networks


Multigram dictionary for hybrid system

• For hybrid system, phone multigrams must not be trained on utterances.

• Phone multigrams are trained on dictionary.• Experimented with LVCSR vs. big vs. OOV

dictionary.


Results – different configurations

• Pruning factors play role in the memory consumption, size of index, RT factor …

• “Reasonable system”• ~2.5x slower than word• ~2.5x bigger index than word• Matches the accuracy of word system for IV• OOVs found.


Agenda



OOV detection by the hybrid system

Comparison of the subword confidence measure

to a threshold => detection of

OOVs


OOV recovery

Use of phoneme to grapheme (P2G) to derive word-form of detected OOV


Alignment error model

• Some detected OOVs could be even converted back to in-vocabulary words !

• But the phone pronunciation in 1-best output is not ideal…

• … alignment error model• Parameters (probabilities of deletion, insertion,

substitution) trained from data. • Can process dictionary and look up detected

OOVs.


Going more complex …

Can construct an wFST accounting for • Sequences of in-vocabulary words• In-vocabulary words + common pre- and

suffixes• OOVs• And combinations …

m ey sh en -> INFORMATIONae l k ax hh aa l ih z em (ALCOHOLISM) -> ALCOHOL /

ISMaa f ax s m ae k s (’Office Max’) -> OFFICE OOV1572


OOV clustering

• Alignment model allows for the evaluation of similarity

• Clustering possible


Agenda



Conclusion

• Subword system with constrained multigrams - very good STD performace and OOV tolerant system.

• Improved hybrid word-subword system tested from STD accuracy and real application point of view.• Hybrid system brings better accuracy/size ratio and is

faster than the standalone system.• It works well in a real indexing & search engine.

• With a hybrid system, we can • Recover OOVs (simple P2G or more elaborate model)• Measure similarity of OOVs• Cluster them, find re-occurring ones, update

vocabulary.


Reading and playing with• Igor Szöke: Hybrid word-subword spoken term

detection, Ph.D. thesis, Brno University of Technology, Oct 2010

• Stefan Kombrink, Mirko Hannemann, Lukáš Burget, and Hynek Heřmanský: Recovery of Rare Words in Lecture Speech, in Proc. Text, Speech and Dialogue (TSD) 2010, Brno, 2010

• Mirko Hannemann, Stefan Kombrink, Martin Karafiát, and Lukáš Burget: Similarity Scoring for Recognizing Repeated Out-of-VocabularyWords, in Proc. Interspeech 2010, Makuhari, Japan, 2010.

• … ‘Publications’ section of http://speech.fit.vutbr.cz/

• http://www.superlectures.com/odyssey/


Thank you for your attention

word-subword based keyword spotting with implications in oov detection

Documents

std system zhlav

vocabulary oov word

word accuracy

oov detection

word tokens

spoken term detection

spoken speech

datanist std