mediaeval 2015 - gtm-uvigo systems for the query-by-example search on speech task at mediaeval 2015

GTM-UVigo Systems for the Query-by-Example

Search on Speech Task at MediaEval 2015

Paula López Otero, Laura Docío Fernández, Carmen GarcíaMateo

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 1/9

Main contributions

Neural networks for phoneme posteriorgram extraction

Phoneme unit selection


Neural networks


Neural networks

We used two Kaldi ASR recipes for phoneme posteriorgramextraction:

LSTMDNN

minCnxe Dev GA ES EN CZ ISF PMUI

lstm 0.895 0.879 0.915 0.904 0.34 0.22


Neural networks


LSTM → everything went fineDNN


LSTM 0.895 0.879 0.915 0.904 0.34 0.22


Neural networks


LSTM → everything went fineDNN → very slow and highly memory consuming!!!


LSTM 0.895 0.879 0.915 0.904 0.34 0.22DNN 0.898 0.897 0.915 0.922 2.93 6


Neural networksUntangling the DNN recipe

System minCnxe ISF PMUI

Initial version 0.897 2.63 6

No lattice determinize

What works for ASR doesn’t have to work for QbESTD




Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73

No lattice determinize






One ASR pass 0.896 4.48 1.73No lattice determinize






One ASR pass 0.896 4.48 1.73No lattice determinize 0.895 1.66 2.04







No fMLLR 0.867 0.62 0.48







No fMLLR 0.867 0.62 0.48

What works for ASR doesn’t always work for QbESTD


Phoneme unit selectionCross-lingual search on speech

/©/ /R//R/ /x/

Spanish

/ð/ /d¶/ /h//ŋ/ /θ/ /r//¬/ /v/ /w//z/ /¶/ /æ//ª:/ /£/ /i:/

/a:/ /u:/ /3:/

English/a/

/e/ /i/ /o//u/ /b/ /d//g/ /p/ /t/

/k/ /m//n/ /s/ /t¬//j/ /l/ /f/

/D/

Many phonemes are not common to both languages ⇒ Dothey really contribute somehow?

But we are working with unknown languages! ⇒ automaticselection of the most suitable phonetic units



Every step of the path has a costRelevance of phoneme β: R(P(Q, D), β) = 1

K

∑Kk=1 cik ,dk ,β

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,dk ,γ

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)



Computation of the best alignment path P(Q, D) of length K

Relevance of phoneme β: R(P(Q, D), β) = 1K

∑Kk=1 cik ,dk ,β


∑Kk=1 cik ,dk ,γ




Every step of the path has a cost ci ,j


∑Kk=1 cik ,dk ,β


∑Kk=1 cik ,dk ,γ




Relevance of phoneme α: R(P(Q, D), α) = 1K

∑Kk=1 cik ,jk (α)


∑Kk=1 cik ,dk ,β


∑Kk=1 cik ,dk ,γ







∑Kk=1 cik ,jk (β)


∑Kk=1 cik ,dk

(γ)







∑Kk=1 cik ,jk (β)


∑Kk=1 cik ,jk (γ)





∑Kk=1 cik ,dk ,α


∑Kk=1 cik ,dk ,β


∑Kk=1 cik ,dk ,γ

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), β)


Phoneme unit selectionPerformance using different phoneme posteriorgrams

0.88

0.9

0.92

0.94

0.96

0.98

1

20 30 40 50 60 70 80

min

Cnx

e

Number of phoneme units

CZtrapsHUtrapsRUtraps

CZdnn

CZlstmGAdnnGAlstmESdnn

ESlstmENdnnENlstm


GTM-UVigo Systems for the Query-by-Example

Search on Speech Task at MediaEval 2015


mediaeval 2015 - gtm-uvigo systems for the query-by-example search on speech task at mediaeval 2015

Education