mediaeval 2015 - gtm-uvigo systems for the query-by-example search on speech task at mediaeval 2015

25
GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 Paula López Otero, Laura Docío Fernández, Carmen García Mateo López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 1/9

Upload: multimediaeval

Post on 20-Jan-2017

222 views

Category:

Education


1 download

TRANSCRIPT

GTM-UVigo Systems for the Query-by-Example

Search on Speech Task at MediaEval 2015

Paula López Otero, Laura Docío Fernández, Carmen GarcíaMateo

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 1/9

Main contributions

Neural networks for phoneme posteriorgram extraction

Phoneme unit selection

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 2/9

Neural networks

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 3/9

Neural networks

We used two Kaldi ASR recipes for phoneme posteriorgramextraction:

LSTMDNN

minCnxe Dev GA ES EN CZ ISF PMUI

lstm 0.895 0.879 0.915 0.904 0.34 0.22

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 4/9

Neural networks

We used two Kaldi ASR recipes for phoneme posteriorgramextraction:

LSTM → everything went fineDNN

minCnxe Dev GA ES EN CZ ISF PMUI

LSTM 0.895 0.879 0.915 0.904 0.34 0.22

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 4/9

Neural networks

We used two Kaldi ASR recipes for phoneme posteriorgramextraction:

LSTM → everything went fineDNN → very slow and highly memory consuming!!!

minCnxe Dev GA ES EN CZ ISF PMUI

LSTM 0.895 0.879 0.915 0.904 0.34 0.22DNN 0.898 0.897 0.915 0.922 2.93 6

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 4/9

Neural networksUntangling the DNN recipe

System minCnxe ISF PMUI

Initial version 0.897 2.63 6

No lattice determinize

What works for ASR doesn’t have to work for QbESTD

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9

Neural networksUntangling the DNN recipe

System minCnxe ISF PMUI

Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73

No lattice determinize

What works for ASR doesn’t have to work for QbESTD

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9

Neural networksUntangling the DNN recipe

System minCnxe ISF PMUI

Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73

One ASR pass 0.896 4.48 1.73No lattice determinize

What works for ASR doesn’t have to work for QbESTD

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9

Neural networksUntangling the DNN recipe

System minCnxe ISF PMUI

Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73

One ASR pass 0.896 4.48 1.73No lattice determinize 0.895 1.66 2.04

What works for ASR doesn’t have to work for QbESTD

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9

Neural networksUntangling the DNN recipe

System minCnxe ISF PMUI

Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73

One ASR pass 0.896 4.48 1.73No lattice determinize 0.895 1.66 2.04

No fMLLR 0.867 0.62 0.48

What works for ASR doesn’t have to work for QbESTD

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9

Neural networksUntangling the DNN recipe

System minCnxe ISF PMUI

Initial version 0.897 2.63 6Phoneme network 0.896 9.96 1.73

One ASR pass 0.896 4.48 1.73No lattice determinize 0.895 1.66 2.04

No fMLLR 0.867 0.62 0.48

What works for ASR doesn’t always work for QbESTD

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 5/9

Phoneme unit selection

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 6/9

Phoneme unit selectionCross-lingual search on speech

/©/ /R//R/ /x/

Spanish

/ð/ /d¶/ /h//ŋ/ /θ/ /r//¬/ /v/ /w//z/ /¶/ /æ//ª:/ /£/ /i:/

/a:/ /u:/ /3:/

English/a/

/e/ /i/ /o//u/ /b/ /d//g/ /p/ /t/

/k/ /m//n/ /s/ /t¬//j/ /l/ /f/

/D/

Many phonemes are not common to both languages ⇒ Dothey really contribute somehow?

But we are working with unknown languages! ⇒ automaticselection of the most suitable phonetic units

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 7/9

Phoneme unit selectionCross-lingual search on speech

/©/ /R//R/ /x/

Spanish

/ð/ /d¶/ /h//ŋ/ /θ/ /r//¬/ /v/ /w//z/ /¶/ /æ//ª:/ /£/ /i:/

/a:/ /u:/ /3:/

English/a/

/e/ /i/ /o//u/ /b/ /d//g/ /p/ /t/

/k/ /m//n/ /s/ /t¬//j/ /l/ /f/

/D/

Many phonemes are not common to both languages ⇒ Dothey really contribute somehow?

But we are working with unknown languages! ⇒ automaticselection of the most suitable phonetic units

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 7/9

Phoneme unit selectionCross-lingual search on speech

/©/ /R//R/ /x/

Spanish

/ð/ /d¶/ /h//ŋ/ /θ/ /r//¬/ /v/ /w//z/ /¶/ /æ//ª:/ /£/ /i:/

/a:/ /u:/ /3:/

English/a/

/e/ /i/ /o//u/ /b/ /d//g/ /p/ /t/

/k/ /m//n/ /s/ /t¬//j/ /l/ /f/

/D/

Many phonemes are not common to both languages ⇒ Dothey really contribute somehow?

But we are working with unknown languages! ⇒ automaticselection of the most suitable phonetic units

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 7/9

Phoneme unit selection

Every step of the path has a costRelevance of phoneme β: R(P(Q, D), β) = 1

K

∑Kk=1 cik ,dk ,β

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,dk ,γ

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9

Phoneme unit selection

Computation of the best alignment path P(Q, D) of length K

Relevance of phoneme β: R(P(Q, D), β) = 1K

∑Kk=1 cik ,dk ,β

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,dk ,γ

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9

Phoneme unit selection

Every step of the path has a cost ci ,j

Relevance of phoneme β: R(P(Q, D), β) = 1K

∑Kk=1 cik ,dk ,β

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,dk ,γ

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9

Phoneme unit selection

Relevance of phoneme α: R(P(Q, D), α) = 1K

∑Kk=1 cik ,jk (α)

Relevance of phoneme β: R(P(Q, D), β) = 1K

∑Kk=1 cik ,dk ,β

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,dk ,γ

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9

Phoneme unit selection

Relevance of phoneme α: R(P(Q, D), α) = 1K

∑Kk=1 cik ,jk (α)

Relevance of phoneme β: R(P(Q, D), β) = 1K

∑Kk=1 cik ,jk (β)

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,dk

(γ)

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9

Phoneme unit selection

Relevance of phoneme α: R(P(Q, D), α) = 1K

∑Kk=1 cik ,jk (α)

Relevance of phoneme β: R(P(Q, D), β) = 1K

∑Kk=1 cik ,jk (β)

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,jk (γ)

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), α)

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9

Phoneme unit selection

Relevance of phoneme α: R(P(Q, D), α) = 1K

∑Kk=1 cik ,dk ,α

Relevance of phoneme β: R(P(Q, D), β) = 1K

∑Kk=1 cik ,dk ,β

Relevance of phoneme γ: R(P(Q, D), γ) = 1K

∑Kk=1 cik ,dk ,γ

R(P(Q, D), γ) > R(P(Q, D), α) > R(P(Q, D), β)

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 8/9

Phoneme unit selectionPerformance using different phoneme posteriorgrams

0.88

0.9

0.92

0.94

0.96

0.98

1

20 30 40 50 60 70 80

min

Cnx

e

Number of phoneme units

CZtrapsHUtrapsRUtraps

CZdnn

CZlstmGAdnnGAlstmESdnn

ESlstmENdnnENlstm

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 9/9

GTM-UVigo Systems for the Query-by-Example

Search on Speech Task at MediaEval 2015

López Otero et al. | GTM-UVigo Systems for the Query-by-Example Search on Speech Task at MediaEval 2015 9/9