sphear workshop 2000 labeling an audio-visual database and training an ann/hmm audio-visual speech...

SPHEAR Workshop 2000

Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system

Universität KarlsruheMartin HeckmannKristian Kroschel

Institut de la Communication parléeFrédéric BerthommierChristophe Savariaux

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Overview

The Database The System Multi-Stage Labeling Results Outlook





Database Acquisition

Transposition of NUMBERS95 Audio-Visual repetition of a subset

of NB95 (1700 sentences)

Audio and video recordings Chroma key process to extract lips

parameters





Lip parameters

outer width

lip surface

inner mouth surface

outer heightinner width

inner height





Recognition System

Hybrid ANN/HMM audio-visual system (STRUT)

Separate Identification (SI) structureSNR given as contextual information

RASTA-PLP

Chrom a-Key

AudioANN

VideoANN

AVFusion

HM M

SNR





Recognition System

Independent weighting of each posterior value for each frame is possible with STRUT

13x3x9

6x3x9

. . .

. . .

. .

..

. .

. . .

. .

.

. . .

. .

..

. .

AudioParam eters

VideoParam eters

. .

. .

. .

. .

. .

. .

1-

1-

27

27

27

16

203

51

0





Recognition System

Fusion of audio and video via:

Phoneme duration modeled via concatenation of states in the HMM

Dictionary containing 30 words is usedGrammar free continuous numbers

recognitionNo distinction between phonemes and

visemes is made

)|()|(),|( 1ViVAiAVAi HPHPHP xxxx





Multi-Stage Labeling

Training on large multi speaker database NUMBERS95

WER on NB95: 11.6%

Numbers95 ANN Training

Audio Pretraining on NUM BERS95

AV DatabaseAudio

Alignment

Audio Labeling

ANN AudioPretra ined


Audio Training

ANN Training ANN AudioTrained

Video Labeling

AV DatabaseAudio R elabeled

AV DatabaseVideo

AV DatabaseVideo Labeled

AV DatabaseAudio Labeled


AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined







Multi speaker NUMBERS95

WER on NB95: 11.6%

Single speaker Audio-visual NB95

WER on AVNB95: 28.5%

Forced Viterbi alignment



AV DatabaseAudio

Alignment

Audio Labeling



Audio Training


Video Labeling


AV DatabaseVideo




AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined







WER AVNB95 first labeling: 7.1%



AV DatabaseAudio

Alignment

Audio Labeling



Audio Training


Video Labeling


AV DatabaseVideo




AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined







WER AVNB95 first labeling: 7.1%

WER AVNB95 second labeling: 4%



AV DatabaseAudio

Alignment

Audio Labeling



Audio Training


Video Labeling


AV DatabaseVideo




AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined







Direct application of the audio labeling to the video path

WER using only video: 35.1%



AV DatabaseAudio

Alignment

Audio Labeling



Audio Training


Video Labeling


AV DatabaseVideo




AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined






Result of the labeling

Result of audio and video labeling for „three oh two“

0 0.5 1 1.5

0 0.5 1 1.5

#p th r iy #p ow t uw #p

0 0.5 1 1.5

Inner lip height

#p





Effects of Audio Pretraining

Comparison between training from scratch on AVNB95 and pretraining on NB95 followed by training continuation on AVNB95

-12dB -6dB 0dB 6dB 12dB clean0

10

20

30

40

50

60

70

80

90

100

Wo

rd E

rro

r R

ate

in %

Audio pretrained Audio from scratch





Audio-Visual Recognition

Comparison of audio, video and audio-visual recognition

-12dB -6dB 0dB 6dB 12dB clean0

10

20

30

40

50

60

70

80

90

100

Wo

rd E

rro

r R

ate

in %

Video Audio Audio-Visual





Conclusion

Automatic audio-visual labeling process presented

Pretraining on a large audio database only advantageous for labeling

Good audio-visual recognition scores with hybrid ANN/HMM system





Outlook

Improvements of the recognition scores via Grouping of visually identical phonemes

to visemes Introduction of a more elaborated fusion

modality

Implementation of a system to determine the weighting of audio and video from the data (e.g. entropy of posteriors) instead of using SNR given as context information

sphear workshop 2000 labeling an audio-visual database and training an ann/hmm audio-visual speech...

Documents

communication parle

puniversitt karlsruhem

madeuniversitt karlsruhem

strutuniversitt karlsruhem

audiovisual database

video pathwer

labelingresult of audio

subset of nb95