sphear workshop 2000 labeling an audio-visual database and training an ann/hmm audio-visual speech...

17
SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann Kristian Kroschel Institut de la Communication parlée Frédéric Berthommier Christophe Savariaux

Upload: milo-norman

Post on 21-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

SPHEAR Workshop 2000

Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system

Universität KarlsruheMartin HeckmannKristian Kroschel

Institut de la Communication parléeFrédéric BerthommierChristophe Savariaux

Page 2: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 2

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Overview

The Database The System Multi-Stage Labeling Results Outlook

Page 3: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 3

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Database Acquisition

Transposition of NUMBERS95 Audio-Visual repetition of a subset

of NB95 (1700 sentences)

Audio and video recordings Chroma key process to extract lips

parameters

Page 4: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 4

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Lip parameters

outer width

lip surface

inner mouth surface

outer heightinner width

inner height

Page 5: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 5

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Recognition System

Hybrid ANN/HMM audio-visual system (STRUT)

Separate Identification (SI) structureSNR given as contextual information

RASTA-PLP

Chrom a-Key

AudioANN

VideoANN

AVFusion

HM M

SNR

Page 6: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 6

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Recognition System

Independent weighting of each posterior value for each frame is possible with STRUT

13x3x9

6x3x9

. . .

. . .

. .

..

. .

. . .

. .

.

. . .

. .

..

. .

AudioParam eters

VideoParam eters

. .

. .

. .

. .

. .

. .

1-

1-

27

27

27

16

203

51

0

Page 7: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 7

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Recognition System

Fusion of audio and video via:

Phoneme duration modeled via concatenation of states in the HMM

Dictionary containing 30 words is usedGrammar free continuous numbers

recognitionNo distinction between phonemes and

visemes is made

)|()|(),|( 1ViVAiAVAi HPHPHP xxxx

Page 8: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 8

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Multi-Stage Labeling

Training on large multi speaker database NUMBERS95

WER on NB95: 11.6%

Numbers95 ANN Training

Audio Pretraining on NUM BERS95

AV DatabaseAudio

Alignment

Audio Labeling

ANN AudioPretra ined

ANN AudioPretra ined

Audio Training

ANN Training ANN AudioTrained

Video Labeling

AV DatabaseAudio R elabeled

AV DatabaseVideo

AV DatabaseVideo Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined

AV DatabaseAudio R elabeled

Page 9: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 9

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Multi-Stage Labeling

Multi speaker NUMBERS95

WER on NB95: 11.6%

Single speaker Audio-visual NB95

WER on AVNB95: 28.5%

Forced Viterbi alignment

Numbers95 ANN Training

Audio Pretraining on NUM BERS95

AV DatabaseAudio

Alignment

Audio Labeling

ANN AudioPretra ined

ANN AudioPretra ined

Audio Training

ANN Training ANN AudioTrained

Video Labeling

AV DatabaseAudio R elabeled

AV DatabaseVideo

AV DatabaseVideo Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined

AV DatabaseAudio R elabeled

Page 10: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 10

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Multi-Stage Labeling

WER AVNB95 first labeling: 7.1%

Numbers95 ANN Training

Audio Pretraining on NUM BERS95

AV DatabaseAudio

Alignment

Audio Labeling

ANN AudioPretra ined

ANN AudioPretra ined

Audio Training

ANN Training ANN AudioTrained

Video Labeling

AV DatabaseAudio R elabeled

AV DatabaseVideo

AV DatabaseVideo Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined

AV DatabaseAudio R elabeled

Page 11: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 11

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Multi-Stage Labeling

WER AVNB95 first labeling: 7.1%

WER AVNB95 second labeling: 4%

Numbers95 ANN Training

Audio Pretraining on NUM BERS95

AV DatabaseAudio

Alignment

Audio Labeling

ANN AudioPretra ined

ANN AudioPretra ined

Audio Training

ANN Training ANN AudioTrained

Video Labeling

AV DatabaseAudio R elabeled

AV DatabaseVideo

AV DatabaseVideo Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined

AV DatabaseAudio R elabeled

Page 12: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 12

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Multi-Stage Labeling

Direct application of the audio labeling to the video path

WER using only video: 35.1%

Numbers95 ANN Training

Audio Pretraining on NUM BERS95

AV DatabaseAudio

Alignment

Audio Labeling

ANN AudioPretra ined

ANN AudioPretra ined

Audio Training

ANN Training ANN AudioTrained

Video Labeling

AV DatabaseAudio R elabeled

AV DatabaseVideo

AV DatabaseVideo Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio Labeled

AV DatabaseAudio

Alignment

Audio Relabeling

ANN AudioRetra ined

AV DatabaseAudio R elabeled

Page 13: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 13

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Result of the labeling

Result of audio and video labeling for „three oh two“

0 0.5 1 1.5

0 0.5 1 1.5

#p th r iy #p ow t uw #p

0 0.5 1 1.5

Inner lip height

#p

Page 14: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 14

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Effects of Audio Pretraining

Comparison between training from scratch on AVNB95 and pretraining on NB95 followed by training continuation on AVNB95

-12dB -6dB 0dB 6dB 12dB clean0

10

20

30

40

50

60

70

80

90

100

Wo

rd E

rro

r R

ate

in %

Audio pretrained Audio from scratch

Page 15: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 15

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Audio-Visual Recognition

Comparison of audio, video and audio-visual recognition

-12dB -6dB 0dB 6dB 12dB clean0

10

20

30

40

50

60

70

80

90

100

Wo

rd E

rro

r R

ate

in %

Video Audio Audio-Visual

Page 16: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 16

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Conclusion

Automatic audio-visual labeling process presented

Pretraining on a large audio database only advantageous for labeling

Good audio-visual recognition scores with hybrid ANN/HMM system

Page 17: SPHEAR Workshop 2000 Labeling an Audio-Visual Database and Training an ANN/HMM Audio-Visual Speech Recognition system Universität Karlsruhe Martin Heckmann

Page 17

Universität Karlsruhe

M. HeckmannK. Kroschel

Institut de la Communication Parlée (ICP), Grenoble

F. BerthommierC. Savariaux

Outlook

Improvements of the recognition scores via Grouping of visually identical phonemes

to visemes Introduction of a more elaborated fusion

modality

Implementation of a system to determine the weighting of audio and video from the data (e.g. entropy of posteriors) instead of using SNR given as context information