signal processing institute swiss federal institute of technology, lausanne 1 feature selection for...
Post on 19-Dec-2015
214 Views
Preview:
TRANSCRIPT
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
1
Feature selection for Feature selection for audio-visual speech recognitionaudio-visual speech recognition
Mihai GurbanMihai Gurban
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
2Outline
Feature selection and extractionFeature selection and extraction– Why select features?– Information theoretic criteria
Our approachOur approach– The audio-visual recognizer– Audio-visual integration– Features and selection methods
Experimental resultsExperimental results
ConclusionConclusion
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
3Feature selection
Features and classificationFeatures and classification– Features (or attributes, properties, characteristics) - different types of
measures that can be taken on the same physical phenomenon– An instance (or pattern, sample, example) - collection of feature values
representing simultaneous measurements– For classification, each sample has an associated class label
Feature selectionFeature selection– Finding from the original feature set, a subset which retains most of the
information that is relevant for a classification task– This is needed because of the curse of dimensionality
Why dimensionality reduction?Why dimensionality reduction?– The number of samples required to obtain accurate models of the data grows
exponentially with the dimensionality– The computing resources required also grow with the dimensionality of the
data– Irrelevant information can decrease performance
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
4Feature selection
Entropy and mutual informationEntropy and mutual information– H(X), the entropy of X – the amount of uncertainty about the value of X– I(X;Y), the mutual information between X and Y – the reduction in the
uncertainty of X due to the knowledge of Y (or vice-versa)
Maximum dependencyMaximum dependency– One of the frequently used criteria is mutual information– Pick YS1…YSm from the set Y1…Yn of features, such that
I(YS1,YS2,…, YSm ; C) is maximum
How many subsets?How many subsets?– Impossible to check all subsets, high number of combinations:
– As an approximate solution, greedy algorithms are used
– The number of possibilities is reduced to
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
5A simple example
Entropies and mutual information can be represented by Venn diagramsEntropies and mutual information can be represented by Venn diagrams
We are searching for the features YWe are searching for the features YSiSi with maximum mutual information with the with maximum mutual information with the
class labelclass label Assume the complete set of features is Assume the complete set of features is
Y 4
Y 5 Y 3Y 2
Y 1
C
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
7A simple example
Y 2
Y 1
C
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
8A simple example
Y 3Y 2
Y 1
C
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
9A simple example
Y 3Y 2
Y 1
C
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
10Which criterion to penalize redundancy?
Many different criteria proposed in the literatureMany different criteria proposed in the literature
Our criterion penalizes only relevant redundancyOur criterion penalizes only relevant redundancy
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
11Solutions from the literature
““Natural” DCT orderingNatural” DCT ordering– Zigzag scanning, used in
compression (JPEG/MPEG)
Maximum mutual informationMaximum mutual information– Typically the redundancy is not taken
into account
Linear Discriminant AnalysisLinear Discriminant Analysis– A transform is applied on the features
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
12Our application: AVSR
AUDIO
VISUAL FRONT-END
VISUAL FEATURE
EXTRACTION
FACE DETECTION MOUTH LOCALIZATION
LIP TRACKING
AUDIO FEATURE
EXTRACTION
VIDEO
AUDIO-VISUAL FUSION
AUDIO-ONLY RECOGNITION
VISUAL-ONLY RECOGNITION
AUDIO-VISUAL RECOGNITION
Experiments on the CUAVE databaseExperiments on the CUAVE database– 36 speakers, 10 words, 5 repetitions per speaker– Leave-one-out crossvalidation– Audio features: MFCC coefficients– Visual features: DCT with first and second temporal derivatives– Different levels of noise added to the audio
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
13The multi-stream HMM
Audio(39 MFCCs)
Video (DCT features)
Audio-visual integration with multi-stream HMMsAudio-visual integration with multi-stream HMMs– States are modeled with gaussian mixtures– Each modality is modeled separately– The emission likelihood is a weighted product– The optimal weights are chosen for each SNR
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
14Information content of different types of features
Comparison of mutual information I(X;C) between different features
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5
Mu
tua
l in
form
ati
on
I(X
;C)
MFCC in clean conditions
MFCC at -10dB of SNR
DCT coef f icients
PCA coef f icients
Optical f low coef f icients
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
15Visual-only recognition rate
0 20 40 60 80 100 120 140 160 180 20030
35
40
45
50
55
60
65
Number of features
Wor
d ac
cura
cy %
Max MIpenalize redundancy
Max MI
Zigzag ordering
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
16Audio-visual performance
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
clean 25db 20db 15db 10db 05db 00db -05db -10db
SNR
Wo
rd a
ccu
racy
%
audio only
video only
audio-visual
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
17AV performance with clean audio
0 20 40 60 80 100 120 140 160 180 20098.2
98.3
98.4
98.5
98.6
98.7
98.8
98.9
99
99.1
99.2
Number of features
Wor
d ac
cura
cy %
AV
audio-only
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
18AV performance at 10db SNR
0 20 40 60 80 100 120 140 160 180 20091
91.5
92
92.5
93
93.5
94
94.5
95
95.5
96
Number of features
Wor
d ac
cura
cy %
AV
audio-only
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
19Noisy AV and visual-only comparison
0 20 40 60 80 100 120 140 160 180 20040
45
50
55
60
65
70
Number of features
Wor
d ac
cura
cy %
AV performance-10db SNR
Video-only performance
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne
20Conclusion and future work
Feature selection for audio-visual speech recognitionFeature selection for audio-visual speech recognition– Visual-only recognition rate not a good predictor for audio-visual
performance because of dimensionality – Maximum audio-visual performance is obtained for small video
dimensionalities– Algorithms that improve performance at small dimensionalities are
needed
Future workFuture work– Better methods to compute the amount of redundancy between
features
top related