look who’s talking? project 3.1
DESCRIPTION
Look who’s talking? Project 3.1. Yannick Thimister Han van Venrooij Bob Verlinden . Project 3.1 27-01-2011 DKE Maastricht University. Contents. Speaker recognition S peech samples Voice activity d etection Feature extraction Speaker recognition Multi speaker recognition - PowerPoint PPT PresentationTRANSCRIPT
Look who’s talking?Project 3.1
Yannick Thimister Han van VenrooijBob Verlinden Project 3.1
27-01-2011 DKEMaastricht University
Project 3.1 DKE - Maastricht University
2
Speaker recognition Speech samples Voice activity detection Feature extraction Speaker recognition Multi speaker recognition Experiments and results Discussion Conclusion
Contents
Project 3.1 DKE - Maastricht University
3
Speech contains several layers of info Spoken words Speaker identity
Speaker-related differences are a combination of anatomical differences and learned speaking habits
Speaker Recognition
Project 3.1 DKE - Maastricht University
4
Self recorded database 55 sentences from 11 different people 2x2 predefined and 1 random Pro recording and build-in laptop microphone
Database via Voxforge.org 610 sentences from 61 different people Varying recording microphones and
environments
Speech samples
Project 3.1 DKE - Maastricht University
5
Power-based Entropy-based Long term spectral divergence
Frames Initial frames are noise Hangover
Voice activity detection
Adaptive noise estimation
Project 3.1 DKE - Maastricht University
6
Power-based Assumes that the noise is normally distributed
Calculate mean, standard deviation
For each sample n Calculate
For each frame j The majority of the samples
Voice activity detection
Project 3.1 DKE - Maastricht University
7
Entropy-based
Scale DFT coefficients
Entropy equals
Voice activity detection
Project 3.1 DKE - Maastricht University
8
Long term spectral divergence L-frame window
Estimation
Divergence
Voice activity detection
Project 3.1 DKE - Maastricht University
9
Long term spectral divergence Estimate the noise spectrum
Averages of the DFT coefficients Calculate mean (μ) LTSD of noise frames
For each frame f Calculate the LTSD > c μ
Update
Voice activity detection
Project 3.1 DKE - Maastricht University
10
Representation of speakers
Mel frequency cepstral coefficients Imitates human hearing
Linear predictive coding Linear function of previous samples
Feature extraction
Project 3.1 DKE - Maastricht University
11
Hamming window FFT Mel-scale Log FFT
MFCC
Project 3.1 DKE - Maastricht University
12
Pth order linear function estimated
LPC
Project 3.1 DKE - Maastricht University
13
Nearest Neighbor Euclidean distance
Neural Network Multilayer perceptron
Speaker recognition
Project 3.1 DKE - Maastricht University
14
Features compared pairwise
Nearest neighbor
Project 3.1 DKE - Maastricht University
15
Neural network
Project 3.1 DKE - Maastricht University
16
Preprocessing using VAD Consecutive speech frames Single speaker recognition per segment
Multi speaker recognition
Project 3.1 DKE - Maastricht University
17
Hand labeled samples Percentage of correct classified False Negatives
Experiments VAD
Project 3.1 DKE - Maastricht University
18
Entropy-based Correctly classified:65,3% False negatives: 9,3%
Power-based Correctly classified:76,3% False negatives: 6,2%
Long term spectral divergence Correctly classified: 79,0% False negatives: 1,6%
Results VAD
Project 3.1 DKE - Maastricht University
19
Nr. of coefficients MFCC
Optimal: 10 90.9%
LPC Optimal: 8 77.3%
Experiments Feature extraction
Project 3.1 DKE - Maastricht University
20
Professional vs. Build-in laptop microphone
Silence removal
Experiments single speaker recognition
Trained Tested Neural network
Nearest neighbor
Pro Pro 90.9% 100%Laptop Laptop 61.1% 94.4%Pro Laptop 16.7% 33.3%Laptop Pro 9.4% 21.4%
Project 3.1 DKE - Maastricht University
21
Optimal number of nodes
Self recorded database: 25 nodes Voxforge database: 100 nodes
Experiments neural network
Project 3.1 DKE - Maastricht University
22
Cycles
Experiments neural network
Project 3.1 DKE - Maastricht University
23
Self-made samples Optimal settings used Neural network: 66.7% Nearest neighbor: 76.5%
Experiments multi speaker recognition
Project 3.1 DKE - Maastricht University
24
Nearest neighbor better than neural network?
Neural network better applicable VAD gives no improvement
Discussion
Project 3.1 DKE - Maastricht University
25
LTSD is the best VAD method MFCC outperforms LPC Training and testing with different
microphones gives significant less accuracy Nearest neighbor works better than an
optimized neural network
Conclusions
Project 3.1 DKE - Maastricht University
26
Questions?