look who’s talking? project 3.1

Look who’s talking?Project 3.1

Yannick Thimister Han van VenrooijBob Verlinden Project 3.1

27-01-2011 DKEMaastricht University

Project 3.1 DKE - Maastricht University

2

Speaker recognition Speech samples Voice activity detection Feature extraction Speaker recognition Multi speaker recognition Experiments and results Discussion Conclusion

Contents


3

Speech contains several layers of info Spoken words Speaker identity

Speaker-related differences are a combination of anatomical differences and learned speaking habits

Speaker Recognition


4

Self recorded database 55 sentences from 11 different people 2x2 predefined and 1 random Pro recording and build-in laptop microphone

Database via Voxforge.org 610 sentences from 61 different people Varying recording microphones and

environments

Speech samples


5

Power-based Entropy-based Long term spectral divergence

Frames Initial frames are noise Hangover

Voice activity detection

Adaptive noise estimation


6

Power-based Assumes that the noise is normally distributed

Calculate mean, standard deviation

For each sample n Calculate

For each frame j The majority of the samples



7

Entropy-based

Scale DFT coefficients

Entropy equals



8

Long term spectral divergence L-frame window

Estimation

Divergence



9

Long term spectral divergence Estimate the noise spectrum

Averages of the DFT coefficients Calculate mean (μ) LTSD of noise frames

For each frame f Calculate the LTSD > c μ

Update



10

Representation of speakers

Mel frequency cepstral coefficients Imitates human hearing

Linear predictive coding Linear function of previous samples

Feature extraction


11

Hamming window FFT Mel-scale Log FFT

MFCC


12

Pth order linear function estimated

LPC


13

Nearest Neighbor Euclidean distance

Neural Network Multilayer perceptron

Speaker recognition


14

Features compared pairwise

Nearest neighbor


15

Neural network


16

Preprocessing using VAD Consecutive speech frames Single speaker recognition per segment

Multi speaker recognition


17

Hand labeled samples Percentage of correct classified False Negatives

Experiments VAD


18

Entropy-based Correctly classified:65,3% False negatives: 9,3%

Power-based Correctly classified:76,3% False negatives: 6,2%

Long term spectral divergence Correctly classified: 79,0% False negatives: 1,6%

Results VAD


19

Nr. of coefficients MFCC

Optimal: 10 90.9%

LPC Optimal: 8 77.3%

Experiments Feature extraction


20

Professional vs. Build-in laptop microphone

Silence removal

Experiments single speaker recognition

Trained Tested Neural network

Nearest neighbor

Pro Pro 90.9% 100%Laptop Laptop 61.1% 94.4%Pro Laptop 16.7% 33.3%Laptop Pro 9.4% 21.4%


21

Optimal number of nodes

Self recorded database: 25 nodes Voxforge database: 100 nodes

Experiments neural network


22

Cycles

Experiments neural network


23

Self-made samples Optimal settings used Neural network: 66.7% Nearest neighbor: 76.5%

Experiments multi speaker recognition


24

Nearest neighbor better than neural network?

Neural network better applicable VAD gives no improvement

Discussion


25

LTSD is the best VAD method MFCC outperforms LPC Training and testing with different

microphones gives significant less accuracy Nearest neighbor works better than an

optimized neural network

Conclusions


26

Questions?

look who’s talking? project 3.1

Documents

dke maastricht university

speaker recognition

ltsd of noise frames

noise spectrumaverages

frame fcalculate

frame jthe majority

dft coefficientscalculate

results vadproject