text prompted remote speaker authentication : joint speech and speaker recognition/verification...
DESCRIPTION
Joint Speech and Speaker Recognition using Hidden Markov Model/Vector Quantization for speaker independent Speech Recognition and Gaussian Mixture Model for speech independent speaker recognition- used MFCC (Mel-Frequency Cepstral Coefficient) for Feature Extraction (delta,delta delta and energy - 39 coefficients). Developed in JAVA with client/server Architecture, web interface developed in Adobe Flex.This project was done at TU, IOE - Pulchowk Campus, Nepal.For more details visit http://ganeshtiwaridotcomdotnp.blogspot.comABSTRACT OF PROJECT>>>Biometric is physical characteristic unique to each individual. It has a very useful application in authentication and access control.The designed system is a text-prompted version of voice biometric which incorporates text-independent speaker verification and speaker-independent speech verification system implemented independently. The foundation for this joint system is that the speech signal conveys both the speech content and speaker identity. Such systems are more-secure from playback attack, since the word to speak during authentication is not previously set. During the course of the project various digital signal processing and pattern classification algorithms were studied. Short time spectral analysis was performed to obtain MFCC, energy and their deltas as feature. Feature extraction module is same for both systems. Speaker modeling was done by GMM and Left to Right Discrete HMM with VQ was used for isolated word modeling. And results of both systems were combined to authenticate the user.The speech model for each word was pre-trained by using utterance of 45 English words. The speaker model was trained by utterance of about 2 minutes each by 15 speakers. While uttering the individual words, the recognition rate of the speech recognition system is 92 % and speaker recognition system is 66%. For longer duration of utterance (>5sec) the recognition rate of speaker recognition system improves to 78%.TRANSCRIPT
MAJOR PROJECT MID-TERM PRESENTATION :
SPEAKER VERIFICATION FOR REMOTE AUTHENTICATION
Members:
Ganesh Tiwari (063BCT510)
Madhav Pandey(063BCT514)
Manoj Shrestha(063BCT518)
Supervisor :
Dr. Subarna Shakya
Associate Professor
INTRODUCTION
Voice biometric system user login
Text-Prompted system The claimant is asked to speak a prompted text Speech and Speaker Recognition/Verification More secure to playback attack.
Web Application Client (Adobe Flex) : Voice Capture, preprocessing and
feature extraction Server (JAVA) : Training / Classification BlazeDS RPC for JAVA-Flex Connectivity
BLOCK DIAGRAM OF SPEAKER / SPEECH RECOGNITION SYSTEM
Signal Capture and Pre-Processing
CAPTURE AND PREPROCESSING
Get the audio signal i.e., ADC
Make suitable for feature extraction
Capture
PCM Extract
Silence Removal
Pre-Emphasis
Framing
Windowing
CAPTURE AND PREPROCESSING : CAPTURE
22050 Hz 16-bits, Signed Little Endian Mono Uncompressed PCM
Capture
PCM Extract
Silence Removal
Pre-Emphasis
Framing
Windowing
CAPTURE AND PREPROCESSING : PCM EXTRACT
Capture
PCM Extract
Silence Removal
Pre-Emphasis
Framing
Windowing
CAPTURE AND PREPROCESSING :
SILENCE REMOVAL
Algorithm described in paper‘a new method for silence removal and endpoint detection’ †
†G. Saha, Sandipan Chakroborty, Suman Senapati of Department of Electronics and
Electrical Communication Engineering, Indian Institute of Technology, Khragpur, India
0 1 2 3 4 5 6 7 8 9
x 104
-1
-0.5
0
0.5
1
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
-1
-0.5
0
0.5
1
Capture
PCM Extract
Silence Removal
Pre-Emphasis
Framing
Windowing
CAPTURE AND PREPROCESSING : PRE-EMPHASIS
Boosting the high frequency energy
In time domain, y[n] = x[n]−αx[n−1], 0.9 ≤ α ≤ 1.0
0 2000 4000 6000 8000 10000 120000
0.01
0.02
0.03
0.04
0.05
Frequency (Hz)
|Y(f
)|
0 2000 4000 6000 8000 10000 120000
1
2
3
4
5x 10
-3
Frequency (Hz)
|Y(f
)|
Capture
PCM Extract
Silence Removal
Pre-Emphasis
Framing
Windowing
CAPTURE AND PREPROCESSING :
FRAMING
Speech Signal is stationary (statistical properties) for 10-30 ms
50% overlapped frames each of 23ms is used
Capture
PCM Extract
Silence Removal
Pre-Emphasis
Framing
Windowing
CAPTURE AND PREPROCESSING :
WINDOWING
Windowing is done on the frame blocked signal
Hamming window
0 10 20 30 40 50 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Hamming Window
0 200 400 600 800 1000 1200-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0 200 400 600 800 1000 1200-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05
Capture
PCM Extract
Silence Removal
Pre-Emphasis
Framing
Windowing
Feature Extraction
FEATURE EXTRACTION
Transform the input audio signal into a sequence of acoustic feature vectors
MFCC : Mel Filter Cepstral Coefficients as Feature Perceptual approach Human Ear processes audio signal in Mel
scale Mel scale : linear up to 1KHz and
logarithmic after 1KHz
MFCC gives distribution of energy in Mel frequency band Calculated for each frame
Fourier Transform
Mel Filter
Log
IFT : DCT
Cepstral Mean Subtraction
Energy and Deltas
FEATURE EXTRACTION :
FOURIER TRANSFORM
Gives information about the amount of energy at each frequency band
FFT used
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
-1
-0.5
0
0.5
1
0 2000 4000 6000 8000 10000 120000
1
2
3
4
5x 10
-3
Frequency (Hz)
|Y(f
)|
Fourier Transform
Mel Filter
Log
IFT : DCT
Cepstral Mean Subtraction
Energy and Deltas
FEATURE EXTRACTION :
MEL FILTER
We used filter bank of triangular filters spaced in Mel scale
Fourier Transform
Mel Filter
Log
IFT : DCT
Cepstral Mean Subtraction
Energy and Deltas
FEATURE EXTRACTION :
MEL FILTER (CONTD..)
Mel Filter
Where,
Fourier Transform
Mel Filter
Log
IFT : DCT
Cepstral Mean Subtraction
Energy and Deltas
FEATURE EXTRACTION : LOG, IFT(DCT)
Log
DCT
MFCC
Fourier Transform
Mel Filter
Log
IFT : DCT
Cepstral Mean Subtraction
Energy and Deltas0 100 200 300 400 500 600 700 800 900
-20
-15
-10
-5
0
5
10
FEATURE EXTRACTION :
CEPSTRAL MEAN SUBTRACTION
CMS: for minimizing channel effectFourier Transform
Mel Filter
Log
IFT : DCT
Cepstral Mean Subtraction
Energy and Deltas
FEATURE EXTRACTION :
ENERGY AND DELTAS
For completeness of feature vector and for achieving high recognition rate
A Energy Feature
A delta or velocity feature, and a double delta or acceleration featureCalculated by linear regression of regression window M
Fourier Transform
Mel Filter
Log
IFT : DCT
Cepstral Mean Subtraction
Energy and Deltas
COMPOSITION OF FEATURE VECTOR
12 MFCC Features 12 Delta MFCC 12 Delta-Delta MFCC 1 Energy Feature 1 Delta Energy Feature 1 Delta-Delta Energy Feature
39 Features from each frame
Speaker Recognition/Verification by GMM
GAUSSIAN MIXTURE MODEL
Parametric probability density function Based on clustering technique M Gaussian components
: a k-dimensional random vector: mixture weight of mth component
: k-dimensional Gaussian function (pdf)
= (, )
GMM TRAINING
Goal: estimate the parameters Method: Maximum Likelihood estimation Input: X = {}
) Maximize with Expectation Maximization
algorithm Iterative process:
initial model: new model: P(X/ ) ≥ P(X/ )
Convergence Condition:
P(X/ ) - P(X/ ) <
VERIFICATION
Decision: Hypothesis TestH0: the speaker is the claimed speaker
H1: the speaker is an imposter Based on likelihood ratio
= Decision by threshold
< reject identity claim > accept identity claim
Speech Recognition by HMM/VQ
HIDDEN MARKOV MODEL :DEFINITION
Hidden Markov Model (HMM) is the statistical model
HMM is the extension of Markov Process
HMM has hidden states and observable symbols per states
HMM Model :
Observed data : feature vector Hidden states : phonemes
(A,B, )
CODEBOOK GENERATION
K-Means Clustering Clustering the whole database & Codebook
Generation
VQ : Vector Quantization is used for mapping each input feature vector to discrete quantized symbols Codebook for each incoming feature vector is built Compare it to each of the prototype vectors in
codebook Select the one which is closest (by some distance
metric) Replace the input vector by the index of this
prototype vector observation sequence
SPEECH RECOGNITION SYSTEM: BY : HMM / VQ
HIDDEN MARKOV MODEL :TRAINING Training by:
Forward backward (Baum-Welch) algorithm
Forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters
Three parameters need to be re-estimated: Initial state distribution: πi
Transition probabilities: ai,j
Emission probabilities: bi(ot)
Input is observation sequence, given by VQ
HIDDEN MARKOV MODEL :VERIFICATION/MATCHING
Viterbi algorithm is used
Input is Observation sequence, given by VQ HMM model of the word
Best matched word is returned
PROBLEM FACED
Learning curve Complex Mathematics
Flex & Java Connectivity (initially) Data conversion
REMAINING TASKS
Speech Training Data Collection
Model Training (HMM, GMM)
Module Integration
Testing
Thanks