text independent speaker recognition system

Click here to load reader

Upload: deepesh-lekhak

Post on 13-Apr-2017

535 views

Category:

Technology


8 download

TRANSCRIPT

A FINAL YEAR PROJECT PROPOSAL ON TEXT-INDEPENDENT SPEAKER RECOGNITION SYSTEM

Project Members

ASHOK SHARMA PAUDEL(066/BEX/405)DEEPESH LEKHAK(066/BEX/414)KESHAV BASHYAL(066/BEX/418)SUSHMA SHRESTHA(066/BEX/444)

TEXT-INDEPENDENT SPEAKER RECOGNITION SYSTEM

1

OVERVIEW OF PRESENTATION IntroductionObjective System ArchitectureMethodology Results and Analysis Application area Limitations Problem Faced Conclusion

2

1. INTRODUCTIONSpeech - universal method of communication.

Information through speech signal

1. high-level Characteristics -syntax, dialect, style, overall meaning of a spoken message.

2. low-level Characteristics- pitch and phonemic spectra associated much more with the physiology of vocal tract.

3

1. INTRODUCTION(2)

4

1. INTRODUCTION(3)Speech is a diverse field with many applications.

SpeechRecognition

LanguageRecognition

SpeakerRecognition

WordsLanguage NameSpeaker NameHow are you?English Deepesh

Speech Signal

5

1. INTRODUCTION (4)What is Speaker Recognition?Recognition of who is speaking based on characteristics of their speech signal. Text-independent , Text-dependentSpeaker Identification: Determines which registered speaker has spoken.Speaker Verification: Accept or reject a claimed identity of a speaker.

6

1. INTRODUCTION (5)Biometric: a human generated signal or attribute for authenticating a persons identityWhy Voice ?natural signal to produceOnly biometric that allows users to authenticate remotely.does not require a specialized input device, Implementation cost is low ubiquitous: telephones and microphone equipped PC

7

Strongest security

Voice biometric with other forms of security

Something you have - badgeSomething you are - voice

Have

Know

AreSomething you know - password

1. INTRODUCTION(6)

Why text independent speaker recognition ?

- Independent of text, easy to access, cannot be forgotten or misplaced, - Independent of language, Acceptable by user

8

8Definition of voice as a biometric. Potential lead in to slide describing the integration of voice and knowledge verification.

2. OBJECTIVEThe main goal of the project is to design and implement a text-independent speaker recognition system on FPGA.

The specific goals can be summarized as:To learn about digital signal processing and FPGA.To implement and analyze the system in MATLAB.To design and implement the system on FPGA.9

3. SYSTEM ARCHITECTURE10

4. METHODOLOGYTesting dataTraining dataInput signal

Feature extraction

Feature matching

ThresholdOutput

11

4.1. System Implementation on MATLAB4.1.1. Voice Capturing and Storage-input through microphone, saved .wav format -sound used in format of 22050Hz, 16-bits PCM, Mono Channel.

12

4.1.2. Pre-Processing

1) Silence removal13

4.1.2. Pre-Processing(2)1)Silence removal 2) Pre-emphasis

s[n]=s[n]-a s[n-1] [1][1] Shi-Huang Chen and Yu-Ren Luo, Speaker Verification Using MFCC and Support Vector Machine14

4.1.2. Pre-Processing(3)1)Silence removal 2) Pre-emphasis 3)Framing

Overlapping frames- frame block of 23.22ms with 50% overlapping i.e., 512 samples per frame 15

4.1.2. Pre-Processing(4)1)Silence removal 2) Pre-emphasis 3)Framing 4)Windowing

x[n] = s[n] . w[n-m] if n=0,1,2,,N-1 if n=m,m+1,..m+N-1 [2][2] Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using MFCC and Support Vector Machine16

4.1.3. Feature Extraction using MFCC

MFCC : Mel Filter Cepstral Coefficients Perceptual approach the human perception of speech, are applied to the sample frames extract the features of speech.

Steps for calculating MFCC1. Discrete Fourier Transform using FFT and Power spectrum , X[k]|2 of signal

17

4.1.3. Feature Extraction using MFCC(2)2. Mel scalingMel scale : linear up to 1 KHz and logarithmic after 1 KHz . Mapping the powers of the spectrum onto the Mel scale, using Mel filter bank-Mel spectral coefficients G[k]

Filter bank: overlapping windows

18

4.1.3. Feature Extraction using MFCC(3)3.log of Mel spectral coefficients has been taken log(G[k]).

4. Discrete Cosine Transform (DCT) ->Mel-cepstrum c[q].

(Source: Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using MFCC and Support Vector Machine)

(3.4)19

4.1.3. Feature Extraction using MFCC(4)

20

4.1.4. Feature Matching using GMM

Gaussian Mixture Model Parametric probability density function Based on soft clustering technique Mixture of Gaussiancomponents

21

4.1.4. Feature Matching using GMM(2)

GMM Training22

4.1.4. Feature Matching using GMM(3)The GMM modeling process consists of two steps:Initialization : Initial value of mean, covariance & weight assigned.

Expectation Maximization(EM)Value of mean, covariance & weight calculated adaptively by finding maximum likelihood of parameters.

23

4.1.5. Identification & VerificationFor speaker identification, maximum posteriori probability of a speaker model within a group of S speakers. For verification, a threshold value for the log-likelihood probability of speaker has been set on the adaptive basis.

. 24

Feature ExtractionFeature MatchingDecision

Accept if

> ThresholdReject if

< Threshold24

4.2. System Implementation on FPGA25

4.2. System Implementation on FPGA(2)Sound Capture and Level ShifterThe audio sound is captured using conditioner microphone and amplified using Op-amp Dc offset of the input audio signal is shifted to 1.65 volt Analog to digital conversion and Digital to analog conversionSpartan 3E FPGA board has ADC module having SPI operation14 bit ADC sample values are obtained from ADC at the rate of 25000 samples per seconds.

26

4.2. System Implementation on FPGA(3)Double Data Rate SDRAM- ADC Samples are stored in DDR SDRAM temporarily before further processing.- Burst mode 4 with burst length 2 i.e. 64 bits are written in SDRAM.- Wishbone communication protocol is used for communication with DDR SDRAM.

27

4.2. System Implementation on FPGA(4)Framing and windowingADC samples stored in DDR are pre-emphasized.50 % overlapped frames having frame length of 512 samples are used.

Fast Fourier Transform512 point Radix-2 Fast Fourier Transform is done using Xilinx Logicore.

28

4.2. System Implementation on FPGA(5)

29FFT timing diagram

4.2. System Implementation on FPGA(6)Mel-SpectrumSpectrum (linear scale) => Mel Spectrum

Log calculationNatural log using look up tables .Input data : 24 bit output : 12 bit

30

4.2. System Implementation on FPGA(7)Discrete Cosine Transform (DCT)DCT core by poencores.orgInput: 1 bitOutput : 16 bit parallel

Universal Asynchronous Receiver Transmitter(UART)Baud rate of 19.2 kbpsEach MFCC (32 bits) are divided into four 8-bit components.Implemented on unused pin in Jumper for using UART protocol via CDC.

31

4.3. Further processing in MatlabMFCCs are received in MATLAB in int32 format.Training phase :MFCC feature vectors => Gaussian Mixture ModelTesting phase : MFCC feature vectors => posterior probability (Recognition).

32

5. RESULT AND ANALYSIS33

5.1. Output in MATLAB Training data:31 speakers (male 20, female-11) Testing data length= 10-30 seconds Training data length= 1-10 seconds No. of MFCCs= 8-20Up to 99% recognition when testing data length= 30 seconds training data length= 10 seconds No. of MFCCs= 20

5.1. Output in MATLAB(2)

Amount of Training SpeechModel order(M)Duration of Testing Speech1 seconds5 seconds10 seconds10 Seconds851.3%75.5%82.9%1360.3%83.5%88.4%2064.7%85.1%90.4%20 Seconds867.3%86.3%93.6%1375.1%95.1%97.3%2078.3%95.4%97.4%30 seconds871.7%95.5%97.5%1379.2%97.8%98.5%2084.1%98.1%99.1%

34

Largest increase in performance when training data increases from 10 to 20 sec. Increasing to 30 sec improves the performance with little increment At most 30 sec of speech to maintain high performance. Abrupt change in performance on increasing testing speech duration from 1 to 5 seconds. Only slight increase in performance when increased from 5 seconds to 10 seconds.Using more training data improves the performance .

35

5.1. Output in MATLAB(3)

77% unknown female voice is matched with female voice 85% unknown male voice is matched with male voice. During the experiments, 4 languages English, Nepali and Hindi, German - correct speaker recognition regardless of the spoken text and language.

36

5.1. Output in MATLAB(4)

Total Error Rate (TER) = FAR + FRR Threshold for speaker verification was calculated empirically using FAR and FRR.

.

37

5.1. Output in MATLAB(5)

5.2. Output Analysis in FPGARecognition rate less than that of software implementation. overall resource utilization in FPGA :RAMs : 7ROMs : 3Multipliers : 15Adders/ Subtractors : 18Counters : 9Registers : 132Comparators : 20Multiplexers : 2

38

Device Utilization summaryLogic utilizationUsedAvailable UtilizationsNumber of Slice Flip-Flops8225931288%Number of 4 input LUTs8734 931293%Number of occupied Slices2355465654%Number of Slices containing only related logic13251325100%Number of Slices containing unrelated logic013250%Total Number of 4 inputs LUTs8903931294%Number of bonded IOBs21523294%Number of RAMB16s72035%Number of BUFGMUXs2248%Number of MULT18X18SIOs152075%Average Fanout of Non-Clock Nets272

395.2. Output Analysis in FPGA (2)

406. APPLICATIONS

41 Duration of speech signal limits the performance . The intrusion based on voice imitation cannot be detected. Optimal number of model order.The silence removal process is not efficient.

7. LIMITATION

limited resources in the Spartan 3E.Lack of sufficient block RAM & ROM memory.Synchronization problem of different modules/components.

428. PROBLEM FACED

The system has been implemented using MFCC for feature extraction and GMM to model the speakers.The performance of software implementation of systems is very good.The implementation in FPGA is not satisfactoryNoise reduction algorithms can be used to improve the performance of the system.

439. CONCLUSION

THANK YOU44